As the name suggests, Automated Speech Recognition - ASR - is a sophisticated software used to interpret spoken words through an input device (mic) or audio file and then output them. ASR relieves users from tedious data entry by enabling them to dictate data to their computer device rather than typing it. Many industries use ASR as a daily driver. One of the biggest examples is Amazon’s Alexa.
There is a designated metric called Word Error Rate - WER - to check the efficiency of different ASR software. WER is a formula applied to the resulting transcript from an ASR software to measure its accuracy. The formula consists of 4 components:
Component | Stands For |
---|---|
S | Substitution: The amount of words that need to be substituted to match the original transcript. |
D | Deletion: The amount of words dropped from the original transcript. |
I | Insertion: The amount of extra words added compared to the original transcript. |
N | Number: The Total number of words in the correct transcript. |
By combining the above components, we get the following formula to compute WER:
Let’s look at an example. Suppose the actual phrase Please turn around gets converted into Please burn a round by some ASR software. Here, can notice that:
After putting all of this together, the computed WER for the conversion above turns out to be:
Fun Fact: Humans have a WER of 0.4!
Free Resources