Improving Speech Recognition in the Browser

By Guy Levy

It seems like the future is all about talking to your devices, whether it is a your smart-phone,game console, TV, Google Glass or plain old PC. We even have a Web Speech API in the browser that gives web developers access to speech recognition.

Still, there is a small problem: the speech recognition software can often fail to detect the correct word/phrase in many use cases. For example, a user with an accent may have a difficult time:

https://youtu.be/5FFRoYhTJQQ

Just imagine writing a app that waits for a user to say command and the user is repeating the command over and over but the speech recognition API keeps returning the wrong string. What can you do? The correct way to tackle this would be through using a speech recognition grammar.

The bad news is that it appears a speech recognition grammar for the Web Speech API is currently not supported and there is no roadmap for it to be added (source). However, the good news is that, since it’s a JavaScript API, we can solve the problem ourselves. In this article I will discuss a method for solving this problem using the well known Levenshtein distance algorithm.

First, let’s quickly lay out the problem:

  • Our app is waiting for a specific speech input;
  • The user speaks, supplying the input;
  • The Web Speech API supplies the results of the speech recognition but they do not match the exact input required.

The Levenshtein Distance Algorithm to the Rescue

The Levenshtein Distance Algorithm is used to determine the distance between strings. The algorithm takes two strings and a number reflective of how different the two strings are.

Here is an example:

levenshteinDistance("hello and good morning", "hello and good morning")
> 0
levenshteinDistance("hello and good morning", "hello and good night")
> 6

This is exactly what we need, an algorithm that will give us a way to understand how far the API results are from our expected (grammar) results. This way, we can decide whether the distance between the user input and the expected phrase is close enough to be accepted.

Putting the Algorithm to Use

Let’s look at an example of how this might work. In this example, we have:

  • An array of strings – our grammar object (for example ["good morning", "hello and good night", "good night son"]).
  • A single string – result from the speech recognition API (for example "good night").

We can now loop through each of the results and calculate the distance:

getDistanceFromArray = function (input, grammar) {
    var confidenceArray = [],
        len = grammar.length;
    while (len--) {
        confidenceArray[len] = levenshteinDistance(grammar[len], input);
    }
    return confidenceArray;
}

getDistanceFromArray("good night", ["good morning", "hello and good night", "good night son"]);
> [6, 10, 4]

The result is an array with each item containing the distance between the grammar string from the result string. From this we simply choose the resultwith the smallest number:

var minDistance = Math.min.apply( Math, confidenceArray );
Math.min.apply( Math, [6, 10, 4] )
> 4

Next, we let our app decide if the smallest distance is a valid distance.

Here is the full code for the example:

getDistanceFromArray = function (input, grammar, validDistance) {
    var confidenceArray = [],
        len = grammar.length;
    while (len--) {
        confidenceArray[len] = levenshteinDistance(grammar[len], input);
    }
    var minDistance = Math.min.apply(Math, confidenceArray);
    if (minDistance <= validDistance)
        return grammar[confidenceArray.indexOf(minDistance)];
    else
        return null;
}
getDistanceFromArray("good night", ["good morning", "hello and good night", "good night son"], 5)
> "good night son"
getDistanceFromArray("good night", ["good morning", "hello and good night", "good night son"], 3)
> null

So now we can safely say that although the speech recognition API gave us the wrong results for user input (“good night”) we can assume that the user actually meant to say the command we are waiting for (“good night son”).

tUnE.js

You might be wondering where the code for levenshteinDistance is coming from. Well, I’ve encapsulated that and a number of oher related methods in an open source JavaScript library called tUnE.js.

Let’s look at a proof of concept that uses the library to achieve a slightly more complex result.

https://youtu.be/SrXxLkWRf8A

As you can see in the video, in the example, we draw a little red cube that waits for the following voice inputs:

  • “red cube” will initiate the app to listen for user commands;
  • “move right” will move the cube right;
  • “move left” will move the cube left;
  • “move up” will move the cube up;
  • “move down” will move the cube down

You may notice that when I say something similar, the command still works. For instance, if instead of “move right”, I say “go right” or “fast right.”

Conclusion

While we are still waiting for the browser to give us an api for specifying a grammar within the Web Speech API, we can use the Levenshtein algorithm as a grammar engine. This may not be a bulletproof solution but nonetheless it could be a safe solution to use for our web app since it gives us the control over the distance we consider valid.

References:

https://github.com/LevyGuy/tUnE.js

https://www.w3.org/TR/speech-grammar/

https://www.w3.org/TR/semantic-interpretation/

Previous

A DOM Manipulation Class in 100 Lines of JavaScript

Is jQuery Too Big For Mobile?

Next