An Arduino With Better Speech Recognition Than Siri

The lowly Arduino, an 8-bit AVR microcontroller with a pitiful amount of RAM, terribly small Flash storage space, and effectively no peripherals to speak of, has better speech recognition capabilities than your Android or iDevice.  Eighty percent accuracy, compared to Siri’s sixty.Here’s the video to prove it.

This uSpeech library created by [Arjo Chakravarty] uses a Goertzel algorithm to turn input from a microphone connected to one of the Arduino’s analog pins into phonemes. From there, it’s relatively easy to turn these captured phonemes into function calls for lighting a LED, turning a servo, or even replicating the Siri, the modern-day version of the Microsoft paperclip.

There is one caveat for the uSpeech library: it will only respond to predefined phrases and not normal speech. Still, that’s an extremely impressive accomplishment for a simple microcontroller.

This isn’t the first time we’ve seen [Arjo]‘s uSpeech library, but it is the first time we’ve seen it in action. When this was posted months and months ago, [Arjo] was behind the Great Firewall of China and couldn’t post a proper demo. Since this the uSpeech library is a spectacular achievement we asked for a few videos showing off a few applications. No one made the effort, so [Arjo] decided to make use of his new VPN and show off his work to the world.

Video below.

37 thoughts on “An Arduino With Better Speech Recognition Than Siri

        1. Don’t forget weaponized 2+ KW lasers, RFID implants or quadrocopters with tasers.
          Welcome to hackaday, it’s safe here.

  1. “has better speech recognition capabilities than your Android or iDevice”

    BS much? Please don’t drink and write ‘articles’.

    1. 80% vs 60%, statistics don’t lie! 9 out of 10 people who only know 15 words agree! It’s a bit cheeky but technically correct, (the best kind of correct). Of course you can’t *actually* beat Google / Apple’s banks of servers with an AVR, but it’s nice to win in a small way.

      1. 80% vs. 60% is comparing apples and oranges, however I have to compliment Arjo on the nice work: being able to recognize a few words (too many conflicts with just 6 phonemes) vs. being able to recognize the whole vocabulary.

        When I was at the end of high school, about 17 years ago, I wrote something quite close to this project for 386, similar to the Atmega in processing power. Although my approach was quite a bit more complex (sliding hamming windows > cepstral coefficients > a small multi-layer perceptron whose output where the set of words it recognized) it performed not much better than Arjo’s.

        Splitting the problem into recognizing phonemes and then using that knowledge in a second phase is the key and that’s something widely adopted in modern speech recognition systems. Nice work in simplifying this approach to the very core.

        In terms of complexity, you can think about uSpeech a woodblock toy train and what Siri or Google voice recognition do as a self-driving car. I am not pulling this comparison out of thin air, although I have not directly worked on Google’s voice recognition systems, I am a Googler and I do have general knowledge about its architecture and implementation.

        Arjo, keep up the nice work and maybe join us (or our Chinese or Indian competitors) sometimes in the future :)

        Internships are a great way to start and you don’t have to wait until you finish the university to work on very cool things.

        1. Actually, splitting the problem into phoneme recognition and then using the recognized phonemes in a separate phase is exactly how modern systems do NOT work! Recognition systems try to integrate their models as fully as possible. In Google Voice Search, for instance, the system is using predictive search to anticipate what word you’re likely to say next and use that to weigh the probabilities of the next phoneme to be recognized.

          However, µSpeech is certainly an interesting approach given the constraints of an Arduino system.

        2. Thanks a lot for the encouragement, I had originally looked into a method using FFT>MFCCs>ANN but then I didn’t want to have people having to say the word they wanted the recognizer to recognize into the microphone 100s of times. This algorithm also means I don’t need to take a large amount of time to sample and then process. That said the headline is too sensationalist. You cannot build siri using uSpeech.

  2. I think uSpeech is really cool, and have used it in a project myself; but I hesitate to call it speech recognition. It recognizes 6 phonemes (f, e, o, v, s, h). So it is handy for simple commands (if they have one of those phonemes), but it isn’t generalized at all.

    Indeed in that video, the “right” command is a cheat. The code looks for the “F” sound in “left”, the “S” sound in “center” and it then assumes that anything else is “right”. See lines 40 through 62 of https://github.com/arjo129/uSpeech/blob/master/examples/servocontrol/servocontrol.ino

    So “Squirrel”, “Sopwith”, and “Squish” will all cause the servo to center. Our favorite four letter F based expletive will cause it to go left, and “Jabberwocky” will cause it to turn right

    1. Spot on. I don’t agree with the fact its better than siri. It gets the job done though. I think I said it can do a maximum of ten words somewhere in the docs…

    2. It would be entertaining to walk into a communal lab space to see an engineer yelling “SQUISH F%$& JABBERWOCKY” at a device.

      1. Without being a dig, the algorithm is relatively unsophisticated, so I am sure you could. I haven’t dug in enough to see how difficult it would be.

        Even without doing that, you could make the recognition a little more sophisticated and reduce the number of synonyms, The library could log the duration of the word, and the relative time that each phoneme was recognized.

        By doing this you could then differentiate “Squish” from “Snugglebunny” by length. By knowing the order and relative position of each of the phonemes you could could tell “fish” from “safe” (Squish the Snugglebunny in the Fish Safe?)

  3. I have never looked into speech recognition, so I found this pretty interesting. There is a lot of handwaving in the documentation when it comes to the algorithm. I was a bit surprised when I looked at the code – is this really based on solid theory or is it a hack that just works somehow? There are some parts that appear weird to me. For example the entire code is lacking a timebase, the sample intervals are basically arbitrary.

  4. Last time I sent a project to H-a-D I was called a “lowly amateur” and worse.
    Join the club :-)
    On the flip side, the project did work and I even used it to make some night vision goggles.

  5. I think it’s the “natural language” bit in Siri (and all the other so-called personal assistants) that’s a bit tricky. That and not taking offence when hearing imaginary swearing.

  6. people seem to be missing the main point, you do not need A: connection to internet and backend speech recognition system like Google’s or B: high powered CPU/Memory to process a few simple commands.
    With those benifits the short cummings listed in the above comments seem arbitary.

    1. Well it depends how accurate you need those commands to be. A uSpeeh system that is free listening is going to respond to a lot of arbitrary words with potentially undesired results.

      Or you use a sentinel word, and potentially give up on one of your sounds. Either way you are going to have to contort your vocabulary to fit the sounds, and be happy with lots of false positives.

      uSpeech is useful, but the use cases are fairly limited.

  7. Pretty impressive, but I have to admit it really feels like a case of using the wrong tool for the job. I’m more curious what you could do with something like an ARM Cortex and more memory instead since that’s more like what I’d want to use in the field for this type of thing.

    Speech recognition is generally pretty processor intensive (and memory intensive when you add in the language and acoustic models) but this type of project shows that you can get away with shortcuts if you’re careful.

    1. It depends on the kind of speech recognition in question. natural language recognition requires a lot of resources, but command/phrase recognition isn’t nearly as intensive. If you look around there are a number of no-name boards using the SPCE061A (a 16 bit “audio processor”). Sensory’s speech recognition is done with a 16 bit DSP. These systems can recognize between 10 and 50 speaker independent words/phrases

  8. This is pretty impressive work. I plan to check this library out for myself.

    In fact my only complaint is that he mispronounces his own library’s name.
    It’s not “You Speech” it’s “Micro Speech” or “Mu (moo) Speech”.
    If you want to pronounce it “You Speech” change the name to U-Speech or You-Speech.

    Other than that, keep up the good work.

  9. hey , Arjo…..thanku very much for uspeech….how did u learn to write this library… .and i tried as u showed in the calibration demo videos.. but my getPhoneme function is not displaying the letters in the end….

  10. arjo, this is good for startup AVR projects. Even this is completely independent of the PC, unlike BitVoicer. Instead of comparing single phonemes, i used a 32 char buffer and compared predefined phoneme patterns using minimum distance. Ofcourse you need to have a certain threshold. But it quite does the job. Tnx.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s