Last chance to enter The Hackaday Prize.

An Arduino With Better Speech Recognition Than Siri

uSpeech

The lowly Arduino, an 8-bit AVR microcontroller with a pitiful amount of RAM, terribly small Flash storage space, and effectively no peripherals to speak of, has better speech recognition capabilities than your Android or iDevice.  Eighty percent accuracy, compared to Siri’s sixty.Here’s the video to prove it.

This uSpeech library created by [Arjo Chakravarty] uses a Goertzel algorithm to turn input from a microphone connected to one of the Arduino’s analog pins into phonemes. From there, it’s relatively easy to turn these captured phonemes into function calls for lighting a LED, turning a servo, or even replicating the Siri, the modern-day version of the Microsoft paperclip.

There is one caveat for the uSpeech library: it will only respond to predefined phrases and not normal speech. Still, that’s an extremely impressive accomplishment for a simple microcontroller.

This isn’t the first time we’ve seen [Arjo]‘s uSpeech library, but it is the first time we’ve seen it in action. When this was posted months and months ago, [Arjo] was behind the Great Firewall of China and couldn’t post a proper demo. Since this the uSpeech library is a spectacular achievement we asked for a few videos showing off a few applications. No one made the effort, so [Arjo] decided to make use of his new VPN and show off his work to the world.

Video below.

Comments

  1. arjo129 says:

    My library DOES NOT USE THE GOERTZEL ALGORITHM

  2. Random Commenter says:

    “has better speech recognition capabilities than your Android or iDevice”

    BS much? Please don’t drink and write ‘articles’.

    • Squirrel says:

      Seems like a BuzzFeed article title

    • Greenaum says:

      80% vs 60%, statistics don’t lie! 9 out of 10 people who only know 15 words agree! It’s a bit cheeky but technically correct, (the best kind of correct). Of course you can’t *actually* beat Google / Apple’s banks of servers with an AVR, but it’s nice to win in a small way.

      • Roberto Lupi says:

        80% vs. 60% is comparing apples and oranges, however I have to compliment Arjo on the nice work: being able to recognize a few words (too many conflicts with just 6 phonemes) vs. being able to recognize the whole vocabulary.

        When I was at the end of high school, about 17 years ago, I wrote something quite close to this project for 386, similar to the Atmega in processing power. Although my approach was quite a bit more complex (sliding hamming windows > cepstral coefficients > a small multi-layer perceptron whose output where the set of words it recognized) it performed not much better than Arjo’s.

        Splitting the problem into recognizing phonemes and then using that knowledge in a second phase is the key and that’s something widely adopted in modern speech recognition systems. Nice work in simplifying this approach to the very core.

        In terms of complexity, you can think about uSpeech a woodblock toy train and what Siri or Google voice recognition do as a self-driving car. I am not pulling this comparison out of thin air, although I have not directly worked on Google’s voice recognition systems, I am a Googler and I do have general knowledge about its architecture and implementation.

        Arjo, keep up the nice work and maybe join us (or our Chinese or Indian competitors) sometimes in the future :)

        Internships are a great way to start and you don’t have to wait until you finish the university to work on very cool things.

        • microtherion says:

          Actually, splitting the problem into phoneme recognition and then using the recognized phonemes in a separate phase is exactly how modern systems do NOT work! Recognition systems try to integrate their models as fully as possible. In Google Voice Search, for instance, the system is using predictive search to anticipate what word you’re likely to say next and use that to weigh the probabilities of the next phoneme to be recognized.

          However, µSpeech is certainly an interesting approach given the constraints of an Arduino system.

        • arjo129 says:

          Thanks a lot for the encouragement, I had originally looked into a method using FFT>MFCCs>ANN but then I didn’t want to have people having to say the word they wanted the recognizer to recognize into the microphone 100s of times. This algorithm also means I don’t need to take a large amount of time to sample and then process. That said the headline is too sensationalist. You cannot build siri using uSpeech.

  3. Jock Murphy says:

    I think uSpeech is really cool, and have used it in a project myself; but I hesitate to call it speech recognition. It recognizes 6 phonemes (f, e, o, v, s, h). So it is handy for simple commands (if they have one of those phonemes), but it isn’t generalized at all.

    Indeed in that video, the “right” command is a cheat. The code looks for the “F” sound in “left”, the “S” sound in “center” and it then assumes that anything else is “right”. See lines 40 through 62 of https://github.com/arjo129/uSpeech/blob/master/examples/servocontrol/servocontrol.ino

    So “Squirrel”, “Sopwith”, and “Squish” will all cause the servo to center. Our favorite four letter F based expletive will cause it to go left, and “Jabberwocky” will cause it to turn right

  4. rue_mohr says:

    so I still have some time to write my own fixed vocab. speech rec’g for avr, cool….

  5. Andrew says:

    Well, this is timely… I was planning to do some speech recognition for home automation in the new year.

  6. cpldcpu says:

    I have never looked into speech recognition, so I found this pretty interesting. There is a lot of handwaving in the documentation when it comes to the algorithm. I was a bit surprised when I looked at the code – is this really based on solid theory or is it a hack that just works somehow? There are some parts that appear weird to me. For example the entire code is lacking a timebase, the sample intervals are basically arbitrary.

  7. Last time I sent a project to H-a-D I was called a “lowly amateur” and worse.
    Join the club :-)
    On the flip side, the project did work and I even used it to make some night vision goggles.

  8. Dutado says:

    20 years back, you would be sued by Currah (or dk’tronics) for using their trademark….. :-)

  9. Anonymous says:

    Finally, your Arduino can wreck a nice beach!

  10. Bacchus says:

    I think it’s the “natural language” bit in Siri (and all the other so-called personal assistants) that’s a bit tricky. That and not taking offence when hearing imaginary swearing.

  11. dr memals says:

    people seem to be missing the main point, you do not need A: connection to internet and backend speech recognition system like Google’s or B: high powered CPU/Memory to process a few simple commands.
    With those benifits the short cummings listed in the above comments seem arbitary.

    • Jock Murphy says:

      Well it depends how accurate you need those commands to be. A uSpeeh system that is free listening is going to respond to a lot of arbitrary words with potentially undesired results.

      Or you use a sentinel word, and potentially give up on one of your sounds. Either way you are going to have to contort your vocabulary to fit the sounds, and be happy with lots of false positives.

      uSpeech is useful, but the use cases are fairly limited.

  12. Eric says:

    Pretty impressive, but I have to admit it really feels like a case of using the wrong tool for the job. I’m more curious what you could do with something like an ARM Cortex and more memory instead since that’s more like what I’d want to use in the field for this type of thing.

    Speech recognition is generally pretty processor intensive (and memory intensive when you add in the language and acoustic models) but this type of project shows that you can get away with shortcuts if you’re careful.

  13. tachyon1 says:

    This is pretty impressive work. I plan to check this library out for myself.

    In fact my only complaint is that he mispronounces his own library’s name.
    It’s not “You Speech” it’s “Micro Speech” or “Mu (moo) Speech”.
    If you want to pronounce it “You Speech” change the name to U-Speech or You-Speech.

    Other than that, keep up the good work.

  14. tejaswi says:

    hey , Arjo…..thanku very much for uspeech….how did u learn to write this library… .and i tried as u showed in the calibration demo videos.. but my getPhoneme function is not displaying the letters in the end….

  15. rasika says:

    arjo, this is good for startup AVR projects. Even this is completely independent of the PC, unlike BitVoicer. Instead of comparing single phonemes, i used a 32 char buffer and compared predefined phoneme patterns using minimum distance. Ofcourse you need to have a certain threshold. But it quite does the job. Tnx.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 91,175 other followers