An Arduino With Better Speech Recognition Than Siri

December 31, 2013

The lowly Arduino, an 8-bit AVR microcontroller with a pitiful amount of RAM, terribly small Flash storage space, and effectively no peripherals to speak of, has better speech recognition capabilities than your Android or iDevice. Eighty percent accuracy, compared to Siri’s sixty.Here’s the video to prove it.

This uSpeech library created by [Arjo Chakravarty] ~~uses a Goertzel algorithm~~ to turn input from a microphone connected to one of the Arduino’s analog pins into phonemes. From there, it’s relatively easy to turn these captured phonemes into function calls for lighting a LED, turning a servo, or even replicating the Siri, the modern-day version of the Microsoft paperclip.

There is one caveat for the uSpeech library: it will only respond to predefined phrases and not normal speech. Still, that’s an extremely impressive accomplishment for a simple microcontroller.

This isn’t the first time we’ve seen [Arjo]’s uSpeech library, but it is the first time we’ve seen it in action. When this was posted months and months ago, [Arjo] was behind the Great Firewall of China and couldn’t post a proper demo. Since this the uSpeech library is a spectacular achievement we asked for a few videos showing off a few applications. No one made the effort, so [Arjo] decided to make use of his new VPN and show off his work to the world.

Video below.

42 thoughts on “An Arduino With Better Speech Recognition Than Siri”

arjo129 says:

December 31, 2013 at 4:29 pm

My library DOES NOT USE THE GOERTZEL ALGORITHM

Report comment

Reply
1. XOIIO says:
  
  December 31, 2013 at 4:44 pm
  
  Is there a reason you need to go into an all caps rage about it?
  
  Report comment
  
  Reply
  1. Preparation H says:
    
    January 1, 2014 at 11:41 am
    
    Is there a reason you are so anal about it!?
    
    Report comment
    
    Reply
2. Brian Benchoff says:
  
  December 31, 2013 at 5:01 pm
  
  Sorry about that.
  
  Report comment
  
  Reply
  1. arjo129 says:
    
    December 31, 2013 at 6:06 pm
    
    No problem.
    
    Report comment
    
    Reply
3. Matt says:
  
  December 31, 2013 at 5:30 pm
  
  That’s probably the last time you’ll be featured on hackaday.
  
  Report comment
  
  Reply
  1. F says:
    
    December 31, 2013 at 5:50 pm
    
    whereas projects featuring toxic waste and poison gas are always welcome
    
    Report comment
    
    Reply
    1. Ryan says:
      
      January 5, 2014 at 2:52 am
      
      Don’t forget weaponized 2+ KW lasers, RFID implants or quadrocopters with tasers.
      Welcome to hackaday, it’s safe here.
      
      Report comment
      
      Reply
Random Commenter says:

December 31, 2013 at 4:52 pm

“has better speech recognition capabilities than your Android or iDevice”

BS much? Please don’t drink and write ‘articles’.

Report comment

Reply
1. Squirrel says:
  
  December 31, 2013 at 5:16 pm
  
  Seems like a BuzzFeed article title
  
  Report comment
  
  Reply
2. Greenaum says:
  
  January 1, 2014 at 12:23 am
  
  80% vs 60%, statistics don’t lie! 9 out of 10 people who only know 15 words agree! It’s a bit cheeky but technically correct, (the best kind of correct). Of course you can’t *actually* beat Google / Apple’s banks of servers with an AVR, but it’s nice to win in a small way.
  
  Report comment
  
  Reply
  1. Roberto Lupi says:
    
    January 1, 2014 at 10:09 am
    
    80% vs. 60% is comparing apples and oranges, however I have to compliment Arjo on the nice work: being able to recognize a few words (too many conflicts with just 6 phonemes) vs. being able to recognize the whole vocabulary.
    
    When I was at the end of high school, about 17 years ago, I wrote something quite close to this project for 386, similar to the Atmega in processing power. Although my approach was quite a bit more complex (sliding hamming windows > cepstral coefficients > a small multi-layer perceptron whose output where the set of words it recognized) it performed not much better than Arjo’s.
    
    Splitting the problem into recognizing phonemes and then using that knowledge in a second phase is the key and that’s something widely adopted in modern speech recognition systems. Nice work in simplifying this approach to the very core.
    
    In terms of complexity, you can think about uSpeech a woodblock toy train and what Siri or Google voice recognition do as a self-driving car. I am not pulling this comparison out of thin air, although I have not directly worked on Google’s voice recognition systems, I am a Googler and I do have general knowledge about its architecture and implementation.
    
    Arjo, keep up the nice work and maybe join us (or our Chinese or Indian competitors) sometimes in the future :)
    
    Internships are a great way to start and you don’t have to wait until you finish the university to work on very cool things.
    
    Report comment
    
    Reply
    1. microtherion says:
      
      January 1, 2014 at 7:10 pm
      
      Actually, splitting the problem into phoneme recognition and then using the recognized phonemes in a separate phase is exactly how modern systems do NOT work! Recognition systems try to integrate their models as fully as possible. In Google Voice Search, for instance, the system is using predictive search to anticipate what word you’re likely to say next and use that to weigh the probabilities of the next phoneme to be recognized.
      
      However, µSpeech is certainly an interesting approach given the constraints of an Arduino system.
      
      Report comment
      
      Reply
    2. arjo129 says:
      
      January 1, 2014 at 9:10 pm
      
      Thanks a lot for the encouragement, I had originally looked into a method using FFT>MFCCs>ANN but then I didn’t want to have people having to say the word they wanted the recognizer to recognize into the microphone 100s of times. This algorithm also means I don’t need to take a large amount of time to sample and then process. That said the headline is too sensationalist. You cannot build siri using uSpeech.
      
      Report comment
      
      Reply
Jock Murphy says:

December 31, 2013 at 6:00 pm

I think uSpeech is really cool, and have used it in a project myself; but I hesitate to call it speech recognition. It recognizes 6 phonemes (f, e, o, v, s, h). So it is handy for simple commands (if they have one of those phonemes), but it isn’t generalized at all.

Indeed in that video, the “right” command is a cheat. The code looks for the “F” sound in “left”, the “S” sound in “center” and it then assumes that anything else is “right”. See lines 40 through 62 of https://github.com/arjo129/uSpeech/blob/master/examples/servocontrol/servocontrol.ino

So “Squirrel”, “Sopwith”, and “Squish” will all cause the servo to center. Our favorite four letter F based expletive will cause it to go left, and “Jabberwocky” will cause it to turn right

Report comment

Reply
1. arjo129 says:
  
  December 31, 2013 at 6:09 pm
  
  Spot on. I don’t agree with the fact its better than siri. It gets the job done though. I think I said it can do a maximum of ten words somewhere in the docs…
  
  Report comment
  
  Reply
2. Paul says:
  
  December 31, 2013 at 7:11 pm
  
  It would be entertaining to walk into a communal lab space to see an engineer yelling “SQUISH F%$& JABBERWOCKY” at a device.
  
  Report comment
  
  Reply
  1. Jock Murphy says:
    
    December 31, 2013 at 8:09 pm
    
    Its good to have a dream ;)
    
    Report comment
    
    Reply
3. Philip says:
  
  December 31, 2013 at 9:51 pm
  
  is there a way to increase the number of phonemes?
  
  Report comment
  
  Reply
  1. Jock Murphy says:
    
    December 31, 2013 at 10:06 pm
    
    Without being a dig, the algorithm is relatively unsophisticated, so I am sure you could. I haven’t dug in enough to see how difficult it would be.
    
    Even without doing that, you could make the recognition a little more sophisticated and reduce the number of synonyms, The library could log the duration of the word, and the relative time that each phoneme was recognized.
    
    By doing this you could then differentiate “Squish” from “Snugglebunny” by length. By knowing the order and relative position of each of the phonemes you could could tell “fish” from “safe” (Squish the Snugglebunny in the Fish Safe?)
    
    Report comment
    
    Reply
4. Greenaum says:
  
  January 1, 2014 at 12:24 am
  
  Hm, be nice to get an N in there for “off” and “on”. Sure Arjo thought of that already tho.
  
  Report comment
  
  Reply
rue_mohr says:

December 31, 2013 at 7:06 pm

so I still have some time to write my own fixed vocab. speech rec’g for avr, cool….

Report comment

Reply
Andrew says:

January 1, 2014 at 12:14 am

Well, this is timely… I was planning to do some speech recognition for home automation in the new year.

Report comment

Reply
cpldcpu says:

January 1, 2014 at 1:17 am

I have never looked into speech recognition, so I found this pretty interesting. There is a lot of handwaving in the documentation when it comes to the algorithm. I was a bit surprised when I looked at the code – is this really based on solid theory or is it a hack that just works somehow? There are some parts that appear weird to me. For example the entire code is lacking a timebase, the sample intervals are basically arbitrary.

Report comment

Reply
BotherSaidPooh says:

January 1, 2014 at 2:49 am

Last time I sent a project to H-a-D I was called a “lowly amateur” and worse.
Join the club :-)
On the flip side, the project did work and I even used it to make some night vision goggles.

Report comment

Reply
Dutado says:

January 1, 2014 at 3:49 am

20 years back, you would be sued by Currah (or dk’tronics) for using their trademark….. :-)

Report comment

Reply
Anonymous says:

January 1, 2014 at 5:11 am

Finally, your Arduino can wreck a nice beach!

Report comment

Reply
Bacchus says:

January 1, 2014 at 9:24 am

I think it’s the “natural language” bit in Siri (and all the other so-called personal assistants) that’s a bit tricky. That and not taking offence when hearing imaginary swearing.

Report comment

Reply
dr memals says:

January 1, 2014 at 1:34 pm

people seem to be missing the main point, you do not need A: connection to internet and backend speech recognition system like Google’s or B: high powered CPU/Memory to process a few simple commands.
With those benifits the short cummings listed in the above comments seem arbitary.

Report comment

Reply
1. Jock Murphy says:
  
  January 1, 2014 at 1:52 pm
  
  Well it depends how accurate you need those commands to be. A uSpeeh system that is free listening is going to respond to a lot of arbitrary words with potentially undesired results.
  
  Or you use a sentinel word, and potentially give up on one of your sounds. Either way you are going to have to contort your vocabulary to fit the sounds, and be happy with lots of false positives.
  
  uSpeech is useful, but the use cases are fairly limited.
  
  Report comment
  
  Reply
Eric says:

January 2, 2014 at 11:06 am

Pretty impressive, but I have to admit it really feels like a case of using the wrong tool for the job. I’m more curious what you could do with something like an ARM Cortex and more memory instead since that’s more like what I’d want to use in the field for this type of thing.

Speech recognition is generally pretty processor intensive (and memory intensive when you add in the language and acoustic models) but this type of project shows that you can get away with shortcuts if you’re careful.

Report comment

Reply
1. Jock Murphy says:
  
  January 2, 2014 at 4:52 pm
  
  It depends on the kind of speech recognition in question. natural language recognition requires a lot of resources, but command/phrase recognition isn’t nearly as intensive. If you look around there are a number of no-name boards using the SPCE061A (a 16 bit “audio processor”). Sensory’s speech recognition is done with a 16 bit DSP. These systems can recognize between 10 and 50 speaker independent words/phrases
  
  Report comment
  
  Reply
2. Andrew says:
  
  January 2, 2014 at 5:09 pm
  
  I’m leaning towards something that harnesses Wolfram or maybe Google Speech API to do the heavy lifting and then use my hardware to provide the appropriate response. Steven Hickson has made some interesting progress in that direction with Raspberry Pi based speech recognition: http://stevenhickson.blogspot.ca/2013/06/voice-command-v30-for-raspberry-pi.html
  
  Here’s a link to his YouTube demo: http://www.youtube.com/watch?v=6NHklmXMouY
  
  Report comment
  
  Reply
  1. Jock Murphy says:
    
    January 2, 2014 at 5:33 pm
    
    PocketSphix (http://cmusphinx.sourceforge.net/) is fairly easy to work with. I have done professional projects using it on both the Raspberry Pi and the BeagleBone Black. It has the added benefit that it will work even if not connected to the net
    
    Report comment
    
    Reply
tachyon1 says:

January 2, 2014 at 6:43 pm

This is pretty impressive work. I plan to check this library out for myself.

In fact my only complaint is that he mispronounces his own library’s name.
It’s not “You Speech” it’s “Micro Speech” or “Mu (moo) Speech”.
If you want to pronounce it “You Speech” change the name to U-Speech or You-Speech.

Other than that, keep up the good work.

Report comment

Reply
tejaswi says:

February 27, 2014 at 7:31 am

hey , Arjo…..thanku very much for uspeech….how did u learn to write this library… .and i tried as u showed in the calibration demo videos.. but my getPhoneme function is not displaying the letters in the end….

Report comment

Reply
rasika says:

August 15, 2014 at 9:37 pm

arjo, this is good for startup AVR projects. Even this is completely independent of the PC, unlike BitVoicer. Instead of comparing single phonemes, i used a 32 char buffer and compared predefined phoneme patterns using minimum distance. Ofcourse you need to have a certain threshold. But it quite does the job. Tnx.

Report comment

Reply
yourmom says:

January 26, 2015 at 11:32 am

Could the documentation for this be any worse…

Report comment

Reply
1. Jock Murphy says:
  
  January 26, 2015 at 12:40 pm
  
  There could be no documentation at all…
  
  Report comment
  
  Reply
Tony C says:

December 12, 2015 at 12:52 am

Hi all,

I know this is a fairly old ‘article’ but I had a question:

I’m fairly new to Arduino so please bare with me. I currently have a door access system setup at my shop using a magnetic lock (similar to a electric door strike)… I would like to tie in a arduino that is constantly listening for the word “open” which will simply trigger a relay for 5 seconds or so to unlock the door. Will this work for my intended use? Are there any instructions/tutorials on how I would set it up? Is there another solutions out there that may work for me? Any help is appreciated.

Report comment

Reply
Terah says:

January 18, 2016 at 3:47 pm

Since we’re necroposting.. Yes, it would recognise ‘open’ (near the top it states that ‘o’ is recognised) and trigger the relay. Of course, the way this currently works it would also open the door when you say ‘Oh no’, ‘Hello’, and basically any other word with the ‘o’ sound in it. If you want to arduino it.. I’d go with a remote control or something ^^.

Report comment

Reply
juergenuk says:

January 18, 2018 at 1:56 am

I know this is Arduino here, but how easy would it be to port to MSP430 and ARM? Are the sources somewhere?. And the other point: is there an implementation that would generate a “Speech Recognition Chip” and just outputs either a serial or a parallel byte to be used with other systems?

Report comment

Reply

Hackaday

An Arduino With Better Speech Recognition Than Siri

42 thoughts on “An Arduino With Better Speech Recognition Than Siri”

Leave a ReplyCancel reply

Search

Never miss a hack

If you missed it

NPAPI And The Hot-Pluggable World Wide Web

The Time Clock Has Stood The Test Of Time

The Rise And Fall Of The In-Car Fax Machines

How Advanced Autopilots Make Airplanes Safer When Humans Go AWOL

2025: As The Hardware World Turns

Our Columns

For The Fun Of It

Fighting Food Poisoning With A Patch

Hackaday Podcast Episode 352: Visualizing Sound, And Windows 11 Is A Dog

How Do PAL And NTSC Really Work?

Linux Fu: Yet Another Shell Script Trick

42 thoughts on “An Arduino With Better Speech Recognition Than Siri”

Leave a ReplyCancel reply

Search

Never miss a hack

Subscribe

If you missed it

Our Columns