Roll Your Own Amazon Echo on a Raspberry Pi

Speech recognition coupled with AI is the new hotness. Amazon’s Echo is a pretty compelling device, for a largish chunk of change. But if you’re interested in building something similar yourself, it’s just gotten a lot easier. Amazon has opened up a GitHub with instructions and code that will get you up and running with their Alexa Voice Service in short order.

If you read Hackaday as avidly as we do, you’ve already read that Amazon opened up their SDK (confusingly called a “Skills Kit”) and that folks have started working with it already. This newest development is Amazon’s “official” hello-world demo, for what that’s worth.

There are also open source alternatives, so if you just want to get something up and running without jumping through registration and licensing hoops, you’ve got that option as well.

Whichever way you slice it, there seems to be a real interest in having our machines listen to us. It’s probably time for an in-depth comparison of the various options. If you know of a voice recognition system that runs on something embeddable — a single-board computer or even a microcontroller — and you’d like to see us look into it, post up in the comments. We’ll see what we can do.

Thanks to [vvenesect] for the tip!

40 thoughts on “Roll Your Own Amazon Echo on a Raspberry Pi

  1. It’s not your own Amazon Echo, or even a half-decent substitute, unless it supports a wake word. Which I haven’t seen any evidence the Alexa Voice Service provides.

    I haven’t found ANY decent way to implement a wake word on a DIY project. Folks keep telling me to use general purpose speech recognition engines, like Sphinx. I’m guessing none of these people making this suggestions have ever tried it. The number of false positive/negative detections is ridiculous, even with lots of tweaks and training.

    Proper wake word support seems not to use general-purpose speech recognition at all. I know Google uses a set of filters and neural nets, highly trained to recognize one word only. That approach yields excellent results, with minimal CPU and memory usage, and without constantly streaming every sound to the cloud. I suspect the other big players use a similar, if not identical approach.

    That’s what we need MOST at this point. Without this piece of the puzzle, DIY speech-enabled projects are never really going to be very useful.

    1. I would also add we need a good far-field microphone solution like the one in the Echo. Otherwise, we’re all stuck within a few feet of a regular mic, carrying around a mic (or cellphone), or wearing some sort of [bluetooth] headset. That’s why I doubt the Amazon Tap will ever take off. The Tap is an Echo with a normal mic and no wake word – the two best selling points of the original.

      1. I’m sure all the tin foil hat people (seems like half of hackaday) will appreciate the button…but of course they’ll think it’s all just a ruse and it’s listening all the time and sending the voice data to every three letter agency, plus Amazon, Apple, Microsoft and Google, even in the face of router logs that show otherwise.

        I suspect lack of good microphone may be part of the OPs problem…

        1. It’s not an issue of the microphone quality, but of background noise. I usually have music playing.

          I want to speak the wake word/phrase, from anywhere in the room and at a louder volume than the music, and have it recognized. It can then automatically turn down the music if needed, for good recognition of the commands to follow. Older versions of desktop Chrome would detect the “Ok Google” wake word. It recognized it successfully in my scenario, almost without any false positive/negative detections at all. And it did so even with an awful mic. There’s just no easy way to integrate this functionality into a project of your own. (Chrome only listens for the wake word when it’s the active window, with Google in the current tab. I set that up in a VM which I never touched, so that it would always be listening even when I’m using my computer for other stuff. And scanned the screen to determine when Chrome heard the wake word. But that’s horribly inelegant, and wasteful of resources.)

          Contrast that to a general-purpose speech recognition engine which assumes if it’s active, then someone must be speaking; and attempts to fit *everything* it hears to speech. Try this. Play some music with no lyrics, and watch it spit out words continuously, even though no speech is present whatsoever. Cough or make your chair creak in a silent room, more words. Give it a dictionary consisting only of the wake word, and it will hear that word in every noise. Give it a large dictionary, and it frequently fails to detect the wake word in the presence of background noise. A better microphone often makes this worse instead of better, by giving the engine more non-speech sound to mistranslate.

          Seriously, we just need a good hacker-friendly wake word detector.

          1. That’s the second suggestion for “wake on whistle”, first made by Elliot. It’s so retro, I honestly didn’t consider it. But it would be easy to implement. I just wonder how immune it could be made against false positives. What could one easily whistle, that is unique enough that it wouldn’t be found in music?

            Maybe the “officer on deck” whistle from old Star Trek?

          2. In the mid 80’s there was a little toy VW van with voice control. It took four commands: forward, right, left, back. It calculated ratios between different audio frequencies in a specific order, made it recognize these different words. It was not speaker specific or needed training of course. I have not read the terms on this forum, so I won’t tell you my favorites, but the fun part were the words you could substitute the commands with and still make it move.

            Something similar is doable in software with no need for server voice recognition in the traditional way.
            Here’s a video on the toy controller chip:

            Another way is with the Philips IC: http://www.futurlec.com/News/Philips/SpeechChip.html
            Can learn 100 words and understand simple sentences. I remember there being an Arduino shield with one. Lot’s of hits on “Arduino robot voice recognition”

          3. Another edit: The ic in the video is more advanced than the one I describe and I think I fist saw this car in the late 70’s. So this kind of frequency lookup state machine is really old tech.

          4. I wonder if a two-stage two-word filter approach would work…

            1. A “dirty” wake word detector running on the user side(on the Pi). The detector skews toward false positives (but isn’t /too/ bad). Recognizing the second word activates Alexa and starts streaming.
            2. You then pass the second word to Alexa in the cloud. I’m assuming Alexa’s speech- or command-recognition is better/easier than rolling your own (maybe not…I played with the Echo once or twice) . The cloud’s job at this point is to recognize the second word.
            a. Word is recognized — get an “OK” signal back. The user-side then starts streaming to Alexa.
            b. second word is /not/ recognized — nothing happens ( or rather a “NOT OK, FOO” is sent back)

            or you can save the last 0.5 seconds and re-send the first word to the cloud….

            Of course this does not solve the sound problems that are addressed by the far-field microphone array / tech…

      2. Building the hardware, for a far-field microphone would be really easy with the TI chips (the same as in Echo), trouble lies ,in setting up the parameter/firmware for these (it use the proprietary TI Purepath software).

    2. That makes a ton of sense. The wake word needs to be unambiguously detected and is independent of everything else, so you’d gain a lot by designing a special-purpose detector for it.

      Wake word is finding a needle in a haystack — after that it’s just identifying which needle you’ve just been handed.

      It must help to pick something totally improbable as your wake word. Makes me think “OK Google” is a horrid choice. “Alexa” or “Cortana” don’t seem all that much better. They must have _strong_ algorithms. I’m gonna call mine “Bandersnatch”.

      Or wake on whistle. http://www.limpkin.fr/index.php?post/2013/04/26/The-whistled%3A-how-to-remake-a-dozen-years-old-project-the-right-way

      1. When Ericsson first introduce voice dialing on their phones around 2000, they suggested Abracadabra as the wake word (magic word was their term for it), but it could be anything.

    3. I would (or will because, I’ve a parrot toy robot, with limited speak recognition, that needs a Pi Zero or micro Odroid C2) use the Speakup cortex M4 for wake words and limited other actions to save energy.

      I tried out, 12 years ago, the Sphinx and other ASR under linux, the best result I got was with the IBM ViaVoice linux SDK used in conjunction with Mr.House (free home automation soft), with a head set it worked great (ex: “Computer Music Louder” with music in background), that all on a single core Intel 1.6GHz used at the same time as desktop.
      Later (8 years ago), I got good results with the Simon-listen project: http://www.simon-listens.org/index.php?id=122&L=1 . It used the Julius ASR: https://en.wikipedia.org/wiki/Julius_(software) and the HTK (ASR): http://htk.eng.cam.ac.uk

      The sound input quality IS primordial, without a good far-field microphone AND a bit of automated sound engineering you want get anywhere, and here it’s get tricky with linux & USB Mics,, but reverse engineering Echo far field mics. could be done (see other comment).

    4. What about doing a two-layer, two-word setup?

      1. A “dirty” wake word recognizer that runs on your hardware (in my case, Pi) that skews toward false positives (but isn’t /too/ terrible). This starts streaming to Alexa.

      2. Alexa then recognizes the second word. I’m gambling that Alexa’s speech recognition is more impressive than rolling your own. I haven’t tried it other than playing with an Echo once or twice…
      a. If this word is /also/ recognized, an “OK ” signal is sent back to device, and you are now directly talking to Alexa.
      b. if the word is not recognized, nothing happens (or rather, a “erm…wut?” is sent back to the device).

      …Or you can just use the same word twice, by continuous recording last 0.5 seconds of sound.

      Of course, this won’t solve any of the sound quality supposedly addressed by the microphone array / tech in the Echo….

  2. I suspect Amazon published this to make people think the new Echo Dot ($90) is a good deal in comparison to hacking one together. I just received two Dots today, and they’re (almost) everything I wanted the original Echo to be (USB power so it can run on batteries, and a stereo output jack for connecting to a real audio system).

    1. Totally agree. In contrast to my comment above about Amazon removing the two best features of the Echo for the Tap, they kept them in the Dot and only removed the part that we can most easily provide ourselves – the speakers. What I really expected when the Dot was released was for the Dot to have an inductive charger, and for Amazon to sell additional pads you could place around your house. Like keep one pad in the kitchen, one on the end table beside your couch, one in the garage, etc. Just pick up the Dot and move it to the nearest pad and forget about it till you need it.

      1. But the Dot doesn’t have batteries (because batteries & always-on speech recognition aren’t a good fit for more than brief use). My comment about running the Dot on batteries meant that a USB battery pack could be used short-term or, better, it can be used in a vehicle without the power inverter that would be needed by the original Echo.

  3. Bought an Echo, used it for a few days. Forgot about it keep trying tk get myself to use it but as I don’t care about sports and have a real sound system (old 5.1 pro logic II hooked up to an ouya connecting through USB sound blaster) that’s a no go. If course my biggest issue is connectivity, it really hates my 63-char password for the WiFi, that or it just doesn’t play we with my routers. Can anyone tell me what these things are actually useful for?

    Also once I found signs of an SSH server running on it I called, emailed, and Formed Amazon support trying to findout the password/credintals not sulringly to no avail. All I bought the darn thing for was to get root access and use its AMAZING Mic array for a real VI. Sorry amazon but your crappy assistant and I just don’t work well. It’s not the cloud connectivity issue alone but it really makes me annoyed knowing I can’t risk using it as am alarm for fearnit might have another (dreaded) connectivity related failure and I sleep into work time.

    Anyway, if anyone k ows of a sub $200 mic array similar to the Echo please let me n ow, the closet I’ve found was $999+. (I’d throw $100 hard earned bucks at the kind fella who figures out an in to our Echo’s, rooted access that is).

    1. Add another $1000+ from me! Hearing-impaired people really need access to that mic array for speech-to-text!!! Amazon, please expose the API!!!

      These ones aren’t as good:
      $100 for the Samson: http://www.juno.co.uk/products/samson-go-mic-connect-portable-stereo-usb/561561-01//?currency=GBP&flt=1&gclid=Cj0KEQjw5ti3BRD89aDFnb3SxPcBEiQAssnp0k-6DNwBYZwcVsiaAGbtwDtddsEl1rR2g95T9rvDk2YaAuBA8P8HAQ

      $400 for accoustic magic

      Please please please PM me if you have a solution!!

      Thanks
      Stephen Morrell

  4. A search for “beam forming microphone array” turns up links like these:

    http://research.microsoft.com/en-us/projects/microphone_array/

    https://www.researchgate.net/publication/220735082_A_portable_USB-based_microphone_array_device_for_robust_speech_recognition

    http://www.ecs.umass.edu/ece/sdp/sdp14/…/Team15FinalMDRReport.pdf

    Some of these dig heavily into math, and unfortunately that’s not a strong suit of mine. The data is out there…it just needs to be parsed into an easier-to-digest form for everyone to understand and use.

  5. So is this project working without pressing the Start Listening button. I have tried the project on github which was released by Amazon and I had to press this button prior to start the voice recognition. I am thinking to pair the raspberry pi with a bluetooth speaker with integrated microphone and control it from anywhere just with my voice, for example on the balcony or while taking a shower. Not sure how sensitive the built-in microphone will be, but I think it is worth trying.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s