Friday Hack Chat: Hacking Voice Assistants

The future of consumer electronics is electronic voice assistants, at least that’s what the manufacturers are telling us. Everything from Alexas to Google Homes to Siris are invading our lives, and if predictions hold, your next new car might just have a voice assistant in it. It’s just a good thing we have enough samples of Majel Barrett’s voice for a quality virtual assistant.

For this week’s Hack Chat, we’re going to be talking all about voice interfaces. There are hundreds of Alexa and Google Home hacks around, but this is just the tip of the iceberg. What else can we do with these neat pieces of computer hardware, and how do we get it to do that?

Our guest for this week’s Hack Chat will be Nadine Lessio, a designer and technologist out of Toronto with a background in visual design and DIY peripherals. Nadine holds an MDes from OCADU where she spent her time investigating the Internet of Things through personal assistants. Currently, she’s working at OCADUs Adaptive Context Environments Lab where she’s researching how humans and devices work together.

During this Hack Chat, Nadine will be talking about voice assistants and answering questions like:

  • What languages can be used to program voice assistants
  • How do you use voice and hardware together?
  • What goes into the UX of a voice assistant?
  • How do these assistants interface with microcontrollers, Pis, and other electronics platforms?

You are, of course, encouraged to add your own questions to the discussion. You can do that by leaving a comment on the Hack Chat Event Page and we’ll put that in the queue for the Hack Chat discussion.join-hack-chat

Our Hack Chats are live community events on the Hackaday.io Hack Chat group messaging. This week is just like any other, and we’ll be gathering ’round our video terminals at noon, Pacific, on Friday, July 13th.  Need a countdown timer? Yes you do.

Click that speech bubble to the right, and you’ll be taken directly to the Hack Chat group on Hackaday.io.

You don’t have to wait until Friday; join whenever you want and you can see what the community is talking about.

17 thoughts on “Friday Hack Chat: Hacking Voice Assistants

    1. (not to plug my project) But I actually have a hackaday project on this:
      https://hackaday.io/project/32425-modular-smart-speaker-assistant-jarvis-pi

      It uses a raspberry pi zero and a couple opensource projects specifically https://github.com/cmusphinx/pocketsphinx
      While not at the same level as something that would be processed by a cloud service, it is able to accomplish a targeted subset of actual commands through some optimization techniques completely offline, even with a low power computer like the raspberry pi zero.

        1. Running large vocabulary speech recognition on the very limited resources of a rPi is a challenge on its own. Why not start from a reasonably sized machine (4 cores, 4G) and then painfully scale the system down to rPi limited resources?
          A highly optimized speech recognition decoder can drastically reduce the required computational resources however has only an indirect impact on the overall recognition accuracy. I think that acoustic models, lexicon, language models, and voice activity detection (followed by any audio pre-processing step) have a more drastic impact on the overall speech recognizer accuracy. Please note that compiling a 40000+ word phonetic lexicon is a boring and time consuming task. Training acoustic models requires hundreds of hours of manually transcribed spoken data while language model training needs lots of textual data. Moreover ASR in general and in particular model training require some black-magic/hacker-spirit :-). Therefore I would start looking at which pre-trained models are already available for free, for example: http://kaldi-asr.org/models.html

          1. OK, this limits vocabulary size and hence computational requirements. This is a sensible choice to recognize simple commands and enquires like “turn on the light”, “is it going to rain?”. However it will be next to impossible to recognize commands such as “play ZZZZZZ” or “search TTTTTTT on wikipedia”, where ZZZZZZ can be any obscure group/song/album title and TTTTTTTT any obscure topic available on wikipedia.

      1. Do you have a schematic/sketch for the hardware hookup? I can’t figure out what you’re doing with the transformers and/or three microphones.

        I ask b/c the Alexa / others do some pretty complicated beamforming stuff with an array of 8 mics that supposedly helps with the audio. There’s a ton of clever front-end preprocessing that you can do as well if you’re only interested in human voices. (Of course, if you’re not CPU constrained, go ahead and DSP it…)

        OK, I’m placing an order for a couple Pis right now. Hacks where my mouth is and all that…

        1. Audio transformers are for speaker amplifier..homemade mini groundloop isolator.

          I am using 3 electret condenser mics in parallel with an off the shelf amplifier module..
          They seem to be quite sensitive in this manner.
          Those are connected to the usb audio amplifier input.

          I am using some sound processing in my scripts. Mostly built in Linux utilities like ALSA plugins and SoX.
          If you have any recommendations I am open to try.

  1. Buffer overflows in vocoder and A.I. stacks.. remote code execution payloads through sound and pictures.. Something that is super advanced and will be making headlines soon.. Advanced because you have to RE a lot of codec and parsing stuff and fuzz allocations etc..

    ex: Jailbreak an iphone with a picture while in retina parsing mode

  2. I actually wrote a paper on this: https://drive.google.com/file/d/1ByrSzbkMNXoF-iJ1uwdhpyixC0_7D1Wy/view?usp=sharing

    Here’s the abstract: “The recent surge in the performance of speech recognition has led to the rapid
    proliferation and adoption of a variety of its applications. However, possible vulnerabilities
    within these systems have the potential to be rather critical. Previous research has shown how
    components of speech recognition applications such as preprocessing and hardware can be
    leveraged by malicious actors. However, a method leveraging neural networks used inside of
    speech recognition systems is notably absent. Hence, a method was developed that could enable
    an adversary to craft noises that could be added to the input to deliberately cause
    misclassification. Not only is this attack inconspicuous, but the crafted noises are both universal
    and transformable, increasing the feasibility and practicality of this attack.”

    1. You should of gone against the grain and mentioned that even after configuration and obvious memory bugs are gone there are still entire classes of bugs laying dormant and at least a fraction of them are reachable through injection vectors like network and parsing..

Leave a Reply to RoGeorgeCancel reply

Please be kind and respectful to help make the comments section excellent. (Comment Policy)

This site uses Akismet to reduce spam. Learn how your comment data is processed.