Friday Hack Chat: Hacking Voice Assistants

July 11, 2018

The future of consumer electronics is electronic voice assistants, at least that’s what the manufacturers are telling us. Everything from Alexas to Google Homes to Siris are invading our lives, and if predictions hold, your next new car might just have a voice assistant in it. It’s just a good thing we have enough samples of Majel Barrett’s voice for a quality virtual assistant.

For this week’s Hack Chat, we’re going to be talking all about voice interfaces. There are hundreds of Alexa and Google Home hacks around, but this is just the tip of the iceberg. What else can we do with these neat pieces of computer hardware, and how do we get it to do that?

Our guest for this week’s Hack Chat will be Nadine Lessio, a designer and technologist out of Toronto with a background in visual design and DIY peripherals. Nadine holds an MDes from OCADU where she spent her time investigating the Internet of Things through personal assistants. Currently, she’s working at OCADUs Adaptive Context Environments Lab where she’s researching how humans and devices work together.

During this Hack Chat, Nadine will be talking about voice assistants and answering questions like:

What languages can be used to program voice assistants
How do you use voice and hardware together?
What goes into the UX of a voice assistant?
How do these assistants interface with microcontrollers, Pis, and other electronics platforms?

You are, of course, encouraged to add your own questions to the discussion. You can do that by leaving a comment on the Hack Chat Event Page and we’ll put that in the queue for the Hack Chat discussion.

Our Hack Chats are live community events on the Hackaday.io Hack Chat group messaging. This week is just like any other, and we’ll be gathering ’round our video terminals at noon, Pacific, on Friday, July 13th. Need a countdown timer? Yes you do.

Click that speech bubble to the right, and you’ll be taken directly to the Hack Chat group on Hackaday.io.

You don’t have to wait until Friday; join whenever you want and you can see what the community is talking about.

17 thoughts on “Friday Hack Chat: Hacking Voice Assistants”

Ostracus says:

July 11, 2018 at 10:15 am

Still not bugging the house. With that being said, what are businesses stance on voice assistants on-premise?

Report comment

Reply
1. Ren says:
  
  July 11, 2018 at 10:40 am
  
  I don’t intend to “intentionally” bug my house, but I wonder how many apps on my daughter’s phone already do it.
  
  Report comment
  
  Reply
RoGeorge says:

July 11, 2018 at 11:01 am

Can the voice assistant work offline?

Report comment

Reply
1. Chris says:
  
  July 11, 2018 at 11:41 am
  
  (not to plug my project) But I actually have a hackaday project on this:
  https://hackaday.io/project/32425-modular-smart-speaker-assistant-jarvis-pi
  
  It uses a raspberry pi zero and a couple opensource projects specifically https://github.com/cmusphinx/pocketsphinx
  While not at the same level as something that would be processed by a cloud service, it is able to accomplish a targeted subset of actual commands through some optimization techniques completely offline, even with a low power computer like the raspberry pi zero.
  
  Report comment
  
  Reply
  1. Elliot Williams says:
    
    July 11, 2018 at 12:08 pm
    
    Please plug your project! Open-source and offline is something that I’d _really_ like to see working well.
    
    Report comment
    
    Reply
    1. Miroslav says:
      
      July 11, 2018 at 1:02 pm
      
      Yeah! Enough of this online cloud “NSA is funding my company and in return I provide all your data to them” thing BS.
      
      Report comment
      
      Reply
    2. suga says:
      
      July 11, 2018 at 1:56 pm
      
      Running large vocabulary speech recognition on the very limited resources of a rPi is a challenge on its own. Why not start from a reasonably sized machine (4 cores, 4G) and then painfully scale the system down to rPi limited resources?
      A highly optimized speech recognition decoder can drastically reduce the required computational resources however has only an indirect impact on the overall recognition accuracy. I think that acoustic models, lexicon, language models, and voice activity detection (followed by any audio pre-processing step) have a more drastic impact on the overall speech recognizer accuracy. Please note that compiling a 40000+ word phonetic lexicon is a boring and time consuming task. Training acoustic models requires hundreds of hours of manually transcribed spoken data while language model training needs lots of textual data. Moreover ASR in general and in particular model training require some black-magic/hacker-spirit :-). Therefore I would start looking at which pre-trained models are already available for free, for example: http://kaldi-asr.org/models.html
      
      Report comment
      
      Reply
      1. Chris says:
        
        July 11, 2018 at 8:13 pm
        
        I used the CMU prebuilt language model and a targeted wordlist with this tool http://www.speech.cs.cmu.edu/tools/lmtool.html
        
        Report comment
      2. Suimo says:
        
        July 13, 2018 at 10:52 am
        
        OK, this limits vocabulary size and hence computational requirements. This is a sensible choice to recognize simple commands and enquires like “turn on the light”, “is it going to rain?”. However it will be next to impossible to recognize commands such as “play ZZZZZZ” or “search TTTTTTT on wikipedia”, where ZZZZZZ can be any obscure group/song/album title and TTTTTTTT any obscure topic available on wikipedia.
        
        Report comment
  2. Elliot Williams says:
    
    July 12, 2018 at 1:32 am
    
    Do you have a schematic/sketch for the hardware hookup? I can’t figure out what you’re doing with the transformers and/or three microphones.
    
    I ask b/c the Alexa / others do some pretty complicated beamforming stuff with an array of 8 mics that supposedly helps with the audio. There’s a ton of clever front-end preprocessing that you can do as well if you’re only interested in human voices. (Of course, if you’re not CPU constrained, go ahead and DSP it…)
    
    OK, I’m placing an order for a couple Pis right now. Hacks where my mouth is and all that…
    
    Report comment
    
    Reply
    1. Chris says:
      
      July 12, 2018 at 3:43 pm
      
      Audio transformers are for speaker amplifier..homemade mini groundloop isolator.
      
      I am using 3 electret condenser mics in parallel with an off the shelf amplifier module..
      They seem to be quite sensitive in this manner.
      Those are connected to the usb audio amplifier input.
      
      I am using some sound processing in my scripts. Mostly built in Linux utilities like ALSA plugins and SoX.
      If you have any recommendations I am open to try.
      
      Report comment
      
      Reply
      1. Elliot Williams says:
        
        July 13, 2018 at 12:06 am
        
        Cool cool. No recommendations, but you’ve just sent me off down a rabbit hole that I’ve been avoiding for a couple years…
        
        Report comment
    2. Suimo says:
      
      July 13, 2018 at 11:16 am
      
      For the lazy people there is a rPi 4-channel microphone array Hat: https://respeaker.io/4_mic_array/
      
      Report comment
      
      Reply
xorpunk says:

July 12, 2018 at 5:02 am

Buffer overflows in vocoder and A.I. stacks.. remote code execution payloads through sound and pictures.. Something that is super advanced and will be making headlines soon.. Advanced because you have to RE a lot of codec and parsing stuff and fuzz allocations etc..

ex: Jailbreak an iphone with a picture while in retina parsing mode

Report comment

Reply
Suhacker says:

July 12, 2018 at 11:18 am

I actually wrote a paper on this: https://drive.google.com/file/d/1ByrSzbkMNXoF-iJ1uwdhpyixC0_7D1Wy/view?usp=sharing

Here’s the abstract: “The recent surge in the performance of speech recognition has led to the rapid
proliferation and adoption of a variety of its applications. However, possible vulnerabilities
within these systems have the potential to be rather critical. Previous research has shown how
components of speech recognition applications such as preprocessing and hardware can be
leveraged by malicious actors. However, a method leveraging neural networks used inside of
speech recognition systems is notably absent. Hence, a method was developed that could enable
an adversary to craft noises that could be added to the input to deliberately cause
misclassification. Not only is this attack inconspicuous, but the crafted noises are both universal
and transformable, increasing the feasibility and practicality of this attack.”

Report comment

Reply
1. xorpunk says:
  
  July 12, 2018 at 12:53 pm
  
  You should of gone against the grain and mentioned that even after configuration and obvious memory bugs are gone there are still entire classes of bugs laying dormant and at least a fraction of them are reachable through injection vectors like network and parsing..
  
  Report comment
  
  Reply
2. Ostracus says:
  
  July 13, 2018 at 11:54 am
  
  Spiking the punch bowl.
  
  Report comment
  
  Reply