Building A Smart Speaker Outside The Corporate Cloud

If you’re not worried about corporate surveillance bots scraping your shopping list and manipulating you through marketing, you can buy any number of off-the-shelf smart speakers for your home. Alternatively, you can roll your own like [arpy8] did, and keep your life a little more private.

The build is based around an ESP32 microcontroller. It connects to the ‘net via its inbuilt Wi-Fi connection, and listens out for your voice with an INMP441 omnidirectional microphone module. The audio data is trucked off to a backend server running a Whisper speech-to-text model. The text is then passed to Google’s Gemini 2.5 Flash large language model. The response generated is passed to the Piper Neural Voice text-to-speech engine, sent back to the ESP32, and spat out via the device’s DAC output and a speaker attached to an LM386 amplifier. Basically, anything you could ask Gemini, you can do with this device.

By virtue of using a commercial large language model, it’s not perfectly private by any means. Still, it’s at least a little farther removed than using a smart speaker that’s directly logged in to your Amazon/Google/Hulu/Beanstikk account. Files are on Github for those eager to dive into the code. We’ve seen some other fun builds along these lines before, too. Video after the break.

6 thoughts on “Building A Smart Speaker Outside The Corporate Cloud

Leave a Reply

Please be kind and respectful to help make the comments section excellent. (Comment Policy)

This site uses Akismet to reduce spam. Learn how your comment data is processed.