With only two hundred odd days ’til Christmas, you just know we’re already feeling the season’s magic. Well, maybe not, but [Sean Dubois] has decided to give us a head start with this WebRTC demo built into a Santa stuffie.
The details are a little bit sparse (hopefully he finishes the documentation on GitHub by the time this goes out) but the project is really neat. Hardware-wise, it’s an audio-enabled ESP32-S3 dev board living inside Santa, running the OpenAI’s OpenRealtime Embedded SDK (as implemented by ExpressIf), with some customization by [Sean]. Looks like the audio is going through the newest version of LibPeer and the heavy lifting is all happening in the cloud, as you’d expect with this SDK. (A key is required, but hey! It’s all open source; if you have an AI that can do the job locally-hosted, you can probably figure out how to connect to it instead.)
This speech-to-speech AI doesn’t need to emulate Santa Claus, of course; you can prime the AI with any instructions you’d like. If you want to delight children, though, its hard to beat the Jolly Old Elf, and you certainly have time to get it ready for Christmas. Thanks to [Sean] for sending in the tip.
If you like this project but want to avoid paying OpenAI API fees, here’s a speech-to-text model to get you started.We covered this AI speech generator last year to handle the talky bit. If you put them together and make your own Santa Claus (or perhaps something more seasonal to this time of year), don’t forget to drop us a tip!
Neat… but the turn off for me is “and the heavy lifting is all happening in the cloud,” . Back to the drawing board for me :) . Got to be local or not at all at this house.
You can likely connect it to things running ollama – you’ll just need a beefy GPU and a similar LLM running first. The “cloud” option skips those steps.
Totally. The magic for me on this is the board is powerful enough to do bi-directional audio streaming.
Since the device uses WebRTC you could swap out the backend trivially. Use any of the WHIP/WHEP servers on https://webrtchacks.com/webrtc-cracks-the-whip-on-obs/
One I maintain (and tested the embedded stuff against) is https://github.com/glimesh/broadcast-box
Wonderful. Can´t wait for Halloween.
I have been trying to buy enough GPUs to do this entire thing at home
Sure I may only be able to run a 13B model, but if its text-to-speech, speech-to-text and a 13B LLM model all responding within milliseconds…that’s amazing!