Robust Speech-to-Text, Running Locally On Quest VR Headset

August 16, 2024 by Donald Papp 3 Comments

[saurabhchalke] recently released whisper.unity, a Unity package that implements whisper locally on the Meta Quest 3 VR headset, bringing nearly real-time transcription of natural speech to the device in an easy-to-use way.

Whisper is a robust and free open source neural network capable of quickly recognizing and transcribing multilingual natural speech with nearly-human level accuracy, and this package implements it entirely on-device, meaning it runs locally and doesn’t interact with any remote service.

It used to be that voice input for projects was a tricky business with iffy results and a strong reliance on speaker training and wake-words, but that’s no longer the case. Reliable and nearly real-time speech recognition is something that’s easily within the average hacker’s reach nowadays.

We covered Whisper getting a plain C/C++ implementation which opened the door to running on a variety of platforms and devices. [Macoron] turned whisper.cpp into a Unity binding which served as inspiration for this project, in which [saurabhchalke] turned it into a Quest 3 package. So if you are doing any VR projects in Unity and want reliable speech input with a side order of easy translation, it’s never been simpler.

ChatGPT Powers A Different Kind Of Logic Analyzer

April 6, 2023 by Dan Maloney 17 Comments

If you’re hoping that this AI-powered logic analyzer will help you quickly debug that wonky digital circuit on your bench with the magic of AI, we’re sorry to disappoint you. But if you’re in luck if you’re in the market for something to help you detect logical fallacies someone spouts in conversation. With the magic of AI, of course.

First, a quick review: logic fallacies are errors in reasoning that lead to the wrong conclusions from a set of observations. Enumerating the kinds of fallacies has become a bit of a cottage industry in this age of fake news and misinformation, to the extent that many of the common fallacies have catchy names like “Texas Sharpshooter” or “No True Scotsman”. Each fallacy has its own set of characteristics, and while it can be easy to pick some of them out, analyzing speech and finding them all is a tough job.

Continue reading “ChatGPT Powers A Different Kind Of Logic Analyzer” →

Cursive Out Loud: Dealing With Dragons

October 28, 2022 by Kristina Panos 21 Comments

When we last left this broadening subject of handwriting, cursive, and moveable type, I was threatening to sing the praises of speech-to-text programs. To me, these seem like the summit of getting thoughts committed to what passes for paper these days.

A common thread in humanity’s tapestry is that we all walk around with so much going on in our heads, and no real chance to get it out stream-of-consciousness style without missing a word — until we start talking to each other. I don’t care what your English teacher told you — talking turns to writing quite easily; all it takes is a willingness to follow enough of the rules, and to record it all in a readable fashion.

But, alas! That suggests that linear thinking is not only possible, but that it’s easy and everyone else is already doing it. While that’s (usually) not true, simply thinking out loud can get you pretty far down the road in a lot of mental vehicles. You just have to record it all somehow. And if your end goal is to have the words typed out, why not skip the the voice recorder and go the speech-to-text route?

Continue reading “Cursive Out Loud: Dealing With Dragons” →

EMOJO Chatbot Will Be There For You

July 7, 2021 by Kristina Panos 1 Comment

We all need someone to talk to sometimes, and the pandemic has only made matters worse when it comes to the number of people living with anxiety and depression. Exchanging the simplest of pleasantries can make you feel whole again, but the masks make it hard to engage with strangers and judge their emotions, so your big trip to the grocery store can make you feel lonely in a crowd.

So you go back home, still feeling lonely, and maybe you turn on the TV. Watching people interact is probably the next best thing to actual interaction, and it might even make you laugh. But have you ever wished you could talk to the people on TV? With [aniketdhole]’s EMOJO chatbot, you’ll feel as though you’re among friends. And technically you are — all the dialogue is from the TV show Friends.

In Castaway, Tom Hanks didn’t give that volleyball a frowny face, now did he? Nor does he have a dopey grin. Instead, he wears a wry smile that suggests depth of character and a grasp of the dire situation at hand. But now we have emoji, and they do a pretty good job of conveying and evoking emotion. EMOJO is a visual chatbot that uses voice and emoji to make easy, two-way conversation to help chase the loneliness away. It uses a Raspberry Pi and a TFT display to take voice input from a Bluetooth headset, convert it to text, and then respond in kind with both voice and text. It was a finalist in the rethink displays round of the Hackaday Prize, and we can’t wait to see how its character develops. Be sure to check out the demo after the break.

Continue reading “EMOJO Chatbot Will Be There For You” →

Ted The Talking Toaster

December 23, 2019 by Sharon Lin 9 Comments

The team behind [8 Bits and a Byte] have built a talking toaster. More accurately, they retrofitted their existing toaster with some hardware components to make it appear to talk and get angry at its users. While the actual toaster functionality isn’t necessary for the build, it certainly allows the project to have a more whimsical vibe.

The project uses a Raspberry Pi 3 and a Google AIY kit, consisting of a HAT, microphone, and speaker. Servos control the movement of the toaster’s eyebrows with the help of the HAT. Some decorative materials in the form of googly eyes and pipe cleaners help bring other features of the talking toaster to life.

The control flow for the chatbot makes use of Google’s speech-to-text for picking up text from audio input, the Dialogflow API to match intent, and Text-to-Speech to pipeline possible answer back to the Raspberry Pi to play over a speaker. They also used Remo.tv to broadcast live updates from the toaster to anyone on an online feed, allowing users in a chatroom to talk directly to Ted.

While Ted’s communications may be quite limited, there’s certainly no limit to the number of interactions he’ll be having online now!

Continue reading “Ted The Talking Toaster” →

Using AI To Pull Call Signs From SDR-Processed Signals

October 9, 2018 by Steven Dufresne 8 Comments

AI is currently popular, so [Chirs Lam] figured he’d stimulate some interest in amateur radio by using it to pull call signs from radio signals processed using SDR. As you’ll see, the AI did just okay so [Chris] augmented it with an algorithm invented for gene sequencing.

His experiment was simple enough. He picked up a Baofeng handheld radio transceiver to transmit messages containing a call sign and some speech. He then used a 0.5 meter antenna to receive it and a little connecting hardware and a NooElec SDR dongle to get it into his laptop. There he used SDRSharp to process the messages and output a WAV file. He then passed that on to the AI, Google’s Cloud Speech-to-Text service, to convert it to text.

Despite speaking his words one at a time and making an effort to pronounce them clearly, the result wasn’t great. In his example, only the first two words of the call sign and actual message were correct. Perhaps if the AI had been trained on actual off-air conversations with background noise, it would have been done better. It’s not quite the same issue, but we’re reminded of those MIT researchers who fooled Google’s Inception image recognizer into thinking that a turtle was a gun.

Rather than train his own AI, [Chris’s] clever solution was to turn to the Smith-Waterman algorithm. This is the same algorithm used for finding similar nucleic acid sequences when analyzing genes. It allowed him to use a list of correct call signs to find the best match for what the AI did come up with. As you can see in the video below, it got the call signs right.

Continue reading “Using AI To Pull Call Signs From SDR-Processed Signals” →

Don’t Look Now, But Your Necklace Is Listening

September 7, 2018 by Tom Nardi 9 Comments

There was a time when the average person was worried about the government or big corporations listening in on their every word. It was a quaint era, full of whimsy and superstition. Today, a good deal of us are paying for the privilege to have constantly listening microphones in multiple rooms of our house, largely so we can avoid having to use our hands to turn the lights on and off. Amazing what a couple years and a strong advertising push can do.

So if we’re going to be funneling everything we say to one or more of our corporate overlords anyway, why not make it fun? For example, check out this speech-to-image necklace developed by [Stephanie Nemeth]. As you speak, the necklace listens in and finds (usually) relevant images to display. Conceptually this could be used as an assistive communication technology, but we’re cool with it being a meme display device for now.

Hardware wise, the necklace is just a Raspberry Pi 3, a USB microphone, and a HyperPixel 4.0 touch screen. The Pi Zero would arguably be the better choice for hanging around your neck, but [Stephanie] notes that there’s some compatibility issues with Node.js on the Zero’s ARM6 processor. She details a workaround, but says there’s no guarantee it will work with her code.

The JavaScript software records audio from the microphone with SoX, and then runs that through the Google Cloud Speech-to-Text service to figure out what the wearer is saying. Finally it does a Google image search on the captured words using the custom search JSON API to find pictures to show on the display. There’s a user-supplied list of words to ignore so it doesn’t try looking up images for function words (such as “and” or “however”), though presumably it can also be used to blacklist certain imagery you might not want popping up on your chest in mixed company.

We’d be interested in seeing somebody implement this software on a Raspberry Pi powered digital frame to display artwork that changes based on what the people in the room are talking about. Like in Antitrust, but without Tim Robbins offing anyone.