Voice recognition is becoming more and more common, but anyone who’s ever used a smart device can attest that they aren’t exactly fool-proof. They can activate seemingly at random, don’t activate when called or, most annoyingly, completely fail to understand the voice commands. Thankfully, researchers from the University of Tokyo are looking to improve the performance of devices like these by attempting to use them without any spoken voice at all.
The project is called SottoVoce and uses an ultrasound imaging probe placed under the user’s jaw to detect internal movements in the speaker’s larynx. The imaging generated from the probe is fed into a series of neural networks, trained with hundreds of speech patterns from the researchers themselves. The neural networks then piece together the likely sounds being made and generate an audio waveform which is played to an unmodified Alexa device. Obviously a few improvements would need to be made to the ultrasonic imaging device to make this usable in real-world situations, but it is interesting from a research perspective nonetheless.
The research paper with all the details is also available (PDF warning). It’s an intriguing approach to improving the performance or quality of voice especially in situations where the voice may be muffled, non-existent, or overlaid with a lot of background noise. Machine learning like this seems to be one of the more powerful tools for improving speech recognition, as we saw with this robot that can walk across town and order food for you using voice commands only.
The Enders Game series had these, as well as tablet computers. Now all we need is faster-than-light communications and relativistic travel speeds.
That’s what I was thinking too. And don’t forget Jane, the AI in the ‘net, that Ender was communicating with.
What about a related field, lip reading. HAL did it. But it is “really hard”, even for the human mind.
A vocabulary word immediately popped into my head when I read this. Fricative. What about the fricatives? Those and other things that are created more in the upper area/lip area. Are they reasonably accurate?
Maybe with some self discipline/training, e.g. many of us can speak coherently without moving our mouths if we think about it
I read an article a few years ago and basically it said people unconsciously form speech silently and that there are decipherable laryngeal movements while thinking. While you may not think you’re doing anything other than thinking, research says otherwise
Different subset of language specifically for these devices?
Didn’t NASA nail this problem ages ago and were even able to detect sub vocalizations so that the user didn’t even need to actual make a sound, the tiny changes in electrical activity in the neck muscles was enough?
The military has had that for a while.
They tried using it to aim and fire automatic gun turrets and remotely operated vehicles.
SottoVoce is such an awesome name!
Named after this, no doubt:
https://youtu.be/o84uUs40ql4
There have been several devices like this over the past 5 years or so. Still cool and I hope it takes off
https://www.smithsonianmag.com/innovation/device-can-hear-voice-inside-your-head-180972785/
For ages (think, WWII) the military had throat mike things that just used vibrations instead of sound itself. Called a voiceless mic if I recall correctly. Used for loud aircraft and stuff. Sometime they would be in movies where they pinch their neck when talking. I played with some surplus ones like 30 years ago. But if you want to uncover AI and neural net learning and stuff, fine.
They had bone conduction lolly pops in the nineties. You could hear music playing through your teeth as you ate it. You just don’t get stuff like that these days.
Lol I remember those! Man, they made my teeth ache!