On Getting A Computer’s Attention And Striking Up A Conversation

With the rise in voice-driven virtual assistants over the years, the sight of people talking to various electrical devices in public and in private has become rather commonplace. While such voice-driven interfaces are decidedly useful for a range of situations, they also come with complications. One of these are the trigger phrases or wake words that voice assistants listen to when in standby. Much like in Star Trek, where uttering ‘Computer’ would get the computer’s attention, so do we have our ‘Siri’, ‘Cortana’ and a range of custom trigger phrases that enable the voice interface.

Unlike in Star Trek, however, our virtual assistants do not know when we really desire to interact. Unable to distinguish context, they’ll happily respond to someone on TV mentioning their trigger phrase. This possibly followed by a ludicrous purchase order or other mischief. The realization here is the complexity of voice-based interfaces, while still lacking any sense of self-awareness or intelligence.

Another issue is that the process of voice recognition itself is very resource-intensive, which limits the amount of processing that can be performed on the local device. This usually leads to the voice assistants like Siri, Alexa, Cortana and others processing recorded voices in a data center, with obvious privacy implications.

Just Say My Name

Radio Rex, a delightful 1920s toy for young and old (Credit: Emre Sevinç)

The idea of a trigger word that activates a system is an old one, with one of the first known practical examples being roughly a hundred years old. This came in the form of a toy called Radio Rex, which featured a robot dog that would sit in its little dog house until its name was called. At the moment it’d hop outside to greet the person calling it.

The way that this was implemented was simple and rather limited courtesy of available technologies in the 1910s and 1920s. Essentially it used the acoustic energy of a formant corresponding roughly to the vowel [eh] in ‘Rex’. As noted by some, an issue with Radio Rex is that it is tuned for 500 Hz, which would be the [eh] vowel when spoken by an (average) adult male voice.

This tragically meant that for children and women Rex would usually refuse to come out of its dog house, unless they used a different vowel that matched the 500 Hz frequency range for their vocal range. Even then they were likely to run into the other major issue with this toy, namely that of the sheer acoustic pressure required. Essentially this meant that some yelling might be required to make Rex move.

What is interesting about this toy is that in many ways ol’ Rex isn’t too different from how modern-day Siri and friends work. The trigger word that wakes them up from standby is less crudely interpreted, using a microphone and signal processing hardware and software rather than a mechanical contraption, but the effect is the same. In the low-power trigger search mode the assistant’s software constantly compares the incoming sound samples’ formants for a match with the sound signature of the predefined trigger word(s).

Once a match has been detected and the mechanism kicks into gear, the assistant will pop out of its digital house as it switches to its full voice processing mode. At this stage a stand-alone assistant – as one might find in e.g. older cars – may use a simple Hidden Markov Model (HMM) to try and piece together the intent of the user. Such a model is generally trained on a fairly simple vocabulary model. Such a model will be specific to a particular language and often a regional accent and/or dialect to increase accuracy.

Too Big For The Dog House

The internals of the Radio Rex toy. (Credit: Emre Sevinç)

While it would be nice to run the entire natural language processing routine on the same system, the fact of the matter is that speech recognition remains very resource-intensive. Not just in terms of processing power, as even an HMM-based approach has to sift through thousands of probabilistic paths per utterance, but also in terms of memory. Depending on the vocabulary of the assistant, the in-memory model can range from dozens of megabytes to multiple gigabytes or even terabytes. This would obviously be rather impractical on the latest whizbang gadget, smartphone or smart TV, which is why this processing is generally moved to a data center.

When accuracy is considered to be even more of a priority – such as with the Google assistant when it gets asked a complex query – the HMM approach is usually ditched for the newer Long Short-Term Memory (LSTM) approach. Although LSTM-based RNNs deal much better with longer phrases, they also come with much higher processing and memory usage requirements.

With the current state-of-the-art in speech recognition moving towards ever more complex neural network models, it would seem unlikely that such system requirements will be overtaken by technological progress.

As a reference point of what a basic lower-end system on the level of a single-board computer like a Raspberry Pi might be capable of with speech recognition, we can look at a project like CMU Sphinx, developed at Carnegie Mellon University. The version that is aimed at embedded systems is called PocketSphinx, and like its bigger versions uses an HMM-based approach. In the Spinx FAQ it’s mentioned explicitly that large vocabularies won’t work on SBCs like the Raspberry Pi due to the limited RAM and CPU power on these platforms.

When you limit the vocabulary to around a thousand words, however, the model may just fit in RAM and the processing will be fast enough to appear instantaneous for the user. This is fine if you desire for the voice-driven interface to only have decent accuracy, within the limits of the training data, while only offering limited interaction. In the case that the goal is to, say, allow the user to turn a handful of lights on or off, this may be sufficient. On the other hand, if this interface is called ‘Siri’ or ‘Alexa’ the expectations for such an interface are a lot higher.

Essentially, these virtual assistants are supposed to act like they understand natural language, the context in which it is used, and to reply in a way that is consistent with the way that the average civilized human interaction is expected to occur. Not surprisingly, this is a tough challenge to meet. Having the speech recognition part off-loaded to a remote data center, and using recorded voice samples to further train the model are natural consequences of this demand.

No Smarts, Just Good Guesses

Something which we humans are naturally pretty good at, and which we get further nagged with during our school time, is called ‘part-of-speech tagging’, also called grammatical tagging. This is where we quantify parts of a phrase into its grammatical constituents, including nouns, verbs, articles, adjectives, and so on. Doing so is essential for understanding a sentence, as the meaning of words can change wildly depending on their grammatical classification, especially in languages like English with its common use of nouns as verbs and vice versa.

Using grammatical tagging we can then understand the meaning of the sentence. Yet this is not what these virtual assistants do. Using a Viterbi algorithm (for HMMs) or equivalent RNN approach, instead the probability is determined of the given input fitting a specific subset of the language model. As most of us are undoubtedly aware, this is an approach that feels almost magical when it works, and makes you realize that Siri is as dumb as a bag of bricks when it fails to get an appropriate match.

As demand for ‘smart’ voice-driven interfaces increases, engineers will undoubtedly work tirelessly to find more ingenious methods to improve the accuracy of today’s system. The reality for the foreseeable future would appear to remain that of voice data being sent to data centers where powerful server systems can perform the requisite probability curve fitting, to figure out that you were asking ‘Hey Google’ where the nearest ice cream parlor is. Never mind that you were actually asking for the nearest bicycle store, but that’s technology for you.

Speak Easy

Perhaps slightly ironic about the whole natural language and computer interaction experience is that speech synthesis is more or less a solved problem. As early as the 1980s the Texas Instruments TMS (of Speak & Spell fame) and the General Instrument SP0256 Linear Predictive Coding (LPC) speech chips used a fairly crude approximation of the human vocal tract in order to synthesize a human-sounding voice.

Over the intervening years. LPC has become ever more refined for use in speech synthesis, while also finding use in speech encoding and transmission. By using a real-life human’s voice as the basis for an LPC vocal tract, virtual assistants can also switch between voices, allowing Siri, Cortana, etc. to sound as whatever gender and ethnicity appeals the most to an end user.

Hopefully within the next few decades we can make speech recognition work as well as speech synthesis, and perhaps even grant these virtual assistants a modicum of true intelligence.

17 thoughts on “On Getting A Computer’s Attention And Striking Up A Conversation”

BT says:

November 14, 2022 at 8:35 am

Two examples of why speech recognition will never work 100% with English:

“Eats shoots and leaves” [a book on grammar]

“I’m hungry let’s eat Dad” [a phrase used to teach the importance of commas]

Report comment

1. Arthur Mezins says:
  
  November 14, 2022 at 11:10 am
  
  Those are examples of something “taken out of context”. In order for any machine to be able to unambiguously “understand” speech, that speech must be unambiguous as well. Maybe it’s a corollary to Gödel’s incompleteness theorems?
  
  Report comment
  
  1. osmarks says:
    
    November 16, 2022 at 1:50 am
    
    This is not relevant. Humans can deal with ambiguous statements, so it’s obviously possible to make computers do the same thing in some way. Language models are already capable of this to some extent.
    
    Report comment
    
irox says:

November 14, 2022 at 9:13 am

fyi:
RNN = Recurrent Neural Network

Report comment

1. The Commenter Formerly Known As Ren says:
  
  November 14, 2022 at 4:54 pm
  
  Thanks!
  
  Report comment
  
Michael says:

November 14, 2022 at 10:36 am

Didn’t Microsoft integrate some rudimentary speech recognition into their OS as early as Win98? And what happened to Dragon Naturally Speaking? Both engines did have decent voice recognition capabilities and even came with an API and back then the systems were way less powerful than today’s multicore Gigabyte Ram and Terabyte SSD versions, but still we seem to haven’t gone too far in this field. Still am dreaming of my very own offline assistant to tell me the daytime and system status by voice command…

Report comment

1. Suimi says:
  
  November 14, 2022 at 12:06 pm
  
  Who’s going to pay for it? Would you be willing to pay, say 150 USD for such offline assistant SW? How many people you think would pay 150 USD just to keep what they say private?
  
  Report comment
  
  1. wizardX says:
    
    November 20, 2022 at 12:31 pm
    
    I would easily pay 150 or more for such a thing, the biggest reasons I haven’t adopted smart assistant tech yet despite being very early adopter keen on tech generally (ask me about the ridiculous number of VR headsets I own some time) is the combination of not having truly custom wake words and the lack of supported local processing. the lack of local is especially irksome considering how many GPUs there are in my house. surely some version of local voice processing could borrow available local processing there…
    
    or is this totally a thing and i’ve just missed it?
    
    Report comment
    
    1. BNBN says:
      
      November 21, 2022 at 10:43 am
      
      How many people like you do you think there are in the world (or even just the usa)? If you look at the average number of VR headsets per household (hint: its probably less than 0.05), you’ll see that you’re a bit of an outlier, and so basing product decisions on your input may not be the best idea for a company.
      
      Report comment
      
2. macsimski says:
  
  November 14, 2022 at 2:37 pm
  
  Mac OS 7.1 running on a av mac with dsp chip had speech recognition, combined with macintalk. You could ask it the time or date and tell it to shut down. Worked quite good, although i never saw a usercase for it. Even now I don’t.
  
  Report comment
  
3. Davis says:
  
  November 15, 2022 at 5:30 pm
  
  I use the latest iteration of Dragon Naturally Speaking. It’s accuracy is way better than what google’s, apple’s or any other online voice recognition could hope to accomplish. As long as you use a quality microphone. It does make mistakes but they are much fewer. As a disabled person it is the only way I can type with any sense of speed.
  
  Report comment
  
QBFreak says:

November 14, 2022 at 2:19 pm

> This possibly followed by a ludicrous purchase order or other mischief.

Obligatory XKCD: https://xkcd.com/1807/

Report comment

mythoughts62 says:

November 14, 2022 at 2:30 pm

For a video of Radi0o Rex in action:

https://www.youtube.com/watch?v=AdUi_St-BdM

Report comment

Joseph Eoff says:

November 14, 2022 at 11:19 pm

If I ever have such a spying “home assistent” in my house, I’ll modify it to respond to the phrase that Louise Baltimore to refer to “Big Computer” in John Varley’s “Millenium” (https://en.wikipedia.org/wiki/Millennium_(novel))

“Listen up, motherfucker.”

Report comment

1. Percy says:
  
  November 15, 2022 at 3:47 pm
  
  ^ kindred spirit right here LOL.
  
  Report comment
  
Beata Graff says:

November 19, 2022 at 12:40 am

Same for “Old Man’s War” (https://en.wikipedia.org/wiki/Old_Man%27s_War)
“Hey, Asshole.”

Report comment

Florian (SEPIA) says:

November 21, 2022 at 1:39 am

CMU Sphinx and PockeSphinx are ancient tech. For real-time, offline, speech recognition on Raspberry Pi there are much better projects now! Please check out the SEPIA Framework with its self-hosted STT-Server for example: https://github.com/SEPIA-Framework/sepia-stt-server . Currently it supports Vosk and Coqui open-source ASR :-)

Report comment