We get it. We also watched Star Trek and thought how cool it would be to talk to our computer. From Kirk setting a self-destruct sequence, to Scotty talking into a mouse, or Picard ordering Earl Grey, we intuitively know that talking to a computer is better than typing, right? Well, computers talking back and forth to us is no longer science fiction, and maybe we aren’t as happy about it as we thought we’d be.
We weren’t able to pinpoint the first talking computer in fiction. Asimov and van Vogt had talking computers in the 1940s. “I, Robot” by Eando Binder, and not the more famous Asimov story, had a fully speaking robot in 1939. You could argue that “The Machine” in E. M. Forster’s “The Machine Stops” was probably speaking — the text is a little vague — and that was in 1909. The robot from Metropolis (1927) spoke after transforming, but you could argue that doesn’t count.
Meanwhile, In Real Life
In real life, computers weren’t as quick to speak. Before the middle of the twentieth century, machine-generated speech was an oddity. In 1779, a mechanical contrivance by Wolfgang von Kempelen, famous for the mechanical Turk chess-playing automaton, could form simple words. By 1939, Bell Labs could do even better speech synthesis electronically but with a human operator. It didn’t sound very good, as you can see in the video below, but it was certainly expressive.
Speech recognition would wait until 1952, when Bell Labs showed a system that required training to understand someone speaking numbers. IBM could recognize 16 different utterances in 1961 with “Shoebox,” and, of course, that same year, they made an IBM 704 sing “Daisy Bell,” which would later inspire HAL 9000 to do the same.
Recent advances in neural network systems and other AI techniques mean that now computers can generate and understand speech at a level even most fiction didn’t anticipate. These days, it is trivially easy to interact with your phone or your PC by using your voice. Of course, we sometimes question if every device needs AI smarts and a voice. We can maybe do without a smart toaster, for instance.
So What’s the Problem?
Patrick Blower’s famous cartoon about Amazon buying Whole Foods is both funny and tragically possible. In it, Jeff Bezos says, “Alexa, buy me something from Whole Foods.” To which Alexa replies, “Sure, Jeff. Buying Whole Foods.” Misunderstandings are one of the problems with voice input.
Every night, I say exactly the same phrase right before I go to sleep: “Hey, Google. Play my playlist sleep list.” About seven times out of ten, I get my playlist going. Two times out of ten, I get children’s lullabies or something even stranger. Occasionally, for variety, I get “Something went wrong. Try again later.” You can, of course, make excuses for this. The technology is new. Maybe my bedroom is noisy or has lousy acoustics. But still.
That’s not the only problem. Science fiction often predicts the future and, generally, newer science fiction is closer than older science fiction. But Star Trek sometimes turns that on its head. Picard had an office. Kirk worked out of his quarters at a time when working from home was almost unheard of. Offices are a forgotten luxury for many people, and if you are working from home, that’s fine. But if you are in a call center, a bullpen, or the bridge of the Enterprise, all this yakking back and forth with your computer will drive everyone crazy. Even if you train the computer to only recognize the user’s voice, it will still annoy you to have to hear everyone else’s notifications, messages, and alerts.
Today, humans are still better at understanding people than computers are. We all have a friend who consistently mispronounces “Arduino,” but we still know what he means. Or the colleague with a very thick accent, like Checkov trying to enter authorization code “wictor wictor two” in the recent movie. You knew what he meant, too.
Some of the problems are social. I can’t tell you the number of times I’m in the middle of dictating an e-mail, and someone just comes up and starts talking to me, which then shows up in the middle of my sentence. Granted, that’s not a computer issue. But it is another example of why voice input systems are not always as delightful as you’d think.
Solutions?

Sure, maybe you could build a cone of silence over each station, but that has its own problems. Then again, Spock and Uhura sometimes wore the biggest Bluetooth Earbud ever, so maybe that’s half of the answer. The other half could be subvocalization, but that’s mostly science fiction, although not entirely.
What do you think? Even telepathy probably has some downsides. You’d have to be careful what you think, right? What is the ideal human-computer interface? Or will future Star Fleet officers be typing on molecular keyboards? Or will it wind up all in our brains? Tell us what you think in the comments.
Why would less than 100% recognition mean something is still not right?
Do two human beings talking to one another hear each other correctly 100% of the time?
Maybe it’s not an achievable task.
Yes, coworkers talking to their computers would suck.
Damn I miss working from home and hate Fox “News” for convincing the boss we all had to come back!
My coworkers talking to one another makes it hard to concentrate and drives me nuts. And no, I’m not being anti-social, it is work, not chat they are talking about. Some people just can’t do a thing without talking through it I guess. I was much happier AND more productive alone in my home office!
Demonstrably, we don’t.
Even if the words are 100%, there’s semantic ambiguity on top.
That’s why conversations have to be bidirectional, so that the intention is received and acknowledged in both directions (and then, disagreed with!)
Hearing each other correctly 100% of the time isn’t a requirement for humans though, right? (as the article points out, if someone pronounces something inaccurately/differently, a lot of humans will still understand what’s meant.) More often than not computers require the human to pronounce something ‘wrong’/phonetically instead of ‘correctly’/with whatever local dialect opposed to the other way around. It would be considered more ‘successful’ when a computer can detect a word like ‘aluminum’ whether I pronounce it in American, British, or ESL dialects (as an example).
Instructions unclear programming voice software to hiss whenever it hears “aluminium”
Could be worse, the computer could reply “I’m sorry Dave, I’m afraid I can’t do that”.
Ask Alexa to do that…
One night I was watching a comedy show on TV and one of the actors said something like “what you need is sex..” …. Suddenly a voice from my kitchen (Google Max) said….”I’m sorry I can’t help you with that right now.” 🤣
Give it a few years
I miss my old Wince phone, so easy to set Hal 9000 up; at no cost and just a little work. The good ole days, eh? Even the hand writing recognition worked on the old Motorola i810, with no training.
We should be using gender nautral brainwaves by now. 2025
What an odd thing to say.
Forget Star Trek! I don’t mind if Rommie can’t understand what I’m saying! if you know what I mean. Damn now you must to find the reference…
“So when you handled certain parts of me, did you wear gloves?”
Should just use the keyboard and mouse when interfacing to a computer… Then the computer knows ‘exactly’ what you want to do. $ ls /home . :) . No need to complicate things. But alas it isn’t the ‘cool’ thing to do.
If you have a playlist before bed, setup a physical button or switch attached to a SBC. It’ll be right every time ;) . Hackaday users should be right at home with this solution.
sudo rm -r * ;)
You missed the -f to make it more dramatic lol
Why would anyone need a mouse? A monitor and a “Model M” keyboard (not the Language Model which we heavily relied on for Automatic Speech Recognition until not so long ago…) is all that is needed to interact with any local or remote computer. Anything more is superfluous.
Actually, if you’re not entering each byte by hand in binary using a bank of rocker switches you’re just being lazy.
Years ago I had a colleague who purchased one of the first Altair 8800’s but he couldn’t afford the monitor and keyboard. He figured he’d just be extra careful as he flipped the switches …
Everything else is just needless luxury.
Real programmers only use C-x, M-c, M-butterfly…
https://xkcd.com/378/
Ha… I added the mouse for the mouse cripples (as we call them). We have people here that can’t operator if they don’t have a mouse handy ;) .
Gah! I hate it when I’m helping someone fix a problem on their Linux computer and we’re sitting in a terminal window and I say, “Go to the blah-blah directory and look for…” and they grab the mouse and go to the file browser and start hunting for the blah-blah directory. Dude, just use “cd and ls”.
I’m all for buttons but I can’t find good ones! I recently switched to Zigbee for my home automation but most zigbee buttons aren’t compatible with Home Assistant or use button cells instead of rechargeable AAAs.
There’s lots of standard buttons you can buy… just wire up a standard button (arcade style, or fancy chrome push button/latches, even some with LEDs to wire in) to an SBC/Micro controller. Uses one GPIO pin and GND to detect. Done. Doesn’t have to be fancy to detect and use a switch or button press. If the controller supports Wifi/bluetooth/Zigbee/etc. easy enough to get the status into an application to read.
Ikea has a bunch of zigbee remotes that work just fine with home assistant. Obviously you need a zigbee modem, like the sonoff zigbee dongle, but thatvalso works great im HA
I’ve got Ikea’s bulbs and a motion sensor, I’ll give their buttons a try. So far I’ve tried a Sonoff button (coin cells), a battery-less 3 gang set (not compatible with ZHA yet), and a 2 gang set I can’t remember the name of that takes AAA batteries (works but not out of the box).
I’ll second the Ikea buttons; they work great in HA. They even have magnets in the back so you can stick them to e.g. the fridge (er, i have the ‘rocker’ ones, not sure if they all do).
Definitely optical mouse + buckling-spring keyboard!
I have a “dark mode” routine that turns off ALL the lights. No matter how carefully I enunciate it, Alexa often thinks I said “bark mode” and then responds to all commands and queries with dog barks.
I think I’d change that to “blackout”, less likely to misinterpret.
… and short after, you realize that the local power grid was updated to digital control recently
Or “pitch black”.
Try adding a word: ‘dark.mode on’ or ‘dark mode activate’ – this should get rid of dark v bark ambiguity
Uhura, not Uhuru.
The two big challenges I have with computer interpreted speech are timing related. I trigger the process “Hey Siri” and it takes about half a second longer than I expect and doesn’t catch my message.
Or I start the process, open my mouth to make a request and find out that I don’t really know how I want to phrase something “Hey Siri, schedule an appointment… for the 19th…” and some half-baked calendar item gets created.
Siri works unless you’re Barry Kripke from TBBT….
https://youtu.be/Q3bdXctq7DM?si=j7rsql6v_TjtWlcT
one of the best things i saw from google…several several years ago, i think a side story to advertising some new android feature on pixel or nexus phones, the google presenter said that their roadmap was for people to be doing some number of voice searches per day. and they straight up acknowledged that today, a lot of voice searches end in failure, and that’s why it’s a future goal still some distance ahead on the roadmap. the idea that they deployed something that wasn’t quite ready and knew that’s what they had done was really impressive to me.
for a counterpoint, the AI results at the top of every search result, full of hallucinations, seems like the opposite mentality. they must know it’s not ready and that most users don’t like it and that the users who get tricked into relying on it are getting brainwashed by misinformation as we speak…but they still shove it in front of all of our faces anyways. not as a novelty feature for early adopters but as a replacement for the top result which used to be pretty useful.
the other day i asked for a simple question, what are the dimensions of a 608 bearing. and i happened to already kind of know the answer (OD 22, ID 8, depth 7, right??) i just wanted something more reliable than my memory, and google AI popped up the right answer, and i knew it was the right answer because it was exactly the numbers i was expecting to see. that really made me realize how deeply screwed we are — a machine aid for confirmation bias
maybe we’ve conquered voice but we haven’t gotten to having anything worthwhile to say yet i guess
I am finding AI summaries quite useful. I know when to click through to cited sources. I also get useful results from google search. Pick search terms likely to be unique to the docs you’re looking for.
The problem is obviously that they are often just wrong enough to confuse people who can’t tell. For the record I have seen it say things that are completely wrong, instructing the use of features that don’t exist or are from another product, etc. even when it’s sources are correct. The sequences it uses should verify it’s result before posting.
What about an EEG that reads the impulses from the speech centre and then interprets the words you’re thinking, easy as speaking but no noise, and you get to wear a fashionable little hat to boot!
I know this is a joke, but simple because a hat isn’t good enough. There is too much noise and the problem is far more complicated than this solution.
but “hat to boot” is a full body dress, no?
Computers in video have voice interfaces for exactly one reason:
Watching people type is boring.
It’s a concession to the constraints of the medium, supported by the fact that narrative causality runs backwards compared to the real world.
There’s no reason to waste screen time letting actors say things that don’t matter. A computer that answers a question is delivering the next plot point. The narrative function of the question is to cue the answer, so the question is selected to fit the answer.
The bit with Picard asking for Earl Grey tea is especially bad in terms of information transfer: an actual request would be more like, “seventeen centiliters of Earl Grey tea, seventy percent keemun with thirty percent lapsang souchong blend, steeped for eighty five seconds at one hundred ninety five celsius, served at one hundred thirty celsius in Royal Dansk teacup with periwinkle pattern” but no director will tolerate that. We hand-wave it by saying the computer already knows all those preferences, but that includes the ‘Earl Grey’ part. Just saying “tea” should be enough.
The specifier would do something useful if Picard ever drank some other kind of tea, but he never does. A computer capable of human-speech interaction that doesn’t notice the guy only ever asks for one thing is embarrassing.
In terms of interface design, pushing a “tea” button linked to those preferences is better in every way. But as a narrative element that would require a “what does that button do?” exchange that’s more expensive than a semi-sentient computer.
Yes, voice recognition is an interesting technical challenge. Yes there are useful real world applications for it. But let’s try not to base our expectations on a model that doesn’t match real applications in any way.
Excellent analysis. However, it’s easy to imagine Picard saying “seventeen centiliters of Earl Grey…” Earl Grey” keywords. on his first interaction with the network 40 y before the episode you’re viewing’. Perhaps he set it up as a cadet at Star Fleet Academy.
No reason to show in episode, when I see Picard ordering tea, I immediately guessed he’d set it up in detail a long time ago. This was long before I had personal experience with AI and learned not to trust it to get things right.
A vocal macro.
But that would imply there’s a “tea” button everywhere Mr. Picard can go on the space ship, that has a replicator. It’s far more plausible he’s just recorded a macro under the title “Earl Grey” that specifies the exact type of tea he wants, instead of the program default “tea” that every replicator would produce.
“served at one hundred thirty celsius”
I’d like to see you try.
Are you putting him under pressure?
surely i can’t be the only reader here who honestly has no idea how to pronounce “arduino”? i’ve never heard it spoken. if i had to SWAG, i’d try to go pseudo-Italianish on it, and probably bungle that since i don’t speak Italian.
i never use voice interfaces, either, due to a brick-thick accent i have trouble getting people to parse sometimes.
Break up the syllables and pronounce them separately, it has 4, or 3 if you are Italian, French or Spanish.
A coworker of mine once nearly drove off the side of the interstate while trying to get Siri to give him directions to Newk’s Eatery. Despite the restaurant being a chain with over 100 locations in over a dozen states all around us, it gave him directions to Los Alamos but couldn’t figure out the restaurant, even when he tried to spell the name out.
To recognize speech or to wreck a nice beach? That is the problem, when some people talk fast with little space to separate sounds.
I have surround sound in places and sound everywhere else such that any eavesdropping device will get jammed. If this is needed a handheld mic with a push to enable button would be all I would accept.
Add a large language model to the speech recognition, so you dont only get a transcription but also understanding of context (or at least the probability of it). You’d be surprised how close you get to “Her”, which is now over a decade old.
Little spoiler: If you give it access to your command line on your good old Linux system, it not only is able to check out the USB ports, but also write a blinky sketch to your file system, download a compiler, flash the hex file to your favorite MCU board and afterwards find a sound file on the data dumpster to give itself a round of applause… and all without touching the keyboard a single time 😇
It’s still terrible at python.
https://clip.cafe/colossus-the-forbin-project-1970/we-can-co-exist/
“Colossus: The Forbin Project,” a fantastic, intelligent, classic sci-fi film that is a true underappreciated sleeper. I won’t spoil it by saying why I think it’s a sleeper and is so little known.
Ideal interface? What’s wrong with a Northgate 102 (or was it 104?) and a Adlib ISA card?
It was moderately better than a light gun and a vortrax.
There is an old joke about the first computer that can understand human voice and execute the commands. It was running on DOS. It took years to create and now, it is shown to a public conference. After asking and getting confirmation that the only copy of this software is on that computer, a guy screams “del space see column space slash eff star dot star”, then another one, catching up the vibe screams also “yes i am sure”.