We get it. We also watched Star Trek and thought how cool it would be to talk to our computer. From Kirk setting a self-destruct sequence, to Scotty talking into a mouse, or Picard ordering Earl Grey, we intuitively know that talking to a computer is better than typing, right? Well, computers talking back and forth to us is no longer science fiction, and maybe we aren’t as happy about it as we thought we’d be.
We weren’t able to pinpoint the first talking computer in fiction. Asimov and van Vogt had talking computers in the 1940s. “I, Robot” by Eando Binder, and not the more famous Asimov story, had a fully speaking robot in 1939. You could argue that “The Machine” in E. M. Forster’s “The Machine Stops” was probably speaking — the text is a little vague — and that was in 1909. The robot from Metropolis (1927) spoke after transforming, but you could argue that doesn’t count.
Meanwhile, In Real Life
In real life, computers weren’t as quick to speak. Before the middle of the twentieth century, machine-generated speech was an oddity. In 1779, a mechanical contrivance by Wolfgang von Kempelen, famous for the mechanical Turk chess-playing automaton, could form simple words. By 1939, Bell Labs could do even better speech synthesis electronically but with a human operator. It didn’t sound very good, as you can see in the video below, but it was certainly expressive.
Speech recognition would wait until 1952, when Bell Labs showed a system that required training to understand someone speaking numbers. IBM could recognize 16 different utterances in 1961 with “Shoebox,” and, of course, that same year, they made an IBM 704 sing “Daisy Bell,” which would later inspire HAL 9000 to do the same.
Recent advances in neural network systems and other AI techniques mean that now computers can generate and understand speech at a level even most fiction didn’t anticipate. These days, it is trivially easy to interact with your phone or your PC by using your voice. Of course, we sometimes question if every device needs AI smarts and a voice. We can maybe do without a smart toaster, for instance.
So What’s the Problem?
Patrick Blower’s famous cartoon about Amazon buying Whole Foods is both funny and tragically possible. In it, Jeff Bezos says, “Alexa, buy me something from Whole Foods.” To which Alexa replies, “Sure, Jeff. Buying Whole Foods.” Misunderstandings are one of the problems with voice input.
Every night, I say exactly the same phrase right before I go to sleep: “Hey, Google. Play my playlist sleep list.” About seven times out of ten, I get my playlist going. Two times out of ten, I get children’s lullabies or something even stranger. Occasionally, for variety, I get “Something went wrong. Try again later.” You can, of course, make excuses for this. The technology is new. Maybe my bedroom is noisy or has lousy acoustics. But still.
That’s not the only problem. Science fiction often predicts the future and, generally, newer science fiction is closer than older science fiction. But Star Trek sometimes turns that on its head. Picard had an office. Kirk worked out of his quarters at a time when working from home was almost unheard of. Offices are a forgotten luxury for many people, and if you are working from home, that’s fine. But if you are in a call center, a bullpen, or the bridge of the Enterprise, all this yakking back and forth with your computer will drive everyone crazy. Even if you train the computer to only recognize the user’s voice, it will still annoy you to have to hear everyone else’s notifications, messages, and alerts.
Today, humans are still better at understanding people than computers are. We all have a friend who consistently mispronounces “Arduino,” but we still know what he means. Or the colleague with a very thick accent, like Checkov trying to enter authorization code “wictor wictor two” in the recent movie. You knew what he meant, too.
Some of the problems are social. I can’t tell you the number of times I’m in the middle of dictating an e-mail, and someone just comes up and starts talking to me, which then shows up in the middle of my sentence. Granted, that’s not a computer issue. But it is another example of why voice input systems are not always as delightful as you’d think.
Solutions?

Sure, maybe you could build a cone of silence over each station, but that has its own problems. Then again, Spock and Uhura sometimes wore the biggest Bluetooth Earbud ever, so maybe that’s half of the answer. The other half could be subvocalization, but that’s mostly science fiction, although not entirely.
What do you think? Even telepathy probably has some downsides. You’d have to be careful what you think, right? What is the ideal human-computer interface? Or will future Star Fleet officers be typing on molecular keyboards? Or will it wind up all in our brains? Tell us what you think in the comments.
Why would less than 100% recognition mean something is still not right?
Do two human beings talking to one another hear each other correctly 100% of the time?
Maybe it’s not an achievable task.
Yes, coworkers talking to their computers would suck.
Damn I miss working from home and hate Fox “News” for convincing the boss we all had to come back!
My coworkers talking to one another makes it hard to concentrate and drives me nuts. And no, I’m not being anti-social, it is work, not chat they are talking about. Some people just can’t do a thing without talking through it I guess. I was much happier AND more productive alone in my home office!
Demonstrably, we don’t.
Even if the words are 100%, there’s semantic ambiguity on top.
That’s why conversations have to be bidirectional, so that the intention is received and acknowledged in both directions (and then, disagreed with!)
You have no idea who Panondorf is or what he/she does. Are you that arrogant that you think you know what’s better for a stranger you’ve never met and have no idea what they do that you think you should get a say in something that has zero effect on you personally but 100% effect on that stranger? How about we start making decisions about your personal rights/freedoms that you have to live with? That not okay? … well then you are being hypocritical. Seriously HAD needs to do something about these comment trolls who add literally nothing to the topic at hand and only try to sow discord.
Hearing each other correctly 100% of the time isn’t a requirement for humans though, right? (as the article points out, if someone pronounces something inaccurately/differently, a lot of humans will still understand what’s meant.) More often than not computers require the human to pronounce something ‘wrong’/phonetically instead of ‘correctly’/with whatever local dialect opposed to the other way around. It would be considered more ‘successful’ when a computer can detect a word like ‘aluminum’ whether I pronounce it in American, British, or ESL dialects (as an example).
Instructions unclear programming voice software to hiss whenever it hears “aluminium”
Humans don’t require 100% fidelity in transmission because we’re context-aware and can doubt inaccurate data. The computer isn’t smart enough to say “Wait, what?”
Could be worse, the computer could reply “I’m sorry Dave, I’m afraid I can’t do that”.
Ask Alexa to do that…
One night I was watching a comedy show on TV and one of the actors said something like “what you need is sex..” …. Suddenly a voice from my kitchen (Google Max) said….”I’m sorry I can’t help you with that right now.” 🤣
Give it a few years
I miss my old Wince phone, so easy to set Hal 9000 up; at no cost and just a little work. The good ole days, eh? Even the hand writing recognition worked on the old Motorola i810, with no training.
Forget Star Trek! I don’t mind if Rommie can’t understand what I’m saying! if you know what I mean. Damn now you must to find the reference…
“So when you handled certain parts of me, did you wear gloves?”
Damn your good! Original Airdate: 30 Oct, 2000
Should just use the keyboard and mouse when interfacing to a computer… Then the computer knows ‘exactly’ what you want to do. $ ls /home . :) . No need to complicate things. But alas it isn’t the ‘cool’ thing to do.
If you have a playlist before bed, setup a physical button or switch attached to a SBC. It’ll be right every time ;) . Hackaday users should be right at home with this solution.
sudo rm -r * ;)
You missed the -f to make it more dramatic lol
Why would anyone need a mouse? A monitor and a “Model M” keyboard (not the Language Model which we heavily relied on for Automatic Speech Recognition until not so long ago…) is all that is needed to interact with any local or remote computer. Anything more is superfluous.
Actually, if you’re not entering each byte by hand in binary using a bank of rocker switches you’re just being lazy.
Years ago I had a colleague who purchased one of the first Altair 8800’s but he couldn’t afford the monitor and keyboard. He figured he’d just be extra careful as he flipped the switches …
Everything else is just needless luxury.
Real programmers only use C-x, M-c, M-butterfly…
https://xkcd.com/378/
Ha… I added the mouse for the mouse cripples (as we call them). We have people here that can’t operator if they don’t have a mouse handy ;) .
Gah! I hate it when I’m helping someone fix a problem on their Linux computer and we’re sitting in a terminal window and I say, “Go to the blah-blah directory and look for…” and they grab the mouse and go to the file browser and start hunting for the blah-blah directory. Dude, just use “cd and ls”.
I’m all for buttons but I can’t find good ones! I recently switched to Zigbee for my home automation but most zigbee buttons aren’t compatible with Home Assistant or use button cells instead of rechargeable AAAs.
There’s lots of standard buttons you can buy… just wire up a standard button (arcade style, or fancy chrome push button/latches, even some with LEDs to wire in) to an SBC/Micro controller. Uses one GPIO pin and GND to detect. Done. Doesn’t have to be fancy to detect and use a switch or button press. If the controller supports Wifi/bluetooth/Zigbee/etc. easy enough to get the status into an application to read.
Ikea has a bunch of zigbee remotes that work just fine with home assistant. Obviously you need a zigbee modem, like the sonoff zigbee dongle, but thatvalso works great im HA
I’ve got Ikea’s bulbs and a motion sensor, I’ll give their buttons a try. So far I’ve tried a Sonoff button (coin cells), a battery-less 3 gang set (not compatible with ZHA yet), and a 2 gang set I can’t remember the name of that takes AAA batteries (works but not out of the box).
I’ll second the Ikea buttons; they work great in HA. They even have magnets in the back so you can stick them to e.g. the fridge (er, i have the ‘rocker’ ones, not sure if they all do).
IKEA have some of them that work well with HA: https://www.ikea.com/au/en/p/somrig-shortcut-button-white-smart-90560346/
Definitely optical mouse + buckling-spring keyboard!
I have a “dark mode” routine that turns off ALL the lights. No matter how carefully I enunciate it, Alexa often thinks I said “bark mode” and then responds to all commands and queries with dog barks.
I think I’d change that to “blackout”, less likely to misinterpret.
… and short after, you realize that the local power grid was updated to digital control recently
Or “pitch black”.
‘Make me good looking.’
Try adding a word: ‘dark.mode on’ or ‘dark mode activate’ – this should get rid of dark v bark ambiguity
Uhura, not Uhuru.
What an odd thing to say.
The two big challenges I have with computer interpreted speech are timing related. I trigger the process “Hey Siri” and it takes about half a second longer than I expect and doesn’t catch my message.
Or I start the process, open my mouth to make a request and find out that I don’t really know how I want to phrase something “Hey Siri, schedule an appointment… for the 19th…” and some half-baked calendar item gets created.
Siri works unless you’re Barry Kripke from TBBT….
https://youtu.be/Q3bdXctq7DM?si=j7rsql6v_TjtWlcT
one of the best things i saw from google…several several years ago, i think a side story to advertising some new android feature on pixel or nexus phones, the google presenter said that their roadmap was for people to be doing some number of voice searches per day. and they straight up acknowledged that today, a lot of voice searches end in failure, and that’s why it’s a future goal still some distance ahead on the roadmap. the idea that they deployed something that wasn’t quite ready and knew that’s what they had done was really impressive to me.
for a counterpoint, the AI results at the top of every search result, full of hallucinations, seems like the opposite mentality. they must know it’s not ready and that most users don’t like it and that the users who get tricked into relying on it are getting brainwashed by misinformation as we speak…but they still shove it in front of all of our faces anyways. not as a novelty feature for early adopters but as a replacement for the top result which used to be pretty useful.
the other day i asked for a simple question, what are the dimensions of a 608 bearing. and i happened to already kind of know the answer (OD 22, ID 8, depth 7, right??) i just wanted something more reliable than my memory, and google AI popped up the right answer, and i knew it was the right answer because it was exactly the numbers i was expecting to see. that really made me realize how deeply screwed we are — a machine aid for confirmation bias
maybe we’ve conquered voice but we haven’t gotten to having anything worthwhile to say yet i guess
I am finding AI summaries quite useful. I know when to click through to cited sources. I also get useful results from google search. Pick search terms likely to be unique to the docs you’re looking for.
The problem is obviously that they are often just wrong enough to confuse people who can’t tell. For the record I have seen it say things that are completely wrong, instructing the use of features that don’t exist or are from another product, etc. even when it’s sources are correct. The sequences it uses should verify it’s result before posting.
The AI is also very likely to just agree to whatever question, like “Can you do X?” – “Yes you can, here’s why…” and then list a bunch of non-sequitur stuff.
What about an EEG that reads the impulses from the speech centre and then interprets the words you’re thinking, easy as speaking but no noise, and you get to wear a fashionable little hat to boot!
I know this is a joke, but simple because a hat isn’t good enough. There is too much noise and the problem is far more complicated than this solution.
but “hat to boot” is a full body dress, no?
Computers in video have voice interfaces for exactly one reason:
Watching people type is boring.
It’s a concession to the constraints of the medium, supported by the fact that narrative causality runs backwards compared to the real world.
There’s no reason to waste screen time letting actors say things that don’t matter. A computer that answers a question is delivering the next plot point. The narrative function of the question is to cue the answer, so the question is selected to fit the answer.
The bit with Picard asking for Earl Grey tea is especially bad in terms of information transfer: an actual request would be more like, “seventeen centiliters of Earl Grey tea, seventy percent keemun with thirty percent lapsang souchong blend, steeped for eighty five seconds at one hundred ninety five celsius, served at one hundred thirty celsius in Royal Dansk teacup with periwinkle pattern” but no director will tolerate that. We hand-wave it by saying the computer already knows all those preferences, but that includes the ‘Earl Grey’ part. Just saying “tea” should be enough.
The specifier would do something useful if Picard ever drank some other kind of tea, but he never does. A computer capable of human-speech interaction that doesn’t notice the guy only ever asks for one thing is embarrassing.
In terms of interface design, pushing a “tea” button linked to those preferences is better in every way. But as a narrative element that would require a “what does that button do?” exchange that’s more expensive than a semi-sentient computer.
Yes, voice recognition is an interesting technical challenge. Yes there are useful real world applications for it. But let’s try not to base our expectations on a model that doesn’t match real applications in any way.
Excellent analysis. However, it’s easy to imagine Picard saying “seventeen centiliters of Earl Grey…” Earl Grey” keywords. on his first interaction with the network 40 y before the episode you’re viewing’. Perhaps he set it up as a cadet at Star Fleet Academy.
No reason to show in episode, when I see Picard ordering tea, I immediately guessed he’d set it up in detail a long time ago. This was long before I had personal experience with AI and learned not to trust it to get things right.
A vocal macro.
Exactly, a macro which should require no Al beyond voice recognition. The badge would have Picard’s personal info, which would easy be sensed by his proximity to a replicator. This certainly doesn’t require an LLM to do any of the work.
But that would imply there’s a “tea” button everywhere Mr. Picard can go on the space ship, that has a replicator. It’s far more plausible he’s just recorded a macro under the title “Earl Grey” that specifies the exact type of tea he wants, instead of the program default “tea” that every replicator would produce.
“served at one hundred thirty celsius”
I’d like to see you try.
Are you putting him under pressure?
In a smooth interior container you can get well over 100.
To the point it’s dangerous.
Always scratch the inside of a new brazil press if you plan on using a microwave to heat it.
Otherwise adding grounds can be good fun.0
A computer that can synthesize any food, can deliver 130C tea.
But you don’t want it to.
If there were food synthesizers, there would be food synthesizer hacker/trolls.
Sooner or later, everybody is eating Surstroming, until a key is cracked, then Lutfisk.
surely i can’t be the only reader here who honestly has no idea how to pronounce “arduino”? i’ve never heard it spoken. if i had to SWAG, i’d try to go pseudo-Italianish on it, and probably bungle that since i don’t speak Italian.
i never use voice interfaces, either, due to a brick-thick accent i have trouble getting people to parse sometimes.
Break up the syllables and pronounce them separately, it has 4, or 3 if you are Italian, French or Spanish.
A coworker of mine once nearly drove off the side of the interstate while trying to get Siri to give him directions to Newk’s Eatery. Despite the restaurant being a chain with over 100 locations in over a dozen states all around us, it gave him directions to Los Alamos but couldn’t figure out the restaurant, even when he tried to spell the name out.
To recognize speech or to wreck a nice beach? That is the problem, when some people talk fast with little space to separate sounds.
I have surround sound in places and sound everywhere else such that any eavesdropping device will get jammed. If this is needed a handheld mic with a push to enable button would be all I would accept.
Add a large language model to the speech recognition, so you dont only get a transcription but also understanding of context (or at least the probability of it). You’d be surprised how close you get to “Her”, which is now over a decade old.
Little spoiler: If you give it access to your command line on your good old Linux system, it not only is able to check out the USB ports, but also write a blinky sketch to your file system, download a compiler, flash the hex file to your favorite MCU board and afterwards find a sound file on the data dumpster to give itself a round of applause… and all without touching the keyboard a single time 😇
It’s still terrible at python.
https://clip.cafe/colossus-the-forbin-project-1970/we-can-co-exist/
“Colossus: The Forbin Project,” a fantastic, intelligent, classic sci-fi film that is a true underappreciated sleeper. I won’t spoil it by saying why I think it’s a sleeper and is so little known.
Ideal interface? What’s wrong with a Northgate 102 (or was it 104?) and a Adlib ISA card?
It was moderately better than a light gun and a vortrax.
There is an old joke about the first computer that can understand human voice and execute the commands. It was running on DOS. It took years to create and now, it is shown to a public conference. After asking and getting confirmation that the only copy of this software is on that computer, a guy screams “del space see column space slash eff star dot star”, then another one, catching up the vibe screams also “yes i am sure”.
“del c: /f.” Hmm…
I’m going to be that guy. Was the computer giving the confirmation, or a person? How was the voice input activated to accept a command from the screaming guy? How would the computer know to press enter? Why did the guy yell an invalid command to the computer? And what if the program was running from a floppy?
He should yell, “eff oh are em ay tee space see colon space forward slash eff space forward slash cue you eye see kay” and then the second guy, named Dave, screams “enter!!”
And then the computer responds, “I’m sorry, Dave, I cannot do that.”
And then their faces just drop, and the camera zooms in on the computer monitor as ominous an ominous tune begins to play, and we hear key clacking from the monitor as it displays “Initiating HAL protocol” in the center of the screen, using an ASCII border around it, and the whole box is using blink text, and there’s a dramatic sci-fi synthesized “ZZZAAAMMMM” as the camera zooms in on their faces, and then the camera cuts to the guys faces and another dramatic “ZZZZAAAMMM” and then the camera cuts to a helicopter view of the building they are in and then there’s a 3rd voice that screams. And then the camera cuts to Big Ben showing 12 o clock, and a loud GONG plays, and at each of the next 9 gongs, it cuts to views of random cities around the world, no people to be seen. Some papers blowing by with the wind. A tumbleweed rolling in the desert, and for the 11th gong it shows earth from space, and for the final 12th gong it cuts to a much further away view of earth from space, where the earth appears the size of a quarter. And you can see the sun in the center of the screen .. fade to black.
I think I just created a prequel movie… Y2K: NOT YET A SPACE ODYSSEY
it would be even better if before the presentation, we are made aware that the date is December 31st, 1999, and it’s minutes til midnight. And we see various scenes of people out in the streets partying and celebrating. And each new scene zooms out until they appear rapidly and the scene blocks form the Y 2 K on the screen, with a nice sci fi string crescendo. And then the Y2K fades out and then we have a black background and the words “SOMEWHERE ON EARTH…” appear on the screen in a text entry animation, as we hear key clacks for each letter. Then we get a fade in to the room this is happening in. And just before the event starts, it shows someone looking at the clock on the wall, and zooms in on the clock as the person says, “Gosh, I hope this doesn’t take too long! I don’t want to miss the party to ring in the new year!”
Personally, I’ve never liked voice for computer input or output. I played with Vortrax speech synthesizers in the ’80s; I had OS/2 Warp 4 in the ’90s with VoiceType. Both were good enough to be usable, but after a few minutes I never had any inclination to use them again. As other commenters noted, keyboards are much more precise, and for a good typist faster (particularly if punctuation is important), than voice input; and I’ve never found myself wanting to use a voice assistant to control, trigger, or request anything.
And, of course, with voice assistants there are the usual issues of false-positive activation and security. My wife and I don’t use them at all. We don’t have any dedicated ones, and disable them on our phones and other devices which include them as bloatware.
For output, voice suffers from synchronicity: it’s impossible, or at least infeasible, to go at my own pace and skip around, as I can do when reading.
The only voice application I ever used to any significant extent was a Japanese-language tutor package, which was useful for practicing pronunciation outside class.
Speaking with humans has important social and psychological affordances. Speaking with machines is mostly just affectation.
Odd isn’t it? Voice control is sold as the be all and end all but, once again, it singularly & consistently fails to deliver in any meaningful way: by which I mean reliably & consistently.
With all of the neurons being applied to this issue, I cannot but help asking why just a few haven’t been applied to the very premise that voice is the only answer. Has no one taken the time to even scratch the surface of what is involved in voice communications? Patently not, as even with many of my own neuros fried by too many late nights, too much loud music, over consumption of alcohol and general abuse, even I appreciate & understand that voice is just a single element within a plethora of others within audible communications.
For example: normal mature human beings instantly understand how inflection, emphasise, tone & pace is occasionally used to instantly turn the simple spoken negative word ‘No’ into the most affirmative of ‘Yeses’ & vice versa.
To me, the simplest, most inclusive, interface ever devised is one I invented. It consists of just two physically different & lit buttons coupled with audio for instruction and feedback. It requires NO instructions & so is suitable for those of limited vision as well as those of limited cognitive abilities such as early onset dementia.
Current interfaces are designed by 25-40 year old males FOR, sadly, 25-40 year old males. In most western societies that group represents just 10% of the population (UK it’s 10.06%). What that means is that stuff is being designed to exclude over 70% of the rest of society. BTW for those about to argue with my maths, the rest are children. Men design essentially mathematical interfaces. Women hate mathematical interfaces and find them hard work to use. The general confidence of people begins to decline once past 55. By that I mean the confidence to engage with the unfamiliar not their core skills. Even worse, older men (65 plus) consistently “exaggerate” their technical competence around their spouses. Post 70 years of age, generally ones cognitive abilities begin to decline too.
When I talk about this, everyone tries to justify current design by talking about user & focus group testing. The truth is, user & focus groups are a literal waste of time. In the real world, your Mr Average or Ms Miggins is no extrovert. Neither are they particularly introverted. They just want to get on, get by in life. People who join user & focus groups ALL feel superior to others. They are ALL either opinionated or simply there for the money. The press their own agenda or they promote a view that gets them the best ROI on their time. In these groups, unlike the real world, there are zero consequences for wrong or poor decision making so no pressure. No one ever in a user or focus group had to go with out dinner because they’d pressed the wrong button on the heating control, they didn’t understand, so it ran 24/7!!
My apologies dear reader for wittering on but perhaps some of those involved in tech might one day take the time to look at how people behave and prove why before deciding on the very next bessestest thing ever…
This is great information.
Unfortunately, you haven’t generalised about how people with different skin colours don’t understand or interact with technology appropriately. Then move on to nationality and sexuality. Finally, make sure to explain how political and religious beliefs affect individual use of technology. /s
There is a nice Youtube video about 2
Scotsmen trying to control a voice controlled elevator.