“Sorry. I had music playing. Would you say that again?” If we had a money-unit every time someone tried talking to us while we were wearing headphones, we could afford a super-nice pair. For an Embedded C class, [extremerockets] built Listen Up!, a cutoff switch that pauses your music when someone wants your attention.
The idea was born while sheltering in place with his daughter, who likes loud music, but he does not want to holler to get her attention. Rather than deny her some auditory privacy, Listen Up! samples the ambient noise level, listens for a sustained rise in amplitude, like speech, and sends a pause signal to the phone. Someday, there may be an option to route the microphone’s audio into the headphones, but for now there is a text-to-speech module for verbalizing character strings. It might be a bit jarring to hear a call to dinner in the middle of a guitar riff, but we don’t like missing dinner either, so we’re with [extremerockets] on this one.
Back in the early 1980s, there was a certain fad in making your computer produce something resembling human speech. There were several hardware solutions to this, adding voices to everything from automated telephone systems to video game consoles, all the way to Steve Jobs using the gimmick to introduce Macintosh to the world in 1984. In 1982, a software-based version of this synthesis was released for the Atari 8-bit line of computers, and ever since them [rossumur] has wondered whether or not it could run on the very constrained 2600.
Fast-forward 38 years and he found out that the answer was that yes, it was indeed possible to port a semblance of the original 1982 Software Automatic Mouth (or SAM) to run entirely on the Atari 2600, without any additional hardware. To be able to fit such a seemingly complicated piece of software into the paltry 128 bytes (yes, bytes) of RAM, [rossumur] actually uses an authoring tool in order to pre-calculate the allophones, and store only those in the ROM. This way, the 2600 alone can’t convert text to phonemes, but there’s enough space left for the allophones, which are converted into sound, that about two minutes of speech can fit into one cartridge. As for why he went through the trouble, we quote the author himself: “Because creating digital swears with 1982 speech synthesis technology on a 1977 game console is exactly what we need right now.”
Even in a world that is as currently far off the rails as this one is, we’re going to go out on a limb and say that this machine learning, servo-powered prayer bot is going to be the strangest thing you see today. We’re happy to be wrong about that, though, and if we are, please send links.
“The Prayer,” as [Diemut Strebe]’s work is called, may look strange, but it’s another in a string of pieces by various artists that explores just what it means to be human at a time when machines are blurring the line between them and us. The hardware is straightforward: a silicone rubber representation of a human nasopharyngeal cavity, servos for moving the lips, and a speaker to create the vocals. Those are generated by a machine-learning algorithm that was trained against the sacred texts of many of the world’s major religions, including the Christian Bible, the Koran, the Baghavad Gita, Taoist texts, and the Book of Mormon. The algorithm analyzes the structure of sacred verses and recreates random prayers and hymns using Amazon Polly that sound a lot like the real thing. That the lips move in synchrony with the ersatz devotions only adds to the otherworldliness of the piece. Watch it in action below.
The more glass we punch with our fingertips, the more we miss fun physical interfaces like the rotary phone. Sure, they took forever to dial, and you did not want to be one of those kids stuck with one during the transition to DTMF, especially if you were trying to be the 9th caller to a radio station, but the solidly electromechanical experience of it all was just cool, okay? The sound and the heft made them seem so adult.
[Tal O] gets it. He’s all but finished bringing this old girl into the 21st century without giving anything away on her surface. Inside are some things you’d expect, like a SIM800 GSM module for the telephony part, and an ESP32 to count the pulses from the dialer and communicate between it and the GSM module. But it also has a few things we haven’t seen before. The entire journey is outlined in a five-part video series, and we’ve got part one dialed in for you after the break.
Although [Tal] got the ringer working to prove it could be done, he didn’t want to have a separate 12V circuit just to run the bells. Also, the bells and their electromagnets take up a lot of space, so he compromised with an mp3 of a rotary ringer. [Tal] also wanted a way to have dialed-number feedback without cutting up the phone to add a screen, so he found a text-to-speech library and made the phone speak each number aloud as soon as it’s dialed. It uses the same internal speaker as the ringer, but we think it would be neat if the feedback came through the handset speaker.
If [Tal] is looking for another modern convenience to add to this phone, how about speed dial?
What’s worse than unleashing a monster on the internet? Allowing the internet to control the monster! But that’s just what [8BitsAndAByte] did, created a monster that anyone on the internet can control. Luckily for us, this monster only talks.
This is a very simple project and most of the parts are off the shelf. Hardware wise the monster’s body is made out of a plastic flowerpot; its mouth is a bit of wood that covers the top of the flowerpot; its eyes, two halves of a plastic sphere painted white with some felt for irises. And then whole thing is covered in some blue fake fur.
Electronics wise, a Raspberry Pi is running the show and handling the text-to-speech is an AIY Voice Hat. A servo fits inside the flowerpot to open and close the monster’s mouth. On the software end of things, a bit of Python has been written that waits for a bit of text, sends it off to the Voice Hat’s text-to-speech module and moves the servo to open and close the mouth. The scary part, connecting the monster to the internet, is done with remo.tv, which is some open-source code hosted on GitHub specifically for allowing control of robots over the internet.
This is a neat little project which is simple enough that kids could build one themselves. The instructions and the python script are up on the Instructables page, and you can see the monster in action at its page on remo.tv. Perhaps [8BitsAndAByte] could add a couple of these internet controlled robot arms to the monster to create a monster that could create some real havoc!
[pepelepoisson]’s Miroir Magique (“Magic Mirror”) is an interesting take on the smart mirror concept; it’s intended to be a playful, interactive learning tool for kids who are at an age where language and interactivity are deeply interesting to them, but whose ceaseless demands for examples of spelling and writing can be equally exhausting. Inspiration came from his own five-year-old, who can neither read nor write but nevertheless has a bottomless fascination with the writing and spelling of words, phrases, and numbers.
The magic is all in the simple interface. Magic Mirror waits for activation (a simple pass of the hand over a sensor) then shows that it is listening. Anything it hears, it then displays on the screen and reads back to the user. From an application perspective it’s fairly simple, but what’s interesting is the use of speech-to-text and text-to-speech functions not as a means to an end, but as an end in themselves. A mirror in more ways than one, it listens and repeats back, while writing out what it hears at the same time. For its intended audience of curious children fascinated by the written and spoken aspects of language, it’s part interactive toy and part learning tool.
Like most smart mirror projects the technological elements are all hidden; the screen is behind a one-way mirror, speakers are out of sight, and the only inputs are a gesture sensor and a microphone embedded into the frame. Thus equipped, the mirror can tirelessly humor even the most demanding of curious children.
[pepelepoisson] explains some of the technical aspects on the project page (English translation link here) and all the code and build details are available (in French) on the project’s GitHub repository. Embedded below is a demonstration of the Magic Mirror, first in French then switching to English.
First Google gradually improved its WaveNet text-to-speech neural network to the point where it sounds almost perfectly human. Then they introduced Smart Reply which suggests possible replies to your emails. So it’s no surprise that they’ve announced an enhancement for Google Assistant called Duplex which can have phone conversations for you.
What is surprising is how well it works, as you can hear below. The first is Duplex calling to book an appointment at a hair salon, and the second is it making reservation’s with a restaurant.
Note that this reverses the roles when talking to a computer on the phone. The computer is the customer who calls the business, and the human is on the business side. The goal of the computer is to book a hair appointment or reserve a table at a restaurant. The computer has to know how to carry out a conversation with the human without the human knowing that they’re talking to a computer. It’s for communicating with all those businesses which don’t have online booking systems but instead use human operators on the phone.
Not knowing that they’re talking to a computer, the human will therefore speak as it would with another human, with all the pauses, “hmm”s and “ah”s, speed, leaving words out, and even changing the context in mid-sentence. There’s also the problem of multiple meanings for a phrase. The “four” in “Ok for four” can mean 4 pm or four people.
The component which decides what to say is a recurrent neural network (RNN) trained on many anonymized phone calls. The input is: the audio, the output from Google’s automatic speech recognition (ASR) software, and context such as the conversation’s history and the parameters of the conversation (e.g. book places at a restaurant, for how many, when), and more.
Producing the speech is done using Google’s text-to-speech technologies, Wavenet and Tacotron. “Hmm”s and “ah”s are inserted for a more natural sound. Timing is also taken into account. “Hello?” gets an immediate response. But they introduce latency when responding to more complex questions since replying too soon would sound unnatural.
There are limitations though. If it decides it can’t complete a task then it hands the conversation over to a human operator. Also, Duplex can’t handle a general conversation. Instead, multiple instances are trained on different domains. So this isn’t the singularity which we’ve talked about before. But if you’re tired of talking to computers at businesses, maybe this will provide a little payback by having the computer talk to the business instead.
On a more serious note, would you want to know if the person you were speaking to was in fact a computer? Perhaps Google should preface each conversation with “Hi! This is Google Assistant calling.” And even knowing that, would you want to have a human conversation with a computer, knowing that it’s “um”s were artificial? This may save time for the person whom the call is on behalf of, but the person being called may wish the computer would be a little more computer-like and speak more efficiently. Let us know your thoughts in the comments below. Or just check out the following Google I/O ’18 keynote presentation video where all this was announced.