Bark is a universal text-to-audio model that can not only create realistic speech, it can incorporate music, background noises, and sound effects. It can even include non-speech sounds like laughter, sighs, throat clearings, and similar elements. But despite the fact that it can deliver such complex results, it’s important to understand some of the peculiarities.
Bark is not a conventional text-to-speech program, and how it works has a lot more in common with large language model AI chatbots. This means that results can deviate from expectations, and outputs aren’t necessarily going to be studio-quality speech. As the project’s README points out, “(generated outputs can) be anything from perfect speech to multiple people arguing at a baseball game recorded with bad microphones.” That being said, there is some support for voice presets as a way to help guide the model with some consistency.
Bark was designed by a company called Suno for research purposes and is available under the MIT License. It can be installed and run locally, and has some demos available as well as an online implementation.
Being mute or speech-challenged can be a barrier, and [Raymond Li] has an interesting project to contribute to the 2023 Hackaday Prize: a pair of discreet chording keyboards that allow the user to emit live text-to-speech as quickly as one can manipulate them.
The project leverages recent developments to deliver high-quality speech via an open-source web app called VoiceBox, while making sure the input devices themselves don’t get in the way of personal interaction. Keeping the chorders at waist level and ensuring high-quality speech is generated and delivered quickly goes a long way towards making interaction and communication flow more naturally.
The VoiceBox software is doing the heavy lifting, and there’s not yet much detail about the rest of the hardware used in the prototype. It’s currently up to the user to figure out a solution for a wearable computer or a suitable chording keyboard. Still, the prototype looks like the Charachorder with a 3D-printed mounting solution to locate them at one’s beltline. Of course, the beauty of the underlying system being so standard is that one can use whatever is most comfortable.
AIs can now apparently carry on a passable conversation, depending on what you classify as passable conversation. The quality of your local pub’s banter aside, an AI stuck in a text box doesn’t have much of a living quality. human. An AI that holds a conversation aloud, though, is another thing entirely. [William Franzin] has whipped up just that on amateur radio. (Video, embedded below.)
The concept is straightforward, if convoluted. A DSTAR digital voice transmission is received, which is then transcoded to regular digital audio. The audio then goes through a voice recognition engine, and that is used as a question for a ChatGPT AI. The AI’s output is then fed to a text-to-speech engine, and it speaks back with its own voice over the airwaves.
[William] demonstrates the system, keying up a transmitter to ask the AI how to get an amateur radio licence. He gets a pretty comprehensive reply in return.
The result is that radio amateurs can call in to ChatGPT with questions, and can receive actual spoken responses from the AI. We can imagine within the next month, AIs will be chatting it up all over the airwaves with similar setups. After all, a few robots could only add more diversity to the already rich and varied ham radio community. Video after the break.
[Geyes30]’s Raspberry Pi project does one thing: it finds arbitrary text in the camera’s view and reads it out loud. Does it do so flawlessly? Not really. Was it at least effortless to put together? Also no, but it does wonderfully illustrate the process of gluing together different bits of functionality to make something new. Also, [geyes30]’s kids find it fascinating, and that’s a win all on its own.
The device is made from a Raspberry Pi and camera and works by sending a still image from the camera to an optical character recognition (OCR) program, which converts any visible text in the image to its ASCII representation. The recognized text is then piped to the espeak engine and spoken aloud. Getting all the tools to play nicely took a bit of work, but [geyes30] documented everything so well that even a novice should be able to get the project up and running in an afternoon.
Seeing other projects come to life and learning about new tools is a great way to get new ideas, and documenting them helps cross-pollinate among creative types. Did something inspire you recently, or have you documented your own project? We want to hear about it and so do others, so let us know via the tips line!
Those of us who were around in the late 70s and into the 80s might remember the Speak & Spell, a children’s toy with a remarkable text-to-speech synthesizer. While it sounds dated by today’s standards, it was revolutionary for the time and was riding a wave of text-to-speech functionality that was starting to arrive to various computers of the era. While a lot of them used dedicated hardware to perform the speech synthesis, some computers were powerful enough to do this in software, but others were not quite able. The VIC-20 was one of the latter, but thanks to an ESP8266 it has been retroactively given this function.
This project comes to us from [Jan Derogee], a connoisseur of this retrocomputer, and builds on the work by [Earle F. Philhower] who ported the retro speech synthesis software known as SAM from assembly to C which made it possible to run on the ESP8266. Audio playback is handled on the I2S port, but some work needed to be done to get this to work smoothly since this port also handles the communication with the VIC-20. Once this was sorted out, a patch was made to be able to hear the computer’s audio as well as the speech synthesizer’s. Finally, a serial command interface was designed by [Jan] which allows for control of the module.
While not many of us have VIC-20s sitting at home, it’s still an interesting project that shows the broad scope of a small and inexpensive chip like the ESP8266 which would have had a hefty price tag back in the 1980s. If you have other 80s hardware laying around waiting to be put to work, though, take a look at this project which brings new vocabulary words to that old classic Speak & Spell.
We all need someone to talk to sometimes, and the pandemic has only made matters worse when it comes to the number of people living with anxiety and depression. Exchanging the simplest of pleasantries can make you feel whole again, but the masks make it hard to engage with strangers and judge their emotions, so your big trip to the grocery store can make you feel lonely in a crowd.
So you go back home, still feeling lonely, and maybe you turn on the TV. Watching people interact is probably the next best thing to actual interaction, and it might even make you laugh. But have you ever wished you could talk to the people on TV? With [aniketdhole]’s EMOJO chatbot, you’ll feel as though you’re among friends. And technically you are — all the dialogue is from the TV show Friends.
In Castaway, Tom Hanks didn’t give that volleyball a frowny face, now did he? Nor does he have a dopey grin. Instead, he wears a wry smile that suggests depth of character and a grasp of the dire situation at hand. But now we have emoji, and they do a pretty good job of conveying and evoking emotion. EMOJO is a visual chatbot that uses voice and emoji to make easy, two-way conversation to help chase the loneliness away. It uses a Raspberry Pi and a TFT display to take voice input from a Bluetooth headset, convert it to text, and then respond in kind with both voice and text. It was a finalist in the rethink displays round of the Hackaday Prize, and we can’t wait to see how its character develops. Be sure to check out the demo after the break.
“Sorry. I had music playing. Would you say that again?” If we had a money-unit every time someone tried talking to us while we were wearing headphones, we could afford a super-nice pair. For an Embedded C class, [extremerockets] built Listen Up!, a cutoff switch that pauses your music when someone wants your attention.
The idea was born while sheltering in place with his daughter, who likes loud music, but he does not want to holler to get her attention. Rather than deny her some auditory privacy, Listen Up! samples the ambient noise level, listens for a sustained rise in amplitude, like speech, and sends a pause signal to the phone. Someday, there may be an option to route the microphone’s audio into the headphones, but for now there is a text-to-speech module for verbalizing character strings. It might be a bit jarring to hear a call to dinner in the middle of a guitar riff, but we don’t like missing dinner either, so we’re with [extremerockets] on this one.