How To Train A New Voice For Piper With Only A Single Phrase

July 9, 2025 by Dave Rowntree 7 Comments

[Cal Bryant] hacked together a home automation system years ago, which more recently utilizes Piper TTS (text-to-speech) voices for various undisclosed purposes. Not satisfied with the robotic-sounding standard voices available, [Cal] set about an experiment to fine-tune the Piper TTS AI voice model using a clone of a single phrase created by a commercial TTS voice as a starting point.

Before the release of Piper TTS in 2023, existing free-to-use TTS systems such as espeak and Festival sounded robotic and flat. Piper delivered much more natural-sounding output, without requiring massive resources to run. To change the voice style, the Piper AI model can be either retrained from scratch or fine-tuned with less effort. In the latter case, the problem to be solved first was how to generate the necessary volume of training phrases to run the fine-tuning of Piper’s AI model. This was solved using a heavyweight AI model, ChatterBox, which is capable of so-called zero-shot training. Check out the Chatterbox demo here.

As the loss function gets smaller, the model’s accuracy gets better

Training began with a corpus of test phrases in text format to ensure decent coverage of everyday English. [Cal] used ChatterBox to clone audio from a single test phrase generated by a ‘mystery TTS system’ and created 1,300 test phrases from this new voice. This audio set served as training data to fine-tune the Piper AI model on the lashed-up GPU rig.

To verify accuracy, [Cal] used OpenAI’s Whisper software to transcribe the audio back to text, in order to compare with the original text corpus. To overcome issues with punctuation and differences between US and UK English, the text was converted into phonemes using espeak-ng, resulting in a 98% phrase matching accuracy.

After down-sampling the training set using SoX, it was ready for the Piper TTS training system. Despite all the preparation, running the software felt anticlimactic. A few inconsistencies in the dataset necessitated the removal of some data points. After five days of training parked outside in the shade due to concerns about heat, TensorBoard indicated that the model’s loss function was converging. That’s AI-speak for: the model was tuned and ready for action! We think it sounds pretty slick.

If all this new-fangled AI speech synthesis is too complex and, well, a bit creepy for you, may we offer a more 1980s solution to making stuff talk? Finally, most people take the ability to speak for granted, until they can no longer do so. Here’s a team using cutting-edge AI to give people back that ability.

“Glasses” That Transcribe Text To Audio

March 19, 2025 by Lewin Day 10 Comments

Glasses for the blind might sound like an odd idea, given the traditional purpose of glasses and the issue of vision impairment. However, eighth-grade student [Akhil Nagori] built these glasses with an alternate purpose in mind. They’re not really for seeing. Instead, they’re outfitted with hardware to capture text and read it aloud.

Yes, we’re talking about real-time text-to-audio transcription, built into a head-worn format. The hardware is pretty straightforward: a Raspberry Pi Zero 2W runs off a battery and is outfitted with the usual first-party camera. The camera is mounted on a set of eyeglass frames so that it points at whatever the wearer might be “looking” at. At the push of a button, the camera captures an image, and then passes it to an API which does the optical character recognition. The text can then be passed to a speech synthesizer so it can be read aloud to the wearer.

It’s funny to think about how advanced this project really is. Jump back to the dawn of the microcomputer era, and such a device would have been a total flight of fancy—something a researcher might make a PhD and career out of. Indeed, OCR and speech synthesis alone were challenge enough. Today, you can stand on the shoulders of giants and include such mighty capability in a homebrewed device that cost less than $50 to assemble. It’s a neat project, too, and one that we’re sure taught [Akhil] many valuable skills along the way.

Continue reading ““Glasses” That Transcribe Text To Audio” →

A Robot Meant For Humans

November 26, 2024 by Bryan Cockfield 11 Comments

Although humanity was hoping for a more optimistic robotic future in the post-war era, with media reflecting that sentiment like The Jetsons or Lost in Space, we seem to have shifted our collective consciousness (for good reasons) to a more Black Mirror/Terminator future as real-world companies like Boston Dynamics are actually building these styles of machines instead of helpful Rosies. But this future isn’t guaranteed, and a PhD researcher is hoping to claim back a more hopeful outlook with a robot called Blossom which is specifically built to investigate how humans interact with robots.

For a platform this robot is not too complex, consisting of an accessible frame that can be laser-cut from wood with only a few moving parts controlled by servos. The robot is not too large, either, and can be set on a desk to be used as a telepresence robot. But Blossom’s creator [Michael] wanted this to help understand how humans interact with robots so the latest version is outfitted not only with a large language model with text-to-speech capabilities, but also with a compelling backstory, lore, and a voice derived from Animal Crossing that’s neither human nor recognizable synthetic robot, all in an effort to make the device more approachable.

To that end, [Michael] set the robot up at a Maker Faire to see what sorts of interactions Blossom would have with passers by, and while most were interested in the web-based control system for the robot a few others came by and had conversations with it. It’s certainly an interesting project and reminds us a bit of this other piece of research from MIT that looked at how humans and robots can work productively alongside one another.

Text-to-Speech Model Can Do Music, Background Noises, And Sound Effects

July 24, 2023 by Donald Papp 8 Comments

Bark is a universal text-to-audio model that can not only create realistic speech, it can incorporate music, background noises, and sound effects. It can even include non-speech sounds like laughter, sighs, throat clearings, and similar elements. But despite the fact that it can deliver such complex results, it’s important to understand some of the peculiarities.

The model takes a prompt and generates the resulting sound from scratch. Results might sometimes be unexpected.

Bark is not a conventional text-to-speech program, and how it works has a lot more in common with large language model AI chatbots. This means that results can deviate from expectations, and outputs aren’t necessarily going to be studio-quality speech. As the project’s README points out, “(generated outputs can) be anything from perfect speech to multiple people arguing at a baseball game recorded with bad microphones.” That being said, there is some support for voice presets as a way to help guide the model with some consistency.

Bark was designed by a company called Suno for research purposes and is available under the MIT License. It can be installed and run locally, and has some demos available as well as an online implementation.

The ability to install and run Bark locally is promising territory for incorporating it into projects. And should you be more interested in speech-to-text instead, don’t forget about this plain C/C++ implementaion of AI-powered speech recognition.

Hackaday Prize 2023: Wear-a-Chorder Lets Discreet Chording Keyboards Do The Talking

July 8, 2023 by Donald Papp 9 Comments

Being mute or speech-challenged can be a barrier, and [Raymond Li] has an interesting project to contribute to the 2023 Hackaday Prize: a pair of discreet chording keyboards that allow the user to emit live text-to-speech as quickly as one can manipulate them.

Rapid generation of input to high-quality speech helps normalize interactions.

The project leverages recent developments to deliver high-quality speech via an open-source web app called VoiceBox, while making sure the input devices themselves don’t get in the way of personal interaction. Keeping the chorders at waist level and ensuring high-quality speech is generated and delivered quickly goes a long way towards making interaction and communication flow more naturally.

The VoiceBox software is doing the heavy lifting, and there’s not yet much detail about the rest of the hardware used in the prototype. It’s currently up to the user to figure out a solution for a wearable computer or a suitable chording keyboard. Still, the prototype looks like the Charachorder with a 3D-printed mounting solution to locate them at one’s beltline. Of course, the beauty of the underlying system being so standard is that one can use whatever is most comfortable.

The Voice Of ChatGPT Is Now On The Air

January 28, 2023 by Lewin Day 39 Comments

AIs can now apparently carry on a passable conversation, depending on what you classify as passable conversation. The quality of your local pub’s banter aside, an AI stuck in a text box doesn’t have much of a living quality. human. An AI that holds a conversation aloud, though, is another thing entirely. [William Franzin] has whipped up just that on amateur radio. (Video, embedded below.)

The concept is straightforward, if convoluted. A DSTAR digital voice transmission is received, which is then transcoded to regular digital audio. The audio then goes through a voice recognition engine, and that is used as a question for a ChatGPT AI. The AI’s output is then fed to a text-to-speech engine, and it speaks back with its own voice over the airwaves.

[William] demonstrates the system, keying up a transmitter to ask the AI how to get an amateur radio licence. He gets a pretty comprehensive reply in return.

The result is that radio amateurs can call in to ChatGPT with questions, and can receive actual spoken responses from the AI. We can imagine within the next month, AIs will be chatting it up all over the airwaves with similar setups. After all, a few robots could only add more diversity to the already rich and varied ham radio community. Video after the break.

Continue reading “The Voice Of ChatGPT Is Now On The Air” →

Raspberry Pi Reads What It Sees, Delights Children

October 31, 2021 by Donald Papp 13 Comments

[Geyes30]’s Raspberry Pi project does one thing: it finds arbitrary text in the camera’s view and reads it out loud. Does it do so flawlessly? Not really. Was it at least effortless to put together? Also no, but it does wonderfully illustrate the process of gluing together different bits of functionality to make something new. Also, [geyes30]’s kids find it fascinating, and that’s a win all on its own.

The device is made from a Raspberry Pi and camera and works by sending a still image from the camera to an optical character recognition (OCR) program, which converts any visible text in the image to its ASCII representation. The recognized text is then piped to the espeak engine and spoken aloud. Getting all the tools to play nicely took a bit of work, but [geyes30] documented everything so well that even a novice should be able to get the project up and running in an afternoon.

Sometimes a function like text-to-speech is an end result in and of itself. This was also true of another similar project: Magic Mirror, whose purpose was to tirelessly indulge children’s curiosity about language.

Seeing other projects come to life and learning about new tools is a great way to get new ideas, and documenting them helps cross-pollinate among creative types. Did something inspire you recently, or have you documented your own project? We want to hear about it and so do others, so let us know via the tips line!

Continue reading “Raspberry Pi Reads What It Sees, Delights Children” →