How To Train A New Voice For Piper With Only A Single Phrase

July 9, 2025 by Dave Rowntree 7 Comments

[Cal Bryant] hacked together a home automation system years ago, which more recently utilizes Piper TTS (text-to-speech) voices for various undisclosed purposes. Not satisfied with the robotic-sounding standard voices available, [Cal] set about an experiment to fine-tune the Piper TTS AI voice model using a clone of a single phrase created by a commercial TTS voice as a starting point.

Before the release of Piper TTS in 2023, existing free-to-use TTS systems such as espeak and Festival sounded robotic and flat. Piper delivered much more natural-sounding output, without requiring massive resources to run. To change the voice style, the Piper AI model can be either retrained from scratch or fine-tuned with less effort. In the latter case, the problem to be solved first was how to generate the necessary volume of training phrases to run the fine-tuning of Piper’s AI model. This was solved using a heavyweight AI model, ChatterBox, which is capable of so-called zero-shot training. Check out the Chatterbox demo here.

As the loss function gets smaller, the model’s accuracy gets better

Training began with a corpus of test phrases in text format to ensure decent coverage of everyday English. [Cal] used ChatterBox to clone audio from a single test phrase generated by a ‘mystery TTS system’ and created 1,300 test phrases from this new voice. This audio set served as training data to fine-tune the Piper AI model on the lashed-up GPU rig.

To verify accuracy, [Cal] used OpenAI’s Whisper software to transcribe the audio back to text, in order to compare with the original text corpus. To overcome issues with punctuation and differences between US and UK English, the text was converted into phonemes using espeak-ng, resulting in a 98% phrase matching accuracy.

After down-sampling the training set using SoX, it was ready for the Piper TTS training system. Despite all the preparation, running the software felt anticlimactic. A few inconsistencies in the dataset necessitated the removal of some data points. After five days of training parked outside in the shade due to concerns about heat, TensorBoard indicated that the model’s loss function was converging. That’s AI-speak for: the model was tuned and ready for action! We think it sounds pretty slick.

If all this new-fangled AI speech synthesis is too complex and, well, a bit creepy for you, may we offer a more 1980s solution to making stuff talk? Finally, most people take the ability to speak for granted, until they can no longer do so. Here’s a team using cutting-edge AI to give people back that ability.

Convert Any Book To A DIY Audiobook?

July 6, 2025 by Dave Rowntree 12 Comments

If the idea of reading a physical book sounds like hard work, [Nick Bild’s] latest project, the PageParrot, might be for you. While AI gets a lot of flak these days, one thing modern multimodal models do exceptionally well is image interpretation, and PageParrot demonstrates just how accessible that’s become.

[Nick] demonstrates quite clearly how little code is needed to get from those cryptic black and white glyphs to sounds the average human can understand, specifically a paltry 80 lines of Python. Admittedly, many of those lines are pulling in libraries, and some are just blank, so functionally speaking, it’s even shorter than that. Of course, the whole application is mostly glue code, stitching together other people’s hard work, but it’s still instructive and fun to play with.

The hardware required is a Raspberry Pi Zero 2 W, a camera (in this case, a USB webcam), and something to hold it above the book. Any Pi with the ability to connect to a camera should also work, however, with just a little configuration.

On the software side, [Nick] pulls in the CV2 library (which is the interface to OpenCV) to handle the camera interfacing, programming it to full HD resolution. Google’s GenAI is used to interface the Gemini 2.5 Flash LLM via an API endpoint. This takes a captured image and a trivial prompt, and returns the whole page of text, quick as a flash.

Finally, the script hands that text over to Piper, which turns that into a speech file in WAV format. This can then be played to an audio device with a call out to the console aplay tool. It’s all very simple at this level of abstraction.

Continue reading “Convert Any Book To A DIY Audiobook?” →

Christmas Comes Early With AI Santa Demo

May 18, 2025 by Tyler August 7 Comments

With only two hundred odd days ’til Christmas, you just know we’re already feeling the season’s magic. Well, maybe not, but [Sean Dubois] has decided to give us a head start with this WebRTC demo built into a Santa stuffie.

The details are a little bit sparse (hopefully he finishes the documentation on GitHub by the time this goes out) but the project is really neat. Hardware-wise, it’s an audio-enabled ESP32-S3 dev board living inside Santa, running the OpenAI’s OpenRealtime Embedded SDK (as implemented by ExpressIf), with some customization by [Sean]. Looks like the audio is going through the newest version of LibPeer and the heavy lifting is all happening in the cloud, as you’d expect with this SDK. (A key is required, but hey! It’s all open source; if you have an AI that can do the job locally-hosted, you can probably figure out how to connect to it instead.)

This speech-to-speech AI doesn’t need to emulate Santa Claus, of course; you can prime the AI with any instructions you’d like. If you want to delight children, though, its hard to beat the Jolly Old Elf, and you certainly have time to get it ready for Christmas. Thanks to [Sean] for sending in the tip.

If you like this project but want to avoid paying OpenAI API fees, here’s a speech-to-text model to get you started.We covered this AI speech generator last year to handle the talky bit. If you put them together and make your own Santa Claus (or perhaps something more seasonal to this time of year), don’t forget to drop us a tip!

“Glasses” That Transcribe Text To Audio

March 19, 2025 by Lewin Day 10 Comments

Glasses for the blind might sound like an odd idea, given the traditional purpose of glasses and the issue of vision impairment. However, eighth-grade student [Akhil Nagori] built these glasses with an alternate purpose in mind. They’re not really for seeing. Instead, they’re outfitted with hardware to capture text and read it aloud.

Yes, we’re talking about real-time text-to-audio transcription, built into a head-worn format. The hardware is pretty straightforward: a Raspberry Pi Zero 2W runs off a battery and is outfitted with the usual first-party camera. The camera is mounted on a set of eyeglass frames so that it points at whatever the wearer might be “looking” at. At the push of a button, the camera captures an image, and then passes it to an API which does the optical character recognition. The text can then be passed to a speech synthesizer so it can be read aloud to the wearer.

It’s funny to think about how advanced this project really is. Jump back to the dawn of the microcomputer era, and such a device would have been a total flight of fancy—something a researcher might make a PhD and career out of. Indeed, OCR and speech synthesis alone were challenge enough. Today, you can stand on the shoulders of giants and include such mighty capability in a homebrewed device that cost less than $50 to assemble. It’s a neat project, too, and one that we’re sure taught [Akhil] many valuable skills along the way.

Continue reading ““Glasses” That Transcribe Text To Audio” →

Speaking Computers From The 1970s

March 5, 2025 by Al Williams 24 Comments

Talking computers are nothing these days. But in the old days, a computer that could speak was quite the novelty. Many computers from the 1970s and 1980s used an AY-3-8910 chip and [InazumaDenki] has been playing with one of these venerable chips. You can see (and hear) the results in the video below.

The chip uses PCM, and there are different ways to store and play sounds. The video shows how different they are and even looks at the output on the oscilloscope. The chip has three voices and was produced by General Instruments, the company that initially made PIC microcontrollers. It found its way into many classic arcade games, home computers, and games like Intellivision, Vectrex, the MSX, and ZX Spectrum. Soundcards for the TRS-80 Color Computer and the Apple II used these chips. The Atari ST used a variant from Yamaha, the YM2149F.

There’s some code for an ATmega, and the video says it is part one, so we expect to see more videos on this chip soon.

General instruments had other speech chips, and some of them are still around in emulated form. In fact, you can emulate the AY-3-8910 with little more than a Raspberry Pi.

Continue reading “Speaking Computers From The 1970s” →

RP2040 Emulator Brings The Voice Of The 80s Back To Life

September 22, 2023 by Dan Maloney 25 Comments

You may not have heard, but there’s a chip shortage out there. And it’s not just the fancy new chips that are in short supply; the chips that were fancy and new back when you could still buy them from Radio Shack are getting hard to come by, too. For different reasons, of course, but it does pose a problem that requires a little hacking to fix.

The chip in question here is the General Instrument SP0256, a 1980s-era speech synthesizer chip that [Andrew Menadue] relies on. The LSI chip stored 59 unique allophones, or basic sounds the vocal tract is capable of, and synthesized speech by rapidly concatenating these sounds. The chip and its descendants made regular appearances in computers and games throughout the 80s, so chances are good you’ve heard it. If not, think WarGames (yes, we know that wasn’t actually a computerized voice) or [Stephen Hawking] and you’ll be pretty close.

[Andrew]’s need for such a chip stems from his attempts to give voice to his collection of Psion Organisers, another 80s relic that was one of the first pocket computers. Some time ago he built a speech board for the Psion based on the SP0256-AL2, but had to resort to building an emulator for the chip since none were to be had. The emulator uses an RP2040 and lives on a PCB that has the same footprint as the original chip, so it can just plug right in. He dug up WAV files of the allophones and translated those to sequences of bytes, allowing the RP2040 to output the correct sounds as they’re called for. Speaker problems notwithstanding, it sounds pretty good in the video below.

We’ve featured a fair number of SP0256 projects before, on everything from Amstrad to Z80. We’ve also shown off a few of [Andrew]’s builds before, including this exploration of the voltage tolerance of the RP2040.

Continue reading “RP2040 Emulator Brings The Voice Of The 80s Back To Life” →

Make Your ESP32 Talk Like It’s The 80s Again

April 25, 2023 by Donald Papp 21 Comments

80s-era electronic speech certainly has a certain retro appeal to it, but it can sometimes be a useful data output method since it can be implemented on very little hardware. [luc] demonstrates this with a talking thermometer project that requires no display and no special hardware to communicate temperatures to a user.

Back in the day, there were chips like the Votrax SC-01A that could play phonemes (distinct sounds that make up a language) on demand. These would be mixed and matched to create identifiable words, in that distinctly synthesized Speak & Spell manner that is so charming-slash-uncanny.

Software-only speech synthesis isn’t new, but it’s better now than it was in Atari’s day.

Nowadays, even hobbyist microcontrollers have more than enough processing power and memory to do a similar job entirely in software, which is exactly what [luc]’s talking thermometer project does. All this is done with the Talkie library, originally written for the Arduino and updated for the ESP32 and other microcontrollers. With it, one only needs headphones or a simple audio amplifier and speaker to output canned voice data from a project.

[luc] uses it to demonstrate how to communicate to a user in a hands-free manner without needing a display, and we also saw this output method in an electric unicycle which had a talking speedometer (judged to better allow the user to keep their eyes on the road, as well as minimizing the parts count.)

Would you like to listen to an authentic, somewhat-understandable 80s-era text-to-speech synthesizer? You’re in luck, because we can show you an authentic vintage MicroVox unit in action. Give it a listen, and compare it to a demo of the Talkie library in the video below.

Continue reading “Make Your ESP32 Talk Like It’s The 80s Again” →