How To Train A New Voice For Piper With Only A Single Phrase

[Cal Bryant] hacked together a home automation system years ago, which more recently utilizes Piper TTS (text-to-speech) voices for various undisclosed purposes. Not satisfied with the robotic-sounding standard voices available, [Cal] set about an experiment to fine-tune the Piper TTS AI voice model using a clone of a single phrase created by a commercial TTS voice as a starting point.

Before the release of Piper TTS in 2023, existing free-to-use TTS systems such as espeak and Festival sounded robotic and flat. Piper delivered much more natural-sounding output, without requiring massive resources to run. To change the voice style, the Piper AI model can be either retrained from scratch or fine-tuned with less effort. In the latter case, the problem to be solved first was how to generate the necessary volume of training phrases to run the fine-tuning of Piper’s AI model. This was solved using a heavyweight AI model, ChatterBox, which is capable of so-called zero-shot training. Check out the Chatterbox demo here.

As the loss function gets smaller, the model’s accuracy gets better

Training began with a corpus of test phrases in text format to ensure decent coverage of everyday English. [Cal] used ChatterBox to clone audio from a single test phrase generated by a ‘mystery TTS system’ and created 1,300 test phrases from this new voice. This audio set served as training data to fine-tune the Piper AI model on the lashed-up GPU rig.

To verify accuracy, [Cal] used OpenAI’s Whisper software to transcribe the audio back to text, in order to compare with the original text corpus. To overcome issues with punctuation and differences between US and UK English, the text was converted into phonemes using espeak-ng, resulting in a 98% phrase matching accuracy.

After down-sampling the training set using SoX, it was ready for the Piper TTS training system. Despite all the preparation, running the software felt anticlimactic. A few inconsistencies in the dataset necessitated the removal of some data points. After five days of training parked outside in the shade due to concerns about heat, TensorBoard indicated that the model’s loss function was converging. That’s AI-speak for: the model was tuned and ready for action! We think it sounds pretty slick.

If all this new-fangled AI speech synthesis is too complex and, well, a bit creepy for you, may we offer a more 1980s solution to making stuff talk? Finally, most people take the ability to speak for granted, until they can no longer do so. Here’s a team using cutting-edge AI to give people back that ability.

A man’s hand is visible holding a large, potato-shaped object in the foreground. A short, white, cylindrical structure is on the top of the potato, with black wires bending back into the potato. A smaller rectangular structure is to one side of it, and a red alligator clip connects to a nail protruding from the potato.

Building A Potato-based GLaDOS As An Introduction To AI

Although not nearly as intimidating as her ceiling-mounted hanging arm body, GLaDOS spent a significant portion of the Portal 2 game in a stripped-down computer powered by a potato battery. [Dave] had already made a version of her original body, but it was built around a robotic arm that was too expensive for the project to be really accessible. For his latest project, therefore, he’s created a AI-powered version of GLaDOS’s potato-based incarnation, which also serves as a fun introduction to building AI systems.

[Dave] wanted the system to work offline, so he needed a computer powerful enough to run all of his software locally. He chose an Nvidia Jetson Orin Nano, which was powerful enough to run a workable software system, albeit slowly and with some memory limitations. A potato cell unfortunately doesn’t generate enough power to run a Jetson, and it would be difficult to find a potato large enough to fit the Jetson inside. Instead, [Dave] 3D-printed and painted a potato-shaped enclosure for the Jetson, a microphone, a speaker, and some supplemental electronics.

A large language model handles interactions with the user, but most models were too large to fit on the Jetson. [Dave] eventually selected Llama 3.2, and used LlamaIndex to preprocess information from the Portal wiki for retrieval-augmented generation. The model’s prompt was a bit difficult, but after contacting a prompt engineer, [Dave] managed to get it to respond to the hapless user in an appropriately acerbic manner. For speech generation, [Dave] used Piper after training it on audio files from the Portal wiki, and for speech recognition used Vosk (a good programming exercise, Vosk being, in his words, “somewhat documented”). He’s made all of the final code available on GitHub under the fitting name of PotatOS.

The end result is a handheld device that sarcastically insults anyone seeking its guidance. At least Dave had the good sense not to give this pernicious potato control over his home.

Convert Any Book To A DIY Audiobook?

If the idea of reading a physical book sounds like hard work, [Nick Bild’s] latest project, the PageParrot, might be for you. While AI gets a lot of flak these days, one thing modern multimodal models do exceptionally well is image interpretation, and PageParrot demonstrates just how accessible that’s become.

[Nick] demonstrates quite clearly how little code is needed to get from those cryptic black and white glyphs to sounds the average human can understand, specifically a paltry 80 lines of Python. Admittedly, many of those lines are pulling in libraries, and some are just blank, so functionally speaking, it’s even shorter than that. Of course, the whole application is mostly glue code, stitching together other people’s hard work, but it’s still instructive and fun to play with.

The hardware required is a Raspberry Pi Zero 2 W, a camera (in this case, a USB webcam), and something to hold it above the book. Any Pi with the ability to connect to a camera should also work, however, with just a little configuration.

On the software side, [Nick] pulls in the CV2 library (which is the interface to OpenCV) to handle the camera interfacing, programming it to full HD resolution. Google’s GenAI is used to interface the Gemini 2.5 Flash LLM via an API endpoint. This takes a captured image and a trivial prompt, and returns the whole page of text, quick as a flash.

Finally, the script hands that text over to Piper, which turns that into a speech file in WAV format. This can then be played to an audio device with a call out to the console aplay tool. It’s all very simple at this level of abstraction.

Continue reading “Convert Any Book To A DIY Audiobook?”

AI Is Only Coming For Fun Jobs

In the past few years, what marketers and venture capital firms term “artificial intelligence” but is more often an advanced predictive text model of some sort has started taking people’s jobs and threatening others. But not tedious jobs that society might like to have automated away in the first place. These AI tools have generally been taking rewarding or enjoyable jobs like artist, author, filmmaker, programmer, and composer. This project from a research team might soon be able to add astronaut to that list.

The team was working within the confines of the Kerbal Space Program Differential Game Challenge, an open-source plugin from MIT that allows developers to test various algorithms and artificial intelligences in simulated spacecraft situations. Generally, purpose-built models are used here with many rounds of refinement and testing, but since this process can be time consuming and costly the researchers on this team decided to hand over control to ChatGPT with only limited instructions. A translation layer built by the researchers allows generated text to be converted to spacecraft controls.

We’ll note that, at least as of right now, large language models haven’t taken the jobs of any actual astronauts yet. The game challenge is generally meant for non-manned spacecraft like orbital satellites which often need to make their own decisions to maintain orbits and avoid obstacles. This specific model was able to place second in a recent competition as well, although we’ll keep rooting for humans in certain situations like these.

Why GitHub Copilot Isn’t Your Coding Partner

These days ‘AI’ is everywhere, including in software development. Coming hot on the heels of approaches like eXtreme Programming and Pair Programming, there’s now a new kind of pair programming in town in the form of an LLM that’s been digesting millions of lines of code. Purportedly designed to help developers program faster and more efficiently, these ‘AI programming assistants’ have primarily led to heated debate and some interesting studies.

In the case of [Jj], their undiluted feelings towards programming assistants like GitHub Copilot burn as brightly as the fire of a thousand Suns, and not a happy kind of fire.

Whether it’s Copilot or ChatGPT or some other chatbot that may or may not be integrated into your IDE, the frustration with what often feels like StackOverflow-powered-autocomplete is something that many of us can likely sympathize with. Although [Jj] lists a few positives of using an LLM trained on codebases and documentation, their overall view is that using Copilot degrades a programmer, mostly because of how it takes critical thinking skills out of the loop.

Regardless of whether you agree with [Jj] or not, the research so far on using LLMs with software development and other tasks strongly suggests that they’re not a net positive for one’s mental faculties. It’s also important to note that at the end of the day it’s still you, the fleshy bag of mostly salty water, who has to justify the code during code review and when something catches on fire in production. Your ‘copilot’ meanwhile gets off easy.

AI Might Kill Us All (With Carbon Emissions)

So-called artificial intelligence (AI) is all the rage right now between your grandma asking ChatGPT how to code in Python or influencers making videos without having to hire extras, but one growing concern is where the power is going to come from for the data centers. The MIT Technology Review team did a deep dive on what the current situation is and whether AI is going to kill us all (with carbon emissions).

Probably of most interest to you, dear hacker, is how they came up with their numbers. With no agreed upon methods and different companies doing different types of processing there were a number of assumptions baked into their estimates. Given the lack of information for closed-source models, Open Source models were used as the benchmark for energy usage and extrapolated for the industry as a whole. Unsurprisingly, larger models have a larger energy usage footprint.

While data center power usage remained roughly the same from 2005 to 2017 as increases in efficiency offset the increase in online services, data centers doubled their energy consumption by 2023 from those earlier numbers. The power running into those data centers is 48% more carbon intensive than the US average already, and expected to rise as new data centers push for increased fossil fuel usage, like Meta in Louisiana or the X data center found to be using methane generators in violation of the Clean Air Act.

Technology Review did find “researchers estimate that if data centers cut their electricity use by roughly half for just a few hours during the year, it will allow utilities to handle some additional 76 gigawatts of new demand.” This would mean either reallocating requests to servers in other geographic regions or just slowing down responses for the 80-90 hours a year when the grid is at its highest loads.

If you’re interested in just where a lot of the US-based data centers are, check out this map from NREL. Still not sure how these LLMs even work? Here’s an explainer for you.

ELIZA Reanimated

The last time we checked in with the ELIZA archeology project, they had unearthed the earliest known copy of the code for the infamous computer psychiatrist written in MAD-SLIP. After a lot of work, that version is now running again, and there were a number of interesting surprises.

While chatbots are all the modern rage, [Joseph Weizenbaum] created what could be the first one, ELIZA, in the mid-1960s. Of course, it wasn’t as capable as what we have today, but it is a good example of how simple it is to ape human behavior.

The original host was an IBM 7094, and MAD-SLIP fell out of favor. Most versions known previously were in Lisp or even Basic. But once the original code was found, it wasn’t enough to simply understand it. They wanted to run it.

Continue reading “ELIZA Reanimated”