Talking Neural Nets

Speech synthesis is nothing new, but it has gotten better lately. It is about to get even better thanks to DeepMind’s WaveNet project. The Alphabet (or is it Google?) project uses neural networks to analyze audio data and it learns to speak by example. Unlike other text-to-speech systems, WaveNet creates sound one sample at a time and affords surprisingly human-sounding results.

Before you rush to comment “Not a hack!” you should know we are seeing projects pop up on GitHub that use the technology. For example, there is a concrete implementation by [ibab]. [Tomlepaine] has an optimized version. In addition to learning English, they successfully trained it for Mandarin and even to generate music. If you don’t want to build a system out yourself, the original paper has audio files (about midway down) comparing traditional parametric and concatenative voices with the WaveNet voices.

Another interesting project is the reverse path — teaching WaveNet to convert speech to text. Before you get too excited, though, you might want to note this quote from the read me file:

“We’ve trained this model on a single Titan X GPU during 30 hours until 20 epochs and the model stopped at 13.4 ctc loss. If you don’t have a Titan X GPU, reduce batch_size in the file from 16 to 4.”

Last time we checked, you could get a Titan X for a little less than $2,000.

There is a multi-part lecture series on reinforced learning (the foundation for DeepMind). If you wanted to tackle a project yourself, that might be a good starting point (the first part appears below).

We’ve seen DeepMind playing Go before. We have to admit, though, we get the practical side of speech analysis over playing with stones. We are waiting to cover the first hacker project that uses this technology.

29 thoughts on “Talking Neural Nets

      1. All previous speech synthesis has been sample based or really limited in tone/accent/inflection etc. This method means that, in a really simple example, I could feed the system all the speeches Barak Obama made ever and get a system that could replicate anything he can say, without him ever having said it in the first place. It also means that going backwards, you could have speech recognition systems which are able train themselves to recognize new and unfamiliar accents just by listening. Those two are just the trivial examples. there’s some deep AI stuff thats honestly a little scary to think about. This is big, big stuff here.

        1. If someone can use this to make it possible to roll your own Echo or Google Home type of device that doesn’t ship all of your voice recordings off for a permanent record somewhere that would be awesome.

      1. See, this is a great reason NOT to do the whole IoT things.

        When my RNN-driven AI assistant goes rogue, he’ll be limited to yelling at me from the Raspberry Pi he’s stuck in. Maybe make a couple angry tweets and clear my RSS feed.

        AI-controlled life-support, locks, or any heavy machinery is a bad, bad idea.

  1. This gives me weird feelings and not in a good way. For years neural nets have been in the category of magical solutions to problems that if they work would change the way technology is done. Like nanobots. Stuck in a few categorising applications like OCR it never did exceed it’s own hype.

    Now on the horizon Google especially is pushing applications that intend to make programming obsolete. Understanding the issues and being able to write code to solve them are going to mean nothing in the wake of teaching a computer by example. And you are never going to know if the way it understands things is the fundamentally right.

      1. I don’t think hardware has to change that much. Modern RNN implementations boil down to simple matrix math, which GPUs are already pretty well optimized for. RNN-specific coprocessors would be further optimized, but still pretty similar.

        Now, if you wanted to replace the digital simulation entirely, an FPGA-like analog computer would be incredible for this. A whole neural net with thousands of layers and millions of connections could be evaluated in a single processor step.

          1. For modern RNNs, I’m pretty new myself. I learned neural networks in the early 00’s from a book written in the 90’s. Back then you actually ran a function for every single neuron, which was slow and inefficient.

            Today, each layer of a network is expressed as a 2D matrix, and evaluating the whole RNN is just multiplying one matrix after another and interpreting the result. I think.

            The Tensorflow website ( has a good explaination and some simple tutorials. There’s also a LOT of Github repositories and list, such as Awesome-Machine-Learning (

            The bulk of development uses a mixture of Python and C++, and runs on GPUs. Also, almost none of it will work on Windows, which is a pain in the butt for me. All my Linux boxes are recycled and low-powered, with my 8-core CPU and GTX1080 workstation running Windows for games, CGI, and physics simulation.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s