Speech synthesis is nothing new, but it has gotten better lately. It is about to get even better thanks to DeepMind’s WaveNet project. The Alphabet (or is it Google?) project uses neural networks to analyze audio data and it learns to speak by example. Unlike other text-to-speech systems, WaveNet creates sound one sample at a time and affords surprisingly human-sounding results.
Before you rush to comment “Not a hack!” you should know we are seeing projects pop up on GitHub that use the technology. For example, there is a concrete implementation by [ibab]. [Tomlepaine] has an optimized version. In addition to learning English, they successfully trained it for Mandarin and even to generate music. If you don’t want to build a system out yourself, the original paper has audio files (about midway down) comparing traditional parametric and concatenative voices with the WaveNet voices.
Another interesting project is the reverse path — teaching WaveNet to convert speech to text. Before you get too excited, though, you might want to note this quote from the read me file:
“We’ve trained this model on a single Titan X GPU during 30 hours until 20 epochs and the model stopped at 13.4 ctc loss. If you don’t have a Titan X GPU, reduce batch_size in the train.py file from 16 to 4.”
Last time we checked, you could get a Titan X for a little less than $2,000.
There is a multi-part lecture series on reinforced learning (the foundation for DeepMind). If you wanted to tackle a project yourself, that might be a good starting point (the first part appears below).