Like a lot of people, we’ve been pretty interested in TensorFlow, the Google neural network software. If you want to experiment with using it for speech recognition, you’ll want to check out [Silicon Valley Data Science’s] GitHub repository which promises you a fast setup for a speech recognition demo. It even covers which items you need to install if you are using a CUDA GPU to accelerate processing or if you aren’t.
Another interesting thing is the use of TensorBoard to visualize the resulting neural network. This tool offers up a page in your browser that lets you visualize what’s really going on inside the neural network. There’s also speech data in the repository, so it is practically a one-stop shop for getting started. If you haven’t seen TensorBoard in action, you might enjoy the video from Google, below.
Tech artist [Alexander Reben] has shared some work in progress with us. It’s a neural network trained on various famous peoples’ speech (YouTube, embedded below). [Alexander]’s artistic goal is to capture the “soul” of a person’s voice, in much the same way as death masks of centuries past. Of course, listening to [Alexander]’s Rob Boss is no substitute for actually watching an old Bob Ross tape — indeed it never even manages to say “happy little trees” — but it is certainly recognizable as the man himself, and now we can generate an infinite amount of his patter.
Behind the scenes, he’s using WaveNet to train the networks. Basically, the algorithm splits up an audio stream into chunks and tries to predict the next chunk based on the previous state. Some pre-editing of the training audio data was necessary — removing the laughter and applause from the Colbert track for instance — but it was basically just plugged right in.
The network seems to over-emphasize sibilants; we’ve never heard Barack Obama hiss quite like that in real life. Feeding noise into machines that are set up as pattern-recognizers tends to push them to the limits. But in keeping with the name of this series of projects, the “unreasonable humanity of algorithms”, it does pretty well.
He’s also done the same thing with multiple speakers (also YouTube), in this case 110 people with different genders and accents. The variation across people leads to a smoother, more human sound, but it’s also not clearly anyone in particular. It’s meant to be continuously running out of a speaker inside a sculpture’s mouth. We’re a bit creeped out, in a good way.
We’ve covered some of [Alexander]’s work before, from the wince-inducing “Robot Bites Man” to the intellectual-conceptual “All Prior Art“. Keep it coming, [Alexander]!
[carykh] took a dive into neural networks, training a computer to replicate Baroque music. The results are as interesting as the process he used. Instead of feeding Shakespeare (for example) to a neural network and marveling at how Shakespeare-y the text output looks, the process converts Bach’s music into a text format and feeds that to the neural network. There is one character for each key on the piano, making for an 88 character alphabet used during the training. The neural net then runs wild and the results are turned back to audio to see (or hear as it were) how much the output sounds like Bach.
The video embedded below starts with a bit of a skit but hang in there because once you hit the 90 second mark things get interesting. Those lacking patience can just skip to the demo; hear original Bach followed by early results (4:14) and compare to the results of a full day of training (11:36) on Bach with some Mozart mixed in for variety. For a system completely ignorant of any bigger-picture concepts such as melody, the results are not only recognizable as music but can even be pleasant to listen to.
The concept of artificial intelligence dates back far before the advent of modern computers — even as far back as Greek mythology. Hephaestus, the Greek god of craftsmen and blacksmiths, was believed to have created automatons to work for him. Another mythological figure, Pygmalion, carved a statue of a beautiful woman from ivory, who he proceeded to fall in love with. Aphrodite then imbued the statue with life as a gift to Pygmalion, who then married the now living woman.
Throughout history, myths and legends of artificial beings that were given intelligence were common. These varied from having simple supernatural origins (such as the Greek myths), to more scientifically-reasoned methods as the idea of alchemy increased in popularity. In fiction, particularly science fiction, artificial intelligence became more and more common beginning in the 19th century.
But, it wasn’t until mathematics, philosophy, and the scientific method advanced enough in the 19th and 20th centuries that artificial intelligence was taken seriously as an actual possibility. It was during this time that mathematicians such as George Boole, Bertrand Russel, and Alfred North Whitehead began presenting theories formalizing logical reasoning. With the development of digital computers in the second half of the 20th century, these concepts were put into practice, and AI research began in earnest.
Over the last 50 years, interest in AI development has waxed and waned with public interest and the successes and failures of the industry. Predictions made by researchers in the field, and by science fiction visionaries, have often fallen short of reality. Generally, this can be chalked up to computing limitations. But, a deeper problem of the understanding of what intelligence actually is has been a source a tremendous debate.
Despite these setbacks, AI research and development has continued. Currently, this research is being conducted by technology corporations who see the economic potential in such advancements, and by academics working at universities around the world. Where does that research currently stand, and what might we expect to see in the future? To answer that, we’ll first need to attempt to define what exactly constitutes artificial intelligence.
[Bjørn Karmann]’s Objectifier is a device that lets you control domestic objects by allowing them to respond to unique actions or behaviour, using machine learning and computer vision. The Objectifier can turn on a table lamp when you open a book, and turn it off when you close the book. Switch on the coffee maker when you place the mug next to the pot, and switch it off when the mug is removed. Turn on the belt sander when you put on the safety glasses, and stop it when you remove the glasses. Charge the phone when you put a banana in front of it, and stop charging it when you place an apple in front of it. You get the drift — the possibilities are endless. Hopefully, sometime in the (near) future, we will be able to interact with inanimate objects in this fashion. We can get them to learn from our actions rather than us learning how to program them.
The device uses computer vision and a neural network to learn complex behaviours associated with your trigger commands. A training mode, using a phone app, allows you to train it for the On and Off actions. Some actions require more human effort in training it — such as detecting an open and closed book — but eventually, the neural network does a fairly good job.
The current version is the sixth prototype in the series and [Bjørn] has put in quite a lot of work refining the project at each stage. In its latest avatar, the device hardware consists of a Pi Zero, a Raspberry-Pi camera module, an SMPS power brick, a relay block to switch the output, a 230 V plug for input power and a 230 V socket outlet for the final output. All the parts are put together rather neatly using acrylic laser cut support pieces, and then further enclosed in a nice wooden enclosure.
On the software side, all of the machine learning part is taken care of using “Wekinator” — a free, open source software that allows building musical instruments, gestural game controllers, computer vision or computer listening systems using machine learning. The computer vision is handled via Processing. All the code is wrapped using openframeworks, with ml4A providing apps for working with machine learning.
All of the above is what we could deduce looking at the pictures and information on his blog post. There isn’t much detail about the hardware, but the pictures are enough to tell us all. The software isn’t made available, but maybe this could spur some of you hackers into action to build another version of the Objectifier. Check out the video after the break, showing humans teaching the Objectifier its tricks.
Deep Learning — the use of neural networks with modern techniques to tackle problems ranging from computer vision to speech recognition and synthesis — is certainly a current buzzword. However, at the core is a set of powerful methods for organizing self-learning systems. Multi-layer neural networks aren’t new, but there is a resurgence of interest primarily due to the availability of massively parallel computation platforms disguised as video cards.
The problem is getting started in something like this. There are plenty of scholarly papers that can be hard to wade through. Or you can grab some code from GitHub and try to puzzle it out.
Speech synthesis is nothing new, but it has gotten better lately. It is about to get even better thanks to DeepMind’s WaveNet project. The Alphabet (or is it Google?) project uses neural networks to analyze audio data and it learns to speak by example. Unlike other text-to-speech systems, WaveNet creates sound one sample at a time and affords surprisingly human-sounding results.
Before you rush to comment “Not a hack!” you should know we are seeing projects pop up on GitHub that use the technology. For example, there is a concrete implementation by [ibab]. [Tomlepaine] has an optimized version. In addition to learning English, they successfully trained it for Mandarin and even to generate music. If you don’t want to build a system out yourself, the original paper has audio files (about midway down) comparing traditional parametric and concatenative voices with the WaveNet voices.
Another interesting project is the reverse path — teaching WaveNet to convert speech to text. Before you get too excited, though, you might want to note this quote from the read me file:
“We’ve trained this model on a single Titan X GPU during 30 hours until 20 epochs and the model stopped at 13.4 ctc loss. If you don’t have a Titan X GPU, reduce batch_size in the train.py file from 16 to 4.”
Last time we checked, you could get a Titan X for a little less than $2,000.
There is a multi-part lecture series on reinforced learning (the foundation for DeepMind). If you wanted to tackle a project yourself, that might be a good starting point (the first part appears below).