Tech artist [Alexander Reben] has shared some work in progress with us. It’s a neural network trained on various famous peoples’ speech (YouTube, embedded below). [Alexander]’s artistic goal is to capture the “soul” of a person’s voice, in much the same way as death masks of centuries past. Of course, listening to [Alexander]’s Rob Boss is no substitute for actually watching an old Bob Ross tape — indeed it never even manages to say “happy little trees” — but it is certainly recognizable as the man himself, and now we can generate an infinite amount of his patter.
Behind the scenes, he’s using WaveNet to train the networks. Basically, the algorithm splits up an audio stream into chunks and tries to predict the next chunk based on the previous state. Some pre-editing of the training audio data was necessary — removing the laughter and applause from the Colbert track for instance — but it was basically just plugged right in.
The network seems to over-emphasize sibilants; we’ve never heard Barack Obama hiss quite like that in real life. Feeding noise into machines that are set up as pattern-recognizers tends to push them to the limits. But in keeping with the name of this series of projects, the “unreasonable humanity of algorithms”, it does pretty well.
He’s also done the same thing with multiple speakers (also YouTube), in this case 110 people with different genders and accents. The variation across people leads to a smoother, more human sound, but it’s also not clearly anyone in particular. It’s meant to be continuously running out of a speaker inside a sculpture’s mouth. We’re a bit creeped out, in a good way.