EMO: Alibaba’s Diffusion Model-Based Talking Portrait Generator

Alibaba’s EMO (or Emote Portrait Alive) framework is a recent entry in a series of attempts to generate a talking head using existing audio (spoken word or vocal audio) and a reference portrait image as inputs. At its core it uses a diffusion model that is trained on 250 hours of video footage and over 150 million images. But unlike previous attempts, it adds what the researchers call a speed controller and a face region controller. These serve to stabilize the generated frames, along with an additional module to stop the diffusion model from outputting frames that feature a result too distinct from the reference image used as input.

In the related paper by [Linrui Tian] and colleagues a number of comparisons are shown between EMO and other frameworks, claiming significant improvements over these. A number of examples of talking and singing heads generated using this framework are provided by the researchers, which gives some idea of what are probably the ‘best case’ outputs. With some examples, like [Leslie Cheung Kwok Wing] singing ‘Unconditional‘ big glitches are obvious and there’s a definite mismatch between the vocal track and facial motions. Despite this, it’s quite impressive, especially with fairly realistic movement of the head including blinking of the eyes.

Meanwhile some seem extremely impressed, such as in a recent video by [Matthew Berman] on EMO where he states that Alibaba releasing this framework to the public might be ‘too dangerous’. The level-headed folks over at PetaPixel however also note the obvious visual imperfections that are a dead give-away for this kind of generative technology. Much like other diffusion model-based generators, it would seem that EMO is still very much stuck in the uncanny valley, with no clear path to becoming a real human yet.

Continue reading “EMO: Alibaba’s Diffusion Model-Based Talking Portrait Generator”

Feast Your Eyes On These AI-Generated Sounds

The radio hackers in the audience will be familiar with a spectrogram display, but for the uninitiated, it’s basically a visual representation of how a range of frequencies are changing with time. Usually such a display is used to identify a clear transmission in a sea of noise, but with the right software, it’s possible to generate a signal that shows up as text or an image when viewed as a spectrogram. Musicians even occasionally use the technique to hide images in their songs. Unfortunately, the audio side of such a trick generally sounds like gibberish to human ears.

Or at least, it used to. Students from the University of Michigan have found a way to use diffusion models to not only create a spectrogram image for a given prompt, but to do it with audio that actually makes sense given what the image shows. So for example if you asked for a spectrogram of a race car, you might get an audio track that sounds like a revving engine.

Continue reading “Feast Your Eyes On These AI-Generated Sounds”

ForceGen: Using A Diffusion Model To Help Design Novel Proteins

Although proteins are composed out of only a small number of distinct amino acids, this deceptive simplicity quickly vanishes when considering the many possible sequences across a protein, not to mention the many ways in which a single 1D protein sequence can fold into a 3D protein shape with a specific functionality. Although natural evolution has done much of the legwork here already, figuring out new sequences and their functionality is a daunting task where increasingly deep learning algorithms are being applied. As [Bo Ni] and colleagues report in a research article in Science Advances, the hardest challenge is designing a protein sequence based on the desired functionality. They then demonstrate a way to use a generative model to speed up this process.

They set out to design proteins with specific mechanical properties, for which they used the known unfolding characteristics of various protein sequences to train a diffusion model. This approach is thus more akin to the technology behind image generation algorithms like DALL-E than LLMs. Using the trained diffusion model it was then possible to generate likely sequences of which the properties could then be simulated, with favorable results.

As a large data set aid, such a diffusion model could conceivably be very useful in fields even beyond protein synthesis, automating tedious tasks and conceivably speeding up discoveries.

GETMusic Uses Machine Learning To Generate Music, Understands Tracks

Music generation guided by machine learning can make great projects, but there’s not usually much apparent control over the results. The system makes what it makes, and it’s an achievement if the results are not obvious cacophony. But that’s all different with GETMusic which allows for a much more involved approach because it understands and is able to create music by tracks. Among other things, this means one can generate a basic rhythm and melody first, then add additional elements to those existing ones, leaving the previous elements unchanged.

GETMusic can make music from scratch, or guided from examples, and under the hood uses a diffusion-based approach similar to the method behind AI image generators like Stable Diffusion. We’ve previously covered how Stable Diffusion works, but instead of images the same basic principles are used to guide the model from random noise to useful tracks of music.

Just a few years ago we saw a neural network trained to generate Bach, and while it was capable of moments of brilliance, it didn’t produce uniformly-listenable output. GETMusic is on an entirely different level. The model and code are available online and there is a research paper to accompany it.

You can watch a video putting it through its paces just below the page break, and there are more videos on the project summary page.

Continue reading “GETMusic Uses Machine Learning To Generate Music, Understands Tracks”