EMO: Alibaba’s Diffusion Model-Based Talking Portrait Generator

June 10, 2024

Alibaba’s EMO (or Emote Portrait Alive) framework is a recent entry in a series of attempts to generate a talking head using existing audio (spoken word or vocal audio) and a reference portrait image as inputs. At its core it uses a diffusion model that is trained on 250 hours of video footage and over 150 million images. But unlike previous attempts, it adds what the researchers call a speed controller and a face region controller. These serve to stabilize the generated frames, along with an additional module to stop the diffusion model from outputting frames that feature a result too distinct from the reference image used as input.

In the related paper by [Linrui Tian] and colleagues a number of comparisons are shown between EMO and other frameworks, claiming significant improvements over these. A number of examples of talking and singing heads generated using this framework are provided by the researchers, which gives some idea of what are probably the ‘best case’ outputs. With some examples, like [Leslie Cheung Kwok Wing] singing ‘Unconditional‘ big glitches are obvious and there’s a definite mismatch between the vocal track and facial motions. Despite this, it’s quite impressive, especially with fairly realistic movement of the head including blinking of the eyes.

Meanwhile some seem extremely impressed, such as in a recent video by [Matthew Berman] on EMO where he states that Alibaba releasing this framework to the public might be ‘too dangerous’. The level-headed folks over at PetaPixel however also note the obvious visual imperfections that are a dead give-away for this kind of generative technology. Much like other diffusion model-based generators, it would seem that EMO is still very much stuck in the uncanny valley, with no clear path to becoming a real human yet.

Thanks to [Daniel Starr] for the tip.

20 thoughts on “EMO: Alibaba’s Diffusion Model-Based Talking Portrait Generator”

Neverm|nd says:

June 10, 2024 at 4:27 pm

I just think where this sort of technology could really be used is instead of translating simply to flat images, you work on translating a 3D mesh– I.e. for so long now the ‘lip-syncing’ of 3D characters in games has been simply utterly *awful*. Something like this would do so much of a better job.

Report comment

Reply
1. JoJo says:
  
  June 11, 2024 at 7:57 am
  
  Start from 1999 era animated textures?
  
  Report comment
  
  Reply
Andrew says:

June 10, 2024 at 5:19 pm

That’s pretty amazing. And it can only get better.

Report comment

Reply
UT says:

June 10, 2024 at 5:39 pm

Aaaannd, this will instantly be used to make videos of people (ie, politicians) saying things they never said in real life.

Report comment

Reply
1. TG says:
  
  June 10, 2024 at 6:02 pm
  
  Who cares? That already happens
  
  Report comment
  
  Reply
2. Antron Argaiv says:
  
  June 11, 2024 at 5:17 am
  
  I can think of some politicians I’d like to hear wishing everyone a Happy Pride Month. And maybe expressing their desire for womens’ rights? Perhaps even acknowledging the institutional racism in the country and supporting efforts to eliminate it?
  
  So, perhaps not entirely a bad thing? (I can dream…Nixon: “Yeah, I am a crook, after all”)
  
  Report comment
  
  Reply
3. gonata says:
  
  June 12, 2024 at 10:11 am
  
  More preferably, it will be used by politicians to deny whatever they said in real life was real :D
  
  Report comment
  
  Reply
jbx says:

June 10, 2024 at 6:07 pm

NVidia CEO : “Everybody in the world is a now a programmer…”

We’re doomed !
I’m going to relocate in an area without electricity and grow my own vegetables.

Report comment

Reply
1. rclark says:
  
  June 10, 2024 at 7:45 pm
  
  Yep…. Where are the checks and balances…. Hate to be a politician that has to ‘defend’ against this stuff all the time and still get their job done. Or anyone that has an agenda can ‘speak for you’ and ruin your day…. “See, you said it, we have a video of it” Yowsa. Not a good place to be. Exciting? Not so much. Who really wants it? I bet there is some countries that will love this tech to hold control of their people…. And it only going to get ‘better’ and more realistic.
  
  Report comment
  
  Reply
  1. rclark says:
    
    June 10, 2024 at 7:48 pm
    
    And remember anything that is said or uploaded ‘stays’ somewhere on the internet. So even if a video (or whatever) is found bogus…. It is still out there for someone to exploit and use…. No tin foil hat here. It just is.
    
    Report comment
    
    Reply
  2. Dan says:
    
    June 10, 2024 at 11:44 pm
    
    This isn’t exactly new. Those with appreciable resources have been able to engage in this sort of trickery for decades. The only thing that has changed is a lower barrier to entry. In my view, this is a positive because it makes us all acutely aware that media we are exposed to might be forged.
    
    Think of Photoshop, in the days of chemical photography, photo manipulation was entirely possible but people were more likely to fall for a hoax because subtle manipulation was uncommon. Now, anything vaguely non-credible will be accused of forgery because anyone with a computer can do image manipulation.
    
    Report comment
    
    Reply
2. Neverm|nd says:
  
  June 10, 2024 at 8:34 pm
  
  I’m not sure whether that is the exact quote he said/may have said — Though note, what he definitely did *not* say is “Everybody in the world is now an IC designer !” ;)
  
  Report comment
  
  Reply
KDawg says:

June 10, 2024 at 7:49 pm

Someone should train it to not make the eyes stare … the pupils never move and it’s quite obvious on the Mona Lisa example where she stares directly at the camera like a 1986 local yokle used car salesman

Report comment

Reply
Duncan Thomas says:

June 11, 2024 at 3:00 am

Like many generated videos, you can still spot where it is falling background details because it isn’t consistent, so look at details around the edge of the head and they subtly change.

I’m sure this will be fixed eventually, like the inability to count fingers and teeth in early image generators

Report comment

Reply
Iván Stepaniuk says:

June 11, 2024 at 4:12 am

These tools are useful for everyone. It even lets you find harmful patterns in comments. Here is what ChatGPT 4-o says about the content of your comment:

– Conspiracy Thinking
– Hostile Intent
– Us vs. Them Mentality
– Extreme Language
– Misinterpretation and Generalization
– References to Specific Individuals

These points suggest that the writer may be experiencing paranoid thoughts, characterized by distrust, suspicion, and a belief in hidden threats.

Report comment

Reply
coder says:

June 11, 2024 at 4:45 am

I’m sure Jensen will hire all these “i dont know how to code” monkeys. :)

Report comment

Reply
TDT says:

June 11, 2024 at 5:30 am

It’s impressive, and quite honnestly, the quality is already useable.
Yes, when you know you are looking at an AI generated video in best quality, you will see background glitches or whatever.
But now change the context and the platform.
A video reuploaded on social media on a smartphone, and the glitches are gone (well, pass as encoding artefact). As for the context, well, it depends on what the sender is willing to tell. And if it’s a reupload, fat chance you even know it’s an AI generated content.

I don’t juge here the usage of this tech (is it good or bad), I just point out that all the defaults talked about earlier are moot in daily usage.

This tech is already useable in daily media. Take care of what you see.

Report comment

Reply
mrehorst says:

June 11, 2024 at 8:38 am

“Obvious visual imperfections”… Obvious to who?
Half the US can’t see what is right in front of them, or disbelieves it when they see it. Do you really think some visual imperfection will enter into their thought processes? Or enter in a linear manner? I can see it now- “visual imperfections were added to the recording to make us think that the video was AI generated….”

Report comment

Reply
cplamb says:

June 11, 2024 at 11:05 am

I wonder if the algorithm would work for animals? I’d love to see a singing ‘possum or a tardigrade.

Report comment

Reply
1. Grounded says:
  
  June 11, 2024 at 1:10 pm
  
  I wonder if the algorithm would work with the other end of the human digestive tract.
  
  Report comment
  
  Reply

Hackaday

EMO: Alibaba’s Diffusion Model-Based Talking Portrait Generator

20 thoughts on “EMO: Alibaba’s Diffusion Model-Based Talking Portrait Generator”

Leave a ReplyCancel reply

Search

Never miss a hack

If you missed it

NPAPI And The Hot-Pluggable World Wide Web

The Time Clock Has Stood The Test Of Time

The Rise And Fall Of The In-Car Fax Machines

How Advanced Autopilots Make Airplanes Safer When Humans Go AWOL

2025: As The Hardware World Turns

Our Columns

For The Fun Of It

Fighting Food Poisoning With A Patch

Hackaday Podcast Episode 352: Visualizing Sound, And Windows 11 Is A Dog

How Do PAL And NTSC Really Work?

Linux Fu: Yet Another Shell Script Trick

20 thoughts on “EMO: Alibaba’s Diffusion Model-Based Talking Portrait Generator”

Leave a ReplyCancel reply

Search

Never miss a hack

Subscribe

If you missed it

Our Columns