How The Image-Generating AI Of Stable Diffusion Works

October 24, 2022

[Jay Alammar] has put up an illustrated guide to how Stable Diffusion works, and the principles in it are perfectly applicable to understanding how similar systems like OpenAI’s Dall-E or Google’s Imagen work under the hood as well. These systems are probably best known for their amazing ability to turn text prompts (e.g. “paradise cosmic beach”) into a matching image. Sometimes. Well, usually, anyway.

‘System’ is an apt term, because Stable Diffusion (and similar systems) are actually made up of many separate components working together to make the magic happen. [Jay]’s illustrated guide really shines here, because it starts at a very high level with only three components (each with their own neural network) and drills down as needed to explain what’s going on at a deeper level, and how it fits into the whole.

Spot any similar shapes and contours between the image and the noise that preceded it? That’s because the image is a result of removing noise from a random visual mess, not building it up from scratch like a human artist would do.

It may surprise some to discover that the image creation part doesn’t work the way a human does. That is to say, it doesn’t begin with a blank canvas and build an image bit by bit from the ground up. It begins with a seed: a bunch of random noise. Noise gets subtracted in a series of steps that leave the result looking less like noise and more like an aesthetically pleasing and (ideally) coherent image. Combine that with the ability to guide noise removal in a way that favors conforming to a text prompt, and one has the bones of a text-to-image generator. There’s a lot more to it of course, and [Jay] goes into considerable detail for those who are interested.

If you’re unfamiliar with Stable Diffusion or art-creating AI in general, it’s one of those fields that is changing so fast that it sometimes feels impossible to keep up. Luckily, our own Matthew Carlson explains all about what it is, and why it matters.

Stable Diffusion can be run locally. There is a fantastic open-source web UI, so there’s no better time to get up to speed and start experimenting!

40 thoughts on “How The Image-Generating AI Of Stable Diffusion Works”

Dab says:

October 24, 2022 at 4:36 am

As exciting as new technology is, this presents dangers which need to be addressed. It can be used to produce abusive and even illegal images, as well as those which may prove politically subversive. It reasonably acceptable for these to be run online with proper filters but they shouldn’t be made publicly available for modification. The consequences of individuals having unrestricted access to the technology could be disasterous for individuals and democracy.

Report comment

Reply
1. Artenz says:
  
  October 24, 2022 at 11:16 am
  
  Too late. Source codes are already available, and can be re-invented by thousands of people if necessary. We’ll have to learn to deal with it
  
  Report comment
  
  Reply
2. J. Samson says:
  
  October 24, 2022 at 11:47 am
  
  Which images would be illegal, that wouldn’t be if simply drawn instead of “generated”?
  
  Report comment
  
  Reply
3. Charles Lamb says:
  
  October 24, 2022 at 1:02 pm
  
  That can be done with a piece of paper and pencil as well.
  
  Report comment
  
  Reply
  1. Raffaello Cellucci says:
    
    October 24, 2022 at 9:25 pm
    
    I was thinking the same, however it’s a lot more inconvenient to make millions of photocopies of your drawing snd mailing them out..The internet makes it conveniently easy to “share” your stuff, good or bad….
    
    Report comment
    
    Reply
    1. bcdesigner says:
      
      October 25, 2022 at 8:17 am
      
      Why would anyone do that? They would just make it on the computer where they have a digital copy.
      
      Report comment
      
      Reply
    2. Muinainen says:
      
      October 28, 2022 at 1:47 am
      
      Yeah. Thankfully we haven’t yet invented a way to scan your physical drawing into digital form and share it online.
      
      Report comment
      
      Reply
      1. Emmanuel says:
        
        October 29, 2022 at 9:00 am
        
        Yeah right?
        More thankfully
        
        Report comment
4. RBMK says:
  
  October 24, 2022 at 2:31 pm
  
  It’s just automatic Photoshop-like image manipulation. Anyone can make any image, just manually. This only speed things up with AI. I would rather teach people to differentiate reality and crap instead of the banhammer and inflating of silly laws. Maybe people want bans and laws because they fear freedom to make anything you can imagine.
  
  Report comment
  
  Reply
  1. Fabrice says:
    
    October 25, 2022 at 7:48 am
    
    It has nothing to do with PS-like manipulation. The AI creates everything from its database. You just give a general idea of what you want and the AI pulls what it thinks looks like it. It doesn’t matter how specific you get, the AI will pull imagery from its databank in the hope it fits the general idea of what you want. In PS image manipulation you have got to come up with the imagery you want to see manipulated, and it’s a lot trickier and creative already than what most people would think, even before getting into the compositing stuff.
    
    Report comment
    
    Reply
5. Francois Otis says:
  
  October 24, 2022 at 7:58 pm
  
  Politically subversive images…?!! Just curious: where are you writing from?
  
  Report comment
  
  Reply
6. Elliot Williams says:
  
  October 25, 2022 at 12:21 am
  
  This reads as obvious troll to me. No?
  
  Report comment
  
  Reply
7. Mohamed Ahmed Abd El Magid says:
  
  October 25, 2022 at 4:37 am
  
  Spare us that nonsense, corporate shills really need to stop this shameless hollow posturing and vapid virtue-signaling because multi-billion corporations are no more or less ethical than individuals (as evident by the almost weekly cases that big pharma lose and have to pay millions in compensation to their victims or their families).
  
  More importantly literally nothing that could be done with SD couldn’t have been done before using photobashing with any decent photo editing software, heck…tyrannical authoritarian regimes like Soviet Russia have been editing photos and removing disappeared people from them for decades now, it’s nothing new.
  
  But the fact SD is open-source and anybody can check the code and understand how it works makes it far more likely that counter measures will appear to recognize AI generated images and even regularly falsified images.
  
  Report comment
  
  Reply
8. combinatorylogic says:
  
  October 27, 2022 at 2:24 pm
  
  What about individuals having an access to canvas, brush and paint?
  
  Report comment
  
  Reply
9. JB says:
  
  January 26, 2023 at 12:28 am
  
  Weird that people never get all alarmist about pencils being used to create “illegal” or “politically subversive” imagery.
  
  Report comment
  
  Reply
Frankel says:

October 24, 2022 at 5:33 am

I’m by no means an expert on deep/machine learning or AI. But some points in the article rub me wrong. Like the description of the gif. Why do you call it “a random visual mess”. Isn’t it a training model? Isn’t it really a probability of like what thousands of images from the learning set look like?

I like to point to this excellent article:
https://hackaday.com/2019/07/16/neural-network-in-glass-requires-no-power-recognizes-numbers/
“AI” really is just bayesian statistics. There sadly is no intelligence. The linked article just shows nicely if you run the whole process in reverse in “a number detecting ‘AI’ ” just maps pixel probabilities and location to a certain path the light can take through the substrate. In hardware “AI” illustrates the smoke and mirrors. It’s just a statistical blackbox.

And now I make a rush for the door not to be torn up by AI researchers.

Report comment

Reply
1. AndrzejKKZ says:
  
  October 24, 2022 at 6:23 am
  
  Hypes come and go but 26 years on I still earn my bread by programming microcontrollers in assembly and C. And most suprising I only had to change work 3 times during my carier. I guess I will retire and 8 bit PICs will still be a thing keeping world running when everything else fails.
  
  Report comment
  
  Reply
2. chango says:
  
  October 24, 2022 at 6:24 am
  
  > “AI” really is just bayesian statistics. There sadly is no intelligence.
  
  Which is why researchers have tried and failed to rebrand the field as “machine learning”. “ML” doesn’t get articles read and research funded as effectively as “AI”. For most “AI” is seen as futuristic via popular sci-fi tropes and “ML” is yet another bit of jargon.
  
  Report comment
  
  Reply
  1. TG says:
    
    October 24, 2022 at 9:44 am
    
    Define where your consciousness or intelligence comes from without guesses or pseudoscience and prove it’s not Bayesian statistics. The spirit still haunts us
    
    Report comment
    
    Reply
    1. Eric Holloway says:
      
      October 24, 2022 at 4:47 pm
      
      The irony is Bayes invented his theory as a response to Laplace’s atheism. The intelligent design of his day. Now repurposed from its original packaging to try and prove the opposite. Just like Mendel’s anti-evolution theory of genetics was repackaged as part of Neodarwinism. If you can’t refute them, reuse them :D
      
      Report comment
      
      Reply
3. UnderSampled says:
  
  October 24, 2022 at 6:44 am
  
  The “random visual mess” is the random noise they are applying the trained model on. It is truly just noise, which is then run through a semantic-aware denoiser.
  
  Report comment
  
  Reply
  1. Frankel says:
    
    October 24, 2022 at 8:33 am
    
    https://en.wikipedia.org/wiki/File:Visual_crypto_animation_demo.gif
    As “random” as this example. But you can tune it to have a message, prior construction.
    
    Report comment
    
    Reply
    1. UnderSampled says:
      
      October 24, 2022 at 12:45 pm
      
      https://github.com/AUTOMATIC1111/stable-diffusion-webui/blob/5aa9525046b7520d39fe8fc8c5c6cc10ab4d5fdb/modules/processing.py#L243
      
      “`
      noise = devices.randn(seed, noise_shape)
      “`
      
      The noise is random.
      
      If you want, you can select the seed yourself. If provide your own non-noise input, that’s how image-to-image works.
      
      Report comment
      
      Reply
  2. Frankel says:
    
    October 24, 2022 at 8:34 am
    
    en[.]wikipedia[.]org/wiki/File:Visual_crypto_animation_demo.gif
    As random as this example, but you can tune it prior construction.
    
    Report comment
    
    Reply
4. tilk says:
  
  October 24, 2022 at 7:49 am
  
  I’m not claiming that neural nets are intelligent, but we actually don’t know what “intelligence” is, and therefore it’s hard to definitively claim they are not. Maybe our brains are just Bayesian statistics, too?
  
  Report comment
  
  Reply
  1. Frankel says:
    
    October 24, 2022 at 8:52 am
    
    You are absolutely right on this. Though I struggle to define human intelligence. In terms of what makes human intelligence special over animal intelligence I have anecdote: Humans have the capability to build tools for the manufacturing of tools. Like a machine that can cutout a wrench.
    
    Can an AI design a tool (like a drug)? Yes, but not in an intelligent way like a human. Usually they do random walks, check electron potentials and molecule interactions on proteins. They can design billions of candidates and output you the best 100. A human has to actually give it thought before synthesizing something that would cost the company money. Like i.e. where to slap a chlorine or fluorine on a steroid to increase its binding strength.
    
    I still think a computer AI is just a fool with very fast fingers. Having recently read a lot of Asimov and Lem, I shudder what a true AI could do for a government. The current ones are already effective tools of oppression.
    
    Report comment
    
    Reply
    1. Christian says:
      
      October 24, 2022 at 9:02 am
      
      I think human intelligence simply boils down to the conscious choice to determine what data is paid attention to.
      
      Report comment
      
      Reply
      1. TG says:
        
        October 24, 2022 at 9:49 am
        
        Conscious choice huh? Didn’t they do some brain scans a while back and find out that for any given action, the hind brain and motor cortex has already set to motion a few seconds before the forebrain finalizes its “decision” and the ego takes credit for it?
        We’re like Maggie in the opening sequence from The Simpsons, pretending to drive the car with a fake toy wheel while Marge sits next to us actually controlling things.
        
        Report comment
  2. Dude says:
    
    October 24, 2022 at 9:20 am
    
    we actually don’t know what “intelligence” is, and therefore it’s hard to definitively claim they are not.
    
    On the contrary. You can always say what something isn’t even when you don’t know what it is. That is simply narrowing your definition of “it” – to the point that it may stop existing altogether, but that simply reveals that your concept of it was not coherent in the first place and no such thing can exist.
    
    Report comment
    
    Reply
    1. Dude says:
      
      October 24, 2022 at 9:43 am
      
      The trick is that you need to know what something is before you can definitely say it exists. If you search the entire universe and account for everything that is “not X”, and you still have something left over, that doesn’t mean what you found is indeed X. It might be something else, because you didn’t define what you were looking for.
      
      If you don’t know that, then you can’t say something might be it – it defaults to not existing until you come up with a definition that allows you to identify it. A good starting place would be to try and identify why something is definitely not it, and forming your definition that way.
      
      For example, a classical deterministic computer program is not intelligent, because it has no power of independent action and only follows what the environment (the programmer, the input) made it to do. Therefore, what is intelligent must have “causal powers” to do something that is not pre-determined in a classical way.
      
      Report comment
      
      Reply
      1. Artenz says:
        
        October 24, 2022 at 11:20 am
        
        > a classical deterministic computer program is not intelligent
        
        So a system that gives you the same optimal answer to a question is not intelligent? It can only be intelligent if it also gives random sub-optimal answers?
        
        Report comment
    2. TG says:
      
      October 24, 2022 at 9:47 am
      
      Out: “AI isn’t intelligent, it’s just parroting.”
      In: “Humans aren’t intelligent, they’re just parroting.”
      
      Report comment
      
      Reply
5. Ben says:
  
  October 24, 2022 at 9:26 am
  
  As I understand, it’s random mess that’s compared to the model and asked if it looks like what you’ve requested. Which then iterates until it matches with a high degree of confidence.
  
  Humans do something similar when they see an animal in a cloud (random noise, they see a shape) or a face in a slice of toast etc.
  If you asked that human to draw what it looks like in their mind, not the cloud itself, it would then be refined until it matches, e.g. a lion roaring.
  
  It can likewise be argued that although this is just copying and adapting the process, most humans do the same thing – learn from parents/peers and copy. There is imho very little unique thought unless thinking out of the box, and even then it’s probably standing on the backs of giants. Copying is evident in peer pressure, advertising, tribalism etc.
  
  Going down the rabbit hole…
  If an AI can mimic a human such as recent ML models GPT-3 and better, is that not just copying and doing the same thing?
  As a thought experiment does that also mean humans, via a massive model, is the computer creating the question to the answer “42” since all our combined knowledge is mashed together, relationships found, that will provide a front end for it?
  
  Report comment
  
  Reply
6. Ben says:
  
  October 24, 2022 at 9:52 am
  
  As I understand, it’s random mess that’s compared to the model and asked if it looks like what you’ve requested. Which then iterates until it matches with a high degree of confidence.
  
  Humans do something similar when they see an animal in a cloud (random noise, they see a shape) or a face in a slice of toast etc.
  If you asked that human to draw what it looks like in their mind, not the cloud itself, it would then be refined until it matches, e.g. a lion roaring.
  
  It can likewise be argued that although this is just copying and adapting the process, most humans do the same thing – learn from parents/peers and copy. There is imho very little unique thought unless thinking out of the box, and even then it’s probably standing on the backs of giants. Copying is evident in peer pressure, advertising, tribalism etc.
  
  Going down the rabbit hole…
  If an AI can mimic a human such as recent ML models GPT-3 and better, is that not just copying and doing the same thing?
  As a thought experiment does that also mean humans, via a massive model, is the computer creating the question to the answer “42” since all our combined knowledge is mashed together, relationships found, that will provide a front end for it?
  
  Report comment
  
  Reply
7. Nick says:
  
  October 25, 2022 at 12:48 pm
  
  No that random mess indeed a random mess. It has nothing to do with the model or the training data set, it is pure guassian noise.
  
  The algorithm works by starting with pure noise and it’s trained to remove the noise slowly step by step, then adding a bit more noise in. Rinse and repeat until all the noise is gone.
  
  Report comment
  
  Reply
Mark Topham says:

October 24, 2022 at 5:55 am

I’ve got a version running reasonably well on my Apple M1 Max. You’re a little restricted in size, but following up with upscaling processes works amazingly well.

I did enjoy playing with MidJourney a little more however, as the results were generally more pleasing.

Report comment

Reply
Ben says:

October 24, 2022 at 6:07 am

When you see a shape like an animal in clouds you’re ultimately de-noising something in your mind, as is seeing faces in toast etc. Ultimately the memory is like the trained model, you could effectively draw what your mind sees (animal on the same pose).
It’s not too far off human imagination, it’s just far more abstract.

Report comment

Reply
justsayin says:

October 24, 2022 at 7:10 am

Has anyone fed this thing an image of the cosmic microwave background radiation yet?

Report comment

Reply
Elliot Williams says:

October 25, 2022 at 12:45 am

Diffusion, the way these image generators work, is really clever. If noising-up an image is a function y=f(x), then finding the image that’s hidden in a bunch of noise is just the inverse: x=f`(y). Start with noise, apply f'().

Figuring out these mappings is horrible. But neural nets are good at fitting arbitrary mappings given enough data.

The other beautiful thing about these programs, is that we don’t care in the end if they come up with x=f'(y) exactly. So there’s actually a very wide universe of x-like solutions that we’ll be willing to call good. We don’t know what x we’re looking for anyway, we just care that the tree looks like a tree. Birch, maple. Potato, pohtahto. So if actually finding the inverse is impossible, getting close-ish seems pretty doable.

Or, viewed from a human / non-math perspective:

My son would do scribble drawings, or random watercolor washes, and then try to figure out what it looked like, drawing the outline of the “horse” that he saw in the picture. (Technique from my mother-in-law.) Some of these worked much better than he could draw at that age.

Report comment

Reply
futureleadershipacademy says:

January 12, 2023 at 8:25 am

https://flacademy.school/2023/01/12/artificial-intelligence-and-human-rights-action-plan-recommendations-for-human-rights-sensitive-and-ethical-artificial-intelligence/?amp=1

Report comment

Reply