How The Image-Generating AI Of Stable Diffusion Works

[Jay Alammar] has put up an illustrated guide to how Stable Diffusion works, and the principles in it are perfectly applicable to understanding how similar systems like OpenAI’s Dall-E or Google’s Imagen work under the hood as well. These systems are probably best known for their amazing ability to turn text prompts (e.g. “paradise cosmic beach”) into a matching image. Sometimes. Well, usually, anyway.

‘System’ is an apt term, because Stable Diffusion (and similar systems) are actually made up of many separate components working together to make the magic happen. [Jay]’s illustrated guide really shines here, because it starts at a very high level with only three components (each with their own neural network) and drills down as needed to explain what’s going on at a deeper level, and how it fits into the whole.

Spot any similar shapes and contours between the image and the noise that preceded it? That’s because the image is a result of removing noise from a random visual mess, not building it up from scratch like a human artist would do.

It may surprise some to discover that the image creation part doesn’t work the way a human does. That is to say, it doesn’t begin with a blank canvas and build an image bit by bit from the ground up. It begins with a seed: a bunch of random noise. Noise gets subtracted in a series of steps that leave the result looking less like noise and more like an aesthetically pleasing and (ideally) coherent image. Combine that with the ability to guide noise removal in a way that favors conforming to a text prompt, and one has the bones of a text-to-image generator. There’s a lot more to it of course, and [Jay] goes into considerable detail for those who are interested.

If you’re unfamiliar with Stable Diffusion or art-creating AI in general, it’s one of those fields that is changing so fast that it sometimes feels impossible to keep up. Luckily, our own Matthew Carlson explains all about what it is, and why it matters.

Stable Diffusion can be run locally. There is a fantastic open-source web UI, so there’s no better time to get up to speed and start experimenting!

40 thoughts on “How The Image-Generating AI Of Stable Diffusion Works

  1. As exciting as new technology is, this presents dangers which need to be addressed. It can be used to produce abusive and even illegal images, as well as those which may prove politically subversive. It reasonably acceptable for these to be run online with proper filters but they shouldn’t be made publicly available for modification. The consequences of individuals having unrestricted access to the technology could be disasterous for individuals and democracy.

      1. I was thinking the same, however it’s a lot more inconvenient to make millions of photocopies of your drawing snd mailing them out..The internet makes it conveniently easy to “share” your stuff, good or bad….

    1. It’s just automatic Photoshop-like image manipulation. Anyone can make any image, just manually. This only speed things up with AI. I would rather teach people to differentiate reality and crap instead of the banhammer and inflating of silly laws. Maybe people want bans and laws because they fear freedom to make anything you can imagine.

      1. It has nothing to do with PS-like manipulation. The AI creates everything from its database. You just give a general idea of what you want and the AI pulls what it thinks looks like it. It doesn’t matter how specific you get, the AI will pull imagery from its databank in the hope it fits the general idea of what you want. In PS image manipulation you have got to come up with the imagery you want to see manipulated, and it’s a lot trickier and creative already than what most people would think, even before getting into the compositing stuff.

    2. Spare us that nonsense, corporate shills really need to stop this shameless hollow posturing and vapid virtue-signaling because multi-billion corporations are no more or less ethical than individuals (as evident by the almost weekly cases that big pharma lose and have to pay millions in compensation to their victims or their families).

      More importantly literally nothing that could be done with SD couldn’t have been done before using photobashing with any decent photo editing software, heck…tyrannical authoritarian regimes like Soviet Russia have been editing photos and removing disappeared people from them for decades now, it’s nothing new.

      But the fact SD is open-source and anybody can check the code and understand how it works makes it far more likely that counter measures will appear to recognize AI generated images and even regularly falsified images.

  2. I’m by no means an expert on deep/machine learning or AI. But some points in the article rub me wrong. Like the description of the gif. Why do you call it “a random visual mess”. Isn’t it a training model? Isn’t it really a probability of like what thousands of images from the learning set look like?

    I like to point to this excellent article:
    https://hackaday.com/2019/07/16/neural-network-in-glass-requires-no-power-recognizes-numbers/
    “AI” really is just bayesian statistics. There sadly is no intelligence. The linked article just shows nicely if you run the whole process in reverse in “a number detecting ‘AI’ ” just maps pixel probabilities and location to a certain path the light can take through the substrate. In hardware “AI” illustrates the smoke and mirrors. It’s just a statistical blackbox.

    And now I make a rush for the door not to be torn up by AI researchers.

    1. Hypes come and go but 26 years on I still earn my bread by programming microcontrollers in assembly and C. And most suprising I only had to change work 3 times during my carier. I guess I will retire and 8 bit PICs will still be a thing keeping world running when everything else fails.

    2. > “AI” really is just bayesian statistics. There sadly is no intelligence.

      Which is why researchers have tried and failed to rebrand the field as “machine learning”. “ML” doesn’t get articles read and research funded as effectively as “AI”. For most “AI” is seen as futuristic via popular sci-fi tropes and “ML” is yet another bit of jargon.

        1. The irony is Bayes invented his theory as a response to Laplace’s atheism. The intelligent design of his day. Now repurposed from its original packaging to try and prove the opposite. Just like Mendel’s anti-evolution theory of genetics was repackaged as part of Neodarwinism. If you can’t refute them, reuse them :D

    3. I’m not claiming that neural nets are intelligent, but we actually don’t know what “intelligence” is, and therefore it’s hard to definitively claim they are not. Maybe our brains are just Bayesian statistics, too?

      1. You are absolutely right on this. Though I struggle to define human intelligence. In terms of what makes human intelligence special over animal intelligence I have anecdote: Humans have the capability to build tools for the manufacturing of tools. Like a machine that can cutout a wrench.

        Can an AI design a tool (like a drug)? Yes, but not in an intelligent way like a human. Usually they do random walks, check electron potentials and molecule interactions on proteins. They can design billions of candidates and output you the best 100. A human has to actually give it thought before synthesizing something that would cost the company money. Like i.e. where to slap a chlorine or fluorine on a steroid to increase its binding strength.

        I still think a computer AI is just a fool with very fast fingers. Having recently read a lot of Asimov and Lem, I shudder what a true AI could do for a government. The current ones are already effective tools of oppression.

          1. Conscious choice huh? Didn’t they do some brain scans a while back and find out that for any given action, the hind brain and motor cortex has already set to motion a few seconds before the forebrain finalizes its “decision” and the ego takes credit for it?
            We’re like Maggie in the opening sequence from The Simpsons, pretending to drive the car with a fake toy wheel while Marge sits next to us actually controlling things.

      2. we actually don’t know what “intelligence” is, and therefore it’s hard to definitively claim they are not.

        On the contrary. You can always say what something isn’t even when you don’t know what it is. That is simply narrowing your definition of “it” – to the point that it may stop existing altogether, but that simply reveals that your concept of it was not coherent in the first place and no such thing can exist.

        1. The trick is that you need to know what something is before you can definitely say it exists. If you search the entire universe and account for everything that is “not X”, and you still have something left over, that doesn’t mean what you found is indeed X. It might be something else, because you didn’t define what you were looking for.

          If you don’t know that, then you can’t say something might be it – it defaults to not existing until you come up with a definition that allows you to identify it. A good starting place would be to try and identify why something is definitely not it, and forming your definition that way.

          For example, a classical deterministic computer program is not intelligent, because it has no power of independent action and only follows what the environment (the programmer, the input) made it to do. Therefore, what is intelligent must have “causal powers” to do something that is not pre-determined in a classical way.

          1. > a classical deterministic computer program is not intelligent

            So a system that gives you the same optimal answer to a question is not intelligent? It can only be intelligent if it also gives random sub-optimal answers?

    4. As I understand, it’s random mess that’s compared to the model and asked if it looks like what you’ve requested. Which then iterates until it matches with a high degree of confidence.

      Humans do something similar when they see an animal in a cloud (random noise, they see a shape) or a face in a slice of toast etc.
      If you asked that human to draw what it looks like in their mind, not the cloud itself, it would then be refined until it matches, e.g. a lion roaring.

      It can likewise be argued that although this is just copying and adapting the process, most humans do the same thing – learn from parents/peers and copy. There is imho very little unique thought unless thinking out of the box, and even then it’s probably standing on the backs of giants. Copying is evident in peer pressure, advertising, tribalism etc.

      Going down the rabbit hole…
      If an AI can mimic a human such as recent ML models GPT-3 and better, is that not just copying and doing the same thing?
      As a thought experiment does that also mean humans, via a massive model, is the computer creating the question to the answer “42” since all our combined knowledge is mashed together, relationships found, that will provide a front end for it?

    5. As I understand, it’s random mess that’s compared to the model and asked if it looks like what you’ve requested. Which then iterates until it matches with a high degree of confidence.

      Humans do something similar when they see an animal in a cloud (random noise, they see a shape) or a face in a slice of toast etc.
      If you asked that human to draw what it looks like in their mind, not the cloud itself, it would then be refined until it matches, e.g. a lion roaring.

      It can likewise be argued that although this is just copying and adapting the process, most humans do the same thing – learn from parents/peers and copy. There is imho very little unique thought unless thinking out of the box, and even then it’s probably standing on the backs of giants. Copying is evident in peer pressure, advertising, tribalism etc.

      Going down the rabbit hole…
      If an AI can mimic a human such as recent ML models GPT-3 and better, is that not just copying and doing the same thing?
      As a thought experiment does that also mean humans, via a massive model, is the computer creating the question to the answer “42” since all our combined knowledge is mashed together, relationships found, that will provide a front end for it?

    6. No that random mess indeed a random mess. It has nothing to do with the model or the training data set, it is pure guassian noise.

      The algorithm works by starting with pure noise and it’s trained to remove the noise slowly step by step, then adding a bit more noise in. Rinse and repeat until all the noise is gone.

  3. I’ve got a version running reasonably well on my Apple M1 Max. You’re a little restricted in size, but following up with upscaling processes works amazingly well.

    I did enjoy playing with MidJourney a little more however, as the results were generally more pleasing.

  4. When you see a shape like an animal in clouds you’re ultimately de-noising something in your mind, as is seeing faces in toast etc. Ultimately the memory is like the trained model, you could effectively draw what your mind sees (animal on the same pose).
    It’s not too far off human imagination, it’s just far more abstract.

  5. Diffusion, the way these image generators work, is really clever. If noising-up an image is a function y=f(x), then finding the image that’s hidden in a bunch of noise is just the inverse: x=f`(y). Start with noise, apply f'().

    Figuring out these mappings is horrible. But neural nets are good at fitting arbitrary mappings given enough data.

    The other beautiful thing about these programs, is that we don’t care in the end if they come up with x=f'(y) exactly. So there’s actually a very wide universe of x-like solutions that we’ll be willing to call good. We don’t know what x we’re looking for anyway, we just care that the tree looks like a tree. Birch, maple. Potato, pohtahto. So if actually finding the inverse is impossible, getting close-ish seems pretty doable.

    Or, viewed from a human / non-math perspective:

    My son would do scribble drawings, or random watercolor washes, and then try to figure out what it looked like, drawing the outline of the “horse” that he saw in the picture. (Technique from my mother-in-law.) Some of these worked much better than he could draw at that age.

Leave a Reply

Please be kind and respectful to help make the comments section excellent. (Comment Policy)

This site uses Akismet to reduce spam. Learn how your comment data is processed.