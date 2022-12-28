You’ve likely heard quite a bit of buzz over the last few months about Stable Diffusion. The new version (v2) has come out, and in addition to the standard image-to-image and text-to-image modes, it also has a depth-image-to-image that can be incredibly useful. [Andrew] has a write-up that guides you on using this mode.
The basic idea is that you can take both an image and depth into the model, which allows you to control what gets put where. Stable Diffusion is a bit confusing, but we already have some great resources to wrap your head around it. In terms of input, you can use a depth map from a camera with lidar (many recent phones include this) or have another model (like MiDaS) estimate it from a 2D picture. This becomes powerful when you can preserve a specific composition, such as an iconic scene from a well-known movie. You can keep the characters’ poses on the screen but transform the style of the scene into whatever you wish (as seen above).
We have already covered a technique to generate textures right in blender, but this new depth information has already been implemented to provide better accuracy of the textures.
[Justin Alvey] used it to create architectural photos from dollhouse furniture. Using the MiDaS model, he estimated the depth and threw away the RGB aspects by setting the denoising strength to maximum. The simplified dollhouse furniture was easily recognizable to the model, which helped produce great results.
However, the only downside is that the perspective produces a rather dollhouse feel. Changing the focal length and moving farther away helps. Overall, it’s a clever use of what the new AI model can do. It’s a fast-moving space, so this will likely be out of date in a few months.
6 thoughts on “Giving Stable Diffusion Some Depth”
Still no discussion about the source material used to train the machine learning model. And how it effects those who’s source material it is.
Copyright and associated licensing is quite important in our society, not just for art. But likewise is it important for any other work. Be it sound, video or source code or even the compiled result of said source code.
That non of these articles seems to even mention the legal ramification surrounding the gathering of publicly available data for use in generative machine learning applications is frankly a bit concerning.
There is an exception in the USA and the EU about search engine optimization regarding data mining, and this makes it fair use on the grounds that it helps people find the source material since the source material itself isn’t provided other than saying where it is and how much of relevance it is to what one tried to find.
This fair use exception of copyright is often stated as the reason why any machine learning system can use publicly available data in its training dataset without prior permission. (the EU copyright directive uses a different term for fair use, but in essence it is the same thing with more or less the same exceptions.)
Even if generative machine learning systems creating content of their own don’t help people find the original, but rather directly competes with the author of the original. So this ground for fair use seems misapplied in this application to say the least. (and ignoring the reason why an exception where relevant is quite legally dangerous, since then we wouldn’t be far from castle doctrine allowing indiscriminate murder.)
At least OpenAI’s/Microsoft’s copilot system made for Github is having a lawsuit against it currently, so some discussion is happening on that front. But that is frankly not bringing much light to the issue at large. Mainly since copilot has many cases of making exact copies of people’s source code including comments, so that is somewhat decent grounds for it being copyright infringement. Visual arts is more nuanced however.
The rather distinct difference between:
1. An ML system used for categorization/analysis of data (mainly content recognition)
2. An ML system used to generate content. (art/audio/text/code/etc generation)
Is something we should probably discuss more often, especially as far as copyright/licensing of source material for the training dataset is concerned.
Personally, I am of the opinion that any ML system used for content generation should explicitly rely on authorized/licensed data for its training dataset. However, Pandora’s box has already been opened.
It’s an interesting topic. The major difficulty for the AI is the fact that it lacks “grounding”.
In linguistics this refers to the idea that words need to be “grounded” on a lower abstraction level to have any meaning at all – otherwise all words are just circular references to each other. You can look up in a dictionary the meaning of a word, and all you get is a synonym or a paragraph where each word is also pointing to other words in the same dictionary – you’re stuck in a loop because nothing really explains what any of the words mean. The fact that we know anything at all is because we have direct experience of what at least some of the words mean.
For the AI, every picture it sees is just data. Seeing a picture tagged as “cat”, it does not know what actually is “cat” in the picture – it treats all of the data as relevant: it is not grounded. Having a depth map to separate picture elements is one step in the way of giving the algorithm some sense of what it is actually seeing in that it can separate objects in an image easier than simply comparing a million pictures and finding what’s common between data labeled as “cat”.
Does that however work to “ground” the algorithm in the same sense as what we do with words?
A typical way to make a content recognition ML system is to scan pixel by pixel through the image and looking at the neighboring pixels around the currently scanned one. From there the ML system uses this array to generate a value for how certain it is that the scanned pixel is on the object it is trained to detect. So one more or less gets a heat map of where the object is on the image.
To start with, we can just give it a few hundred images that a good portion has the content, and the rest doesn’t. After a fair bit of training until it correctly identifies the pictures, one can manually look at the heat map, and see if it is on the right track.
One can likewise manually make a heat map to compare the results to and pick out the ML iterations with the best result and further improve on those.
But yes. If we just give the ML system a million images, it won’t have a clue what a cat or a dragon is. Without guidance it is just left to its own devices and likely won’t do much of interest at all.
>the object it is trained to detect
Yes, but that’s assuming the algorithm is already trained to detect “cats”. You may throw it any object, so wouldn’t that mean you first have to train the algorithm to detect everything under the sun so that it could then take an arbitrary command to switch from cats to dogs to cars?
There’s many ways of detecting objects, but does it still give the algorithm any understand of what it is requested to do?
I imagine combining the two methods would turn out quite powerful. Separating objects first allows the ML algorithm to concentrate on the detailed differences between objects with fewer examples given.
But, without “grounding”, the algorithm still doesn’t understand that a “cat” is not just the statistical commonality between sets of data, just as it doesn’t understand that a peg legged stool turned upside down is actually a boot drying rack. It doesn’t have access to the underlying reality of the object.
A good illustration about the point is the famous Gripsholm Castle lion.
Having been sent a lion pelt, the taxidermists attempted to stuff it, but none of them had actually seen a lion or knew what it was, so they made their best guesses. At least they had some examples appearing in heraldry, so they got it “mostly right”. With only second hand information, would you have done better?
https://en.wikipedia.org/wiki/Lion_of_Gripsholm_Castle
Please be kind and respectful to help make the comments section excellent. (Comment Policy)