MIDI isn’t just about music, as [Johannes Stelzer] shows by using dials to adjust AI-generated imagery in real-time. The results are wild, with an interactivity to them that we don’t normally see in such things.
[Johannes] uses Stable Diffusion‘s SDXL Turbo to create a baseline image of “photo of a red brick house, blue sky”. The hardware dials act as manual controls for applying different embeddings to this baseline, such as “coral”, “moss”, “fire”, “ice”, “sand”, “rusty steel” and “cookie”.
By adjusting the dials, those embeddings are applied to the base image in varying strengths. The results are generated on the fly and are pretty neat to see, especially since there is no appreciable amount of processing time required.
The MIDI controller is integrated with the help of lunar_tools, a software toolkit on GitHub to facilitate creating interactive exhibits. As for the image end of things, we’ve previously covered how AI image generators work.
In before people lose it over AI image generation.
Very cool project. Visually fascinating and highly polished with the minimal processing time.
This is a hacker crowd. I’d expect they support genAI.
Nope, no support for genAI. genAI just produces crap.
I agree. genAI steals artists’ works without permission.
They have no one to blame but themselves. When the Devs BLATANTLY lie about how the models are trained, what they use to train them, keeping lists of artists and entire catalogs of works. THEN Discord leaks pretty much prove the entire team lied through their teeth and KNEW it was lies. I’ve got zero sympathy for them. They should have been transparent and take whatever regulations may come, but they didn’t.
Real developers lie to the larger dumb audience and politicians and money grabbers, it’s the only thing to do really.
There are many ready-made MIDI controllers with dials available.
So this ‘only way he knew’ snark seems a bit out of place.
FUCKING HELL why don’t my replies go where they should??
Drives me nuts, excuse me.
A little more responsiveness and it has good potential as an eq visualizer plugin.
Would be a cool concept, though i don’t think one’s GPU turning into a jet-engine as it goes into overdrive to keep generating makes it a practical one.
The demo-video missed the most juicy opportunity: using multiple dials at the same time.
Yeah I was thinking about that too. I wonder if the outputs were pre-baked to make it more responsive, which would make it prohibitive to bake every possible combination of dial positions. Or maybe they just forgot to film the coolest part, this happens more often than you’d think
How is it so responsive? My images with stable-diffusion take like 10 seconds at least to make a 512×512 image
They used a “distilled” Model. Which is like a Model that tries to condense what a set of bigger models does while being smaller and faster. Essentially it is a Model of other Models. Its as crazy as it sounds.
It can enable for stupidly fast generation like this one does it within a single step, but as seen in the demo: the accuracy and ability to deviate takes a nose-dive. Rendering it more of something for experimental showcases like this.
SDXL Turbo (as mentioned in the article) can render an output in one or two steps. I can generate a 512×512 image on my 4080 Super in about 300 ms.
Can’t deny that Generative AI used for such interactive show-cases shows potential.
Just wish it didn’t involve Stability AI. They are definitely one of the more ethically dubious of the bunch.
This makes me like it a lot more
Why do you say SAI is more ethically dubious? All of the major ones are trained on scraped data, at least SAI isn’t then putting the result in a corporate walled garden.
Very cool. I’m glad they used a self-hostable image generator and not some API a corporation could take away on a whim. I did something similar (without a physical interface) for a puzzle in a table-top RPG – the device had dials corresponding to the classical elements that changed the overall environment, and switches to toggle on or off specific elements. The players had to use it to match descriptions from an NPC’s journal.
Dang, that’s cool! Any plans for a project writeup?
Now, apply it to a photo of a face, with variable like “ear size” and “hair color”. We’ve long seen this with selections from discrete images, but it would be a lot more fun with continuous variation.
Or to use it on a piece of writing like a short story or a poem. That would be interesting
Sweet dreams are made of this…[insert sound track here]
What does MIDI got to do with this other than the only way he knew to read dials?
“If all you have is a hammer…”
Probably the developer chose MIDI because you can buy a box with a bunch of knobs on it and a CPU that encodes them and sends messages to an interface really cheap if you use midi as the interface. Otherwise you have to build your own and it’s expensive and takes a long time and the developer wasn’t interested in hardware.anyway.
This reminds me so much of a video I saw of a talk called “Inventing on Principle” by Brett Victor. As a means of illustrating his point, he talks about his own guiding principle of immediate feedback in creative endeavors. Sure, the relationship between AI/ML/what-have-you and human creativity is absolute flame war fodder, but I think this is a fantastic ‘fuzzy’ way to interact with the ‘fuzzy’ black-box logic of AI image generation engines. Also, would 100% recommend looking up the above video on YouTube. Well worth your 55 minutes.
I’m curious if it can sidestep prompts and just remix images without external learning data.
Interesting, but what is the use case?