Feast Your Eyes On These AI-Generated Sounds

The radio hackers in the audience will be familiar with a spectrogram display, but for the uninitiated, it’s basically a visual representation of how a range of frequencies are changing with time. Usually such a display is used to identify a clear transmission in a sea of noise, but with the right software, it’s possible to generate a signal that shows up as text or an image when viewed as a spectrogram. Musicians even occasionally use the technique to hide images in their songs. Unfortunately, the audio side of such a trick generally sounds like gibberish to human ears.

Or at least, it used to. Students from the University of Michigan have found a way to use diffusion models to not only create a spectrogram image for a given prompt, but to do it with audio that actually makes sense given what the image shows. So for example if you asked for a spectrogram of a race car, you might get an audio track that sounds like a revving engine.

The first step of the technique is easy enough — two separate pre-trained models are used, Stable Diffusion to create the image, and Auffusion4 to produce the audio. The results are then combined via weighted average, and enter into an iterative denoising process to refine the end result. Normally the process produces a grayscale image, but as the paper explains, a third model can be kicked in to produce a more visually pleasing result without impacting the audio itself.

Ultimately, neither the visual nor audio component is perfect. But they both get close enough that you get the idea, and that alone is pretty impressive. We won’t hazard to guess what practical applications exist for this technique, but the paper does hint at some potential use for steganography. Perhaps something to keep in mind the next time we try to hide data in an episode of the Hackaday Podcast.

7 thoughts on “Feast Your Eyes On These AI-Generated Sounds

  1. That is quite interesting. But I must disagree with this statement:

    ” Unfortunately, the audio side of such a trick generally sounds like gibberish to human ears. Or at least, it used to.”

    Software inserting visuals into music spectrogram was available in at least late ’90. I remember Richard D. James (Aphex Twin) did that on his track “ΔMi−1 = −αΣn=1NDi[n][Σj∈C[i]Fji[n − 1] + Fexti[n−1]]” (also known as “formula”) where he hid his distorted face image – that was 1999. I know – for some that track “sounds like gibberish” but I swear it’s music ;-) Also he was not the only one.

      1. Never let it be said that the Hackaday commenters will allow a little thing like having a point get in the way of them of being argumentative…

        This literally sounds like a 56K modem, it’s the definition of gibberish digital noise.

  2. I’m imagining setting up a wide spectrographic display with a persistence of a few seconds, and feeding a loop of these sounds into it. Wall art for the über-nerd!

Leave a Reply

Please be kind and respectful to help make the comments section excellent. (Comment Policy)

This site uses Akismet to reduce spam. Learn how your comment data is processed.