Focus Your Ears With The Visual Microphone


A Group of MIT, Microsoft, and Adobe researchers have managed to reproduce sound using video alone. The sounds we make bounce off every object in the room, causing microscopic vibrations.  The Visual Microphone utilizes a high-speed video camera and some clever signal processing to extract an audio signal from these vibrations. Using video of everyday objects such as snack bags, plants, Styrofoam cups, and water, the team was able to reproduce tones, music and speech. Capturing audio from light isn’t exactly new. Laser microphones have been around for years. The difference here is the fact that the visual microphone is a completely passive device. No laser or special illumination is required.

The secret is in the signal processing, which the team explains in their SIGGRAPH paper (pdf link). They used a complex steerable pyramid along with wavelet filters to obtain local pixel motion values. These local values are averaged into a global motion value. From this global motion value the team is able to measure movement down to 1/1000 of a pixel. Plenty of resolution to decode audio data.

Most of the research is performed with high-speed video cameras, which are well outside the budget of the average hacker. Don’t despair though, the team did prove out that the same magic can be performed with consumer cameras, albeit with lower quality results. The team took advantage of the rolling shutter found in most of today’s CMOS imager based consumer cameras. Rolling shutter CMOS sensors capture images one row at a time. Each row can be processed in a similar fashion to the frames of the high-speed camera. There are some inter-frame gaps when the camera isn’t recording anything though. Even with the reduced resolution, it’s easy to pick out “Mary had a little lamb” in the video below.

We’re blown away by this research, and we’re sure certain organizations will be looking into it for their own use. Don’t pull out your tin foil hats yet though. Foil containers proved to be one of the best sound reflectors.

Thanks [Zach]!

53 thoughts on “Focus Your Ears With The Visual Microphone

    1. I saw someone on Fark point out that they had already seen this on another site yesterday, so your comment is old news.
      Seriously- do people not understand how the internet works? Do you think Adam Fabio flew out to interview these guys a few days after your alphabet soup site did the same thing?
      What are you trying to accomplish by pointing this out? Are we supposed to be impressed with your internet skills? Is Mr. Fabio expected to apologize for wasting your time? Should the offending post be removed? If old news is such a waste of your time why waste more time pointing it out? Do you send messages to CBS to complain that the plane crash they are covering was also reported on NBC 35 seconds earlier? Was your intent to get some attention (mission accomplished)? You have the modern equivalent of the Library of Alexandria at your fingertips- Find a better use for it than just squawking to hear yourself squawk.

  1. Honestly my BS meter was ringing like mad watching that video, thinking there’s absolutely no way that visual noise wouldn’t mess the recording up, but no they’re using a camera with a pretty great lens system and sensor that’s also running 25x slower than the recommended speed.

    1. I believe they have methods that either vibrate the window to mask sound or they use two windows where the intervening empty space eliminates the sound, this system would defeat both of these defence methods, so I guess buy some drapes.

    2. kind of hard to shine a laser pointer into the past/at a youtube video

      this technique should allow NSA, and other criminal organisations, to analyze video material for audible hints. From the looks of it you can get something as long as it was recorded with a rolling shutter type of sensor (meaning all cellphones)

          1. Whilst I can’t say for certain without further reading, I’d suggest that since this relies on tiny sub-pixel variations, the compression applied to normal video posted on youtube etc is going to consider the pixels of the target object to be unchanging and will only appear in key frames, destroying the audio. The camera in such video is less likely to be mounted on a tripod, and held in the hand, the strongest signal besides waving it around, might be the pulse from the person holding it. Still it remains an interesting and impressive achievement.

  2. now places that have no audio recording rules will now have to add video to their rules.

    what could you get from a movie?

    that could be a new way for hollywood to watermark the movie and will survive better than other methods.

    while the watermark may not stop recording or playing it will allow for identifying the source by say for example each theater will have it’s own id number spoken .

    here is how that would work.

    1. hollywood would make a movie.

    2. someone would speak out the identity of each theater (potentially thousands of unique movies personalized to that theater (digital projection theaters the modulation of the object could be done by the theater’s projector (maybe say like modulating the frame hold or focus adjustment or even the brightness))).

    1. hehe.
      I must admit I was also thinking of a Smart clip a few weeks ago when the Senate Investing Committee was being investigated by the Senate investigating Committee.

    1. The frame rate is too low. The human voice is probably about 80Hz to 1200Hz. So the frame rate would need to be double this, and that is not the case for silent movies. For best recovery they were using high speed cameras, to get 38K frames a second.

      1. You could potentially interpolate some of the missing data, perhaps processing the motion blur occurring throughout the individual frames?

        The output would probably be very tinny, but would be interesting nontheless

    2. Movie set is very noisy. That’s why (talkies) movies record everything in a studio.
      Now if you could recover sound from silent film, it would be picking up the noise from camera, director yelling and god know if the actors even read their dialogs.

      1. That has been done with expert lip readers on stuff shot in a documentary mode with the subject talking at a silent camera. Transcribed of course.
        As with all bugs just turn up the music and drown them in jams.

  3. This would be an interesting way to authenticate products or currency… If you manufactured say a dollar bill in a way that when exposed to audio of a certain frequency, it would resonate at a specific frequency and absorb others… This could be done by adding certain materials like rubber or something that would attenuate whatever frequencies you want to eliminate… or the reverse could be done where it only attenuates a certain frequency.. Merchandise could be packaged in materials that allow manufacturers to identify it as authentic.. Or maybe not.. :)
    Either way, it would be interesting to see what materials react in what ways to different tones, etc..

    1. It might also be interesting to think of the possibility of using this for identification.. Imagine a bank card, or drivers license that had chip or something that the machine it is inserted into examines under exposure to some combination of tones with a tiny camera that analyzes the chips vibrational response… and each personal has a slightly different combination allowing a unique identifier.. Just thoughts off the top of my head.. :)

    2. How well are the auditory properties of an object preserved once it’s been creased/folded/torn? I am no expert, but isn’t resonance of a piece of paper affected by folding it?

      1. That would be a good point.. Yes, I suppose it would.. :) Also wear and tear on the bill, probably moisture, and whatever oils and other contaminants… Hmm.. There goes reality again…shooting down my daydreams.. Might still work for the ID card though if the ‘chip’ were protected well enough..but I doubt the benefit outweighs the effort.. :/

    3. What is going on with people posting insane and pointless uses for this technology? If you have control of the recording environment and the object, anything you can do with this technology you can do an order of magnitude better with an actual microphone. This is useful when you can’t use a microphone.

  4. great research by these mit researchers and an idea i am very much interested in making a hack for in the future. I made an article in the forums a few days ago about this here:

    I’m looking for any takers, researchers, hackers, anyone with ideas who would like to pursue this further and make any kind of device with me. if you are interested throw a post in there or send a pm my way!!

  5. “Rolling shutter CMOS sensors capture images one row at a time. ”

    Not a great description. Exposure is controlled by a pixel reset ahead of the pixel read. If the light to the sensor is bright enough a rolling shutter camera will reduce the lag to the order of one line.

    In terms of paleoacoustics applications, no cinematographer or camera man would ever do this deliberately in the normal course of filming something. 24fps footage for example normally aims for an exposure of 1/48th (180 degrees in cine terms) so things blur instead of warp.

    1. That’s what I was thinking – they probably used a very fast shutter speed on the rolling shutter tests. AND an extremely stable platform for the camera. I doubt that this would work for just random video footage.

  6. You could use that to analyse some old movies and you tube videos without sound to try and recover what was going on…even the 60fps from the slr you could tell the notes…

Leave a Reply

Please be kind and respectful to help make the comments section excellent. (Comment Policy)

This site uses Akismet to reduce spam. Learn how your comment data is processed.