Focus Your Ears With The Visual Microphone

August 6, 2014

VideoMicrophone

A Group of MIT, Microsoft, and Adobe researchers have managed to reproduce sound using video alone. The sounds we make bounce off every object in the room, causing microscopic vibrations. The Visual Microphone utilizes a high-speed video camera and some clever signal processing to extract an audio signal from these vibrations. Using video of everyday objects such as snack bags, plants, Styrofoam cups, and water, the team was able to reproduce tones, music and speech. Capturing audio from light isn’t exactly new. Laser microphones have been around for years. The difference here is the fact that the visual microphone is a completely passive device. No laser or special illumination is required.

The secret is in the signal processing, which the team explains in their SIGGRAPH paper (pdf link). They used a complex steerable pyramid along with wavelet filters to obtain local pixel motion values. These local values are averaged into a global motion value. From this global motion value the team is able to measure movement down to 1/1000 of a pixel. Plenty of resolution to decode audio data.

Most of the research is performed with high-speed video cameras, which are well outside the budget of the average hacker. Don’t despair though, the team did prove out that the same magic can be performed with consumer cameras, albeit with lower quality results. The team took advantage of the rolling shutter found in most of today’s CMOS imager based consumer cameras. Rolling shutter CMOS sensors capture images one row at a time. Each row can be processed in a similar fashion to the frames of the high-speed camera. There are some inter-frame gaps when the camera isn’t recording anything though. Even with the reduced resolution, it’s easy to pick out “Mary had a little lamb” in the video below.

We’re blown away by this research, and we’re sure certain organizations will be looking into it for their own use. Don’t pull out your tin foil hats yet though. Foil containers proved to be one of the best sound reflectors.

Thanks [Zach]!

53 thoughts on “Focus Your Ears With The Visual Microphone”

nioga says:

August 6, 2014 at 1:08 pm

I’ve seen it on niebezpiecznik.pl at least 2 days ago. Old news.

Report comment

Reply
1. chuck says:
  
  August 6, 2014 at 3:26 pm
  
  I saw someone on Fark point out that they had already seen this on another site yesterday, so your comment is old news.
  Seriously- do people not understand how the internet works? Do you think Adam Fabio flew out to interview these guys a few days after your alphabet soup site did the same thing?
  What are you trying to accomplish by pointing this out? Are we supposed to be impressed with your internet skills? Is Mr. Fabio expected to apologize for wasting your time? Should the offending post be removed? If old news is such a waste of your time why waste more time pointing it out? Do you send messages to CBS to complain that the plane crash they are covering was also reported on NBC 35 seconds earlier? Was your intent to get some attention (mission accomplished)? You have the modern equivalent of the Library of Alexandria at your fingertips- Find a better use for it than just squawking to hear yourself squawk.
  
  Report comment
  
  Reply
  1. fgdghf says:
    
    August 6, 2014 at 3:51 pm
    
    u mad bro?
    
    Report comment
    
    Reply
    1. FooBarBaz says:
      
      August 7, 2014 at 2:33 pm
      
      2012 is over, it’s okay to grow up now
      
      Report comment
      
      Reply
Max Siegieda says:

August 6, 2014 at 1:11 pm

Honestly my BS meter was ringing like mad watching that video, thinking there’s absolutely no way that visual noise wouldn’t mess the recording up, but no they’re using a camera with a pretty great lens system and sensor that’s also running 25x slower than the recommended speed.

Report comment

Reply
LK says:

August 6, 2014 at 1:14 pm

The possibility to do this and the current state of video processing is awesome, but wouldn’t a laser microphone be easier and cheaper than the high speed camera and image processor? DIY laser mic: http://www.lucidscience.com/pro-laser%20spy%20device-1.aspx

And I think afterwards extraction of conversations from (normal, not specially recorded) videos isn’t feasible, which would be the main advantage over laser microphones.

Report comment

Reply
1. franklyn says:
  
  August 6, 2014 at 1:35 pm
  
  Laser mics need a lot of setup and alignment.
  
  Report comment
  
  Reply
  1. Truth says:
    
    August 6, 2014 at 2:17 pm
    
    And night time is better than daytime as well – See this link:
    
    http://www.lucidscience.com/pro-laser%20spy%20device-6.aspx
    “As you have probably guessed, the Laser Spy system will not perform very well in the daytime due to ambient light sources competing with your laser beam, but this is fine since real spies usually operate in the darkness! “
    
    Report comment
    
    Reply
2. Hirudinea says:
  
  August 6, 2014 at 3:29 pm
  
  I believe they have methods that either vibrate the window to mask sound or they use two windows where the intervening empty space eliminates the sound, this system would defeat both of these defence methods, so I guess buy some drapes.
  
  Report comment
  
  Reply
  1. Adam says:
    
    August 6, 2014 at 4:02 pm
    
    I believe the device you’re thinking of is a vibrator.
    
    Report comment
    
    Reply
  2. twdarkflame says:
    
    August 8, 2014 at 5:58 am
    
    ” two windows where the intervening empty space”
    
    double glassing not enough?
    
    Report comment
    
    Reply
3. rasz_pl says:
  
  August 6, 2014 at 5:57 pm
  
  kind of hard to shine a laser pointer into the past/at a youtube video
  
  this technique should allow NSA, and other criminal organisations, to analyze video material for audible hints. From the looks of it you can get something as long as it was recorded with a rolling shutter type of sensor (meaning all cellphones)
  
  Report comment
  
  Reply
  1. anon says:
    
    August 7, 2014 at 2:49 am
    
    “utilizes a high-speed video camera”
    
    This wont help anyone extract sound from youtube videos.
    
    Report comment
    
    Reply
    1. Angus says:
      
      August 7, 2014 at 5:20 am
      
      Or a normal camera with a rolling shutter sensor.
      
      Report comment
      
      Reply
      1. Jim says:
        
        August 7, 2014 at 6:40 am
        
        Whilst I can’t say for certain without further reading, I’d suggest that since this relies on tiny sub-pixel variations, the compression applied to normal video posted on youtube etc is going to consider the pixels of the target object to be unchanging and will only appear in key frames, destroying the audio. The camera in such video is less likely to be mounted on a tripod, and held in the hand, the strongest signal besides waving it around, might be the pulse from the person holding it. Still it remains an interesting and impressive achievement.
        
        Report comment
      2. cplamb says:
        
        August 7, 2014 at 9:10 am
        
        In the video they show it working with a normal camera with a rolling shutter sensor.
        
        Report comment
4. Whatnot says:
  
  August 7, 2014 at 8:43 am
  
  I think the damn disgusting NSA people want to be able to take your existing video feeds and spy on people really, and that that’s what it is all about.
  
  Report comment
  
  Reply
ejonesss says:

August 6, 2014 at 1:15 pm

now places that have no audio recording rules will now have to add video to their rules.

what could you get from a movie?

that could be a new way for hollywood to watermark the movie and will survive better than other methods.

while the watermark may not stop recording or playing it will allow for identifying the source by say for example each theater will have it’s own id number spoken .

here is how that would work.

1. hollywood would make a movie.

2. someone would speak out the identity of each theater (potentially thousands of unique movies personalized to that theater (digital projection theaters the modulation of the object could be done by the theater’s projector (maybe say like modulating the frame hold or focus adjustment or even the brightness))).

Report comment

Reply
1. Aaron Lee Kafton says:
  
  August 6, 2014 at 1:39 pm
  
  A pirate’s camera would be in the same room as the loud soundtrack, being exposed to its own vibrations, making it near impossible to pick out micro vibrations hidden in the original video from those introduced by the second recording.
  
  Report comment
  
  Reply
2. rasz_pl says:
  
  August 6, 2014 at 5:59 pm
  
  whats a theater? is that this thing my grandpa used to tell me about?
  also only clueless people watch CAM releases.
  
  Report comment
  
  Reply
3. Jim says:
  
  August 7, 2014 at 6:41 am
  
  Watermarking of movies – look up Cinavia.
  
  Report comment
  
  Reply
John says:

August 6, 2014 at 1:33 pm

http://youtu.be/HWtPPWi6OMQ?t=56s

Report comment

Reply
1. twdarkflame says:
  
  August 8, 2014 at 6:01 am
  
  hehe.
  I must admit I was also thinking of a Smart clip a few weeks ago when the Senate Investing Committee was being investigated by the Senate investigating Committee.
  
  Report comment
  
  Reply
a3 says:

August 6, 2014 at 1:37 pm

I guess this is useful for when you download porn that is missing audio.

Report comment

Reply
1. Aaron Lee Kafton says:
  
  August 6, 2014 at 1:41 pm
  
  People don’t always mute porn? They must live alone.
  
  Report comment
  
  Reply
  1. Greenaum says:
    
    August 8, 2014 at 7:57 am
    
    Of course they live alone. That’s why they’ve got the time and freedom for the pr0n.
    
    Report comment
    
    Reply
Toot says:

August 6, 2014 at 2:37 pm

Could this be a way to add (the original) sound to silent movies? :)

Report comment

Reply
1. Truth says:
  
  August 6, 2014 at 3:06 pm
  
  The frame rate is too low. The human voice is probably about 80Hz to 1200Hz. So the frame rate would need to be double this, and that is not the case for silent movies. For best recovery they were using high speed cameras, to get 38K frames a second.
  
  Report comment
  
  Reply
  1. justice099 says:
    
    August 6, 2014 at 3:28 pm
    
    You could potentially interpolate some of the missing data, perhaps processing the motion blur occurring throughout the individual frames?
    
    The output would probably be very tinny, but would be interesting nontheless
    
    Report comment
    
    Reply
    1. John says:
      
      August 6, 2014 at 4:37 pm
      
      Old movies (ignoring the poor quality issues) have a framerate at 0.00075% of that needed for full audio range (40k frames per second).
      
      Report comment
      
      Reply
      1. Angus says:
        
        August 7, 2014 at 5:22 am
        
        You meant 0.075%.
        
        Report comment
  2. rasz_pl says:
    
    August 6, 2014 at 6:01 pm
    
    have you missed the part about rolling shutter?
    basically you get (framerate of camera * vertical resolution of said camera) of samples
    
    Report comment
    
    Reply
    1. Truth says:
      
      August 6, 2014 at 7:17 pm
      
      Is there a rolling shutter when using analogue ? I do not think that there could be, it is chemical and no scanning is involved to generate the effect.
      
      Report comment
      
      Reply
      1. Truth says:
        
        August 6, 2014 at 7:25 pm
        
        Wow, there is!
        ‘The “Rolling Shutter” can be either mechanical or electronic.’ – https://en.wikipedia.org/wiki/Rolling_shutter
        And there is a good example image of a 1920’s Dixi race car showing the distortion here:
        https://en.wikipedia.org/wiki/Focal-plane_shutter#Two-curtain_shutters
        
        Report comment
2. tekkieneet says:
  
  August 6, 2014 at 3:10 pm
  
  Movie set is very noisy. That’s why (talkies) movies record everything in a studio.
  Now if you could recover sound from silent film, it would be picking up the noise from camera, director yelling and god know if the actors even read their dialogs.
  
  Report comment
  
  Reply
3. justice099 says:
  
  August 6, 2014 at 3:22 pm
  
  Might be easier to program something to read lips. lol Well, more likely to work somewhat, anyway
  
  Report comment
  
  Reply
  1. echodelta says:
    
    August 7, 2014 at 1:44 am
    
    That has been done with expert lip readers on stuff shot in a documentary mode with the subject talking at a silent camera. Transcribed of course.
    As with all bugs just turn up the music and drown them in jams.
    
    Report comment
    
    Reply
fl@c@ says:

August 6, 2014 at 4:06 pm

This would be an interesting way to authenticate products or currency… If you manufactured say a dollar bill in a way that when exposed to audio of a certain frequency, it would resonate at a specific frequency and absorb others… This could be done by adding certain materials like rubber or something that would attenuate whatever frequencies you want to eliminate… or the reverse could be done where it only attenuates a certain frequency.. Merchandise could be packaged in materials that allow manufacturers to identify it as authentic.. Or maybe not.. :)
Either way, it would be interesting to see what materials react in what ways to different tones, etc..

Report comment

Reply
1. fl@c@ says:
  
  August 6, 2014 at 4:11 pm
  
  It might also be interesting to think of the possibility of using this for identification.. Imagine a bank card, or drivers license that had chip or something that the machine it is inserted into examines under exposure to some combination of tones with a tiny camera that analyzes the chips vibrational response… and each personal has a slightly different combination allowing a unique identifier.. Just thoughts off the top of my head.. :)
  
  Report comment
  
  Reply
2. danieljlouw says:
  
  August 7, 2014 at 12:09 am
  
  How well are the auditory properties of an object preserved once it’s been creased/folded/torn? I am no expert, but isn’t resonance of a piece of paper affected by folding it?
  
  Report comment
  
  Reply
  1. fl@c@ says:
    
    August 7, 2014 at 4:06 am
    
    That would be a good point.. Yes, I suppose it would.. :) Also wear and tear on the bill, probably moisture, and whatever oils and other contaminants… Hmm.. There goes reality again…shooting down my daydreams.. Might still work for the ID card though if the ‘chip’ were protected well enough..but I doubt the benefit outweighs the effort.. :/
    
    Report comment
    
    Reply
3. HC says:
  
  August 7, 2014 at 7:58 pm
  
  What is going on with people posting insane and pointless uses for this technology? If you have control of the recording environment and the object, anything you can do with this technology you can do an order of magnitude better with an actual microphone. This is useful when you can’t use a microphone.
  
  Report comment
  
  Reply
Soo-Hyun says:

August 6, 2014 at 7:26 pm

Related research by some of the same authors: http://people.csail.mit.edu/mrub/vidmag/

Report comment

Reply
mannanj says:

August 6, 2014 at 9:32 pm

great research by these mit researchers and an idea i am very much interested in making a hack for in the future. I made an article in the forums a few days ago about this here:
http://forums.hackaday.com/viewtopic.php?f=10&t=4755

I’m looking for any takers, researchers, hackers, anyone with ideas who would like to pursue this further and make any kind of device with me. if you are interested throw a post in there or send a pm my way!!

Report comment

Reply
1. Whatnot says:
  
  August 7, 2014 at 8:48 am
  
  Don’t be a disgusting tool for the fascist of the world. Do something somewhat useful for the better side of the planet.
  
  Report comment
  
  Reply
OneShot Willie says:

August 6, 2014 at 10:46 pm

My initial thought was what two random words will the NSA & CIA give this one in their ToolKit? I’m voting for RollingShutter…

Report comment

Reply
1. exit151 says:
  
  August 7, 2014 at 3:42 pm
  
  And I’m wondering why Microsoft and Adobe are involved in this.. Kinect Xbox One not spying enough, they want more data from it??
  
  Report comment
  
  Reply
2. BrightBlueJim says:
  
  August 7, 2014 at 6:10 pm
  
  I don’t think those would qualify as random words in this case.
  
  Report comment
  
  Reply
OneShot Willie says:

August 6, 2014 at 10:47 pm

http://hackaday.com/?s=NSA+catalog

Report comment

Reply
Marvin says:

August 7, 2014 at 8:27 am

“Rolling shutter CMOS sensors capture images one row at a time. ”

Not a great description. Exposure is controlled by a pixel reset ahead of the pixel read. If the light to the sensor is bright enough a rolling shutter camera will reduce the lag to the order of one line.

In terms of paleoacoustics applications, no cinematographer or camera man would ever do this deliberately in the normal course of filming something. 24fps footage for example normally aims for an exposure of 1/48th (180 degrees in cine terms) so things blur instead of warp.

Report comment

Reply
1. BrightBlueJim says:
  
  August 7, 2014 at 6:12 pm
  
  That’s what I was thinking – they probably used a very fast shutter speed on the rolling shutter tests. AND an extremely stable platform for the camera. I doubt that this would work for just random video footage.
  
  Report comment
  
  Reply
sjamaan says:

August 7, 2014 at 6:17 pm

You could use that to analyse some old movies and you tube videos without sound to try and recover what was going on…even the 60fps from the slr you could tell the notes…

Report comment

Reply
ERROR_user_unknown says:

August 8, 2014 at 3:13 am

fuck more tools for the fed.

Report comment

Reply