It’s sometimes useful for a system to not just have a flat 2D camera view of things, but to have an understanding of the depth of a scene. Dual RGB cameras can be used to sense depth by contrasting the two slightly different views, in much the same way that our own eyes work. It’s considered an economical but limited method of depth sensing, or at least it was before FoundationStereo came along and blew previous results out of the water. That link has a load of interactive comparisons to play with and see for yourself, so check it out.

The FoundationStereo paper explains how researchers leveraged machine learning to create a system that can not only outperform existing dual RGB camera setups, but even active depth-sensing cameras such as the Intel RealSense.
FoundationStereo is specifically designed for strong zero-shot performance, meaning it delivers useful general results with no additional training needed to handle any particular scene or environment. The framework and models are available from the project’s GitHub repository.
Microsoft may have discontinued the Kinect and Intel similarly discontinued RealSense, but depth sensing remains an enabling technology that opens possibilities and gives rise to interesting projects, like a headset that allows one to see the world through the eyes of a depth sensor.
The ability to easily and quickly gain an understanding of the physical layout of a space is a powerful tool, and if a system like this one can deliver such fantastic results with nothing more than two RGB cameras, that’s a great sign. Watch it in action in the video below.
Oh Deities, an AI voice…
I’ll let it slide this time, but don’t do it again, nVidia. You have money enough to hire someone with actual educational skills to narrate the text.
Knowing Nvidia and how almost all of their profits is from selling proverbial shovels and pickaxes to the “AI” goldminers, I’d bet that using a AI voice was a hard requirement by higher ups.
You don’t actually need two cameras. You can recover 3D structure if you have a video shot using a moving camera (basically every video ever) where different frames offer different perspectives.
This is the foundational principle behind SLAM, which everyone seems to think is impossible. (but is readily available on github in multiple implementations, most notably ORBSLAM3)
The one catch is you don’t get scale unless you have objects in the environment you know the size of……or you have an accelerometer strapped to your camera, as all smartphones do.
No they don’t, and claiming incorrectly that they do doesn’t bolster your argument.
Also, you can do everything with a single low-res camera connected to a potato, but you can’t do it well. The entire point of this is the quality of the output, and if you’d actually bothered to look at the website you’d have seen copious interactive examples of just how much better the output of this is over the alternatives.
This is a contest. The rules of the contest were to use stereoscopic images. They won the contest. By using moving cameras they could not have entered the contest.
There are dozens of ways to obtain 3d information from 2d images. Stereoscopic images are just one way to get this.
The advantage of using stereoscopic images vs a moving camera is that you instantly have depth information and if objects from the scene are moving it remains accurate. With a moving camera there is latency and if objects are moving you get errors (think of those images of running pets in panorama shots).
Another point is that it is not either or. This method could be combined with other methods if there is a moving stereoscopic camera.
What part of this needed a NN? You’d think this stuff would just be done by phase detection techniques like a camera autofocus. Does the AI make some stage of that process (e.g. matching up the waveforms) better than algebra?
Phase detection autofocus tends to work on maybe 16 spots on a really high end camera. In order to do the kind of modeling they’re doing here, you’d need roughly one autofocus zone per pixel.
Fortunately they have that. This is pretty directly equivalent to a split-pixel system.
I bet you could do some cool stuff with camera autofocus but it would require a lot of work and specialized gear. Cameras don’t report that raw autofocus information (not that I’ve seen at least). And it can’t be added to standard image/video formats so now you are looking at custom files, etc. Also from my understanding I don’t think “every pixel is a split pixel” CMOS sensors are common, and would probably be crazy expensive. I believe for most DLSR cameras it’s a small subset of the pixels.
I also don’t think it would be as fast, if you’ve played around with a DSLR before you’ve probably noticed the autofocus isn’t instantaneous (they’ve gotten pretty snappy but it isn’t instant). This is because of 2 reasons, 1) Phase measurement on your image sensor only works if focus is close enough to correct first, which is why when it fails to find focus it looks like it’s “hunting”. Because that is what it’s doing, it’s brute force sweeping through the focal settings to find the closest match. 2) Once the closest match is found it’s an iterative process: Check phases, get consensus on gradient direction across AF points, take step, check again, etc.
All that just for answering “what is a single depth plane that fits my view the best”. It’s optimized for answering that, and doing it fast in a handheld system, and for that it does it really well. But it’s a square peg/round hole for the per-pixel depth measurements needed by most robotics applications.
That being said you should check out the LYTRO cameras, some really cool tech that unfortunately failed, but it’s a “light field” sensor rather than an “image sensor”. You could take a picture and decide the focal point later. Theoretically it captured all the depth information at once. But the company went under because that tech is really hard.
The TL;DR is industry has moved to Deep Stereo because it’s the cheapest and most performant option available so far for this type of sensor (dense near-field depth aligned with an RGB image). It’s what all of the major robotics companies use in their stereo setups (Telsa, Waymo, etc. even the off the shelf Stereolabs stereo cameras ship with their own deep stereo SDK). Neural nets can learn semantic information about the local pixel region that things like AF points or stereo block matching (the off the shelf naive approach you get in OpenCV) can’t. It turns out it’s just really hard to beat being able to understand in each image that the sink in front of the wall is a single contiguous object separate from the wall with clean boundaries, before you go in and try to associate which pixels match to which. That’s a pretty human interpretation of what this net does but it’s a decent enough analogy. And neural net’s can do this at 10Hz or even a lot more on a moving platform depending on the image resolution and GPU, which is again critical for most robotics applications (although not offline photogrammetry or Structure-from-Motion).
I skimmed it. Saw NVidia and AI was a major talking point, then moved on to validate the “We’re number one on the leaderboards”. They’re no longer #1, and there’s been no github activity for about a month. Leads me to believe the article here is based on the youtube video itself, as there’s no mention of the other solutions that are now listed #1 in the leaderboard this project so proudly claims to be at the top of.
Just because things have changed since the paper was published doesn’t suddenly make their technique or their paper invalid.
But it does mean there article should be different.
I did something similar a few years ago as a proof of concept just to see if I could. It turned into a poorly written multi-threaded C++ program. It uses no AI, and no phase detection. Just raw brute-force pixel matching. Super inefficient, and the results are prett meh in comparison, but I did find that it will extract depth from Magic Eye pictures. The small startup company I’ve been working with decided to make it open source.
https://github.com/Haptic-Solutions/DepthExtrapolator
Just FYI, Intel real sense is still in business and doing very well. They only discontinued their lidar cameras.