High Quality 3D Scene Generation From 2D Source, In Realtime

Here’s some fascinating work presented at SIGGRAPH 2023 of a method for radiance field rendering using a novel technique called Gaussian Splatting. What’s that mean? It means synthesizing a 3D scene from 2D images, in high quality and in real time, as the short animation shown above shows.

Neural Radiance Fields (NeRFs) are a method of leveraging machine learning to, in a way, do what photogrammetry does: synthesize complex scenes and views based on input images. But NeRFs work in a fraction of the time, and require only a fraction of the source material. There are different ways to go about this and unsurprisingly, there tends to be a clear speed vs. quality tradeoff. But as the video accompanying this new work seems to show, clever techniques mean the best of both worlds.

A short video summary is embedded just below the page break. Interested in deeper details? The research PDF is here. The amount of development this field has seen is nothing short of staggering, and certainly higher in quality than what was state-of-the-art for NeRFs only a year ago.

28 thoughts on “High Quality 3D Scene Generation From 2D Source, In Realtime

  1. I still don’t get what the “original” is…
    Is it a series of photographs that are rendered 3D?
    Is It a single photograph?
    Is it a computer image?

      1. Yes, but what makes this really interesting is how they also applied a fourrier-transformation filter (NOT! fast-fourrier) to the individual vector components to re-align the main focal points of each frame, allowing for a smoother render in the end pass. Wait, what?!

    1. I haven’t grokked the paper and the code yet, but here’s my understanding so far.

      They take a set of still images and interpolate the objects, conceptually like photogrammetry interpolates objects.

      They then encode the objects as 3D gaussians: a 1D gaussian is a bell curve, a 2D is like a sand dune, and a 3D gaussian is like a sausage or gel-cap shape.

      Just as you can encode any 1D signal as a sum of sin and cos, with higher frequencies making for higher resolution, you can encode objects as a sum of 3D gaussians with smaller gaussians making for higher resolution of the 3D objects.

      Once you have the 3D objects encoded as gaussians, synthesizing a camera viewpoint from anywhere in the scene is easy: rotate and scale the gaussians and add them together.

      Once you have trained on some images, you can move a virtual camera around the scene in real time.

      I think that’s the thrust of the paper, but note that I haven’t looked deeply into it yet.

        1. Right, so the way I’d explain it to you, to sum it up is, they’ve taken a lot of photos, and basically exploded everything into little tiny particles—those particles (Gaussians) are basically the dna/bones of the photo set, and using mathematics, and machine learning-photo/video editing fly the program connects all the “dots” and synthesizes a movable, explorable video or 3D experience.

          It’s an interesting way to achieve some form of photogrammetry. I’ve never even thought of this before.

    2. You were right the first time. The “original” is a series of photographs from different angles, processed to become a 3d scene you can interact with and explore. Kind of like photogrammetry, except with fewer photos and better results. And it does not involve meshing a point cloud (which is a big source of dog shit models).

  2. Been following nerfs a good while, mostly due to a coworker who is much more knowledgeable about the guts n gears involved than I am. The way it can render things like transparency and reflection, or lighting effects such as bloom.. it’s pretty astounding. You have to see it to believe it (and you have to know a bit about how nasty these things looked just a few years ago).

  3. It’s not realtime, it can show something but you have to wait like progressive rendering, ofcause it requires 4090, and quality worse than professional software in photogrametry, best is Reality Capture it can run on low end pc. So this is all ads from nvidia.

    1. This is not *just* an ad. Yes it is not realtime, that’s referring to after data processing, and yes it requires high end hardware. I’m not sure where you are getting that it’s worse than existing photogrammetry software though, that’s just incorrect. It’s also a lot faster.

    2. It uses CUDA? I think it’s a bit odd that universities allow such propriety tie-in like CUDA. Especially when there are alternatives.
      Regardless if they had no payments from or contact with Nvidia.

    3. Oh and you are advertising software owned by Epicgames and requiring ‘activation’ and all kinds of agreements. While complaining of ads…
      So do you have a list of companies that you think ads are allowed from and which ones not?

  4. I’m so powerfully ignorant. Taking a bunch of photos of something then making a “3d scene”. Isn’t that just…. making a video? I thought the trick would be taking a single still image, hand waving, AI etc and generate a 3d scene like in the title. But comments lead me to believe otherwise.

    1. Imagine that you can create a video flyby of a scene, moving at angles that you do not have pictures for, without a 3d, mesh-based scene being generated. Imagine that the “scene” is being created as a collections of clouds of light, that don’t care about “solid” “transparent” or “reflective” just the way light moves and from that you can render out arbitrary videos of in-scene motion. And you create the “scene” from just a handful of 2d images

Leave a Reply

Please be kind and respectful to help make the comments section excellent. (Comment Policy)

This site uses Akismet to reduce spam. Learn how your comment data is processed.