Issac Asimov wrote Caves of Steel in 1953. In it, he mentions something called trimensional personification. In an age before WebEx and Zoom, imagining that people would have remote meetings replete with 3D holograms was pretty far-sighted. We don’t know if any Google engineers read the book, but they are trying to create a very similar experience with project Starline.
The system is one of those that seems simple on the face of it, but we are sure the implementation isn’t easy. You sit facing something that looks like a window. The other person shows up in 3D as though they were on the other side of the window. Think prison visitation without the phone handset. The camera is mounted such that you look naturally at the other person through your virtual window.
Since you are sitting in a relatively fixed position, making a 3D display without headgear is much easier. From the video demonstrations, the display is awfully good, too. Of course, there are only a few Starline setups in Google offices today, but it does give you an idea of where things are probably going.
Then again, there’s no reason you couldn’t try cooking something like this up on your own. Granted, making a really good 3D display is still pretty difficult. Then again, you could always go retro.
36 thoughts on “Project Starline Realizes Asimov’s 3D Vision”
Ah great. Add some “agony” to that booth and realize our Trek fantasies.
Thing is these booths run smack into the desire for telecommuting because only the most monied (and hooked up) will be able to use these.
That’s how all technologies start out. Computers were exceptionally expensive machines when they got started but now you would be hard pressed to buy something that doesn’t have a computer in it.
i always figured that would work with some kind of neural scanner. which is likely a bit of kit in the medical scanning field. this of course would be tied to a cnc torture machine that fires laser pulses at all your nerve endings. don’t ever vote for me.
Deep learning researchers are producing systems that can synthesize plausible front-on facial images from simple off-axis cameras. That’s going to show up in standard video conferencing applications soon enough, using the standard webcam or phone camera.
So you need multiple cameras to capture proper depth information. This information is sent across, and you need to be rendered in 3D on the other end, projected according to how the other person is looking at you. So their eye position also needs to be calculated. You don’t really need a 3D display to be very effective, but it helps. The eye tracking and corresponding rendering is what makes the “screen” disappear and seem like glass instead.
There is another, more cheesy, way to do this: Use a screen with an array of pinhole cameras behind it. Choose one camera image to send across to the other side, according to the eye position of the remote viewer. I suppose it would “pop” when the camera changed, so to avoid this, choose the 4 closest cameras and use bilinear interpolation to compute the image to send across. I imagine you’d run into a bunch of alignment issues, so throw an AI at it to solve them. :-)
This approach breaks as the viewer changes distance from the screen.
That’s true. Making people sit in a chair helps address that.
The problem could be solved by using the appropriate light field, which is probably being captured by those cameras already. You’d just need to choose the appropriate sets of pixels to combine. That’s really what was being done already, but just being simplistic about which pixels to choose.
Cheesy approach #2: For the display, use an angled half-silvered mirror that shows a screen above/below/or off to one side. Then put a camera on a motion rig behind the mirror. The position and angle of the camera is controlled according to the position and gaze of the remote viewer. If you use some kind of mechanical tracking, then no computing power is needed here.
Btw, you may have already noticed a limitation for these systems: one viewer (per side) only. If you wear shutter glasses, or use some other means to make sure each viewer sees a unique image, then you can have more.
So are they tracking the head position and modifying what is displayed based on head position ?
And how does it handle 3, 4, 5 or 20 people simultaneously in each room ?
I doubt it does. That’s why the demo has the child in the Mom’s lap.
There are videos on the project page with a moving camera behind the person talking and the 3d effect does not seem to go away just because the camera goes to the side of the person infront of the mirror. You are thinking of somthing like the experiments Johnny Lee got famous for in the mid 2000s?
I think the interesting part seems to be the display technology used.
I do not know if Al is referring the peppers ghost pyramid “hologram” as a joke (as it is just translucent plastic pyramid reflecting four different images, one at each side)
(I am still sore that no one got my joke about Peppers ghost on the video with a similar pyramid displaying an Ironman suit)
Most likely, the videos on the project page were made by tracking the camera instead of tracking the person viewing (ie, treating the camera as the viewer).
Why would he call it a 3d display in that case? It would just be a normal display that changes its content depending on the viewers position.
I have done development with multiple types of 3d cameras, Time of flight (Fotonic, Kinect), Structured Light (Orbbec) and others.
So the 3D data collection is very familiar grounds for me.
I am not as convinced as you that the display uses the point of view to create the 3d feeling, I am hoping for a 3d display with as good quality as the one in the video.
It could just be stereoscopic. If it were some amazing new 3D display technology, I’m sure the hype would focus on that instead of the remote connection part.
I wonder how long until they add the overlay with exaggerated face temperature difference and neural network amplified micro expression detection. Knowing google, all the interactions using Starline are permanently recorded and will be used for training data at a later date.
At 27 to 28 seconds in, you can just make out the cameras.
Old news: https://www.youtube.com/watch?v=9YOEEpWAXgU
Lol. But not the same
This is amazing 👏. Would be great to make a diy version.
I’m on it!
…when I have time ⏲️😏
A good excuse to keep that old 3D TV out of the dumpster :)
we cant even get enough bandwidth to make a “flat”image display in semi realistic notion (smooth movement etc) this wil never work in Australia – at least on our NBN :lol:
Try flipping your NBN modem upside down. That usually helps.
Heinlein wrote a 1948 story “Waldo”, which is how telemanipulation got its name.
And in the sequel, The Naked Sun – 1956 , they can walk around outdoors visiting virtually. Made possible by the precision with which the robots could more with cameras and projectors.
Azimov has a cultural insight in which there is a difference between ‘viewing’ and ‘seeing’. People will view in any state of dress or situation but be appalled if seen at anything but their best. I notice a trend like this with Zoom and Facetime and QQ and WeChat, etc.
Now we just need one of the participants to be holding a telephone handset with a curly cord.
A little late for the covid party, also feels like a prison’s visitor room :D
Why do you need depth if you can have stereo camera with 3D screen?
Also cannot 3D TVs display split images as 3D, in that case Skype could be used, just need to merge stereo cameras into a split screen.
The “compression” — from a few artifacts (hair, baby moving hand quickly) it looks like it’s making a 3D model, skinning it, and presenting that on the other side. Cool.
But somehow creepy. He says, even though phone calls have been linear speech models for decades now…
Don’t they say it directly? they basically use something like https://hackaday.com/2012/02/27/make-any-photo-3d-using-the-gimp/ but start with real time 3D models instead of 2D pictures. I would suspect this needs some serious hardware to encode and decode the changes in model in real time not to mention render the output.
But what about the bandwidth requirements and latency? We’re seeing it at its best, but show me that on a connection that we can reasonable expect and I’ll be impressed. The video and site don’t go into how much data we’re seeing, compressed or not. The only way to deal with latency on a high bandwidth connection is to prioritize the bits, and if EVERYONE is prioritizing bits, they’re all prioritized.
Still, neat idea.
There are various ways to transmit a 3D representation. One way is to send a regular color image as well as a depth image. Of course, since only one viewpoint is captured, there will be artifacts when rendering the scene from a different viewpoint (where data wasn’t captured). Capturing more views can address this, but it multiplies the bandwidth issues unless some fancy compression is used to avoid sending redundant data.
You can imagine lots of ways to process or preprocess the data to minimize bandwidth, but each has its own issues. For instance, you could try to generate and transmit an accurate initial model as a preprocessing step, then try to capture and send only differences in real-time. Figuring out the differences is an interesting problem.
As far as latency, there’s a reasonable amount of leeway here, since the interaction is a conversation. Most folks will tolerate a fair fraction of a second in this, depending on the back-and-forth speed of the conversation. You can see the upper bounds of what folks tolerate when you see folks interacting over a geosynchronous satellite link, where the latency can be up to one second.
Well I live in London now. I was born in New Zealand. So when I catch up with my friends I should spend a couple thousand £ and endure the 2×12 hour flights with 4-8 hour stopover to do that?
What!? You mean IRL? People don’t do that any more, do they?
The right hand picture looks like Guardian Bob after he got back from Mainframe to help Dot Matrix and Enzo defeat Megabyte and Hexi Decimal… (Reboot – animated series)
Nice work by Evil Incorporated. I wonder how thy will utilize it against us.
I guess this is related to the ‘Glasses Free 3D’ tech of the Nintendo 3DS and some others. Using a striped film on a regular display, you serve different scanlines to each eye, like those rulers with moving dinosaurs we had in school, or 3D postcards. The display I had on a Toshiba Qosmio F750-125 used diagonal stripe lens film, and using webcam face tracking to know which eye could see which pixel.
Please be kind and respectful to help make the comments section excellent. (Comment Policy)