[Dennis] is on YouTube with his channel “Made By Dennis,” but for the record he is a maker, not a V-tuber. On the other hand, his latest project– creating a profesisonal-level tracking rig with DIY IR cameras and a whole lot of moxie–does mean he’s now equipped to make the move to the prestigious, high-status world of pretending to be an anime girl.
That is of course not why he did it. Like most projects around here, the motivation was more a case of “I wonder if I can…”– in this case [Dennis] wondered what it would take for him to pull off the same sort of optical motion capture, or MoCap, that is used in Hollywood studios. Optical mocap has the advantage of being very precise, able to track things at high speeds, and not being in any way limited to the human form like the slew of AI-assisted methods hitting the market right now. The disatvantage is that you need to place markers on any part of your subject you want tracked, film them from all angles, and process a whole lot of pixels. In [Dennis]’s case, it ended up being about four billion. Keeping in mind that actually locating those points in 3D space is dependent on knowing exactly where your cameras are: if you want sub-millimeter precision, your cameras need to be fixed with sub-millimeter tolerance. It’s a big project, hence a long video, which is embedded below.
The DIY cameras use a AR0234 MIPI camera on a custom PCB with M12 lenses and IR filters. To improve the signal-to-noise ratio on optical MoCap, it’s standard to use near-IR light. The camera boards, as you might expect given the MIPI interface, hook into Raspberry Pi compute modules– the cheapest CM4 should work, though he’s using CM5s. The compute modules sit on custom boards that provide PoE, and some other niceties– like a small microcontroller driven by the pulse-per-second pin to help trigger the cameras in sync.
Each camera gets a ring light of near-IR LEDs that pulse at 160 W, which would be way more than PoE is specced to provide, but since the LEDs are only on when the camera is taking a frame, the average power is well within allowable limits. With 16 cameras each having their own ring light, that’s a lot of near-IR photons. Don’t forget your safety squints!
Rather than process the images with OpenCV, he has his own custom solution optimized for this use-case that [Dennis] reports is 300x faster. Luckily, he’s put his implementation on GitHub, along with the rest of the project. Even if you don’t have any v-tubing ambitions, this project is very impressive and worth checking out in its entirety.
Optical MoCap isn’t the only game in town, of course. If you want to do this cheap and easy, you can strap a bunch of IMU sensors to yourself– just don’t expect the same precision.
Thanks to [Dennis] for the tip!

“professional”
Not you don’t. At worst, you want precise angles, not position. But in reality, you need to calibrate your camera, they can be anywhere whatever the tolerance. Put a single object in the scene and observe it by all your camera. Move your object, do the same (you don’t even need to know how you moved your object).
There’s a single solution that would match all the observation. In practice, it’s a least square error solution you’ll find by solving the linear system (of the projection by the pinhole equation). You’ll know both the object and camera pose (that is, attitude and position in 3D space). Bonus point: you can redo your calibration before each shot since it’s likely it’ll move when you’ll hurt the rig anyway.
Yes. There should be some well defined calibration triangle, that is presented to all camera at the same time. Trying to setup all cameras mechanically beforehand seems futile.
Should be a tetrahedron with IR LEDs and well defined lengths. That should allow for camera calibration, as it is moved around, and calibration of the cameras against each other.
And the input should not be filtered to just pixel on/off. Maybe to find the lights/reflectors, but after that use the intensity of the pixels for subpixel accuracy. Don’t just throw it away.
That’s a much fuller explanation. That said: the cameras must stay fixed to within that fraction of a millimeter from one another during filiming, or you’ll lose precision. That’s what the article was trying to say.
Seems like it would make sense to use accelerometer data and average the cameras, no it won’t be sub millimeter, but asking a rig to be that stable is a big ask. I suppose you could use dampened mounts?
The point of this was to answer your big an ask it really is. Turns out it’s actually achievable.
One way is to have a few markers stationary in the surroundings, so the errors can be compensated for each frame.
It looks like it can track features with markers on really well, but I’m not sure it’ll be that useful for Vtubing. The biggest challenges I’ve come across with Vtubing is tracking hands, fingers, thumbs and facial features including eyes – the bits that, on the whole, seem to benefit from having some level of AI model to track.
The bit about v-tubing was supposed to be a joke. That said, you can put a small marker on each finger for hand tracking and it should work great. Gluing retroreflective spheres to your pupils for eye tracking would be much less comfortable.
Need a way to get an eye shine like Riddick then it’ll be built in and you can save on electricity by needing less lighting.
Sputter just enough aluminum to be half transparent onto hard contact lenses.
Like coating a telescope mirror, but cut short.
Retroreflective prisms have been glued to contact lenses and used in eyes.
Sounds like a DIY version of the system StuffMadeHere uses for his robotic sports equipment.
Is called optitrack. Is a well know piece of equipment in robotics labs. You can see them in almost every lab robotics video
freemocap project? its working towards realtime.
https://freemocap.org/
This is real-time.
Useful pieces of in-depth hardware knowledge in there.
Instructions unclear. Watched video 15 times to learn about pain.
“but for the record he is a maker, not a V-tuber. ”
A critical bit of information for this audience. 😂
Excellent equipment he owns.
Clever design! A bit overcomplicated with everything over LAN, would be easier with an RF sync.
Also great opportunity for a Kalman filter to eliminate the final jitter in dataset.