Linux Containers The Hard Way

If you want to make containers under Linux, plenty of high-level options exist. [Lucavallin] wanted to learn more about how containers really work, so he decided to tackle the problem using the low-level kernel functions, and he shared the code with us on GitHub.

Containers are more isolated than processes but not quite full virtual machines. While a virtual machine creates a fake computer, a container is more like a fake operating system. Applications can run with their own idea of libraries, devices, and other resources, but it doesn’t try to abstract the underlying hardware.

[Lucavallin] tells us that the key features include namespaces which allow different kernel resources to be grouped into related sets and control access to the different features. The seccomp facility controls what system calls a process may make while the capabilities system controls what root can do in the container. Finally, the cgroups system allows you to limit resources so one container gets a fair share of things like CPU time or disk I/O.

These capabilities are available in the kernel started with version 6.0.x, so you’ll need that. In addition, namespaces and cgroupsv2 have to be on. If you aren’t sure, skim your /boot/config-* file (use the one that matches what uname -a tells you). For the user namespace, for example, you should find CONFIG_USER_NS set to y. You can also look at /proc/self/ns and see if it has namespace object you are looking for. If you want to be sure cgroupv2 is enabled, try “grep cgroup /proc/filesystems” and you should see a “cgroup2” entry.

Do you need to roll your own container solution? No. Do you want to? We do because we love to learn more about why things work on a starship Linux system.

10 thoughts on “Linux Containers The Hard Way

  1. These capabilities are available in the kernel started with version 6.0.x,

    Which capabilities exactly? Because AFAIK most of them (namespaces, seccomp, capabilities, cgroups) have existed for quite some time.

    1. Capabilities goes back to kernel 2.1 in 1998, but it didn’t really work right. The capabilities we know were merge in 2.6.24 at the end of 2007. Not sure on the others, but i remember there was mailing list drama in 2006 when Madore wanted to fix up capabilities into something actually useful. And it was a feature I cared about, so I paid attention.

    1. FreeBSD jail gets you close. Chroot on its own isn’t sufficient for containers. A union mount helps. But isolating networking, process IDs, shared memory, and other things would be missing. More importantly, the ability to compose a group of dependent containers is also lacking until you start diving into kernels that support something like namespaces.

    1. Back in the day, Solaris (what many people thought of as THE Unix) wasn’t POSIX compliant. It’s like “im UNIX im better than POSIX”. Then NT had a POSIX subsystem. Useless, except for checking a box that said “yep we support this specific test that you can throw at us”. Sun then made Solaris POSIX compliant.

  2. Ten years ago, I patched a kernel with the grsecurity patches to try and make a linux server survive in a hostile environment without getting hacked. I manually built versions of nginx, php and mysql that could live happily in hardened chroot jails. Despite being left out like a lamb, the server was never hacked until it was retired in 2021 due to the business closing. I certainly stopped following hardened linux solutions since implementing that solution. Interesting to see what’s available now in the mainline kernels. Evidently, grsecurity is still a thing but has lost the war. My brief curiosity reading suggests it’s a hell of a lot easier to accomplish much of what I implemented years ago by leveraging containers.

    1. Many of the container _solutions_ could be described that way, certainly.

      But the concept itself is a good one (if you have certain needs). Much of the drama and stress is primarily due to the fact that large devops enviornments have _very_ different expectations and needs than more traditional unix environments. The burden of the unneeded details often makes it hard to reuse those solutions for non-fleet needs.

      It’s not actually that hard to implement containerization from scratch, with the option to make it fit your needs exactly. Or, if your needs are just a little less specialized, you can use something like bwrap to set up your initial isolation.

      You can, for example, set up a loopback interface and a macvlan interface in an isolated netns, then use bwrap to fully isolate a chroot. Processes within it cannot see pids, filesystems, uids, networks, etc on the rest of the machine, and the macvlan gives you a simple isolated network path to the outside physical network with much less need to play around with firewall isolation.

      The result is a chroot, not a full rootfs. You don’t need to (and cannot) set up hardware, but you can configure networks, use runit or another service manager to create interdependent services, etc. The host machine and isolated chroot are layer two network neighbors (no more nor less security than if they were both separate machines plugged into the same external hub)

      It doesn’t scratch every itch, but it’s a surprisingly useful tool to keep in the toolbox.

      If bwrap itself doesn’t quite meet your needs, none of the underlying syscalls are particularly complicated. There have been several “container technology from scratch” walkthroughs designed to familiarize people with the capabilities. It’s not hard to wrap your head enough around the details to sketch out just the bits you need…

Leave a Reply

Please be kind and respectful to help make the comments section excellent. (Comment Policy)

This site uses Akismet to reduce spam. Learn how your comment data is processed.