Linux Fu: Don’t Share Well With Others

In kindergarten, you learn that you should share. But for computer security, sharing is often a bad thing. The Linux kernel introduced the concept of namespaces starting with version 2.6.24. That’s been a few years ago, but namespaces are not used by many even though the tools exist to manipulate them. Granted, you don’t always need namespaces, but it is one of those things that when you do need it, the capability is priceless. In a nutshell, namespaces let you give a process its own private resources and — more importantly — prevents a process from seeing resources in other namespaces.

Turns out, you use namespaces all the time because every process you run lives in some set of namespaces. I say set, because there are a number of namespaces for different resources. For example, you can set a different network namespace to give a process its own set of networking items including routing tables, firewall rules, and everything else network-related.

So let’s have a look at how Linux doesn’t share names.

The possible namespaces are:

  • Mount – File system mounts. It is possible to share mounts with other namespaces, but you have to do so explicitly.
  • UTS – This namespace controls things like hostname and domain name.
  • IPC – A program with a separate IPC namespace will have its own message queues, semaphores, shared memory, and other interprocess communications items.
  • Network – Processes in the namespace will have their own networking stacks and related configurations.
  • PID – Processes in a PID namespace can’t see other processes outside the namespace.
  • Cgroup – A namespace that provides a virtualized view of the cgroup mounts for CPU management.
  • User – Individual users, groups, etc.

Obviously, some of these are more useful than others. It is easy to see, however, that if you had a system of cooperating programs, you might find it attractive to create a private space for IPC or networking between them.

Go to Shell

If you want to experiment with namespaces from the shell, you can use unshare. The name might seem odd, but the command takes its name from the fact that a new process typically shares the namespaces of its parent. The unshare command lets you create new namespaces.

One key feature or quirk of unshare is that, by default, it runs a program with the new namespaces created, but it does not associate that program with these namespaces. Instead, the new namespaces go to any children that program creates. You can add a –fork option to make it work more as you’d expect.

For example, let’s start a new shell in its own private Idaho:

sudo unshare --pid --fork --mount-proc /bin/bash
ps alx</pre>

If you try that command without a separate namespace, you’ll get a long list of processes. But the output inside our new namespace is much less bulky:

F   UID     PID    PPID PRI  NI    VSZ   RSS WCHAN  STAT TTY        TIME COMMAND 
4     0       1       0  20   0  10820  4376 -      S    pts/6      0:00 /bin/bash 
0     0       9       1  20   0  12048  1168 -      R+   pts/6      0:00 ps alx

You do have to think a bit about how the different utilities work. For example, ps reads from /proc so if we didn’t provide --mount-proc, it would still display all the main processes. (Try it.) You wouldn’t be able to interact with them, but since you can read /proc, you’d still see them. The --mount-proc flag is really just a shorthand for --mount (to get a new mount namespace) and then doing a mount of the proc filesystem.

Omitting the fork option will cause strange shell behavior because the shell usually spins off new processes which will now be a different namespace than your main process.

If you add a filename to most of the arguments (like --pid or --mount) you can create a persistent namespace that you share among processes. You can also use virtual ethernet adapters (type veth) or a network bridge to expose a network in one namespace to another.

Mounts and More Options

Another useful isolation is in the mount table. Linux handles mounts a bit differently. You can make mounts propagate in several ways. If you want total privacy, you can do that, but you can also share within a group, or track changes in other groups but not propagate your own changes. You can read more on the man page.

One interesting thing is that since the namespaces are isolated, it is possible for a normal user to have quasi-root privileges in the new namespaces. The --map-root-user allows for this and also turns on an option to deny users calling setgroups which could allow them to get elevated permissions.

There’s more, of course. If you have util-linux installed, just ask for the unshare man page to read more. If you want to use these things in a program, which is probably easier to imagine, there is an unshare system call. Use man 2 unshare to see the details. Note that you can exercise even more control with the system call. For example, you can disassociate the file system. It is closely tied to the clone system call which is sort of a super version of fork.

You might find it interesting that all the namespace data for a process show up in /proc. For example, try:

sudo ls -l /proc/$$/ns/*

You’ll see specialized symlinks with information about the different namespaces for the current process. For example:

lrwxrwxrwx 1 alw alw 0 Dec 8 07:29 /proc/2182630/ns/cgroup -> 'cgroup:[4026531835]'
lrwxrwxrwx 1 alw alw 0 Dec 8 07:29 /proc/2182630/ns/ipc -> 'ipc:[4026531839]'
lrwxrwxrwx 1 alw alw 0 Dec 8 07:29 /proc/2182630/ns/mnt -> 'mnt:[4026531840]'
lrwxrwxrwx 1 alw alw 0 Dec 8 07:29 /proc/2182630/ns/net -> 'net:[4026531992]'
lrwxrwxrwx 1 alw alw 0 Dec 8 07:29 /proc/2182630/ns/pid -> 'pid:[4026531836]'
lrwxrwxrwx 1 alw alw 0 Dec 8 07:29 /proc/2182630/ns/pid_for_children -> 'pid:[4026531836]'
lrwxrwxrwx 1 alw alw 0 Dec 8 07:29 /proc/2182630/ns/time -> 'time:[4026531834]'
lrwxrwxrwx 1 alw alw 0 Dec 8 07:29 /proc/2182630/ns/time_for_children -> 'time:[4026531834]'
lrwxrwxrwx 1 alw alw 0 Dec 8 07:29 /proc/2182630/ns/user -> 'user:[4026531837]'
lrwxrwxrwx 1 alw alw 0 Dec 8 07:29 /proc/2182630/ns/uts -> 'uts:[4026531838]'

This is one of those Linux-isms that is somewhat obscure but can be very useful when you need it. Even if you don’t need it right now, it is worth understanding because it just might solve your next development challenge. Sure, you could run your program in its own virtual machine, but that’s a pretty heavy option compared to simply isolating what you want in a clean and simple way. Even from a shell script.

14 thoughts on “Linux Fu: Don’t Share Well With Others

  1. Between unshare and chroot and docker it seems like there are three hacks and no good answers.
    This is high security stuff and should be well thought out and well architected but this is just a mess. Mountains of code with little oversight for critical security issues, what could possibly go wrong? There are undoubtedly multiple drop dead fix now bugs hiding in there just waiting to be discovered.

    1. that was my feeling as well when i set up docker. it’s so massively complicated, and it’s clearly changing very quickly. obviously a lot of people who are even very seasoned at installing containers do not realize all the implications, and i’m not convinced that there’s anyone who really understands all of it.

      the funniest realization for me was that anyone who is authorized to run the docker commandline utility can easily get root privileges on the host. it’s not necessarily an unworkable design but when my naive self stumbled onto that fact, i found it hard to imagine that there aren’t a ton of people out there who didn’t realize they’re handing out host root. the virtualization isn’t nearly as complete as you might imagine.

      i found lxc a little easier to use, able to run without root, and a little more transparent about how it uses the underlying kernel magic that makes it all happen. but there were still lots of things about the namespace isolation that i wasn’t sure i understood.

      i mean, i’m complaining about being ignorant, clearly. but that’s one of the hazards when you introduce facilities that aren’t inspectable with the old interfaces. it reminds me of how hidden network interface state accumulated that ifconfig couldn’t touch until the ‘ip’ tool took off that has an interface more representative of the actual interface state. presumably this stuff will settle down somewhat and the tooling and practice will catch up.

      1. “[T]he virtualization isn’t nearly as complete as you might imagine.” because it is NOT virtualization, it’s containerization/segmentation. Containers all use the same kernel instance, base file system etc. IMO this misunderstanding is perpetuated by the Docker docs. You can think of a container as a virtual machine most of the time and be fine. But it will eventually bite you if you don’t understand the difference.

        This is also a reason to never run the processes in your containers as root unless you absolutely need to because they’re running as root on system as a whole.

      2. Containers are great, but instead of docker I recommend using podman. It runs rootless by default, separates namespaces etc. and as another commenter said, containers are not virtualization. It is a great way to isolate and run a ton of services

  2. Thanks Al, I’m always happy to stumble across things like this, although I’ve been living with with *nix since the early 80’s, theres always new things to learn. Looks like we’ve got even more ways to confound the support team now :)

  3. I know a number of homeowners who share a well with their neighbors.
    They all contribute to pay for the electrical power to pump the water and any repairs the well might need.

  4. Ah yes, the tools that Docker builds on, but without the declarative Dockerfile standard.

    I’ve heard critics of Docker say ‘oh, we’ve had cgroups forever’, but it’s a big lift from hacking together cgroups to getting shareable, composable process isolation containers. Docker brought some de facto standards to the table to glue these tools together and changed the industry.

      1. I don’t understand the criticism being leveled at Docker. Seems like a slippery slope that leads to NIH syndrome. I get it, but just because I can write software doesn’t mean I have the time or the skills to write all the software I use. Most of the time software is just a means to an end. I must accept some tradeoffs if I’m going to get anything done before I die.

  5. With Linux it has distro for most use cases, like this way: if you want security use QubesOS, if you want general purpose PC use Debian or Ubuntu, if you build server use FreeBSD. “Jake of all trades” like Windows has many bugs.

  6. I used to use unshare with ecryptfs so that I could restrict only a few applications to have access to inside an encrypted directory. With GUI apps it quickly fell apart due to their reliance on shared per-session daemons that also needed to have access to the files.

Leave a Reply

Please be kind and respectful to help make the comments section excellent. (Comment Policy)

This site uses Akismet to reduce spam. Learn how your comment data is processed.