Linux Fu: Simple Pipes

In the old days, you had a computer and it did one thing at a time. Literally. You would load your cards or punch tape or whatever and push a button. The computer would read your program, execute it, and spit out the results. Then it would go back to sleep until you fed it some more input.

The problem is computers — especially then — were expensive. And for a typical program, the computer is spending a lot of time waiting for things like the next punched card to show up or the magnetic tape to get to the right position. In those cases, the computer was figuratively tapping its foot waiting for the next event.

Someone smart realized that the computer could be working on something else while it was waiting, so you should feed more than one program in at a time. When program A is waiting for some I/O operation, program B could make some progress. Of course, if program A didn’t do any I/O then program B starved, so we invented preemptive multitasking. In that scheme, program A runs until it can’t run anymore or until a preset time limit occurs, whichever comes first. If time expires, the program is forced to sleep a bit so program B (and other programs) get their turn. This is how virtually all modern computers outside of tiny embedded systems work.

But there is a difference. Most computers now have multiple CPUs and special ways to quickly switch tasks. The desktop I’m writing this on has 12 CPUs and each one can act like two CPUs. So the computer can run up to 12 programs at one time and have 12 more that can replace any of the active 12 very quickly. Of course, the operating system can also flip programs on and off that stack of 24, so you can run a lot more than that, but the switch between the main 12 and the backup 12 is extremely fast.

So the case is stronger than ever for writing your solution using more than one program. There are a lot of benefits. For example, I once took over a program that did a lot of calculations and then spent hours printing out results. I spun off the printing to separate jobs on different printers and cut like 80% of the run time — which was nearly a day when I got started. But even outside of performance, process isolation is like the ultimate encapsulation. Things you do in program A shouldn’t be able to affect program B. Just like we isolate code in modules and objects, we can go further and isolate them in processes.

Doubled-Edged Sword

But that’s also a problem. Presumably, if you want to have two programs cooperate, they need to affect each other in some way. You could just use a file to talk between them but that’s notoriously inefficient. So operating systems like Linux provide IPC — interprocess communications. Just like you make some parts of an object public, you can expose certain things in your program to other programs.

The most fundamental way to do this is with the fork call. When you fork a new process, the new process is a total copy of its parent. You don’t always realize this because the next thing you often do is call something like exec to load a new program or you use a wrapper-like system that calls fork and exec for you. But each time you run, say, ls at the command prompt, your running ls program starts life as a total copy of the shell. That copy then loads the ls executable and runs it.

But what if it didn’t? That was how my report writer worked. The big math, which took hours on a Sequent computer with a lot of CPUs, took place in one process. When it was time to print, I forked off a bunch of subprocesses. Each one had a total copy of the data which I then treated as read-only and started printing. That’s one way to communicate between processes.

Another way is pipes. Imagine a command line like:

cat data.txt | sort | more

Here, you are creating three processes. One dumps data from a text file. It sends that data to a pipe that is connected to the sort program. It outputs to another pipe that is connected to the more program.

One Way

Pipes like that are one-way affairs, but you can create named pipes and talk over them in both directions. You can do both of these in the shell — mknod makes a named pipe — but you can also do both of them in a program. (popen is very easy to use for regular pipes and there is a mknod API call, too.)

There are several other methods you can use to talk across processes:

  • Message queues – A way to send messages to another process asynchronously
  • Semaphores – A way to share a counter with another program
  • Shared memory – Share a block of memory
  • Signal – You can send signals to other ​processes which can be used as a form of communication

You might wonder why you need anything beyond shared memory. Honestly, you don’t, but it is easier in many cases to use a different method. The problem is you need some way to have an atomic operation and things like semaphores manage that for you. Imagine if we had a variable in shared memory called busy. If busy is 1, then we know we shouldn’t change data in our shared memory because someone is using it.

We might write:

while (busy) ; // wait for busy==0
busy=1;
do_stuff();
busy=0;

Looks great, right? No. Somewhere in the CPU that while loop looks like this:

while_loop2384: TST busy ; set flags on busy
                JNZ while_loop2384 ; if no zero flag, jump
                MOV busy,#1 ; move 1 to busy

Most of the time this will work fine. Most of the time. But what happens if I do the TST instruction and then I get put to sleep so another program can run the same code? Or another CPU is running the exact same code at the exact same time? It can happen. So both programs will now see that busy is zero. Then they will both set busy to 1 and continue on. That’s a fail.

Semaphores manage this through an atomic access mechanism that allows the program to test and set the operation in one place. There is more to worry about, like what happens if I’m waiting for process B to release a semaphore and process B is waiting for me to release a different one. But that situation — deadlock –is a topic for the future, along with other hiddle gotchas like priority inversion.

In the Pipeline

I have a bit of a made up problem. On Linux, if I type df I can find out all the mounted things and their characteristics. But that list includes things like the root directory and swap file. What if you wanted to just read the loop devices and show the same format output? There are plenty of ways to do that, of course. You could read the loop files out /etc/mtab and then read the other data from /sys or wherever else it resides. Sounds like a lot of work.

Of course, running df gets us almost there. In fact, I could just run a pipeline in the shell to get what I want, sort of:

df | grep '^/dev/loop'

That works but the output is scrambled. On my system /dev/loop3 is first and /dev/loop0 is last and there’s no clear reason why number 4 is between 8 and 14. So I want to sort it. Piping through sort doesn’t help much because it is going to sort alphabetically. You might think about the -n flag to sort, but that won’t work because the number is at the end of the string. Sure, I could use some strange combo of cut or sed to maybe fix all this, but it is getting too complicated. Let’s just write C code.

The first step is to get df to just print everything and capture the output. Since we want to process the output we need to read a pipe, and popen() is an easy way to set this up:

#include <stdio.h>
int main(int argc, char * argv[]) {
// This part reads the output of DF into the lines array (with some modifications)
   FILE * result = popen("df", "r"), * sort;
   int i;
   if (!result) {
      perror("Can't open df");
      return 1;
   }
   while (!feof(result)) {
     int c = getc(result);
     if (c != EOF) putchar(c);
   }
 pclose(result);
 return 0;
}

Half Solved

That’s half the problem solved. If you have the characters, you could do all the sorting and filtering you want, but… wait a minute! I’m still lazy. So let’s ask the shell to help us out. Here’s my plan. I know I only want lines that start with /dev/loop so let’s do this:

  • Read a whole line at one time
  • If it isn’t a /dev/loop line throw it out
  • If it is a /dev/loop line then save it in an array but chop off the /dev/loop part
  • After we have all the lines ask the shell to sort and then add the /dev/loop back in after sorting

Easy enough:

#include <stdio.h>
#include <string.h>

char buffer[4097];
char * lines[512];
unsigned int maxline = 0;

int main(int argc, char * argv[]) {
// This part reads the output of DF into the lines array (with some modifications)
   FILE * result = popen("df", "r"), * sort;
   int i;
   if (!result) {
      perror("Can't open df");
   return 1;
   }
   while (!feof(result)) {
 // get a line from df
     char * rc = fgets(buffer, sizeof(buffer), result);
// only save lines that start with /dev/loop
    if (rc &amp;amp;&amp;amp; !strncmp(buffer, "/dev/loop", 9)) {
// out of space
       if (maxline >= sizeof(lines) / sizeof(char * )) {
       fprintf(stderr, "Too many loops\n");
       return 2;
     }
   lines[maxline++] = strdup(buffer + 9); // copy just the number part and the rest of the line
   // should check lines[maxline[1] for null here
   }
 }
 pclose(result);

// Now we are going to print through sort
// The sed replaces the /dev/loop at the front of the line after sorting
 sort = popen("sort -n | sed 's/^/\\/dev\\/loop/'", "w");
 if (!sort) {
   perror("Can't open sort");
   return 3;
  }
// for each line, send to pipe (note order didn't really matter here ;-)
 for (i = 0; i < maxline; i++)
   fputs(lines[i], sort);
 pclose(sort);
 return 0;
}

And there you have it. Yeah, you could do this with the shell alone but it would be a lot more difficult unless you resort to another programming language like awk and then you really aren’t just using a shell. Besides, this makes a nice example and there are lots of things you could do like this that would be very hard to do otherwise.

You might wonder if you could spin off something like sort and both feed it input and read its output. The answer is yes, but not with popen(). The popen() call is just a handy wrapper around pipe and fork. If you wanted to do both ends you would have to use the pipe() call directly (twice) and then execute sort or whatever. But that’s a topic for the future.

There are plenty of other future topics for interprocess communications, too. But for now, try out pipes using popen. Critical sections come up in shell scripts, too. If you prefer to write your scripts in C, that’s possible, too.

15 thoughts on “Linux Fu: Simple Pipes

  1. Just a note: you can use pipe() then fork() and finally an exec*() if you need the flexibility of using one of the many different exec*() functions. This also provides you with the opportunity to set up the user environment before actually executing the program.

  2. Also note: An application itself can have many ‘tasks’ (some call threads) working together also. All this means is you can spread out a lot of processes (what you talk about above) and internal tasks (application threads) across all the cores available. An interesting job for the preemptive kernel to keep track of … and then there is the priorities of who gets to run and when….

  3. Instead of writing C code to re-invent the wheel, why not check the man page for sort and use ‘-h’ (human)? OK OK I get it’s an exercise and ‘-h’ would spoilt it :)

    Annoyingly it doesn’t exist on Mac (and I expect BSD), i suspect it’s a GNU extension, but you can easily install the ‘better/more human like’ sort to fix that.

  4. If you’re not willing to use awk/cut/grep/paste/sed/sort/tr/xargs directly, you’ve already lost the way. Those tools exist for a reason and your purpose-built program doesn’t justify not using them.

    Furthermore, I’m not sure if this article is just full of bad examples, or if the author actually does things this way, but most people should not heed any of the advice in this piece, particularly new Linux users. All of the other articles in this series have been useful and informative. This one is not. It could also literally be dangerous.

    Writing a C program to shell out using popen or execvp is asking for trouble if someone puts a like-named malicious program in your $PATH. Someone sneaks a setuid binary in as one of those programs, or you’re running something as the root user, and you’re in very deep trouble. This is not only “the long way ’round” but it’s bad practice. Put your commonly used program patterns into scripts, drop them in ~/bin, add that to your path and be done with it. Please. If you’re on a shared system, everyone else will thank you too.

    1. I’m sure your a little tongue and cheek because I’m convinced you must know that the simple examples are to illustrate a technique. I mentioned in the post it’s a made-up problem and I certainly don’t think anybody needs a specialized tool to print out a list of their loop devices. But if you don’t know how to use popen, this is an accessible example and presumably, you would use it for something smarter yourself.

  5. I’m pretty sure that’s not how hyper-threading works. You seem to describe it as one core acting like two virtual cores but with only one virtual core being awake at a time, which one being determined by the OS (which is slow). Indeed, it is a bit like this in some ways, but the whole point is that it switches fast enough (each one consumes only a few instructions at most before switching to the other one) that it’s as if they’re both running at once. When the CPU is single-threaded it can only feed itself instructions so fast due to the limits of out-of-order execution which means that not all parts of the CPU can be kept busy all the time. By having the single CPU pretend to be two it can run two programs at once with the hope that the CPU resources that are unused by one will be consumed by the other thereby keeping all parts of the CPU busy. This means that as far as the operating system can tell it genuinely is two CPUs, not a single CPU that can quickly switch between two processes upon command.

    1. Replace the amps with ampersands – someone has failed to properly escape the code in the article so ampersands become “amps” – and if I write the actual HTML code in a comment I get a security error when trying to reply!

  6. Even easier (for me) than “just writing a little program” is creating an alias for one or more piped shell commands that massage the output the way I want it. I’ve managed to avoid learning to program properly for many years this way!

  7. I tend to ‘decorate’ my shell programs using an ipython macro:
    This does more or less the same, filter on loop and sort by second column
    (I did this on my phone, BusyBox df does not recognize the ‘-r’ switch
    “`python
    df_output = !df
    loop = [d.split() for d in df_output if ‘loop’ in d]
    2darr = sorted(loop,key=lambda x : int(x[2]))
    for row in 2darr:
    for cell in row:
    print(cell, end=’\t’)
    print()

    “`

  8. “…., but you can create named pipes and talk over them in both directions. ”

    No. Don’t ever use one pipe to implement two way communication. I can be fine with the rest, but just suggesting to implement two directional communication over ONE pipe is a nono for me (and that how that could be read). You might want to take that bit out of your article.

Leave a Reply

Please be kind and respectful to help make the comments section excellent. (Comment Policy)

This site uses Akismet to reduce spam. Learn how your comment data is processed.