Linux Fu: Preprocessing Beyond Code

If you glanced at the title and thought, “I don’t care — I don’t write C code,” then hang on a minute. While it is true that C has a preprocessor and you can notoriously do strange and — depending on your point of view — horrible or wonderful things with it, there are actually other options and you don’t have to use any of them with a C program. You can actually use the C preprocessor with almost any kind of text file. And it’s not the only preprocessor you can abuse this way. For example, the m4 preprocessor is wildly complex, vastly underused, and can handle C source code or anything else you care to send to it.

Definitions

I’ll define a preprocessor as a program that transforms its input file into an output file, reacting to commands that are probably embedded in the file itself. Most often, that output is then sent to some other program to do the “real” work. That covers cpp, the C preprocessor. It also covers things like sed. Honestly, you can easily create custom preprocessors using C, awk, Python, Perl, or any other programming language. There are many other standard programs that you could think of as preprocessors, for example, tr. However, one of the most powerful is made to preprocess complex input files called m4. For some reason — maybe because of its complexity — you don’t see much m4 in the wild.

What Preprocessor?

If you’ve only used modern C compilers, you may wonder where the preprocessor even is. An ordinary system now does the entire compile in — as far as you can tell — one single pass. However, your compiler should offer a cpp executable that does the preprocessor logic externally, if you prefer. For gcc (and many other compilers), the preprocessor is named — unsurprisingly — cpp. The preprocessor has four major tasks:

  1. Substitute one string for another, including “macros” that look like a function call.
  2. Evaluate expressions and include parts of the input or exclude them based on the expression’s value.
  3. Strip out comments.
  4. Read in other files.

Of course, usually, the input is C source code, and the output is headed for the compiler, but it doesn’t have to be that way.

A Simple Example

Suppose you have a configuration file of some sort that has messages in it, originally in English. The file looks like this:

message1: Good Morning
message2: Good Night
message3: The cat is white

We want to arrange it so we can easily change the messages and build a new configuration file. There are several ways you could do this, each with some advantages and disadvantages.

Imagine you have a file called langs:

#define ENGLISH 0
#define SPANISH 1

Obviously, you could add more languages here, and the numbers are arbitrary as long as they are unique.

Now, we can create a template for the final configuration file:

#include "langs"

#ifndef LANG
#define LANG ENGLISH
#endif

#include "xlat"

message1: GOOD_MORNING
message2: GOOD_NIGHT
message3: CAT(WHITE)

There are a few things to notice about this file. First, it includes our language definition file. It then defines LANG as one of those symbols unless something else has already defined it. We will soon see what that might be, but assume this sets LANG to ENGLISH for now.

The include of xlat populates the tags like GOODMORNING with the correct string in whatever language we choose. Here’s what xlat looks like:

#if LANG==ENGLISH
#define WHITE white
#define GOOD_MORNING Good Morning
#define GOOD_NIGHT Good Morning
#define CAT(clr) The cat is clr

#endif

#if LANG==SPANISH
#define WHITE blanco
#define GOOD_MORNING Buenos Días
#define GOOD_NIGHT Buenas Noches
#define CAT(clr) El gato es clr
#endif

Note that the good morning message has a Unicode character in it. That’s one small issue with using tools like this. The encoding will come out as a C-style escape character. Depending on what you are going to use the output for, that may or may not be acceptable. In fact, there are several things the preprocessor does for the compiler that we probably want to suppress.

If you just run:

cpp template

You get:

# 0 "template" 
# 0 "<built-in>" 
# 0 "<command-line>" 
# 1 "/usr/include/stdc-predef.h" 1 3 4 
# 0 "<command-line>" 2 
# 1 "template" 
# 1 "langs" 1 
# 2 "template" 2 


# 1 "xlat" 1 
# 8 "template" 2 

message1: Good Morning 
message2: Good Night 
message3: The cat is white

What we want is at the bottom, true, but there’s a lot of stuff to help the compiler generate error messages and other things.

The trick is to put a few options on the command line:

cpp -udef -P template

These options are for gcc’s preprocessor. If you use something else, you may have to make your own decisions.

Customizing

If you want the Spanish version, you could simply edit the file. But you can also tell the preprocessor to force the LANG symbol, and since the template won’t redefine it, you’ll get the language of your choice:

cpp -udef -P -D LANG=SPANISH template

As I mentioned, the Unicode character will look funny after this, depending on how you look at it.

Another Way

This isn’t the only way to use the preprocessor in this example. You could detect the language and then include a different file — ENGLISH or SPANISH — to get the same result. This would have the advantage of many small independent files you could send to different translators, for example.

There are probably dozens of other ways you could do this, too. The preprocessor is like a multitool. There are lots of ways to do almost anything.

Preprocessor on Steroids

If you really want to get fancy with the preprocessor, try m4. It is similar in idea to the C preprocessor but has many superpowers. It isn’t specific to C, so there’s not much you have to do to coax it to work with your files. Unlike the C preprocessor, m4 doesn’t care about lines. For example, consider this input:

Hello!
define(HACKADAY,1)
Testing our macro:
HACKADAY
The End

If you run that through m4, you’ll notice there is a strange blank line between Hello and the line that says “Testing.” Why? Because the macro definition only consumes the characters up to the close parenthesis. Everything else is still in the file, including that newline at the end. If you type some text in after the definition, there’s no problem, and it will show up in the output.

If you want to ignore the rest of the line, you use dnl (delete to new line) like this:

define(HACKADAY,1)dnl

Arguments in m4 use the dollar sign notation, much like the shell. Quoting is strange, too, since you use the back quote to open and the apostrophe to close. Like this:

define(HACKADAY,`eval(10**$1)')

As you might expect, this allows you to say HACKADAY(2) and get 100 as the result — the double asterisk is exponentiation.

A Pleasant Diversion

One of the best features of m4 is that it has at least ten different output streams. The default is stream 0 and the rest are numbered from 1 to 9. You can write to any of the streams easily, or write to an out-of-range stream like -1 to discard input. At the end, the output streams are put together in order. Hypothetically, then, you could have a macro that adds an item to a report, for example. The report has a header, a body, and a totals column. You could put all the header code into the first stream (or “diversion”, in m4-speak). Then put the body code in diversion 2 and the total code in diversion 3.

At the end, the generated program would have all the headers, then all the body items, and, finally, the totals and you could write them in any order you find convenient. If you want to throw text away, you should divert to a negative file number. Some m4 programs — including the GNU one — allow larger numbers of diversions than the standard.

As a simple example, consider this script:

dnl These comments will be discarded
dnl First, we are going to divert to #1
dnl Then we will print each word along with a count
dnl incrementing the count (_c)
dnl At the end, we will switch back to 0 and output the count
dnl This way, the header of the "report" will have the count
dnl followed by the words we wanted to count
divert(1)dnl
define(_c,0)dnl
define(WC,`
define(`_c',incr(_c))dnl
_c: $1')dnl
WC(Hello)
WC(There)
WC(Hackaday)
WC(2024)
divert(0)dnl
List of _c words:

Note that the lines that start with dnl are essentially comments. The rest is cryptic, but the idea is to define a macro to output a list of words with sequence numbers. The header contains a total count which, of course, we don’t know until the end. But since the header is put in diversion 0 and the rest in diversion 1, everything comes out in the right order.

There’s too much about m4 to cover in a single post, but you can read more about it on your own. Honestly, if you really need the power of m4, maybe you should be thinking about awk or Python anyway. You’ll probably have to recreate your own version of the divert system, though, so if you really need that functionality, maybe there is something to m4.

On the other hand, maybe try awk. Or mix awk, shell script, and the C processor in terrible ways.

15 thoughts on “Linux Fu: Preprocessing Beyond Code

    1. I came here to make a comment about m4, and then was surprised to see it mentioned in the body. Back in the day I knew of a website that made extensive use of m4 to produce the final HTML from templates. I don’t remember the details any more but I remember the site owner mentioning it on his about page.

  1. Ooooh, this is great! I don’t write C, but I use Make and CPP directives in Dockerfiles, because Docker is *adamant* about not having includes or anything that would allow modularization of images. So, defines and macros to the rescue for refactoring common parts of images my company’s CI systems use.

        1. Actually used that too on one of my company’s projects. Only a simple change, though, just had to change the start command to accommodate a debugger in dev but not in production (performance hit and all).

      1. Great to hear that you’re interested? But what would you be interested in, in particular? I’d love to share but I don’t know what would be interesting for all y’all.

  2. Remember in 1986 or thereabouts, writing an M4 macro to expand an x86 (80186 actually) assembler interrupt routine into an inline switch decoding six status bits. About 250 lines in and 4000 lines of gibberish out, but it worked and for the time was fast.

    M4 was very flexible, yes I did use it to support my Sendmail configuration as well. Wife and I both used it for Sendmail (from scratch) but she never ventured into assembler.

    Suspect I’m showing my (our) age.

  3. Back in the 1980s I developed the software for the Newbury Data Recording NDR3000 series of terminals in Z80 assembler based around a fully cooperative MASCOT (Modular Approach to Software Construction and Test) kernel.

    To validate the terminal emulations (Televideo, VT100, etc) my team wrote a test suite in M4 that exercised all the terminal functions and automatically checked the results comparing screen dumps to an exemplar file; where required we added code to capture the screen display and saved this to a file which each M4 test script could generate and then compare to the master exemplar copy. If they matched the test passed, simple but effective.

    To add another emulation just meant defining a new set of primatives – the bulk of the test suite stayed the same as it just called the primatives. This approach saved us weeks of development effort and eliminated many bugs.

    The development environment was Xenix running on a Vax 11/750, the requirements spec. was written using nroff and we wrote the code outline in pseudo Pascal, then hand coded the assembler statements using the pseudo Pascal as code comments. The final code was hand optimised using an HP64000 emulator to optimise the character pathway.

    The NDR3000 terminals were as fast as the fastest terminal on the market at the time, the Wyze terminal, but I would argue that the NDR was faster because the Wyze failed to scan the keyboard when running flat out. We cheated and dropped the keyboard scan rate to 10 Hz, just enough to capture a Control-C keypress.

  4. What you’d actually get from #define GOOD_NIGHT Good Morning is message2: Good Morning. Unless I missed something.

    As a good lesson in that auto correct’s no substitution for proofing your code :). Although I’m giving the benefit of the doubt and assuming it wasn’t run.

  5. Thanks! This article re-piqued my interest in m4 to the point that I searched for some tutorials. The m4 docs are “great” 🙄 [i]if you know m4 already![/i]

    Professionally, I found the cpp and Jon Bentley’s m1 (written in awk) to be sufficient for my needs in writing C and assembly code for microcontrollers.

    One puzzling item is ‘dnl’ – I would have thought it should be a global switch, rather than forcing the user to continually type it in…

    Thanks again,
    –Rich

  6. You don’t actually need cpp to run just the preprocessor step, since gcc/clang will do just that if you pass the -E flag to them. Though I admit I don’t know if they would be willing to preprocess non-C/C++ input files in that mode. Never occurred to me to try.

Leave a Reply

Please be kind and respectful to help make the comments section excellent. (Comment Policy)

This site uses Akismet to reduce spam. Learn how your comment data is processed.