Linux Fu: Scripting For Binary Files

June 29, 2018

If you ever need to write a binary file from a traditional language like C, it isn’t all that hard to do. About the worst thing you might have to deal with is attempts to fake line endings across Windows and Linux, but there’s usually a way to turn that off if it is on by default. However, if you are using some type of scripting language, binary file support might be a bit more difficult. One answer is to use a tool like xxd or t2b (text-to-binary) to handle the details. You can find the code for t2b on GitHub including prebuilt binaries for many platforms. You should be able to install xxd from your system repository.

These tools take very different approaches. You might be familiar with tools like od or hexdump for producing readable representations of binary files. The xxd tool can actually do the same thing — although it is not as flexible. What xxd can even reverse itself so that it can rebuild a binary file from a hex dump it creates (something other tools can’t do). The t2b tool takes a much different approach. You issue commands to it that causes it to write an original hex file.

Both of these approaches have some merit. If you are editing a binary file in a scripting language, xxd makes perfect sense. You can convert the file to text, process it, and then roll it back to binary using one program. On the other hand, if you are creating a binary file from scratch, the t2b program has some advantages, too.

I decided to write a few test scripts using bash to show how it all works. These aren’t production scripts so they won’t be as hardened as they could be, but there is no reason they couldn’t be made as robust as you were willing to make them.

Cheating a Little

I decided to write two shell scripts. One will generate an image file. I cheated in two ways there. First, I picked the PPM (Portable Pix Map) format which is very simple to create. And second I ignored the format that uses ASCII instead of binary. That’s not strictly cheating because it does make a larger file, as you’d expect. So there is a benefit to using the binary format.

The other script takes a file in the same format and cuts the color values within it by half. This shows off both tools since the first job is generating an image file from data and the second one is processing an image file and writing out a new one. I’ll use t2b for the first job and xxd for the second.

PPM File Format

The PPM format is part of a family of graphics formats from the 1980s. They are very simple to construct and deconstruct, although they aren’t known for being small. However, if you needed to create a graphic from a Raspberry Pi program, it is sometimes handy to create them using this simple file format and then use ImageMagick or some other tool to convert to a nicer format like PNG.

There are actually three variants of the format. One for black and white, one for grayscale, and another for color. In addition, each of them can contain ASCII data or binary data. There is a very simple header which is always in ASCII.

We’ll only worry about the color format. The header will start with the string “P6.” That usually ends with a newline, although defensively, you ought to allow for any whitespace character to end the header fields. Then the X and Y limits — in decimal and still in ASCII — appear separated by whitespace. This is usually really a space and a newline at the end. The next part of the header is another ASCII decimal value indicating the maximum value for the color components in the image. After that, the data is binary RGB (red/green/blue) triplets. By the way, if the P6 had been a P3, everything would remain the same, but the RGB triplets would be in ASCII, not binary. This could be handy in some cases but — as I mentioned — will result in a larger file.

Here’s a sample header with a little bit of binary data following it:

The green text represents hex numbers and the other boxes contain ASCII characters. You can see the first 15 bytes are header and after that, it is all image data.

T2B

The t2b program takes a variety of commands to generate output. You can write a string or various sizes of integers. You can also do things like repeat output a given number of times and even choose what to output based on conditions. There’s a way to handle variables and even macros.

As an example, my script will write out an image with three color bars in it. The background will be black with a white border. The color bars will automatically space to fit the box. I won’t use too many of the t2b features, but I did like using the macros to make the resulting output easier to read. Here’s the code for creating the header (with comments added):

strl P6  # Write P6 followed by a newline (no quotes needed because no whitespace in the string)
str $X   # Write the X coordinate (no newline)
u8 32    # a space
strl $Y  # The Y coordinate (with newline)
strl 255 # Maximum subpixel value (ASCII)

That’s all there is to it. The RGB triples use the u8 command, although you could probably use a 24-bit command, too. I also set up some macros for the colors I used:

macro RED 
  begin 
    u8 255 
    times 2 u8 0 
    endtimes 
endmacro

Once you have the t2b language down, the rest is just math. You can find the complete code on GitHub, but you’ll see it just computes 7 equal-sized regions and draws different colors as it runs through each pixel in a nested set of for loops. There’s also a one-pixel white border around the edges for no good reason.

When you want to run the code you can either specify the X and Y coordinates or take the 800×600 default:

./colorbar.sh 700 700 | t2b >outputfile.ppm

If you intercept the output before the t2b program, you’ll see the commands rolling out of the script. Here’s the default output to the ppm file:

Shades of Gray

The other script is a little different. The goal is to divide all the color values in a PPM file in half. If it were just binary data, that would be easy enough, but you need to skip the header so as not to corrupt it. That takes a little extra work. I used gawk (GNU awk) to make the work a little simpler.

The code expects output from xxd, which looks like this:

00000000: 5036 0a38 3030 2036 3030 0a32 3535 0aff  P6.800 600.255.. 
00000010: ffff ffff ffff ffff ffff ffff ffff ffff  ................ 
00000020: ffff ffff ffff ffff ffff ffff ffff ffff  ................ 
00000030: ffff ffff ffff ffff ffff ffff ffff ffff  ................ 
00000040: ffff ffff ffff ffff ffff ffff ffff ffff  ................

The address isn’t important to us. You can ask xxd to suppress it, but it is also easy to just skip it. The character representations to the right aren’t important either. The xxd program will ignore that when it rebuilds the binary. Here’s the code in awk (which is embedded in the shell script):

# need to find 4 white space fields
BEGIN  { noheader=4 }
    {
    lp=1
    }
    {
    split($0, chars, &quot;&quot;)
# skip initial address
    while (chars[lp++]!=&quot;:&quot;);
    n=0;  # # of bytes read
# get two characters 
    while (n&lt;16 &amp;&amp; lp&lt;length(chars)) { # heuristically two space characters out of xxd ends the hex dump line (ascii follows) if (chars[lp] ~ /[ \t\n\r]/) { if (chars[++lp] ~ /[ \t\n\r]/) { break; # no need to look at rest of line } } b=chars[lp++] chars[lp++]; n++; # if header then skip white space if (noheader&gt;0) {
      if (b==&quot;20&quot; || b==&quot;0a&quot; || b==&quot;0d&quot; || b==&quot;09&quot;) noheader--;
    }
    else {
    # if not header than /2
     bn=strtonum(&quot;0x&quot; b)/2;
     bs=sprintf(&quot;%02x&quot;,bn);
     chars[lp-2]=substr(bs,1,1);
     chars[lp-1]=substr(bs,2,1);
    }
  }
# recombine array and print
  p=&quot;&quot;
  for (i=1;i&lt;=length(chars);i++) p=p chars[i];
  print p
  }

The awk code simply skips the address and then pulls up to 16 items from a line of data. The first task is to count whitespace characters to skip over the header. I made the assumption that there would not be runs of whitespace, although a more robust program would probably consume multiple spaces (easy to fix). After that, each byte gets divided and reassembled. This task is more character oriented and awk doesn’t handle characters well without a trick.

In particular, I used the split command to convert the current line into an array with each element containing a character. This includes any whitespace characters because I used an empty string as the split delimiter:

split($0, chars, "")

After processing the array — which isn’t hard to do — you can build a new string back like this:

p=""
for (i=1;i<=length(chars);i++) p=p chars[i];

The output file will feed back to xxd with the -r option and you are done:

xxd infile.ppm | ./half.sh | xxd -r >outfile.ppm

Two is the Loneliest

This is a great example of how the Unix philosophy makes it possible to build tools that are greater than the sum of their parts. A simple program changes a text-processing language like awk into a binary file manipulation language. Great. By the way, if your idea of manipulating binary is Intel hex or Motorola S records, be sure to check out the srec_cat and related software which can manipulate those, too.

Once you have a bunch of binary files, you might appreciate an online hex editor. By the way, a couple of years ago, I mentioned using od to process binary files in awk. That’s still legitimate, of course, but xxd allows you to go both ways, which is a lot more useful.

31 thoughts on “Linux Fu: Scripting For Binary Files”

Jerry says:

June 29, 2018 at 7:09 am

TUX links in bookmarks

Reply
Ostracus says:

June 29, 2018 at 7:25 am

“This is a great example of how the Unix philosophy makes it possible to build tools that are greater than the sum of their parts.”

DIY DSL.

Reply
Shinsukke says:

June 29, 2018 at 7:41 am

lol i just use printf “\x{number}” and pipe that into a file to create small binary files if i have to do it from bash
printf is present in most linux systems, so you don’t even have to install anything

Reply
1. Al Williams says:
  
  June 29, 2018 at 8:09 am
  
  Nice thing about Unix-y stuff is there are dozens of ways to do anything.
  
  Reply
2. jcamdr says:
  
  June 29, 2018 at 11:17 am
  
  Exactly. In the following bin2bash script, printf are generated automatically from a binary file to reconstruct it when executed, The goal is to send binary file to systems that only have a serial console and nothing like x|y|zmodem, kermit, SLIP, PPP or whatever available that could transmit a binary file without hitting an escape sequence.
  ———-
  #!/bin/bash
  echo “f=’/tmp/$1′”
  echo “touch \$f; rm \$f”
  c=0
  for b in $(od -An -tx1 -v $1); do
  if [[ “((c))” -ne “0” && “((c%32))” -eq “0” ]]; then
  echo
  fi
  echo -n “\x$b”
  ((c++))
  done | sed “s/^/printf ‘/;s/$/’ >>\$f/”
  echo
  echo “x=”$(sha512sum $1 | cut -d’ ‘ -f1)
  echo “y=\$(sha512sum \$f | cut -d’ ‘ -f1)”
  echo “if [ \”\$x\” != \”\$y\” ]; then echo \”File transfert failed !\”; fi; echo \”File \$f is ready.\””
  ———-
  The file will end in the /tmp folder of the receiving system and his SHA512 sum is compared to the one of the original file.
  
  Reply
yeti says:

June 29, 2018 at 7:56 am

Please write “gawk” instead of “awk” if you use gawk-features.

Reply
1. yeti says:
  
  June 29, 2018 at 7:58 am
  
  …I mean consequently instead of once.
  
  Reply
2. Al Williams says:
  
  June 29, 2018 at 8:05 am
  
  Well, that’s true. I’m spoiled and think of gawk as awk, but you are correct. I don’t think I’ve had to deal with a non-gawk awk though since I left AIX. But, like you say, I did identify it as such once ;-)
  
  Reply
RustyHydrogen says:

June 29, 2018 at 8:19 am

… bvi

Reply
RoGeorge says:

June 29, 2018 at 8:32 am

If you have a Rigol DS1054Z oscilloscope, plug a network cable into your oscilloscope, and type the following in a Terminal on your PC:

echo “:DISPLAY:DATA? ON,OFF,PNG” | nc -w1 192.168.32.208 5555 | dd bs=1 skip=11 of=image.png

This will save a screenshot of whatever your oscilloscope is displaying. A one line script that
– opens a TCP connection to the scope
– sends a SCPI request over LXI for a capture of the oscilloscope’s screen
– receive the oscilloscope’s SCPI answer
– decode the SCPI answer
– save the decoded oscilloscope’s screen as a png image on the PC
– drop the TCP connection

All these without any additional drivers or tools.

Not bad for a one line script.
:o)

Of course, you need to replace 192.168.32.208 with your oscilloscope’s IP.
All credit goes to ‘S Clark’, who came up with this beautiful example of scripting Fu.

Reply
1. Mike Szczys says:
  
  June 29, 2018 at 8:37 am
  
  Love this tip, thanks!
  
  Reply
2. Al Williams says:
  
  June 29, 2018 at 8:40 am
  
  https://github.com/wd5gnr/scopesnap — from 2015
  
  Reply
  1. RoGeorge says:
    
    June 29, 2018 at 12:11 pm
    
    Wow, you put a lot of features there!
    
    Reply
    1. Elliot Williams says:
      
      July 2, 2018 at 5:48 am
      
      And not just Rigol scopes! SCPI works for most everything modern. I pull down (megabyte) data dumps from my old Agilent all the time.
      
      I think Jenny did something on SCPI once? Maybe we need to write that up again!
      
      Reply
      1. RoGeorge says:
        
        July 2, 2018 at 7:30 am
        
        Good idea, why not? If so, you may also want to look at lxi-tools, too. The best so far IMO, with a library, a command line, a GUI, Lua scripting (great for automated testing, logging and so on), covers both RAW/TCP or VXI-11/TCP, tested for many instruments, open source, free, collaborative project. Last time I checked it was planned to make ‘lxi-tools’ a standard package in GNU/Linux repositories. So far it was featured by Siglent, maybe some other instruments manufacturers, too.
        
        The author, [lundmar], is an amazing programmer.
      2. RoGeorge says:
        
        July 2, 2018 at 7:31 am
        
        https://www.eevblog.com/forum/testgear/open-source-lxi-tools-and-liblxi-v1-0-released-for-gnulinux/
      3. RoGeorge says:
        
        July 2, 2018 at 7:33 am
        
        Highly recommended:
        https://lxi-tools.github.io/
CityZen says:

June 29, 2018 at 8:35 am

If you scour the man pages for various Linux commands, you’ll often find that they have provisions for working directly with binary data, obviating the need for any conversion/reconversion.

Reply
1. Al Williams says:
  
  June 29, 2018 at 8:41 am
  
  Doesn’t help if you are trying to write your own stuff in awk though. I’ve often wished awk had a “binary” mode that did something sensible and this is the closest I’ve gotten.
  
  Reply
  1. CityZen says:
    
    June 29, 2018 at 12:35 pm
    
    You can do this in gawk if you set BINMODE appropriately and simply don’t use the line-processing features. Or, in this case, don’t use them once you’ve gotten the initial text fields.
    
    Reply
    1. CityZen says:
      
      June 29, 2018 at 12:44 pm
      
      Okay, I seem to be mitsaken; I thought gawk had more input flexibility, but it seems that all the input statements are line-based only.
      
      Reply
DeadlyDad says:

June 29, 2018 at 11:32 am

I recall back in the day writing a Windows/DOS batch script that piped itself to DEBUG.EXE to generate a small .COM utility program that the script then used.

Reply
1. dwywit says:
  
  June 29, 2018 at 10:33 pm
  
  Was that the program that instigated a reboot by initiating a jmp to memory address 0000 ? Or something like that. I’ve got it somewhere. Fives lines of assembly, fed to DEBUG, produced a COM file guaranteed to reboot a DOS machine.
  
  Reply

I needed to send a 2-byte length, followed by the actual content using a shell script. This is what I came up with:

#!/bin/sh

l16v() { 
        S=$(cat "$1" | wc -c) 
        echo "Sending $S bytes" >&2 
        printf "%b" $(printf "\\%03o" $(( ($S >>  8) & 255)) )
        printf "%b" $(printf "\\%03o" $(( ($S >>  0) & 255)) ) 
        cat "$1"
} 

l16v "$1"

Luke says:

June 29, 2018 at 12:41 pm

I’m still trying to figure out who remembers all this stuff for when they happen to need it, and how long it would take to re-learn it to come up with a suitable script. When I write a script for something, six months later I forget that it even exists.

For these sort of batch processing of images, I have an ancient copy of Photoshop. It records your points and clicks, so you only have to do the first example.

Reply
Tim says:

June 29, 2018 at 5:03 pm

While this works, a easier solution is the Python stuct library.

Produces infinity more readable code than bash, and almost as likely to already be installed.

Reply
1. Al Williams says:
  
  June 30, 2018 at 5:19 am
  
  Well… assuming you know Python, maybe. Personally? I would do all this in C which I find easier by dint of having done it for a few decades. But it seems harsh to say: Hey people who do bash scripting. Learn my language because it is the One True Language! And I will say if you know awk it is very very productive for the right tasks.
  
  Reply
Rog Fanther says:

July 1, 2018 at 7:59 am

Thanks t2b is a very nice little utility.

Now I wonder if there is one to do the inverse, that is, I´d like to script the reading from a binary file and take some actions ( namely, split it in parts based in its contents ) . Would be nice to be able to do it with an equivalent for t2b.

Reply
Tom says:

July 1, 2018 at 9:24 pm

Vedit pro binary editor

Reply
ben says:

July 4, 2018 at 7:35 pm

Is it possible to do binary data manipulations using a regular-expression tool? Or do those only do text (ex. “TextPipe”)?

Reply
Tobe Osakwe says:

December 17, 2019 at 5:16 am

What a small world. I found my way here from a Google search for “awk for binary files,” trying to see if there were an existing tool used to patch binary files, and then found a mention of “t2b,” which I wrote about a year-ish ago.

Reply