Coffee With Kernighan

August 24, 2022 by Chris Lott 29 Comments

There was an interesting tidbit buried in a Computerphile video released last week (below the break), featuring professors [David Brailsford] and [Brian Kernighan] having a chat over coffee. Among other topics, they discuss the history and current state of various text processing tools. We learn that [Kernighan] has taken on a summer project of updating the AWK text processing language to handle UTF-8 text, an omission he admits is embarrassing in this day and age. He is also working on a second edition of The AWK Programming Language book, which hasn’t been updated since being first released in 1988.

[Brian Kernighan] is a legend in the world of Unix and computing, working at Bell Labs during the 70s where Unix and C were developed. Among the many accomplishments in his career, he is well-known as the co-author with [Dennis Ritchie] of The C Programming Language, first published in 1972 and still being used decades later, AWK mentioned above, and major updates to troff. More recently, he co-authored The Go Programming Language book in 2015.

If an updated UTF-8-capable AWK interests you, keep an eye on the AWK GitHub repository where [Kernighan] anticipates an update, once he wraps his head around git a little better. We’re happy to see [Brian] so active at 80 years old. If you want to learn more about those early days at Bell Labs, we reviewed [kernighan]’s very interesting UNIX: A History and a Memoir a couple of years ago.

Continue reading “Coffee With Kernighan” →

Linux Fu: Mixing Bash And Python

May 4, 2021 by Al Williams 24 Comments

Although bash scripts are regularly maligned, they do have a certain simplicity and ease of creation that makes them hard to resist. But sometimes you really need to do some heavy lifting in another language. I’ll talk about Python, but actually, you can use many different languages with this technique, although you might need a little adaptation, depending on your language of choice.

Of course, you don’t have to do anything special to call another program from a bash script. After all, that’s what it’s mainly used for: calling other programs. However, it isn’t very handy to have your script spread out over multiple files. They can get out of sync and if you want to send it to someone or another machine, you have to remember what to get. It is nicer to have everything in one file.

Continue reading “Linux Fu: Mixing Bash And Python” →

Linux-Fu: Making AWK A Bit Easier

September 29, 2020 by Al Williams 18 Comments

awk is a kind of Swiss Army knife for text files. However, some of its limitations are often a bit annoying. I’ve used a simple set of functions to make awk a bit better, although I will warn you: it does require GNU extensions to awk. That is, you must use gawk and not other versions. Your system probably maps /usr/bin/awk to something and that something might be gawk. But it could also be mawk or some other flavor. If you use a Debian-based distro, update-alternatives is your friend here. But for the purposes of this post, I’m going to assume you are using gawk.

By the end of the post, you’ll see how to use my awk add-on functions to split up a line into fields even when there is no single character to separate all fields. In addition, you’ll be able to refer to the fields using names you decide. You won’t have to remember that $2 is the time field. You’ll say Fields_fields["time"] instead.

The Problem

awk does a lot of common work for you when you use it to process text files. It reads files a record at a time. Normally, a record is a single line. Then it splits the line on fields using whitespace, or some other choice of field separators. You can write code that manipulates the line or individual fields. This default behavior is great, especially since you can change the end of record character and the field separator. A surprising number of files fit this sort of format.

Until, of course, they don’t. If you have data coming from a data logging instrument or some database, it could be formatted in a variety of ways. Some fields might have structured data with a variety of separators. This isn’t a deal-breaker. Since you can get at the whole line, you can do almost anything you want, but the logic is harder and the whole point to using awk is to make things easier.

For example, suppose you had a file from a data recorder that had an eight-digit serial number, followed by a six-character tag, and then two floating point numbers separated by colons. The pattern might look like

^([0-9]{8})([a-zA-Z0-9]{6})([-+.0-9]+),([-+.0-9]+)$

This would be hard to handle with the conventional field splitting and you’d normally just write code to split everything apart.

Continue reading “Linux-Fu: Making AWK A Bit Easier” →

Linux-Fu: Your Own Dynamic DNS

August 25, 2020 by Al Williams 14 Comments

It is a problem as old as the Internet. You want to access your computer remotely, but it is behind a router that randomly gets different IP addresses. Or maybe it is your laptop and it winds up in different locations with, again, different IP addresses. There are many ways to solve this problem and some of them are better than others.

A lot of routers can report their IP address to a dynamic DNS server. That used to be great, but now it seems like many of them hound you to upgrade or constantly renew so you can see their ads. Some of them disappear, too. If your router vendor supplies one, that might be a good choice, until you change routers, of course. OpenWRT supports many such services and there are many lists of common services.

However, if you have a single public accessible computer, for example a Web server or even a cloud instance, and you are running your own DNS server, you really don’t need one of those services. I’m going to show you how I do it with an accessible Linux server running Bind. This is a common setup, but if you have a different system you might have to adapt a bit.

There are many ways to set up dynamic DNS if you are willing to have a great deal of structure on both sides. Most of these depend on setting up a secret key to allow for DNS updates and some sort of script that calls nsupdate or having the DHCP server do it. The problem is, I have a lot of client computers and many are set up differently. I wanted a system where the only thing needed on the client side was ssh. All the infrastructure remains on the DNS server.

Continue reading “Linux-Fu: Your Own Dynamic DNS” →

Linux Fu: Scripting For Binary Files

June 29, 2018 by Al Williams 31 Comments

If you ever need to write a binary file from a traditional language like C, it isn’t all that hard to do. About the worst thing you might have to deal with is attempts to fake line endings across Windows and Linux, but there’s usually a way to turn that off if it is on by default. However, if you are using some type of scripting language, binary file support might be a bit more difficult. One answer is to use a tool like xxd or t2b (text-to-binary) to handle the details. You can find the code for t2b on GitHub including prebuilt binaries for many platforms. You should be able to install xxd from your system repository.

These tools take very different approaches. You might be familiar with tools like od or hexdump for producing readable representations of binary files. The xxd tool can actually do the same thing — although it is not as flexible. What xxd can even reverse itself so that it can rebuild a binary file from a hex dump it creates (something other tools can’t do). The t2b tool takes a much different approach. You issue commands to it that causes it to write an original hex file.

Both of these approaches have some merit. If you are editing a binary file in a scripting language, xxd makes perfect sense. You can convert the file to text, process it, and then roll it back to binary using one program. On the other hand, if you are creating a binary file from scratch, the t2b program has some advantages, too.

I decided to write a few test scripts using bash to show how it all works. These aren’t production scripts so they won’t be as hardened as they could be, but there is no reason they couldn’t be made as robust as you were willing to make them.

Continue reading “Linux Fu: Scripting For Binary Files” →

Gawking Hex Files

July 13, 2016 by Al Williams 9 Comments

Last time I talked about how to use AWK (or, more probably the GNU AWK known as GAWK) to process text files. You might be thinking: why did I care? Hardware hackers don’t need text files, right? Maybe they do. I want to talk about a few common cases where AWK can process things that are more up the hardware hacker’s alley.

The Simple: Data Logs

If you look around, a lot of data loggers and test instruments do produce text files. If you have a text file from your scope or a program like SIGROK, it is simple to slice and dice it with AWK. Your machines might not always put out nicely formatted text files. That’s what AWK is for.

AWK makes the default assumption that fields break on whitespace and end with line feeds. However, you can change those assumptions in lots of ways. You can set FS and RS to change the field separator and record separator, respectively. Usually, you’ll set this in the BEGIN action although you can also change it on the command line.

For example, suppose your test file uses semicolons between fields. No problem. Just set FS to “;” and you are ready to go. Setting FS to a newline will treat the entire line as a single field. Instead of delimited fields, you might also run into fixed-width fields. For a file like that, you can set FIELDWIDTHS.

If the records aren’t delimited, but a fixed length, things are a bit trickier. Some people use the Linux utility dd to break the file apart into lines by the number of bytes in each record. You can also set RS to a limited number of any character and then use the RT variable (see below) to find out what those characters were. There are other options and even ways to read multiple lines. The GAWK manual is your friend if you have these cases.

BEGIN { RS=".{10}"   # records are 10 characters
      }

   {
   $0=RT
   }

   {
   print $0  # do what you want here
   }

Once you have records and fields sorted, it is easy to do things like average values, detect values that are out of limit, and just about anything else you can think of.

Spreadsheet Data Logs

Some tools output spreadsheets. AWK isn’t great at handling spreadsheets directly. However, a spreadsheet can be saved as a CSV file and then AWK can chew those up easily. It is also an easy format to produce from an AWK file that you can then read into a spreadsheet. You can then easily produce nice graphs, if you don’t want to use GNUPlot.

Simplistically, setting FS to a comma will do the job. If all you have is numbers, this is probably enough. If you have strings, though, some programs put quotes around strings (that may contain commas or spaces). Some only put quotes around strings that have commas in them.

To work around this problem cleanly, AWK offers an alternate way to define fields. Normally, FS tells you what characters separate a field. However, you can set FPAT to define what a field looks like. In the case of CSV file, a field is any character other than a comma or a double quote and then anything up to the next double quote.

The manual has a good example:

BEGIN {
  FPAT = "([^,]+)|(\"[^\"]+\")"
  }

  {
  print "NF = ", NF
  for (i = 1; i <= NF; i++) {
  printf("$%d = <%s>\n", i, $i)
  }

This isn’t perfect. For example, escaped quotes don’t work right. Quoted text with new lines in it don’t either. The manual has some changes that remove quotes and handle empty fields, but the example above works for most common cases. Often the easiest approach is to change the delimiter in the source program to something unique, instead of a comma.

Hex Files

Another text file common in hardware circles is a hex file. That is a text file that represents the hex contents of a programmable memory (perhaps embedded in a microcontroller). There are two common formats: Intel hex files and Motorola S records. AWK can handle both, but we’ll focus on the Intel variant.

Old versions of AWK didn’t work well with hex input, so you’d have to resort to building arrays to convert hex digits to numbers. You still see that sometimes in old code or code that strives to be compatible. However, GNU AWK has the strtonum function that explicitly converts a string to a number and understands the 0x prefix. So a highly compatible two digit hex function looks like this (not including the code to initialize the hexdigit array):

function hex2dec(x) {
  return (hexdigit[substr(x,1,1)]*16)+hexdigit[substr(x,2,1)]
}

If you don’t mind requiring GAWK, it can look like this:

function hex2dec(x) {
  return strtonum("0x" x);
}

In fact, the last function is a little better (and misnamed) because it can handle any hex number regardless of length (up to whatever limit is in GAWK).

Hex output is simple since you have printf and the X format specifier is available. Below is an AWK script that chews through a hex file and provides a count of the entire file, plus shows a breakdown of the segments (that is, non-contiguous memory regions).

BEGIN { ct=0;
  adxpt=""
}


function hex4dec(y) {
  return strtonum("0x" y)
}


function hex2dec(x) {
  return strtonum("0x" x);
}

/:[[:xdigit:]][[:xdigit:]][[:xdigit:]][[:xdigit:]][[:xdigit:]][[:xdigit:]]00/ {

  ad = hex4dec(substr($0, 4, 4))
  if (ad != adxpt) {
  block[++n] = ad
  adxpt = ad;
    }
  l = hex2dec(substr($0, 2, 2))
  blockct[n] = blockct[n] + l
  adxpt = adxpt + l
  ct = ct + l
  }

END { printf("Count=%d (0x%04x) bytes\t%d (0x%04x) words\n\n", ct, ct, ct/2, ct/2)
  for (i = 1 ; i <= n ; i++) {
  printf("%04x: %d (0x%x) bytes\t", block[i], blockct[i], blockct[i])
  printf("%d (0x%x) words\n", blockct[i]/2, blockct[i]/2)
  }

}

This shows a few AWK features: the BEGIN action, user-defined functions, the use of named character classes (:xdigit: is a hex digit) and arrays (block and blockct use numeric indices even though they don’t have to). In the END action, the summary uses printf statements for both decimal and hex output.

Once you can parse a file like this, there are many things you could do with the resulting data. Here’s an example of some similar code that does a sanity check on hex files.

Binary Files

Text files are fine, but real hardware uses binary files that people (and AWK) can’t easily read, right? Well, maybe people, but AWK can read binary files in a few ways. You can use getline in the BEGIN part of the script and control how things are read directly. You can also use the RS/RT trick mentioned above to read a specific number of bytes. There are a few other AWK-only methods you can read about if you are interested.

However, the easiest way to deal with binary files in AWK is to convert them to text files using something like the od utility. This is a program available with Linux (or Cygwin, and probably other Windows toolkits) that converts a binary file to different readable formats. You probably want hex bytes, so that’s the -t x2 option (or use x4 for 16-bit words). However, the output is made for humans, not machines, so when a long run of the same output occurs, od omits them replacing all the missing lines with a single asterisk. For AWK use, you want to use the -v option to turn that behavior off. There are other options to change the output radix of the address, swap bytes, and more.

Here are a few lines from a random binary file:

0000000 d8ff e0ff 1000 464a 4649 0100 0001 0100
0000020 0100 0000 dbff 4300 5900 433d 434e 5938
0000040 484e 644e 595e 8569 90de 7a85 857a c2ff
0000060 a1cd ffde ffff ffff ffff ffff ffff ffff
0000100 ffff ffff ffff ffff ffff ffff ffff ffff
0000120 ffff ffff ffff ffff ffff 00db 0143 645e
0000140 8564 8575 90ff ff90 ffff ffff ffff ffff
0000160 ffff ffff ffff ffff ffff ffff ffff ffff
0000200 ffff ffff ffff ffff ffff ffff ffff ffff
0000220 ffff ffff ffff ffff ffff ffff ffff c0ff
0000240 1100 0108 02e0 0380 2201 0200 0111 1103
0000260 ff01 00c4 001f 0100 0105 0101 0101 0001
0000300 0000 0000 0000 0100 0302 0504 0706 0908
0000320 0b0a c4ff b500 0010 0102 0303 0402 0503
0000340 0405 0004 0100 017d 0302 0400 0511 2112
0000360 4131 1306 6151 2207 1471 8132 a191 2308

This is dead simple to parse with AWK. The address will be $1 and each field will be $2, $3, etc. You can just convert the file yourself, use a pipe in the shell, or–if you want a clean solution–have AWK run od as a subprocess. Since the input is text, all of AWK’s regular expression features still work, which is useful.

Writing binary files is easy, too, since printf can output nearly anything. An alternative is to use xxd instead of od. It can convert binary files to text, but also can do the reverse.

Full Languages

There’s an old saying that if all you have is a hammer, everything looks like a nail. I doubt that AWK is the best tool to build full languages, but it can be a component of some quick and dirty hacks. For example, the universal cross assembler uses AWK to transform assembly language files into an internal format the C preprocessor can handle

Since AWK can call out to external programs easily, it would be possible to write things that, for example, processed a text file of commands and used them to drive a robot arm. The regular expression matching makes text processing easy and external programs could actually handle the hardware interface.

Think that’s far fetched? We’ve covered stranger AWK use cases, including a Wolfenstien-like game that uses 600 lines of AWK script (as seen to the right).

So, sure it is software, but it is a tool that has that Swiss Army knife quality that makes it a useful tool for software and hardware hackers alike. Of course, other tools like Perl, Python, and even C or C++ can do more. But often with a price in complexity and learning curve. AWK isn’t for every job, but when it works, it works well.

Gawking Text Files

June 21, 2016 by Al Williams 17 Comments

Some tools in a toolbox are versatile. You can use a screwdriver as a pry bar to open a paint can, for example. I’ve even hammered a tack in with a screwdriver handle even though you probably shouldn’t. But a chainsaw isn’t that versatile. It only cuts. But man does it cut!

AWK is a chainsaw for processing text files line-by-line (and the GNU version is known as GAWK). That’s a pretty common case. It is even more common if you produce a text file from a spreadsheet or work with other kinds of text files. AWK has some serious limitations, but so do chainsaws. They are still super useful. Although AWK sounds like a penguin-like bird (see right), that’s an auk. Sounds the same, but spelled differently. AWK is actually an acronym of the original author’s names.

If you know C and you grok regular expressions, then you can learn AWK in about 5 minutes. If you only know C, go read up on regular expressions and come back. Five minutes later you will know AWK. If you are running Linux, you probably already have GAWK installed and can run it using the alias awk. If you are running Windows, you might consider installing Cygwin, although there are pure Windows versions available. If you just want to play in a browser, try webawk.

Continue reading “Gawking Text Files” →