awk
is a kind of Swiss Army knife for text files. However, some of its limitations are often a bit annoying. I’ve used a simple set of functions to make awk
a bit better, although I will warn you: it does require GNU extensions to awk
. That is, you must use gawk
and not other versions. Your system probably maps /usr/bin/awk
to something and that something might be gawk
. But it could also be mawk
or some other flavor. If you use a Debian-based distro, update-alternatives is your friend here. But for the purposes of this post, I’m going to assume you are using gawk
.
By the end of the post, you’ll see how to use my awk
add-on functions to split up a line into fields even when there is no single character to separate all fields. In addition, you’ll be able to refer to the fields using names you decide. You won’t have to remember that $2 is the time field. You’ll say Fields_fields["time"]
instead.
The Problem
awk
does a lot of common work for you when you use it to process text files. It reads files a record at a time. Normally, a record is a single line. Then it splits the line on fields using whitespace, or some other choice of field separators. You can write code that manipulates the line or individual fields. This default behavior is great, especially since you can change the end of record character and the field separator. A surprising number of files fit this sort of format.
Until, of course, they don’t. If you have data coming from a data logging instrument or some database, it could be formatted in a variety of ways. Some fields might have structured data with a variety of separators. This isn’t a deal-breaker. Since you can get at the whole line, you can do almost anything you want, but the logic is harder and the whole point to using awk
is to make things easier.
For example, suppose you had a file from a data recorder that had an eight-digit serial number, followed by a six-character tag, and then two floating point numbers separated by colons. The pattern might look like
^([0-9]{8})([a-zA-Z0-9]{6})([-+.0-9]+),([-+.0-9]+)$
This would be hard to handle with the conventional field splitting and you’d normally just write code to split everything apart.
If you have regular fields, but don’t know how many, you probably want to set FS
or FPAT
, instead. We talked about FPAT
a little before when we were abusing awk
to read hex files. This library is a little different. You can use it to pick apart a line totally. For example, you might have part of the line with a fixed field length and then multiple types of separators. That can be hard to handle with the other methods.
Regular Expressions
To make things easier, I’ll wrap up the gawk
match
function. That function exists in regular awk
, of course, but gawk
adds an extension that makes things much easier. Normally, the function performs a regular expression match on a string and tells you where the match starts, if there was a match, and how many characters matched.
With the GNU extensions in gawk
, you can provide an extra array argument. That array will get some information about the match. In particular, the zero item of the array will contain the entire match. If the regular expression contains sub-expressions in parenthesis, the array will contain those, numbering by the order of the parenthesis. It will also contain start and length information.
For example, if your regular expression were "^([0-9]+)([a-z]+)$"
and your input string is 123abc
, the array would look like this:
array[0] - 123abc array[1] - 123 array[2] - abc array[0start] - 1 array[0length] - 6 array[1start] - 1 array[1length] - 3 array[2start] - 4 array[2length] - 3
You can even have nested expressions, so "^(([xyz])[0-9]+)([a-z]+)$"
with an input of z1x gives array[1]=z1
, array[2]=z
, and array[3]=x
.
Theory vs Practice
In theory, that’s all you need. You can write a regular expression to pick apart a line, parse it, and then access the pieces using the array. In practice, it is much nicer to have everything done so you can use plain names to access the data.
As an example data format, consider a line like this:
11/10/2020 07:00 The Best of Bradbury, 14.95 *****
There is a date in US format, a time in 24-hour format, an item name, a price, and a rating from 1 to 5 stars that may not be present. Writing a regular expression to grab each field is a bit complex, but not very hard. Here is one way to do it:
"^(([01][0-9])/([0-3][0-9])/(2[01][0-9][0-9]))[[:space:]]*(([0-2][0-9]):([0-5][0-9]))[[:space:]]+([^,]+),[[:space:]]*([0-9.]+)[[:space:]]*([*]{1,5})?[[:space:]]*$"
That’s a mouthful, but it works. Note that each item is in parenthesis and some of those are nested. So the date is one field, but the month, day, and year are also fields.
The Library
Once you grab the files on GitHub, you could put the fields_* functions into your code. You need to do some setup in the BEGIN tag. Then you process each line using fields_process. Here’s a small example (with the functions omitted):
BEGIN { fields_setup("^(([01][0-9])/([0-3][0-9])/(2[01][0-9][0-9]))[[:space:]]*(([0-2][0-9]):([0-5][0-9]))[[:space:]]+([^,]+), [[:space:]]*([0-9.]+)[[:space:]]*([*]{1,5})?[[:space:]]*$") fields_setupN(1,"date") fields_setupN(2,"month") fields_setupN(3,"day") fields_setupN(4,"year") fields_setupN(5,"time") fields_setupN(6,"hours") fields_setupN(7,"minutes") fields_setupN(8,"item") fields_setupN(9,"price") fields_setupN(10,"star") } { v=fields_process() ... your code here... }
In your code you can write something like:
cost=Fields_fields["price"] * 3
Simple, right? The fields_process
function returns false if there was no match. You can still access the normal awk
fields like $0 or $2 if you want.
Inside
The extra functions rely on two things: the extensions to the gawk
match
function and awk
‘s associative array mechanism. In the past, I’ve added the named keys to the existing match array so you could get data out either way. However, I’ve modified it so that the match array is local because I almost never really want that capability and then you have to filter out the extra fields if you want to dump the entire array.
It is frequently useful to start the regular expression with ^
and end it with $
to anchor the entire string. Just don’t forget that the regular expression needs to handle white space consumption, as the example does. This is often a benefit when you have fields that can contain spaces, but if you wanted spaces to break fields anyway, you are probably better off with the original parsing scheme.
Another trick is to get “the rest of the line” after you parsed off the first fields. You can do that by adding "(.*)$"
to the end of the regular expression. Just don’t forget to set up a tag for it using fields_setupN
so that you can fetch the value later.
An easy extension to this library would be to make the pattern an array. The processing function could try each pattern in turn until one matches. Then it would return the index of the matching pattern or false if there were no matches. This would let you define multiple types of lines if you had a complex file format. You’d probably want to have different sets of field tags for each one, too.
I have a long history of abusing tools like awk
to do things, like build cross assemblers. Even so, I’m probably not the worst offender.
After several years writing one-liners in Perl it’s superseded awk entirely. This is something I type by heart:
perl -nl -we ‘$totals{$1}+=$2 if /context (key) (number)/; END { print “$_ $totals{$_}” for sort keys %totals }’
Don’t forget a2p — I sometimes start in awk and finish in Perl, but usually if I am going to get that hardcore, I’ll just switch to C++ or maybe Python if I’m feeling frisky.
Reading your posts [Al] has me gawking at your knowledge of UN*X/Linux.
“the whole point to using awk is to make things easier” – ROFL
I believe that view is shared only by those who’ve spent far too much time with awk and those who’ve never tried it.
Let me explain my process for using awk.
1. Can I solve the problem with stupid simple bash commands. if no 2
2. Can I easily program this with some python scripts. if no 3
3. Can I find an awk solution for my problem posted randomly on the internet. if no 4.
4. Hate life.
There are now even compiled languages with first-class regular expression processing, like Crystal, a descendant of Perl through Ruby. So discussion of awk serves only two purposes: historical interest, and graybeard status contests. I still use sed for bulk editing, but any more complex pattern processing is written in Crystal.
“Abusing AWK”… My, my, yes, indeed. Its always interesting to see if you can make something do what it was never intended to do, or at least how far past its design you can push it.
I found AWK this year. Personally I love it because it solves a slice of things I do on a regular basis that I knew someone out there had to have pre-built tool for. Yes, its not Perl (but I think Perl happened because of AWK’s limitations), Python, Ruby, FPC, C++, … Different tools with different strengths and weakness suiting them to different jobs. When I start something I usually pick the tool I think will get me there the quickest, with the least amount of effort. And that usually means working with its design, not against it.
However its always fun to push the boundaries and keep one’s faculties agile. Keep it up @Al.
I work with RegEx rarely enough that I always have to look stuff up. That said, this article was interesting and helpful, thank you!
When you have one problem and think “I will use regex for that,” you end up with two problems…
beat me to it.
You have a problem.
So you write a regular expression.
Now you have two problems.
A regex is nothing more than a computerized syntax of describing to another human what info you want out of a file. They are not so hard to write, as your mind sequentially deconstructs the logic into syntax. But hey can be hard to read. Luckily there are tools for that, or write it on multiple lines in your script with added comments.
Of course the proper tool to use here is Rexx and its PARSE statement:
parse line date time item ‘,’ price star
or if the date time needs to be separated
parse line date . 1 month ‘/’ day’/’ year time . 1 . hours ‘:’ minutes item ‘,’ price star
although it is even easier to read
parse line date time item ‘,’ price star
parse date month ‘/’ day ‘/’ year
parse time hours ‘:’ minutes
Regex has its uses, but parse beats it most of the time in the real world. And is always much easier to read.
And for doing the “read a whole file of these,” use NetRexx’s Pipelines. It has selection stages that use either PARSE or REGEX — your choice — to do the separations. (Pipelines takes the Unix pipe concept and puts it on steroids. Great!)
Great article – thanks!
I’m also impressed with your self-control in not titling the article “Making AWK a bit less awkward” – that must have been soooo tempting…
@Al Williams – One of the things I need to do on a regular basis is deal with csv files. I’ve been using some fairly complicated AWK code that parses it’s way through the record. A regex that manages csv with embedded commas, double ” (to escape) and the ever annoying Used ‘ instead of ” is pretty ugly. How do you manage csv files?
@Jeff Hennick – I too miss Rexx and it’s parse agility. When I worked at an OS/2 shop , I did things in VX-REXX that were quick and easy with Parse. Thanks for the memory!
@Astro — It is still alive, and expanding. NetRexx is a cross with Java and fully connects with it, making Java class files with about 30-40% fewer source characters — easier to write, type, and much easier to read. NetRexx takes care of the boilerplate overhead and the {{{{{}}}}}s. And, ooRexx, especially with BSF4Rexx, also fully integrates with Java. I just gave a presentation on NetRexx Pipelines to an annual Rexx Symposium. (Mike Cowlishaw attended and had quite a nice Q&A on Rexx’s history and future.)
Anyone processing text definitely should consider it. (Or anybody stuck with Java.) Lest we forget, it is a natural born scripting language.
awk still fills the role for me of splitting delimited records (and possibly slightly more complex things) which are ultimately destined to be piped into other tools. If I need to go beyond that, I usually end up just doing it all in Perl or Python now.
ISTM that the “problem” of dealing with the long serial number plus 6 character tag plus 2 floats example is a lack of knowledge of Awk. Writing a regexp to match the whole line doesn’t serve any useful purpose in splitting the fields.
Awk already _automatically_ splits each input line ($0) into fields, availalble as the variables $1, $2, $3, … $NF, so zero effort is required to grab and use any particular one.
For the cited case where the field separator (FS) is not consistent, i.e. bodgy input data, Awk effortlessly handles that by allowing you to set a FS as a regex. In this quicky illustration, we accept various input delimiters, excluding decimal point, let’s say, and rectifying the chaos into consistent output:
$ gawk ‘BEGIN {FS = “[:;,]”} ; {print $1 ” ” $2 ” ” $3 ” ” $4}’ # Set FS to a regex.
11:22;33.3,44 # Feral input, typed in to test.
11 22 33.3 44 # Proof of correct handling of that.
There isn’t any need to do any work then, is there?
The Addison-Wesley book on Awk, written by the language’s authors, is a concise goldmine, if still in print.
The PDF “GAWK: Effective AWK Programming” is also very handy, even for the proficient.
It takes time to become effective in any language, and textbook leveraged practice is the best teacher.