Linux Fu: Globs Vs Regexp

I once asked a software developer at work how many times we called fork() in our code. I’ll admit, it was a very large project, but I expected the answer to be — at most — two digits. The developer came back and read off some number from a piece of paper that was in the millions. I told them there was no way we had millions of calls to fork() and, of course, we didn’t. The problem was the developer wasn’t clear on the difference between a regular expression and a glob.

Tools like grep use regular expressions to create search patterns. I might write [Hh]ack ?a ?[Dd]ay as a regular expression to match things like “HackaDay” and “Hack a day” and, even, “Hackaday” using a tool like grep, awk, or many programming languages.

Continue reading “Linux Fu: Globs Vs Regexp”

13,000 Regular Expressions Make An Editor’s Life Easier

Being an editor is a job that seems deceptively easy until you are hauled over the coals for letting a textual howler go to print (or website). Most publications have style guides to ensure that their individual voice is preserved, but even the most eagle-eyed will sometimes slip up in their application. At the Guardian newspaper in the UK they have been struggling with this against an ever-evolving style guide that must adapt to fast-moving world events, to the extent that they had a set of regular expressions to deal with commonly-occurring problems. A lot of regular expressions, in fact around 13,000 of them.

Clearly some form of management was required, and  a team of developers set about taming this monster. The result is Typerighter, their server-side document-checker, which can be found in a GitHub repository. Surprisingly for rule management they started with a Google Sheet, a choice which proved unexpectedly robust when working with such a long list even though they later replaced it. The back end doing the job of text matching was written in Scala, and for the front end a plugin was created for their Prosemirror text editor.

For a publication of course this is extremely interesting, but where’s the interest for hackers? The answer lies in any text-processing engine that uses a lot of regular expressions; those of you who have dabbled in this space will know how unwieldy this work can become. Any user of computational linguistic techniques in the pursuit of language processing could probably find much of interest here.

If you’re a bit hazy on regular expressions, how about the episode on them from our long-running Linux-fu series?

DIY Regular Expressions

In the Star Wars universe, not everyone uses a lightsaber, and those who do wield them had to build them themselves. There’s something to be said about that strategy. Building a car or a radio is a great way to learn how those things work. That’s what [Low Level JavaScript] points out about regular expressions. Sure, a lot of people think they are scary. So why not write your own regular expression parser and engine? Get that under your belt and you’ll probably never fear another regular expression.

Of course, most of us probably won’t do it ourselves, but you can still watch the process in the video below. The code is surprisingly short, but don’t expect all the bells and whistles you might find in Python or even Perl.

Continue reading “DIY Regular Expressions”

Linux-Fu: Making AWK A Bit Easier

awk is a kind of Swiss Army knife for text files. However, some of its limitations are often a bit annoying. I’ve used a simple set of functions to make awk a bit better, although I will warn you: it does require GNU extensions to awk. That is, you must use gawk and not other versions. Your system probably maps /usr/bin/awk to something and that something might be gawk. But it could also be mawk or some other flavor. If you use a Debian-based distro, update-alternatives is your friend here. But for the purposes of this post, I’m going to assume you are using gawk.

By the end of the post, you’ll see how to use my awk add-on functions to split up a line into fields even when there is no single character to separate all fields. In addition, you’ll be able to refer to the fields using names you decide. You won’t have to remember that $2 is the time field. You’ll say Fields_fields["time"] instead.

The Problem

awk does a lot of common work for you when you use it to process text files. It reads files a record at a time. Normally, a record is a single line. Then it splits the line on fields using whitespace, or some other choice of field separators. You can write code that manipulates the line or individual fields. This default behavior is great, especially since you can change the end of record character and the field separator. A surprising number of files fit this sort of format.

Until, of course, they don’t. If you have data coming from a data logging instrument or some database, it could be formatted in a variety of ways. Some fields might have structured data with a variety of separators. This isn’t a deal-breaker. Since you can get at the whole line, you can do almost anything you want, but the logic is harder and the whole point to using awk is to make things easier.

For example, suppose you had a file from a data recorder that had an eight-digit serial number, followed by a six-character tag, and then two floating point numbers separated by colons. The pattern might look like

^([0-9]{8})([a-zA-Z0-9]{6})([-+.0-9]+),([-+.0-9]+)$

This would be hard to handle with the conventional field splitting and you’d normally just write code to split everything apart.

Continue reading “Linux-Fu: Making AWK A Bit Easier”

Linux Fu: Literate Regular Expressions

Regular expressions — the things you feed to programs like grep — are a bit like riding a bike. It seems impossible until you learn to do it, and then it’s easy. Part of their bad reputation is because they use a very concise and abbreviated syntax that alarms people. To help people who don’t use regular expressions every day, I created a tool that lets you write them in something a little closer to plain English. Actually, I’ve written several versions of this over the years, but this incarnation that targets grep is the latest. Unlike some previous versions, this time I did it all using Bash.

Those who don’t know regular expressions might freak out when they see something like:

[0-9]{5}(-[0-9]{4})?

How long does it take to figure out what that does? What if you could write that in a more literate way? For example:

digit repeat 5 \

start_group \

   - digit repeat 4 \

end_group optional

Not as fast to type, sure. But you can probably deduce what it does: it reads US Zipcodes.

I’ve found that some of the most popular tools I’ve created over the years are ones that I don’t need myself. I’m sure you’ve had that experience, too. You know how to operate a computer, but you create a menu system for people who don’t and they love it. That’s how it is with this tool. You might not need it, but there’s a good chance you know someone who does. Along the way, the code uses some interesting features of Bash, so even if you don’t want to be verbose with your regular expressions, you might pick up a trick or two.

Continue reading “Linux Fu: Literate Regular Expressions”

Program Guesses Your Regular Expression

We aren’t sure how we feel about [pemistahl’s] grex program. On the one hand, we applaud a program that can take some input samples and produce a regular expression. On the other hand, it might be just as hard to gather example data that produces the correct regular expression. Still, it is an interesting piece of code.

Even the author suggests not to use this as an excuse to not learn regular expressions, since you’ll need to check the program’s output. It is certain that the results will match your test cases, but it isn’t certain that it won’t accept things you didn’t expect. Bad regular expressions have been the source of some deeply buried bugs.

Continue reading “Program Guesses Your Regular Expression”

Linux Fu: Regular Expressions

If you consider yourself a good cook, you may or may not know how to make a souffle or baklava. But there are certain things you probably do know how to do that form the basis of many recipes. For example, you can probably boil water, crack an egg, and brown meat. With Linux or Unix systems, you can make the same observation. You might not know how to set up a Wayland server or write a kernel module. But there are certain core skills like file manipulation and editing that will serve you no matter what you do. One of the pervasive skills that often gives people trouble is regular expressions. Many programs use these as a way to specify search patterns, usually in text strings such as files.

If you aren’t comfortable with regular expressions, that’s easy to fix. They aren’t that hard to learn and there are some great tools to help you. Many tools use regular expressions and the core syntax is the same. The source of confusion is that the details beyond core syntax have variations.

Let’s look at the foundation you need to understand regular expression well.

Continue reading “Linux Fu: Regular Expressions”