Linux Fu: Literate Regular Expressions

September 11, 2020

Regular expressions — the things you feed to programs like grep — are a bit like riding a bike. It seems impossible until you learn to do it, and then it’s easy. Part of their bad reputation is because they use a very concise and abbreviated syntax that alarms people. To help people who don’t use regular expressions every day, I created a tool that lets you write them in something a little closer to plain English. Actually, I’ve written several versions of this over the years, but this incarnation that targets grep is the latest. Unlike some previous versions, this time I did it all using Bash.

Those who don’t know regular expressions might freak out when they see something like:

[0-9]{5}(-[0-9]{4})?

How long does it take to figure out what that does? What if you could write that in a more literate way? For example:

digit repeat 5 \

start_group \

   - digit repeat 4 \

end_group optional

Not as fast to type, sure. But you can probably deduce what it does: it reads US Zipcodes.

I’ve found that some of the most popular tools I’ve created over the years are ones that I don’t need myself. I’m sure you’ve had that experience, too. You know how to operate a computer, but you create a menu system for people who don’t and they love it. That’s how it is with this tool. You might not need it, but there’s a good chance you know someone who does. Along the way, the code uses some interesting features of Bash, so even if you don’t want to be verbose with your regular expressions, you might pick up a trick or two.

Tower of Babel

One of the problems is that there isn’t a single form of regular expressions. Every tool has a slightly different flavor with different rules and extensions. For the purposes of this, I’m targeting egrep, although much of it will work in other systems, too. Once you have the idea, it would be easy to extend this for different flavors of regular expressions.

Even grep has some uncommon regular expression elements, so I’m only going to work with a subset of patterns, but they are the ones you tend to use the most. It’s easy to add more exotic ones or even macros that contain multiple regular expression patterns if you decide you want to extend the program.

Tool Chest

There are a few things that are important in our quest for literate regular expressions. The idea is to have a small program that converts our literate text into a regular expression. We can naturally combine this with grep or any tool that needs a regular expression:

egrep $(regx start space zero_or_more digit repeat 5)

The $(...) construct runs the command within and whatever it writes out is placed on the command line. So, for example:

for I in $( mount | cut -d ' ' -f 3 )
do
   echo $I
   if [ -f "$I/mountinfo.txt" ]
   then
     cat "$I/mountinfo.txt"
   fi
done

This contrived example selects every mount point from the mount command and tries to locate and display the mountinfo.txt file.

So the key is to build a regx script that can convert our verbose syntax into regular expressions and then use $() to insert the patterns into the command line.

Another odd Bash tool used a bit in these scripts is the regular expression parameter expansion. For example, if $1="Hackanight" then ${1/night/day} will give you Hackaday.

Quoting

Another tool isn’t really necessary for the regx command, but I wanted to build something you can use instead of employing the $() notation with grep. The problem is you have a script getting arguments and then passing them to another program. When you have spaces, potentially, you have a problem.

If script A has $1="Hack A Day" you can assume the command line used quotes or backslashes to keep that together as one string. But passing it to another program could strip the quotes resulting in the other program seeing three different arguments. In this case, you could pass "$1” and that would be fine. But it isn’t always that simple.

To make litgrep work, you need to know about the Bash shell expansion that quotes a value so the shell can read it again:

VAR="${1@Q}"

In our previous example, VAR would now equal ‘Hack a Day’ (including the single quotes).

Why?

Why is this important? Because litgrep will pick off command line arguments and send them to regx. If you have a space in the middle of an argument, it needs to pass as a whole to regx.

Here’s an example:

litgrep Hack space a space Day space optional -- *.txt

The regx Script

The regx script itself is pretty simple. There are two functions to escape characters because so many special characters are present in a regular expression. The reesc function escapes backslash characters along with other metacharacters. Inside a class (that is, square brackets) there isn’t much quoting. You generally have to arrange the expression correctly. For example, to build a character class that has a dash, it needs to come first or last. I didn’t attempt to rearrange your class, but you could do that in the placeholder reescclass function. You could also use it for some other regular expression variants that have more escaping options.

There are three broad groups of patterns. The majority take no arguments like any_char (.) or end ($). The script uses shift to move these out of the way after processing.

The other groups take one or two arguments such as repeat or range. Those commands do extra shifts to dispose of their arguments Once you have the definitions, the script is almost anti-climatic.

The litgrep Script

The litgrep program is a bit more difficult to follow because it has to ensure that spaces are handled correctly. The script pulls arguments out until it reads — as an argument and the rest of the command line goes to grep. That is, you can include grep arguments and file names after the –. If you omit the –, then grep will read from standard input, the same as if you put the — with no file arguments after it.

The ${1@Q} syntax, as described above, makes sure the arguments are quoted properly. Then using eval when setting RELIST puts it back together in the right format to send to egrep.

Motivation

I have had versions of this tool floating around for years. My original version was in C++ and there’s been at least one version for Python inspired by the C version.

A tool like this is certainly handy if you don’t know regular expressions. But, honestly, you should really learn regular expressions. If you want a quick start, there’s a Linux Fu post for that. Or, take your chances and let a program infer your regular expression from a data set.

20 thoughts on “Linux Fu: Literate Regular Expressions”

beatjunkie says:

September 11, 2020 at 10:09 am

It happens that I am reading this while taking a short smoke-break from developing some software which heavily uses regex. So I wanted to share https://regex101.com/ which I find very useful for fiddling with regexes. It also “verbalizes” the regex but it does not let you write them in that way though.

Report comment

Reply
1. Drone says:
  
  September 11, 2020 at 9:18 pm
  
  Also:
  https://regexr.com/
  
  Report comment
  
  Reply
UnderSampled says:

September 11, 2020 at 10:53 am

Someone’s integrated with the GPT-3 AI to generate regex: https://m.youtube.com/watch?v=31sTerHb_d4

Report comment

Reply
Anonymous says:

September 11, 2020 at 11:12 am

Regex will never be clear, in fact, your first example reads more like, “look for a string of 5s, optionally not preceded by a string of 4s.

Report comment

Reply
1. Somun says:
  
  September 11, 2020 at 11:55 am
  
  I also got confused by the dash / minus sign ambiguity.
  
  But it’s still easier to read than a docs handwriting and less verbose than how a lawyer would write it. Thats something.
  
  Report comment
  
  Reply
ROB says:

September 11, 2020 at 2:20 pm

And when you’ve mastered riding the bike on your favorite coding platform or OS and move on to a new platform, you discover the bike now had five wheels of seemingly random orientation, a folding hinging frame that has to be configured a certain way and no seat or handlebars.

Report comment

Reply
Ostracus says:

September 11, 2020 at 2:57 pm

I’ve found the O’Reilly book very handy.

https://www.oreilly.com/library/view/mastering-regular-expressions/0596528124/

Report comment

Reply
ScriptGiddy says:

September 11, 2020 at 4:52 pm

Just wanted to say thanks for the Linux-Fu series. I’m really enjoying it.

Report comment

Reply
1. John says:
  
  September 12, 2020 at 4:05 am
  
  Me too but I wish I had the time to learn it instead of just ‘wow’ when I try the things and then forgetting them. :(
  
  Report comment
  
  Reply
2. Christopher Hancock says:
  
  September 14, 2020 at 3:33 am
  
  Could probably make it into a book.
  
  Report comment
  
  Reply
  1. ScriptGiddy says:
    
    September 14, 2020 at 1:21 pm
    
    I was thinking the same. I’d pay for a cool bespoke Hackaday book on the series.
    
    Report comment
    
    Reply
Fabien T says:

September 12, 2020 at 9:03 am

I wouldn’t see myself using this in shell, but this could be great for object languages (eg: Java, .net, even NodJS hell why not…)
This kind of syntax lends itself quite well to a fluent/builder/monad-style syntax, that would look something like this:
Regex r = LiteralRegex.builder()
.any(digit()).repeat(5)
.group()
.any(digit()).repeat(4)
.endGroup()
.optional()
.build();
Would maybe allow developers who never got around learning regex to use them without knowing the regex syntax
They would benefit from completion, and the builder would ensure no syntax error is possible
Wonder if anyone already did something similar?

Report comment

Reply
1. Fabien T says:
  
  September 12, 2020 at 9:14 am
  
  Answering my own question here, looks like it’s been implemented already:
  JS: https://github.com/francisrstokes/super-expressive or https://github.com/wyantb/js-regex
  Java: https://github.com/VerbalExpressions/JavaVerbalExpressions
  
  Lots of resources here: https://github.com/aloisdg/awesome-regex
  
  Report comment
  
  Reply
chango says:

September 12, 2020 at 11:02 am

It would be interesting to see how much more comprehensible existing Perl or awk code would be with this grammar substituted for regex strings.

Report comment

Reply
imqqmi says:

September 12, 2020 at 12:05 pm

It took me a couple of years and a few attempts to get my head around regexes but now use it daily at work as a developer. I’m still learning new tricks with regexes. Recently I’ve learned how to use negative look ahead. The search program that I wanted to use in with didn’t support it in its free version. Luckily grep did the trick as well.

Just build a regex up step my step, with a regex tester or Notepad++ for example.

Example uses:
– Search in php code for SQL injection vulnerabilities in queries by searching for non quoted non escaped $_GET[] user input. Or search for id’s that aren’t converted to int explicitly.
– Convert logs to csv

If you’ve got the basics down, it’s fun to play some regex crossword puzzles to keep your newly gotten or even your advanced skills sharp. For example:
https://regexcrossword.com/

Report comment

Reply
bwmetz says:

September 12, 2020 at 12:53 pm

Interesting work no doubt, but could be an abstraction that is too simple in some ways and not simple enough in others. While it may help decipher complex regex you didn’t create or is too complex to easily grok within a few seconds, it doesn’t help with simpler regex as the syntax is not literate per se. If you can understand commands like group() and order of match operation enough to read these, then regular regex syntax is no worse for simple patterns. For example, how is this any better than tracking parentheses for nested groups, memory groups, etc?

Again, interesting work. Interested to see next version.

Report comment

Reply
mac012345 says:

September 14, 2020 at 8:56 am

I never forget how to ride a bike, I always forget how to write regexp…

Report comment

Reply
ivan256 says:

September 14, 2020 at 1:35 pm

These might be easier to understand, but they’re not any easier to write. You will still need to understand the full syntax and the subtle consequences of grouping, greediness, back-referencing, etc… in order to write expressions this way. And at that point, you will almost certainly find the traditional way more pleasant.

So what we’re talking about here is a regular expression language that is more readable – not completely readable – by people who don’t understand regular expressions. Considering how much trouble you can get in using regular expressions improperly, I can’t help but wonder if this is actually a good thing?

Report comment

Reply
X says:

September 16, 2020 at 7:20 pm

I want a tool that will make it easier to write effective sed scripts, it’s a very powerful tool but not at all intuitive.

Report comment

Reply
Erik Christiansen says:

November 17, 2020 at 1:53 am

The biggest nuisance when writing a regex translator would be the variety of regex dialects out there.
My approach to minimising the effort of regexing for various tools is to always use Extended regular Expressions (EREs), rather than Basic Regular Expressions (BREs), which “$ man 7 regex” will show are obsolete.. That is enabled by using egrep or grep -E instead of bare grep. The regexes are then consistent across egrep, pgrep, awk, mutt, lex, etc.

The one sad laggard is still vim, which is plagued by a weird variety of regex dialects. The one which approximates POSIX ERE isn’t quite right.

Apart from standardisation, a big advantage of an ERE is that it is much easier to read than a BRE, which is most often plagued by a snowstorm of backslashes.

Report comment

Reply

Hackaday

Linux Fu: Literate Regular Expressions

Tower of Babel

Tool Chest

Quoting

Why?

The regx Script

The litgrep Script

Motivation

Read more from this series:
Linux-Fu

20 thoughts on “Linux Fu: Literate Regular Expressions”

Leave a Reply to XCancel reply

Search

Never miss a hack

If you missed it

A Gentle Introduction To Ncurses For The Terminally Impatient

End Of An Era: NOAA’s Polar Sats Wind Down Operations

Crowdsourcing SIGINT: Ham Radio At War

Reconductoring: Building Tomorrow’s Grid Today

Is The Atomic Outboard An Idea Whose Time Has Come?

Our Columns

Keebin’ With Kristina: The One With The Gaming Typewriter

Hackaday Links: June 15, 2025

This Week In Security: The Localhost Bypass, Reflections, And X

FLOSS Weekly Episode 836: Beeps And Boops With Meshtastic

Hackaday Links: June 8, 2025

Tower of Babel

Tool Chest

Quoting

Why?

The regx Script

The litgrep Script

Motivation

Read more from this series:Linux-Fu

20 thoughts on “Linux Fu: Literate Regular Expressions”

Leave a Reply to XCancel reply

Search

Never miss a hack

Subscribe

If you missed it

Our Columns

Read more from this series:
Linux-Fu