Regular Expressions Finally Come To Microsoft Excel

There are two types of people in the world: those who have no idea what a regular expression is, and those who not only know what they are but can compose them on the fly and tend to use them in situations where they’re clearly not called for. And it’s that latter camp, of which we consider ourself a proud member, that is rejoicing with the announcement that Microsoft is adding regular expression support to Excel.

Or perhaps not rejoicing so much as wondering what took so long. Yes, regular expressions have been part of VBA for a while now, but the new functions allow you to use regexes right in the spreadsheet grid. There are plenty of caveats, of course. The big one is that this is still in beta at this time, so you have to do some gymnastics to enable it, if you’re even allowed to in the first place. Second, support appears limited to three functions at the moment: REGEXTEST, which provides a logical test of pattern matching; REGEXEXTRACT, which returns a substring that matches a pattern; and REGEXREPLACE, which substitutes a string for a pattern. The video below walks through how to use these functions within spreadsheets.

What’s also unclear now is what flavor of regular expressions is supported. There are a bewildering number of entities in the regex bestiary — character classes, positional indicators, quantifiers, subexpressions, lazy and greedy matches, and a range of grouping constructs that perplex even regex pros. One hopes these new functions will support one of the existing regex standards, but Microsoft is famous for “extending and enhancing.” Then again, regex support has been in the .NET Framework for years and is pretty close to the Perl standard, so our guess is that it’ll be close to that.

If you fall into the “What’s a regex?” camp but want to change that, why not get your grep on?

Continue reading “Regular Expressions Finally Come To Microsoft Excel”

Linux Fu: Globs Vs Regexp

I once asked a software developer at work how many times we called fork() in our code. I’ll admit, it was a very large project, but I expected the answer to be — at most — two digits. The developer came back and read off some number from a piece of paper that was in the millions. I told them there was no way we had millions of calls to fork() and, of course, we didn’t. The problem was the developer wasn’t clear on the difference between a regular expression and a glob.

Tools like grep use regular expressions to create search patterns. I might write [Hh]ack ?a ?[Dd]ay as a regular expression to match things like “HackaDay” and “Hack a day” and, even, “Hackaday” using a tool like grep, awk, or many programming languages.

Continue reading “Linux Fu: Globs Vs Regexp”

13,000 Regular Expressions Make An Editor’s Life Easier

Being an editor is a job that seems deceptively easy until you are hauled over the coals for letting a textual howler go to print (or website). Most publications have style guides to ensure that their individual voice is preserved, but even the most eagle-eyed will sometimes slip up in their application. At the Guardian newspaper in the UK they have been struggling with this against an ever-evolving style guide that must adapt to fast-moving world events, to the extent that they had a set of regular expressions to deal with commonly-occurring problems. A lot of regular expressions, in fact around 13,000 of them.

Clearly some form of management was required, and  a team of developers set about taming this monster. The result is Typerighter, their server-side document-checker, which can be found in a GitHub repository. Surprisingly for rule management they started with a Google Sheet, a choice which proved unexpectedly robust when working with such a long list even though they later replaced it. The back end doing the job of text matching was written in Scala, and for the front end a plugin was created for their Prosemirror text editor.

For a publication of course this is extremely interesting, but where’s the interest for hackers? The answer lies in any text-processing engine that uses a lot of regular expressions; those of you who have dabbled in this space will know how unwieldy this work can become. Any user of computational linguistic techniques in the pursuit of language processing could probably find much of interest here.

If you’re a bit hazy on regular expressions, how about the episode on them from our long-running Linux-fu series?

DIY Regular Expressions

In the Star Wars universe, not everyone uses a lightsaber, and those who do wield them had to build them themselves. There’s something to be said about that strategy. Building a car or a radio is a great way to learn how those things work. That’s what [Low Level JavaScript] points out about regular expressions. Sure, a lot of people think they are scary. So why not write your own regular expression parser and engine? Get that under your belt and you’ll probably never fear another regular expression.

Of course, most of us probably won’t do it ourselves, but you can still watch the process in the video below. The code is surprisingly short, but don’t expect all the bells and whistles you might find in Python or even Perl.

Continue reading “DIY Regular Expressions”

Linux-Fu: Making AWK A Bit Easier

awk is a kind of Swiss Army knife for text files. However, some of its limitations are often a bit annoying. I’ve used a simple set of functions to make awk a bit better, although I will warn you: it does require GNU extensions to awk. That is, you must use gawk and not other versions. Your system probably maps /usr/bin/awk to something and that something might be gawk. But it could also be mawk or some other flavor. If you use a Debian-based distro, update-alternatives is your friend here. But for the purposes of this post, I’m going to assume you are using gawk.

By the end of the post, you’ll see how to use my awk add-on functions to split up a line into fields even when there is no single character to separate all fields. In addition, you’ll be able to refer to the fields using names you decide. You won’t have to remember that $2 is the time field. You’ll say Fields_fields["time"] instead.

The Problem

awk does a lot of common work for you when you use it to process text files. It reads files a record at a time. Normally, a record is a single line. Then it splits the line on fields using whitespace, or some other choice of field separators. You can write code that manipulates the line or individual fields. This default behavior is great, especially since you can change the end of record character and the field separator. A surprising number of files fit this sort of format.

Until, of course, they don’t. If you have data coming from a data logging instrument or some database, it could be formatted in a variety of ways. Some fields might have structured data with a variety of separators. This isn’t a deal-breaker. Since you can get at the whole line, you can do almost anything you want, but the logic is harder and the whole point to using awk is to make things easier.

For example, suppose you had a file from a data recorder that had an eight-digit serial number, followed by a six-character tag, and then two floating point numbers separated by colons. The pattern might look like

^([0-9]{8})([a-zA-Z0-9]{6})([-+.0-9]+),([-+.0-9]+)$

This would be hard to handle with the conventional field splitting and you’d normally just write code to split everything apart.

Continue reading “Linux-Fu: Making AWK A Bit Easier”

Linux Fu: Literate Regular Expressions

Regular expressions — the things you feed to programs like grep — are a bit like riding a bike. It seems impossible until you learn to do it, and then it’s easy. Part of their bad reputation is because they use a very concise and abbreviated syntax that alarms people. To help people who don’t use regular expressions every day, I created a tool that lets you write them in something a little closer to plain English. Actually, I’ve written several versions of this over the years, but this incarnation that targets grep is the latest. Unlike some previous versions, this time I did it all using Bash.

Those who don’t know regular expressions might freak out when they see something like:

[0-9]{5}(-[0-9]{4})?

How long does it take to figure out what that does? What if you could write that in a more literate way? For example:

digit repeat 5 \

start_group \

   - digit repeat 4 \

end_group optional

Not as fast to type, sure. But you can probably deduce what it does: it reads US Zipcodes.

I’ve found that some of the most popular tools I’ve created over the years are ones that I don’t need myself. I’m sure you’ve had that experience, too. You know how to operate a computer, but you create a menu system for people who don’t and they love it. That’s how it is with this tool. You might not need it, but there’s a good chance you know someone who does. Along the way, the code uses some interesting features of Bash, so even if you don’t want to be verbose with your regular expressions, you might pick up a trick or two.

Continue reading “Linux Fu: Literate Regular Expressions”

Program Guesses Your Regular Expression

We aren’t sure how we feel about [pemistahl’s] grex program. On the one hand, we applaud a program that can take some input samples and produce a regular expression. On the other hand, it might be just as hard to gather example data that produces the correct regular expression. Still, it is an interesting piece of code.

Even the author suggests not to use this as an excuse to not learn regular expressions, since you’ll need to check the program’s output. It is certain that the results will match your test cases, but it isn’t certain that it won’t accept things you didn’t expect. Bad regular expressions have been the source of some deeply buried bugs.

Continue reading “Program Guesses Your Regular Expression”