13,000 Regular Expressions Make An Editor’s Life Easier

Being an editor is a job that seems deceptively easy until you are hauled over the coals for letting a textual howler go to print (or website). Most publications have style guides to ensure that their individual voice is preserved, but even the most eagle-eyed will sometimes slip up in their application. At the Guardian newspaper in the UK they have been struggling with this against an ever-evolving style guide that must adapt to fast-moving world events, to the extent that they had a set of regular expressions to deal with commonly-occurring problems. A lot of regular expressions, in fact around 13,000 of them.

Clearly some form of management was required, and  a team of developers set about taming this monster. The result is Typerighter, their server-side document-checker, which can be found in a GitHub repository. Surprisingly for rule management they started with a Google Sheet, a choice which proved unexpectedly robust when working with such a long list even though they later replaced it. The back end doing the job of text matching was written in Scala, and for the front end a plugin was created for their Prosemirror text editor.

For a publication of course this is extremely interesting, but where’s the interest for hackers? The answer lies in any text-processing engine that uses a lot of regular expressions; those of you who have dabbled in this space will know how unwieldy this work can become. Any user of computational linguistic techniques in the pursuit of language processing could probably find much of interest here.

If you’re a bit hazy on regular expressions, how about the episode on them from our long-running Linux-fu series?

41 thoughts on “13,000 Regular Expressions Make An Editor’s Life Easier

    1. Trained on their previous content, yes that is a sound hypothesis. They may not have the skill to pull it off though, using other people’s code is very different to actually understanding the nature of the beast well enough to come up with something that is publication worthy.

    1. What’s the alternative? I do wish that regular expression behavior / features were more standard across implementations, but having a portable implementation of a feature rich search and replace tool set is eternally useful.

      1. After watching The Expanse and hearing the evolution of English into Belter language, I get a small sense of how you Brits must feel every time you see an American television program.

      1. That misses the point entirely; there are too many non-critical style rules and a lack of attention to basics. How did “USD-D” make it into a headline today, instead of the correct “USB-D” in the article below that headline? But, by golly, the maker’s handle was properly bracketed!

  1. The Guardian had a reputation in the UK for spelling mistakes, such that it was known in a satirical magazine as The Grauniad, which makes the concept of 13,000 regular expressions a bit ironic …

    1. It’s pretty rich that despite articles showing how much effort it takes for *major publications* to craft error-free articles on short deadlines that you still think a blog like HaD should be held to the same standard.

  2. Regular Expressions (RegEx RegExp, or RE) has a reputation for obscurity and complexity. Actually RegEx is a good example of how being powerful usually means being complex. Also, it doesn’t help that adding RE capability to myriad other programming/scripting languages in a lazy way has obfuscated its use. RE’s used in one language will often not work in a different language. These instances of RE’s across languages even has a name, PCRE or Perl “Compatible” Regular Expressions. And that brings up Perl.[1] RE’s predate Perl but today it is widely understood RE’s live natively in the Perl-5 language. There is/was a Perl-6, but it is far removed from Perl-5 and was renamed Raku. Supposedly Perl-5 will live on as Perl-7 in the future. Perl, unofficially an acronym for “Practical Extraction and Reporting Language” has been around in Linux/Unix for over 30 years and is considered an essential “glue” language in the Unices. Due to variability and incompatibility between various shells across the Unices, it is recommended to program system scripts in Perl for portability. Perl-5 is an interpreter, but it can be compiled. In my opinion, if you are going to do Regular Expressions properly you want to do do them in Perl. Perl has an enormous repository of modules and scripts stored online in the Comprehensive Perl Archive Network (CPAN).[2] Perl, especially when coupled with Regular Expressions is considered a “noisy” language. What you write and understand today in Perl will look like random “noise” written by an Alien if you come back to it a year later.

    References:

    1. Perl

    https://en.wikipedia.org/wiki/Perl

    https://www.perl.org/

    2. CPAN

    https://en.wikipedia.org/wiki/CPAN

    https://www.cpan.org/

    https://metacpan.org/

  3. Nice! We use regex all day long to help tidy up nasty-a$$ text exports from InDesign, Word etc for eBook production… I’m dreaming that there may be some interesting nuggets in the article (so speaks Andy of posts past, before he actually reads it) – the Ghost of Andy to come may well just visit this post tomorrow to smack myself upside the head in the reply :)

Leave a Reply

Please be kind and respectful to help make the comments section excellent. (Comment Policy)

This site uses Akismet to reduce spam. Learn how your comment data is processed.