13,000 Regular Expressions Make An Editor’s Life Easier

February 1, 2021

Being an editor is a job that seems deceptively easy until you are hauled over the coals for letting a textual howler go to print (or website). Most publications have style guides to ensure that their individual voice is preserved, but even the most eagle-eyed will sometimes slip up in their application. At the Guardian newspaper in the UK they have been struggling with this against an ever-evolving style guide that must adapt to fast-moving world events, to the extent that they had a set of regular expressions to deal with commonly-occurring problems. A lot of regular expressions, in fact around 13,000 of them.

Clearly some form of management was required, and a team of developers set about taming this monster. The result is Typerighter, their server-side document-checker, which can be found in a GitHub repository. Surprisingly for rule management they started with a Google Sheet, a choice which proved unexpectedly robust when working with such a long list even though they later replaced it. The back end doing the job of text matching was written in Scala, and for the front end a plugin was created for their Prosemirror text editor.

For a publication of course this is extremely interesting, but where’s the interest for hackers? The answer lies in any text-processing engine that uses a lot of regular expressions; those of you who have dabbled in this space will know how unwieldy this work can become. Any user of computational linguistic techniques in the pursuit of language processing could probably find much of interest here.

If you’re a bit hazy on regular expressions, how about the episode on them from our long-running Linux-fu series?

41 thoughts on “13,000 Regular Expressions Make An Editor’s Life Easier”

Puoskari says:

February 1, 2021 at 1:27 am

What does the fu in Linux-fu mean? I always just read it as “Linux, f##ck you.”, but that just sounds wrong.

Report comment

Reply
1. Elliot Williams says:
  
  February 1, 2021 at 1:32 am
  
  It’s the Fu in Kung Fu — the mastery.
  
  Report comment
  
  Reply
2. BT says:
  
  February 1, 2021 at 2:12 am
  
  I often find when using regular expressions that your interpretation of fu is most appropriate.
  
  Report comment
  
  Reply
3. x14km2d says:
  
  February 1, 2021 at 9:18 am
  
  https://www.quora.com/What-is-Linux-Fu?share=1
  
  Report comment
  
  Reply
Stilmant Michael says:

February 1, 2021 at 1:55 am

I don’t understand that article?
is there any regular expression that could correct that?

Report comment

Reply
1. volt-k says:
  
  February 1, 2021 at 3:44 am
  
  > I don’t understand that article?
  
  Is it a question or a statement?
  
  Report comment
  
  Reply
  1. Old Guy says:
    
    February 1, 2021 at 3:50 am
    
    Yes.
    
    Report comment
    
    Reply
    1. CRJEEA says:
      
      February 1, 2021 at 4:20 am
      
      I see what you did there. +1
      
      Report comment
      
      Reply
2. Clara says:
  
  February 1, 2021 at 5:04 am
  
  Probably not. I’d wager that the language of articles you don’t understand isn’t regular.
  
  Report comment
  
  Reply
Heidi Shepherd says:

February 1, 2021 at 3:06 am

Why not just use AI?

Report comment

Reply
1. Ostracus says:
  
  February 1, 2021 at 4:10 am
  
  It would be an interesting challenge.
  
  Report comment
  
  Reply
2. Jenny List says:
  
  February 1, 2021 at 8:43 am
  
  I’ve played with AI and language corpora. Garbage in, garbage out.
  
  Report comment
  
  Reply
3. 𐂀 𐂅 says:
  
  February 1, 2021 at 11:42 am
  
  Trained on their previous content, yes that is a sound hypothesis. They may not have the skill to pull it off though, using other people’s code is very different to actually understanding the nature of the beast well enough to come up with something that is publication worthy.
  
  Report comment
  
  Reply
4. pelrun says:
  
  February 1, 2021 at 8:58 pm
  
  Because they already know what the rules are.
  
  Report comment
  
  Reply
Elliot Williams says:

February 1, 2021 at 4:03 am

Now you have 13,001 problems.

Report comment

Reply
1. Gravis says:
  
  February 1, 2021 at 7:26 am
  
  The reference:
  https://imgs.xkcd.com/comics/perl_problems.png
  
  Report comment
  
  Reply
  1. Elliot Williams says:
    
    February 1, 2021 at 1:42 pm
    
    The parent of both of these nodes: JWZ.
    
    http://regex.info/blog/2006-09-15/247
    
    Report comment
    
    Reply
Rpol says:

February 1, 2021 at 4:25 am

Whom ever created regular expressions and those who allowed it to persist this long should be head smacked!

Report comment

Reply
1. a guy says:
  
  February 1, 2021 at 5:31 am
  
  What’s the alternative? I do wish that regular expression behavior / features were more standard across implementations, but having a portable implementation of a feature rich search and replace tool set is eternally useful.
  
  Report comment
  
  Reply
  1. pelrun says:
    
    February 1, 2021 at 9:09 pm
    
    Obviously the alternative is to go back to writing state machines in machine code by hand.
    
    Report comment
    
    Reply
2. Mark S. says:
  
  February 1, 2021 at 6:18 am
  
  I respectfully disagree. Yes, they’re hard to grok (even after 40 years of coding i still struggle) but sometimes they are exactly what you need to get the job done.
  
  Report comment
  
  Reply
3. ryanbarrett says:
  
  February 5, 2021 at 11:10 pm
  
  String.replace(([W])\w+\s([e])\w+/g, ‘Whoever’);
  
  Report comment
  
  Reply
Ridetheory says:

February 1, 2021 at 4:29 am

This is hilarious, because the Private Eye magazine calls The Guardian “The Gruniad” due to its long history of typos. I spotted one just the other day.

Report comment

Reply
1. targetdrone says:
  
  February 1, 2021 at 10:52 am
  
  Just came here to see if anyone else was searching for /Grauniad/ in their regex catalog. :-)
  
  Report comment
  
  Reply
2. 𐂀 𐂅 says:
  
  February 1, 2021 at 11:43 am
  
  Here we tend to refer to it as The Garbagian, for obvious reasons.
  
  Report comment
  
  Reply
joelfinkle says:

February 1, 2021 at 5:23 am

Aarghh, I wonder how many of the 13,000 would be different for US English than UK English?

Report comment

Reply
1. BT says:
  
  February 1, 2021 at 6:25 am
  
  Ha ha that is a job I would not take on. “Two countries divided by a common language” as someone is supposed to have once said.
  
  Report comment
  
  Reply
  1. Ren says:
    
    February 1, 2021 at 8:15 am
    
    Winston Churchill, I believe.
    
    Report comment
    
    Reply
2. Jenny List says:
  
  February 1, 2021 at 8:44 am
  
  It’s simple. All that’s needed would be a set to correct the Americans :)
  
  Report comment
  
  Reply
  1. targetdrone says:
    
    February 1, 2021 at 10:59 am
    
    After watching The Expanse and hearing the evolution of English into Belter language, I get a small sense of how you Brits must feel every time you see an American television program.
    
    Report comment
    
    Reply
    1. John says:
      
      February 1, 2021 at 3:50 pm
      
      Funnily enough, I’m pretty sure belter is based on british english. It sounds funny to americans for that reason.
      
      Report comment
      
      Reply
LAK says:

February 1, 2021 at 6:48 am

The Guardian should simplify their style guide so mere humans can do the work.

Report comment

Reply
1. Jenny List says:
  
  February 1, 2021 at 8:45 am
  
  Not always easy when your subject matter is so broad.
  
  Report comment
  
  Reply
  1. LAK says:
    
    February 3, 2021 at 7:19 am
    
    That misses the point entirely; there are too many non-critical style rules and a lack of attention to basics. How did “USD-D” make it into a headline today, instead of the correct “USB-D” in the article below that headline? But, by golly, the maker’s handle was properly bracketed!
    
    Report comment
    
    Reply
    1. LAK says:
      
      February 3, 2021 at 7:23 am
      
      Clearly, I need a better editor! That’s “USD-PD” not “USD-D” or the proper “USB-PD”. 13,001 and counting up.
      
      Report comment
      
      Reply
Alan Fleck says:

February 1, 2021 at 9:25 am

The Guardian had a reputation in the UK for spelling mistakes, such that it was known in a satirical magazine as The Grauniad, which makes the concept of 13,000 regular expressions a bit ironic …

Report comment

Reply
punkdigerati says:

February 1, 2021 at 10:29 am

This is pretty rich on Hackaday, notorious for letting a myriad of mistakes make it to publishing.

Report comment

Reply
1. pelrun says:
  
  February 1, 2021 at 9:07 pm
  
  It’s pretty rich that despite articles showing how much effort it takes for *major publications* to craft error-free articles on short deadlines that you still think a blog like HaD should be held to the same standard.
  
  Report comment
  
  Reply
Drone says:

February 1, 2021 at 3:03 pm

Regular Expressions (RegEx RegExp, or RE) has a reputation for obscurity and complexity. Actually RegEx is a good example of how being powerful usually means being complex. Also, it doesn’t help that adding RE capability to myriad other programming/scripting languages in a lazy way has obfuscated its use. RE’s used in one language will often not work in a different language. These instances of RE’s across languages even has a name, PCRE or Perl “Compatible” Regular Expressions. And that brings up Perl.[1] RE’s predate Perl but today it is widely understood RE’s live natively in the Perl-5 language. There is/was a Perl-6, but it is far removed from Perl-5 and was renamed Raku. Supposedly Perl-5 will live on as Perl-7 in the future. Perl, unofficially an acronym for “Practical Extraction and Reporting Language” has been around in Linux/Unix for over 30 years and is considered an essential “glue” language in the Unices. Due to variability and incompatibility between various shells across the Unices, it is recommended to program system scripts in Perl for portability. Perl-5 is an interpreter, but it can be compiled. In my opinion, if you are going to do Regular Expressions properly you want to do do them in Perl. Perl has an enormous repository of modules and scripts stored online in the Comprehensive Perl Archive Network (CPAN).[2] Perl, especially when coupled with Regular Expressions is considered a “noisy” language. What you write and understand today in Perl will look like random “noise” written by an Alien if you come back to it a year later.

References:

1. Perl

https://en.wikipedia.org/wiki/Perl

https://www.perl.org/

2. CPAN

https://en.wikipedia.org/wiki/CPAN

https://www.cpan.org/

https://metacpan.org/

Report comment

Reply
&e7 says:

February 1, 2021 at 3:43 pm

Nice! We use regex all day long to help tidy up nasty-a$$ text exports from InDesign, Word etc for eBook production… I’m dreaming that there may be some interesting nuggets in the article (so speaks Andy of posts past, before he actually reads it) – the Ghost of Andy to come may well just visit this post tomorrow to smack myself upside the head in the reply :)

Report comment

Reply
Gösta says:

February 2, 2021 at 10:43 am

I love regular expressions, use them when possible. Always been happy I invested time learning it, it has been the second best language investment I’ve made, after C.

Report comment

Reply

Hackaday

13,000 Regular Expressions Make An Editor’s Life Easier

41 thoughts on “13,000 Regular Expressions Make An Editor’s Life Easier”

Leave a Reply to 𐂀 𐂅Cancel reply

Search

Never miss a hack

If you missed it

Ask Hackaday: How Do You Digitize Your Documents?

The Amazing Maser

Zombie Netscape Won’t Die

Does Carbon Fiber PLA Make Sense?

Size (and Units) Really Do Matter

Our Columns

Did We Overestimate The Potential Harm From Microplastics?

FLOSS Weekly Episode 862: Have Your CAKE And Eat It Too

The Fancy Payment Cards Of Taiwan

Regrowing Teeth Might Not Be Science Fiction Anymore

Keebin’ With Kristina: The One With The Split With The Num Pad

41 thoughts on “13,000 Regular Expressions Make An Editor’s Life Easier”

Leave a Reply to 𐂀 𐂅Cancel reply

Search

Never miss a hack

Subscribe

If you missed it

Our Columns