Being an editor is a job that seems deceptively easy until you are hauled over the coals for letting a textual howler go to print (or website). Most publications have style guides to ensure that their individual voice is preserved, but even the most eagle-eyed will sometimes slip up in their application. At the Guardian newspaper in the UK they have been struggling with this against an ever-evolving style guide that must adapt to fast-moving world events, to the extent that they had a set of regular expressions to deal with commonly-occurring problems. A lot of regular expressions, in fact around 13,000 of them.
Clearly some form of management was required, and a team of developers set about taming this monster. The result is Typerighter, their server-side document-checker, which can be found in a GitHub repository. Surprisingly for rule management they started with a Google Sheet, a choice which proved unexpectedly robust when working with such a long list even though they later replaced it. The back end doing the job of text matching was written in Scala, and for the front end a plugin was created for their Prosemirror text editor.
For a publication of course this is extremely interesting, but where’s the interest for hackers? The answer lies in any text-processing engine that uses a lot of regular expressions; those of you who have dabbled in this space will know how unwieldy this work can become. Any user of computational linguistic techniques in the pursuit of language processing could probably find much of interest here.
If you’re a bit hazy on regular expressions, how about the episode on them from our long-running Linux-fu series?
What does the fu in Linux-fu mean? I always just read it as “Linux, f##ck you.”, but that just sounds wrong.
It’s the Fu in Kung Fu — the mastery.
I often find when using regular expressions that your interpretation of fu is most appropriate.
https://www.quora.com/What-is-Linux-Fu?share=1
I don’t understand that article?
is there any regular expression that could correct that?
> I don’t understand that article?
Is it a question or a statement?
Yes.
I see what you did there. +1
Probably not. I’d wager that the language of articles you don’t understand isn’t regular.
Why not just use AI?
It would be an interesting challenge.
I’ve played with AI and language corpora. Garbage in, garbage out.
Trained on their previous content, yes that is a sound hypothesis. They may not have the skill to pull it off though, using other people’s code is very different to actually understanding the nature of the beast well enough to come up with something that is publication worthy.
Because they already know what the rules are.
Now you have 13,001 problems.
The reference:
https://imgs.xkcd.com/comics/perl_problems.png
The parent of both of these nodes: JWZ.
http://regex.info/blog/2006-09-15/247
Whom ever created regular expressions and those who allowed it to persist this long should be head smacked!
What’s the alternative? I do wish that regular expression behavior / features were more standard across implementations, but having a portable implementation of a feature rich search and replace tool set is eternally useful.
Obviously the alternative is to go back to writing state machines in machine code by hand.
I respectfully disagree. Yes, they’re hard to grok (even after 40 years of coding i still struggle) but sometimes they are exactly what you need to get the job done.
String.replace(([W])\w+\s([e])\w+/g, ‘Whoever’);
This is hilarious, because the Private Eye magazine calls The Guardian “The Gruniad” due to its long history of typos. I spotted one just the other day.
Just came here to see if anyone else was searching for /Grauniad/ in their regex catalog. :-)
Here we tend to refer to it as The Garbagian, for obvious reasons.
Aarghh, I wonder how many of the 13,000 would be different for US English than UK English?
Ha ha that is a job I would not take on. “Two countries divided by a common language” as someone is supposed to have once said.
Winston Churchill, I believe.
It’s simple. All that’s needed would be a set to correct the Americans :)
After watching The Expanse and hearing the evolution of English into Belter language, I get a small sense of how you Brits must feel every time you see an American television program.
Funnily enough, I’m pretty sure belter is based on british english. It sounds funny to americans for that reason.
The Guardian should simplify their style guide so mere humans can do the work.
Not always easy when your subject matter is so broad.
That misses the point entirely; there are too many non-critical style rules and a lack of attention to basics. How did “USD-D” make it into a headline today, instead of the correct “USB-D” in the article below that headline? But, by golly, the maker’s handle was properly bracketed!
Clearly, I need a better editor! That’s “USD-PD” not “USD-D” or the proper “USB-PD”. 13,001 and counting up.
The Guardian had a reputation in the UK for spelling mistakes, such that it was known in a satirical magazine as The Grauniad, which makes the concept of 13,000 regular expressions a bit ironic …
This is pretty rich on Hackaday, notorious for letting a myriad of mistakes make it to publishing.
It’s pretty rich that despite articles showing how much effort it takes for *major publications* to craft error-free articles on short deadlines that you still think a blog like HaD should be held to the same standard.
Regular Expressions (RegEx RegExp, or RE) has a reputation for obscurity and complexity. Actually RegEx is a good example of how being powerful usually means being complex. Also, it doesn’t help that adding RE capability to myriad other programming/scripting languages in a lazy way has obfuscated its use. RE’s used in one language will often not work in a different language. These instances of RE’s across languages even has a name, PCRE or Perl “Compatible” Regular Expressions. And that brings up Perl.[1] RE’s predate Perl but today it is widely understood RE’s live natively in the Perl-5 language. There is/was a Perl-6, but it is far removed from Perl-5 and was renamed Raku. Supposedly Perl-5 will live on as Perl-7 in the future. Perl, unofficially an acronym for “Practical Extraction and Reporting Language” has been around in Linux/Unix for over 30 years and is considered an essential “glue” language in the Unices. Due to variability and incompatibility between various shells across the Unices, it is recommended to program system scripts in Perl for portability. Perl-5 is an interpreter, but it can be compiled. In my opinion, if you are going to do Regular Expressions properly you want to do do them in Perl. Perl has an enormous repository of modules and scripts stored online in the Comprehensive Perl Archive Network (CPAN).[2] Perl, especially when coupled with Regular Expressions is considered a “noisy” language. What you write and understand today in Perl will look like random “noise” written by an Alien if you come back to it a year later.
References:
1. Perl
https://en.wikipedia.org/wiki/Perl
https://www.perl.org/
2. CPAN
https://en.wikipedia.org/wiki/CPAN
https://www.cpan.org/
https://metacpan.org/
Nice! We use regex all day long to help tidy up nasty-a$$ text exports from InDesign, Word etc for eBook production… I’m dreaming that there may be some interesting nuggets in the article (so speaks Andy of posts past, before he actually reads it) – the Ghost of Andy to come may well just visit this post tomorrow to smack myself upside the head in the reply :)
I love regular expressions, use them when possible. Always been happy I invested time learning it, it has been the second best language investment I’ve made, after C.