13,000 Regular Expressions Make An Editor’s Life Easier

Being an editor is a job that seems deceptively easy until you are hauled over the coals for letting a textual howler go to print (or website). Most publications have style guides to ensure that their individual voice is preserved, but even the most eagle-eyed will sometimes slip up in their application. At the Guardian newspaper in the UK they have been struggling with this against an ever-evolving style guide that must adapt to fast-moving world events, to the extent that they had a set of regular expressions to deal with commonly-occurring problems. A lot of regular expressions, in fact around 13,000 of them.

Clearly some form of management was required, and  a team of developers set about taming this monster. The result is Typerighter, their server-side document-checker, which can be found in a GitHub repository. Surprisingly for rule management they started with a Google Sheet, a choice which proved unexpectedly robust when working with such a long list even though they later replaced it. The back end doing the job of text matching was written in Scala, and for the front end a plugin was created for their Prosemirror text editor.

For a publication of course this is extremely interesting, but where’s the interest for hackers? The answer lies in any text-processing engine that uses a lot of regular expressions; those of you who have dabbled in this space will know how unwieldy this work can become. Any user of computational linguistic techniques in the pursuit of language processing could probably find much of interest here.

If you’re a bit hazy on regular expressions, how about the episode on them from our long-running Linux-fu series?

Rock Out To The Written Word With BookSound

With his latest project, [Roni Bandini] has simultaneously given the world a new type of audiobook and music. Traditional audiobooks are basically the adult equivalent of having somebody read you a bedtime story, but BookSound actually turns the written word into electronic music. You won’t be able to boast to your friends that as a matter of fact, you have read that popular new novel, but at least you might be able to dance to it.

[Roni] says he’s still working on perfecting the word to music mapping, so the results shown in the video after the break are still a bit rough. But even in these early stages there’s no denying this is an exceptionally unique project, and we’re excited to see where it goes from here.

Inside the classy looking 3D printed enclosure is a Raspberry Pi, an OLED display, and the button and switch which make up the extent of the device’s controls. At the end of the arm is a standard Raspberry Pi Camera module, which gives the BookSound a bird’s eye view of the book to be songified.

To turn your favorite book into electronic beats, simply open it up, put it under the gaze of BookSound, and press the button on the front. Because the Raspberry Pi isn’t exactly a powerhouse, it takes about two minutes for it to scan the page, perform optical character recognition (OCR), and compose the track before you start to hear anything.

If you’re wondering what the secret sauce is to turn words into music, [Roni] isn’t ready to share his source code just yet. But he was able to give us a few high-level explanations of what’s going on inside BookSound. For example, to generate the song’s BPM, the software will count how many words per paragraph are on the page: so a book with shorter paragraphs will consequently have a faster tempo to match the speed at which the author is moving through ideas. Similarly, drum kicks are generated based on the number of syllables in each paragraph. In the future, he’s looking at adding “lyrics” by running commonly used words on the page through a text to speech engine and inserting them into the beat.

We’ve seen practical applications of OCR on the Raspberry Pi in the past and even similar looking book scanning arrangements. But nothing quite like BookSound before, which at this point, is really saying something.

Continue reading “Rock Out To The Written Word With BookSound”

Gawking Text Files

Some tools in a toolbox are versatile. You can use a screwdriver as a pry bar to open a paint can, for example. I’ve even hammered a tack in with a screwdriver handle even though you probably shouldn’t. But a chainsaw isn’t that versatile. It only cuts. But man does it cut!

aukAWK is a chainsaw for processing text files line-by-line (and the GNU version is known as GAWK). That’s a pretty common case. It is even more common if you produce a text file from a spreadsheet or work with other kinds of text files. AWK has some serious limitations, but so do chainsaws. They are still super useful. Although AWK sounds like a penguin-like bird (see right), that’s an auk. Sounds the same, but spelled differently. AWK is actually an acronym of the original author’s names.

If you know C and you grok regular expressions, then you can learn AWK in about 5 minutes. If you only know C, go read up on regular expressions and come back. Five minutes later you will know AWK. If you are running Linux, you probably already have GAWK installed and can run it using the alias awk. If you are running Windows, you might consider installing Cygwin, although there are pure Windows versions available. If you just want to play in a browser, try webawk.

Continue reading “Gawking Text Files”