DIY Regular Expressions

In the Star Wars universe, not everyone uses a lightsaber, and those who do wield them had to build them themselves. There’s something to be said about that strategy. Building a car or a radio is a great way to learn how those things work. That’s what [Low Level JavaScript] points out about regular expressions. Sure, a lot of people think they are scary. So why not write your own regular expression parser and engine? Get that under your belt and you’ll probably never fear another regular expression.

Of course, most of us probably won’t do it ourselves, but you can still watch the process in the video below. The code is surprisingly short, but don’t expect all the bells and whistles you might find in Python or even Perl.

In the hands of the skilled, regular expressions are very powerful and offer a quick way to split apart text data. Like a lot of powerful ideas, the basic concept — that of a finite state machine — is really simple. It is the application to real problems that becomes difficult.

If you want a primer on regular expressions that doesn’t require you to write your own tools, we have had a few posts that can help. If you just want some practice, try a crossword puzzle.

12 thoughts on “DIY Regular Expressions

    1. Basic regex’s are a direct mapping of finite state automata, which is fundamental to the theory of computation. The extended regex’s we use in practice are also incredibly useful. There’s a reason nearly every language has copied Perl’s regex’s.

      What they haven’t done is copy its /x modifier, which is a powerful tool for giving you hope of writing clear expressions. A lot of languages need to pick that up.

      1. Rust ‘regex’ crate uses ‘(?x)’ to implement it too.
        Python reuses the idea with re.VERBOSE (also named re.X). ‘(?x)’ is also usable.

        I don’t know for other languages.

        1. Agree. I actually write GUI utilities in Perl, and often use regexes to validate inputs.
          There was a time, not so long ago, when python didn’t support quite a lot of things that have been easy in perl “forever”. In a very long career writing code, I tend to skip over every other new language that comes along, especially if it’s essentially perl “without the more than one way to do it ” feature.
          I find that having more than one way lets me write clearer code, personally. It does take self-discipline, something hard to enforce in large projects, but for one-man cowboy work, it’s fine.

  1. The best thing about using regexes is the tremendous debugging and fine-grained error-handling that they afford… oh wait.

    The problem with regexes is that they work for the test cases you think up. And then when someone in the real world decides that there needs to be a “:” between some of what were previously spaces, your parsing goes all to hell. “Brittle”.

    If you validate all the results of the regex, at least you’ll notice that something is messed up. Of course, I only use regexes when it’s something quick and dirty, so I’m not validating. And the brittleness bites me in the ass.

    “Now you have two problems.”

    1. Javascript used to be a horrible language because of bad tooling, but it’s a language like any other, and since of it’s strengths have made it quite popular over time. Tools like Regex101 are really useful as Regex’s strengths continue to be useful — it also is a language (almost) like any other.
      A function is always only as good as it’s test cases.

      Now I want to see/make a tool that generates test cases for a given Regex

    2. That is what sed is for, make sure your input is within the set of workable inputs. Pipes are your friend, particularly on multicore machines as you can get parallelism in your code for free.

  2. GREP anyone? The bane of earlier generations of CS students – we all had to write a subset of grep or agrep. Methinks it was a plot to force understanding of NFAs in particular and automate theory in general.

    I have used them for parsing kicad BoMs and for a project that is studying biological post-grad papers (spoiler alert: many botany dissertations are works of fiction)

  3. “Get that under your belt and you’ll probably never fear another regular expression.”

    I very much doubt that.

    The biggest problem that I have with regular expressions is that between work and home there are multiple dialects I need to use, just often enough to get confused trying to remember what goes with which dialect.

Leave a Reply to Stéphane Cancel reply

Please be kind and respectful to help make the comments section excellent. (Comment Policy)

This site uses Akismet to reduce spam. Learn how your comment data is processed.