Program Guesses Your Regular Expression

We aren’t sure how we feel about [pemistahl’s] grex program. On the one hand, we applaud a program that can take some input samples and produce a regular expression. On the other hand, it might be just as hard to gather example data that produces the correct regular expression. Still, it is an interesting piece of code.

Even the author suggests not to use this as an excuse to not learn regular expressions, since you’ll need to check the program’s output. It is certain that the results will match your test cases, but it isn’t certain that it won’t accept things you didn’t expect. Bad regular expressions have been the source of some deeply buried bugs.

The code is written in Rust and builds an automaton for the test cases, making assumptions about the characters it sees belonging to certain classes. You can control the class algorithm to some degree using command line options. It is also possible to use the code as a library from another program.

Here are a few examples of grex at work:

$ grex a b c

$ grex a c d e f

$ grex a b x de

We wondered if it would help if you could provide counterexamples, too. For instance, old fashioned US area codes could only have a 1 or 0 in the middle digit. So giving examples like 713 and 212 could benefit from counterexamples such as 173 or 777.

If you want to create your own regular expressions, it isn’t that hard. If you want to practice, crosswords are fun.

12 thoughts on “Program Guesses Your Regular Expression

  1. It’s a nice idea. But I think that the regexp would become more reliable if you would not only be able to pass the positive cases, but also the negative cases to exclude those instances where the regexp runs away.

    Regexps are greedy, and most of the hard part of designing a regexp is actually about limiting the greediness of the parser. Using ^ and $ to anchor the regexp, and using constructions like [^0..9].

    1. Perhaps listing exceptions as a series of awk expression blocks calling next can fit this need. It is a rather composite solution, but potentially cleaner than a single complex expression containing the exceptions.

  2. There was an interesting post on the pihole subreddit recently where someone integrated/optimised regexs into their blocklist.

    Some of the devs replied stating how this was actually more expensive and time consuming than using the optimised list searches.

    People tend to see regexs as this magical solve-all tool, but in some instances, a string function may actually be faster and more appropriate.

    1. This!

      Regex has an overhead cost especially with tread based processing like a web server.

      Apache MOD_REWRITE is the ultimate regex gotcha.

      I see people attempt the – one regex does everything – in a mod rewrite.

      They end up loosing half of their URI namespace and are committed as fixing the problem would destroy their SEO.

      Then after numerous site changes and the inevitable mod rewrite changes they create what I call the regex Rubik’s cube.

      It’s impossible to solve. And often creates security weaknesses for hackers.

      Anywhere else you should consider string manipulation, optimised list searches or even SQL first. These methods are often faster and less resource hungry than regex.

  3. Ha! Funny!
    Imho regexp are _much_ simpler to write than to read, so this program is like the most elaborate regexp related troll ever. Probably the algorithmically generated regexp are borderline write-only for nontrivial cases :D

  4. Hi, I’m the author of this little tool. First of all, thank you very much for covering it here. I’m happy that it gains attention, even though it is discussed quite controversially. But this is a good thing.

    Just a few notes:

    – If you don’t use flags for the shorthand character classes (-d, -D, -s, -S, -w, -W) it is guaranteed that the produced regex matches only the input test cases and nothing else.
    – I plan for the next version to add a functionality that allows to provide counterexamples which should NOT be matched by the produced regex. I agree that this would be a very useful addition so I’ll be working on that next.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.