Program Guesses Your Regular Expression

April 22, 2020

We aren’t sure how we feel about [pemistahl’s] grex program. On the one hand, we applaud a program that can take some input samples and produce a regular expression. On the other hand, it might be just as hard to gather example data that produces the correct regular expression. Still, it is an interesting piece of code.

Even the author suggests not to use this as an excuse to not learn regular expressions, since you’ll need to check the program’s output. It is certain that the results will match your test cases, but it isn’t certain that it won’t accept things you didn’t expect. Bad regular expressions have been the source of some deeply buried bugs.

The code is written in Rust and builds an automaton for the test cases, making assumptions about the characters it sees belonging to certain classes. You can control the class algorithm to some degree using command line options. It is also possible to use the code as a library from another program.

Here are a few examples of grex at work:

$ grex a b c
^[a-c]$

$ grex a c d e f
^[ac-f]$

$ grex a b x de
^(?:de|[abx])$

We wondered if it would help if you could provide counterexamples, too. For instance, old fashioned US area codes could only have a 1 or 0 in the middle digit. So giving examples like 713 and 212 could benefit from counterexamples such as 173 or 777.

If you want to create your own regular expressions, it isn’t that hard. If you want to practice, crosswords are fun.

12 thoughts on “Program Guesses Your Regular Expression”

RetepV says:

April 22, 2020 at 1:17 am

It’s a nice idea. But I think that the regexp would become more reliable if you would not only be able to pass the positive cases, but also the negative cases to exclude those instances where the regexp runs away.

Regexps are greedy, and most of the hard part of designing a regexp is actually about limiting the greediness of the parser. Using ^ and $ to anchor the regexp, and using constructions like [^0..9].

Report comment

Reply
1. RetepV says:
  
  April 22, 2020 at 1:18 am
  
  Do’h, I didn’t read past the image with the examples until I pressed Post. :*)
  
  Report comment
  
  Reply
  1. abjq says:
    
    April 22, 2020 at 2:12 am
    
    :s/It\s a nice idea.*9\]\.//
    
    Yeah the lack of an edit option is annoying!
    
    Report comment
    
    Reply
    1. abjq says:
      
      April 22, 2020 at 2:12 am
      
      Especially as i missed out the single quote! D’oh!
      
      Report comment
      
      Reply
2. Jeffrey Bedard says:
  
  April 24, 2020 at 3:58 am
  
  Perhaps listing exceptions as a series of awk expression blocks calling next can fit this need. It is a rather composite solution, but potentially cleaner than a single complex expression containing the exceptions.
  
  Report comment
  
  Reply
Sam Neaves (@samwalrus) says:

April 22, 2020 at 1:37 am

Have a look at some Inductive Logic Programming if this is interesting for you: http://andrewcropper.com/pubs/mlj19-metaho.pdf

You can give positive and negative examples and learn a program (Similar to ‘learning a regular expression’).

Report comment

Reply
Jaxter25 says:

April 22, 2020 at 2:03 am

There was an interesting post on the pihole subreddit recently where someone integrated/optimised regexs into their blocklist.

Some of the devs replied stating how this was actually more expensive and time consuming than using the optimised list searches.

People tend to see regexs as this magical solve-all tool, but in some instances, a string function may actually be faster and more appropriate.

Report comment

Reply
1. ROB says:
  
  April 22, 2020 at 4:14 pm
  
  This!
  
  Regex has an overhead cost especially with tread based processing like a web server.
  
  Apache MOD_REWRITE is the ultimate regex gotcha.
  
  I see people attempt the – one regex does everything – in a mod rewrite.
  
  They end up loosing half of their URI namespace and are committed as fixing the problem would destroy their SEO.
  
  Then after numerous site changes and the inevitable mod rewrite changes they create what I call the regex Rubik’s cube.
  
  It’s impossible to solve. And often creates security weaknesses for hackers.
  
  Anywhere else you should consider string manipulation, optimised list searches or even SQL first. These methods are often faster and less resource hungry than regex.
  
  Report comment
  
  Reply
Michael C says:

April 22, 2020 at 7:55 am

Given my personal experience with regexs, I wasn’t at all sure this was casually possible.
/s

Report comment

Reply
Alex says:

April 22, 2020 at 12:17 pm

Ha! Funny!
Imho regexp are _much_ simpler to write than to read, so this program is like the most elaborate regexp related troll ever. Probably the algorithmically generated regexp are borderline write-only for nontrivial cases :D

Report comment

Reply
rtyr56y says:

April 22, 2020 at 2:20 pm

Not working on my e-mail filter
grex politics lgbagd idiots etc.

Report comment

Reply
Peter M. Stahl says:

April 23, 2020 at 1:28 am

Hi, I’m the author of this little tool. First of all, thank you very much for covering it here. I’m happy that it gains attention, even though it is discussed quite controversially. But this is a good thing.

Just a few notes:

– If you don’t use flags for the shorthand character classes (-d, -D, -s, -S, -w, -W) it is guaranteed that the produced regex matches only the input test cases and nothing else.
– I plan for the next version to add a functionality that allows to provide counterexamples which should NOT be matched by the produced regex. I agree that this would be a very useful addition so I’ll be working on that next.

Report comment

Reply

Hackaday

Program Guesses Your Regular Expression

12 thoughts on “Program Guesses Your Regular Expression”

Leave a Reply to RetepVCancel reply

Search

Never miss a hack

If you missed it

Dearest C++, Let Me Count The Ways I Love/Hate Thee

Personal Reflections On Immutable Linux

Crunching The News For Fun And Little Profit

The End Of The Hackintosh Is Upon Us

The Hackaday Summer Reading List: No AI Involvement, Guaranteed

Our Columns

Hackaday Podcast Episode 328: Benchies, Beanies, And Back To The Future

This Week In Security: Bitchat, CitrixBleed Part 2, Opossum, And TSAs

Ask Hackaday: Are You Wearing 3D Printed Shoes?

FLOSS Weekly Episode 840: End-of-10; Not Just Some Guy In A Van

Dithering With Quantization To Smooth Things Over

12 thoughts on “Program Guesses Your Regular Expression”

Leave a Reply to RetepVCancel reply

Search

Never miss a hack

Subscribe

If you missed it

Our Columns