Parsing Math In Python

Programming computers used to be harder. Don’t get us wrong — today, people tend to solve harder problems with computers, but the fundamental act of programming is easier. We have high-level languages, toolkits, and even help from our operating systems. Most people never have to figure out how to directly read from a disk drive, deblock the data into records, and perform multiplication using nothing but shifts and adds. While that’s a good thing, sometimes it is good to study the basics. That was [gnebehay’s] thought when his university studies were too high level, so he decided to write an arithmetic expression parser in Python. It came out in about 100 lines of code.

Interpreting math expressions is one of those things that seems simple until you get into it. The first problem is correctly lexing the input — a term that means splitting into tokens. For a human, it seems simple that 5-3 is three tokens, {5, -, and 3} and that’s easy to figure out. But what about 5+-3? That’s also three tokens: {5,+,-3}. Tricky.

Precedence is the other problem. If you look at 5+3*2, you should remember that the answer is 11 or 5+(3*2). However, other than by convention, it is equally valid to consider the answer is 16, or (5+3)*2. Another convention is left association so that 7-4+2 is 5 instead of 1. As it turns out, [gnebehay’s] parser currently spits out 1 for that expression, but he’s promised to fix it soon.

While it is interesting to read the code, the real value is the readme file which documents the creation of the parser and some of the theory behind it. This sort of thing used to be a staple of computer science classes, but today you are less likely to encounter it as classes focus on higher-level constructs.

You probably won’t use this code for anything. You rarely need to parse just a math expression and even if you did, there are many tools to help do that now. But you might just learn something about how interpreters and compilers digest text and garner meaning.

Today, you would probably build a parser with ANTLR or some similar tool. While 100 lines seems small, we’ve seen tiny languages that are smaller.

9 thoughts on “Parsing Math In Python

  1. It is so trivial to parse LISP syntax, that up until 2000 or so, every large program embedded an interpreter. You make trivial built-ins for logical and numeric operators, and you have a perfectly useful language that can be both data and code … something we lost when we started using json and yaml

    1. I always wondered why compiled languages targeting LISP aren’t really a thing. It’s the perfect bytecode for a high level language, I just wouldn’t want to actually write in it myself.

      I do use LISP like structures as the backend for my visual scripting engine in my lighting control software, but it’s stored in YAML and interpreted by Python.

  2. However, other than by convention, it is equally valid to consider…

    If you go down that path, then you may as well use * for addition and = for multiplication and / for logical XOR, maybe also dependent on moon phases and planet alignment.

    Those conventions are a good thing to have. It’s because they add meaning to the gibberish.
    How would you read this text without the common convention of the 26 squiggles we call the letters of the alphabet?

  3. Just so. The “it is only a convention” nonsense was neatly pointed out by a logician over 15p years ago….

    “I don’t know what you mean by ‘glory,’ ” Alice said.
    Humpty Dumpty smiled contemptuously. “Of course you don’t—till I tell you. I meant ‘there’s a nice knock-down argument for you!'”
    “But ‘glory’ doesn’t mean ‘a nice knock-down argument’,” Alice objected.
    “When I use a word,” Humpty Dumpty said, in rather a scornful tone, “it means just what I choose it to mean—neither more nor less.”
    “The question is,” said Alice, “whether you can make words mean so many different things.”
    “The question is,” said Humpty Dumpty, “which is to be master—that’s all.”

  4. Per Python’s precedence rules for operators, which is same for most programming languages, precedence is not a “problem”. Parsing math equations is one of the first things you will do for the mundane undergrad course typically titled Data Structures and Algorithms.

    Programmatic parsing of strings, symbols, etc is a solved problem. Many ways to do it – mostly covered in the compiler theory course. I still have my dragon book.

  5. In my experience there is nothing particularly rare about having to write an infix notation parser. I’ve ended up doing it at a bunch of times in my life. Once or twice for a function graphing app for some PDA. Once for Boolean searching in a PalmOS Bible reading app. Most recently, in OpenSCAD to compensate for the fact that the language doesn’t let you pass functions to functions or modules, so at least I can pass formulas as strings and then parse then myself.
    In one case, I used bison to generate the parser, but in a number of other cases I wrote the parser from scratch, either for efficiency or as I was using a language other than C.
    It’s not that hard, but always a nuisance. And generally I do it by translating into some alternate notation, like tokenized RPN that can be evaluated faster repeatedly.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.