All You’ve Ever Wanted To Know About Compilers

They say that in order to understand recursion, you must first understand recursion. Once you master that concept, you might decide that it’s time to write your own compiler that can compile itself as a fun side project. According to [Warren] aka [DoctorWkt], who documented every step of writing this C compiler from scratch, a true compiler will be able to do that.

Some of the goals for the project included self-compiling, focusing on a real hardware platform, practicality, and simplicity. [Warren] outlines a lot of the theory of compilers as well, including all the lexical, grammar, and semantic analysis and then the final translation into assembly language, but really focuses on making this compiler one for practical use rather than just a theoretical implementation. He focuses on Intel x86-64 and 32-bit ARM platforms too, which are widely available.

This project is a long read and very thoroughly documented at around 100,000 words, so if you’ve ever been interested in compilers this is a great place to start. There are a lot of other great compiler tools floating around too, like the Compiler Explorer which shows you generated code as you write in a higher level language.

[via Hackaday.io]

19 thoughts on “All You’ve Ever Wanted To Know About Compilers

    1. The Dragon book is the college standard and it’s a tough slog. Wirth’s book is a lot more readable and has source code for a Oberon-0 compiler.

      Wirth has also wrote Project Oberon which includes the design and construction of a compiler, operating system and a microprocessor to run it on(in Verilog). Source for all this is at:
      http://www.projectoberon.com/

      1. Of course, Dr. Dobbs magazine had some articles about Small-C, which were eventually collected in a book. Since it was more aimed at hobbyists, thst material might be an introduction. C is of course “simple” since there are few key words, so maybe easy to grasp the companies oncepts of parsing etc.

        I have the dragon book, bought at a used book sale for a few dollars.

        1. IIRC, the Aho and Sethi book is strongly oriented towards the parser-generator paradigm. If I’m wrong about this, my apologies .. I haven’t looked at the book in a long, long .. long, time.

          In any case, there are many, many ways to skin the compiler cat.

          I’m definitely not dissing the parser-generator approach .. I’ve written my fair share of specialized tools using lex, yacc and such. Whether these were “compilers” or not might be a matter of how you define “compilation”. These were mostly tools to automate the translation of code from various industrial control systems to a particular target system. They did parse formal languages to ASTs, and then (ultimately) converted that AST representation into running code on the target system, using various combinations of declarative and imperative transformation and so forth along the way .. so I guess the would qualify as specialized compilers .. maybe ..

          In any case, a good ol’ LALR(1) parser-generator was definitely the “go to” approach when it was the right tool for the job, but there are plenty of cases where they are inordinately hard to use, or flat out can’t do the job.

          If you search for the term “parser combinator” on the following linked page, it should take you to a short, sweet, insightful review that expounds on this a little.
          https://www.amazon.com/Compilers-Principles-Techniques-Alfred-Aho-ebook/dp/B009TGD06W

  1. One may think “Why?”, but one important reason to consider writing your own compiler from scratch is described by Ken Thompson in his excellent (and terrifying) 1984 ACM Turing award speech Reflections on Trusting Trust: https://www.archive.ece.cmu.edu/~ganger/712.fall02/papers/p761-thompson.pdf also summarized here: https://wiki.c2.com/?TheKenThompsonHack

    Basically, a maliciously-modified compiler can create executable code with undetectable back doors in it even when used to compile itself. So a bad actor could (or could have) installed this hack in long-ago generations of (say) gcc, and that malicious code could persist to this day.

    The inimitable Bruce Schneier revisits the notion, with a possible countermeasure described by by David Wheeler, here https://www.schneier.com/blog/archives/2006/01/countering_trus.html

    1. Or, as Ken has done more recently, you might invent a new language for which a compiler doesn’t yet exist :)

      I remember in college people being dismissive of studying compilers since gcc already existed. This was before Python, Java, Rust, Go, PHP, JavaScript, and Ruby were invented!

  2. “Intel x86-64”? That’s like the one thing that it’s never been called.

    per wikipedia: “x86-64 (also known as x64, x86_64, AMD64 and Intel 64[note 1])”
    The note being that “Intel initially used the names IA-32e and EM64T before finally settling on “Intel 64″”

    Tech people call it x86_64 but previously called it AMD64.

  3. Very neat! There’s nothing quite like writing a compiler or interpreter for a general purpose programming language to refresh and deepen your understanding of the bottom couple layers of abstraction that one can otherwise easily be lulled into taking for granted.

    This is one of those great system-level projects (like making a simple CPU on an FPGA with Verilog, or writing a minimal kernel with just enough smarts to schedule and switch tasks, configure page tables, and handle page faults and I/O interrupts) where it is well worth reinventing a wheel or two for the sake of the richer understanding of the technology that comes from tackling the basics in a hands-on way.

    Bravo! And all the much more for taking a practical incremental approach and for debugging in the open thus showing that nobody gets it 100% right the first time. Both the incremental approach and the visible test-debug-fix cycle go a long way to making the process approachable for those who might be inspired to set out on a similar journey themselves.

    1. I was a bit disappointed with my compiler class (in the same timeframe.) Lots of stuff about grammars and parsing (that has been very useful over the years), but very, very little about the code generation part. I guess a semester only has so many days…)

      1. Time constraints is only a small part of it. To quote Terrence Parr from his book ‘Language Implementation Patterns’.

        “Unfortunately, optimizing and generating machine code from an AST (or any other intermediate rep-
        resentation) is really hard.”

        He recommends using LLVM to do the “heavy lifting”.

  4. “Back in the day…”, says a grey beard… ALL introductory graduate CS curriculum included writing assemblers and then compilers. At the time, the “Dragon book” was the best reference as FOSS wasn’t a thing and Usenet was one resource. That was the heyday of syntax/grammar generators (YACC was popular at that time), if one chose to go that route. I can remember some difficulty coding “null transitions states”. I still remember the intro class, the most successful (meaning they completed the assignment) students wrote their assemblers in assembly (IBM 360 assembly). I chose SNOBOL after a brief trial of Fortran. Other options later on included Turbo-C, Turbo-Pascal, Ada (Booth’s book)… all of which I used professionally at the time.

  5. I’m really, really enjoying this. I’m not quite following all the details of the concepts in all the chapters I’ve read so far, but it is surprisingly gripping to see the compiler unfold. Thanks for sharing!

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.