Learn Compilers Online From Cornell

It sounds like the start of a joke, but what’s the difference between taking Cornell’s CS6120 online and in-person? The instructor, [Adrian Samspon] notes that the real class has deadlines, an end-of-semester project, and a discussion board that is only open to real-life students. He also notes that you only earn “imagination credits.”

Still, this is a great opportunity to essentially audit a PhD-level computer science class on a fascinating topic. The course consists of videos, papers, and open source projects using LLVM and a custom internal representation based on JSON that is made for the class. It is all open source, too. You do however need access to the papers, some of which are behind paywalls. Your local library can help if you can’t otherwise find copies of the papers.

The topics include internal representation of programs, simple optimizations, data flow, global analysis and optimization, and practical topics about LLVM. With the basics out of the way, the class turns to some classic topics like loop optimization and alias analysis, along with some more modern topics like memory management and dynamic compilers.

The final part has some more cutting-edge reading about concurrency and parallelism. We suspect you should probably be pretty fluent in programming languages before attempting this course. Of course, not everyone needs to know how to write a compiler. But if you do, this is great stuff. Even if you don’t, it is a chance to expand your understanding of how things work under the hood.

We’ve looked at [Adrian’s] thoughts on LLVM before and you might find that post a useful adjunct to this course. If you want something more practical and less theoretical, there’s the C compiler we looked at before.

29 thoughts on “Learn Compilers Online From Cornell

    1. Any recommendations for other good resources? I picked up a used copy of the dragon book recently but haven’t made it that far in my reading stack yet. Actually its more of a heap at this point…

  1. Thank you for bringing this to our attention. Self-taught computerfolk and hobbyists seem in general to be fairly comfortable with hardware design and designing with FPGAs, writing code in Python or C/C++, and even writing machine language assemblers and BASIC interpreters, but as soon as you bring up the topic of developing a compiler, they suddenly remember a prior engagement. For those people designing and building their own CPUs out of TTL chips or even transistors and diodes, this pretty much keeps them working in assembler or a dialect of BASIC. But once you understand how compilers and virtual machines work, and in particular separating the front end (language parsing) and the back end (code generation for a specific CPU instruction set), in an intermediate language determined by the virtual machine, the last barrier is lifted, and suddenly you can program your one-of-a-kind CPU using the same high-level languages you have on an x86 or ARM machines. Whch in turn improves your CPU designs, because you learn what instructions are most important to generating efficient code.

    1. Probably because writing a good compiler is much more work than making a CPU.

      And if you actually *need* a CPU for a real project that’s big enough to need a high level language and a compiler, then it’s probably smarter to get a standard CPU for the job.

      1. I know you – you’re the guy who always says “why did you build it when you could have bought it for $20?” Hint: this website is aimed at makers, not consumers.

        1. I challenge your claim that it takes “much more work” to make a compiler than the CPU it targets. Evidence: the relative sizes of the hardware and software engineering staffs at companies that develop CPUs.
        2. Are you saying that nobody should design their own CPU because they wouldn’t have a compiler? That is exactly my whole point. People who design CPUs but don’t have a plan to provide a compiler for it are saying “nothing to see here, this is not the CPU you’re looking for.” to the rest of the world. Which negates all that street cred you were hoping for. But you’re probably right – those guys at Acorn shouldn’t have designed that Acorn RISC Machine chip, since there were already plenty of CPUs out there at the time. Who needs another CPU architecture? And for pete’s sake, who’s going to write a compiler for it, anyway?
        3. Modern modular compilers like gcc and LLVM use an intermediate language that is the instruction set for a low-level virtual machine. The compiler consists of two main modules: a front end that translates source code from any of a number of high-level languages into the intermediate language (which is not much higher-level than assembly language), and the back end, which is all you have to design in order to port your CPU’s architecture to the compiler, can be almost as simple as an assembler.
        4. Once you have a back end for your architecture, you can compile the compiler to run on your sweet, shiny badass thyratron-based CPU, and run it natively there. Depending on the size of your machine, you may need to run optimizations in a separate pass, but that just adds to the fun.
        5. You can’t know how shitty your CPU design is until you see how inefficient it is at implementing the virtual machine, and how much code it takes to do a simple funciton call and return.

        6. If you don’t see any value in developing a compiler, why did you read this article? Trolling much?

        1. I’ve done both. I made simple CPUs in a few hundred lines of code, in a few days time. I’ve made a simple C compiler in a few months, and tens of thousands of lines of code. And those simple CPUs wouldn’t even support high level languages, due to lack of stack relative options. And those simple compilers would generate hugely unoptimized code, which would be a poor match for resource constrained targets.

          You were asking why there were so many CPUs and so few compilers. I explained it.

          1. I’m not sure how you got the idea I was asking why there were so many CPUs and so few compilers. I DO understand why there are so many one-off CPUs that don’t have compilers: most people who design their own CPUs know nothing about writing compilers.

            I would suggest you look at a (freely-available) course called “nand2tetris”, which attempts to cover the basics of every level of computer design from CPU to compiler, except that it has a serious flaw: the CPU they have you design is SO inadequate, it would tend to prove your point! In my implementation, it took something like 35 machine instructions to implement a simple call statement, and almost as many to return!! So yeah, I do see your point, that it is easy to design a CPU, but not one that you could use to write real software for.

            But on the other hand, much of that is because the compiler they describe depends on using a stack to pass parameters and results. Which is the modern way of doing it, but there were a lot of compilers built before stack support became a basic expectation of CPUs, and they produced code that ran almost as fast as code written in the system’s assembler. So maybe what you see as CPUs that are unsuited for high-level languages are just unsuitable for how compilers have been written ever since stacks became ubiquitous in CPU architectures.

            And optimization? We’re talking about machines that are lucky to execute instructions in 10 microseconds. They’re not really expecting high performance out of them, as evidenced by the fact that some of these people DO implement BASIC interpreters for them. And as you say, this should be a poor pairing.

  2. > Of course, not everyone needs to know how to write a compiler.

    To the contrary, *EVERYBODY* absolutely needs to know how to write a compiler. This was by far the most useful and versatile topic of the whole CS study. The class at my university was useless (pure theory, obfuscating the beauty of the concept). But the tiny book by Nikolaus Wirth is mind blowing. 100 pages with little theory and lots of great ideas. After learning about that, you suddenly recognize that almost anything you wrote so far has been a compiler – you just didn’t knew it.

    The concept of a compiler is not limited to computer language translation.
    – There is input of some kind (it doesn’t matter if it source code, a json file to parse, sensor data or user interaction – something from the outside).
    – The program needs to react in a sensible way to this input, influencing the internal state and often generating some kind of output.
    – There are error conditions that need to be handled and you have to define a way to go on with live after an error occured.
    – and finally, there is output. That might be text, a database entry, a server response or a PWM setting that changes the motor speed of something.

    And it is not neccessary to reinvent the wheel every single time you start a new project. There are tools and concepts and if you know about their possiblities (and limitations!) you can easily choose the right approach from the beginning and save yourself a lot of trouble, implementation errors and spaggetti code in order to fix wrong architectural decisions made at the very beginning.

    1. Great comment!

      I once spent a summer analyzing source code for a sorting application for ancient IBM iron.. It was literally a compiler. Based on your command line options for the specifics of the sort, it would generate machine code to do the actual sort, and then execute it.

    2. This is what makes Forth so interesting. You explicitly extend the compiler to create new functionality, and you fit it to specific needs.

      And +10 for Wirth. The big three for me were Wirth, Dykstra, and Hoare. And everyone has a set of The Art of Computer Programming by Knuth I hope.

  3. What about separating instruction and certification? As in having the credit bearing part with class work and projects which count towards certification separate from the actual instruction. This would allow someone with the necessary knowledge gained elsewhere to still get certified without having to sit through the classes and naturally without also having to pay for them.

    Yeah, right.

      1. Something a LITTLE like that, but not MUCH like that – if you know how a compiler generates code, you can make a better CPU. You really, really don’t need to know how an oven works to make good cakes.

      2. Also, gcc isn’t that friendly when you are trying to develop a back end for a new CPU architecture. I looked into this, and found that there are at least three different intermediate languages that gcc uses can use, and none are particularly well documented, at least as far as being easy to learn about.. I would love it if somebody would prove me wrong by saying, “nonsense – go to this website to see how easy it is.” LLVM, on the other hand, appears to have been designed from the low-level virtual machine out. I will be looking into this again.

      1. I’ve looked, just last year. Still a problem. While obviously a number of people have succeeded at writing back-ends for gcc (judging from the number of CPU architectures it supports), gcc looks like it was never intended for people other than its own developers to write back-ends. What I’m hoping is that LLVM is better at this, because as I’ve already said, designing a good CPU requires knowing how a compiler generates code.

    1. Wikipedia says “LLVM was released under the University of Illinois/NCSA Open Source License, a permissive free software licence.” When it comes to software licenses,”permissive” usually means that users of the software are not as restricted as they are under GNU’s GPL license, so I suspect that Clang and LLVM were written so that Apple wouldn’t have to give away the store to use it. This is also why Mac OS is based on BSD Unix instead of GNU/Linux.

Leave a Reply

Please be kind and respectful to help make the comments section excellent. (Comment Policy)

This site uses Akismet to reduce spam. Learn how your comment data is processed.