Dirty Tricks For 6502 Programming

We know the 6502 isn’t exactly the CPU of choice for today’s high-performance software, but with the little CPU having appeared in so many classic computers — the Apple, the KIM-1, The Commodores, to name a few — we have a real soft spot for it. [Janne] has a post detailing the eight best entries in the Commodore 64 coding competition. The goal was to draw an X on the screen using the smallest program possible. [Janne] got 56 bytes, but two entrants clocked in at 34 bytes.

In addition to the results, [Janne] also exposes the tricks people used to get these tiny programs done. Just looking at the solution in C and then 6502 assembly is instructive. Naturally, one trick is to use the existing ROM code to do tasks such as clearing the screen. But that’s just the starting point.

Some of the efficiencies are good practice on any CPU. For example, converting multiplication in a loop into a running total is always a good idea unless you have hardware multiplication that is as cheap as an addition. Some of the tricks are a bit more specific. For example, it was more efficient to draw the figure at the bottom of the screen and scroll than it was to draw each part at a specific X and Y position.

You might think some of these tricks aren’t really dirty, but then you’ll see self-modifying code. A legitimate hack, but always messy. There are also some special tricks used to get the C64 to load the machine code without going through BASIC first.

You might think 34 bytes would be the smallest possible program. You’d be wrong. After the contest, everyone had a look at all the entries and several people were able to come in even smaller — in one case, 29 bytes.

We wished we could find the PRG files so you could run them on an emulated C64. But we were too lazy to build them up from source. If you don’t have a C64, you could always pull out an FPGA. You can even build a new one — seriously.

27 thoughts on “Dirty Tricks For 6502 Programming

  1. Reading about small programs reminds me of the smallest program I’ve ever used.
    It was for x86 / dos and it waited for a key input and returned the key as the error level.
    It’s intended use was to enhance batch files, so unlike just writing an X for fun, it was an acutally useful utility.

    I don’t remember it’s exact size, but It was a .com file and I think it was 3 asm instructions.

      1. Looks like int 20h is return without errorlevel but 21 also provides 4c which should work, it was probably

        mov ah, 1
        int 21h
        mov ah, 4c
        int 21h

        Which is 8 bytes B401 CD21 B44C CD21

        But I can’t get it working in dosbox so can’t confirm.

  2. The world relies on hyper optimized code. Video compression, 3D rendering, network protocols, search algorithms. Nothing works without tightly packed bits executed by brutally optimized loops.

    Performance by default. No more bloat.

      1. The compilers for the Arm are better too. There are plenty of CPU registers, they do away with specialized registers and extending them. As a result, the compiler do a lot less register shuffling and pend more time on generating better code.

    1. No. Not performance by default. Maintainability by default. Some other idiot has to work with your code as well. Performance is only important for certain bits of code.

      I’ve 15 years of embedded software experience. And I teach everyone the 3 basic rules of code optimization:
      1) Do not optimize.
      2) Do not optimize yet.
      3) Profile the code.

      1. I second this!
        I’ve worked with distributed systems on extremely capable machines, and I’ve worked with resource constrained embedded, real time systems. 99 times out of 100 the best performance gains are not by cleverly hyper-optimising the existing code, they’re from structural or algorithmic changes. Where there has been cleverly optimised code it’s almost always needed to be tweaked eventually which ruins the optimisations, and takes ages to decipher and understand the implications.

        1. Agree with all of the above…but there’s a but.
          Every time I do somthing tricky – it gets a paragraph of comments, and those are specially marked with something unlikely to be in real code, like @@@.
          If everyone did this, there’s be little or no argument. I’m retired these days, but still, coming back to years old (or less…) code – someone else may as well have written it – I do this in self-defense. I did start the practice as a working consultant however- the customers really liked that style, and the only issues were when their own software guys ignored what I wrote, explaining things like “I don’t have the check ready here or use a semaphore because, unless you change something else in a stupid way, we’ll always be ready when we get here”.

          And in fact, it’s the code that doesn’t need to be there (or the wires/components if hardware) due to careful design that are in fact the hardest to figure out for the next guy (even if it’s you).

          Those paragraphs are really worth it, and don’t take up space on the target.

          We all seem to have no problem spewing words online, just do some of that in comment blocks!

    2. Smallest code does mean fastest code…
      It’s always a matter of balance between both parameters… for the skilled ones…
      Day to day wanabe-programmers are just reusing bricks they do not understand over other bricks their partners do not understand … the current result of software indutry is awfull ;-)

  3. It’s great to study and consider different possibilities whenever practical. I’ve learned many useful tricks from other talented programmers that help me shrink my code to fit into tighter spaces- microcontrollers especially. It really does take challenges like this to stretch your mind and break out of the box.

  4. I did some work with the 8051 and did some crazy things. I was able to implement JTAG – sort of like SPI. Normally it is very painful because the CPU can only do stuff with the accumulator, so handling both MOSI and MISO (TDI,TDO) would required a lot of shuffling, shifting, bit testing and branches. I unrolled the loops and did away with the bit shifting, accumulator with bit variables. My 8051 assembly code was pretty fast compare to some of the sloppy AVR JTAG C code I have seen. :)

    I wouldn’t bother with 8-bit processors. Their days are long gone. I prefer to start with the cheaper and more capable Arm chips. Nothing like writing bare metal code to know the modern day peripherals well and using multiple DMA channels, IRQ to shift the loads. No longer I need to go into assembly any more. The few times I looked at the Arm compiler generated code, I was impressed.

    1. 8-bit is still relevant. For quick and small projects they are unbeatable. They are easy to implement both in hardware and software and require a small amount of components to implement.

    2. 8 bit are not gone at all. I do a lot with AVR, and if you’re building a product you can go a ways cheaper at 8bit than 16 or 32.

      8051 is still popular, if you’re going even smaller than AVR.

      Mouser has an 8051 from Silicon Labs in a QFN-20 package. (New product! You don’t see that on things long gone.) For 32 cents. UART, 15 ADC channels, 4 16b timers. For some use cases that is actually beefier than the cheapest AVRs.

      For about 2 cents you can get an 8051 from China (also recent designs) but don’t ask about documentation. I think Dave at EEVblog did a video about them.

      1. > I do a lot with AVR
        Arm are cheaper and have better specs than AVR. :P I played with the Silicon Labs part for a project, but I am not impressed as the 8051 code density is bad and their Eclipse tool isn’t that hot either.

        I have played with various Arm chips from other vendors and have no problems switch families or vendors. Can’t say the same for 8-bit world.

        Right now my chips of choice are $0.20 STM8F003, $0.36 STM32F030 and $1 STM32F103. These are commodity parts that shows up on Aliexpress and they all use same hardware debugging clone that cost $2.

  5. The software world is very bimodal. On one hand, and as another commenter pointed out, we have hyper-optimized loops for realtime conversion (mostly). On the other we have hyper layered code, sometimes running through not one but two interpreters, and whose overhead we simply rely on the massive performance of modern processors to hide. But, in the middle are almost limitless opportunities for competitive advantage by simply optimizing a handful of critical loops in current code.

    One excellent example is an extremely high quality electronic piano. This program uses massive samples for each piano (well over 40 GB per piano) to give not only the played note, but harmonics and resonances of the original instrument. Even on a very fast laptop processor (i7) it required over 35% of the CPU to continually mix and remix an enormous resonance profile. And this all to get a single sample output at 96KHz. The most critical loop was this resonance creation and mixing loop and it took 80% of the processing time. By recoding in assembly, once for each processor family supported, the program suddenly took less than 5% of the CPU and was able to run on extremely cheap hardware (as long as the data bandwidth was sufficient). This opened up the possibility of many pianos on the same hardware, dedicated boxes for the piano and other functions (mixers, analysis, video integration, etc.). The final loop was about 60 instructions, unrolled once.

    This is all the legacy coding assembly starting with the 6502. I think it’s just as valuable today even though the opportunity goes missed in too many cases.

  6. If you want performance, and security, code in JAVA. If you want security holes, poor performance, code in ASSembly or C/C++.

    Java is fast. It’s the fastest, nothing is faster than java, nothing.

    It gets faster with each iteration, at just a small cost of memory and cpu cycles.

    I mean, who doesn’t have a core-i7 with 16Gb of RAM as the minimum nowadays?

    With just a few GB of helper libraries, and few GB for compatibility libraries, and another few GB for versions of the interpreter and byte code compiler, you can have “hello world” compile and run in just minutes!

    Programmer time is expensive, user time is free! and so is user hardware!

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.