Assembly Required: Subroutine Calls and the 1K Challenge

The first computer I personally owned had 256 bytes of memory. Bytes. The processor in my mouse and keyboard both have more memory than that. Lots more. Granted, 256 bytes was a bit extreme, but even the embedded systems I was building as part of my job back then generally had a small fraction of the 64K bytes of memory they could address.

Some people are probably glad they don’t have to worry about things like that anymore. Me, I kind of miss it. It was often like a puzzle trying to squeeze ten more bytes out of an EPROM to get a bug fix or a new feature put in. I though with the 1K challenge underway, I might share some of the tricks we used in those days to work around the small memory problems.

I Can’t Help You…

Unfortunately, optimization at this level very often requires specific solutions and no one can help you unless they know exactly what you are doing. Often, the best ways to save memory involve either changing your algorithm or using CPU-specific features. For example, if you are storing a lot of strings but don’t need a lot of characters, perhaps you can pack your strings. On the right CPU, you might spend two bytes writing:

MOV B,#0

This presumably clears hypothetical register B and usually takes two bytes of encoding (one byte for the MOV with a bitfield that indicates the B register, and another byte for the literal zero). However, if available, an XOR instruction probably takes one byte:


This also zeros register B since the XOR of anything with itself is zero. A subtract could do the same thing, if that’s available.

These are all specific to your project and platform, though. So for those sort of things you are on your own.

But, Some Things are Universal

One of the easiest ways to change your program involve subroutines. Most CPUs have some facility for call and return (although my 256 byte computer didn’t, oddly). The first rule is to use them. If you have four or five instructions being repeated, that’s a good candidate for a subroutine. Common sense, right? But there are a few other tricks involving subroutines to keep in mind.

The first idea is to look for items you can move into a subroutine, even if it requires a little tweak somewhere in the code. To illustrate this (and some other ideas) I went to this online assembly language simulator. It has a simple assembly language that is similar to real CPUs. The ability to be able to step through the code in your web browser is nice, too. Of course, you’ll have to apply the concepts to your specific CPU.

Below is a simple program to read a four-character ASCII string that represents a decimal number and puts it in a 16-bit register. So the string “255” will turn into 00FF in the register.

; Simple example
; Convert decimal # at istring into 16-bits in B:C
   JMP start

   CALL convert
; more interesting things happen here
   MOV B,0
   MOV C,0  ; answer in B:C
   MOV D, istring    ; Point to var
   CALL decdig  ; convert each digit
   INC D
   CALL decdig
   INC D
   CALL decdig
   INC D
    CALL decdig

   MOV A,0
   SHL C,1
   JNC shift16a
   INC A
   SHL B,1
   ADD B,A   ; output carry is meaningless

   CALL shift16; *2
   CALL shift16; *4
   CALL shift16    ; *8
   POP A
   ADD C,A   ; 8X+2X=10X!
   JNC decdig0
   INC B

   POP A
   ADD B,A
   MOV A,[D]
   SUB A,'0'
   ADD C,A
   JNC xret
   INC B

istring: DB "1924" ; input
         DB 0 ; String terminator


This assembles into 89 bytes. Not much, but we can do better. The first thing to notice is that the D register holds the address of the string. The program loads it at the start and after each call to the digit conversion subroutine, there is an increment to point to the next character. One obvious way to save some space is to factor the increment instruction into the subroutine. You can put it towards the end, or–if it makes more sense–load the address minus one (most assemblers can figure that out at assembly time). In this case, it doesn’t really matter, but either way will get you to 85 bytes. I put the POP D instruction after the xret label and the convert routine now looks like this:

MOV C,0 ; answer in B:C
MOV D, istring ; Point to var
CALL decdig ; convert each digit
CALL decdig
CALL decdig
CALL decdig

Notice those last two lines. You can shave a byte by changing the final CALL into a JMP. The return at the end of decdig will go back to the original caller. You can save even more space by rearranging the code so that decdig appears right at that spot:

CALL decdig
CALL decdig
CALL decdig
decdig: ...

You can apply this even more. Notice that you have 4 calls to decdig which could be considered two groups of two. So you could write:

CALL decdig2
CALL decdig
decdig: ....

You might have to think about that for a minute. The first call executes two operations and the return goes back to decdig2 where you do two more operations. At the end, the return goes back to the original caller.


That’s not very readable, I’ll admit. Comments are free, so you should use them if you aren’t writing a blog post around the code. In particular, any time code depends on being in a particular location, you ought to document it:

CALL decdig2:

This prevents you from accidentally moving it or inserting code and breaking things. Now we are at 80 bytes. You can pull the same trick with the shift16 subroutine, but since it is only called once, there’s no real savings in this case.

Saving 9 bytes doesn’t seem like much, but it is about 10% of the original code. Sometimes that’s all you need. You could probably optimize the algorithm, too.

Return Conversions

On some CPUs, you can do  conditional return (that is, like return on zero or RZ). But on other processors, you can only do a branch or jump on a condition. This leads to code like this:

JNZ nextinst

However, you probably have a return somewhere in your code already (like the xret label in the example program). That means you can write:

JZ xret

There you’ve saved another byte. Like pennies, they all add up.

Your Turn

What’s your favorite space-saving trick? Share it in the comments unless you want to save it as your secret sauce for the 1K challenge. You have until January 5th to squeeze out those extra bytes.


34 thoughts on “Assembly Required: Subroutine Calls and the 1K Challenge

  1. On processor that only shift 1 bit at a time when you want to shift 4 bits 4 instructions are needed

    PIC enhnanced midrange exemple:
    “`lslf WREG
    lslf var1
    lslf var1
    lslf var1“`
    This can be reduce to
    “`swapf WREG
    andlw 0xf0 ; keep high nibble“`
    same trick can be used when right shifting 4 times:
    “`lsrf WREG
    andlw 0xf ; keep low nibble“`

  2. When I code for a microcontroller, I tend to use use c, then look at my list file, I rewrite sections to see how they translate.

    For me, on a uC, it’s two fold, speed up the program and save prog space.

    As I tell people a saving of 1 instruction in a loop that runs 1 million times in a second, means 1 million other instructions get to run in the same amount of time.

    1. In one of the old ‘DTACK Grounded’ newsletters, Hal Hardenberg argued that doing it that way was an incorrect approach, because starting with C (or any other high-level language) imposed constraints on data structures and code organization that using assembly language didn’t impose. Therefore, using assembly from scratch allowed you to come up with more efficient ways of doing things than a high-level language would permit.

      That being said, it’s probably a more time-effective way of doing things than coding everything from scratch in assembly.

      1. It’s a great argument in the 80s – but I’ve found that a lot of modern compilers will in fact essentially guess what you’re doing and optimize based on that rather than language assumptions. Of course, you can still occasionally outsmart a modern compiler on a modern platform. But man, my code is so much more manageable now that I’m not writing everything in assembler for the PIC… I can actually not have to start from (almost) scratch every time I switch a MCU with C.

        Unfortunately, quite a few embedded platforms have compilers that are written like they came straight from the 80s, so there’s quite a few platforms where this advice is still completely true (*glares at PIC8*).

      2. Doing ARM (Cortex) stuff I always just eyeball the assembler output; almost always you can just massage the C slightly to help the compiler – stuff like copying things into local variables is a big one – in fact so big it’s worth mentioning; the issue is that when you have (say) a function that takes a couple of structure pointers and loops over them, the compiler can’t absolutely guarantee that those pointers at runtime won’t actually point to the same thing in memory (although you usually know for sure that they won’t) so the compiler has to be super-paranoid about reloading things from memory rather than caching them in registers; this also happens a lot when making a function call and passing it a pointer in a loop. The fix here is (a) use “const *” pointers wherever possible and copy stuff like loop counts into local variables (which the compiler can then legitimately assume won’t be modified by any pointers). I remember back in the PSX days ranting at the compiler guys about their dumb compiler not making obvious optimizations and having this explained to me. It’s quite subtle and really something of a weakness in the C language.
        void blah(somestruct *s1, somestruct *s2)
        for (int t=0;tsomeCount; t++){
        In this case the compiler will have to reload “s1->someCount” every time around the loop because it’s possible that s2 might overlap s1->someCount. It practically never is, but the compiler can’t assume that. Fix is to copy s1->someCount into a local variable. The other way to do it would be to do the loop in reverse and initialize t with the count and count down. Sometimes on very smart compilers (and if it’s legal) it’ll do the loop in reverse for you because it saves a compare instruction. There’s many more tips but knowing about how potential pointer aliasing can mess with C compiler optimization is a golden one.

          1. Haha, I have to use a shortcut on my phone’s keyboard or else I swap them around the wrong way.
            Lets see if code tags still convert html, I can’t remember.

            italic tag?

        1. This isn’t really a thread about compiler optimizations, but, since you’ve brought it up, you may need to look at the compiler output for a number of reasons.

          I once had to write an entire module for the 68332 in assembly, because it was manipulating TPU registers, and the compiler we were using was smart enough to optimize any operation it could into byte-wide operations. Unfortunately, many of the TPU registers have to be accessed as 16-bit words – writing only one byte caused the other to get trashed.

          Some of the only recent assembly I’ve written for PIC was to get around the fact that the compiler I was using wrote the halves of a 16-bit value to memory in the wrong order for a specific register pair I had to access. I’m very glad I can use C for almost all of my PIC processing these days, but I still have to dip into it from time to time.

  3. One trick I saw used in 6809 assembly was jumping into the middle of an instruction. I don’t remember a good example of it but it was to save one byte of code. Basically if you wanted to subtract 1 from the D register the opcode looked like:
    83 01 SUBD #1
    so the hacky code would actually have
    10 83 01 CMPD #1
    and somewhere else there was code that would jump into the 2nd byte of that instruction to subtract 1 from D, while another code path would “fall through” and execute the CMPD instruction which harmlessly set condition codes/flags. This saved a byte because otherwise they’d need 2 bytes of code to jump over the SUBD instruction.

  4. A big space saving trick is to emulate a more space-efficient virtual machine. This was done in the original 1K BASIC published in DDJ, and again in the Apple-2 (i.e. “SWEET-16”). A huge space saver. And on a smaller scale, judicious use of relative addressing (such as relative jumps and relative calls), ordering your code to fit within that limited relative address-distance. Then to squeeze out every last byte, replacing constants with bytes of the same value that “just happen” to fall somewhere in your code (even as part of an instruction). Then you can do like TI did on their personal computer, and store your BASIC programs in the video RAM (though very slow accessing it through the video controller I/O ports).

    1. I’m writing the firmware for a Propeller based drone, and this is heavily used. My on-board “FPU” processes an instruction stream where each op is 4 bytes: Opcode, SourceA, SourceB, Dest. The 3 “register” values are indices into an array of floats. The instruction stream itself is generated from C by a compiler. This saves me several kilobytes of space, and actually runs faster than the original code because I have some opcodes that C doesn’t have, like a floating point shift operator.

  5. Hey, [Al] (can I call you Al? and am I using brackets right, there…?) — what computer was that, with only 256B of RAM…? You’ve got me curious.

    If I had to guess… something based on the Signetics 2650. Its publicly-available hobbyist ROM was coded for a system using Intel 2112 chips, which are 1024-bit chips, organized as 4 x 256 bits — or 256 nybbles, for those who don’t speak SRAM ;)

    Interestingly enough, that ROM I mentioned is still somewhat available. It’s called PIPBUG, and it seems to have hung around (you can download it out here on the ‘Net if you know where to look) because there were a few 2650-based pinball machines that someone read out the ROM from… and that ROM happened to be PIPBUG. (How do I know all this…? I’m thinking of building a 2650-based retrocomputer using the PIPBUG ROM. I’ve a few other ideas for retrocomps bouncing around, so maybe, maybe not, we’ll see — but it’s kind of a cool idea. The 2650’s got some interesting features — a bit-banged serial terminal interface that’s right on the die, for instance. It’s kind of a rare CPU now, but I have one — $10 on eBay, IIRC.)

    Signetics, however, is long since gone. They became part of Philips Semi in 1975, right around the release of the 2650 CPU, which got spun off as NXP in 2005… which, as [Hackaday] reported back in October, is now part of Qualcomm.

    I should probably also mention… Signetics is notable for one other (rather significant, actually) contribution to electronics. In 1971, a Swiss chip designer by the name of Hans Camenzind, came up with the 555 timer. He was working under contract with Signetics at the time, so it was their chip to bring to market.

      1. Nifty, another rather uncommon CPU… man, the 1802 is weird (well, okay, everything’s relative) — but it’s also really cool because it’s been to spaaaaaaaaaace.

        COSMAC ELF? or another system? Pics would be cool if you have them (and time to post ’em). Sorry if I’m being overly curious — I really dig old gear like this.

        1. Oh, duh. You DID say homebrew! (Sorry, I’m something of the ‘absent-minded professor’ type. Well, except I don’t teach anything.)

          I’d love to see pics, if you can do that. Homebrew gear is awesome gear.

    1. Actually… correction, the PIPBUG ROM expects Intel 2114 SRAMs, which are 4096-bit, organized as 4 x 1024 — aka 1k nybbles on a chip. If anyone wants to build the PIPBUG-based computer, there’s a schematic from Electronics Australia in the May ’78 issue. I’ve uploaded a JPEG of it here –>

  6. On PIC16 processors, I’ve had to organize which subroutines go on specific code pages, so that I don’t have to change bits in the code page register in order to call several subroutines in sequence. I also defined a set of macros so that code for setting the code page bits would only be generated if it was required. Basically, this involved macros for cross-page calls and a ‘sync’ macro for setting the code page register to the current location. It meant that I didn’t have to keep track of that manually.

    Another trick, used in a command interface when I had to compare the W register to several literals (in order to select from one of several commands specified by single letters), was to “chain” the comparisons so that I didn’t have to reload the input character, like this:

    sublw ‘D’ ; check for ‘D’ command
    btfss STATUS,Z
    goto not_D
    ; Handler code for ‘D’ command goes here

    sublw ‘D’ – ‘G’ ; check for ‘G’ command
    btfss STATUS,Z
    goto not_G
    ; Handler code for ‘G’ command goes here

    … and so on.

    For jump tables on the PIC processors, I’ve put the code for the last entry in the table directly following the table, so that I can delete the jump instruction for that entry.

  7. This is well out of my league, programming wise.

    For your entertainment:
    I thought something to be easy then hit a (mental) roadblock or just forgot,
    I have had many projects that I’ve abandoned (or just forgot about),
    Most my C/C++ programs just contain int main(); before being abandoned,
    I completed a program for logging some info requiring more direct access.
    I haven’t completed the temperature to fanspeed control for my Pi2 server.

    I once tried to disassemble a BIOS from a HUDL2, Found a jump into a loop very early. Just after some kind of cache sanitize, FPU-init and SSE2 instructions are executed there is a test that jumps to an infinite clear interrupt-and-loop (all in 16Bit code from f000:fff0).
    So i replaced the first instruction of the test with a jump to the loop.

    Soldered the ROM back and powered. I expected heat from the CPU and had none, I also had no voltage going to the CPU. reflashed the original image and the thing came to life.

    Found out about Intel Boot Guard.

    I put my experiences down to: I suck at programming.

    The following is ideas for those who have time/resources/etc:
    The only chance at this challenge is if I had:
    some vynal records and players with samples,
    a rotary photo-film projector or more.
    i.e. move the code+data out and turn it into an electro/mechanical external automated action instead (i.e. an addressable drumkit using discrete logic tied to GPIO pins) and run a light show.
    Only need to worry about timing the GPIO then (no need for image data or sound samples)
    As long as it is mechanical or analog it can’t be measured in any amounts of bits (until it is converted to digital of course)

  8. God I wish I could remember 1-100th of what I know when I was younger.

    I started programing with a 6809 home brew then ZX81, Grew up a little, to a COCO then a massive IBM with 256m of ram and EGA monitor.
    Now my watch is more powerful.

    I LOVE IT !!!!!!
    And now I’m starting all over again using Arduinos, ESP8266 and Raspberry PI clones. I still love The Orange PI clone evan with its flaws. But the price is right.

    I wish everyone the best of luck. I am sad though because I am no longer in this game anymore.

  9. BST and BLD in AVR are amazing for swapping bits arbitrarily and far more legible than a bunch of rotates and ORs and ANDs not to mention that you don’t need to involve another register or memory address..

    As for Z80 you can put the value of any bit on the carry withour rotating by combining AND and NEG instructions. I have used that in a SPI library for joystick port on MSX computers

    As for PIC assemby my favorite is the sequence of

    BTFSC f,b
    BTFSC f,b

    after a logic operation for testing bit while keeping constant instruction time for serial bitbang. It works on Z80 too by
    CALL C, xxxx
    CALL NC,xxxx
    as used in the serial print in the joystick port:

  10. Increasing recursion depth is also helpful as long as you keep in mind that each level costs two bites on the stack as a return address.

    Another method is look up tables that include a CALL location –
    LD BC, (DE)

    What I found better was to have multiple entry points to a sub routine that did similar things –

    LD C,0 // or XOR C,C
    LD B,0 // or XOR B,B
    DEC B
    JRNZ wait_BC_clocks
    DEC C
    JRNZ wait_C_times_256_clocks
    LD C,1
    JR wait_BC_clocks

    4 routines, 13 bytes

  11. One useful trick for condensing many common integer math calculations to a single instruction on x86/x86_64 targets is to use the LEA instruction (load effective address). There is no hard and fast law that says the calculation it performs _has to_ represent an address.

    In effect you give it an unaigned constant or register base, and a signed register displacement which is scaled by an immediate value of 1, 2, 4, or 8, so in pseudo-code it is:
    dest = X + (Y << Z) where z is a two bit immediate and at most one of X or Y can be an immediate constant, otherwise a register.

  12. Some favorites, used in OKOS on PIC18:

    Chained else-if style checks, similar to what you’d use switch-case for:
    xorlw CHAR_A
    bz cliAssembler
    xorlw CHAR_E ^ CHAR_A
    bz cliEditor
    xorlw CHAR_R ^ CHAR_E
    bz cliRun

    (similar to wheels post)

    A take on the xor swap, repurposed for shifting memory. Fewer instructions than moving copies around
    ;load the first char, point to next
    movf POSTINC1, w
    xorwf INDF1, w ; W := A^B; X = B
    xorwf INDF1, f ; W = A^B, X := B^(A^B) = A
    xorwf POSTINC1, w ; W := (A^B)^A = B; X = A

  13. You’ll have to forgive me. 1k challenge? Had imagined that’s all you get… as in there is no Assembler, no C compiler, no runtime package loading up when you execute. Just hand compiled code that executes directly, perhaps even toggled in by hand the way we all used to have to do it when this hobby got started. Back in the day MS Basic fit into a single EEPROM. What would windoze run like if re-written in machine language instead of a compiler? Thinking the contest is more along those lines. What’s the tightest fastest code YOU can write that fits within 1k… meaning NOT what C can write from your text file. YOU write the code.

  14. On 6809 (and some other 8 bit chips) there were two ways to branch – 16 bit address or 8 bit offset. Of course the 8 bit offset meant the target address had to be +/- 256 of the call. To save one byte per call, I coded a program with the start in the middle of memory (only 750B available) and branched up/down from there.

  15. In the old 8bit days we also never wasted memory on variables. Instead of having two bytes of memory holding some variable you picked the most time critical load of the register and made it an immediate load. The last two bytes of the instruction (on Z80 at least) then become the variable.

    ie instead of

    LD HL, (foo)
    blah blah
    LD (foo),HL
    foo: DEFW 0

    _vfoo: LD HL,#0
    foo .equ _vfoo + 1
    blah blh
    LD (foo),HL

    Smaller and faster

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s