If you like nitpicking around C code and generated assembly language — and we’ll admit that we do — you shouldn’t miss [Scaramanga’s] analysis of what’s known as Kernighan’s trick. If you haven’t heard of the trick, it’s a pretty efficient way of counting bits.
Like the Wheatstone bridge and a lot of other things, the Kernighan trick is misnamed. Brian Kernighan made it famous, but it was actually first published in 1960 and again in 1964 before he wrote about it in 1988. The naive way to count bits would be to scan through each bit position noting how many one bits you encounter, but the problem is, that takes a loop for each bit. A 64-bit word, then, takes 64 loops no matter what it contains. You can do slightly better by removing each bit you find and stopping when the word goes to zero, but that still could take 64 cycles if the last bit you test is set.
Using the trick, you make the observation that X & (X-1) will always clear the least significant bit of a word. Try a few examples:
X X-1 X&(X-1) 0001 0000 0000 0010 0001 0000 0011 0010 0010 1010 1001 1000 1100 1011 1000 1111 1110 1110
You can probably see where this is going. By computing X&(X-1) you clear a bit on each loop iteration and you only have to go through the number of bits that are actually set.
This kind of thing is a common enough type of interview question although as [Scaramanga] points out, compilers will probably optimize this to use specific CPU operations and get even better performance. The POPCNT instruction on the x86 architecture, for example, will do it all in one instruction. He also has a detailed explanation of exactly why this works the way it does.
Of course, most software doesn’t need to run so fast that it is worth using obscure tricks. But sometimes it makes sense. It is also a nice test of logic and problem solving in an interview situation.
If you like this sort of thing, be sure to check out [Sean Anderson’s] extensive list of bit hacks. It shows several different ways to count bits and do other common and uncommon tasks with different tradeoffs. For example, you could dedicate a 256-entry lookup table and do the whole thing with one loop per byte with great speed but bad memory utilization. Always a trade-off.