With modern tools, you have to try very hard to do something stupid, because the tools (rightly) recognize you’re doing something stupid. [Andreas Karlsson] can speak to that first hand as he tried to get four billion if statements to compile.
You may ask what state space requires four billion comparisons to evaluate? The answer is easy: the range of an unsigned 32-bit integer. The whole endeavor started with a simple idea: what if instead of evaluating whether an integer is even or odd with a modulo or bit mask, you just did an if statement for every case? Small ranges like 0-10 are trivial to write out by hand, but you reach for more automated solutions as you pass 8 bits and move towards 16. [Andreas] wrote some Python that outputs a valid C program with all the comparisons. For 16 bits, the source only clocks in at 130k lines with the executable less than 2 MB.
Of course, scaling to 32 bits is a very different problem. The source file balloons to 330 GB, and most compilers barf at that point. Undeterred, [Andreas] modified the Python to output x86_64 assembly instead of C. Of course, the executable format of Windows (PE) only allows executables up to 4 GB, so a helper program mapped the 40 GB generated executable and jumped into it.
What’s incredible about this whole journey is how performant the program is. Even large numbers complete in a few seconds. Considering that it has to thrash 40 GB of an executable through memory, we can’t help but shake our heads at how even terrible solutions can work. We love seeing someone turn a bad idea into an interesting one, like this desoldering setup.
Older class AMD GPUs with DX10 class hardware can only execute loop with 32 bit counter and don’t have actual jump, only rigid loop and if/else constructs in shader instruction set.
So when used for GPGPU they cannot truly implements infinite loop, so the best next thing the compiler does is 4 billion iteration loop. It completes in few second (when empty). So they can finish infinite loops in few seconds ;-)
Isn’t this still the case for all GPUs? The shader units in a group can only execute the same instruction, one can’t branch while the others don’t. Groups are made up of 32-64 shader units or something around that number.
If a single shader unit would be able to branch independently that would mean it needs its own code memory, instruction fetch, decode and program counter. Basically it would be a full processor. If all shaders in a group run the same code then only one code mem, fetch, decode, program counter unit is needed for all the shader units in that group. 16000 independent processors on a single die would eat up a lot more area.
No, a wavefront on modern GPUs can run forever in infinite loop.
It has been possible on nVidia G80 arch from the very beginning.
AMD stopped implementing DirectX spec to the letter (including the rigid shader structure) starting with Southern Islands. Where basically scalar unit (which does flow control) can run any general purpose code.
Infinite loops have nothing to do with how massively parallel GPU is or divergence between pixels, It is just about instruction set.
An important note is that he obviously turned off optimizations even before resorting to machine code: I’m kinda curious as to the compiler behavior as n grows. Would it actually be able to recognize the modulo operation for the elements compared? Or at least convert it into a jump table of some sort?
yeah! the sequential if() on a single variable can be translated to a switch table. the identical code for each of the two possible results can be collapsed into just two destination labels. the switch table branch code can take into account the bitwise pattern. it’s very possible for a compiler to reduce it to the proper if (number&1).
one challenge is to actually chain them all together, to take care of all of the details that inevitably foul up the next step in the optimization. the other challenge is to implement it in a reasonably performant way. for example, searching for bitwise operations that will match a pattern of “case x” constants can easily be exhaustively slow, and isn’t rewarding at all very often. i don’t think i’ve ever seen a compiler that would recognize an even/odd pattern of case labels, though i’ve seen plenty of other neat bitwise tricks in generated code.
if i was spending a lot of time on conditional optimization, i think there’s a lot more low hanging fruit converting … } else if (!strcmp(opt, “-flag”)) { … chains into a sort of lexer, maybe a series of nested switch tables. :)
Can’t wait for the 64bit version!
There must be something wrong with me.
I immediately thought of how to reduce the average execution time, by a nested tree:
if (number == 0x7ffffff) printf(“odd\n”);
else { if (number < 0x7fffffff) {
if (number == 0x3fffffff) printf(“odd\n”);
else { if (number == 0x1fffffff) printf(“odd\n”);
…
} else /* number > 0x7ffff */
if (number = 0xc0000000) printf(“even\n”);
else { if (number > 0xc0000000) printf(“even\n”);
…
Maximum of 63 (I think) comparisons instead of maximum 4 billion.
If that’s the goal, you can get it all the way to one comparison.
Rainbow tables to the rescue. Just one seek.
But that wouldn’t be a Stupid Optimization, would it?
If it makes you feel better, that’s where my head went as I was writing this
Nice, but I prefer the elegant recursive version:
int iseven(int num) {
if (num == 1) return 0;
return ! iseven(num – 1);
}
For positive integers only, naturally; anything else is a luxury. With a q&d test with MSVC (too lazy to ssh to a Linux VM to try under GCC) compiling to a 64-bit Windows executable in debug mode, this can handle values as high as 16106 before overflowing the stack and crashing (which it does silently, in typical awesome Microsoft fashion). That should be large enough for anyone.
Four billion ‘IF’ statements?
Sounds like someone’s building an AI.
This
Can be also asking wife if he can go this weekend fishing with his friend.
Do you mean you would do a if table with all the possible week-end dates? :3
“What if we wrote this software the dumbest and least efficient way possible?”
Okay neat I guess
There is, of course! You could write the program with a for loop instead, like this:
void evenorodd() {
int I;
for (I = 0 ; I < INT_MAX ; I += 2) {
if (number == I) {
printf(“even\n”);
return;
}
}
printf(“odd\n”);
}
This is much shorter and efficient!
If only there was an easier way
There is, of course! You could write the program with a for loop instead, like this:
void evenorodd() {
int I;
for (I = 0 ; I < INT_MAX ; I += 2) {
if (number == I) {
printf(“even\n”);
return;
}
}
printf(“odd\n”);
}
This is much shorter and efficient!
Except they didn’t “compile” it.
They wrote a script to generate intentionally bloated assembly.
A modern compiler, if able to load the dumb source code of that huge file, would optimize it into something reasonable like a binary tree(or something more efficient. But even non coders know what a binary tree is even if they don’t know what it is called).
This is the same kind of trickery as a compiler unrolling a loop into 40k if statements. Or the opposite, changing a variable and functions into a constant, because the output always ends up the same based on other code.
The difference here is the coder intentionally generated bad assembly for… reasons? I guess just to see if they could get it to run?
Perfectly good reason to do it, but not exactly useful.
This is one of those “why did you climb that mountain?” Because it was there” reasons. Nothing gained but the experience.
No “else if” in this silly contraption?
Come on, just upgrade your computer :-p. In the XXI century, 40 GB of RAM wasted on a silly program with 4 billion IFs and not a single else statement (or return, for that matter) is, or will soon become, the norm. Or – wait, there’s a solution: put it in the cloud and pay only when you use it.
Now I wanna see one with a huge ass hash table
“What’s incredible about this whole journey is how performant the program is. Even large numbers complete in a few seconds.”
One shouldn’t be too surprised. Windows seems to contain similar high quality code and it still runs reasonably well, doesn’t it?
This speed is 100% memory bound, so it is just limited to ddr access times and cache fetches, OS has zero impact.
I think Daid was making a funny comparison between the quality of the code in Windows and the quality of code in this program, indicating that they’re not dissimilar.
Nice trigger from the editor including the screenshot of the early version with 8-bit numbers and then talking about 32-bit numbers.
silliest thing is not using a string constant for the output…
It’s very important not to do early optimizations in any software project.
Dear moderators, what did the discussion about GPUs not having jump instruction to realize “IF” statements violates any commenting rules?
Let’s not jump to conclusions here, Lamalas.
Sad to see 330GB source failling in GCC, try minifying it, and macro for printf, and shorter var name.
PS. what happened to HD comment module? It doesn’t remember me and cannot use single login.
Makes me nostalgic for the first text adventure I ever wrote in BASIC. I didn’t know about flow control or subroutines, so it was one enormous loop:
10 INPUT I$
20 IF I$ = “go north” AND POSITION = 1 AND HEALTH > 50 AND SWORD = 1 THEN PRINT “You enter the castle”
…
10000 GOTO 10
we have all been there…thing is, some of us are still there at our day jobs!
Why not use a switch stmt, it’s way more efficient.
I cant believe Im not the only one working on scalable integer size approaches to solutions. I do algorithm checks for fun, and Ive started making sure my code can handle arbitrary sized integers. The research paper I based this idea on only covers 16, 32, 64bit cases, but Ive extended it to any integer with bitmasking.
But for the last 10+ years we have 64 bit numbers and addressing (although that seems to be limited to 56 bits or so). So there is plenty of room left to heat the universe even more.
Or maybe it was 20 years and it was only the raspberry foundation that kept the 32 bitters alive for another 10 years.
Gosh, at the moment it would be quite limiting to only have a 32 bit (4GiBi) address space.
Have I missed something? Or have I indeed missed since else do this exact same comment but…..
Surely if you check the least significant bit of the number, that would instantly tell you whether it’s odd or even? (Ie. If bit 1 is zero, then it’s an even number, likewise, if bit 1 is odd, then it’s an odd number
Or is this the usual method? And even this is slower than trailing through all those if statements?
Feel free to call me a dumbarse if I have missed something 😂
You missed this part:
what if instead of evaluating whether an integer is even or odd with a modulo or bit mask, you just did an if statement for every case?
There was no problem solving – just curiosity.
Once I saw an algorithm to calculate the factorial of a number. But it had some twists in it.
First, it was a recursive algorithm calling itself. This is common enough, as calculating factorials is often used as an example for recursive algorithms.
Second, for calculating the factorial of a number, it did not multiply the factorial of the previous number with the last (next?) number, but it re-calculated the factorial of the previous number, and as any recursive algorithm, it had to recalculate the factorials of the numbers before that to get to that factorial. This resulted in an explosion of excursiveness (is that quadratic or exponential), but it did come with the right answer at the end. And it even had a useful application too.
It was used in an example application for writing multi threaded applications, and in that example a “work thread” was needed that kept itself busy for a while and then reported back to the GUI thread when it had a result. And that factorial work thread sure kept itself busy!
I recall some drama about an indie game that had it’s source doe leaked. and all the uninformed people were ragging on the amount of if statements blaming them instead of all the assets with bloated polygon counts for performance issues. Compilers are smart, it’s more about best practices and legibility than performance.
The examples here seems weird to me, I would just:
if (number % 1) return odd;
else return even;
Oops:
if ( (number % 2) || (number == 0) ) return odd;