[Michael Lynch] encountered a strange situation. Why was compiling then running his program nearly 10x faster than just running the program by itself? [Michael] ran into this issue while benchmarking a programming project, pared it down to its essentials for repeatability and analysis, and discovered it highlighted an incorrect mental model of how bash pipelines worked.
Here’s the situation. The first thing [Michael]’s pared-down program does is start a timer. Then it simply reads and counts some bytes from stdin
, then prints out how long it took for that to happen. When running the test program in the following way, it takes about 13 microseconds.
$ echo '00010203040506070809' | xxd -r -p | zig build run -Doptimize=ReleaseFast bytes: 10 execution time: 13.549µs
When running the (already-compiled) program directly, execution time swells to 162 microseconds.
$ echo '00010203040506070809' | xxd -r -p | ./zig-out/bin/count-bytes bytes: 10 execution time: 162.195µs
Again, the only difference between zig build run
and ./zig-out/bin/count-bytes
is that the first compiles the code, then immediately runs it. The second simply runs the compiled program.
How can adding an extra compile step decrease the execution time? Turns out that [Michael]’s mental model of how bash pipelines work was incorrect, and he does a great job of explaining how they actually work, and why that caused the strange behavior he was seeing.
In short, commands in a bash pipeline are not launched sequentially. They are all launched at the same time and execute in parallel. That meant that when run directly, [Michael]’s byte-counter program launched immediately. Then it waited around doing nothing much for about 150 microseconds while the echo '00010203040506070809' | xxd -r -p
part of the pipeline got around to delivering its data for the program to read. This is where the extra execution time comes from when running the already-compiled version.
So why is compiling it first running faster? Same basic reason: when the zig build run
command kicks off, it spends a little time compiling the program first. Then when the compiled program is actually launched (and begins its execution timer), the input data from the bash pipeline is already ready. So, the freshly-compiled program executes in less time because it doesn’t sit around waiting for data from earlier in the pipeline to become available.
It’s an interesting look at how bash pipelines actually function under the hood, and we’re delighted with the detail [Micheal] puts into the whole journey and explanation. Sooner or later, details like this crop up and cause some eyebrows to raise, like the user who discovered troublesome edge cases regarding spaces in ssh commands.
To me it seems obvious that bash starts everything concurrently – how else could pagers work? – but the general point stands. I’m often reminded how little I know about the details of supposedly basic things, like how argv gets passed, or how input is echoed to the terminal.
Nowadays parallel is closer to truly parallel, thanks to multi-core CPUs. I expect that on a single core processor you would see much more variance in the results, because the time measurement appears to include IO access – and the order of the tasks executing and when they would get pre-empted would vary a lot.
I’ve not tried in a decade but taskset could let you run bash on a single core and might just let you test that idea.
I think the mental model of the explanation is wrong. It’s not that compiling takes the same time as echo’ing and parsing the hexadecimal stream, but that the fact that in the former case, one less syscall/blocking read is used. When run directly (without compile), the program tries to read input data and block. The other process is then run (or concurrently, we don’t care here). When the data is available, the kernel’s task scheduler has to run to detect that the pipe is ready to be read and resume the waiting program. It does so in ~150µs which is quite efficient in fact.
When compiling and running, the program simply tries to read the data and since there’s already some data in the pipe to be read, there’s no need to run the scheduler and relinquish its timeslice.
You would never see more than one scheduler’s timeslice difference between both program (easy to prove, add some “sleep 100” in your program and the difference will stay the same.
Would be interesting to see how the results differ with regards to system load, since it seems very plausible that your explanation of the difference in latency is correct.. and therefore should closely mimic the behaviour of the scheduler during different load cases.
Thanks for that. I was thinking that the original explanation doesn’t really make sense, but yours does. If echoing and parsing the data took the same time as the compile then we’d see about the same resulting times.
This is interesting with respect to clock time. Using a UNIX utility like “time” would have shown a whole lot of time not spent in user or sys mode, i.e. sleeping. BTW, who knew that “echo” and “xxd” were so slow?!
The commands themselves are not necessarily slow, the problem is that a new shell process is being spawned for every pipe, which is actually horrendously slow. Makes me cringe every time I see a system() in a program, which is the slowest way to get anything done.
well, my mental model was confirmed by this article. *shrug* unix pipes aren’t DOS pipes, film at eleven. in this day and age, why would anyone even have DOS pipes in their intuition anywhere?
when you’re benchmarking a single 10-byte read() syscall, all you’re going to be able to measure is confounders / noise, and imo you shouldn’t be surprised by that. the details of how the measurement works are far more relevant than anything else. i wouldn’t say the program “ran 10x faster”, but rather that the measurement of noise measured 10x less noise.
Exactly right. The file descriptor pairs are created before the processes can be fork and exec’ed.
Am I missing something? He is measuring the time inside the final command, not the time to compile. So in one case his code starts at the same time as the other parts of the system (so may have to wait on those other parts to complete), while in the other case his code starts after all other parts are completed. The results would likely be the same if the pipeline was written like:
echo ‘00010203040506070809’ | xxd -r -p | (sleep 2 ; ./zig-out/bin/count-bytes)
Compilation has nothing to do with it.
[All of that said, benchmarking something that takes a few microseconds to run can be challenging.]