Linux-Fu: Parallel Universe

At some point, you simply run out of processing power. Admittedly, that point keeps getting further and further away, but you can still get there. If you run out of CPU time, the answer might be to add more CPUs. However, sometimes there are other bottlenecks like memory or disk space. However, it is also likely that you have access to multiple computers. Who doesn’t have a few Raspberry Pis sitting around their network? Or maybe a server in the basement? Or even some remote servers “in the cloud.” GNU Parallel is a tool that lets you spread work across multiple tasks either locally to remote machines. In some ways, it is simple, since it looks sort of like xargs but with parallel execution. On the other hand, it has myriad options and configurations that can make it a little daunting to use.

About xargs

In case you don’t use xargs, it is a very simple program that among other things lets you do something with a list of files. For example, suppose we want to search all C source files for the string “hackaday” using grep. You could write:

find . -name ‘*.[ch]’ | xargs grep -i hackaday

Here, xargs grabs an input line, calls grep and after grep completes, it repeats the process until it runs out of input lines. (Note: handling files with spaces is a bit tricky. Using -d ‘\n’ might help, although not all versions of xarg support it.)

In the simplest case, Parallel does the same thing, but it can execute grep — or whatever you are using — multiple times at once. On a local machine, this allows you to use multiple CPUs to improve timing. However, you can also spread the work among different machines that have passwordless ssh logins.


The author of GNU Parallel has a multipart video demonstration of the system. You can see the first part, below. The tutorial is also very good, and clears up a number of details that might not be obvious from the man page.

Just for my own amusement, I took a directory with some large mp4 files in it and used both xargs and parallel to gzip each file. I know, I know. The files are already compressed, so gzip isn’t going to do much. But I just wanted some large task to time. Here are the results:

[:~/Videos/movies] $ time find *.mp4 | xargs -d '\n' gzip

real    6m10.796s
user    2m52.828s
sys     0m9.718s
[:~/Videos/movies] $ time find *.mp4 | parallel --jobs 8  -d '\n' gzip

real    5m25.050s
user    2m56.676s
sys     0m7.732s

Admittedly, this wasn’t very scientific, and saving about 45 seconds isn’t a tremendous gain, but still. I picked eight jobs because I have an eight-core processor. You might vary that setting depending on what else you’re doing at the time.


If you want to use remote computers to process data, you need to have passwordless ssh remote access to the other computer (or computers). Of course, chances are the remote computer won’t have the same files and resources, so it makes sense that — by default — your commands only run on the remote server. You can provide a comma-separated list of servers, and if you use the server name of “:” (just a colon), you’ll include your local machine in handling jobs.

This might be very useful if you have a mildly underpowered computer that needs help doing something. For example, we could imagine a Raspberry Pi-based 3D printer asking a remote host to slice a bunch of models in parallel. Even if you think you don’t have any computational heavy lifting, Parallel can do things like process files from a tar archive as they are unpacked without waiting for the rest of the files. It can distribute grep‘s work across your CPUs or cores.

Honestly, it would take a lot to explain each feature in detail, but I hope this has encouraged you to read more about GNU Parallel. Between the videos and the tutorial, you should get a good idea of some of the things you could do with this powerful tool.

38 thoughts on “Linux-Fu: Parallel Universe

  1. Interesting. I could find uses for that. I’m thinking I could maybe lighten my in-progress OTA TV snarfer, knock it back to a lower power dual core, then have it stash work to a shared drive, then overnight fire up a quad core box to help transcode and convert, and put it to sleep again. Provided WOL is an option here.

  2. Tonight I will crack open a cold one and drink to the memory of OpenMosix or whatever they called it in it’s last iteration before development shut down for the last time.

  3. `xargs` also has a parallel flag… it runs things on the same system (not across systems) but it can effectively utilize all of your cores. use the `-P` and `-n` flags together try it out…

  4. So, I guess most of the examples, if ran across multiple machines assume that all the machines have the same filesystem mounted? Is that right or it it actually transferring over the contents of any files you are doing operations on?

    1. According to the video, it transfers the files. There’s an explicit flag for removing the transferred files, as well, so you could end up leaving the files for some reason

  5. A couple notes for Debian Buster and Raspian Buster. If you have the ‘moreutils’ package installed, but not the ‘parallel’ package, you will still have a ‘parallel’ command. This has no ‘–version’ option, so no idea how old it is, but it’s definitely not the same as the GNU Parallel command, and supports a total of 4 options.

    Grep has some smarts so that if multiple file names are passed on the command line, it will print the file name along with the matching line, and if only one name is passed, it does not. grep can be forced to always print file names with the ‘–with-filename’ (-H) option, or never print names with the ‘–no-filename’ (-h) option.

    To have parallel work the same as ‘find . -name \*.[ch] | xargs grep -s main`, use ‘find . -name \*.[ch] | parallel grep -s -H main’ (I’m ignoring the fact that grep does support the -P option, as mentioned above).

    For whatever it’s worth, on a Raspberry Pi 3, the above commands took 1.137s with ‘parallel’ (specifying the ‘-j 4’ option) and 0.043s with ‘xargs’, searching 47 files with a combined size of 746K. These times were extremely repeatable, with a variation of +/-0.020 seconds for the ‘parallel’ version. No idea why it’s so much slower.

  6. I have a python program that does astronomical gearing calculations that was written for multiple wheel and pinion trains and variable tooth count inputs, but it takes days occasionally to run depending on how I have the inputs configured.

    Granted, a C variant wouldve probably been a computationally speedier language for the approach, but it was a learning exercise in python, and the base created for me by a friend. Now that I have a multicore i7 9650, versus an old single core 2.3gz pentium M chip circa 2005, I’m sure it’ll run faster- but I’d have to dig the program outta storage first and reconfig.

    Scaling from 8 input combinations would take a week to run- I’d love to scale it higher maybe up to 16 inputs. This will undoubtedly still lag a lot-

    So could this command spread the inputs apart and process them simultaneously across all my cores to actually make it finish before the universe dies of heat death?

    1. Interesting. If you already have it written in Python and packaged it should not be difficult (in principle) to parallelize the work for various input combinations using native Python multiprocessing on one machine and Spark for multiple workers on the network (or just one as well).

      1. It was my understanding even at the time of nacency that python was being used as a teaching tool for my interest in python at the time- but it was conceded that another language was probably more efficient at the pure computational syntax.

        I’d learn to code from scratch just to write this in the optimum language (15 year simple linux user), but I’m not really a coder.

        Thought about farming it out to local university supercomputer cluster, but I’d like to see what’s possible on a standard optimization for normal pc first.

    2. Check out Julia if you are interested in performance, as it includes MatLab/Octave/R style human readable syntax. However, Julia also often simplifies scaling tasks, and has local solutions like CUDA wrappers.

      I normally am not that keen on modern languages, but this one is fundamentally different given it often outperforms most user written C/C++ routines, and many core libraries like NumPy at many tasks.


      1. Being not a coder, but bilingual in Japanese by degree, I can appreciate the power of good syntax.

        That seems to be an interesting language indeed, I may try rewriting this in that- as I’d love to scale up in inputs to around 16 gears or so.

        The program was intended to generate extremely precise (past 10 digits if possible) accuracy in nonstandard gear ratios for creation of astronomical orrerys of high accuracy. Computationally it was slow as hell in python, but my friend was super kind to basically write the base of it and I added on and learned python.

        If I could use both this language, or another purely optimized for mathematical computation, and parallel processing simultaneously, I’d work to make it freely available to all horologists.

      1. Brute force as far as I can describe. Not a coding expert, so might be wrong. It basically takes inputs as a range, assigned by user, at upper and lower end for tooth count (I limited it to 5 minimum teeth and 360 upper, even though 3 teeth could be done, and above 360 teeth is doable but impractical), and gave this input to it named as wheel and pinion, and layered the logic for ratio calculations as if they were 1 wheel and pinion per shaft, with allowance for 1 sole wheel and pinion at either end on their own- so it just brute force runs every combination of numbers to get a desired input ratio.

        It was optimized further by creative disregard of similar combinations, and outputs were given with result and deviation from desired input ratio number to manually calculate rotational errors due to ratio error over given amounts of time, generally in centuries.

        The idea was to scale to 10,000 years eventually, as I wanted to make theoretical accurate models for long period accuracy (which is impossible with linear gear ratios, due to cosmic drift and other factors, but was working on non-linear correctional gearing to take care of that eventually too)

        Wanted to make it all freely available for anyone, as a master program to help other horologists.

        1. I see, I’m not a great expert on this but my gut is saying you might look at other methods to get a good speedup, an example of what’s out there…

          Because if I understand right, what you’re doing is taking some big number, the final ratio you want, and trying to break it up into smaller ratios, which multiply together, hence are factors of the bigger number.

          1. The book “Group Theory In The Bedroom” by Brian Hayes discussed the issue perfectly in historical contexts in regards to one chapter explaining the creation of Brocot Tables for gear factorization and how it was essentially a specialized factorization problem to be solved by astronomical clockmakers.

            It was gifted to me when I was already trying to design some stuff, and also found my own copy of “Geared To The Stars” by Henry C. King. I did a few Brocot tables by hand, then quickly turned to programming to solve my niche problem.

            You’re correct- I’ve known since the start it’s essentially a specialized factorization optimization, using gearing to display an answer.

  7. Both find and xargs support a -0 (dash zero) argument, and you should use it on both commands. find … -0 prints out file names separated by ASCII NUL instead of line breaks, and xargs -0 reads file names separated by ASCII NUL instead of line breaks. Why? Simply because line breaks are a legal part of a file name, but ASCII NUL isn’t. Using -0 won’t confuse xargs when find prints filenames with line breaks and other nasty suprises.

  8. >Who doesn’t have a few Raspberry Pis sitting around their network?

    I don’t. Learn from data center: Consolidate hardware and put systems on vm. The money for a few Pi would be more wisely spent on getting higher end Ryzen type of processors. Let some 8-bitter deal with real time I/O
    8 core Ryzen 3700X is roughly $273/8 = ~$35 per core

    1. No room for absolutes on HackaDay, I’m afraid. While most of of my personal infrastructure is on VMs, I do have a number of RPis on my network for certain tasks that I don’t want to hassle with USB passthrough or other crap (environmental, wiring, etc.) to get running on the server rack.

    2. I’ll be “that guy” and point out you’re still $300 short of a minimal system. Much as I think Pis are overhyped by those for whom it’s their first time on a lightweight linux distro rather than the hardware being spectacular.

    3. A $35 raspberry pi has 4 cores. that’s $8ish a core (RAM included). Still you’re missing the point of the pi. They’re far smaller and don’t require as much initial investment.

    4. You also get what you pay for there…

      ive already got a stack of dead ryzen hardware in shop taller than the las 3 generations of intel hardware combined…

      dont get me wrong, love AMD stuff for on the cheap.. but any sort of heavy lifting, you’re gonna be spending on replacement hardware.

      its fast and its cheap, they chose their two.

      you lose one Pi in a cluster, you’re out $40 and everything stays running.

      you lose your Ryzen Vm server, you’re down, and out quite a bit more.

  9. > Note: handling files with spaces is a bit tricky.

    The -print0 option on find and the -0 option on xargs are meant to deal with that by using null terminated strings.

  10. I’m reminded of the old ray tracing program PovRay. Ray tracing every pixel of a “frame”, one at a time.
    Since the scene was based on a file description, you could render video frames on multiple machines. But (from my dusty memory) you could also split a single frame across multiple cores.

    These days, graphics is typically handled by a GPU but the principle should still hold.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.