Linux-Fu: Parallel Universe

June 29, 2020

At some point, you simply run out of processing power. Admittedly, that point keeps getting further and further away, but you can still get there. If you run out of CPU time, the answer might be to add more CPUs. However, sometimes there are other bottlenecks like memory or disk space. However, it is also likely that you have access to multiple computers. Who doesn’t have a few Raspberry Pis sitting around their network? Or maybe a server in the basement? Or even some remote servers “in the cloud.” GNU Parallel is a tool that lets you spread work across multiple tasks either locally to remote machines. In some ways, it is simple, since it looks sort of like xargs but with parallel execution. On the other hand, it has myriad options and configurations that can make it a little daunting to use.

About xargs

In case you don’t use xargs, it is a very simple program that among other things lets you do something with a list of files. For example, suppose we want to search all C source files for the string “hackaday” using grep. You could write:

find . -name ‘*.[ch]’ | xargs grep -i hackaday

Here, xargs grabs an input line, calls grep and after grep completes, it repeats the process until it runs out of input lines. (Note: handling files with spaces is a bit tricky. Using -d ‘\n’ might help, although not all versions of xarg support it.)

In the simplest case, Parallel does the same thing, but it can execute grep — or whatever you are using — multiple times at once. On a local machine, this allows you to use multiple CPUs to improve timing. However, you can also spread the work among different machines that have passwordless ssh logins.

Demos

The author of GNU Parallel has a multipart video demonstration of the system. You can see the first part, below. The tutorial is also very good, and clears up a number of details that might not be obvious from the man page.

Just for my own amusement, I took a directory with some large mp4 files in it and used both xargs and parallel to gzip each file. I know, I know. The files are already compressed, so gzip isn’t going to do much. But I just wanted some large task to time. Here are the results:

[:~/Videos/movies] $ time find *.mp4 | xargs -d '\n' gzip

real    6m10.796s
user    2m52.828s
sys     0m9.718s
[:~/Videos/movies] $ time find *.mp4 | parallel --jobs 8  -d '\n' gzip

real    5m25.050s
user    2m56.676s
sys     0m7.732s

Admittedly, this wasn’t very scientific, and saving about 45 seconds isn’t a tremendous gain, but still. I picked eight jobs because I have an eight-core processor. You might vary that setting depending on what else you’re doing at the time.

Remote

If you want to use remote computers to process data, you need to have passwordless ssh remote access to the other computer (or computers). Of course, chances are the remote computer won’t have the same files and resources, so it makes sense that — by default — your commands only run on the remote server. You can provide a comma-separated list of servers, and if you use the server name of “:” (just a colon), you’ll include your local machine in handling jobs.

This might be very useful if you have a mildly underpowered computer that needs help doing something. For example, we could imagine a Raspberry Pi-based 3D printer asking a remote host to slice a bunch of models in parallel. Even if you think you don’t have any computational heavy lifting, Parallel can do things like process files from a tar archive as they are unpacked without waiting for the rest of the files. It can distribute grep‘s work across your CPUs or cores.

Honestly, it would take a lot to explain each feature in detail, but I hope this has encouraged you to read more about GNU Parallel. Between the videos and the tutorial, you should get a good idea of some of the things you could do with this powerful tool.

38 thoughts on “Linux-Fu: Parallel Universe”

RW ver 0.0.3 says:

June 29, 2020 at 10:21 am

Interesting. I could find uses for that. I’m thinking I could maybe lighten my in-progress OTA TV snarfer, knock it back to a lower power dual core, then have it stash work to a shared drive, then overnight fire up a quad core box to help transcode and convert, and put it to sleep again. Provided WOL is an option here.

Report comment

Reply
Duckula says:

June 29, 2020 at 10:28 am

Tonight I will crack open a cold one and drink to the memory of OpenMosix or whatever they called it in it’s last iteration before development shut down for the last time.

Report comment

Reply
1. Feinfinger says:
  
  June 29, 2020 at 4:59 pm
  
  Mosix was nice.
  :-( … for the “was”.
  
  Report comment
  
  Reply
  1. Ren says:
    
    June 30, 2020 at 6:44 am
    
    Didn’t we go through that whole “Mosi(x) thing yesterday?
    
    Report comment
    
    Reply
2. Geoffrey says:
  
  July 2, 2020 at 6:01 pm
  
  And Kerrighed. And OpenSSI.
  
  Report comment
  
  Reply
PaulD says:

June 29, 2020 at 10:30 am

`xargs` also has a parallel flag… it runs things on the same system (not across systems) but it can effectively utilize all of your cores. use the `-P` and `-n` flags together try it out…

Report comment

Reply
1. paleogizmo says:
  
  June 30, 2020 at 5:00 pm
  
  I usually use xargs -P -n over parallel as xargs is installed by default on most *nix systems which is not the case for parallel
  
  Report comment
  
  Reply
CityZen says:

June 29, 2020 at 10:34 am

As an alternative to:
$ find . -name ‘*.[ch]’ | xargs grep -i hackaday
there is the single command:
$ grep -i hackaday -r . –include=’*.[ch]’

Report comment

Reply
CM says:

June 29, 2020 at 10:41 am

Uhh, xargs includes a -P # flag for running multiple processes at a given time. Please read the man pages of the tools you are using before blogging about it.

Report comment

Reply
1. John says:
  
  June 30, 2020 at 8:40 pm
  
  And deny us of your awesome comments?
  
  Report comment
  
  Reply
kc8rwr says:

June 29, 2020 at 10:42 am

So, I guess most of the examples, if ran across multiple machines assume that all the machines have the same filesystem mounted? Is that right or it it actually transferring over the contents of any files you are doing operations on?

Report comment

Reply
1. jbucky1092 says:
  
  June 30, 2020 at 2:19 pm
  
  According to the video, it transfers the files. There’s an explicit flag for removing the transferred files, as well, so you could end up leaving the files for some reason
  
  Report comment
  
  Reply
Jonathan Bennett says:

June 29, 2020 at 10:45 am

That’s… That’s a really interesting command.

Report comment

Reply
jcwren says:

June 29, 2020 at 12:13 pm

A couple notes for Debian Buster and Raspian Buster. If you have the ‘moreutils’ package installed, but not the ‘parallel’ package, you will still have a ‘parallel’ command. This has no ‘–version’ option, so no idea how old it is, but it’s definitely not the same as the GNU Parallel command, and supports a total of 4 options.

Grep has some smarts so that if multiple file names are passed on the command line, it will print the file name along with the matching line, and if only one name is passed, it does not. grep can be forced to always print file names with the ‘–with-filename’ (-H) option, or never print names with the ‘–no-filename’ (-h) option.

To have parallel work the same as ‘find . -name \*.[ch] | xargs grep -s main`, use ‘find . -name \*.[ch] | parallel grep -s -H main’ (I’m ignoring the fact that grep does support the -P option, as mentioned above).

For whatever it’s worth, on a Raspberry Pi 3, the above commands took 1.137s with ‘parallel’ (specifying the ‘-j 4’ option) and 0.043s with ‘xargs’, searching 47 files with a combined size of 746K. These times were extremely repeatable, with a variation of +/-0.020 seconds for the ‘parallel’ version. No idea why it’s so much slower.

Report comment

Reply
Drew says:

June 29, 2020 at 12:45 pm

I have a python program that does astronomical gearing calculations that was written for multiple wheel and pinion trains and variable tooth count inputs, but it takes days occasionally to run depending on how I have the inputs configured.

Granted, a C variant wouldve probably been a computationally speedier language for the approach, but it was a learning exercise in python, and the base created for me by a friend. Now that I have a multicore i7 9650, versus an old single core 2.3gz pentium M chip circa 2005, I’m sure it’ll run faster- but I’d have to dig the program outta storage first and reconfig.

Scaling from 8 input combinations would take a week to run- I’d love to scale it higher maybe up to 16 inputs. This will undoubtedly still lag a lot-

So could this command spread the inputs apart and process them simultaneously across all my cores to actually make it finish before the universe dies of heat death?

Report comment

Reply
1. tym0tym says:
  
  June 29, 2020 at 1:16 pm
  
  Interesting. If you already have it written in Python and packaged it should not be difficult (in principle) to parallelize the work for various input combinations using native Python multiprocessing on one machine and Spark for multiple workers on the network (or just one as well).
  
  Report comment
  
  Reply
  1. Drew says:
    
    June 29, 2020 at 7:38 pm
    
    It was my understanding even at the time of nacency that python was being used as a teaching tool for my interest in python at the time- but it was conceded that another language was probably more efficient at the pure computational syntax.
    
    I’d learn to code from scratch just to write this in the optimum language (15 year simple linux user), but I’m not really a coder.
    
    Thought about farming it out to local university supercomputer cluster, but I’d like to see what’s possible on a standard optimization for normal pc first.
    
    Report comment
    
    Reply
2. Joel says:
  
  June 29, 2020 at 3:01 pm
  
  Check out Julia if you are interested in performance, as it includes MatLab/Octave/R style human readable syntax. However, Julia also often simplifies scaling tasks, and has local solutions like CUDA wrappers.
  
  I normally am not that keen on modern languages, but this one is fundamentally different given it often outperforms most user written C/C++ routines, and many core libraries like NumPy at many tasks.
  
  https://julialang.org/learning/
  
  ;-)
  
  Report comment
  
  Reply
  1. Drew says:
    
    June 29, 2020 at 7:33 pm
    
    Being not a coder, but bilingual in Japanese by degree, I can appreciate the power of good syntax.
    
    That seems to be an interesting language indeed, I may try rewriting this in that- as I’d love to scale up in inputs to around 16 gears or so.
    
    The program was intended to generate extremely precise (past 10 digits if possible) accuracy in nonstandard gear ratios for creation of astronomical orrerys of high accuracy. Computationally it was slow as hell in python, but my friend was super kind to basically write the base of it and I added on and learned python.
    
    If I could use both this language, or another purely optimized for mathematical computation, and parallel processing simultaneously, I’d work to make it freely available to all horologists.
    
    Report comment
    
    Reply
3. RW ver 0.0.1 says:
  
  July 1, 2020 at 6:13 am
  
  Is it brute force iterative or does it use a factoring algorithm?
  
  Report comment
  
  Reply
  1. Drew says:
    
    July 1, 2020 at 9:19 am
    
    Brute force as far as I can describe. Not a coding expert, so might be wrong. It basically takes inputs as a range, assigned by user, at upper and lower end for tooth count (I limited it to 5 minimum teeth and 360 upper, even though 3 teeth could be done, and above 360 teeth is doable but impractical), and gave this input to it named as wheel and pinion, and layered the logic for ratio calculations as if they were 1 wheel and pinion per shaft, with allowance for 1 sole wheel and pinion at either end on their own- so it just brute force runs every combination of numbers to get a desired input ratio.
    
    It was optimized further by creative disregard of similar combinations, and outputs were given with result and deviation from desired input ratio number to manually calculate rotational errors due to ratio error over given amounts of time, generally in centuries.
    
    The idea was to scale to 10,000 years eventually, as I wanted to make theoretical accurate models for long period accuracy (which is impossible with linear gear ratios, due to cosmic drift and other factors, but was working on non-linear correctional gearing to take care of that eventually too)
    
    Wanted to make it all freely available for anyone, as a master program to help other horologists.
    
    Report comment
    
    Reply
    1. RW ver 0.0.3 says:
      
      July 1, 2020 at 9:42 am
      
      I see, I’m not a great expert on this but my gut is saying you might look at other methods to get a good speedup, an example of what’s out there…
      https://stackoverflow.com/questions/6800193/what-is-the-most-efficient-way-of-finding-all-the-factors-of-a-number-in-python
      
      Because if I understand right, what you’re doing is taking some big number, the final ratio you want, and trying to break it up into smaller ratios, which multiply together, hence are factors of the bigger number.
      
      Report comment
      
      Reply
      1. Drew says:
        
        July 1, 2020 at 3:23 pm
        
        The book “Group Theory In The Bedroom” by Brian Hayes discussed the issue perfectly in historical contexts in regards to one chapter explaining the creation of Brocot Tables for gear factorization and how it was essentially a specialized factorization problem to be solved by astronomical clockmakers.
        
        It was gifted to me when I was already trying to design some stuff, and also found my own copy of “Geared To The Stars” by Henry C. King. I did a few Brocot tables by hand, then quickly turned to programming to solve my niche problem.
        
        You’re correct- I’ve known since the start it’s essentially a specialized factorization optimization, using gearing to display an answer.
        
        Report comment
      2. Drew says:
        
        July 1, 2020 at 7:08 pm
        
        For record- I was specically trying for a ratio of 1 to 1.002737909350795.
        
        For year 2000 Solar To Sidereal conversion.
        
        Report comment
Tux2000 says:

June 29, 2020 at 1:27 pm

Both find and xargs support a -0 (dash zero) argument, and you should use it on both commands. find … -0 prints out file names separated by ASCII NUL instead of line breaks, and xargs -0 reads file names separated by ASCII NUL instead of line breaks. Why? Simply because line breaks are a legal part of a file name, but ASCII NUL isn’t. Using -0 won’t confuse xargs when find prints filenames with line breaks and other nasty suprises.

Report comment

Reply
Comedicles says:

June 29, 2020 at 1:41 pm

Pretty interesting. Since you must have 16 threads, can you run it with maybe 14 to see if it makes a difference?

Report comment

Reply
tekkieneet says:

June 29, 2020 at 5:36 pm

>Who doesn’t have a few Raspberry Pis sitting around their network?

I don’t. Learn from data center: Consolidate hardware and put systems on vm. The money for a few Pi would be more wisely spent on getting higher end Ryzen type of processors. Let some 8-bitter deal with real time I/O
8 core Ryzen 3700X is roughly $273/8 = ~$35 per core

Report comment

Reply
1. asheets says:
  
  June 29, 2020 at 6:25 pm
  
  No room for absolutes on HackaDay, I’m afraid. While most of of my personal infrastructure is on VMs, I do have a number of RPis on my network for certain tasks that I don’t want to hassle with USB passthrough or other crap (environmental, wiring, etc.) to get running on the server rack.
  
  Report comment
  
  Reply
2. RW ver 0.0.3 says:
  
  June 29, 2020 at 6:25 pm
  
  I’ll be “that guy” and point out you’re still $300 short of a minimal system. Much as I think Pis are overhyped by those for whom it’s their first time on a lightweight linux distro rather than the hardware being spectacular.
  
  Report comment
  
  Reply
3. John says:
  
  June 30, 2020 at 9:17 pm
  
  A $35 raspberry pi has 4 cores. that’s $8ish a core (RAM included). Still you’re missing the point of the pi. They’re far smaller and don’t require as much initial investment.
  
  Report comment
  
  Reply
  1. RW ver 0.0.1 says:
    
    July 1, 2020 at 6:21 am
    
    about 1500Mhz of a single (Post P4) x86 core = 1 whole pi though in computational performance. 8 3700 cores at max turbo = 96 pi cores.
    
    Report comment
    
    Reply
4. Hooptie J says:
  
  July 1, 2020 at 5:48 pm
  
  You also get what you pay for there…
  
  ive already got a stack of dead ryzen hardware in shop taller than the las 3 generations of intel hardware combined…
  
  dont get me wrong, love AMD stuff for on the cheap.. but any sort of heavy lifting, you’re gonna be spending on replacement hardware.
  
  its fast and its cheap, they chose their two.
  
  you lose one Pi in a cluster, you’re out $40 and everything stays running.
  
  you lose your Ryzen Vm server, you’re down, and out quite a bit more.
  
  Report comment
  
  Reply
Old Guy says:

June 29, 2020 at 6:42 pm

I find just using grep usually suffices when I want to use grep.

Report comment

Reply
Allan-H says:

June 29, 2020 at 8:47 pm

> Note: handling files with spaces is a bit tricky.

The -print0 option on find and the -0 option on xargs are meant to deal with that by using null terminated strings.

Report comment

Reply
1. Cory Albrecht says:
  
  July 1, 2020 at 7:15 am
  
  Came here to say that. One’s Linux-fu is weak if one doesn’t know that.
  
  Report comment
  
  Reply
  1. Elliot Williams says:
    
    July 1, 2020 at 7:42 am
    
    Meh. I just remove all spaces from filenames. :)
    
    Report comment
    
    Reply
Alan says:

July 1, 2020 at 5:52 pm

I’m reminded of the old ray tracing program PovRay. Ray tracing every pixel of a “frame”, one at a time.
Since the scene was based on a file description, you could render video frames on multiple machines. But (from my dusty memory) you could also split a single frame across multiple cores.

These days, graphics is typically handled by a GPU but the principle should still hold.

Report comment

Reply
Dmpalmer says:

July 12, 2020 at 5:08 pm

A bit more Linux-fu: most current versions of xargs take the -P option to run multiple processes in parallel.

You can even tweak the number of jobs up and down while it is running (by sending signals to the xargs pid with kill).

Check the man or info pages

https://www.gnu.org/software/findutils/manual/html_node/find_html/Controlling-Parallelism.html

Report comment

Reply

Hackaday

Linux-Fu: Parallel Universe

About xargs

Demos

Remote

Read more from this series:
Linux-Fu

38 thoughts on “Linux-Fu: Parallel Universe”

Leave a ReplyCancel reply

Search

Never miss a hack

If you missed it

Dearest C++, Let Me Count The Ways I Love/Hate Thee

Personal Reflections On Immutable Linux

Crunching The News For Fun And Little Profit

The End Of The Hackintosh Is Upon Us

The Hackaday Summer Reading List: No AI Involvement, Guaranteed

Our Columns

Hackaday Podcast Episode 328: Benchies, Beanies, And Back To The Future

This Week In Security: Bitchat, CitrixBleed Part 2, Opossum, And TSAs

Ask Hackaday: Are You Wearing 3D Printed Shoes?

FLOSS Weekly Episode 840: End-of-10; Not Just Some Guy In A Van

Dithering With Quantization To Smooth Things Over

About xargs

Demos

Remote

Read more from this series:Linux-Fu

38 thoughts on “Linux-Fu: Parallel Universe”

Leave a ReplyCancel reply

Search

Never miss a hack

Subscribe

If you missed it

Our Columns

Read more from this series:
Linux-Fu