Tech/The poor men's GNU parallel
Intro
It doesn't happen everyday to face an interesting shell-scripting problem. Today was such a day at work and I had the opportunity to improve my bash scripting skills a bit more.
So I need to launch a certain task on a set of remote machines but for such task to complete I need an input dataset to be available each remote machine.
Such dataset is a multi-gigabyte file so while it's certainly feasible to sequentially upload to every remote host one could get a significant speed up by parallelizing the download.
Why not GNU Parallel ?
Now on a happy day on a machine I (personally) own I would use the fine GNU parallel . However since I may have to share the script with other people in the future (and my employer isn't 100% fond of GPLv3 software) I looked into re-implementing the very basic of gnu parallel: launching n tasks (as sub-processes) and wait for they completion (a-la rendezvous) before continuing.
In order to do this I used the usual execute-in-background job control features (basically appending the good old &
at the end of a command) and the wait
bash built-in command.
However the devil lies in the details, as usual...
Interesting findings
I won't annoy with the trial and error, but I will summarize the interesting finding here: pipes execution cause some while loops to be executed in a subshell.
This is usually an useless trivia factoid, until you start depending on processes parent-child relationships.
If the code generating children runs in a subshell, then the spawned children will not belong to the code you thing owns them, and you will get error messages like this when wait
-ing on them:
[Thu Jul 4 15:04:22 UTC 2024] Waiting for downloads to finish...
./distribute-snapshot.bash: line 55: wait: pid 26604 is not a child of this shell
./distribute-snapshot.bash: line 55: wait: pid 26610 is not a child of this shell
./distribute-snapshot.bash: line 55: wait: pid 26616 is not a child of this shell
./distribute-snapshot.bash: line 55: wait: pid 26622 is not a child of this shell
./distribute-snapshot.bash: line 55: wait: pid 26628 is not a child of this shell
[Thu Jul 4 15:04:22 UTC 2024] Done
This is essentially the result of me writing shell code that looks like this:
cat ${WORKERS_FILE} | while IFS=$'\n' read worker ; do
echo "Downloading snapshot on host $worker"
ssh $worker "sleep 60" & # this is the long-running thing
jobs -p | xargs echo > $_TMPFILE
done
wait $(cat $_TMPFILE)
What to do instead?
The problem lies in the cat ... | while read something
pattern. Such a lovely pattern I've used so many times...
Basically we need to do something equivalent without the pipe.
Reading the read
builtin documentation it seems you can ask it to read from a file descriptor rather than stdin. Now, opening files and binding them to file descriptors is something I've read about many years ago in the Advanced Bash Scripting Guide but never had a chance to use. I guess this is my lucky day!
I don't now how to get the next available file descriptor, so I'll use a some random number (randomly chosen while writing the script, not at runtime)
So this seem to be working very well:
export BGJOBS=""
exec 555<>${WORKERS_FILE} # don't change this...
while IFS=$'\n' read -u 555 worker ; do
cyan "Downloading snapshot on host $worker"
ssh $worker "sleep 60" &
export BGJOBS="$(jobs -p | xargs echo)"
done
echo "[`date`] Waiting for downloads to finish..."
wait $BGJOBS
echo "[`date`] Done"
Basically we open (and keep open) WORKERS_FILE
as file descriptor 555 and then we tell read
to read from it. Since there are no sub-shells involved i don't have to always write the list of children to a temporary file, and can use a regular variable to keep the list of children to wait on.
In here all my children where stuff to wait on. Otherwise I could have used the $! variable to save the just-launched process id:
export BGJOBS=""
exec 555<>${WORKERS_FILE} # don't change this...
while IFS=$'\n' read -u 555 worker ; do
cyan "Downloading snapshot on host $worker"
ssh $worker "sleep 60" &
export BGJOBS="$BGJOBS $!" #difference HERE
done
echo "[`date`] Waiting for downloads to finish..."
wait $BGJOBS
echo "[`date`] Done"
Both solutions work well enough without noticeable differences:
[ esantoro@workbench: ~/load-tester/workers ] $ ./distribute-snapshot.bash five-workers-f
Workers file: five-workers-f
Correct? [yN]
y
Downloading snapshot on host worker1
Downloading snapshot on host worker2
Downloading snapshot on host worker3
Downloading snapshot on host worker4
Downloading snapshot on host worker5
[Thu Jul 4 15:23:30 UTC 2024] Waiting for downloads to finish...
[Thu Jul 4 15:24:30 UTC 2024] Done
[ esantoro@workbench: ~/load-tester/workers ]
$
[ esantoro@workbench: ~/load-tester/workers ] $ ./distribute-snapshot.bash five-workers-f
Workers file: five-workers-f
Correct? [yN]
y
Downloading snapshot on host worker1
Downloading snapshot on host worker2
Downloading snapshot on host worker3
Downloading snapshot on host worker4
Downloading snapshot on host worker5
[Thu Jul 4 15:28:05 UTC 2024] Waiting for downloads to finish...
[Thu Jul 4 15:29:05 UTC 2024] Done
[ esantoro@workbench: ~/load-tester/workers ] $