While I still write a decent amount of shell I generally try to avoid it. It's hard for others to read, has a lot of sharp edges, tends to swallow errors, and handles the unusual situations poorly. But one thing that keeps me coming back to it is how easily I can set up trees of processes.
Say I have a program that reads two files together in a single pass [1] and writes something out. The inputs you have are compressed, so you'll need to decompress them, and the output needs to be compressed before you write it out to storage. You could do:
# download the files aws s3 cp "$path1" . aws s3 cp "$path2" . # decompress the files gunzip "$file1" gunzip "$file2" # run the command cmd -1 "$file1" -2 "$file2" > "$fileOut" # compress the output gzip "$fileOut" # upload the output aws s3 cp "$fileOut.gz" "$pathOut"
This works, but for large files it's slow and needs too much space. We're waiting for each step to finish before starting the next, and we're storing some very large intermediate files on the local machine.
Instead, we'd like to stream the inputs down, decompress them as we go, compress the output as it comes out, and stream the output back up. In bash this is reasonably straightforward to write:
cmd -1 <(aws s3 cp "$path1" - | gunzip) \ -2 <(aws s3 cp "$path2" - | gunzip) \ | gzip | aws s3 cp - "$pathOut"
This uses almost no disk space and it parallelizes the decompression, command, and recompression. But it's also shell...
I tend to use python for this kind of thing, where I'm gluing things together and want it to be clear what I'm doing. It seems like it should be possible to do this sort of thing with the subprocess module, but while I've played with it a few times I haven't figured it out. I'd like an API like:
pipeline = subprocess.Pipeline() dl1 = pipeline.process( ["aws", "s3", "cp", path1, "-"]) gunzip1 = pipeline.process( ["gunzip"], stdin=dl1.stdout) dl2 = pipeline.process( ["aws", "s3", "cp", path2, "-"]) gunzip2 = pipeline.process( ["gunzip"], stdin=dl2.stdout) cmd = pipeline.process( ["cmd", "-1", dl1.stdout, "-2", dl2.stdout]) gzip = pipeline.process( ["gzip"], stdin=cmd.stdout) pipeline.process( ["aws", "s3", "cp", "-", pathOut], stdin=gzip.stdout) pipeline.check_call()
Or:
from subprocess import check_call, PIPE, InputFile check_call([ "cmd", "-1", InputFile([ "aws", "s3", "cp", path1, "-", PIPE, "gunzip"]), "-2", InputFile([ "aws", "s3", "cp", path2, "-", PIPE, "gunzip"]), PIPE, "gzip", PIPE, "aws", "s3", "cp", "-", pathOut])
These are 5x and 3x the length of the bash version, but I'd be willing to put up with that for having something that's more robust. The difference would also be smaller in practice as the commands would typically have a lot of arguments.
I see these stack overflow answers suggesting named pipes, but they seem awkward, hard to read, and easy to get wrong. Is there a better way? Should I just stick with bash when doing something bash is this good a fit for, especially now that people can paste my code into an LLM and get an explanation of what it's doing?
[1] Interleaved fastq
files, where the Nth record in file 1
corresponds to the Nth record in file 2.
Yes, so, just like in my example, the first round of conversation is nearly useless. Let's see, if the second iteration is any better (it does turn to named pipes):
ME (without triple quotes, but here I am including them for better formatting):
Unfortunately, this is concatenating the two downloads and providing them to cmd on it's stdin
What we'd like to happen instead is to process those two downloads by cmd in parallel.
GPT-4:
To process two downloads in parallel and provide them as separate inputs to a command using Python's
subprocess
module, you can employ named pipes (FIFOs). Named pipes can be created using theos.mkfifo
method, which allows two independent data streams for the two downloads.Here's a revised version of the script that uses named pipes for parallel processing:
This script creates two named pipes and starts two threads to handle the downloads and the decompression. The outputs of these processes are directed to the named pipes. The main command reads from these pipes. The use of threads ensures that the reading and writing to the command's stdin do not block each other.
This approach is more complex than the initial script but allows the parallel processing of the two data streams as required. As before, error handling and cleanup are simplified for demonstration purposes and should be robustly implemented for production use.