Unix Pipes in Bioinformatics: How Streaming Data Reduces Memory and Storage
Unix pipes (|) are one of the most powerful yet underutilized features in bioinformatics. They allow you to chain multiple commands together, processing data in a streaming fashion that dramatically reduces memory usage and disk I/O. This post explores why pipes are essential for bioinformatics work and shows how they work under the hood.
The Problem: Data Explosion in Bioinformatics
Modern sequencing generates massive datasets. A single human genome sequencing run can produce:
- Raw reads: 100+ GB of FASTQ files
- Alignments: 50-100 GB of BAM files
- Variants: 1-5 GB of VCF files
Without pipes, traditional bioinformatics workflows create intermediate files at each step:
# Traditional approach (❌ Wasteful)
bwa mem reference.fa reads.fastq > aligned.sam # 200 GB intermediate
samtools view -b aligned.sam > aligned.bam # 100 GB intermediate
samtools sort aligned.bam > sorted.bam # 100 GB intermediate
samtools index sorted.bam # Already indexed
# Total disk usage: 400 GB for processing 100 GB of raw data!
The costs:
- Storage: 4x the original data size just for intermediates
- Time: Writing/reading intermediate files is slow
- Failure: If a step fails midway, you lose everything and restart from scratch
- Complexity: Managing and cleaning up intermediate files is tedious
The Solution: Unix Pipes for Streaming Data
Pipes (|) connect the output of one command directly to the input of the next, processing data in memory as it flows through:
# Pipe approach (✓ Efficient)
bwa mem reference.fa reads.fastq | \
samtools view -b - | \
samtools sort -o sorted.bam -
samtools index sorted.bam
# Total disk usage: 100 GB (original data + final output)
# No intermediate files!
The benefits:
- Memory efficient: Data flows through without full copies in RAM
- Fast: No disk I/O for intermediates
- Resilient: Failures are caught immediately and clearly
- Simple: Clean, readable pipeline syntax
- Storage efficient: Only keep final outputs
How Pipes Work Under the Hood
Understanding how pipes work is key to writing efficient bioinformatics pipelines. Let's explore the mechanics.
1. File Descriptors in Unix
Every Unix process has three standard file descriptors:
| Descriptor | Name | Purpose | Default Target |
|---|---|---|---|
| 0 | stdin | Standard input | Keyboard |
| 1 | stdout | Standard output | Terminal/Screen |
| 2 | stderr | Standard error | Terminal/Screen |
By default, processes read from stdin (file descriptor 0) and write to stdout (file descriptor 1).
# Example: cat command
cat myfile.txt # Reads file, writes to stdout (terminal)
cat < myfile.txt # Explicitly redirect stdin from file
cat myfile.txt 1> output.txt # Redirect stdout to file
2. Creating a Pipe with the Pipe System Call
When you type command1 | command2, the shell:
- Creates a pipe - an unnamed, in-memory buffer that connects two processes
- Forks process 1 - creates a copy of the shell running
command1 - Connects stdout of process 1 to the pipe's write end (file descriptor 1)
- Forks process 2 - creates a copy of the shell running
command2 - Connects stdin of process 2 to the pipe's read end (file descriptor 0)
- Executes both processes - they run in parallel
Diagram of a pipe:
[command1] [command2]
| |
stdout (fd 1) stdin (fd 0)
| |
v ^
[================================]
Pipe (kernel buffer)
3. The Kernel Manages Data Flow
The Unix kernel manages the pipe as a FIFO (First In, First Out) buffer:
- Write side:
command1writes data to the pipe's write end - Pipe buffer: Data sits in kernel memory (typically 64KB-1MB per pipe)
- Read side:
command2reads data from the pipe's read end
Key behaviors:
- If pipe is full: The writing process blocks until space is available
- If pipe is empty: The reading process blocks until data arrives
- If reader closes: Writer gets a SIGPIPE signal (broken pipe error)
- If writer closes: Reader gets EOF and can finish processing
This synchronization happens automatically—you don't need to manage it.
4. Example: Tracing a Real Pipe
Let's trace what happens when you run:
cat genome.fasta | grep ">chr1" | wc -l
Step 1: Shell creates the pipeline
User types: cat genome.fasta | grep ">chr1" | wc -l
↓
Shell creates:
- Pipe A (between cat and grep)
- Pipe B (between grep and wc)
Step 2: Processes fork and file descriptors redirect
Process: cat genome.fasta
stdout (fd 1) → Pipe A write end
Process: grep ">chr1"
stdin (fd 0) ← Pipe A read end
stdout (fd 1) → Pipe B write end
Process: wc -l
stdin (fd 0) ← Pipe B read end
stdout (fd 1) → Terminal
Step 3: Execution flows
Pipe A buffer: Pipe B buffer:
[data from cat] → [data to grep] → [data to wc] → [count to terminal]
Step 4: Back pressure synchronization
If grep is slow:
- Pipe A fills up
- cat blocks (can't write)
- System automatically waits for grep to catch up
If wc is slow:
- Pipe B fills up
- grep blocks (can't write)
- System waits for wc to catch up
5. Memory Efficiency: Why Pipes Don't Load Everything into RAM
Without pipes (writing to disk):
command1: Read all data → Write 100GB to disk
↓
Disk (100GB) ← Slow, uses storage
↓
command2: Read 100GB from disk → Process → Write results
With pipes (streaming):
command1: Read chunk → Write chunk to pipe buffer (64KB)
↓
Pipe buffer (kernel memory, reused)
↓
command2: Read chunk → Process → Write chunk to next pipe
Only a small amount of data (one buffer, ~64KB) sits in memory at any time. The buffer is reused as data flows through, so memory usage stays constant regardless of total data size.
This is the magic of pipes: Constant memory usage, not linear in data size.
Practical Example: Processing a Large FASTQ File
Let's apply pipes to a real bioinformatics workflow.
Without Pipes (Wasteful)
# Step 1: Filter low-quality reads
fastqc reads.fastq --outdir=qc_before/
fastq_quality_filter -i reads.fastq -o reads_filtered.fastq -q 20 -p 80
# File created: reads_filtered.fastq (70 GB)
# Step 2: Count remaining reads
wc -l reads_filtered.fastq
# Temporary file: 70 GB on disk
# Step 3: Get sequence length distribution
awk 'NR%4==2 {print length}' reads_filtered.fastq | \
sort -n | uniq -c > length_dist.txt
# Processing reads_filtered.fastq again
# Cleanup
rm reads_filtered.fastq
# Total I/O: Read source (100GB) + Write filtered (70GB) + Read filtered (70GB) = 240GB
# Total disk: 170 GB (100 source + 70 intermediate)
With Pipes (Efficient)
# Single streaming pipeline: Filter → Count → Length distribution
fastq_quality_filter -i reads.fastq -q 20 -p 80 | \
tee >(wc -l > read_count.txt) | \
awk 'NR%4==2 {print length}' | \
sort -n | uniq -c > length_dist.txt
# Total I/O: Read source (100GB) = 100GB
# Total disk: 100 GB (source + final outputs only)
# Memory: ~100MB (pipe buffers)
Comparison:
| Metric | Without Pipes | With Pipes |
|---|---|---|
| Total I/O | 240 GB | 100 GB |
| Disk space | 170 GB | 100 GB |
| Processing time | Slow (multiple reads from disk) | Fast (one sequential read) |
| Memory usage | Streaming OK | Streaming OK |
| Failure recovery | Restart entire pipeline | Pipeline atomicity issues |
Advanced Pipe Patterns in Bioinformatics
1. Parallel Processing with GNU Parallel
Process multiple files simultaneously while piping:
# Process 1000 FASTQ files in parallel
find . -name "*.fastq" | \
parallel --pipe --block 10M \
'fastq_quality_filter -q 20 -p 80 | gzip' > all_filtered.fastq.gz
2. Tee: Branching a Pipeline
Use tee to send data to multiple streams:
# Simultaneously:
# 1. Count reads
# 2. Filter and output
# 3. Generate statistics
samtools view input.bam | \
tee >(samtools flagstat /dev/stdin > flagstat.txt) | \
samtools view -b -F 4 | \
samtools sort -o sorted_aligned.bam -
# The >(command) syntax is "process substitution"
# It creates a named pipe to a subprocess
3. Named Pipes (mkfifo) for Complex Workflows
For workflows requiring multiple inputs/outputs:
# Create named pipes
mkfifo pipe1 pipe2
# Process A writes to pipe1, reads from pipe2
cat input.txt > pipe1 &
# Process B reads from pipe1, writes to pipe2
sort < pipe1 > pipe2 &
# Main process reads final result
uniq < pipe2
# Cleanup
rm pipe1 pipe2
4. Buffering and Backpressure Management
Sometimes a slow downstream command creates a bottleneck. Use buffer to add extra memory:
# Without buffer: cat blocks if samtools is slow
cat large.bam | samtools view -c
# With buffer: Extra memory absorbs the blocking
cat large.bam | buffer -m 500M | samtools view -c
When to use buffer:
- Downstream process is much slower than upstream
- You have spare RAM and want to minimize blocking
- Upstream data is expensive to regenerate
5. Process Substitution for Multiple Outputs
Fan out to multiple processes from a single input:
# Single BAM file → Multiple statistics simultaneously
samtools view input.bam | \
tee >(samtools view -c > read_count.txt) \
>(awk '{print $3}' | sort | uniq -c > chromosome_dist.txt) \
>(awk '{print $4}' | sort -n > position_stats.txt) \
> /dev/null
# All three statistics generated from a single read of input.bam
Common Pipe Gotchas and Solutions
1. Buffering Issues with Pipes
Problem: Interactive commands don't flush output in pipes
# This might hang (python buffers output in pipe mode)
python long_script.py | tee output.log
Solution: Disable buffering
# Python: Use -u flag or stdbuf
python -u long_script.py | tee output.log
# Or use stdbuf
stdbuf -oL python long_script.py | tee output.log
2. Error Handling in Pipes
Problem: Errors in the middle command are silent
# If samtools fails, cat and downstream still succeed!
cat input.bam | samtools view -b - > output.bam 2>/dev/null
Solution: Set pipefail
# If ANY command fails, the entire pipeline fails
set -o pipefail
cat input.bam | samtools view -b - > output.bam
3. Monitoring Pipeline Progress
Problem: Pipes process data silently—hard to see progress
# No feedback on progress
cat input.fastq | process_cmd | sort > output.txt
Solution: Use pv (pipe viewer) to visualize throughput
# Shows progress, speed, and ETA
cat input.fastq | pv | process_cmd | sort > output.txt
# Or with file size estimation
pv -N "Reading FASTQ" < input.fastq | process_cmd | \
pv -N "Sorting output" | sort > output.txt
4. Debugging Pipe Failures
Problem: Which command in the pipeline failed?
# Unclear where failure occurred
cmd1 | cmd2 | cmd3 | cmd4 failed but which one?
Solution: Use intermediate tee files for debugging
# Save intermediate outputs while piping
cmd1 | tee /tmp/debug1.txt | \
cmd2 | tee /tmp/debug2.txt | \
cmd3 | tee /tmp/debug3.txt | \
cmd4
# Later, inspect intermediates
cat /tmp/debug1.txt | head
cat /tmp/debug2.txt | head
cat /tmp/debug3.txt | head
Using Pipes in Nextflow Modules
Nextflow pipelines are built from individual processes, and each process can use pipes internally to efficiently chain tools together. This is where pipes truly shine in production bioinformatics workflows.
Real Example: BWA_MEM Process from Sarek
The Sarek variant calling pipeline includes a BWA_MEM process that demonstrates best practices for using pipes in Nextflow. Let's examine how pipes reduce intermediate files:
process BWA_MEM {
tag "$meta.id"
label 'process_high'
conda "${moduleDir}/environment.yml"
container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ?
'https://community-cr-prod.seqera.io/docker/registry/v2/blobs/sha256/bf/bf7890f8d4e38a7586581cb7fa13401b7af1582f21d94eef969df4cea852b6da/data' :
'community.wave.seqera.io/library/bwa_htslib_samtools:56c9f8d5201889a4' }"
input:
tuple val(meta) , path(reads)
tuple val(meta2), path(index)
tuple val(meta3), path(fasta)
val sort_bam
output:
tuple val(meta), path("*.bam") , emit: bam, optional: true
tuple val(meta), path("*.cram") , emit: cram, optional: true
tuple val(meta), path("*.csi") , emit: csi, optional: true
tuple val(meta), path("*.crai") , emit: crai, optional: true
path "versions.yml" , emit: versions
when:
task.ext.when == null || task.ext.when
script:
def args = task.ext.args ?: ''
def args2 = task.ext.args2 ?: ''
def prefix = task.ext.prefix ?: "${meta.id}"
def samtools_command = sort_bam ? 'sort' : 'view'
def extension = args2.contains("--output-fmt sam") ? "sam" :
args2.contains("--output-fmt cram") ? "cram":
sort_bam && args2.contains("-O cram")? "cram":
!sort_bam && args2.contains("-C") ? "cram":
"bam"
def reference = fasta && extension=="cram" ? "--reference ${fasta}" : ""
if (!fasta && extension=="cram") error "Fasta reference is required for CRAM output"
"""
INDEX=`find -L ./ -name "*.amb" | sed 's/\\.amb\$//'`
bwa mem \\
$args \\
-t $task.cpus \\
\$INDEX \\
$reads \\
| samtools $samtools_command $args2 ${reference} --threads $task.cpus -o ${prefix}.${extension} -
cat <<-END_VERSIONS > versions.yml
"${task.process}":
bwa: \$(echo \$(bwa 2>&1) | sed 's/^.*Version: //; s/Contact:.*\$//')
samtools: \$(echo \$(samtools --version 2>&1) | sed 's/^.*samtools //; s/Using.*\$//')
END_VERSIONS
"""
stub:
def args2 = task.ext.args2 ?: ''
def prefix = task.ext.prefix ?: "${meta.id}"
def extension = args2.contains("--output-fmt sam") ? "sam" :
args2.contains("--output-fmt cram") ? "cram":
sort_bam && args2.contains("-O cram")? "cram":
!sort_bam && args2.contains("-C") ? "cram":
"bam"
"""
touch ${prefix}.${extension}
touch ${prefix}.csi
touch ${prefix}.crai
cat <<-END_VERSIONS > versions.yml
"${task.process}":
bwa: \$(echo \$(bwa 2>&1) | sed 's/^.*Version: //; s/Contact:.*\$//')
samtools: \$(echo \$(samtools --version 2>&1) | sed 's/^.*samtools //; s/Using.*\$//')
END_VERSIONS
"""
}
Understanding the Pipe in BWA_MEM
The key line in this process is the pipe connecting bwa and samtools:
bwa mem $args -t $task.cpus $INDEX $reads | samtools $samtools_command $args2 ${reference} --threads $task.cpus -o ${prefix}.${extension} -
Here's what happens:
-
bwa mem aligns reads to reference genome
- Outputs SAM format to stdout
- This is hundreds of GB for large genomes
- Without the pipe, this would be written to disk as an intermediate
.samfile
-
Pipe (|) connects output directly to samtools
- Data streams from bwa to samtools in memory
- No intermediate SAM file created
- Constant memory usage regardless of dataset size
-
samtools sort/view processes the SAM stream
- Either sorts (if
sort_bam=true) or just converts format - Final output directly to
.bam,.cram, or.sam - Uses
--threads $task.cpusto parallelize efficiently
- Either sorts (if
Benefits in This Real Workflow
For a human whole-genome sequencing run (~100GB raw reads):
Without pipes (hypothetical):
bwa mem $INDEX $reads > aligned.sam # 500+ GB intermediate
samtools sort aligned.sam > sorted.bam # 100+ GB final
rm aligned.sam
- Storage needed: 600+ GB
- Disk I/O: Read 100GB (bwa) + Write 500GB (SAM) + Read 500GB (samtools) = 1.1TB of I/O
- Time: Significantly slower due to I/O contention
With pipes (Sarek approach):
bwa mem $INDEX $reads | samtools sort -o sorted.bam -
- Storage needed: 100 GB (final output only)
- Disk I/O: Read 100GB (bwa input) + Write 100GB (final BAM) = 200GB of I/O
- Time: 2-5x faster due to eliminated intermediate I/O
- Memory: Constant (~2-3GB) regardless of genome size
Flexibility Through Arguments
The Sarek process demonstrates production-grade flexibility:
def samtools_command = sort_bam ? 'sort' : 'view'
This single line allows switching between:
- Sorting in the pipeline (slower but sorted output)
- No sorting (faster, relies on downstream tools)
And the extension can be dynamically chosen:
def extension = args2.contains("--output-fmt cram") ? "cram" : "bam"
This means the same process can output BAM, CRAM, or SAM depending on configuration—all using the same efficient pipe pattern.
Error Handling in Nextflow Processes
Nextflow automatically handles pipe failures. If any command in the chain fails:
bwa mem ... | samtools sort ... # If either fails, Nextflow catches it
The entire task is marked as failed, and error logs are captured. This is superior to manual bash scripts where errors can be silent.
To ensure pipeline stops on any error, include at the top of your Nextflow script:
process {
shell = ['/bin/bash', '-euo', 'pipefail'] // -o pipefail catches pipe errors
}
This ensures that if bwa fails and outputs nothing, samtools will detect it as an error rather than silently succeeding on empty input.
Pipes for Download, Upload, and Cloud Storage
One of the most practical uses of pipes in bioinformatics is combining download/upload tools with data processing, eliminating intermediate files entirely. This is especially valuable when working with cloud storage like AWS S3.
Example 1: Download → Decompress → Process (No Temp Files)
Scenario: You need to download a compressed reference genome from a public server, decompress it, and index it—all without storing the compressed file.
# Traditional approach (❌ Wasteful)
wget https://example.com/reference.fasta.gz # Downloads 5GB
gunzip reference.fasta.gz # Decompresses to 15GB
samtools faidx reference.fasta # Index the reference
rm reference.fasta # Cleanup takes time
# Total disk space needed: 20GB
# Pipe approach (✓ Efficient)
wget -q -O - https://example.com/reference.fasta.gz | \
gunzip | \
samtools faidx /dev/stdin
# Total disk space needed: 15GB (final reference only)
# No temporary files, no cleanup needed
What happens:
wget -O -downloads to stdout (not to disk)gunzipdecompresses the stream on-the-flysamtools faidx /dev/stdincreates the index directly from the stream- Final reference file and index stored without intermediate compressed file
Example 2: Process → Compress → Upload to S3 (Single Pipeline)
Scenario: Process sequencing data and upload the compressed results directly to AWS S3 without creating local compressed files.
# Traditional approach (❌ Wasteful)
bwa mem reference.fa reads.fastq | samtools sort -o aligned.bam -
gzip aligned.bam # Creates aligned.bam.gz (100+ GB)
aws s3 cp aligned.bam.gz s3://my-bucket/ # Upload
rm aligned.bam.gz # Cleanup
# Total disk space: 200+ GB (original BAM + compressed file)
# Pipe approach (✓ Efficient)
bwa mem reference.fa reads.fastq | \
samtools sort -b - | \
gzip | \
aws s3 cp - s3://my-bucket/aligned.bam.gz
# Total disk space: 0 GB temporary files
# Compressed data uploaded directly to S3
# Only stores the final index or metadata files locally
What happens:
bwa memandsamtools sortchain together (as before)gzipcompresses the BAM stream on-the-flyaws s3 cp -uploads directly from stdin to S3- Data never exists uncompressed on disk
Example 3: Download from S3 → Decompress → Analyze (No Local Copy)
Scenario: Analyze BAM files stored in S3 without downloading the entire uncompressed file locally.
# Traditional approach (❌ Wasteful)
aws s3 cp s3://my-bucket/sample.bam.gz . # Download 50GB
gunzip sample.bam.gz # Decompress to 150GB
samtools flagstat sample.bam # Analyze
# Local disk: 200GB, slow downloads
# Pipe approach (✓ Efficient)
aws s3 cp s3://my-bucket/sample.bam.gz - | \
gunzip | \
samtools flagstat /dev/stdin
# Local disk: 0 GB temporary files
# Stream analysis, no storage overhead
What happens:
aws s3 cp - s3://...downloads from S3 to stdoutgunzipdecompresses the streamsamtools flagstatreads and analyzes directly from the pipe- Only metadata (flagstat results) stored locally
Example 4: Download Tarball → Extract → Index (No Intermediate Files)
Scenario: Download a compressed archive of reference sequences, extract, and index—all in one pipeline.
# Traditional approach (❌ Wasteful)
wget https://example.com/genomes.tar.gz # 10GB download
tar -xzf genomes.tar.gz # Extracts to 50GB
for ref in genomes/*.fasta; do
samtools faidx "$ref"
done
rm -rf genomes.tar.gz genomes/ # Cleanup
# Disk space: 60GB temporary files
# Pipe approach (✓ Efficient)
wget -q -O - https://example.com/genomes.tar.gz | \
tar -xz --to-stdout | \
while read -r line; do
echo "$line" >> current_genome.fasta
if [[ "$line" =~ ^> ]] && [ -s current_genome.fasta ]; then
samtools faidx current_genome.fasta
fi
done
# Or more elegantly with GNU tar:
wget -q -O - https://example.com/genomes.tar.gz | \
tar -xz -C /dev/shm # Extract to RAM disk (if available)
# Disk space: Only final index files
Better approach with GNU tar's streaming:
# Extract specific files from tarball without decompressing entire archive
wget -q -O - https://example.com/genomes.tar.gz | \
tar -xzf - genomes/reference.fasta --to-stdout | \
samtools faidx /dev/stdin
Example 5: Process Multiple Files from S3 with GNU Parallel
Scenario: Process many files in S3 in parallel using pipes and parallel processing.
# List all BAM files in S3 and process in parallel
aws s3 ls s3://my-bucket/bams/ --recursive | awk '{print $4}' | \
parallel --pipe --block 10M \
'aws s3 cp "s3://my-bucket/{}" - | \
samtools view -b -F 4 | \
samtools sort -o {/.}.sorted.bam -'
# What happens:
# 1. List all S3 objects
# 2. Process up to N files in parallel
# 3. Each file is downloaded and piped directly to samtools
# 4. Results written back to disk (or piped to another command)
Example 6: Bidirectional Piping: Upload Results as They're Generated
Scenario: Generate analysis results and upload to S3 incrementally (useful for long-running processes).
# Real-time upload of VCF variants as they're discovered
bcftools mpileup -f reference.fa sample.bam | \
bcftools call -m | \
tee >(gzip | aws s3 cp - s3://my-bucket/variants.vcf.gz) | \
grep -v "^#" | \
wc -l
# What happens:
# 1. bcftools generates variants
# 2. tee splits the stream into two paths:
# - First path: gzip and upload to S3 in real-time
# - Second path: count total variants
# 3. Both operations happen simultaneously from a single bcftools stream
# 4. Results available in S3 even while analysis continues
Real-world use case: Monitoring long-running analyses without waiting for completion:
# Start a 24-hour variant calling run with real-time uploading
samtools mpileup -f ref.fa *.bam | \
bcftools call -m -v 2>/tmp/variants.log | \
tee >(gzip | aws s3 cp - s3://bucket/live-variants.vcf.gz) | \
tail -100 | grep "PASS" > /tmp/latest_variants.txt
# In another terminal, monitor progress:
watch -n 10 "wc -l /tmp/latest_variants.txt && aws s3 ls s3://bucket/live-variants.vcf.gz"
Tips for Reliable Download/Upload Pipes
| Scenario | Pipe Command | Notes |
|---|---|---|
| Download + decompress | wget -O - | gunzip | process | Use -O - to output to stdout |
| Upload + compress | process | gzip | aws s3 cp - | Pipe - for stdin/stdout |
| S3 download + analyze | aws s3 cp s3://... - | process | - reads from stdin to stdout |
| Tar extract streaming | wget -O - tar.gz | tar -xz -O file | -O or --to-stdout extracts to stdout |
| Multiple S3 files | Use aws s3 ls + parallel | Chain with aws s3 cp s3://... - |
| Real-time monitoring | Use tee for branching | Simultaneous upload and local processing |
| Error handling | Check exit codes with set -o pipefail | S3 errors need to fail the pipeline |
Critical Considerations for Cloud Pipes
1. Network vs. Disk Bottleneck
# If network is slower than local processing:
# Download once, process multiple times
wget -O - file.tar.gz | tee local.tar.gz | tar -xz | process1
tar -xz < local.tar.gz | process2
tar -xz < local.tar.gz | process3
# Tradeoff: Local storage vs. network bandwidth
2. Retry Logic for Failed S3 Operations
# Simple retry with exponential backoff
retry_count=0
while [ $retry_count -lt 3 ]; do
aws s3 cp s3://bucket/file - 2>/dev/null && break
retry_count=$((retry_count + 1))
sleep $((2 ** retry_count))
done | gunzip | process
3. Monitoring Upload Progress
# Use pv to monitor upload speed
process_data | \
pv -br | \
gzip | \
aws s3 cp - s3://bucket/file.gz
# Output: [1.2GB/s] or similar
4. Handling Large Files with Multipart Upload
For files larger than a few GB, AWS S3 multipart uploads are more reliable:
# aws s3 cp already handles multipart, but for custom control:
process_data | \
gzip | \
aws s3 cp - s3://bucket/large-file.gz \
--sse AES256 \
--storage-class GLACIER # Optional: cheaper storage class
Real-World Bioinformatics Pipes
Example 1: RNA-seq Quality Control Pipeline
# Process RNA-seq reads:
# 1. Filter low quality
# 2. Count valid reads
# 3. Extract length distribution
# 4. Detect adapters
fastq_quality_filter -i reads.fastq -q 20 -p 80 | \
tee >(wc -l | awk '{print "Valid reads: "$1/4}' > read_count.txt) | \
tee >(awk 'NR%4==2 {print length}' | \
sort -n | uniq -c > length_dist.txt) | \
fastx_collapser -o collapsed.fastq | \
fastx_clipper -a AGATCGGAAGAGC -o trimmed.fastq
Example 2: Variant Calling Pipeline
# Align reads, mark duplicates, and call variants in one pass
bwa mem -t 8 reference.fa reads.fastq | \
samtools view -b -S - | \
samtools sort -o sorted.bam - && \
samtools markdup sorted.bam marked.bam && \
samtools index marked.bam && \
bcftools mpileup -f reference.fa marked.bam | \
bcftools call -mv -o variants.vcf
# Memory: Constant (~1-2 GB for buffers)
# Disk: Only intermediate files marked.bam (necessary for bcftools indexing)
Example 3: FASTA Processing with Decompression
# Decompress, process, and recompress in a single pipeline
zcat genome.fasta.gz | \
awk '/^>/{if(NR>1)print prev_seq; seq=""; prev_seq=$0; next} {seq=seq $0} END{print prev_seq; print seq}' | \
gzip > processed.fasta.gz
# No intermediate uncompressed files
# Entire genome processed with minimal disk space
Summary: Why Pipes Matter in Bioinformatics
| Problem | Pipe Solution |
|---|---|
| Intermediate files | Stream directly between tools |
| Disk space | No temporary storage needed |
| Memory usage | Constant, independent of data size |
| Processing speed | Single sequential read of data |
| Failure recovery | Failures caught immediately |
| Code readability | Clear, linear data flow |
Key Takeaways
1. Pipes Enable Streaming Data Processing
- Data flows through memory buffers, not disk
- Only ~64KB per pipe sits in RAM at any time
- Processing 1GB or 1TB uses the same constant memory
- This is the fundamental advantage over traditional file-based workflows
2. Kernel-Level Synchronization (Automatic)
- No manual management needed
- Backpressure prevents one slow command from overwhelming others
- If a command fails, the entire pipeline fails cleanly
- Use
set -o pipefailin bash to ensure this behavior
3. I/O Reduction is Massive in Bioinformatics
- Traditional alignment: 100GB reads → 500GB SAM → 100GB BAM (1.1TB I/O)
- Pipe alignment: 100GB reads → 100GB BAM (200GB I/O total)
- Savings: 5.5x reduction in disk I/O
- Practical benefit: 2-5x faster execution on disk-bound operations
4. Pipes Work Seamlessly with Cloud Storage
- Download with
wget -O -oraws s3 cp - s3://... - Stream directly from S3 without local copies
- Upload results on-the-fly during processing
- Combine with
teefor simultaneous processing and uploading
5. Production Workflows (Nextflow, Sarek) Use Pipes Extensively
- Real example: Sarek's
BWA_MEMprocess pipesbwa mem | samtools sort - This is the standard pattern for large-scale bioinformatics
- Nextflow adds reliability and error handling on top
- No need to choose between pipes and Nextflow—they work together
6. Practical Patterns You'll Use
| Pattern | Use Case | Command |
|---|---|---|
| Simple chain | Sequential processing | cmd1 | cmd2 | cmd3 |
| Branching | Multiple outputs from one input | cmd1 | tee >(cmd2) | cmd3 |
| Download + process | Remote files without storage | wget -O - url | gunzip | process |
| Upload + compress | Direct to cloud storage | process | gzip | aws s3 cp - |
| Error safety | Catch failures in pipes | set -o pipefail |
| Parallel processing | Scale across cores/machines | input | parallel 'process' |
7. When Pipes Are Most Valuable
✅ Use pipes for:
- Massive datasets (100GB+) where I/O is the bottleneck
- Cloud storage workflows where local disk is expensive
- Real-time monitoring of long-running analyses
- Linear processing chains (one output → next input)
- Development and rapid prototyping
- Nextflow process scripts (internal tool chaining)
❌ Avoid pipes for:
- Highly branching workflows (many different paths)
- Complex error recovery (need to restart from middle)
- Data that needs multiple passes through the same file
- When you need to monitor intermediate results extensively
8. Master These Tools for Production Workflows
# Essential for pipe mastery:
set -o pipefail # Error handling
tee # Branching
pv # Progress monitoring
process substitution: >(cmd) # Streaming to files
AWS S3 pipes: aws s3 cp - s3://... # Cloud integration
GNU parallel # Scale across cores
Final Thoughts
Unix pipes are a cornerstone of efficient bioinformatics. They're not just a convenient syntax—they're a fundamental architecture for processing data that would otherwise overwhelm available disk space and I/O capacity.
The real power becomes apparent at scale:
- Small datasets: Pipes save time and code complexity
- Large datasets: Pipes are often the only practical solution
- Cloud workflows: Pipes enable streaming to/from S3 without local copies
- Production pipelines: Pipes are embedded in every major workflow (Nextflow, Snakemake, etc.)
By understanding how pipes work under the hood—how the kernel manages file descriptors, buffers data, and synchronizes backpressure—you can write bioinformatics workflows that are not just faster, but fundamentally more efficient.
Start small: Use pipes in your next single-command analysis. Progress to chaining tools. Eventually, you'll write entire bioinformatics workflows as elegant streaming pipelines—just like Sarek does.
Your future self (and your disk quota administrator) will thank you.