Skip to main content

Unix Pipes in Bioinformatics: How Streaming Data Reduces Memory and Storage

· 22 min read
Thanh-Giang Tan Nguyen
Founder at RIVER

Unix pipes (|) are one of the most powerful yet underutilized features in bioinformatics. They allow you to chain multiple commands together, processing data in a streaming fashion that dramatically reduces memory usage and disk I/O. This post explores why pipes are essential for bioinformatics work and shows how they work under the hood.

The Problem: Data Explosion in Bioinformatics

Modern sequencing generates massive datasets. A single human genome sequencing run can produce:

  • Raw reads: 100+ GB of FASTQ files
  • Alignments: 50-100 GB of BAM files
  • Variants: 1-5 GB of VCF files

Without pipes, traditional bioinformatics workflows create intermediate files at each step:

# Traditional approach (❌ Wasteful)
bwa mem reference.fa reads.fastq > aligned.sam # 200 GB intermediate
samtools view -b aligned.sam > aligned.bam # 100 GB intermediate
samtools sort aligned.bam > sorted.bam # 100 GB intermediate
samtools index sorted.bam # Already indexed
# Total disk usage: 400 GB for processing 100 GB of raw data!

The costs:

  • Storage: 4x the original data size just for intermediates
  • Time: Writing/reading intermediate files is slow
  • Failure: If a step fails midway, you lose everything and restart from scratch
  • Complexity: Managing and cleaning up intermediate files is tedious

The Solution: Unix Pipes for Streaming Data

Pipes (|) connect the output of one command directly to the input of the next, processing data in memory as it flows through:

# Pipe approach (✓ Efficient)
bwa mem reference.fa reads.fastq | \
samtools view -b - | \
samtools sort -o sorted.bam -
samtools index sorted.bam
# Total disk usage: 100 GB (original data + final output)
# No intermediate files!

The benefits:

  • Memory efficient: Data flows through without full copies in RAM
  • Fast: No disk I/O for intermediates
  • Resilient: Failures are caught immediately and clearly
  • Simple: Clean, readable pipeline syntax
  • Storage efficient: Only keep final outputs

How Pipes Work Under the Hood

Understanding how pipes work is key to writing efficient bioinformatics pipelines. Let's explore the mechanics.

1. File Descriptors in Unix

Every Unix process has three standard file descriptors:

DescriptorNamePurposeDefault Target
0stdinStandard inputKeyboard
1stdoutStandard outputTerminal/Screen
2stderrStandard errorTerminal/Screen

By default, processes read from stdin (file descriptor 0) and write to stdout (file descriptor 1).

# Example: cat command
cat myfile.txt # Reads file, writes to stdout (terminal)
cat < myfile.txt # Explicitly redirect stdin from file
cat myfile.txt 1> output.txt # Redirect stdout to file

2. Creating a Pipe with the Pipe System Call

When you type command1 | command2, the shell:

  1. Creates a pipe - an unnamed, in-memory buffer that connects two processes
  2. Forks process 1 - creates a copy of the shell running command1
  3. Connects stdout of process 1 to the pipe's write end (file descriptor 1)
  4. Forks process 2 - creates a copy of the shell running command2
  5. Connects stdin of process 2 to the pipe's read end (file descriptor 0)
  6. Executes both processes - they run in parallel

Diagram of a pipe:

[command1]                    [command2]
| |
stdout (fd 1) stdin (fd 0)
| |
v ^
[================================]
Pipe (kernel buffer)

3. The Kernel Manages Data Flow

The Unix kernel manages the pipe as a FIFO (First In, First Out) buffer:

  • Write side: command1 writes data to the pipe's write end
  • Pipe buffer: Data sits in kernel memory (typically 64KB-1MB per pipe)
  • Read side: command2 reads data from the pipe's read end

Key behaviors:

  1. If pipe is full: The writing process blocks until space is available
  2. If pipe is empty: The reading process blocks until data arrives
  3. If reader closes: Writer gets a SIGPIPE signal (broken pipe error)
  4. If writer closes: Reader gets EOF and can finish processing

This synchronization happens automatically—you don't need to manage it.

4. Example: Tracing a Real Pipe

Let's trace what happens when you run:

cat genome.fasta | grep ">chr1" | wc -l

Step 1: Shell creates the pipeline

User types: cat genome.fasta | grep ">chr1" | wc -l

Shell creates:
- Pipe A (between cat and grep)
- Pipe B (between grep and wc)

Step 2: Processes fork and file descriptors redirect

Process: cat genome.fasta
stdout (fd 1) → Pipe A write end

Process: grep ">chr1"
stdin (fd 0) ← Pipe A read end
stdout (fd 1) → Pipe B write end

Process: wc -l
stdin (fd 0) ← Pipe B read end
stdout (fd 1) → Terminal

Step 3: Execution flows

Pipe A buffer:     Pipe B buffer:
[data from cat] → [data to grep] → [data to wc] → [count to terminal]

Step 4: Back pressure synchronization

If grep is slow:
- Pipe A fills up
- cat blocks (can't write)
- System automatically waits for grep to catch up

If wc is slow:
- Pipe B fills up
- grep blocks (can't write)
- System waits for wc to catch up

5. Memory Efficiency: Why Pipes Don't Load Everything into RAM

Without pipes (writing to disk):

command1: Read all data → Write 100GB to disk

Disk (100GB) ← Slow, uses storage

command2: Read 100GB from disk → Process → Write results

With pipes (streaming):

command1: Read chunk → Write chunk to pipe buffer (64KB)

Pipe buffer (kernel memory, reused)

command2: Read chunk → Process → Write chunk to next pipe

Only a small amount of data (one buffer, ~64KB) sits in memory at any time. The buffer is reused as data flows through, so memory usage stays constant regardless of total data size.

This is the magic of pipes: Constant memory usage, not linear in data size.


Practical Example: Processing a Large FASTQ File

Let's apply pipes to a real bioinformatics workflow.

Without Pipes (Wasteful)

# Step 1: Filter low-quality reads
fastqc reads.fastq --outdir=qc_before/
fastq_quality_filter -i reads.fastq -o reads_filtered.fastq -q 20 -p 80
# File created: reads_filtered.fastq (70 GB)

# Step 2: Count remaining reads
wc -l reads_filtered.fastq
# Temporary file: 70 GB on disk

# Step 3: Get sequence length distribution
awk 'NR%4==2 {print length}' reads_filtered.fastq | \
sort -n | uniq -c > length_dist.txt
# Processing reads_filtered.fastq again

# Cleanup
rm reads_filtered.fastq

# Total I/O: Read source (100GB) + Write filtered (70GB) + Read filtered (70GB) = 240GB
# Total disk: 170 GB (100 source + 70 intermediate)

With Pipes (Efficient)

# Single streaming pipeline: Filter → Count → Length distribution
fastq_quality_filter -i reads.fastq -q 20 -p 80 | \
tee >(wc -l > read_count.txt) | \
awk 'NR%4==2 {print length}' | \
sort -n | uniq -c > length_dist.txt

# Total I/O: Read source (100GB) = 100GB
# Total disk: 100 GB (source + final outputs only)
# Memory: ~100MB (pipe buffers)

Comparison:

MetricWithout PipesWith Pipes
Total I/O240 GB100 GB
Disk space170 GB100 GB
Processing timeSlow (multiple reads from disk)Fast (one sequential read)
Memory usageStreaming OKStreaming OK
Failure recoveryRestart entire pipelinePipeline atomicity issues

Advanced Pipe Patterns in Bioinformatics

1. Parallel Processing with GNU Parallel

Process multiple files simultaneously while piping:

# Process 1000 FASTQ files in parallel
find . -name "*.fastq" | \
parallel --pipe --block 10M \
'fastq_quality_filter -q 20 -p 80 | gzip' > all_filtered.fastq.gz

2. Tee: Branching a Pipeline

Use tee to send data to multiple streams:

# Simultaneously:
# 1. Count reads
# 2. Filter and output
# 3. Generate statistics
samtools view input.bam | \
tee >(samtools flagstat /dev/stdin > flagstat.txt) | \
samtools view -b -F 4 | \
samtools sort -o sorted_aligned.bam -

# The >(command) syntax is "process substitution"
# It creates a named pipe to a subprocess

3. Named Pipes (mkfifo) for Complex Workflows

For workflows requiring multiple inputs/outputs:

# Create named pipes
mkfifo pipe1 pipe2

# Process A writes to pipe1, reads from pipe2
cat input.txt > pipe1 &

# Process B reads from pipe1, writes to pipe2
sort < pipe1 > pipe2 &

# Main process reads final result
uniq < pipe2

# Cleanup
rm pipe1 pipe2

4. Buffering and Backpressure Management

Sometimes a slow downstream command creates a bottleneck. Use buffer to add extra memory:

# Without buffer: cat blocks if samtools is slow
cat large.bam | samtools view -c

# With buffer: Extra memory absorbs the blocking
cat large.bam | buffer -m 500M | samtools view -c

When to use buffer:

  • Downstream process is much slower than upstream
  • You have spare RAM and want to minimize blocking
  • Upstream data is expensive to regenerate

5. Process Substitution for Multiple Outputs

Fan out to multiple processes from a single input:

# Single BAM file → Multiple statistics simultaneously
samtools view input.bam | \
tee >(samtools view -c > read_count.txt) \
>(awk '{print $3}' | sort | uniq -c > chromosome_dist.txt) \
>(awk '{print $4}' | sort -n > position_stats.txt) \
> /dev/null

# All three statistics generated from a single read of input.bam

Common Pipe Gotchas and Solutions

1. Buffering Issues with Pipes

Problem: Interactive commands don't flush output in pipes

# This might hang (python buffers output in pipe mode)
python long_script.py | tee output.log

Solution: Disable buffering

# Python: Use -u flag or stdbuf
python -u long_script.py | tee output.log
# Or use stdbuf
stdbuf -oL python long_script.py | tee output.log

2. Error Handling in Pipes

Problem: Errors in the middle command are silent

# If samtools fails, cat and downstream still succeed!
cat input.bam | samtools view -b - > output.bam 2>/dev/null

Solution: Set pipefail

# If ANY command fails, the entire pipeline fails
set -o pipefail

cat input.bam | samtools view -b - > output.bam

3. Monitoring Pipeline Progress

Problem: Pipes process data silently—hard to see progress

# No feedback on progress
cat input.fastq | process_cmd | sort > output.txt

Solution: Use pv (pipe viewer) to visualize throughput

# Shows progress, speed, and ETA
cat input.fastq | pv | process_cmd | sort > output.txt

# Or with file size estimation
pv -N "Reading FASTQ" < input.fastq | process_cmd | \
pv -N "Sorting output" | sort > output.txt

4. Debugging Pipe Failures

Problem: Which command in the pipeline failed?

# Unclear where failure occurred
cmd1 | cmd2 | cmd3 | cmd4 failed but which one?

Solution: Use intermediate tee files for debugging

# Save intermediate outputs while piping
cmd1 | tee /tmp/debug1.txt | \
cmd2 | tee /tmp/debug2.txt | \
cmd3 | tee /tmp/debug3.txt | \
cmd4

# Later, inspect intermediates
cat /tmp/debug1.txt | head
cat /tmp/debug2.txt | head
cat /tmp/debug3.txt | head

Using Pipes in Nextflow Modules

Nextflow pipelines are built from individual processes, and each process can use pipes internally to efficiently chain tools together. This is where pipes truly shine in production bioinformatics workflows.

Real Example: BWA_MEM Process from Sarek

The Sarek variant calling pipeline includes a BWA_MEM process that demonstrates best practices for using pipes in Nextflow. Let's examine how pipes reduce intermediate files:

process BWA_MEM {
tag "$meta.id"
label 'process_high'

conda "${moduleDir}/environment.yml"
container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ?
'https://community-cr-prod.seqera.io/docker/registry/v2/blobs/sha256/bf/bf7890f8d4e38a7586581cb7fa13401b7af1582f21d94eef969df4cea852b6da/data' :
'community.wave.seqera.io/library/bwa_htslib_samtools:56c9f8d5201889a4' }"

input:
tuple val(meta) , path(reads)
tuple val(meta2), path(index)
tuple val(meta3), path(fasta)
val sort_bam

output:
tuple val(meta), path("*.bam") , emit: bam, optional: true
tuple val(meta), path("*.cram") , emit: cram, optional: true
tuple val(meta), path("*.csi") , emit: csi, optional: true
tuple val(meta), path("*.crai") , emit: crai, optional: true
path "versions.yml" , emit: versions

when:
task.ext.when == null || task.ext.when

script:
def args = task.ext.args ?: ''
def args2 = task.ext.args2 ?: ''
def prefix = task.ext.prefix ?: "${meta.id}"
def samtools_command = sort_bam ? 'sort' : 'view'
def extension = args2.contains("--output-fmt sam") ? "sam" :
args2.contains("--output-fmt cram") ? "cram":
sort_bam && args2.contains("-O cram")? "cram":
!sort_bam && args2.contains("-C") ? "cram":
"bam"
def reference = fasta && extension=="cram" ? "--reference ${fasta}" : ""
if (!fasta && extension=="cram") error "Fasta reference is required for CRAM output"
"""
INDEX=`find -L ./ -name "*.amb" | sed 's/\\.amb\$//'`

bwa mem \\
$args \\
-t $task.cpus \\
\$INDEX \\
$reads \\
| samtools $samtools_command $args2 ${reference} --threads $task.cpus -o ${prefix}.${extension} -

cat <<-END_VERSIONS > versions.yml
"${task.process}":
bwa: \$(echo \$(bwa 2>&1) | sed 's/^.*Version: //; s/Contact:.*\$//')
samtools: \$(echo \$(samtools --version 2>&1) | sed 's/^.*samtools //; s/Using.*\$//')
END_VERSIONS
"""

stub:
def args2 = task.ext.args2 ?: ''
def prefix = task.ext.prefix ?: "${meta.id}"
def extension = args2.contains("--output-fmt sam") ? "sam" :
args2.contains("--output-fmt cram") ? "cram":
sort_bam && args2.contains("-O cram")? "cram":
!sort_bam && args2.contains("-C") ? "cram":
"bam"
"""
touch ${prefix}.${extension}
touch ${prefix}.csi
touch ${prefix}.crai

cat <<-END_VERSIONS > versions.yml
"${task.process}":
bwa: \$(echo \$(bwa 2>&1) | sed 's/^.*Version: //; s/Contact:.*\$//')
samtools: \$(echo \$(samtools --version 2>&1) | sed 's/^.*samtools //; s/Using.*\$//')
END_VERSIONS
"""
}

Understanding the Pipe in BWA_MEM

The key line in this process is the pipe connecting bwa and samtools:

bwa mem $args -t $task.cpus $INDEX $reads | samtools $samtools_command $args2 ${reference} --threads $task.cpus -o ${prefix}.${extension} -

Here's what happens:

  1. bwa mem aligns reads to reference genome

    • Outputs SAM format to stdout
    • This is hundreds of GB for large genomes
    • Without the pipe, this would be written to disk as an intermediate .sam file
  2. Pipe (|) connects output directly to samtools

    • Data streams from bwa to samtools in memory
    • No intermediate SAM file created
    • Constant memory usage regardless of dataset size
  3. samtools sort/view processes the SAM stream

    • Either sorts (if sort_bam=true) or just converts format
    • Final output directly to .bam, .cram, or .sam
    • Uses --threads $task.cpus to parallelize efficiently

Benefits in This Real Workflow

For a human whole-genome sequencing run (~100GB raw reads):

Without pipes (hypothetical):

bwa mem $INDEX $reads > aligned.sam      # 500+ GB intermediate
samtools sort aligned.sam > sorted.bam # 100+ GB final
rm aligned.sam
  • Storage needed: 600+ GB
  • Disk I/O: Read 100GB (bwa) + Write 500GB (SAM) + Read 500GB (samtools) = 1.1TB of I/O
  • Time: Significantly slower due to I/O contention

With pipes (Sarek approach):

bwa mem $INDEX $reads | samtools sort -o sorted.bam -
  • Storage needed: 100 GB (final output only)
  • Disk I/O: Read 100GB (bwa input) + Write 100GB (final BAM) = 200GB of I/O
  • Time: 2-5x faster due to eliminated intermediate I/O
  • Memory: Constant (~2-3GB) regardless of genome size

Flexibility Through Arguments

The Sarek process demonstrates production-grade flexibility:

def samtools_command = sort_bam ? 'sort' : 'view'

This single line allows switching between:

  • Sorting in the pipeline (slower but sorted output)
  • No sorting (faster, relies on downstream tools)

And the extension can be dynamically chosen:

def extension = args2.contains("--output-fmt cram") ? "cram" : "bam"

This means the same process can output BAM, CRAM, or SAM depending on configuration—all using the same efficient pipe pattern.

Error Handling in Nextflow Processes

Nextflow automatically handles pipe failures. If any command in the chain fails:

bwa mem ... | samtools sort ...  # If either fails, Nextflow catches it

The entire task is marked as failed, and error logs are captured. This is superior to manual bash scripts where errors can be silent.

To ensure pipeline stops on any error, include at the top of your Nextflow script:

process {
shell = ['/bin/bash', '-euo', 'pipefail'] // -o pipefail catches pipe errors
}

This ensures that if bwa fails and outputs nothing, samtools will detect it as an error rather than silently succeeding on empty input.


Pipes for Download, Upload, and Cloud Storage

One of the most practical uses of pipes in bioinformatics is combining download/upload tools with data processing, eliminating intermediate files entirely. This is especially valuable when working with cloud storage like AWS S3.

Example 1: Download → Decompress → Process (No Temp Files)

Scenario: You need to download a compressed reference genome from a public server, decompress it, and index it—all without storing the compressed file.

# Traditional approach (❌ Wasteful)
wget https://example.com/reference.fasta.gz # Downloads 5GB
gunzip reference.fasta.gz # Decompresses to 15GB
samtools faidx reference.fasta # Index the reference
rm reference.fasta # Cleanup takes time
# Total disk space needed: 20GB

# Pipe approach (✓ Efficient)
wget -q -O - https://example.com/reference.fasta.gz | \
gunzip | \
samtools faidx /dev/stdin
# Total disk space needed: 15GB (final reference only)
# No temporary files, no cleanup needed

What happens:

  1. wget -O - downloads to stdout (not to disk)
  2. gunzip decompresses the stream on-the-fly
  3. samtools faidx /dev/stdin creates the index directly from the stream
  4. Final reference file and index stored without intermediate compressed file

Example 2: Process → Compress → Upload to S3 (Single Pipeline)

Scenario: Process sequencing data and upload the compressed results directly to AWS S3 without creating local compressed files.

# Traditional approach (❌ Wasteful)
bwa mem reference.fa reads.fastq | samtools sort -o aligned.bam -
gzip aligned.bam # Creates aligned.bam.gz (100+ GB)
aws s3 cp aligned.bam.gz s3://my-bucket/ # Upload
rm aligned.bam.gz # Cleanup
# Total disk space: 200+ GB (original BAM + compressed file)

# Pipe approach (✓ Efficient)
bwa mem reference.fa reads.fastq | \
samtools sort -b - | \
gzip | \
aws s3 cp - s3://my-bucket/aligned.bam.gz
# Total disk space: 0 GB temporary files
# Compressed data uploaded directly to S3
# Only stores the final index or metadata files locally

What happens:

  1. bwa mem and samtools sort chain together (as before)
  2. gzip compresses the BAM stream on-the-fly
  3. aws s3 cp - uploads directly from stdin to S3
  4. Data never exists uncompressed on disk

Example 3: Download from S3 → Decompress → Analyze (No Local Copy)

Scenario: Analyze BAM files stored in S3 without downloading the entire uncompressed file locally.

# Traditional approach (❌ Wasteful)
aws s3 cp s3://my-bucket/sample.bam.gz . # Download 50GB
gunzip sample.bam.gz # Decompress to 150GB
samtools flagstat sample.bam # Analyze
# Local disk: 200GB, slow downloads

# Pipe approach (✓ Efficient)
aws s3 cp s3://my-bucket/sample.bam.gz - | \
gunzip | \
samtools flagstat /dev/stdin
# Local disk: 0 GB temporary files
# Stream analysis, no storage overhead

What happens:

  1. aws s3 cp - s3://... downloads from S3 to stdout
  2. gunzip decompresses the stream
  3. samtools flagstat reads and analyzes directly from the pipe
  4. Only metadata (flagstat results) stored locally

Example 4: Download Tarball → Extract → Index (No Intermediate Files)

Scenario: Download a compressed archive of reference sequences, extract, and index—all in one pipeline.

# Traditional approach (❌ Wasteful)
wget https://example.com/genomes.tar.gz # 10GB download
tar -xzf genomes.tar.gz # Extracts to 50GB
for ref in genomes/*.fasta; do
samtools faidx "$ref"
done
rm -rf genomes.tar.gz genomes/ # Cleanup
# Disk space: 60GB temporary files

# Pipe approach (✓ Efficient)
wget -q -O - https://example.com/genomes.tar.gz | \
tar -xz --to-stdout | \
while read -r line; do
echo "$line" >> current_genome.fasta
if [[ "$line" =~ ^> ]] && [ -s current_genome.fasta ]; then
samtools faidx current_genome.fasta
fi
done
# Or more elegantly with GNU tar:
wget -q -O - https://example.com/genomes.tar.gz | \
tar -xz -C /dev/shm # Extract to RAM disk (if available)
# Disk space: Only final index files

Better approach with GNU tar's streaming:

# Extract specific files from tarball without decompressing entire archive
wget -q -O - https://example.com/genomes.tar.gz | \
tar -xzf - genomes/reference.fasta --to-stdout | \
samtools faidx /dev/stdin

Example 5: Process Multiple Files from S3 with GNU Parallel

Scenario: Process many files in S3 in parallel using pipes and parallel processing.

# List all BAM files in S3 and process in parallel
aws s3 ls s3://my-bucket/bams/ --recursive | awk '{print $4}' | \
parallel --pipe --block 10M \
'aws s3 cp "s3://my-bucket/{}" - | \
samtools view -b -F 4 | \
samtools sort -o {/.}.sorted.bam -'

# What happens:
# 1. List all S3 objects
# 2. Process up to N files in parallel
# 3. Each file is downloaded and piped directly to samtools
# 4. Results written back to disk (or piped to another command)

Example 6: Bidirectional Piping: Upload Results as They're Generated

Scenario: Generate analysis results and upload to S3 incrementally (useful for long-running processes).

# Real-time upload of VCF variants as they're discovered
bcftools mpileup -f reference.fa sample.bam | \
bcftools call -m | \
tee >(gzip | aws s3 cp - s3://my-bucket/variants.vcf.gz) | \
grep -v "^#" | \
wc -l

# What happens:
# 1. bcftools generates variants
# 2. tee splits the stream into two paths:
# - First path: gzip and upload to S3 in real-time
# - Second path: count total variants
# 3. Both operations happen simultaneously from a single bcftools stream
# 4. Results available in S3 even while analysis continues

Real-world use case: Monitoring long-running analyses without waiting for completion:

# Start a 24-hour variant calling run with real-time uploading
samtools mpileup -f ref.fa *.bam | \
bcftools call -m -v 2>/tmp/variants.log | \
tee >(gzip | aws s3 cp - s3://bucket/live-variants.vcf.gz) | \
tail -100 | grep "PASS" > /tmp/latest_variants.txt

# In another terminal, monitor progress:
watch -n 10 "wc -l /tmp/latest_variants.txt && aws s3 ls s3://bucket/live-variants.vcf.gz"

Tips for Reliable Download/Upload Pipes

ScenarioPipe CommandNotes
Download + decompresswget -O - | gunzip | processUse -O - to output to stdout
Upload + compressprocess | gzip | aws s3 cp -Pipe - for stdin/stdout
S3 download + analyzeaws s3 cp s3://... - | process- reads from stdin to stdout
Tar extract streamingwget -O - tar.gz | tar -xz -O file-O or --to-stdout extracts to stdout
Multiple S3 filesUse aws s3 ls + parallelChain with aws s3 cp s3://... -
Real-time monitoringUse tee for branchingSimultaneous upload and local processing
Error handlingCheck exit codes with set -o pipefailS3 errors need to fail the pipeline

Critical Considerations for Cloud Pipes

1. Network vs. Disk Bottleneck

# If network is slower than local processing:
# Download once, process multiple times
wget -O - file.tar.gz | tee local.tar.gz | tar -xz | process1
tar -xz < local.tar.gz | process2
tar -xz < local.tar.gz | process3
# Tradeoff: Local storage vs. network bandwidth

2. Retry Logic for Failed S3 Operations

# Simple retry with exponential backoff
retry_count=0
while [ $retry_count -lt 3 ]; do
aws s3 cp s3://bucket/file - 2>/dev/null && break
retry_count=$((retry_count + 1))
sleep $((2 ** retry_count))
done | gunzip | process

3. Monitoring Upload Progress

# Use pv to monitor upload speed
process_data | \
pv -br | \
gzip | \
aws s3 cp - s3://bucket/file.gz

# Output: [1.2GB/s] or similar

4. Handling Large Files with Multipart Upload

For files larger than a few GB, AWS S3 multipart uploads are more reliable:

# aws s3 cp already handles multipart, but for custom control:
process_data | \
gzip | \
aws s3 cp - s3://bucket/large-file.gz \
--sse AES256 \
--storage-class GLACIER # Optional: cheaper storage class

Real-World Bioinformatics Pipes

Example 1: RNA-seq Quality Control Pipeline

# Process RNA-seq reads:
# 1. Filter low quality
# 2. Count valid reads
# 3. Extract length distribution
# 4. Detect adapters

fastq_quality_filter -i reads.fastq -q 20 -p 80 | \
tee >(wc -l | awk '{print "Valid reads: "$1/4}' > read_count.txt) | \
tee >(awk 'NR%4==2 {print length}' | \
sort -n | uniq -c > length_dist.txt) | \
fastx_collapser -o collapsed.fastq | \
fastx_clipper -a AGATCGGAAGAGC -o trimmed.fastq

Example 2: Variant Calling Pipeline

# Align reads, mark duplicates, and call variants in one pass
bwa mem -t 8 reference.fa reads.fastq | \
samtools view -b -S - | \
samtools sort -o sorted.bam - && \
samtools markdup sorted.bam marked.bam && \
samtools index marked.bam && \
bcftools mpileup -f reference.fa marked.bam | \
bcftools call -mv -o variants.vcf

# Memory: Constant (~1-2 GB for buffers)
# Disk: Only intermediate files marked.bam (necessary for bcftools indexing)

Example 3: FASTA Processing with Decompression

# Decompress, process, and recompress in a single pipeline
zcat genome.fasta.gz | \
awk '/^>/{if(NR>1)print prev_seq; seq=""; prev_seq=$0; next} {seq=seq $0} END{print prev_seq; print seq}' | \
gzip > processed.fasta.gz

# No intermediate uncompressed files
# Entire genome processed with minimal disk space

Summary: Why Pipes Matter in Bioinformatics

ProblemPipe Solution
Intermediate filesStream directly between tools
Disk spaceNo temporary storage needed
Memory usageConstant, independent of data size
Processing speedSingle sequential read of data
Failure recoveryFailures caught immediately
Code readabilityClear, linear data flow

Key Takeaways

1. Pipes Enable Streaming Data Processing

  • Data flows through memory buffers, not disk
  • Only ~64KB per pipe sits in RAM at any time
  • Processing 1GB or 1TB uses the same constant memory
  • This is the fundamental advantage over traditional file-based workflows

2. Kernel-Level Synchronization (Automatic)

  • No manual management needed
  • Backpressure prevents one slow command from overwhelming others
  • If a command fails, the entire pipeline fails cleanly
  • Use set -o pipefail in bash to ensure this behavior

3. I/O Reduction is Massive in Bioinformatics

  • Traditional alignment: 100GB reads → 500GB SAM → 100GB BAM (1.1TB I/O)
  • Pipe alignment: 100GB reads → 100GB BAM (200GB I/O total)
  • Savings: 5.5x reduction in disk I/O
  • Practical benefit: 2-5x faster execution on disk-bound operations

4. Pipes Work Seamlessly with Cloud Storage

  • Download with wget -O - or aws s3 cp - s3://...
  • Stream directly from S3 without local copies
  • Upload results on-the-fly during processing
  • Combine with tee for simultaneous processing and uploading

5. Production Workflows (Nextflow, Sarek) Use Pipes Extensively

  • Real example: Sarek's BWA_MEM process pipes bwa mem | samtools sort
  • This is the standard pattern for large-scale bioinformatics
  • Nextflow adds reliability and error handling on top
  • No need to choose between pipes and Nextflow—they work together

6. Practical Patterns You'll Use

PatternUse CaseCommand
Simple chainSequential processingcmd1 | cmd2 | cmd3
BranchingMultiple outputs from one inputcmd1 | tee >(cmd2) | cmd3
Download + processRemote files without storagewget -O - url | gunzip | process
Upload + compressDirect to cloud storageprocess | gzip | aws s3 cp -
Error safetyCatch failures in pipesset -o pipefail
Parallel processingScale across cores/machinesinput | parallel 'process'

7. When Pipes Are Most Valuable

Use pipes for:

  • Massive datasets (100GB+) where I/O is the bottleneck
  • Cloud storage workflows where local disk is expensive
  • Real-time monitoring of long-running analyses
  • Linear processing chains (one output → next input)
  • Development and rapid prototyping
  • Nextflow process scripts (internal tool chaining)

Avoid pipes for:

  • Highly branching workflows (many different paths)
  • Complex error recovery (need to restart from middle)
  • Data that needs multiple passes through the same file
  • When you need to monitor intermediate results extensively

8. Master These Tools for Production Workflows

# Essential for pipe mastery:
set -o pipefail # Error handling
tee # Branching
pv # Progress monitoring
process substitution: >(cmd) # Streaming to files
AWS S3 pipes: aws s3 cp - s3://... # Cloud integration
GNU parallel # Scale across cores

Final Thoughts

Unix pipes are a cornerstone of efficient bioinformatics. They're not just a convenient syntax—they're a fundamental architecture for processing data that would otherwise overwhelm available disk space and I/O capacity.

The real power becomes apparent at scale:

  • Small datasets: Pipes save time and code complexity
  • Large datasets: Pipes are often the only practical solution
  • Cloud workflows: Pipes enable streaming to/from S3 without local copies
  • Production pipelines: Pipes are embedded in every major workflow (Nextflow, Snakemake, etc.)

By understanding how pipes work under the hood—how the kernel manages file descriptors, buffers data, and synchronizes backpressure—you can write bioinformatics workflows that are not just faster, but fundamentally more efficient.

Start small: Use pipes in your next single-command analysis. Progress to chaining tools. Eventually, you'll write entire bioinformatics workflows as elegant streaming pipelines—just like Sarek does.

Your future self (and your disk quota administrator) will thank you.