Unix Pipes in Bioinformatics: How Streaming Data Reduces Memory and Storage

February 10, 2026 · 22 min read

Founder at RIVER

Unix pipes (|) are one of the most powerful yet underutilized features in bioinformatics. They allow you to chain multiple commands together, processing data in a streaming fashion that dramatically reduces memory usage and disk I/O. This post explores why pipes are essential for bioinformatics work and shows how they work under the hood.

The Problem: Data Explosion in Bioinformatics

Modern sequencing generates massive datasets. A single human genome sequencing run can produce:

Raw reads: 100+ GB of FASTQ files
Alignments: 50-100 GB of BAM files
Variants: 1-5 GB of VCF files

Without pipes, traditional bioinformatics workflows create intermediate files at each step:

# Traditional approach (❌ Wasteful)
bwa mem reference.fa reads.fastq > aligned.sam    # 200 GB intermediate
samtools view -b aligned.sam > aligned.bam        # 100 GB intermediate
samtools sort aligned.bam > sorted.bam            # 100 GB intermediate
samtools index sorted.bam                         # Already indexed
# Total disk usage: 400 GB for processing 100 GB of raw data!

The costs:

Storage: 4x the original data size just for intermediates
Time: Writing/reading intermediate files is slow
Failure: If a step fails midway, you lose everything and restart from scratch
Complexity: Managing and cleaning up intermediate files is tedious

The Solution: Unix Pipes for Streaming Data

Pipes (|) connect the output of one command directly to the input of the next, processing data in memory as it flows through:

# Pipe approach (✓ Efficient)
bwa mem reference.fa reads.fastq | \
  samtools view -b - | \
  samtools sort -o sorted.bam -
samtools index sorted.bam
# Total disk usage: 100 GB (original data + final output)
# No intermediate files!

The benefits:

Memory efficient: Data flows through without full copies in RAM
Fast: No disk I/O for intermediates
Resilient: Failures are caught immediately and clearly
Simple: Clean, readable pipeline syntax
Storage efficient: Only keep final outputs

How Pipes Work Under the Hood

Understanding how pipes work is key to writing efficient bioinformatics pipelines. Let's explore the mechanics.

1. File Descriptors in Unix

Every Unix process has three standard file descriptors:

Descriptor	Name	Purpose	Default Target
0	stdin	Standard input	Keyboard
1	stdout	Standard output	Terminal/Screen
2	stderr	Standard error	Terminal/Screen

By default, processes read from stdin (file descriptor 0) and write to stdout (file descriptor 1).

# Example: cat command
cat myfile.txt    # Reads file, writes to stdout (terminal)
cat < myfile.txt  # Explicitly redirect stdin from file
cat myfile.txt 1> output.txt  # Redirect stdout to file

2. Creating a Pipe with the Pipe System Call

When you type command1 | command2, the shell:

Creates a pipe - an unnamed, in-memory buffer that connects two processes
Forks process 1 - creates a copy of the shell running command1
Connects stdout of process 1 to the pipe's write end (file descriptor 1)
Forks process 2 - creates a copy of the shell running command2
Connects stdin of process 2 to the pipe's read end (file descriptor 0)
Executes both processes - they run in parallel

Diagram of a pipe:

[command1]                    [command2]
     |                              |
  stdout (fd 1)              stdin (fd 0)
     |                              |
     v                              ^
    [================================]
              Pipe (kernel buffer)

3. The Kernel Manages Data Flow

The Unix kernel manages the pipe as a FIFO (First In, First Out) buffer:

Write side: command1 writes data to the pipe's write end
Pipe buffer: Data sits in kernel memory (typically 64KB-1MB per pipe)
Read side: command2 reads data from the pipe's read end

Key behaviors:

If pipe is full: The writing process blocks until space is available
If pipe is empty: The reading process blocks until data arrives
If reader closes: Writer gets a SIGPIPE signal (broken pipe error)
If writer closes: Reader gets EOF and can finish processing

This synchronization happens automatically—you don't need to manage it.

4. Example: Tracing a Real Pipe

Let's trace what happens when you run:

cat genome.fasta | grep ">chr1" | wc -l

Step 1: Shell creates the pipeline

User types: cat genome.fasta | grep ">chr1" | wc -l
         ↓
Shell creates:
  - Pipe A (between cat and grep)
  - Pipe B (between grep and wc)

Step 2: Processes fork and file descriptors redirect

Process: cat genome.fasta
  stdout (fd 1) → Pipe A write end

Process: grep ">chr1"
  stdin (fd 0)  ← Pipe A read end
  stdout (fd 1) → Pipe B write end

Process: wc -l
  stdin (fd 0)  ← Pipe B read end
  stdout (fd 1) → Terminal

Step 3: Execution flows

Pipe A buffer:     Pipe B buffer:
[data from cat] → [data to grep] → [data to wc] → [count to terminal]

Step 4: Back pressure synchronization

If grep is slow:
  - Pipe A fills up
  - cat blocks (can't write)
  - System automatically waits for grep to catch up

If wc is slow:
  - Pipe B fills up
  - grep blocks (can't write)
  - System waits for wc to catch up

5. Memory Efficiency: Why Pipes Don't Load Everything into RAM

Without pipes (writing to disk):

command1: Read all data → Write 100GB to disk
          ↓
          Disk (100GB) ← Slow, uses storage
          ↓
command2: Read 100GB from disk → Process → Write results

With pipes (streaming):

command1: Read chunk → Write chunk to pipe buffer (64KB)
                      ↓
                   Pipe buffer (kernel memory, reused)
                      ↓
command2: Read chunk → Process → Write chunk to next pipe

Only a small amount of data (one buffer, ~64KB) sits in memory at any time. The buffer is reused as data flows through, so memory usage stays constant regardless of total data size.

This is the magic of pipes: Constant memory usage, not linear in data size.

Practical Example: Processing a Large FASTQ File

Let's apply pipes to a real bioinformatics workflow.

Without Pipes (Wasteful)

# Step 1: Filter low-quality reads
fastqc reads.fastq --outdir=qc_before/
fastq_quality_filter -i reads.fastq -o reads_filtered.fastq -q 20 -p 80
# File created: reads_filtered.fastq (70 GB)

# Step 2: Count remaining reads
wc -l reads_filtered.fastq
# Temporary file: 70 GB on disk

# Step 3: Get sequence length distribution
awk 'NR%4==2 {print length}' reads_filtered.fastq | \
  sort -n | uniq -c > length_dist.txt
# Processing reads_filtered.fastq again

# Cleanup
rm reads_filtered.fastq

# Total I/O: Read source (100GB) + Write filtered (70GB) + Read filtered (70GB) = 240GB
# Total disk: 170 GB (100 source + 70 intermediate)

With Pipes (Efficient)

# Single streaming pipeline: Filter → Count → Length distribution
fastq_quality_filter -i reads.fastq -q 20 -p 80 | \
  tee >(wc -l > read_count.txt) | \
  awk 'NR%4==2 {print length}' | \
  sort -n | uniq -c > length_dist.txt

# Total I/O: Read source (100GB) = 100GB
# Total disk: 100 GB (source + final outputs only)
# Memory: ~100MB (pipe buffers)

Comparison:

Metric	Without Pipes	With Pipes
Total I/O	240 GB	100 GB
Disk space	170 GB	100 GB
Processing time	Slow (multiple reads from disk)	Fast (one sequential read)
Memory usage	Streaming OK	Streaming OK
Failure recovery	Restart entire pipeline	Pipeline atomicity issues

Advanced Pipe Patterns in Bioinformatics

1. Parallel Processing with GNU Parallel

Process multiple files simultaneously while piping:

# Process 1000 FASTQ files in parallel
find . -name "*.fastq" | \
  parallel --pipe --block 10M \
  'fastq_quality_filter -q 20 -p 80 | gzip' > all_filtered.fastq.gz

2. Tee: Branching a Pipeline

Use tee to send data to multiple streams:

# Simultaneously:
# 1. Count reads
# 2. Filter and output
# 3. Generate statistics
samtools view input.bam | \
  tee >(samtools flagstat /dev/stdin > flagstat.txt) | \
  samtools view -b -F 4 | \
  samtools sort -o sorted_aligned.bam -

# The >(command) syntax is "process substitution"
# It creates a named pipe to a subprocess

3. Named Pipes (mkfifo) for Complex Workflows

For workflows requiring multiple inputs/outputs:

# Create named pipes
mkfifo pipe1 pipe2

# Process A writes to pipe1, reads from pipe2
cat input.txt > pipe1 &

# Process B reads from pipe1, writes to pipe2
sort < pipe1 > pipe2 &

# Main process reads final result
uniq < pipe2

# Cleanup
rm pipe1 pipe2

4. Buffering and Backpressure Management

Sometimes a slow downstream command creates a bottleneck. Use buffer to add extra memory:

# Without buffer: cat blocks if samtools is slow
cat large.bam | samtools view -c

# With buffer: Extra memory absorbs the blocking
cat large.bam | buffer -m 500M | samtools view -c

When to use buffer:

Downstream process is much slower than upstream
You have spare RAM and want to minimize blocking
Upstream data is expensive to regenerate

5. Process Substitution for Multiple Outputs

Fan out to multiple processes from a single input:

# Single BAM file → Multiple statistics simultaneously
samtools view input.bam | \
  tee >(samtools view -c > read_count.txt) \
      >(awk '{print $3}' | sort | uniq -c > chromosome_dist.txt) \
      >(awk '{print $4}' | sort -n > position_stats.txt) \
      > /dev/null

# All three statistics generated from a single read of input.bam

Common Pipe Gotchas and Solutions

1. Buffering Issues with Pipes

Problem: Interactive commands don't flush output in pipes

# This might hang (python buffers output in pipe mode)
python long_script.py | tee output.log

Solution: Disable buffering

# Python: Use -u flag or stdbuf
python -u long_script.py | tee output.log
# Or use stdbuf
stdbuf -oL python long_script.py | tee output.log

2. Error Handling in Pipes

Problem: Errors in the middle command are silent

# If samtools fails, cat and downstream still succeed!
cat input.bam | samtools view -b - > output.bam 2>/dev/null

Solution: Set pipefail

# If ANY command fails, the entire pipeline fails
set -o pipefail

cat input.bam | samtools view -b - > output.bam

3. Monitoring Pipeline Progress

Problem: Pipes process data silently—hard to see progress

# No feedback on progress
cat input.fastq | process_cmd | sort > output.txt

Solution: Use pv (pipe viewer) to visualize throughput

# Shows progress, speed, and ETA
cat input.fastq | pv | process_cmd | sort > output.txt

# Or with file size estimation
pv -N "Reading FASTQ" < input.fastq | process_cmd | \
  pv -N "Sorting output" | sort > output.txt

4. Debugging Pipe Failures

Problem: Which command in the pipeline failed?

# Unclear where failure occurred
cmd1 | cmd2 | cmd3 | cmd4 failed but which one?

Solution: Use intermediate tee files for debugging

# Save intermediate outputs while piping
cmd1 | tee /tmp/debug1.txt | \
  cmd2 | tee /tmp/debug2.txt | \
  cmd3 | tee /tmp/debug3.txt | \
  cmd4

# Later, inspect intermediates
cat /tmp/debug1.txt | head
cat /tmp/debug2.txt | head
cat /tmp/debug3.txt | head

Using Pipes in Nextflow Modules

Nextflow pipelines are built from individual processes, and each process can use pipes internally to efficiently chain tools together. This is where pipes truly shine in production bioinformatics workflows.

Real Example: BWA_MEM Process from Sarek

The Sarek variant calling pipeline includes a BWA_MEM process that demonstrates best practices for using pipes in Nextflow. Let's examine how pipes reduce intermediate files:

process BWA_MEM {
    tag "$meta.id"
    label 'process_high'

    conda "${moduleDir}/environment.yml"
    container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ?
        'https://community-cr-prod.seqera.io/docker/registry/v2/blobs/sha256/bf/bf7890f8d4e38a7586581cb7fa13401b7af1582f21d94eef969df4cea852b6da/data' :
        'community.wave.seqera.io/library/bwa_htslib_samtools:56c9f8d5201889a4' }"

    input:
    tuple val(meta) , path(reads)
    tuple val(meta2), path(index)
    tuple val(meta3), path(fasta)
    val   sort_bam

    output:
    tuple val(meta), path("*.bam")  , emit: bam,    optional: true
    tuple val(meta), path("*.cram") , emit: cram,   optional: true
    tuple val(meta), path("*.csi")  , emit: csi,    optional: true
    tuple val(meta), path("*.crai") , emit: crai,   optional: true
    path  "versions.yml"            , emit: versions

    when:
    task.ext.when == null || task.ext.when

    script:
    def args = task.ext.args ?: ''
    def args2 = task.ext.args2 ?: ''
    def prefix = task.ext.prefix ?: "${meta.id}"
    def samtools_command = sort_bam ? 'sort' : 'view'
    def extension = args2.contains("--output-fmt sam")   ? "sam" :
                    args2.contains("--output-fmt cram")  ? "cram":
                    sort_bam && args2.contains("-O cram")? "cram":
                    !sort_bam && args2.contains("-C")    ? "cram":
                    "bam"
    def reference = fasta && extension=="cram"  ? "--reference ${fasta}" : ""
    if (!fasta && extension=="cram") error "Fasta reference is required for CRAM output"
    """
    INDEX=`find -L ./ -name "*.amb" | sed 's/\\.amb\$//'`

    bwa mem \\
        $args \\
        -t $task.cpus \\
        \$INDEX \\
        $reads \\
        | samtools $samtools_command $args2 ${reference} --threads $task.cpus -o ${prefix}.${extension} -

    cat <<-END_VERSIONS > versions.yml
    "${task.process}":
        bwa: \$(echo \$(bwa 2>&1) | sed 's/^.*Version: //; s/Contact:.*\$//')
        samtools: \$(echo \$(samtools --version 2>&1) | sed 's/^.*samtools //; s/Using.*\$//')
    END_VERSIONS
    """

    stub:
    def args2 = task.ext.args2 ?: ''
    def prefix = task.ext.prefix ?: "${meta.id}"
    def extension = args2.contains("--output-fmt sam")   ? "sam" :
                    args2.contains("--output-fmt cram")  ? "cram":
                    sort_bam && args2.contains("-O cram")? "cram":
                    !sort_bam && args2.contains("-C")    ? "cram":
                    "bam"
    """
    touch ${prefix}.${extension}
    touch ${prefix}.csi
    touch ${prefix}.crai

    cat <<-END_VERSIONS > versions.yml
    "${task.process}":
        bwa: \$(echo \$(bwa 2>&1) | sed 's/^.*Version: //; s/Contact:.*\$//')
        samtools: \$(echo \$(samtools --version 2>&1) | sed 's/^.*samtools //; s/Using.*\$//')
    END_VERSIONS
    """
}

Understanding the Pipe in BWA_MEM

The key line in this process is the pipe connecting bwa and samtools:

bwa mem $args -t $task.cpus $INDEX $reads | samtools $samtools_command $args2 ${reference} --threads $task.cpus -o ${prefix}.${extension} -

Here's what happens:

bwa mem aligns reads to reference genome
- Outputs SAM format to stdout
- This is hundreds of GB for large genomes
- Without the pipe, this would be written to disk as an intermediate .sam file
Pipe (|) connects output directly to samtools
- Data streams from bwa to samtools in memory
- No intermediate SAM file created
- Constant memory usage regardless of dataset size
samtools sort/view processes the SAM stream
- Either sorts (if sort_bam=true) or just converts format
- Final output directly to .bam, .cram, or .sam
- Uses --threads $task.cpus to parallelize efficiently

Benefits in This Real Workflow

For a human whole-genome sequencing run (~100GB raw reads):

Without pipes (hypothetical):

bwa mem $INDEX $reads > aligned.sam      # 500+ GB intermediate
samtools sort aligned.sam > sorted.bam   # 100+ GB final
rm aligned.sam

Storage needed: 600+ GB
Disk I/O: Read 100GB (bwa) + Write 500GB (SAM) + Read 500GB (samtools) = 1.1TB of I/O
Time: Significantly slower due to I/O contention

With pipes (Sarek approach):

bwa mem $INDEX $reads | samtools sort -o sorted.bam -

Storage needed: 100 GB (final output only)
Disk I/O: Read 100GB (bwa input) + Write 100GB (final BAM) = 200GB of I/O
Time: 2-5x faster due to eliminated intermediate I/O
Memory: Constant (~2-3GB) regardless of genome size

Flexibility Through Arguments

The Sarek process demonstrates production-grade flexibility:

def samtools_command = sort_bam ? 'sort' : 'view'

This single line allows switching between:

Sorting in the pipeline (slower but sorted output)
No sorting (faster, relies on downstream tools)

And the extension can be dynamically chosen:

def extension = args2.contains("--output-fmt cram") ? "cram" : "bam"

This means the same process can output BAM, CRAM, or SAM depending on configuration—all using the same efficient pipe pattern.

Error Handling in Nextflow Processes

Nextflow automatically handles pipe failures. If any command in the chain fails:

bwa mem ... | samtools sort ...  # If either fails, Nextflow catches it

The entire task is marked as failed, and error logs are captured. This is superior to manual bash scripts where errors can be silent.

To ensure pipeline stops on any error, include at the top of your Nextflow script:

process {
    shell = ['/bin/bash', '-euo', 'pipefail']  // -o pipefail catches pipe errors
}

This ensures that if bwa fails and outputs nothing, samtools will detect it as an error rather than silently succeeding on empty input.

Pipes for Download, Upload, and Cloud Storage

One of the most practical uses of pipes in bioinformatics is combining download/upload tools with data processing, eliminating intermediate files entirely. This is especially valuable when working with cloud storage like AWS S3.

Example 1: Download → Decompress → Process (No Temp Files)

Scenario: You need to download a compressed reference genome from a public server, decompress it, and index it—all without storing the compressed file.

# Traditional approach (❌ Wasteful)
wget https://example.com/reference.fasta.gz  # Downloads 5GB
gunzip reference.fasta.gz                     # Decompresses to 15GB
samtools faidx reference.fasta                # Index the reference
rm reference.fasta                            # Cleanup takes time
# Total disk space needed: 20GB

# Pipe approach (✓ Efficient)
wget -q -O - https://example.com/reference.fasta.gz | \
  gunzip | \
  samtools faidx /dev/stdin
# Total disk space needed: 15GB (final reference only)
# No temporary files, no cleanup needed

What happens:

wget -O - downloads to stdout (not to disk)
gunzip decompresses the stream on-the-fly
samtools faidx /dev/stdin creates the index directly from the stream
Final reference file and index stored without intermediate compressed file

Example 2: Process → Compress → Upload to S3 (Single Pipeline)

Scenario: Process sequencing data and upload the compressed results directly to AWS S3 without creating local compressed files.

# Traditional approach (❌ Wasteful)
bwa mem reference.fa reads.fastq | samtools sort -o aligned.bam -
gzip aligned.bam                           # Creates aligned.bam.gz (100+ GB)
aws s3 cp aligned.bam.gz s3://my-bucket/  # Upload
rm aligned.bam.gz                          # Cleanup
# Total disk space: 200+ GB (original BAM + compressed file)

# Pipe approach (✓ Efficient)
bwa mem reference.fa reads.fastq | \
  samtools sort -b - | \
  gzip | \
  aws s3 cp - s3://my-bucket/aligned.bam.gz
# Total disk space: 0 GB temporary files
# Compressed data uploaded directly to S3
# Only stores the final index or metadata files locally

What happens:

bwa mem and samtools sort chain together (as before)
gzip compresses the BAM stream on-the-fly
aws s3 cp - uploads directly from stdin to S3
Data never exists uncompressed on disk

Example 3: Download from S3 → Decompress → Analyze (No Local Copy)

Scenario: Analyze BAM files stored in S3 without downloading the entire uncompressed file locally.

# Traditional approach (❌ Wasteful)
aws s3 cp s3://my-bucket/sample.bam.gz .   # Download 50GB
gunzip sample.bam.gz                        # Decompress to 150GB
samtools flagstat sample.bam                # Analyze
# Local disk: 200GB, slow downloads

# Pipe approach (✓ Efficient)
aws s3 cp s3://my-bucket/sample.bam.gz - | \
  gunzip | \
  samtools flagstat /dev/stdin
# Local disk: 0 GB temporary files
# Stream analysis, no storage overhead

What happens:

aws s3 cp - s3://... downloads from S3 to stdout
gunzip decompresses the stream
samtools flagstat reads and analyzes directly from the pipe
Only metadata (flagstat results) stored locally

Example 4: Download Tarball → Extract → Index (No Intermediate Files)

Scenario: Download a compressed archive of reference sequences, extract, and index—all in one pipeline.

# Traditional approach (❌ Wasteful)
wget https://example.com/genomes.tar.gz    # 10GB download
tar -xzf genomes.tar.gz                     # Extracts to 50GB
for ref in genomes/*.fasta; do
  samtools faidx "$ref"
done
rm -rf genomes.tar.gz genomes/              # Cleanup
# Disk space: 60GB temporary files

# Pipe approach (✓ Efficient)
wget -q -O - https://example.com/genomes.tar.gz | \
  tar -xz --to-stdout | \
  while read -r line; do
    echo "$line" >> current_genome.fasta
    if [[ "$line" =~ ^> ]] && [ -s current_genome.fasta ]; then
      samtools faidx current_genome.fasta
    fi
  done
# Or more elegantly with GNU tar:
wget -q -O - https://example.com/genomes.tar.gz | \
  tar -xz -C /dev/shm  # Extract to RAM disk (if available)
# Disk space: Only final index files

Better approach with GNU tar's streaming:

# Extract specific files from tarball without decompressing entire archive
wget -q -O - https://example.com/genomes.tar.gz | \
  tar -xzf - genomes/reference.fasta --to-stdout | \
  samtools faidx /dev/stdin

Example 5: Process Multiple Files from S3 with GNU Parallel

Scenario: Process many files in S3 in parallel using pipes and parallel processing.

# List all BAM files in S3 and process in parallel
aws s3 ls s3://my-bucket/bams/ --recursive | awk '{print $4}' | \
  parallel --pipe --block 10M \
  'aws s3 cp "s3://my-bucket/{}" - | \
   samtools view -b -F 4 | \
   samtools sort -o {/.}.sorted.bam -'

# What happens:
# 1. List all S3 objects
# 2. Process up to N files in parallel
# 3. Each file is downloaded and piped directly to samtools
# 4. Results written back to disk (or piped to another command)

Example 6: Bidirectional Piping: Upload Results as They're Generated

Scenario: Generate analysis results and upload to S3 incrementally (useful for long-running processes).

# Real-time upload of VCF variants as they're discovered
bcftools mpileup -f reference.fa sample.bam | \
  bcftools call -m | \
  tee >(gzip | aws s3 cp - s3://my-bucket/variants.vcf.gz) | \
  grep -v "^#" | \
  wc -l
  
# What happens:
# 1. bcftools generates variants
# 2. tee splits the stream into two paths:
#    - First path: gzip and upload to S3 in real-time
#    - Second path: count total variants
# 3. Both operations happen simultaneously from a single bcftools stream
# 4. Results available in S3 even while analysis continues

Real-world use case: Monitoring long-running analyses without waiting for completion:

# Start a 24-hour variant calling run with real-time uploading
samtools mpileup -f ref.fa *.bam | \
  bcftools call -m -v 2>/tmp/variants.log | \
  tee >(gzip | aws s3 cp - s3://bucket/live-variants.vcf.gz) | \
  tail -100 | grep "PASS" > /tmp/latest_variants.txt

# In another terminal, monitor progress:
watch -n 10 "wc -l /tmp/latest_variants.txt && aws s3 ls s3://bucket/live-variants.vcf.gz"

Tips for Reliable Download/Upload Pipes

Scenario	Pipe Command	Notes
Download + decompress	`wget -O - \| gunzip \| process`	Use `-O -` to output to stdout
Upload + compress	`process \| gzip \| aws s3 cp -`	Pipe `-` for stdin/stdout
S3 download + analyze	`aws s3 cp s3://... - \| process`	`-` reads from stdin to stdout
Tar extract streaming	`wget -O - tar.gz \| tar -xz -O file`	`-O` or `--to-stdout` extracts to stdout
Multiple S3 files	Use `aws s3 ls` + `parallel`	Chain with `aws s3 cp s3://... -`
Real-time monitoring	Use `tee` for branching	Simultaneous upload and local processing
Error handling	Check exit codes with `set -o pipefail`	S3 errors need to fail the pipeline

Critical Considerations for Cloud Pipes

1. Network vs. Disk Bottleneck

# If network is slower than local processing:
# Download once, process multiple times
wget -O - file.tar.gz | tee local.tar.gz | tar -xz | process1
tar -xz < local.tar.gz | process2
tar -xz < local.tar.gz | process3
# Tradeoff: Local storage vs. network bandwidth

2. Retry Logic for Failed S3 Operations

# Simple retry with exponential backoff
retry_count=0
while [ $retry_count -lt 3 ]; do
  aws s3 cp s3://bucket/file - 2>/dev/null && break
  retry_count=$((retry_count + 1))
  sleep $((2 ** retry_count))
done | gunzip | process

3. Monitoring Upload Progress

# Use pv to monitor upload speed
process_data | \
  pv -br | \
  gzip | \
  aws s3 cp - s3://bucket/file.gz

# Output: [1.2GB/s] or similar

4. Handling Large Files with Multipart Upload

For files larger than a few GB, AWS S3 multipart uploads are more reliable:

# aws s3 cp already handles multipart, but for custom control:
process_data | \
  gzip | \
  aws s3 cp - s3://bucket/large-file.gz \
    --sse AES256 \
    --storage-class GLACIER  # Optional: cheaper storage class

Real-World Bioinformatics Pipes

Example 1: RNA-seq Quality Control Pipeline

# Process RNA-seq reads:
# 1. Filter low quality
# 2. Count valid reads
# 3. Extract length distribution
# 4. Detect adapters

fastq_quality_filter -i reads.fastq -q 20 -p 80 | \
  tee >(wc -l | awk '{print "Valid reads: "$1/4}' > read_count.txt) | \
  tee >(awk 'NR%4==2 {print length}' | \
        sort -n | uniq -c > length_dist.txt) | \
  fastx_collapser -o collapsed.fastq | \
  fastx_clipper -a AGATCGGAAGAGC -o trimmed.fastq

Example 2: Variant Calling Pipeline

# Align reads, mark duplicates, and call variants in one pass
bwa mem -t 8 reference.fa reads.fastq | \
  samtools view -b -S - | \
  samtools sort -o sorted.bam - && \
  samtools markdup sorted.bam marked.bam && \
  samtools index marked.bam && \
  bcftools mpileup -f reference.fa marked.bam | \
  bcftools call -mv -o variants.vcf

# Memory: Constant (~1-2 GB for buffers)
# Disk: Only intermediate files marked.bam (necessary for bcftools indexing)

Example 3: FASTA Processing with Decompression

# Decompress, process, and recompress in a single pipeline
zcat genome.fasta.gz | \
  awk '/^>/{if(NR>1)print prev_seq; seq=""; prev_seq=$0; next} {seq=seq $0} END{print prev_seq; print seq}' | \
  gzip > processed.fasta.gz

# No intermediate uncompressed files
# Entire genome processed with minimal disk space

Summary: Why Pipes Matter in Bioinformatics

Problem	Pipe Solution
Intermediate files	Stream directly between tools
Disk space	No temporary storage needed
Memory usage	Constant, independent of data size
Processing speed	Single sequential read of data
Failure recovery	Failures caught immediately
Code readability	Clear, linear data flow

Key Takeaways

1. Pipes Enable Streaming Data Processing

Data flows through memory buffers, not disk
Only ~64KB per pipe sits in RAM at any time
Processing 1GB or 1TB uses the same constant memory
This is the fundamental advantage over traditional file-based workflows

2. Kernel-Level Synchronization (Automatic)

No manual management needed
Backpressure prevents one slow command from overwhelming others
If a command fails, the entire pipeline fails cleanly
Use set -o pipefail in bash to ensure this behavior

3. I/O Reduction is Massive in Bioinformatics

Traditional alignment: 100GB reads → 500GB SAM → 100GB BAM (1.1TB I/O)
Pipe alignment: 100GB reads → 100GB BAM (200GB I/O total)
Savings: 5.5x reduction in disk I/O
Practical benefit: 2-5x faster execution on disk-bound operations

4. Pipes Work Seamlessly with Cloud Storage

Download with wget -O - or aws s3 cp - s3://...
Stream directly from S3 without local copies
Upload results on-the-fly during processing
Combine with tee for simultaneous processing and uploading

5. Production Workflows (Nextflow, Sarek) Use Pipes Extensively

Real example: Sarek's BWA_MEM process pipes bwa mem | samtools sort
This is the standard pattern for large-scale bioinformatics
Nextflow adds reliability and error handling on top
No need to choose between pipes and Nextflow—they work together

6. Practical Patterns You'll Use

Pattern	Use Case	Command
Simple chain	Sequential processing	`cmd1 \| cmd2 \| cmd3`
Branching	Multiple outputs from one input	`cmd1 \| tee >(cmd2) \| cmd3`
Download + process	Remote files without storage	`wget -O - url \| gunzip \| process`
Upload + compress	Direct to cloud storage	`process \| gzip \| aws s3 cp -`
Error safety	Catch failures in pipes	`set -o pipefail`
Parallel processing	Scale across cores/machines	`input \| parallel 'process'`

7. When Pipes Are Most Valuable

✅ Use pipes for:

Massive datasets (100GB+) where I/O is the bottleneck
Cloud storage workflows where local disk is expensive
Real-time monitoring of long-running analyses
Linear processing chains (one output → next input)
Development and rapid prototyping
Nextflow process scripts (internal tool chaining)

❌ Avoid pipes for:

Highly branching workflows (many different paths)
Complex error recovery (need to restart from middle)
Data that needs multiple passes through the same file
When you need to monitor intermediate results extensively

8. Master These Tools for Production Workflows

# Essential for pipe mastery:
set -o pipefail              # Error handling
tee                          # Branching
pv                           # Progress monitoring
process substitution: >(cmd) # Streaming to files
AWS S3 pipes: aws s3 cp - s3://... # Cloud integration
GNU parallel                 # Scale across cores

Final Thoughts

Unix pipes are a cornerstone of efficient bioinformatics. They're not just a convenient syntax—they're a fundamental architecture for processing data that would otherwise overwhelm available disk space and I/O capacity.

The real power becomes apparent at scale:

Small datasets: Pipes save time and code complexity
Large datasets: Pipes are often the only practical solution
Cloud workflows: Pipes enable streaming to/from S3 without local copies
Production pipelines: Pipes are embedded in every major workflow (Nextflow, Snakemake, etc.)

By understanding how pipes work under the hood—how the kernel manages file descriptors, buffers data, and synchronizes backpressure—you can write bioinformatics workflows that are not just faster, but fundamentally more efficient.

Start small: Use pipes in your next single-command analysis. Progress to chaining tools. Eventually, you'll write entire bioinformatics workflows as elegant streaming pipelines—just like Sarek does.

Your future self (and your disk quota administrator) will thank you.

The Problem: Data Explosion in Bioinformatics​

The Solution: Unix Pipes for Streaming Data​

How Pipes Work Under the Hood​

1. File Descriptors in Unix​

2. Creating a Pipe with the Pipe System Call​

3. The Kernel Manages Data Flow​

4. Example: Tracing a Real Pipe​

5. Memory Efficiency: Why Pipes Don't Load Everything into RAM​

Practical Example: Processing a Large FASTQ File​

Without Pipes (Wasteful)​

With Pipes (Efficient)​

Advanced Pipe Patterns in Bioinformatics​

1. Parallel Processing with GNU Parallel​

2. Tee: Branching a Pipeline​

3. Named Pipes (mkfifo) for Complex Workflows​

4. Buffering and Backpressure Management​

5. Process Substitution for Multiple Outputs​

Common Pipe Gotchas and Solutions​

1. Buffering Issues with Pipes​

2. Error Handling in Pipes​

3. Monitoring Pipeline Progress​

4. Debugging Pipe Failures​

Using Pipes in Nextflow Modules​

Real Example: BWA_MEM Process from Sarek​

Understanding the Pipe in BWA_MEM​

Benefits in This Real Workflow​

Flexibility Through Arguments​

Error Handling in Nextflow Processes​

Pipes for Download, Upload, and Cloud Storage​

Example 1: Download → Decompress → Process (No Temp Files)​

Example 2: Process → Compress → Upload to S3 (Single Pipeline)​

Example 3: Download from S3 → Decompress → Analyze (No Local Copy)​

Example 4: Download Tarball → Extract → Index (No Intermediate Files)​

Example 5: Process Multiple Files from S3 with GNU Parallel​

Example 6: Bidirectional Piping: Upload Results as They're Generated​

Tips for Reliable Download/Upload Pipes​

Critical Considerations for Cloud Pipes​

Real-World Bioinformatics Pipes​

Example 1: RNA-seq Quality Control Pipeline​

Example 2: Variant Calling Pipeline​

Example 3: FASTA Processing with Decompression​

Summary: Why Pipes Matter in Bioinformatics​

Key Takeaways​

Final Thoughts​

The Problem: Data Explosion in Bioinformatics

The Solution: Unix Pipes for Streaming Data

How Pipes Work Under the Hood

1. File Descriptors in Unix

2. Creating a Pipe with the Pipe System Call

3. The Kernel Manages Data Flow

4. Example: Tracing a Real Pipe

5. Memory Efficiency: Why Pipes Don't Load Everything into RAM

Practical Example: Processing a Large FASTQ File

Without Pipes (Wasteful)

With Pipes (Efficient)

Advanced Pipe Patterns in Bioinformatics

1. Parallel Processing with GNU Parallel

2. Tee: Branching a Pipeline

3. Named Pipes (mkfifo) for Complex Workflows

4. Buffering and Backpressure Management

5. Process Substitution for Multiple Outputs

Common Pipe Gotchas and Solutions

1. Buffering Issues with Pipes

2. Error Handling in Pipes

3. Monitoring Pipeline Progress

4. Debugging Pipe Failures

Using Pipes in Nextflow Modules

Real Example: BWA_MEM Process from Sarek

Understanding the Pipe in BWA_MEM

Benefits in This Real Workflow

Flexibility Through Arguments

Error Handling in Nextflow Processes

Pipes for Download, Upload, and Cloud Storage

Example 1: Download → Decompress → Process (No Temp Files)

Example 2: Process → Compress → Upload to S3 (Single Pipeline)

Example 3: Download from S3 → Decompress → Analyze (No Local Copy)

Example 4: Download Tarball → Extract → Index (No Intermediate Files)

Example 5: Process Multiple Files from S3 with GNU Parallel

Example 6: Bidirectional Piping: Upload Results as They're Generated

Tips for Reliable Download/Upload Pipes

Critical Considerations for Cloud Pipes

Real-World Bioinformatics Pipes

Example 1: RNA-seq Quality Control Pipeline

Example 2: Variant Calling Pipeline

Example 3: FASTA Processing with Decompression

Summary: Why Pipes Matter in Bioinformatics

Key Takeaways

Final Thoughts