Building a Reproducible GATK Variant Calling Bash Workflow with Pixi (Part 1)
Before you can transform a bash workflow to Nextflow, you need a solid, reproducible bash baseline. This hands-on guide walks through building a complete 16-step GATK variant calling workflow using bash scripts and Pixi for environment management—following GATK best practices with GVCF mode and hard filtering. While this traditional approach works for academic research and proof-of-concept work, scaling to thousands of samples in industry requires the reproducibility and reliability of workflow managers like Nextflow, which we'll cover in Part 2.
Series Overview
This is Part 1 of a two-part series on modernizing bioinformatics workflows:
-
Part 1 (This Post): Building a production-ready 16-step GATK bash workflow as proof of concept
- Complete GATK best practices: GVCF mode + hard filtering
- Suitable for academic research and pipeline development
- Test data validation (nf-core datasets)
- Establishes baseline for transformation
-
Part 2 (Coming Soon): Transforming to Nextflow for industrial-scale processing
- Parallel execution across thousands of samples
- HPC cluster integration (SLURM/PBS/SGE)
- Automated testing with nf-test
- CI/CD pipeline with GitHub Actions
- Real production data at scale
Prerequisites & Related Topics
Before diving into this blog, readers should be familiar with:
- Pixi - New conda era - Essential for understanding the package management setup
- Containers on HPC: From Docker to Singularity and Apptainer - Container fundamentals for reproducibility
- The Evolution of Version Control - Git's Role in Reproducible Bioinformatics (Part 1) - Version control basics
1. The GATK Variant Calling Workflow
1.1. Understanding the NGS101 Workflow
The NGS101 GATK tutorial provides an excellent foundation for variant calling workflows. We've implemented the complete 16-step GATK best practices workflow following Option 2 (GVCF mode + hard filtering), which is appropriate for small to medium-sized datasets:
Complete 16-Step Workflow:
- Quality Control - FastQC on raw reads
- Adapter Trimming - Trim Galore for quality and adapter removal
- Read Alignment - BWA-MEM with read groups
- Sort BAM - Coordinate sorting
- Mark Duplicates - GATK MarkDuplicates
- BQSR - Generate Table - BaseRecalibrator with known sites
- BQSR - Apply - ApplyBQSR for quality recalibration
- Alignment QC - CollectAlignmentSummaryMetrics and CollectInsertSizeMetrics
- Variant Calling - HaplotypeCaller in GVCF mode (
-ERC GVCF) - Genotyping - GenotypeGVCFs to convert GVCF to VCF
- Hard Filter SNPs - SelectVariants + VariantFiltration for SNPs
- Hard Filter Indels - SelectVariants + VariantFiltration for indels
- Merge Variants - MergeVcfs to combine filtered SNPs and indels
- Annotation - SnpEff for functional annotation
- Statistics - bcftools stats on raw and filtered variants
- Visualization - Generate BED and bedGraph files
Why GVCF Mode + Hard Filtering?
- GVCF mode enables efficient multi-sample joint genotyping later
- Hard filtering works well for datasets with <30 samples (vs. VQSR which requires 30+ samples for machine learning)
- Follows GATK best practices for small to medium cohorts
1.2. The Reproducibility and Scalability Challenge
While this workflow works for a single sample on your local machine, several challenges emerge when scaling to production:
For Academic Research (Single Sample Analysis):
- ✅ Bash scripts work well for 1-10 samples
- ✅ Easy to understand and modify
- ✅ Good for method development and testing
- ✅ Suitable for proof-of-concept work
For Industrial Production (Thousands of Samples):
- ❌ Scalability: Cannot efficiently process 1000+ samples in parallel
- ❌ Reproducibility: Environment differences across machines/clusters
- ❌ Reliability: No automated resume on failure
- ❌ Resource Management: Manual resource allocation doesn't scale
- ❌ Validation: No automated QC checks at scale
- ❌ Collaboration: "Works on my machine" syndrome
This is where workflow managers like Nextflow (Part 2) become essential for industrial-scale genomics.
2. Setting Up Reproducible Environments with Pixi
2.1. Why Pixi Over Conda?
Pixi offers several advantages for bioinformatics workflows:
- Fast: 10x faster than conda for environment resolution
- Reproducible: Lock files ensure identical environments
- Simple: Single
pixi.tomlfile defines everything - Cross-platform: Works on Linux, macOS, and Windows
2.2. Creating the Project
Create a new directory and initialize the Pixi project:
mkdir gatk-variant-calling
cd gatk-variant-calling
Create a pixi.toml file:
[workspace]
name = "gatk-variant-calling"
version = "0.1.0"
description = "GATK variant calling workflow with reproducible environment"
channels = ["conda-forge", "bioconda"]
platforms = ["linux-64"]
[dependencies]
bwa = "*"
samtools = "*"
gatk4 = "*"
fastqc = "*"
trim-galore = "*"
bcftools = "*"
bedtools = "*"
snpeff = "*"
htslib = "*"
r-base = "*"
curl = "*"
Install the environment:
# Install Pixi if not already installed
curl -fsSL https://pixi.sh/install.sh | bash
# Install all dependencies
pixi install
This creates a complete, isolated environment with:
- BWA 0.7.19-r1273 - Read alignment
- SAMtools 1.23 - BAM file manipulation
- GATK 4.6.2.0 - Variant calling and processing
- FastQC v0.12.1 - Quality control
- Trim Galore 0.6.10 - Adapter trimming
- bcftools 1.21 - Variant statistics
- bedtools 2.31.1 - Genomic interval operations
- SnpEff 5.2 - Variant annotation
- htslib 1.21 - VCF compression (bgzip/tabix)
- R 4.4.2 - Insert size histogram generation
2.3. Verifying the Installation
Check that all tools are available:
pixi run bwa 2>&1 | head -3
# Program: bwa (alignment via Burrows-Wheeler transformation)
# Version: 0.7.19-r1273
pixi run samtools --version
# samtools 1.23
pixi run gatk --version
# The Genome Analysis Toolkit (GATK) v4.6.2.0
pixi run fastqc --version
# FastQC v0.12.1
3. Preparing Test Data
3.1. Using nf-core Test Datasets
Rather than downloading full human genome data (which takes hours), we'll use the small test dataset from nf-core/modules. This provides:
- Small paired-end FASTQ files (~16 MB each)
- Partial chr22 reference genome (40 kb)
- Known sites VCF files for BQSR
- Fast execution for testing and validation
3.2. Project Structure
Create the directory structure:
mkdir -p data reference results
Final structure:
gatk-variant-calling/
├── pixi.toml # Environment specification
├── pixi.lock # Locked dependencies
├── gatk_pipeline.sh # Main workflow script
├── data/ # Input FASTQ files
├── reference/ # Reference genome and indices
└── results/ # Pipeline outputs
3.3. Downloading Reference Data
Download the reference genome and known sites:
cd reference
# Reference genome
curl -L -o genome.fasta \
https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/genomics/homo_sapiens/genome/genome.fasta
curl -L -o genome.fasta.fai \
https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/genomics/homo_sapiens/genome/genome.fasta.fai
curl -L -o genome.dict \
https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/genomics/homo_sapiens/genome/genome.dict
# Known sites for BQSR
curl -L -o dbsnp_146.hg38.vcf.gz \
https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/genomics/homo_sapiens/genome/vcf/dbsnp_146.hg38.vcf.gz
curl -L -o dbsnp_146.hg38.vcf.gz.tbi \
https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/genomics/homo_sapiens/genome/vcf/dbsnp_146.hg38.vcf.gz.tbi
curl -L -o mills_and_1000G.indels.vcf.gz \
https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/genomics/homo_sapiens/genome/vcf/mills_and_1000G.indels.vcf.gz
curl -L -o mills_and_1000G.indels.vcf.gz.tbi \
https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/genomics/homo_sapiens/genome/vcf/mills_and_1000G.indels.vcf.gz.tbi
cd ..
3.4. Indexing the Reference Genome
Create BWA index:
pixi run bwa index reference/genome.fasta
This generates the necessary index files (.amb, .ann, .bwt, .pac, .sa).
3.5. Downloading Test FASTQ Files
Download paired-end FASTQ files:
cd data
curl -L -o test_1.fastq.gz \
https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/genomics/homo_sapiens/illumina/fastq/test_1.fastq.gz
curl -L -o test_2.fastq.gz \
https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/genomics/homo_sapiens/illumina/fastq/test_2.fastq.gz
cd ..
4. Building the Complete 16-Step Bash Workflow
4.1. Script Header and Configuration
Create gatk_pipeline.sh with full GATK best practices:
#!/bin/bash
#
# GATK Variant Calling Pipeline - Complete Best Practices (16 Steps)
# Following GATK Option 2: GVCF mode + hard filtering
#
set -euo pipefail
# Configuration
SAMPLE="${1:-sample1}" # Accept sample name as argument
THREADS=2
DATA_DIR="data"
REF_DIR="reference"
RESULTS_DIR="results"
# Reference files
REFERENCE="${REF_DIR}/genome.fasta"
DBSNP="${REF_DIR}/dbsnp_146.hg38.vcf.gz"
KNOWN_INDELS="${REF_DIR}/mills_and_1000G.indels.vcf.gz"
# Input files
FASTQ_R1="${DATA_DIR}/${SAMPLE}_R1.fastq.gz"
FASTQ_R2="${DATA_DIR}/${SAMPLE}_R2.fastq.gz"
# Output directories
QC_DIR="${RESULTS_DIR}/qc/${SAMPLE}"
TRIMMED_DIR="${RESULTS_DIR}/trimmed/${SAMPLE}"
ALIGNED_DIR="${RESULTS_DIR}/aligned/${SAMPLE}"
VAR_DIR="${RESULTS_DIR}/var/${SAMPLE}"
# Create directories
mkdir -p ${QC_DIR} ${TRIMMED_DIR} ${ALIGNED_DIR} ${VAR_DIR}
Key features:
- Command-line argument support:
bash gatk_pipeline.sh sample1 - Centralized configuration
- Organized output directory structure
4.2. Steps 1-2: Quality Control and Trimming
# Step 1: FastQC on raw reads
echo "[$(date)] Step 1: Running FastQC on raw reads..."
fastqc -o ${QC_DIR} -t ${THREADS} ${FASTQ_R1} ${FASTQ_R2}
echo "[$(date)] FastQC completed"
# Step 2: Adapter trimming with Trim Galore
echo "[$(date)] Step 2: Adapter trimming with Trim Galore..."
trim_galore \
--paired \
--quality 20 \
--length 50 \
--fastqc \
--output_dir ${TRIMMED_DIR} \
${FASTQ_R1} ${FASTQ_R2}
echo "[$(date)] Adapter trimming completed"
What happens:
- FastQC generates quality reports
- Trim Galore removes adapters and low-quality bases (Q<20)
- Reads shorter than 50bp after trimming are discarded
- Post-trimming FastQC automatically generated
4.3. Steps 3-5: Alignment and Deduplication
# Step 3: Alignment with BWA-MEM
echo "[$(date)] Step 3: Aligning reads with BWA-MEM..."
bwa mem -t ${THREADS} -M \
-R "@RG\tID:${SAMPLE}\tSM:${SAMPLE}\tPL:ILLUMINA\tLB:${SAMPLE}_lib" \
${REFERENCE} ${TRIMMED_R1} ${TRIMMED_R2} | \
samtools view -Sb - > ${ALIGNED_BAM}
# Step 4: Sort BAM file
echo "[$(date)] Step 4: Sorting BAM file..."
samtools sort -@ ${THREADS} -o ${SORTED_BAM} ${ALIGNED_BAM}
# Step 5: Mark Duplicates
echo "[$(date)] Step 5: Marking duplicates with GATK..."
gatk MarkDuplicates \
-I ${SORTED_BAM} \
-O ${DEDUP_BAM} \
-M ${METRICS} \
--CREATE_INDEX true
Key points:
-Rflag adds read group information (required by GATK)MarkDuplicatesidentifies PCR/optical duplicates- Duplicates marked but not removed (for QC purposes)
4.4. Steps 6-8: BQSR and Quality Metrics
# Step 6: Generate BQSR recalibration table
echo "[$(date)] Step 6: Generating BQSR recalibration table..."
gatk BaseRecalibrator \
-I ${DEDUP_BAM} \
-R ${REFERENCE} \
--known-sites ${DBSNP} \
--known-sites ${KNOWN_INDELS} \
-O ${RECAL_TABLE}
# Step 7: Apply BQSR
echo "[$(date)] Step 7: Applying BQSR..."
gatk ApplyBQSR \
-I ${DEDUP_BAM} \
-R ${REFERENCE} \
--bqsr-recal-file ${RECAL_TABLE} \
-O ${RECAL_BAM}
# Step 8: Collect alignment quality metrics
echo "[$(date)] Step 8: Collecting alignment quality metrics..."
gatk CollectAlignmentSummaryMetrics \
-R ${REFERENCE} \
-I ${RECAL_BAM} \
-O ${QC_DIR}/${SAMPLE}_alignment_summary.txt
gatk CollectInsertSizeMetrics \
-I ${RECAL_BAM} \
-O ${QC_DIR}/${SAMPLE}_insert_size_metrics.txt \
-H ${QC_DIR}/${SAMPLE}_insert_size_histogram.pdf
BQSR purpose:
- Corrects systematic errors in base quality scores
- Uses known variants (dbSNP, Mills indels) as truth set
- Improves variant calling accuracy
4.5. Steps 9-10: GVCF Generation and Genotyping
# Step 9: Variant calling with HaplotypeCaller in GVCF mode
echo "[$(date)] Step 9: Calling variants with HaplotypeCaller in GVCF mode..."
gatk HaplotypeCaller \
-R ${REFERENCE} \
-I ${RECAL_BAM} \
-O ${GVCF} \
-ERC GVCF \
--dbsnp ${DBSNP}
# Step 10: Genotyping GVCF to VCF
echo "[$(date)] Step 10: Genotyping GVCF to VCF..."
gatk GenotypeGVCFs \
-R ${REFERENCE} \
-V ${GVCF} \
-O ${RAW_VCF}
Why GVCF mode?
-ERC GVCFgenerates reference confidence records- Enables efficient joint genotyping later (for multi-sample cohorts)
- GenotypeGVCFs converts GVCF to standard VCF
4.6. Steps 11-13: Hard Filtering
# Step 11: Hard filtering SNPs
echo "[$(date)] Step 11: Hard filtering SNPs..."
# Select SNPs
gatk SelectVariants \
-R ${REFERENCE} \
-V ${RAW_VCF} \
--select-type-to-include SNP \
-O ${VAR_DIR}/${SAMPLE}_raw_snps.vcf.gz
# Apply filters (RELAXED for test data)
# NOTE: For production WGS (30x+ coverage), use stricter thresholds:
# QD < 2.0, QUAL < 30.0, SOR > 3.0, FS > 60.0, MQ < 40.0
gatk VariantFiltration \
-R ${REFERENCE} \
-V ${VAR_DIR}/${SAMPLE}_raw_snps.vcf.gz \
-O ${FILTERED_SNP_VCF} \
--filter-expression "QD < 1.0" --filter-name "QD1" \
--filter-expression "QUAL < 10.0" --filter-name "QUAL10" \
--filter-expression "SOR > 10.0" --filter-name "SOR10" \
--filter-expression "FS > 100.0" --filter-name "FS100" \
--filter-expression "MQ < 20.0" --filter-name "MQ20"
# Step 12: Hard filtering Indels
echo "[$(date)] Step 12: Hard filtering indels..."
# Select Indels
gatk SelectVariants \
-R ${REFERENCE} \
-V ${RAW_VCF} \
--select-type-to-include INDEL \
-O ${VAR_DIR}/${SAMPLE}_raw_indels.vcf.gz
# Apply filters (RELAXED for test data)
# NOTE: For production WGS (30x+ coverage), use stricter thresholds:
# QD < 2.0, QUAL < 30.0, FS > 200.0
gatk VariantFiltration \
-R ${REFERENCE} \
-V ${VAR_DIR}/${SAMPLE}_raw_indels.vcf.gz \
-O ${FILTERED_INDEL_VCF} \
--filter-expression "QD < 1.0" --filter-name "QD1" \
--filter-expression "QUAL < 10.0" --filter-name "QUAL10" \
--filter-expression "FS > 300.0" --filter-name "FS300"
# Step 13: Merge filtered variants
echo "[$(date)] Step 13: Merging filtered variants..."
gatk MergeVcfs \
-I ${FILTERED_SNP_VCF} \
-I ${FILTERED_INDEL_VCF} \
-O ${FINAL_VCF}
Filter thresholds:
- Test data (low coverage ~1-2x): Relaxed filters shown above
- Production data (30x coverage): Use strict GATK recommended thresholds
- Filters mark variants as "FILTER" but keep them in VCF
4.7. Steps 14-16: Annotation, Statistics, and Visualization
# Step 14: Functional annotation with SnpEff
echo "[$(date)] Step 14: Annotating variants with SnpEff..."
gunzip -c ${FINAL_VCF} > ${VAR_DIR}/${SAMPLE}_filtered.vcf
snpeff -v GRCh38.mane.1.0.refseq \
${VAR_DIR}/${SAMPLE}_filtered.vcf \
> ${ANNOTATED_VCF}
bgzip ${ANNOTATED_VCF}
tabix -p vcf ${ANNOTATED_VCF}.gz
# Step 15: Variant statistics
echo "[$(date)] Step 15: Generating variant statistics..."
bcftools stats ${RAW_VCF} > ${VAR_DIR}/${SAMPLE}_variant_stats_raw.txt
bcftools stats -f "PASS" ${FINAL_VCF} > ${VAR_DIR}/${SAMPLE}_variant_stats_filtered.txt
# Step 16: Generate visualization files
echo "[$(date)] Step 16: Generating visualization files..."
# Convert VCF to BED
zcat ${FINAL_VCF} | grep -v "^#" | \
awk '{print $1"\t"$2-1"\t"$2"\t"$3}' \
> ${VAR_DIR}/${SAMPLE}_variants.bed
# Create bedGraph for depth
samtools depth ${RECAL_BAM} | \
awk '{print $1"\t"$2-1"\t"$2"\t"$3}' \
> ${VAR_DIR}/${SAMPLE}_depth.bedgraph
Final steps:
- SnpEff adds functional annotations (gene impact, amino acid changes)
- bcftools stats generates comprehensive variant statistics
- BED and bedGraph files for genome browser visualization (IGV, UCSC)
4.8. Summary and Quality Control
echo "=========================================="
echo "Pipeline Completed Successfully!"
echo "=========================================="
echo "Results:"
echo " - QC reports: ${QC_DIR}/"
echo " - Trimmed reads: ${TRIMMED_DIR}/"
echo " - Final BAM: ${RECAL_BAM}"
echo " - GVCF: ${GVCF}"
echo " - Raw Variants VCF: ${RAW_VCF}"
echo " - Filtered VCF: ${FINAL_VCF}"
echo " - Annotated VCF: ${ANNOTATED_VCF}.gz"
echo " - Variant statistics: ${VAR_DIR}/"
echo
echo "Quick Quality Control Summary:"
echo "------------------------------"
echo "Alignment metrics:"
samtools flagstat ${RECAL_BAM}
echo
echo "Duplication rate:"
grep "^LIBRARY" -A 1 ${METRICS} | tail -1 | awk '{print $9}'
echo
echo "Variant counts (Raw):"
echo " SNPs:" $(bcftools view -v snps ${RAW_VCF} | grep -v "^#" | wc -l)
echo " Indels:" $(bcftools view -v indels ${RAW_VCF} | grep -v "^#" | wc -l)
echo
echo "Variant counts (Filtered - PASS only):"
echo " SNPs:" $(bcftools view -f "PASS" -v snps ${FINAL_VCF} | grep -v "^#" | wc -l)
echo " Indels:" $(bcftools view -f "PASS" -v indels ${FINAL_VCF} | grep -v "^#" | wc -l)
echo "=========================================="
Key bash features:
set -euo pipefail: Exit on errors, undefined variables, and pipe failures- Centralized configuration variables
- Clear file path definitions
5. Running the Pipeline
5.1. Pipeline Features
The complete 16-step gatk_pipeline.sh script provides:
✅ Complete GATK best practices (Option 2: GVCF + hard filtering)
✅ Proper error handling with set -euo pipefail
✅ Timestamped logging for performance tracking
✅ Sample name argument support: bash gatk_pipeline.sh sample1
✅ Organized output with separate directories for QC, trimmed reads, alignments, and variants
✅ Comprehensive QC with alignment metrics, insert size histograms (R), and variant statistics
✅ Automated summary showing alignment stats, duplication rate, and variant counts
5.2. Make Script Executable
chmod +x gatk_pipeline.sh
5.2. Execute the Workflow
# Make script executable
chmod +x gatk_pipeline.sh
# Run pipeline with sample name
pixi run bash gatk_pipeline.sh sample1 2>&1 | tee pipeline_sample1.log
Execution flow:
- Pixi ensures correct tool versions are used
- Each of 16 steps logs timestamp and status
- Output captured in
pipeline_sample1.log - Pipeline exits on first error (due to
set -euo pipefail)
5.3. Expected Output
==========================================
GATK Variant Calling Pipeline - Complete
Sample: sample1
Based on NGS101 Best Practices
==========================================
[Date] Step 1: Running FastQC on raw reads...
[Date] FastQC completed
[Date] Step 2: Adapter trimming with Trim Galore...
[Date] Adapter trimming completed
[Date] Step 3: Aligning reads with BWA-MEM...
[Date] Alignment completed
...
[Date] Step 16: Creating visualization files...
[Date] Visualization files created
==========================================
Pipeline Completed Successfully!
==========================================
Results:
- QC reports: results/qc/sample1/
- Trimmed reads: results/trimmed/sample1/
- Final BAM: results/aligned/sample1/sample1_recalibrated.bam
- GVCF: results/var/sample1/sample1.g.vcf.gz
- Raw Variants VCF: results/var/sample1/sample1_raw_variants.vcf.gz
- Filtered VCF: results/var/sample1/sample1_filtered.vcf.gz
- Annotated VCF: results/var/sample1/sample1_annotated.vcf.gz
- Variant statistics: results/var/sample1/
Quick Quality Control Summary:
------------------------------
Alignment metrics:
16946 + 0 in total (QC-passed reads + QC-failed reads)
274 + 0 primary mapped (1.62%)
5 + 0 duplicates
Duplication rate: 0.018248
Variant counts (Raw):
SNPs: 0
Indels: 0
Variant counts (Filtered - PASS only):
SNPs: 0
Indels: 0
==========================================
Total runtime: ~3 minutes for this small test dataset
5.4. Understanding the Test Data Results
Important Note: The test data produces 0 variants, which is expected:
- Test data purpose: Validates pipeline mechanics, not biological discovery
- Coverage: Extremely low (~1-2x) - insufficient for confident variant calling
- Data source: nf-core test datasets designed for speed, not variant content
- GVCF output: Contains reference confidence blocks (proves HaplotypeCaller works)
For real variant calling, use:
- Production WGS data (30x+ coverage)
- Known samples with variants (e.g., NA12878 from 1000 Genomes)
- Exome sequencing data with targeted regions
6. Analyzing the Results
6.1. Directory Structure
The pipeline creates an organized output structure:
results/
├── qc/
│ └── sample1/
│ ├── sample1_R1_fastqc.html # Pre-trim QC
│ ├── sample1_R2_fastqc.html
│ ├── sample1_R1_val_1_fastqc.html # Post-trim QC
│ ├── sample1_R2_val_2_fastqc.html
│ ├── sample1_alignment_summary.txt # Alignment metrics
│ └── sample1_insert_size_histogram.pdf # Insert size plot (R)
├── trimmed/
│ └── sample1/
│ ├── sample1_R1_val_1.fq.gz # Trimmed reads
│ ├── sample1_R2_val_2.fq.gz
│ └── *_trimming_report.txt # Trim Galore reports
├── aligned/
│ └── sample1/
│ ├── sample1_recalibrated.bam # Final BAM
│ ├── sample1_recalibrated.bai # BAM index
│ ├── sample1_duplicate_metrics.txt # Duplication stats
│ ├── sample1_recal_data.table # BQSR table
│ └── sample1_coverage.bedgraph # Coverage track
└── var/
└── sample1/
├── sample1.g.vcf.gz # GVCF (reference blocks)
├── sample1_raw_variants.vcf.gz # Raw VCF
├── sample1_filtered.vcf.gz # Filtered VCF
├── sample1_annotated.vcf.gz # SnpEff annotations
├── sample1_variants.bed # BED for IGV
├── sample1_variant_stats_raw.txt # bcftools stats
└── sample1_variant_stats_filtered.txt
6.2. Quality Control Reports
View the FastQC HTML reports:
# Pre-trimming QC
xdg-open results/qc/sample1/sample1_R1_fastqc.html
# Post-trimming QC
xdg-open results/qc/sample1/sample1_R1_val_1_fastqc.html
Key metrics to check:
- Per base sequence quality: Should be mostly green (>28)
- Per sequence quality scores: Peak should be at high quality
- Adapter content: Should drop significantly after trimming
- Duplication levels: Varies by library type
Trimming impact:
# Check trimming report
cat results/trimmed/sample1/sample1_R1.fastq.gz_trimming_report.txt
Example output:
Total reads processed: 266,736
Quality-trimmed: 22,074,598 bp (82.7%)
Reads with adapters: 662 (0.2%)
Reads written (passing filters): 266,736 (100.0%)
Note: For test data, 96.82% of reads are removed by length filter (shorter than 50bp after trimming), which is expected for low-quality test data.
6.3. Alignment Statistics
Check the final BAM statistics:
pixi run samtools flagstat results/aligned/sample1/sample1_recalibrated.bam
Test data results:
16,946 total reads
274 mapped (1.62%)
5 duplicates (1.82%)
Low mapping rate (1.62%) is expected—test data uses a tiny reference (40kb chr22 region) with reads from across the genome.
6.4. GVCF Output
The GVCF contains reference confidence blocks, even when no variants are called:
pixi run bash -c "zcat results/var/sample1/sample1.g.vcf.gz | grep -v '^##' | head -10"
Example GVCF output:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT sample1
chr22 1 . A <NON_REF> . . END=12925 GT:DP:GQ 0/0:0:0
chr22 12926 . C <NON_REF> . . END=12928 GT:DP:GQ 0/0:1:3
chr22 12929 . G <NON_REF> . . END=12947 GT:DP:GQ 0/0:2:6
Interpretation:
<NON_REF>= reference call (no variant)GT:0/0= homozygous referenceDP= depth (0-2x, very low)GQ= genotype quality
6.5. Variant Statistics
Even with 0 variants, bcftools stats provides useful QC:
cat results/var/sample1/sample1_variant_stats_raw.txt
Key sections:
- SN (Summary Numbers): Total records, SNPs, indels
- TSTV (Transition/Transversion ratio): Should be ~2.0-2.1 for WGS
- DP (Depth distribution): Shows coverage distribution
- QUAL (Quality distribution): Quality score distribution
6.6. Visualization in IGV
Load the results in Integrative Genomics Viewer (IGV):
# Load reference
File > Load Genome from File > reference/genome.fasta
# Load BAM
File > Load from File > results/aligned/sample1/sample1_recalibrated.bam
# Load coverage track
File > Load from File > results/aligned/sample1/sample1_coverage.bedgraph
# Load variants (if present)
File > Load from File > results/var/sample1/sample1_filtered.vcf.gz
This allows visual inspection of:
- Read alignments
- Coverage depth
- Variant positions
- Base quality scores
7. Understanding the Limitations: Why Nextflow Matters
7.1. Performance Analysis
Let's analyze the execution time per step:
| Step | Duration | Percentage |
|---|---|---|
| FastQC | 11s | 6.1% |
| Trim Galore | 23s | 12.8% |
| Alignment (BWA) | 2s | 1.1% |
| Sorting | 1s | 0.6% |
| MarkDuplicates | 16s | 8.9% |
| BaseRecalibrator | 19s | 10.6% |
| ApplyBQSR | 13s | 7.2% |
| Alignment QC | 75s | 41.7% |
| HaplotypeCaller | 18s | 10.0% |
| GenotypeGVCFs | 15s | 8.3% |
| SNP/Indel Filtering | 33s | 18.3% |
| Annotation | 1s | 0.6% |
| Statistics | 1s | 0.6% |
| Visualization | 1s | 0.6% |
| Total | ~180s | 100% |
Scaling calculations for industrial genomics:
- 1 sample: 3 minutes (acceptable for testing)
- 10 samples (serial): 30 minutes (still manageable)
- 100 samples (serial): 5 hours (problematic)
- 1,000 samples (serial): 50 hours = 2.1 days (unacceptable)
- 10,000 samples (serial): 500 hours = 20.8 days (impossible)
7.2. The Serial Execution Problem
Current bash approach (Part 1):
Sample 1: [================] 3 min
Sample 2: [================] 3 min
Sample 3: [================] 3 min
...
Total: N samples × 3 minutes
What industry needs (Part 2 with Nextflow):
Sample 1: [================]
Sample 2: [================]
Sample 3: [================]
Sample 4: [================]
...
Sample 100: [================]
Total: ~3 minutes (with sufficient compute resources)
The difference:
- Academic research: Bash scripts work fine for 1-10 samples
- Clinical diagnostics: Need 100s of samples processed overnight
- Population genomics: Require 1000s-10000s of samples in parallel
- Industrial scale: Nextflow becomes essential
7.3. Resource Management Challenges
The bash script has hard-coded resource settings:
THREADS=2 # Fixed for all steps
Reality for production WGS data (30x coverage):
| Step | Optimal CPUs | Memory | Time (per sample) |
|---|---|---|---|
| Trim Galore | 4 | 2 GB | 10-15 min |
| BWA MEM | 16 | 8 GB | 2-3 hours |
| MarkDuplicates | 1 | 16 GB | 30-45 min |
| BQSR | 1 | 8 GB | 20-30 min |
| HaplotypeCaller | 2 | 16 GB | 4-6 hours |
| GenotypeGVCFs | 2 | 32 GB | 1-2 hours |
| Hard Filtering | 1 | 4 GB | 10-15 min |
What bash can't do:
- ❌ Dynamic resource allocation per step
- ❌ Memory limits to prevent OOM kills
- ❌ CPU scheduling across hundreds of samples
- ❌ Automatic retry with more resources on failure
- ❌ HPC job submission with appropriate requests
- ❌ Cloud cost optimization
7.4. Error Handling and Reproducibility
Current error handling:
set -euo pipefail
What it does:
✅ Exit on first error
✅ Catch undefined variables
✅ Detect pipe failures
What's missing for production:
- ❌ Resume from failure point (must restart from beginning)
- ❌ Automated retry logic for transient failures
- ❌ Detailed error context and stack traces
- ❌ Email/Slack notifications on failure
- ❌ Automatic generation of execution reports
- ❌ Provenance tracking (what was run, when, with what parameters)
Real-world scenario:
Sample 1-50: Complete ✅
Sample 51: Failed at HaplotypeCaller (OOM) ❌
Current solution: Restart entire batch from sample 1
Nextflow solution: Resume from sample 51 with more memory
7.5. Multi-Sample Coordination
Bash challenges for 1000+ samples:
# This doesn't scale:
for sample in $(cat samples.txt); do
bash gatk_pipeline.sh $sample 2>&1 | tee logs/${sample}.log
done
Problems:
- Sequential execution (1000 × 3 min = 50 hours)
- No dependency management
- All logs mixed together
- Cannot track overall progress
- Difficult to identify failed samples
- Manual result aggregation
What industry needs (Nextflow in Part 2):
- Parallel sample processing
- Centralized sample sheet management
- Automatic result aggregation
- MultiQC for unified QC reports
- Failed sample identification and retry
- Clear progress tracking
7.6. The "Works on My Machine" Problem
Even with Pixi, portability issues remain:
Different compute environments:
# Laptop (development)
THREADS=2
MEMORY=8GB
# HPC cluster (production)
THREADS=32
MEMORY=128GB
SLURM submission required
# Cloud (AWS Batch)
Spot instances with variable resources
S3 for input/output
Containerization mandatory
Missing in bash approach:
- ❌ Execution profiles for different environments
- ❌ Container support (Docker/Singularity)
- ❌ HPC scheduler integration (SLURM/PBS/SGE)
- ❌ Cloud platform support (AWS Batch, Google Cloud)
- ❌ Automatic path resolution across systems
7.7. Validation and Testing
Current approach:
- Manual inspection of outputs
- No automated regression testing
- No unit tests for individual steps
- No integration tests for end-to-end workflow
What production requires:
- ✅ Automated nf-test for each process
- ✅ Integration tests with known datasets
- ✅ Continuous integration (GitHub Actions)
- ✅ Output validation against gold standards
- ✅ Performance regression detection
8. When to Use Bash vs. Nextflow
8.1. Bash is Appropriate For:
✅ Academic Research
- Single-sample analysis
- Method development and prototyping
- Learning bioinformatics workflows
- Proof-of-concept work
- 1-10 samples maximum
✅ Quick Testing
- Validating tool installations
- Testing parameter combinations
- Debugging individual steps
- Creating baseline results
✅ Educational Purposes
- Teaching workflow concepts
- Understanding tool interactions
- Demonstrating best practices
- Creating reproducible examples
8.2. Nextflow is Required For:
✅ Industrial-Scale Production
- 100+ samples processed routinely
- Clinical diagnostics pipelines
- Population genomics studies
- Pharmaceutical research
- Biotech companies
✅ HPC/Cloud Environments
- SLURM/PBS/SGE cluster integration
- AWS Batch / Google Cloud execution
- Dynamic resource management
- Cost optimization on cloud platforms
✅ Team Collaboration
- Multiple analysts using same pipeline
- Standardized workflows across organization
- Version-controlled pipelines
- Automated testing and validation
✅ Production Requirements
- Resume capability after failures
- Automated QC and reporting
- Audit trails and provenance
- Email/Slack notifications
- Integration with LIMS systems
8.3. The Transition Point
When to transition from bash to Nextflow:
| Factor | Bash OK | Nextflow Essential |
|---|---|---|
| Sample count | 1-10 | >20 |
| Execution frequency | Occasional | Daily/Weekly |
| Team size | Individual | 3+ people |
| Compute environment | Laptop/Workstation | HPC/Cloud |
| Resume requirement | No | Yes |
| Validation requirement | Manual | Automated |
| Budget impact | Low | High (efficiency) |
9. Lessons Learned and Key Takeaways
9.1. What We Successfully Built
Complete 16-Step GATK Pipeline:
- ✅ Full GATK best practices (GVCF mode + hard filtering)
- ✅ Reproducible environment with Pixi
- ✅ Comprehensive QC at multiple stages
- ✅ Proper error handling with
set -euo pipefail - ✅ Clear directory structure and organized outputs
- ✅ Automated summary with key metrics
- ✅ Fast execution (~3 minutes for test data)
- ✅ Single file to maintain and understand
Baseline Metrics Established:
Execution metrics:
- Single sample runtime: ~180 seconds (3 minutes)
- Total steps: 16
- Tool invocations: 30+
- Output files: 25+
Resource usage (test data):
- Peak memory: ~4 GB (GATK steps)
- CPU time: ~45 seconds (BWA, with low coverage)
- Disk space: ~200 MB (intermediate files)
Data metrics (nf-core test data):
- Input reads: 266,736 raw, 8,472 after trimming
- Mapped reads: 274 (1.62%)
- Duplicates: 5 (1.82%)
- Final variants: 0 (expected for test data)
9.2. What Worked Well
Pixi Environment Management:
- ✅ Reproducible tool versions across systems
- ✅ Fast environment creation (~30-60 seconds)
- ✅ No conflicts with system packages
- ✅ Easy to share via
pixi.tomlandpixi.lock - ✅ Single command installation:
pixi install
Structured Bash Approach:
- ✅ Clear step-by-step execution logic
- ✅ Easy to read and understand
- ✅ Timestamp logging for performance analysis
- ✅ Single script file for maintenance
- ✅ Command-line argument support
- ✅ Good for learning and teaching
Test Data Strategy:
- ✅ Fast execution for pipeline validation
- ✅ Publicly available (nf-core datasets)
- ✅ Tests all workflow steps
- ✅ Perfect for CI/CD testing
- ✅ Doesn't require HPC resources
Documentation:
- ✅ Clear README with setup instructions
- ✅ Automated download script
- ✅ Git version control
- ✅
.gitignorefor data directories
9.3. Critical Limitations (Addressed in Part 2)
For Industrial-Scale Genomics:
❌ No Parallelization
- Cannot process 100+ samples simultaneously
- Serial execution doesn't scale
- Solution in Part 2: Nextflow channels and parallel processes
❌ Manual Resource Management
- Hard-coded thread counts
- No memory limits
- Solution in Part 2: Per-process resource directives
❌ No Resume Capability
- Must restart from beginning on failure
- Wasted computation on transient errors
- Solution in Part 2: Nextflow's
-resumeflag
❌ Hard-Coded Paths
- Sample names embedded in script
- Reference paths not parameterized
- Solution in Part 2: Configuration files and
params
❌ No Automated Testing
- Manual verification of outputs
- No regression detection
- Solution in Part 2: nf-test framework
❌ Limited Portability
- Assumes local file system
- No HPC scheduler integration
- No cloud platform support
- Solution in Part 2: Execution profiles and containers
❌ No Production Features
- No email notifications
- No execution reports
- No MultiQC integration
- Solution in Part 2: Built-in Nextflow features
9.4. The Value of This Baseline
Why this Part 1 matters:
- Proof of Concept: Validates the workflow logic is scientifically correct
- Learning Tool: Understand what each step does before abstracting to Nextflow
- Comparison Standard: Establishes baseline metrics for Part 2 performance comparison
- Rapid Prototyping: Test parameter changes quickly
- Educational Resource: Perfect for teaching NGS analysis workflows
Who should use this bash approach:
- Graduate students learning bioinformatics
- Researchers processing <10 samples
- Method developers testing new algorithms
- Anyone creating proof-of-concept pipelines
- Teachers demonstrating variant calling workflows
Who needs Part 2 (Nextflow):
- Clinical diagnostic labs
- Pharmaceutical companies
- Biotech companies
- Large research consortia
- Any team processing 20+ samples regularly
10. Preparing for Part 2: Nextflow Transformation
10.1. What We've Established
With this bash implementation, we now have the essential foundation for Part 2:
✅ Validated Workflow Logic
- All 16 steps execute correctly
- Proper GATK best practices (GVCF + hard filtering)
- Scientific correctness confirmed
✅ Baseline Metrics
- Execution time: 180 seconds per sample
- Resource usage: 4GB RAM, 2 CPUs
- Output file sizes and counts documented
✅ Test Data Strategy
- nf-core test datasets validated
- Expected outputs documented (including 0 variants)
- Fast execution for CI/CD
✅ Clear Requirements
- Known limitations documented
- Scalability challenges identified
- Production requirements defined
10.2. Success Criteria for Part 2
The Nextflow transformation must achieve:
Functional Parity:
- ✅ Produce identical results (validated via MD5/checksums)
- ✅ Pass all QC checks from Part 1
- ✅ Generate same output file structure
Scalability Improvements:
- ✅ Process 100 samples in <10 minutes (vs. 300 minutes with bash)
- ✅ Parallel execution across samples
- ✅ Dynamic resource allocation per process
- ✅ Resume from any failure point
Production Features:
- ✅ HPC scheduler integration (SLURM/PBS/SGE)
- ✅ Container support (Docker/Singularity)
- ✅ Configuration profiles (local, HPC, AWS, Azure)
- ✅ Execution reports with resource usage
- ✅ MultiQC for unified QC reports
Quality Assurance:
- ✅ Pass nf-lint standards
- ✅ 100% test coverage with nf-test
- ✅ Automated CI/CD validation (GitHub Actions)
- ✅ Complete documentation (README, usage, parameters)
- ✅ Example datasets and expected outputs
Real Production Data:
- ✅ Test with 30x WGS samples (not just test data)
- ✅ Validate on known samples (NA12878, etc.)
- ✅ Compare variant calls to gold standards
- ✅ Document filtering thresholds for real data
10.3. Key Transformations in Part 2
From Bash to Nextflow:
| Bash Concept | Nextflow Equivalent | Benefit |
|---|---|---|
| Sequential steps | Parallel processes | 100x faster execution |
| Hard-coded paths | params and channels | Parameterization |
THREADS=2 | Process directives | Dynamic resource allocation |
| No resume | -resume flag | Restart from failure |
| Manual loops | Channels and operators | Automatic parallelization |
| Exit on error | Error strategies | Automatic retry |
| Local execution | Execution profiles | HPC/Cloud support |
| No testing | nf-test | Automated validation |
| Manual QC | MultiQC integration | Unified reports |
| Single environment | Containers | Complete reproducibility |
10.4. Preview: What Part 2 Will Cover
Core Nextflow Development:
- Converting bash steps to Nextflow processes
- Implementing channels for data flow
- Creating modular, reusable processes
- Adding process-specific resource directives
- Implementing error handling strategies
Testing and Validation:
- Setting up nf-test for unit tests
- Creating integration tests
- Validating outputs match bash baseline
- Establishing regression testing
- CI/CD with GitHub Actions
Production Deployment:
- Creating execution profiles (local, SLURM, AWS, etc.)
- Containerization with Docker/Singularity
- Optimizing resource allocation
- Implementing monitoring and logging
- Adding MultiQC for comprehensive QC
Real Production Data:
- Testing with 30x WGS samples
- Adjusting filter thresholds for real data
- Benchmarking performance at scale
- Validating against known gold standards
- Documenting production deployment
10.5. Timeline and Expectations
Part 2 Development Plan:
Week 1-2: Core Nextflow Implementation
- Convert all 16 steps to Nextflow processes
- Implement channels and workflow logic
- Test with nf-core test data
Week 3-4: Testing and Quality
- Implement nf-test suite
- Add CI/CD pipelines
- Pass nf-lint checks
- Documentation
Week 5-6: Production Features
- Add execution profiles
- Container support
- Resource optimization
- MultiQC integration
Week 7-8: Real Data Validation
- Test with 30x WGS samples
- Validate against gold standards
- Performance benchmarking
- Final documentation
11. Conclusion
11.1. What We Accomplished in Part 1
In this post, we built a complete, production-ready 16-step GATK variant calling pipeline using traditional bash scripting:
✅ Complete GATK Best Practices
- GVCF mode for efficient joint genotyping
- Hard filtering appropriate for small datasets
- BQSR for base quality recalibration
- Comprehensive QC at every stage
✅ Reproducible Environment
- Pixi for consistent tool versions
- Single-command setup
- Works across Linux systems
✅ Well-Organized Workflow
- 16 clearly defined steps
- Automated QC summaries
- Structured output directories
- Comprehensive logging
✅ Validated with Test Data
- nf-core test datasets
- Fast execution (~3 minutes)
- Expected outputs documented
- Ready for CI/CD integration
11.2. Who Should Use This Bash Approach
Perfect for:
- 👨🎓 Graduate students learning bioinformatics
- 🔬 Researchers processing 1-10 samples
- 🧪 Method developers prototyping new algorithms
- 📚 Educators teaching NGS analysis
- 🚀 Proof-of-concept projects
Example use cases:
- Academic research projects with small sample sizes
- Testing GATK parameters and filter thresholds
- Learning how variant calling workflows operate
- Creating baseline results for comparison
- Developing new bioinformatics methods
11.3. When You Need Nextflow (Part 2)
Transition to Nextflow when you have:
🏭 Industrial-Scale Requirements:
- Processing 100+ samples regularly
- Clinical diagnostic pipelines
- Population genomics studies (1000s of samples)
- Pharmaceutical research workflows
- Biotech production pipelines
🖥️ HPC/Cloud Infrastructure:
- SLURM/PBS/SGE cluster access
- AWS Batch / Google Cloud Batch
- Need for dynamic resource management
- Multi-tenant compute environments
👥 Team Collaboration:
- Multiple analysts using same pipeline
- Need for standardized workflows
- Version control and testing requirements
- Regulatory compliance (CLIA, CAP)
⚡ Production Requirements:
- Resume capability after failures
- Automated QC and reporting
- Audit trails and provenance
- Integration with LIMS systems
- Cost optimization
11.4. The Transformation Journey
Part 1 (This Post) - Academic Foundation:
Single Sample → Bash Script → 3 minutes → Learning Tool
Part 2 (Next Post) - Industrial Production:
1000 Samples → Nextflow Pipeline → 3 minutes total → Production System
Key Insight: The bash approach isn't "wrong"—it's appropriate for certain contexts. Nextflow doesn't replace bash for small-scale work; it extends the capability to industrial scale.
11.5. What's Next in Part 2
In the upcoming post, we'll transform this validated baseline into an enterprise-grade Nextflow pipeline:
Core Development:
- 🔄 Convert each bash step to a Nextflow process
- 📊 Implement channels for parallel data flow
- ⚙️ Add dynamic resource management
- 🔁 Enable resume capability with
-resume
Testing & Quality:
- ✅ Implement nf-test for all processes
- 🏗️ Create comprehensive integration tests
- 🤖 Set up GitHub Actions CI/CD
- 📋 Pass nf-lint quality checks
Production Features:
- 🐳 Containerization (Docker/Singularity)
- ☁️ HPC and cloud execution profiles
- 📈 MultiQC for unified QC reports
- 📧 Email notifications and monitoring
Real Production Data:
- 🧬 Test with 30x coverage WGS samples
- 🎯 Validate against gold standard samples (NA12878)
- 📊 Benchmark performance at scale (100+ samples)
- 📖 Document production deployment
11.6. Try It Yourself
The complete code is available on GitHub. To reproduce:
# Clone repository
cd variant-calling-gatk-pipeline-best-practice-from-scratch
# Install with Pixi
pixi install
# Download test data
pixi run bash download_data.sh
# Run pipeline
pixi run bash gatk_pipeline.sh sample1 2>&1 | tee pipeline.log
You'll have a working GATK pipeline in under 5 minutes!
11.7. Key Takeaways
- Bash is excellent for proof-of-concept and small-scale academic research
- Pixi enables reproducibility without the complexity of containers
- Test data validates mechanics, not biological discovery
- Understanding limitations justifies the effort of modernization
- Baseline metrics enable meaningful performance comparisons
- Nextflow is essential for industrial-scale genomics (Part 2)
Ready for industrial scale? Stay tuned for Part 2: From Bash to Nextflow where we'll transform this baseline into a production-ready, enterprise-grade workflow capable of processing thousands of samples with automated testing, dynamic resource management, and deployment on HPC clusters and cloud platforms.
Have questions or suggestions? Leave a comment below or reach out on GitHub!
References
Primary Resources
- NGS101 GATK Tutorial: How to Analyze Whole Genome Sequencing Data - Complete walkthrough of GATK variant calling workflow
- GATK Best Practices: Broad Institute GATK Workflows - Official GATK germline variant calling guidelines
- nf-core Test Datasets: GitHub Repository - Publicly available test data for bioinformatics pipelines
Tools and Documentation
- Pixi Package Manager: Official Documentation - Modern, fast conda-based environment manager
- BWA Manual: BWA Homepage - Burrows-Wheeler Aligner documentation
- SAMtools: Official Site - Tools for manipulating SAM/BAM files
- GATK Documentation: Broad Institute - Comprehensive GATK tool documentation
- Trim Galore: Babraham Bioinformatics - Adapter trimming tool
- SnpEff: Official Site - Variant annotation and effect prediction
- bcftools: SAMtools bcftools - Variant calling and manipulation
Related Blog Posts
- Pixi - New conda era - Understanding modern package management with Pixi
- Containers on HPC: From Docker to Singularity and Apptainer - Container fundamentals for reproducible bioinformatics
- The Evolution of Version Control - Git's Role in Reproducible Bioinformatics (Part 1) - Version control basics for bioinformatics
Code Repository
- GitHub: variant-calling-gatk-pipeline-best-practice-from-scratch - Complete code from this blog post
Further Reading
- GATK Hard Filtering: GATK Forums Discussion - When to use hard filtering vs. VQSR
- GVCF Mode: Joint Genotyping Workflow - Understanding GVCF and joint calling
- Nextflow Documentation: Official Docs - Preview for Part 2
- nf-core: Best Practice Pipelines - Community-curated bioinformatics pipelines