Bioinformatics Workflow Template: Standardizing Python Pipelines with Modular Design
Building reproducible bioinformatics pipelines is hard. Every project starts from scratch with its own testing, CI/CD, and deployment strategy. What if you could clone a template, add your analysis tools, and be ready to go?
This post introduces a standardized bioinformatics workflow template featuring consistent testing, CI/CD, and project structure. Developed from real production experience with bioinfor-wf-template, this template reduces setup time from days to minutes, ensures research reproducibility, and promotes modular, reusable code. It is Python-based and ideal for proof-of-concept projects. Support for more advanced and widely adopted bioinformatics frameworks (such as Snakemake and Nextflow) is planned, applying the same core principles while leveraging their native testing systems.
The Problem: Bioinformatics Projects Start From Zero
Most bioinformatics projects face similar challenges:
No Standard Structure
- Where do I put my scripts?
src/?bin/?scripts/? - How do I organize apps vs workflows?
- Where do tests go?
Testing Nightmare
- Unit tests for data validation
- End-to-end (E2E) tests with real data
- Different frameworks need different test approaches
- Docker container testing
CI/CD Inconsistency
- Each project has its own GitHub Actions workflow
- No standard for running tests on changes
- Hard to scale to 50+ apps and workflows
Onboarding Friction
- New team members spend days setting up
- "How do I run the tests?" → No clear answer
- "Where's the documentation?" → Scattered or missing
The Solution: Standardized Template
Introducing a production-ready bioinformatics workflow template with:
- ✅ Clear project structure for apps, workflows, and tests
- ✅ Unified testing framework (pytest) for all modules
- ✅ Smart CI/CD that only runs tests for changed files
- ✅ Docker integration for reproducible environments
- ✅ One-command local testing with
act - ✅ Type hints and validation for data integrity
- ✅ Scales from 5 apps to 50+ production pipelines
Part 1: Understanding the Template Structure
Repository Layout
bioinfor-wf-template/
├── apps/ # Individual bioinformatics tools
│ ├── fastqc/ # Quality control
│ │ ├── main.py # Implementation
│ │ ├── tests/ # Unit + E2E tests
│ │ │ ├── test_fastq_in_pairs.py # Unit tests
│ │ │ └── test_e2e.py # Integration tests
│ │ └── Makefile # Local testing commands
│ ├── multiqc/ # Aggregation
│ │ ├── main.py
│ │ ├── tests/
│ │ └── Makefile
│ └── [your-tool]/ # Add more tools
│
├── workflows/ # Composed pipelines
│ ├── qc/ # Quality control workflow
│ │ ├── main.py # Orchestrates apps
│ │ ├── tests/
│ │ └── Makefile
│ └── [your-pipeline]/ # Add more workflows
│
├── .github/
│ └── workflows/
│ └── tests.yaml # GitHub Actions CI/CD
│
├── conftest.py # Shared pytest configuration
├── pytest.ini # Pytest settings
├── requirements.txt # Dependencies
├── pixi.toml # (Optional) Reproducible environment
└── README.md
Key insight: Apps are reusable building blocks, workflows compose them into pipelines.
Example App: FastQC Quality Control
apps/fastqc/main.py
def validate_fastq_files(fastq1: str, fastq2: str):
"""Validate paired FASTQ files match naming convention."""
base1 = os.path.basename(fastq1)
base2 = os.path.basename(fastq2)
expected_base2 = base1.replace("_R1_", "_R2_")
if base2 != expected_base2:
raise ValueError(f"Expected '{expected_base2}', got '{base2}'")
def run_fastqc(fastq1: str, fastq2: str, output_dir: str):
"""Run FastQC on paired FASTQ files using Docker."""
validate_fastq_files(fastq1, fastq2)
cmd = [
"docker", "run", "--rm",
"-v", f"{cwd}:{cwd}",
"biocontainers/fastqc:v0.11.9_cv8",
"fastqc", fastq1, fastq2, "--outdir", output_dir
]
subprocess.run(cmd, check=True)
Key patterns:
- ✅ Validation before execution
- ✅ Docker isolation (reproducible)
- ✅ Clear function signatures
- ✅ Type hints for IDE support
Testing Strategy: Unit + E2E
apps/fastqc/tests/test_fastq_in_pairs.py (Unit tests)
def test_validate_fastq_files_valid_pair():
"""Test that valid paired files pass validation."""
fastq1 = "/path/sample_R1_001.fastq.gz"
fastq2 = "/path/sample_R2_001.fastq.gz"
validate_fastq_files(fastq1, fastq2) # Should not raise
def test_validate_fastq_files_invalid_pair():
"""Test that mismatched files raise error."""
fastq1 = "/path/sample_R1_001.fastq.gz"
fastq2 = "/path/sample_R3_001.fastq.gz"
with pytest.raises(ValueError):
validate_fastq_files(fastq1, fastq2)
apps/fastqc/tests/test_e2e.py (End-to-end tests)
@pytest.fixture
def dummy_fastq_files():
"""Create minimal FASTQ files for testing."""
fastq1.write_text("@SEQ_ID\nGATTT...\n+\nIIIII...\n")
fastq2.write_text("@SEQ_ID\nGATTT...\n+\nIIIII...\n")
return str(fastq1), str(fastq2)
@pytest.mark.skipif(
not subprocess.run(["docker", "--version"]).returncode == 0,
reason="Docker is not available"
)
def test_run_fastqc_e2e(dummy_fastq_files, data_dir):
"""Test actual FastQC execution with real (dummy) data."""
fastq1, fastq2 = dummy_fastq_files
run_fastqc(fastq1, fastq2, str(data_dir / "results"))
# Assert output files exist
Test classification:
- Unit tests (
test_fastq_in_pairs.py): Always fast, no Docker needed - E2E tests (
test_e2e.py): Slower, require Docker, test actual tools
Part 2: Local Testing Workflow
Quick Setup
# Clone the template
git clone git@github.com:riverxdata/bioinfor-wf-template.git my-pipeline
cd my-pipeline
# Install dependencies
pip install -r requirements.txt
# Run all tests locally
pytest -v
# Or with Pixi (recommended)
pixi global install act
pixi run tests
Testing Commands with Make
Each app and workflow has a Makefile for consistent commands:
apps/fastqc/Makefile
.PHONY: unittest e2e
unittest:
pytest -v -k "not e2e" tests/
e2e:
pytest -v -k "e2e" tests/
Run tests locally:
# Unit tests only (fast - 10 seconds)
make -C apps/fastqc unittest
# End-to-end tests (slower - 2 minutes)
make -C apps/fastqc e2e
# All tests
make -C apps/fastqc unittest e2e
Run tests for entire workflows:
# All unit tests in all apps
for app in apps/*; do make -C $app unittest; done
# All E2E tests in all workflows
for workflow in workflows/*; do make -C $workflow e2e; done
Part 3: Smart CI/CD with GitHub Actions
GitHub Actions Workflow Analysis
.github/workflows/tests.yaml intelligently runs tests only for changed files:
jobs:
unittest_and_e2e:
runs-on: ubuntu-latest
steps:
# 1. Detect changed files
- uses: tj-actions/changed-files@v46.0.5
id: changed-files
with:
files: apps/**
base: ${{ github.event.pull_request.base.ref }}
# 2. Run unit tests for changed apps
- name: Unit tests
if: steps.changed-files.outputs.all_changed_files != ''
run: |
for app in ${{ steps.changed-files.outputs.all_changed_files }}; do
make -C $app unittest
done
# 3. Run E2E tests for changed apps
- name: E2E tests
if: steps.changed-files.outputs.all_changed_files != ''
run: |
for app in ${{ steps.changed-files.outputs.all_changed_files }}; do
make -C $app e2e
done
# 4. Run all workflow E2E tests
- name: Workflow tests
run: |
for workflow in workflows/*; do
make -C $workflow e2e
done
How It Works
Scenario 1: Edit apps/fastqc/main.py
GitHub detects change → Run only fastqc tests → ~30 seconds
Scenario 2: Edit workflows/qc/main.py
GitHub detects change → Run qc workflow E2E → ~2 minutes
Includes fastqc + multiqc + orchestration
Scenario 3: Edit README.md
GitHub detects no code changes → Skip all tests
Performance benefit:
- Without optimization: All tests run (~10 min) for every PR
- With optimization: Only affected tests run (~30 sec - 2 min)
- Result: 5-10x faster feedback 🚀
Part 4: Example Apps and Patterns
The template works for any bioinformatics tool wrapped in Python. Here are common patterns:
Pattern 1: Docker-Based Tool Wrapper
Most bioinformatics tools are containerized. Wrap them in Python:
# apps/bowtie2/main.py
def run_bowtie2_alignment(reference, fastq1, fastq2, output_bam):
"""Align reads using Bowtie2 via Docker."""
cmd = [
"docker", "run", "--rm",
"-v", f"{Path.cwd()}:{Path.cwd()}",
"biocontainers/bowtie2:v2.5.1_cv1",
"bowtie2-build", reference, "index",
"-x", "index", "-1", fastq1, "-2", fastq2, "-S", output_bam
]
subprocess.run(cmd, check=True)
# apps/bowtie2/tests/test_e2e.py
def test_bowtie2_alignment_e2e(reference_fasta, fastq_files):
"""Test alignment with real (dummy) data."""
run_bowtie2_alignment(*reference_fasta, *fastq_files, "output.bam")
assert Path("output.bam").exists()
Pattern 2: Python Library Wrapper
Some tools have Python packages. Use them directly:
# apps/cutadapt/main.py
import cutadapt
def trim_adapters(fastq1, fastq2, output_dir):
"""Trim adapters from FASTQ files."""
Path(output_dir).mkdir(exist_ok=True)
# Use cutadapt Python API
adapter = "AGATCGGAAGAGC" # Common Illumina adapter
results = cutadapt.main([
"-a", adapter,
"-A", adapter,
"-o", f"{output_dir}/trimmed_R1.fastq.gz",
"-p", f"{output_dir}/trimmed_R2.fastq.gz",
fastq1, fastq2
])
return results
# apps/cutadapt/tests/test_e2e.py
def test_trim_adapters_e2e(fastq_files, tmp_path):
"""Test adapter trimming."""
trim_adapters(fastq_files[0], fastq_files[1], str(tmp_path))
assert (tmp_path / "trimmed_R1.fastq.gz").exists()
assert (tmp_path / "trimmed_R2.fastq.gz").exists()
Pattern 3: Custom Analysis
Write your own analysis tools:
# apps/gene_counter/main.py
import pandas as pd
def count_genes(bam_file, annotation_gtf, output_csv):
"""Count reads per gene from BAM file."""
# Load annotation
gtf = pd.read_csv(annotation_gtf, sep="\t", comment="#")
# Count using samtools + Python
import pysam
bam = pysam.AlignmentFile(bam_file)
counts = {}
for read in bam:
gene = read.get_tag("XG") if read.has_tag("XG") else "unknown"
counts[gene] = counts.get(gene, 0) + 1
# Save results
df = pd.DataFrame(list(counts.items()), columns=["gene", "count"])
df.to_csv(output_csv, index=False)
return df
# apps/gene_counter/tests/test_e2e.py
def test_gene_counting_e2e(bam_file, annotation_gtf, tmp_path):
"""Test gene counting."""
output_csv = tmp_path / "counts.csv"
count_genes(bam_file, annotation_gtf, str(output_csv))
results = pd.read_csv(output_csv)
assert len(results) > 0
assert "count" in results.columns
Part 5: Adding New Apps and Workflows
Template for New Python App
Create the structure:
mkdir -p apps/myapp/tests
touch apps/myapp/__init__.py
touch apps/myapp/main.py
touch apps/myapp/tests/test_unit.py
touch apps/myapp/tests/test_e2e.py
touch apps/myapp/Makefile
apps/myapp/main.py
import subprocess
from pathlib import Path
def validate_input(input_file: str):
"""Validate input data format."""
if not Path(input_file).exists():
raise FileNotFoundError(f"Input file not found: {input_file}")
def run_myapp(input_file: str, output_dir: str):
"""Run my bioinformatics tool."""
Path(output_dir).mkdir(parents=True, exist_ok=True)
cmd = [
"docker", "run", "--rm",
"-v", f"{Path.cwd()}:{Path.cwd()}",
"myregistry/myapp:latest",
"myapp", input_file, "--output", output_dir
]
subprocess.run(cmd, check=True)
apps/myapp/tests/test_unit.py
from apps.myapp.main import validate_input
import pytest
def test_validate_input_exists():
"""Test validation of existing file."""
# Create temp file
with open("/tmp/test_input.txt", "w") as f:
f.write("test data")
# Should not raise
validate_input("/tmp/test_input.txt")
def test_validate_input_missing():
"""Test validation of missing file."""
with pytest.raises(FileNotFoundError):
validate_input("/nonexistent/file.txt")
apps/myapp/tests/test_e2e.py
import pytest
from pathlib import Path
import subprocess
from apps.myapp.main import run_myapp
@pytest.mark.skipif(
not subprocess.run(["docker", "--version"]).returncode == 0,
reason="Docker not available"
)
def test_run_myapp_e2e(tmp_path):
"""Test complete myapp execution."""
input_file = tmp_path / "input.txt"
input_file.write_text("test data")
output_dir = tmp_path / "output"
run_myapp(str(input_file), str(output_dir))
# Verify output
assert output_dir.exists()
assert (output_dir / "result.txt").exists()
apps/myapp/Makefile
.PHONY: unittest e2e
unittest:
pytest -v -k "not e2e" tests/
e2e:
pytest -v -k "e2e" tests/
Template for New Workflow
Create the structure:
mkdir -p workflows/mypipeline/tests
touch workflows/mypipeline/main.py
touch workflows/mypipeline/tests/test_e2e.py
touch workflows/mypipeline/Makefile
workflows/mypipeline/main.py
import argparse
from apps.myapp.main import run_myapp
from apps.anotherap.main import run_anotherap
def run_mypipeline(input_file, output_dir):
"""Compose myapp and anotherap into a pipeline."""
# Step 1: Run first app
intermediate_dir = f"{output_dir}/intermediate"
run_myapp(input_file, intermediate_dir)
# Step 2: Run second app with output from first
final_dir = f"{output_dir}/final"
run_anotherap(f"{intermediate_dir}/result.txt", final_dir)
print(f"Pipeline complete. Results in: {final_dir}")
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="My bioinformatics pipeline")
parser.add_argument("input_file")
parser.add_argument("output_dir")
args = parser.parse_args()
run_mypipeline(args.input_file, args.output_dir)
workflows/mypipeline/tests/test_e2e.py
import subprocess
from pathlib import Path
from workflows.mypipeline.main import run_mypipeline
def test_mypipeline_e2e(tmp_path):
"""Test complete pipeline execution."""
input_file = tmp_path / "input.txt"
input_file.write_text("test data")
output_dir = tmp_path / "output"
run_mypipeline(str(input_file), str(output_dir))
# Verify all outputs exist
assert (output_dir / "intermediate").exists()
assert (output_dir / "final").exists()
workflows/mypipeline/Makefile
.PHONY: e2e
e2e:
pytest -v -k "e2e" tests/
Part 6: Reproducible Environment with Pixi
For maximum reproducibility, add Pixi support:
pixi.toml
[project]
name = "bioinfor-wf"
version = "0.1.0"
channels = ["conda-forge", "bioconda"]
[dependencies]
python = "3.12"
pytest = ">=7.0"
numpy = ">=1.24"
pandas = ">=1.5"
[tasks]
test = "pytest -v"
unittest = "pytest -v -k 'not e2e'"
e2e = "pytest -v -k 'e2e'"
ci = "act push"
[environments]
test = { channels = ["conda-forge", "bioconda", "nvidia"], features = ["gpu"] }
dev = { features = ["dev"] }
Usage:
# Install dependencies
pixi install
# Run tests
pixi run test
# Run GitHub Actions locally
pixi run ci
Benefits and Impact
| Aspect | Before | After |
|---|---|---|
| Onboarding time | 2-3 days | 30 minutes |
| Test setup | Different per project | Standardized pytest |
| CI/CD setup | Write from scratch | Clone workflow |
| Running tests locally | "How do I test this?" | make -C apps/fastqc unittest |
| Code consistency | All over the place | Type hints, validation enforced |
| New app creation | Copy-paste existing | Use template structure |
| App reusability | One-off scripts | Modular, reusable components |
| Team onboarding | Weeks of confusion | 30 min + documentation |
Best Practices for Template Adoption
1. One Function Per Purpose
Each function should do one thing well:
# ❌ Bad: Multiple responsibilities
def process_fastq_and_align(fastq1, fastq2, ref, output):
validate_fastq(fastq1, fastq2)
align(fastq1, fastq2, ref, output)
quality_check(output)
# ✅ Good: Single responsibility
def validate_fastq(fastq1, fastq2): ...
def run_alignment(fastq1, fastq2, ref, output): ...
def quality_check(bam_file): ...
2. Always Validate Input
def run_analysis(input_file: str):
"""Validate before processing."""
input_path = Path(input_file)
if not input_path.exists():
raise FileNotFoundError(f"Input file not found: {input_file}")
if input_path.stat().st_size == 0:
raise ValueError("Input file is empty")
3. Use Type Hints
from pathlib import Path
from typing import List, Tuple
def process_samples(samples: List[str], output_dir: str) -> Tuple[Path, Path]:
"""Type hints help catch errors early."""
...
4. Separate Unit and E2E Tests
# Unit: Fast, no Docker needed
def test_validation():
assert validate(valid_input) == True
# E2E: Slow, includes Docker
@pytest.mark.skipif(no_docker, reason="Docker required")
def test_full_workflow_e2e():
...
5. Document Dependencies
Keep requirements.txt updated:
pytest>=7.0
docker>=6.0
numpy>=1.24
pandas>=1.5
Common Patterns
Error Handling
import subprocess
import sys
def run_command(cmd: List[str]) -> bool:
"""Run shell command with proper error handling."""
try:
result = subprocess.run(cmd, check=True, capture_output=True, text=True)
return True
except subprocess.CalledProcessError as e:
print(f"Error: Command failed with exit code {e.returncode}")
print(f"stderr: {e.stderr}")
sys.exit(1)
Logging and Progress
import logging
logger = logging.getLogger(__name__)
def process_pipeline(samples: List[str]):
"""Process with logging."""
for i, sample in enumerate(samples, 1):
logger.info(f"Processing sample {i}/{len(samples)}: {sample}")
# Process...
logger.info(f"✓ Completed {sample}")
Temporary File Management
from pathlib import Path
import tempfile
def run_with_temp_files(input_file: str) -> str:
"""Create and clean temporary files automatically."""
with tempfile.TemporaryDirectory() as tmp_dir:
tmp_path = Path(tmp_dir)
# Work with temp files
intermediate = tmp_path / "intermediate.txt"
# ... process ...
# Copy final result to output
final_output = "final_result.txt"
# ... copy ...
# Temp dir automatically cleaned up when exiting with block
return final_output
Scaling to 50+ Apps and Workflows
As your template grows, follow these patterns:
Directory Organization
apps/
├── alignment/ # Group by domain
│ ├── bowtie2/
│ ├── bwa/
│ └── hisat2/
├── quality_control/
│ ├── fastqc/
│ ├── multiqc/
│ └── qualimap/
└── variant_calling/
├── gatk/
└── samtools/
workflows/
├── rnaseq/
├── wgs/
├── amplicon/
└── metagenomics/
Automated Testing
#!/bin/bash
# test_all.sh - Run all tests
failed_tests=0
total_tests=0
for app in apps/*/*/; do
if [ -f "$app/Makefile" ]; then
total_tests=$((total_tests + 1))
if ! make -C "$app" unittest; then
failed_tests=$((failed_tests + 1))
echo "❌ Failed: $app"
else
echo "✓ Passed: $app"
fi
fi
done
echo "Tests: $((total_tests - failed_tests))/$total_tests passed"
exit $failed_tests
Documentation
Keep a central CONTRIBUTING.md:
## Adding a New App
1. Create directory: `mkdir -p apps/category/myapp/tests`
2. Copy template: `cp -r apps/fastqc/* apps/category/myapp/`
3. Edit `main.py` with your logic
4. Write tests in `tests/`
5. Update `README.md`
6. Create PR
## Adding a New Workflow
1. Create directory: `mkdir -p workflows/mypipeline/tests`
2. Create `main.py` that composes apps
3. Write E2E test in `tests/`
4. Add to `.github/workflows/tests.yaml`
Key Takeaways
- Template consistency → New projects setup in 30 minutes
- Modular structure → Apps are reusable, workflows compose them
- Testing discipline → Unit + E2E tests, both automated
- Smart CI/CD → Only test changed files (5-10x faster)
- Multi-framework → Python, Nextflow, Snakemake all supported
- Reproducibility → Docker + Pixi guarantees same environment everywhere
- Scaling → Pattern works for 5 apps or 50+ apps
Getting Started
# Clone the template
git clone git@github.com:riverxdata/bioinfor-wf-template.git my-analysis
# Install dependencies
cd my-analysis
pip install -r requirements.txt
# Run tests
pytest -v
# Or with Pixi
pixi global install act
pixi run test
# Or locally simulate GitHub Actions
act push
References
Start building standardized bioinformatics pipelines today! 🧬🚀