Bioinformatics Workflow Template: Standardizing Python Pipelines with Modular Design

February 4, 2026 · 13 min read

Founder at RIVER

Building reproducible bioinformatics pipelines is hard. Every project starts from scratch with its own testing, CI/CD, and deployment strategy. What if you could clone a template, add your analysis tools, and be ready to go?

This post introduces a standardized bioinformatics workflow template featuring consistent testing, CI/CD, and project structure. Developed from real production experience with bioinfor-wf-template, this template reduces setup time from days to minutes, ensures research reproducibility, and promotes modular, reusable code. It is Python-based and ideal for proof-of-concept projects. Support for more advanced and widely adopted bioinformatics frameworks (such as Snakemake and Nextflow) is planned, applying the same core principles while leveraging their native testing systems.

The Problem: Bioinformatics Projects Start From Zero

Most bioinformatics projects face similar challenges:

No Standard Structure

Where do I put my scripts? src/? bin/? scripts/?
How do I organize apps vs workflows?
Where do tests go?

Testing Nightmare

Unit tests for data validation
End-to-end (E2E) tests with real data
Different frameworks need different test approaches
Docker container testing

CI/CD Inconsistency

Each project has its own GitHub Actions workflow
No standard for running tests on changes
Hard to scale to 50+ apps and workflows

Onboarding Friction

New team members spend days setting up
"How do I run the tests?" → No clear answer
"Where's the documentation?" → Scattered or missing

The Solution: Standardized Template

Introducing a production-ready bioinformatics workflow template with:

✅ Clear project structure for apps, workflows, and tests
✅ Unified testing framework (pytest) for all modules
✅ Smart CI/CD that only runs tests for changed files
✅ Docker integration for reproducible environments
✅ One-command local testing with act
✅ Type hints and validation for data integrity
✅ Scales from 5 apps to 50+ production pipelines

Part 1: Understanding the Template Structure

Repository Layout

bioinfor-wf-template/
├── apps/                          # Individual bioinformatics tools
│   ├── fastqc/                    # Quality control
│   │   ├── main.py                # Implementation
│   │   ├── tests/                 # Unit + E2E tests
│   │   │   ├── test_fastq_in_pairs.py      # Unit tests
│   │   │   └── test_e2e.py                 # Integration tests
│   │   └── Makefile               # Local testing commands
│   ├── multiqc/                   # Aggregation
│   │   ├── main.py
│   │   ├── tests/
│   │   └── Makefile
│   └── [your-tool]/               # Add more tools
│
├── workflows/                      # Composed pipelines
│   ├── qc/                        # Quality control workflow
│   │   ├── main.py                # Orchestrates apps
│   │   ├── tests/
│   │   └── Makefile
│   └── [your-pipeline]/           # Add more workflows
│
├── .github/
│   └── workflows/
│       └── tests.yaml             # GitHub Actions CI/CD
│
├── conftest.py                    # Shared pytest configuration
├── pytest.ini                     # Pytest settings
├── requirements.txt               # Dependencies
├── pixi.toml                      # (Optional) Reproducible environment
└── README.md

Key insight: Apps are reusable building blocks, workflows compose them into pipelines.

Example App: FastQC Quality Control

apps/fastqc/main.py

def validate_fastq_files(fastq1: str, fastq2: str):
    """Validate paired FASTQ files match naming convention."""
    base1 = os.path.basename(fastq1)
    base2 = os.path.basename(fastq2)
    expected_base2 = base1.replace("_R1_", "_R2_")
    if base2 != expected_base2:
        raise ValueError(f"Expected '{expected_base2}', got '{base2}'")

def run_fastqc(fastq1: str, fastq2: str, output_dir: str):
    """Run FastQC on paired FASTQ files using Docker."""
    validate_fastq_files(fastq1, fastq2)
    cmd = [
        "docker", "run", "--rm",
        "-v", f"{cwd}:{cwd}",
        "biocontainers/fastqc:v0.11.9_cv8",
        "fastqc", fastq1, fastq2, "--outdir", output_dir
    ]
    subprocess.run(cmd, check=True)

Key patterns:

✅ Validation before execution
✅ Docker isolation (reproducible)
✅ Clear function signatures
✅ Type hints for IDE support

Testing Strategy: Unit + E2E

apps/fastqc/tests/test_fastq_in_pairs.py (Unit tests)

def test_validate_fastq_files_valid_pair():
    """Test that valid paired files pass validation."""
    fastq1 = "/path/sample_R1_001.fastq.gz"
    fastq2 = "/path/sample_R2_001.fastq.gz"
    validate_fastq_files(fastq1, fastq2)  # Should not raise

def test_validate_fastq_files_invalid_pair():
    """Test that mismatched files raise error."""
    fastq1 = "/path/sample_R1_001.fastq.gz"
    fastq2 = "/path/sample_R3_001.fastq.gz"
    with pytest.raises(ValueError):
        validate_fastq_files(fastq1, fastq2)

apps/fastqc/tests/test_e2e.py (End-to-end tests)

@pytest.fixture
def dummy_fastq_files():
    """Create minimal FASTQ files for testing."""
    fastq1.write_text("@SEQ_ID\nGATTT...\n+\nIIIII...\n")
    fastq2.write_text("@SEQ_ID\nGATTT...\n+\nIIIII...\n")
    return str(fastq1), str(fastq2)

@pytest.mark.skipif(
    not subprocess.run(["docker", "--version"]).returncode == 0,
    reason="Docker is not available"
)
def test_run_fastqc_e2e(dummy_fastq_files, data_dir):
    """Test actual FastQC execution with real (dummy) data."""
    fastq1, fastq2 = dummy_fastq_files
    run_fastqc(fastq1, fastq2, str(data_dir / "results"))
    # Assert output files exist

Test classification:

Unit tests (test_fastq_in_pairs.py): Always fast, no Docker needed
E2E tests (test_e2e.py): Slower, require Docker, test actual tools

Part 2: Local Testing Workflow

Quick Setup

# Clone the template
git clone git@github.com:riverxdata/bioinfor-wf-template.git my-pipeline
cd my-pipeline

# Install dependencies
pip install -r requirements.txt

# Run all tests locally
pytest -v

# Or with Pixi (recommended)
pixi global install act
pixi run tests

Testing Commands with Make

Each app and workflow has a Makefile for consistent commands:

apps/fastqc/Makefile

.PHONY: unittest e2e

unittest:
	pytest -v -k "not e2e" tests/

e2e:
	pytest -v -k "e2e" tests/

Run tests locally:

# Unit tests only (fast - 10 seconds)
make -C apps/fastqc unittest

# End-to-end tests (slower - 2 minutes)
make -C apps/fastqc e2e

# All tests
make -C apps/fastqc unittest e2e

Run tests for entire workflows:

# All unit tests in all apps
for app in apps/*; do make -C $app unittest; done

# All E2E tests in all workflows
for workflow in workflows/*; do make -C $workflow e2e; done

Part 3: Smart CI/CD with GitHub Actions

GitHub Actions Workflow Analysis

.github/workflows/tests.yaml intelligently runs tests only for changed files:

jobs:
  unittest_and_e2e:
    runs-on: ubuntu-latest
    steps:
      # 1. Detect changed files
      - uses: tj-actions/changed-files@v46.0.5
        id: changed-files
        with:
          files: apps/**
          base: ${{ github.event.pull_request.base.ref }}
      
      # 2. Run unit tests for changed apps
      - name: Unit tests
        if: steps.changed-files.outputs.all_changed_files != ''
        run: |
          for app in ${{ steps.changed-files.outputs.all_changed_files }}; do
            make -C $app unittest
          done
      
      # 3. Run E2E tests for changed apps
      - name: E2E tests
        if: steps.changed-files.outputs.all_changed_files != ''
        run: |
          for app in ${{ steps.changed-files.outputs.all_changed_files }}; do
            make -C $app e2e
          done
      
      # 4. Run all workflow E2E tests
      - name: Workflow tests
        run: |
          for workflow in workflows/*; do
            make -C $workflow e2e
          done

How It Works

Scenario 1: Edit apps/fastqc/main.py

GitHub detects change → Run only fastqc tests → ~30 seconds

Scenario 2: Edit workflows/qc/main.py

GitHub detects change → Run qc workflow E2E → ~2 minutes
Includes fastqc + multiqc + orchestration

Scenario 3: Edit README.md

GitHub detects no code changes → Skip all tests

Performance benefit:

Without optimization: All tests run (~10 min) for every PR
With optimization: Only affected tests run (~30 sec - 2 min)
Result: 5-10x faster feedback 🚀

Part 4: Example Apps and Patterns

The template works for any bioinformatics tool wrapped in Python. Here are common patterns:

Pattern 1: Docker-Based Tool Wrapper

Most bioinformatics tools are containerized. Wrap them in Python:

# apps/bowtie2/main.py
def run_bowtie2_alignment(reference, fastq1, fastq2, output_bam):
    """Align reads using Bowtie2 via Docker."""
    cmd = [
        "docker", "run", "--rm",
        "-v", f"{Path.cwd()}:{Path.cwd()}",
        "biocontainers/bowtie2:v2.5.1_cv1",
        "bowtie2-build", reference, "index",
        "-x", "index", "-1", fastq1, "-2", fastq2, "-S", output_bam
    ]
    subprocess.run(cmd, check=True)

# apps/bowtie2/tests/test_e2e.py
def test_bowtie2_alignment_e2e(reference_fasta, fastq_files):
    """Test alignment with real (dummy) data."""
    run_bowtie2_alignment(*reference_fasta, *fastq_files, "output.bam")
    assert Path("output.bam").exists()

Pattern 2: Python Library Wrapper

Some tools have Python packages. Use them directly:

# apps/cutadapt/main.py
import cutadapt

def trim_adapters(fastq1, fastq2, output_dir):
    """Trim adapters from FASTQ files."""
    Path(output_dir).mkdir(exist_ok=True)
    
    # Use cutadapt Python API
    adapter = "AGATCGGAAGAGC"  # Common Illumina adapter
    results = cutadapt.main([
        "-a", adapter,
        "-A", adapter,
        "-o", f"{output_dir}/trimmed_R1.fastq.gz",
        "-p", f"{output_dir}/trimmed_R2.fastq.gz",
        fastq1, fastq2
    ])
    return results

# apps/cutadapt/tests/test_e2e.py
def test_trim_adapters_e2e(fastq_files, tmp_path):
    """Test adapter trimming."""
    trim_adapters(fastq_files[0], fastq_files[1], str(tmp_path))
    assert (tmp_path / "trimmed_R1.fastq.gz").exists()
    assert (tmp_path / "trimmed_R2.fastq.gz").exists()

Pattern 3: Custom Analysis

Write your own analysis tools:

# apps/gene_counter/main.py
import pandas as pd

def count_genes(bam_file, annotation_gtf, output_csv):
    """Count reads per gene from BAM file."""
    # Load annotation
    gtf = pd.read_csv(annotation_gtf, sep="\t", comment="#")
    
    # Count using samtools + Python
    import pysam
    bam = pysam.AlignmentFile(bam_file)
    
    counts = {}
    for read in bam:
        gene = read.get_tag("XG") if read.has_tag("XG") else "unknown"
        counts[gene] = counts.get(gene, 0) + 1
    
    # Save results
    df = pd.DataFrame(list(counts.items()), columns=["gene", "count"])
    df.to_csv(output_csv, index=False)
    
    return df

# apps/gene_counter/tests/test_e2e.py
def test_gene_counting_e2e(bam_file, annotation_gtf, tmp_path):
    """Test gene counting."""
    output_csv = tmp_path / "counts.csv"
    count_genes(bam_file, annotation_gtf, str(output_csv))
    
    results = pd.read_csv(output_csv)
    assert len(results) > 0
    assert "count" in results.columns

Part 5: Adding New Apps and Workflows

Template for New Python App

Create the structure:

mkdir -p apps/myapp/tests
touch apps/myapp/__init__.py
touch apps/myapp/main.py
touch apps/myapp/tests/test_unit.py
touch apps/myapp/tests/test_e2e.py
touch apps/myapp/Makefile

apps/myapp/main.py

import subprocess
from pathlib import Path

def validate_input(input_file: str):
    """Validate input data format."""
    if not Path(input_file).exists():
        raise FileNotFoundError(f"Input file not found: {input_file}")

def run_myapp(input_file: str, output_dir: str):
    """Run my bioinformatics tool."""
    Path(output_dir).mkdir(parents=True, exist_ok=True)
    
    cmd = [
        "docker", "run", "--rm",
        "-v", f"{Path.cwd()}:{Path.cwd()}",
        "myregistry/myapp:latest",
        "myapp", input_file, "--output", output_dir
    ]
    subprocess.run(cmd, check=True)

apps/myapp/tests/test_unit.py

from apps.myapp.main import validate_input
import pytest

def test_validate_input_exists():
    """Test validation of existing file."""
    # Create temp file
    with open("/tmp/test_input.txt", "w") as f:
        f.write("test data")
    
    # Should not raise
    validate_input("/tmp/test_input.txt")

def test_validate_input_missing():
    """Test validation of missing file."""
    with pytest.raises(FileNotFoundError):
        validate_input("/nonexistent/file.txt")

apps/myapp/tests/test_e2e.py

import pytest
from pathlib import Path
import subprocess
from apps.myapp.main import run_myapp

@pytest.mark.skipif(
    not subprocess.run(["docker", "--version"]).returncode == 0,
    reason="Docker not available"
)
def test_run_myapp_e2e(tmp_path):
    """Test complete myapp execution."""
    input_file = tmp_path / "input.txt"
    input_file.write_text("test data")
    
    output_dir = tmp_path / "output"
    run_myapp(str(input_file), str(output_dir))
    
    # Verify output
    assert output_dir.exists()
    assert (output_dir / "result.txt").exists()

apps/myapp/Makefile

.PHONY: unittest e2e

unittest:
	pytest -v -k "not e2e" tests/

e2e:
	pytest -v -k "e2e" tests/

Template for New Workflow

Create the structure:

mkdir -p workflows/mypipeline/tests
touch workflows/mypipeline/main.py
touch workflows/mypipeline/tests/test_e2e.py
touch workflows/mypipeline/Makefile

workflows/mypipeline/main.py

import argparse
from apps.myapp.main import run_myapp
from apps.anotherap.main import run_anotherap

def run_mypipeline(input_file, output_dir):
    """Compose myapp and anotherap into a pipeline."""
    
    # Step 1: Run first app
    intermediate_dir = f"{output_dir}/intermediate"
    run_myapp(input_file, intermediate_dir)
    
    # Step 2: Run second app with output from first
    final_dir = f"{output_dir}/final"
    run_anotherap(f"{intermediate_dir}/result.txt", final_dir)
    
    print(f"Pipeline complete. Results in: {final_dir}")

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="My bioinformatics pipeline")
    parser.add_argument("input_file")
    parser.add_argument("output_dir")
    args = parser.parse_args()
    
    run_mypipeline(args.input_file, args.output_dir)

workflows/mypipeline/tests/test_e2e.py

import subprocess
from pathlib import Path
from workflows.mypipeline.main import run_mypipeline

def test_mypipeline_e2e(tmp_path):
    """Test complete pipeline execution."""
    input_file = tmp_path / "input.txt"
    input_file.write_text("test data")
    
    output_dir = tmp_path / "output"
    run_mypipeline(str(input_file), str(output_dir))
    
    # Verify all outputs exist
    assert (output_dir / "intermediate").exists()
    assert (output_dir / "final").exists()

workflows/mypipeline/Makefile

.PHONY: e2e

e2e:
	pytest -v -k "e2e" tests/

Part 6: Reproducible Environment with Pixi

For maximum reproducibility, add Pixi support:

pixi.toml

[project]
name = "bioinfor-wf"
version = "0.1.0"
channels = ["conda-forge", "bioconda"]

[dependencies]
python = "3.12"
pytest = ">=7.0"
numpy = ">=1.24"
pandas = ">=1.5"

[tasks]
test = "pytest -v"
unittest = "pytest -v -k 'not e2e'"
e2e = "pytest -v -k 'e2e'"
ci = "act push"

[environments]
test = { channels = ["conda-forge", "bioconda", "nvidia"], features = ["gpu"] }
dev = { features = ["dev"] }

Usage:

# Install dependencies
pixi install

# Run tests
pixi run test

# Run GitHub Actions locally
pixi run ci

Benefits and Impact

Aspect	Before	After
Onboarding time	2-3 days	30 minutes
Test setup	Different per project	Standardized pytest
CI/CD setup	Write from scratch	Clone workflow
Running tests locally	"How do I test this?"	`make -C apps/fastqc unittest`
Code consistency	All over the place	Type hints, validation enforced
New app creation	Copy-paste existing	Use template structure
App reusability	One-off scripts	Modular, reusable components
Team onboarding	Weeks of confusion	30 min + documentation

Best Practices for Template Adoption

1. One Function Per Purpose

Each function should do one thing well:

# ❌ Bad: Multiple responsibilities
def process_fastq_and_align(fastq1, fastq2, ref, output):
    validate_fastq(fastq1, fastq2)
    align(fastq1, fastq2, ref, output)
    quality_check(output)

# ✅ Good: Single responsibility
def validate_fastq(fastq1, fastq2): ...
def run_alignment(fastq1, fastq2, ref, output): ...
def quality_check(bam_file): ...

2. Always Validate Input

def run_analysis(input_file: str):
    """Validate before processing."""
    input_path = Path(input_file)
    if not input_path.exists():
        raise FileNotFoundError(f"Input file not found: {input_file}")
    if input_path.stat().st_size == 0:
        raise ValueError("Input file is empty")

3. Use Type Hints

from pathlib import Path
from typing import List, Tuple

def process_samples(samples: List[str], output_dir: str) -> Tuple[Path, Path]:
    """Type hints help catch errors early."""
    ...

4. Separate Unit and E2E Tests

# Unit: Fast, no Docker needed
def test_validation():
    assert validate(valid_input) == True

# E2E: Slow, includes Docker
@pytest.mark.skipif(no_docker, reason="Docker required")
def test_full_workflow_e2e():
    ...

5. Document Dependencies

Keep requirements.txt updated:

pytest>=7.0
docker>=6.0
numpy>=1.24
pandas>=1.5

Common Patterns

Error Handling

import subprocess
import sys

def run_command(cmd: List[str]) -> bool:
    """Run shell command with proper error handling."""
    try:
        result = subprocess.run(cmd, check=True, capture_output=True, text=True)
        return True
    except subprocess.CalledProcessError as e:
        print(f"Error: Command failed with exit code {e.returncode}")
        print(f"stderr: {e.stderr}")
        sys.exit(1)

Logging and Progress

import logging

logger = logging.getLogger(__name__)

def process_pipeline(samples: List[str]):
    """Process with logging."""
    for i, sample in enumerate(samples, 1):
        logger.info(f"Processing sample {i}/{len(samples)}: {sample}")
        # Process...
        logger.info(f"✓ Completed {sample}")

Temporary File Management

from pathlib import Path
import tempfile

def run_with_temp_files(input_file: str) -> str:
    """Create and clean temporary files automatically."""
    with tempfile.TemporaryDirectory() as tmp_dir:
        tmp_path = Path(tmp_dir)
        
        # Work with temp files
        intermediate = tmp_path / "intermediate.txt"
        # ... process ...
        
        # Copy final result to output
        final_output = "final_result.txt"
        # ... copy ...
        
        # Temp dir automatically cleaned up when exiting with block
    return final_output

Scaling to 50+ Apps and Workflows

As your template grows, follow these patterns:

Directory Organization

apps/
├── alignment/          # Group by domain
│   ├── bowtie2/
│   ├── bwa/
│   └── hisat2/
├── quality_control/
│   ├── fastqc/
│   ├── multiqc/
│   └── qualimap/
└── variant_calling/
    ├── gatk/
    └── samtools/

workflows/
├── rnaseq/
├── wgs/
├── amplicon/
└── metagenomics/

Automated Testing

#!/bin/bash
# test_all.sh - Run all tests

failed_tests=0
total_tests=0

for app in apps/*/*/; do
    if [ -f "$app/Makefile" ]; then
        total_tests=$((total_tests + 1))
        if ! make -C "$app" unittest; then
            failed_tests=$((failed_tests + 1))
            echo "❌ Failed: $app"
        else
            echo "✓ Passed: $app"
        fi
    fi
done

echo "Tests: $((total_tests - failed_tests))/$total_tests passed"
exit $failed_tests

Documentation

Keep a central CONTRIBUTING.md:

## Adding a New App

Create directory: `mkdir -p apps/category/myapp/tests`
Copy template: `cp -r apps/fastqc/* apps/category/myapp/`
Edit `main.py` with your logic
Write tests in `tests/`
Update `README.md`
Create PR

## Adding a New Workflow

Create directory: `mkdir -p workflows/mypipeline/tests`
Create `main.py` that composes apps
Write E2E test in `tests/`
Add to `.github/workflows/tests.yaml`

Key Takeaways

Template consistency → New projects setup in 30 minutes
Modular structure → Apps are reusable, workflows compose them
Testing discipline → Unit + E2E tests, both automated
Smart CI/CD → Only test changed files (5-10x faster)
Multi-framework → Python, Nextflow, Snakemake all supported
Reproducibility → Docker + Pixi guarantees same environment everywhere
Scaling → Pattern works for 5 apps or 50+ apps

Getting Started

# Clone the template
git clone git@github.com:riverxdata/bioinfor-wf-template.git my-analysis

# Install dependencies
cd my-analysis
pip install -r requirements.txt

# Run tests
pytest -v

# Or with Pixi
pixi global install act
pixi run test

# Or locally simulate GitHub Actions
act push

References

Start building standardized bioinformatics pipelines today! 🧬🚀

The Problem: Bioinformatics Projects Start From Zero​

The Solution: Standardized Template​

Part 1: Understanding the Template Structure​

Repository Layout​

Example App: FastQC Quality Control​

Testing Strategy: Unit + E2E​

Part 2: Local Testing Workflow​

Quick Setup​

Testing Commands with Make​

Part 3: Smart CI/CD with GitHub Actions​

GitHub Actions Workflow Analysis​

How It Works​

Part 4: Example Apps and Patterns​

Pattern 1: Docker-Based Tool Wrapper​

Pattern 2: Python Library Wrapper​

Pattern 3: Custom Analysis​

Part 5: Adding New Apps and Workflows​

Template for New Python App​

Template for New Workflow​

Part 6: Reproducible Environment with Pixi​

Benefits and Impact​

Best Practices for Template Adoption​

1. One Function Per Purpose​

2. Always Validate Input​

3. Use Type Hints​

4. Separate Unit and E2E Tests​

5. Document Dependencies​

Common Patterns​

Error Handling​

Logging and Progress​

Temporary File Management​

Scaling to 50+ Apps and Workflows​

Directory Organization​

Automated Testing​

Documentation​

Key Takeaways​

Getting Started​

References​