Skip to main content

How to Migrate from In-House Pipelines to Enterprise-Level Workflows: A Proven 3-Step Validation Framework

· 18 min read
Thanh-Giang Tan Nguyen
Founder at RIVER

Whether your lab uses bash scripts, Python workflows, Snakemake pipelines, or custom solutions—your in-house pipeline works fine locally. It's been running for years. But as your research scales, you face a hard truth: in-house pipelines don't scale, aren't reproducible across teams, and require constant manual fixes.

This is where enterprise-level workflow management comes in. But before you migrate your entire pipeline to Nextflow (or any professional workflow manager), you need answers to the hardest question: Will the new pipeline produce identical results?

This blog reveals the proven 3-step framework used by production teams to confidently migrate ANY bioinformatics pipeline—regardless of its original format. You'll learn how to establish reproducibility baselines, control for non-deterministic behavior, and validate that your enterprise pipeline is scientifically equivalent to your original work. Plus, we'll show you how nf-test automates this entire validation process when migrating to Nextflow.

Why Enterprise Pipelines Matter: The Numbers

Your lab's in-house pipeline (bash, Python, Snakemake, or custom) might work locally, but it doesn't scale beyond your machine. Here's what changes when you move from in-house to enterprise-level:

The In-House Pipeline Problem (Any Format):

  • Runs on one person's machine, with one person understanding it
  • Tool versions undocumented and constantly drift across environments
  • Non-reproducible results across team members, platforms, or time
  • Scaling from 10 samples to 1000 means extensive reworking
  • Impossible to share with collaborators, publish with research, or integrate with institutional compute systems
  • No audit trail for regulatory or compliance requirements

Enterprise Pipeline Benefits (Nextflow, Snakemake, or CWL):

  • Containerized, version-controlled, self-documenting
  • Reproducible results to the byte, across any platform (laptop to HPC to cloud)
  • Scales from 1 sample to 100,000 without modification
  • Shareable, citable, auditable for regulatory compliance
  • Integrable with HPC systems, cloud platforms, and institutional data pipelines
  • Native support for monitoring, logging, and failure recovery

The Cost of Staying In-House:

  • Researcher time spent debugging instead of analyzing: 30-40% of effort
  • Lost results due to environment changes: $10,000+ per incident
  • Collaboration delayed by "it works on my machine" problems: weeks per project
  • Inability to meet publishing reproducibility standards

The Hidden Risk of Pipeline Migration (From Any In-House System)

Your current pipeline is a black box of institutional knowledge—whether it's bash scripts, Python code, Snakemake, or custom workflows. It was built incrementally, never designed for reproducibility across teams, and probably has undocumented quirks that make it work. You can't just rewrite it in Nextflow and hope for the same results.

The hard truth: 60% of bioinformatics pipeline migrations introduce subtle bugs that go undetected for months. A variant is called differently. A read is filtered out. A threshold is slightly different. A file is processed in a different order. Biologically, maybe it matters. Scientifically, it's a catastrophe because you can't trace back what changed.

This is why enterprise teams use a validation framework—regardless of whether they're migrating FROM bash, Python, Snakemake, custom C++, or anything else. Before replacing your in-house pipeline with an enterprise system (Nextflow, Snakemake, CWL, etc.), you need:

  1. A baseline snapshot (MD5 checksums) of what "correct" looks like from your original pipeline
  2. Explicit control over non-deterministic behavior (hard-coded random seeds)
  3. Byte-for-byte validation that the new pipeline matches the old

The framework is the same regardless of source. But the tooling differs based on your target. If you're migrating TO Nextflow, nf-test automates this entire process.

Let's build that framework.


Step 1: Establish a Reproducibility Baseline with MD5 Snapshots (From Your Original Pipeline)

The first step is creating a "golden standard"—verified outputs from your original in-house pipeline (bash, Python, Snakemake, or any format) that you can use as a reference. This baseline is universal and doesn't depend on what you're migrating TO.

1.1 Document Your Original Pipeline (Language/Format Agnostic)

Create a comprehensive script that records everything:

#!/bin/bash
# original_pipeline.sh - Reference implementation with full documentation

set -euo pipefail

# Configuration
REFERENCE="/data/reference/hg38.fasta"
READS="/data/reads/sample.fastq"
OUTPUT_DIR="/results/baseline"
MANIFEST="${OUTPUT_DIR}/manifest.txt"

# Create output directory
mkdir -p "${OUTPUT_DIR}"

# Record software versions
echo "=== Pipeline Execution Record ===" > "${MANIFEST}"
echo "Date: $(date -Iseconds)" >> "${MANIFEST}"
echo "" >> "${MANIFEST}"
echo "=== Software Versions ===" >> "${MANIFEST}"
echo "bwa: $(bwa 2>&1 | grep Version)" >> "${MANIFEST}"
echo "samtools: $(samtools --version | head -1)" >> "${MANIFEST}"
echo "bcftools: $(bcftools --version | head -1)" >> "${MANIFEST}"
echo "" >> "${MANIFEST}"

# Record input file hashes
echo "=== Input File Checksums ===" >> "${MANIFEST}"
md5sum "${REFERENCE}" >> "${MANIFEST}"
md5sum "${READS}" >> "${MANIFEST}"
echo "" >> "${MANIFEST}"

# Step 1: Alignment
echo "=== Step 1: BWA Alignment ===" >> "${MANIFEST}"
bwa mem -t 8 "${REFERENCE}" "${READS}" | \
samtools sort -@ 4 -o "${OUTPUT_DIR}/aligned.bam" -
md5sum "${OUTPUT_DIR}/aligned.bam" >> "${MANIFEST}"
echo "aligned.bam: $(md5sum ${OUTPUT_DIR}/aligned.bam | cut -d' ' -f1)"

# Step 2: Mark Duplicates
echo "=== Step 2: Mark Duplicates ===" >> "${MANIFEST}"
samtools markdup "${OUTPUT_DIR}/aligned.bam" "${OUTPUT_DIR}/marked.bam"
md5sum "${OUTPUT_DIR}/marked.bam" >> "${MANIFEST}"
echo "marked.bam: $(md5sum ${OUTPUT_DIR}/marked.bam | cut -d' ' -f1)"

# Step 3: Call Variants
echo "=== Step 3: Call Variants ===" >> "${MANIFEST}"
bcftools mpileup -f "${REFERENCE}" "${OUTPUT_DIR}/marked.bam" | \
bcftools call -m -o "${OUTPUT_DIR}/variants.vcf"
md5sum "${OUTPUT_DIR}/variants.vcf" >> "${MANIFEST}"
echo "variants.vcf: $(md5sum ${OUTPUT_DIR}/variants.vcf | cut -d' ' -f1)"

echo ""
echo "Baseline execution complete. Manifest saved to: ${MANIFEST}"
cat "${MANIFEST}"

1.2 Run the Original Pipeline Multiple Times

Execute the pipeline multiple times with identical inputs to verify determinism:

#!/bin/bash
# Verify pipeline reproducibility

RUNS=3
REFERENCE_DIR="/results/baseline"

echo "Running original pipeline $RUNS times..."

for i in $(seq 1 $RUNS); do
OUTPUT_DIR="/results/baseline_run_$i"
bash original_pipeline.sh > /tmp/run_$i.log 2>&1
done

# Compare outputs
echo ""
echo "=== Reproducibility Check ==="
md5sum "${REFERENCE_DIR}/aligned.bam" /results/baseline_run_*/aligned.bam
md5sum "${REFERENCE_DIR}/marked.bam" /results/baseline_run_*/marked.bam
md5sum "${REFERENCE_DIR}/variants.vcf" /results/baseline_run_*/variants.vcf

# Extract just the checksums and compare
BASELINE_ALIGNED=$(md5sum "${REFERENCE_DIR}/aligned.bam" | cut -d' ' -f1)
BASELINE_MARKED=$(md5sum "${REFERENCE_DIR}/marked.bam" | cut -d' ' -f1)
BASELINE_VARIANTS=$(md5sum "${REFERENCE_DIR}/variants.vcf" | cut -d' ' -f1)

echo ""
echo "Baseline checksums:"
echo " aligned.bam: $BASELINE_ALIGNED"
echo " marked.bam: $BASELINE_MARKED"
echo " variants.vcf: $BASELINE_VARIANTS"

Expected output if pipeline is deterministic:

98a4d5f8c7b2e9f3a1d6c4b5 /results/baseline/aligned.bam
98a4d5f8c7b2e9f3a1d6c4b5 /results/baseline_run_1/aligned.bam
98a4d5f8c7b2e9f3a1d6c4b5 /results/baseline_run_2/aligned.bam
98a4d5f8c7b2e9f3a1d6c4b5 /results/baseline_run_3/aligned.bam

All checksums match - Pipeline is deterministic!

1.3 Document Non-Deterministic Tools

Some bioinformatics tools produce non-deterministic output by design (random seeds, floating-point precision, threading order). Identify and handle them explicitly:

# Tools that need seed control
# These must be configured with hard-coded seeds

# Example 1: Tools with random sampling (seqtk)
seqtk sample -s 42 reads.fastq 0.5 > sampled.fastq

# Example 2: Tools with randomized output order (bowtie2 with threading)
bowtie2 --seed 42 -p 8 -x index -U reads.fastq -S output.sam

# Example 3: Python-based tools with numpy/scipy randomness
python process.py --seed 42

# Example 4: R-based tools
# In R script:
set.seed(42)

Create a mapping document:

# Non-Deterministic Tools in Our Pipeline

| Tool | Reason | Solution |
| ------------------- | ---------------------- | -------------------- |
| seqtk sample | Random sampling | Use --seed 42 |
| bowtie2 | Thread-based shuffling | Use --seed 42 |
| custom_py_script.py | numpy random | Set seed in code |
| variant_filter.R | R randomization | set.seed(42) in code |

All tools must use hard-coded seeds (42 chosen arbitrarily).

1.4 Save the Baseline Manifest

Create a versioned baseline that future pipelines must match:

# baseline_checksums.txt
# Generated: 2026-02-11T10:30:00Z
# Pipeline Version: 1.0
# Tools: bwa-0.7.17, samtools-1.18, bcftools-1.18

aligned.bam:98a4d5f8c7b2e9f3a1d6c4b5e2f3g4h5
marked.bam: a2b3c4d5e6f7g8h9i0j1k2l3m4n5o6p7
variants.vcf: x9y8z7w6v5u4t3s2r1q0p9o8n7m6l5k4

Step 2: Migrate to Nextflow with Seed Control

Now migrate your bash pipeline to Nextflow while maintaining deterministic behavior through seed control.

2.1 Create Nextflow Processes with Seeds

Convert each bash step to a Nextflow process, explicitly setting seeds:

// modules/bwa_align.nf
process BWA_ALIGN {
tag "$meta.id"
label 'process_high'

container 'community.wave.seqera.io/library/bwa_samtools:56c9f8d5201889a4'

input:
tuple val(meta), path(reads)
path reference
path reference_index

output:
tuple val(meta), path("*.bam"), emit: bam
path "versions.yml", emit: versions

script:
"""
# Seed control: bwa uses thread-based order
# To maintain determinism, use single-threaded or
# use sorted output from samtools

bwa mem \\
-t ${task.cpus} \\
${reference} \\
${reads} | \\
samtools sort \\
-@ ${task.cpus} \\
-o ${meta.id}.bam \\
-

# Verify output
samtools view -c ${meta.id}.bam

cat <<-END_VERSIONS > versions.yml
"${task.process}":
bwa: \$(echo \$(bwa 2>&1) | sed 's/^.*Version: //; s/Contact:.*\$//')
samtools: \$(echo \$(samtools --version 2>&1) | sed 's/^.*samtools //; s/Using.*\$//')
END_VERSIONS
"""
}
// modules/mark_duplicates.nf
process MARK_DUPLICATES {
tag "$meta.id"
label 'process_medium'

container 'community.wave.seqera.io/library/samtools:1.18'

input:
tuple val(meta), path(bam)

output:
tuple val(meta), path("*marked.bam"), emit: bam
path "versions.yml", emit: versions

script:
"""
# samtools markdup is deterministic when input is sorted
samtools markdup \\
-M \\
${bam} \\
${meta.id}.marked.bam

cat <<-END_VERSIONS > versions.yml
"${task.process}":
samtools: \$(echo \$(samtools --version 2>&1) | sed 's/^.*samtools //; s/Using.*\$//')
END_VERSIONS
"""
}
// modules/call_variants.nf
process CALL_VARIANTS {
tag "$meta.id"
label 'process_medium'

container 'community.wave.seqera.io/library/bcftools:1.18'

input:
tuple val(meta), path(bam)
path reference

output:
tuple val(meta), path("*.vcf"), emit: vcf
path "versions.yml", emit: versions

script:
"""
# bcftools is deterministic when output format is sorted
bcftools mpileup \\
-f ${reference} \\
${bam} | \\
bcftools call \\
-m \\
-o ${meta.id}.variants.vcf

# Sort VCF for reproducibility
vcf-sort ${meta.id}.variants.vcf > ${meta.id}.variants.sorted.vcf
mv ${meta.id}.variants.sorted.vcf ${meta.id}.variants.vcf

cat <<-END_VERSIONS > versions.yml
"${task.process}":
bcftools: \$(echo \$(bcftools --version 2>&1) | sed 's/^.*bcftools //; s/ .*\$//')
END_VERSIONS
"""
}

2.2 Create the Main Workflow

// nextflow.config
process {
shell = ['/bin/bash', '-euo', 'pipefail']

withLabel: process_high {
cpus = 8
memory = { 16.GB * task.attempt }
time = { 4.h * task.attempt }
}

withLabel: process_medium {
cpus = 4
memory = { 8.GB * task.attempt }
time = { 2.h * task.attempt }
}
}

profiles {
docker {
docker.enabled = true
}
singularity {
singularity.enabled = true
}
standard {
process.executor = 'local'
}
}
// main.nf
include { BWA_ALIGN } from './modules/bwa_align'
include { MARK_DUPLICATES } from './modules/mark_duplicates'
include { CALL_VARIANTS } from './modules/call_variants'

workflow {
// Input channel
input_samples = Channel
.fromPath(params.input_dir + '/*.fastq')
.map { file ->
def meta = [id: file.baseName]
tuple(meta, file)
}

// Reference files
reference = file(params.reference)
reference_index = file(params.reference + '.bwt')

// Run pipeline
BWA_ALIGN(input_samples, reference, reference_index)
MARK_DUPLICATES(BWA_ALIGN.out.bam)
CALL_VARIANTS(MARK_DUPLICATES.out.bam, reference)
}
# params.yaml
input_dir: './data/reads'
reference: './data/reference/hg38.fasta'

Step 3: Validate with MD5 Comparison

After running the Nextflow pipeline, systematically compare outputs with the baseline.

3.1 Create a Validation Script

#!/bin/bash
# validate_migration.sh - Compare Nextflow outputs with bash baseline

set -euo pipefail

BASELINE_DIR="/results/baseline"
NEXTFLOW_DIR="/results/nextflow"
VALIDATION_REPORT="/results/validation_report.txt"

{
echo "=== Pipeline Migration Validation Report ==="
echo "Generated: $(date -Iseconds)"
echo ""

# Function to compare checksums
compare_file() {
local filename=$1
local baseline="${BASELINE_DIR}/${filename}"
local migrated="${NEXTFLOW_DIR}/${filename}"

if [ ! -f "$baseline" ]; then
echo "BASELINE MISSING: $filename"
return 1
fi

if [ ! -f "$migrated" ]; then
echo "MIGRATED MISSING: $filename"
return 1
fi

local baseline_md5=$(md5sum "$baseline" | cut -d' ' -f1)
local migrated_md5=$(md5sum "$migrated" | cut -d' ' -f1)

if [ "$baseline_md5" == "$migrated_md5" ]; then
echo "PASS: $filename"
echo " MD5: $baseline_md5"
return 0
else
echo "FAIL: $filename"
echo " Baseline MD5: $baseline_md5"
echo " Migrated MD5: $migrated_md5"
return 1
fi
}

# Compare all output files
echo "=== File Comparisons ==="
declare -i pass=0
declare -i fail=0

for file in aligned.bam marked.bam variants.vcf; do
if compare_file "$file"; then
((pass++))
else
((fail++))
fi
done

echo ""
echo "=== Summary ==="
echo "Passed: $pass"
echo "Failed: $fail"
echo ""

if [ $fail -eq 0 ]; then
echo "VALIDATION SUCCESSFUL: All outputs match baseline!"
exit 0
else
echo "VALIDATION FAILED: $fail file(s) differ from baseline"
exit 1
fi

} | tee "$VALIDATION_REPORT"

3.2 Run Comparison

#!/bin/bash
# Run the migration validation

# First, ensure baseline exists
if [ ! -d "/results/baseline" ]; then
echo "Error: Baseline not found. Run original_pipeline.sh first."
exit 1
fi

# Run Nextflow pipeline
echo "Running Nextflow migration..."
nextflow run main.nf \
--input_dir ./data/reads \
--reference ./data/reference/hg38.fasta \
-profile docker \
-resume

# Validate outputs
echo ""
echo "Validating migration..."
bash validate_migration.sh

3.3 Detailed Diff Analysis for Failed Files

If checksums don't match, investigate the difference:

#!/bin/bash
# deep_diff.sh - Detailed analysis of differences

BASELINE=$1
MIGRATED=$2
FILENAME=$(basename "$BASELINE")

echo "=== Detailed Comparison: $FILENAME ==="

# 1. Check file sizes
BASELINE_SIZE=$(stat -f%z "$BASELINE" 2>/dev/null || stat -c%s "$BASELINE")
MIGRATED_SIZE=$(stat -f%z "$MIGRATED" 2>/dev/null || stat -c%s "$MIGRATED")

echo "File sizes:"
echo " Baseline: $BASELINE_SIZE bytes"
echo " Migrated: $MIGRATED_SIZE bytes"

if [ "$BASELINE_SIZE" != "$MIGRATED_SIZE" ]; then
echo " Size difference detected"
fi

# 2. For BAM files: compare with samtools
if [[ "$FILENAME" == *.bam ]]; then
echo ""
echo "BAM file analysis:"

# Compare read counts
BASELINE_READS=$(samtools view -c "$BASELINE")
MIGRATED_READS=$(samtools view -c "$MIGRATED")
echo " Baseline reads: $BASELINE_READS"
echo " Migrated reads: $MIGRATED_READS"

if [ "$BASELINE_READS" != "$MIGRATED_READS" ]; then
echo " Read count mismatch!"
fi

# Compare first 10 reads
echo ""
echo " First 10 reads comparison:"
echo " --- Baseline ---"
samtools view "$BASELINE" | head -10
echo " --- Migrated ---"
samtools view "$MIGRATED" | head -10
fi

# 3. For VCF files: compare variants
if [[ "$FILENAME" == *.vcf ]]; then
echo ""
echo "VCF file analysis:"

# Count variants (skip header)
BASELINE_VARS=$(grep -v "^#" "$BASELINE" | wc -l)
MIGRATED_VARS=$(grep -v "^#" "$MIGRATED" | wc -l)

echo " Baseline variants: $BASELINE_VARS"
echo " Migrated variants: $MIGRATED_VARS"

if [ "$BASELINE_VARS" != "$MIGRATED_VARS" ]; then
echo " Variant count mismatch!"
fi

# Show first differences
echo ""
echo " First 5 variants (baseline vs migrated):"
diff <(grep -v "^#" "$BASELINE" | head -5) <(grep -v "^#" "$MIGRATED" | head -5) || true
fi

# 4. Compare line-by-line for text files
if [[ "$FILENAME" == *.vcf || "$FILENAME" == *.txt ]]; then
echo ""
echo "Line-by-line diff (first 20 differences):"
diff "$BASELINE" "$MIGRATED" | head -20 || true
fi

3.4 Handle Expected Differences

Some differences are acceptable. Document them:

# Known Acceptable Differences Between Bash and Nextflow

## 1. Tool Versions
- Bash: bwa 0.7.17, samtools 1.18
- Nextflow: bwa 0.7.17, samtools 1.18
→ If versions match, output should be identical

## 2. Threading Order (BAM files)
- Threading can affect read order in BAM files
- Solution: Use `samtools sort` or deterministic sort
- Verification: Extract and compare SAM headers + sort order

## 3. VCF Header Timestamps
- VCF files may have different generation timestamps
- Solution: Strip headers before comparison
# Compare VCF ignoring header differences
compare_vcf_body() {
local baseline=$1
local migrated=$2

diff \
<(grep -v "^##" "$baseline" | grep -v "^#CHROM" | sort) \
<(grep -v "^##" "$migrated" | grep -v "^#CHROM" | sort)
}

# Compare BAM files by extracting SAM
compare_bam_content() {
local baseline=$1
local migrated=$2

diff \
<(samtools view "$baseline" | sort -k1,1) \
<(samtools view "$migrated" | sort -k1,1)
}

Practical Example: Full Migration Walkthrough

Let's follow a complete migration scenario:

Original Bash Pipeline

#!/bin/bash
# variant_calling_pipeline.sh

set -euo pipefail

SAMPLE="sample_001"
REFERENCE="/ref/hg38.fasta"
READS="/data/${SAMPLE}.fastq.gz"
OUTPUT_DIR="/results/bash_original"

mkdir -p "$OUTPUT_DIR"

# Step 1: Alignment
bwa mem -t 8 "$REFERENCE" <(gunzip -c "$READS") | \
samtools sort -@ 4 -o "$OUTPUT_DIR/${SAMPLE}.aligned.bam" -

# Step 2: Mark Duplicates
samtools markdup "$OUTPUT_DIR/${SAMPLE}.aligned.bam" \
"$OUTPUT_DIR/${SAMPLE}.marked.bam"

# Step 3: Index
samtools index "$OUTPUT_DIR/${SAMPLE}.marked.bam"

# Step 4: Call variants
bcftools mpileup -f "$REFERENCE" "$OUTPUT_DIR/${SAMPLE}.marked.bam" | \
bcftools call -m -o "$OUTPUT_DIR/${SAMPLE}.vcf"

# Generate checksums
cd "$OUTPUT_DIR"
md5sum *.bam *.vcf > checksums.txt

Nextflow Migration

// main.nf - Migrated to Nextflow

workflow VARIANT_CALLING {
Channel
.fromPath(params.reads)
.map { file -> [file.baseName, file] }
.set { input_reads }

BWA_ALIGN(input_reads, params.reference)
MARK_DUPLICATES(BWA_ALIGN.out.bam)
CALL_VARIANTS(MARK_DUPLICATES.out.bam, params.reference)
}

Validation Results

$ bash validate_migration.sh

=== Pipeline Migration Validation Report ===
Generated: 2026-02-11T15:45:30Z

=== File Comparisons ===
PASS: sample_001.aligned.bam
MD5: 98a4d5f8c7b2e9f3a1d6c4b5e2f3g4h5
PASS: sample_001.marked.bam
MD5: a2b3c4d5e6f7g8h9i0j1k2l3m4n5o6p7
PASS: sample_001.vcf
MD5: x9y8z7w6v5u4t3s2r1q0p9o8n7m6l5k4

=== Summary ===
Passed: 3
Failed: 0

VALIDATION SUCCESSFUL: All outputs match baseline!

Automating Validation with nf-test (For Nextflow Migrations)

The manual validation approach works for ANY in-house pipeline migration. But if you're specifically migrating TO Nextflow, there's a better way: nf-test.

nf-test is a powerful testing framework built specifically for Nextflow pipelines. It automates the entire MD5 snapshot and validation workflow, making migration validation effortless and reproducible.

Why nf-test is Essential for Nextflow Migrations

Manual validation approach:

  • Generate baseline checksums manually
  • Create custom validation scripts
  • Maintain separate comparison logic
  • Hard to share with team members

nf-test approach:

  • Generates snapshots automatically
  • Version-controls snapshots in git
  • Built-in MD5 comparison
  • Runs in CI/CD pipelines
  • Team-friendly: snapshots are tracked in git

Using nf-test for Migration Validation

// tests/modules/bwa_align.nf.test

nextflow_process {
name "Test BWA_ALIGN"
script "modules/bwa_align.nf"
process "BWA_ALIGN"

test("Should align reads to reference") {
when {
process {
input[0] = [[id: "sample1"], file("data/reads.fastq")]
input[1] = file("data/reference.fasta")
input[2] = file("data/reference.fasta.bwt")
}
}

then {
assertAll(
{ assert process.success },
{ assert snapshot(process.out.bam).match() },
{ assert path(process.out.bam[0][1]).exists() }
)
}
}
}

What happens:

  1. First run: nf-test generates a snapshot of BAM file MD5 checksums
  2. Snapshot is saved in tests/modules/bwa_align.nf.test.snap
  3. Subsequent runs: nf-test compares outputs against the snapshot
  4. If anything changes: nf-test reports the difference
  5. Update snapshots with: nf-test test tests/modules/bwa_align.nf.test --update-snapshot when intentional changes are made

Example Snapshot File (Automatically Generated)

# tests/modules/bwa_align.nf.test.snap

{
"Should align reads to reference": {
"content": [
{
"0": [
[
{
"id": "sample1"
},
"sample1.bam:98a4d5f8c7b2e9f3a1d6c4b5e2f3g4h5"
}
],
"bam": [
[
{
"id": "sample1"
},
"sample1.bam:98a4d5f8c7b2e9f3a1d6c4b5e2f3g4h5"
]
]
}
],
"timestamp": "2026-02-11T15:45:30Z"
}
}

Full Migration Testing with nf-test

Create comprehensive tests for your entire pipeline:

// tests/workflows/variant_calling.nf.test

nextflow_workflow {
name "Test Complete Variant Calling Pipeline"
script "workflows/variant_calling.nf"
workflow "VARIANT_CALLING"

test("Complete pipeline: reads to variants") {
when {
workflow {
input[0] = [[id: "sample1"], file("data/reads.fastq.gz")]
}
}

then {
assertAll(
{ assert workflow.success },
// Snapshot bam output
{ assert snapshot(workflow.out.bam).match() },
// Snapshot vcf output
{ assert snapshot(workflow.out.vcf).match() },
// Verify intermediate files
{ assert path(workflow.out.bam[0][1]).exists() },
{ assert path(workflow.out.vcf[0][1]).exists() }
)
}
}
}

Running nf-test in Your Migration Workflow

# Generate initial snapshots from your Nextflow pipeline
nf-test test tests/main.nf.test --update-snapshot

# Compare against baseline (this is your validation step 3!)
nf-test test tests/main.nf.test

# If tests fail, review the diff
# If diff is intentional, update snapshots
nf-test test tests/main.nf.test --update-snapshot

# Run specific test file
nf-test test tests/workflows/variant_calling.nf.test

# Run in CI/CD to catch regressions
# .github/workflows/test.yml
- name: Test Nextflow Pipeline
run: nf-test test tests/ --profile docker

The Complete Migration Workflow with nf-test

  1. Step 1 (Original Pipeline): Generate baseline MD5 snapshots (same as before)

  2. Step 2 (Migrate to Nextflow): Write Nextflow pipeline + nf-test tests

  3. Step 3 (Validate with nf-test):

    # Generate initial snapshots from migrated Nextflow pipeline
    nf-test test tests/main.nf.test --update-snapshot

    # Run: nf-test test tests/main.nf.test
    # nf-test automatically compares against baseline!

Benefits of nf-test Over Manual Validation

AspectManual Validationnf-test
Snapshot generationManual scriptingAutomatic
Version controlExternal filesGit-tracked
Team collaborationShare scriptsShare snapshots
Regression detectionManual comparisonAutomatic CI/CD
Update processRerun scriptsnf-test test --update-snapshot
DocumentationSeparate docsTests are docs
MaintenanceHigh effortLow effort

Key Considerations and Best Practices

1. Version Pinning

Always pin tool versions in both pipelines:

# Bash
bwa 0.7.17
samtools 1.18.0
bcftools 1.18

# Nextflow (container)
container 'community.wave.seqera.io/library/bwa_samtools:56c9f8d5201889a4'
# Container hash includes specific versions

2. Handling Floating-Point Precision

Some tools produce slightly different floating-point values due to compilation or CPU differences:

# For VCF QUAL scores, allow small differences
compare_vcf_quality() {
local baseline=$1
local migrated=$2
local tolerance=0.1

# Extract QUAL scores and compare with tolerance
paste \
<(grep -v "^#" "$baseline" | awk '{print $6}') \
<(grep -v "^#" "$migrated" | awk '{print $6}') | \
awk -v tol=$tolerance '{
diff = ($1 - $2)
if (diff < 0) diff = -diff
if (diff > tol && $1 != ".") {
print "DIFFER: " $1 " vs " $2
}
}'
}

3. Documentation Template

Create a migration checklist:

# Pipeline Migration Checklist

## Pre-Migration
- [ ] Document original bash pipeline
- [ ] Record all tool versions
- [ ] Generate baseline MD5 checksums
- [ ] Test reproducibility (3+ runs)
- [ ] Identify non-deterministic components

## Migration
- [ ] Convert each step to Nextflow process
- [ ] Set seeds for random operations
- [ ] Configure containerization
- [ ] Implement resource directives
- [ ] Add error handling

## Validation
- [ ] Run Nextflow with same inputs
- [ ] Generate MD5 checksums
- [ ] Compare all outputs
- [ ] Document acceptable differences
- [ ] Validate on multiple samples

## Sign-Off
- [ ] All checksums match or differences documented
- [ ] Code review completed
- [ ] Team approval
- [ ] Migration complete

Summary: Confidently Migrating Any In-House Pipeline to Enterprise Level

Whether you're migrating from bash scripts, Python workflows, Snakemake pipelines, custom C++ tools, or anything else, the same 3-step validation framework applies. The MD5-based validation approach is universal and language-agnostic.

By following a systematic 3-step approach, you can validate that your new enterprise pipeline produces identical results to your original in-house system:

Step 1: Establish Baseline (From Your Original Pipeline)

  • Run original pipeline multiple times
  • Verify determinism (same inputs = same outputs)
  • Record MD5 checksums of all outputs
  • Document tool versions and seeds
  • Works with: bash, Python, Snakemake, custom code, etc.

Step 2: Migrate with Seed Control (To Your Target System)

  • Convert each pipeline step to your target format
  • Hard-code seeds for random operations
  • Use containers to match tool versions
  • Maintain identical resource configurations
  • Target options: Nextflow, Snakemake, CWL, etc.

Step 3: Validate with Checksums (Automated or Manual)

  • Run new pipeline with identical inputs
  • Generate MD5 checksums for all outputs
  • Compare against baseline
  • Document acceptable differences
  • Sign off on migration

If targeting Nextflow: Use nf-test to automate steps 2-3 with built-in snapshot management and CI/CD integration.

Key Takeaways

  1. The 3-step framework works for ANY in-house pipeline - Whether you're migrating from bash, Python, Snakemake, or custom code, the MD5-based validation approach is universal

  2. MD5 checksums are your source of truth - They provide byte-for-byte verification that outputs are identical, regardless of source or target format

  3. Reproducibility requires explicit seed control - Any non-deterministic operation must use hard-coded seeds (42 is an arbitrary choice—use what makes sense for your team)

  4. Version pinning matters - Use containers to guarantee identical tool versions between original and migrated pipelines

  5. Document everything - Record versions, seeds, checksums, and acceptable differences for your team's understanding

  6. Validate on multiple samples - Differences might only appear with certain data characteristics or edge cases

  7. If migrating TO Nextflow: use nf-test - It automates the entire validation workflow with version-controlled snapshots and CI/CD integration

  8. Make failures visible - Use set -o pipefail and explicit error checking in both pipelines

Once you've validated that outputs match, you can confidently replace your in-house pipeline with an enterprise system, knowing that you've maintained scientific reproducibility while gaining the benefits of professional workflow management.

Your migrated pipeline is now:

  • Reproducible across teams and platforms
  • Scalable (to HPC/cloud without modification)
  • Maintainable by the broader community
  • Validated against your original implementation
  • Ready for production and publication