Skip to main content

13 posts tagged with "bioinformatics"

View All Tags

From Bash to Nextflow: GATK Best Practice With Nextflow (Part 2)

41 min read
Thanh-Giang Tan Nguyen
Founder at RIVER

In Part 1, we built a complete 16-step GATK variant calling pipeline in bash鈥攑erfect for academic research and 1-10 samples. But what happens when you need to scale to 100+ samples? This is where Nextflow becomes essential.

馃搧 Repository: All code from this tutorial is organized in the variant-calling-gatk-pipeline-best-practice-from-scratch repository. The structure follows best practices with separate directories for bash (workflows/bash/) and Nextflow (workflows/nextflow/) implementations.

Building a Reproducible GATK Variant Calling Bash Workflow with Pixi (Part 1)

30 min read
Thanh-Giang Tan Nguyen
Founder at RIVER

Before you can transform a bash workflow to Nextflow, you need a solid, reproducible bash baseline. This hands-on guide walks through building a complete 16-step GATK variant calling workflow using bash scripts and Pixi for environment management鈥攆ollowing GATK best practices with GVCF mode and hard filtering. While this traditional approach works for academic research and proof-of-concept work, scaling to thousands of samples in industry requires the reproducibility and reliability of workflow managers like Nextflow, which we'll cover in Part 2.

Working with Remote Files using bcftools and samtools (HTSlib)

18 min read
Thanh-Giang Tan Nguyen
Founder at RIVER

HTSlib-based tools like bcftools and samtools provide powerful capabilities for working with genomic data stored on remote servers. Whether your data is in AWS S3, accessible via FTP, or hosted on HTTPS endpoints, these tools allow you to efficiently query and subset remote files without downloading entire datasets. This guide covers authentication, remote file access patterns, and practical workflows.

Docker Out of Docker: Running Interactive Web Applications for Data Analysis

10 min read
Thanh-Giang Tan Nguyen
Founder at RIVER

Running interactive web applications like RStudio, JupyterLab, and Code Server in containers is a powerful way to provide reproducible analysis environments. However, users often need to spawn additional containerized tools from within these applications. Docker out of Docker (DooD) elegantly solves this by allowing containers to access the host's Docker daemon. This post explains how to set up DooD for interactive web applications and why it's the right approach for bioinformatics workflows.

Unix Pipes in Bioinformatics: How Streaming Data Reduces Memory and Storage

22 min read
Thanh-Giang Tan Nguyen
Founder at RIVER

Unix pipes (|) are one of the most powerful yet underutilized features in bioinformatics. They allow you to chain multiple commands together, processing data in a streaming fashion that dramatically reduces memory usage and disk I/O. This post explores why pipes are essential for bioinformatics work and shows how they work under the hood.

Containers in Bioinformatics: Community Tooling and Efficient Docker Building

21 min read
Thanh-Giang Tan Nguyen
Founder at RIVER

Docker containers are revolutionizing bioinformatics by automating reproducibility and portability across platforms. But what problems can they actually solve? This post shows real-world applications of containers in bioinformatics workflows, then guides you through the simplest possible ways to use, build and debug them.

Containers in Bioinformatics: Community Tooling and Efficient Docker Building

21 min read
Thanh-Giang Tan Nguyen
Founder at RIVER

Docker containers are revolutionizing bioinformatics by automating reproducibility and portability across platforms. But what problems can they actually solve? This post shows real-world applications of containers in bioinformatics workflows, then guides you through the simplest possible ways to use, build and debug them.

Bioinformatics Workflow Template: Standardizing Python Pipelines with Modular Design

13 min read
Thanh-Giang Tan Nguyen
Founder at RIVER

Building reproducible bioinformatics pipelines is hard. Every project starts from scratch with its own testing, CI/CD, and deployment strategy. What if you could clone a template, add your analysis tools, and be ready to go?

This post introduces a standardized bioinformatics workflow template featuring consistent testing, CI/CD, and project structure. Developed from real production experience with bioinfor-wf-template, this template reduces setup time from days to minutes, ensures research reproducibility, and promotes modular, reusable code. It is Python-based and ideal for proof-of-concept projects. Support for more advanced and widely adopted bioinformatics frameworks (such as Snakemake and Nextflow) is planned, applying the same core principles while leveraging their native testing systems.

Running GitHub Actions Locally with act: 5x Faster Development

12 min read
Thanh-Giang Tan Nguyen
Founder at RIVER

GitHub Actions are powerful for automating bioinformatics pipelines, but waiting 5-10 minutes for each cloud run is painful during development. act lets you run GitHub Actions workflows locally on your machine in seconds, slashing feedback time by 5x.

In this post, we'll explore act, a command-line tool that runs GitHub Actions locally using Docker. Perfect for testing ML pipelines, gene expression analysis, and CI/CD workflows before pushing to GitHub.

Machine Learning in Bioinformatics Part 1: Building KNN from Scratch

12 min read
Thanh-Giang Tan Nguyen
Founder at RIVER

Machine learning is transforming bioinformatics, enabling us to discover patterns in biological data. In this first part, we'll build a K-Nearest Neighbors (KNN) classifier from scratch using only Python, then apply it to simulated gene expression data. This post is designed for anyone who knows basic Python and biology鈥攏o advanced ML experience required!

Introduction to AI/ML in Bioinformatics: Classification Models & Evaluation

12 min read
Thanh-Giang Tan Nguyen
Founder at RIVER

Machine learning is transforming bioinformatics by automating pattern discovery from biological data. But what problems can it actually solve? This post shows real-world applications of classification models, then builds the simplest possible classifiers to understand how they work and how to evaluate them. This is Part 0鈥攖he practical foundation before diving into complex algorithms like KNN.

The Evolution of Version Control - CI/CD in bioinformatics (Part 2)

14 min read
Thanh-Giang Tan Nguyen
Founder at RIVER

Welcome to Part 2 of our series on version control in bioinformatics. In Part 1, we introduced Git fundamentals, branching strategies, and collaborative workflows. In this post, we'll dive into how Continuous Integration and Continuous Deployment (CI/CD) can transform your bioinformatics projects. If these concepts are new to you, don't worry鈥攖his guide will walk you through managing your bioinformatics repository to ensure your work is easily reproducible on any machine. Whether your server is wiped or you need to spin up a new virtual machine, you'll be able to quickly rerun your pipeline. With CI/CD, every code update can automatically trigger tests on a small dataset to verify everything works before scaling up, ensuring that new changes don't break your results or workflows.

The Evolution of Version Control - Git's Role in Reproducible Bioinformatics (Part 1)

13 min read
Thanh-Giang Tan Nguyen
Founder at RIVER

In Part 1 (this post), we explore the history of Git, its integration with GitHub, and basic hands-on tutorials. Part 2 (coming soon) will cover real-world bioinformatics examples and advanced workflows with best practices.

This part focuses on practical applications, including NGS quality control using multiqc and fastqc.