Skip to main content

Working with Remote Files using bcftools and samtools (HTSlib)

· 18 min read
Thanh-Giang Tan Nguyen
Founder at RIVER

HTSlib-based tools like bcftools and samtools provide powerful capabilities for working with genomic data stored on remote servers. Whether your data is in AWS S3, accessible via FTP, or hosted on HTTPS endpoints, these tools allow you to efficiently query and subset remote files without downloading entire datasets. This guide covers authentication, remote file access patterns, and practical workflows.

1. How HTSlib Handles Remote Files with Random Access

HTSlib is a C library that powers many genomic tools. Tools built on HTSlib—including bcftools, samtools, tabix, and others—inherit native support for remote file access via random access queries. This section explains how HTSlib works with remote files and which protocols are supported.

1.1 Random Access Mechanism: The Foundation

HTSlib enables efficient remote file access through a byte-range request mechanism. Instead of downloading entire files, it uses file indexes to determine exact byte positions and requests only those bytes from the remote server.

How it works:

User Request:
bcftools view s3://bucket/variants.vcf.gz chr1:1000-2000

HTSlib Process:
1. Reads index file (.vcf.gz.tbi) to find byte ranges
2. Index lookup: "chr1:1000-2000 is at bytes 5000-15000"
3. Sends byte-range request to remote server
4. Remote server returns ONLY requested bytes (10KB, not 1GB!)
5. HTSlib decompresses partial data
6. Returns matching variants to user

Result: 10KB downloaded instead of 1GB
Time: 1-5 seconds instead of 5-30 minutes

Critical requirement: Indexed files (.tbi, .bai, .csi)

  • Without index: Must download entire file from start (fails on large files)
  • With index: Downloads only necessary bytes (efficient and fast)

1.2 All Supported URL Schemes

HTSlib supports a comprehensive list of protocols via libcurl. Check your installation:

# View all supported protocols
pixi run samtools --version 2>&1 | grep -A 20 "HTSlib URL"

Complete output example:

HTSlib URL scheme handlers present:
built-in: file, preload, data
Google Cloud: gs+http, gs+https, gs
libcurl: gophers, smtp, wss, smb, rtsp, tftp, pop3, smbs, imaps, pop3s, ws, ftps, ftp, gopher, imap, http, https, sftp, smtps, scp, dict, mqtt, telnet
Amazon S3: s3+https, s3, s3+http
crypt4gh: crypt4gh
mem: mem

Protocol categories:

CategoryProtocolsTypical Use
Built-infile, preload, dataLocal files, pipes
Cloud Storages3://, gs://AWS, Google Cloud
HTTP-basedhttp://, https://, ftps://Web servers, FTP over SSL
Otherftp://, sftp://, scp://, smb://Traditional file servers
Specializedmqtt, dict, imap, etcNot typical for genomics

Note on protocol coverage: This blog focuses on the most common protocols for bioinformatics: S3, HTTPS, FTP, and GCS. Many other protocols are available but less commonly used for genomic data.

1.3 Typical Bioinformatics Protocols (Covered in This Blog)

The following protocols are widely used for storing and accessing genomic data:

S3 (Amazon S3 and compatible)

  • URL scheme: s3://bucket-name/path/file.vcf.gz
  • Use case: Cloud-based genomic repositories, AWS-hosted data
  • Random access: Via byte-range requests, AWS SDK for auth
  • Speed: Fastest (typically <1 second for indexed queries)
  • See section: 2 - S3 Remote File Access

HTTPS (Web servers)

  • URL scheme: https://server.com/path/file.vcf.gz
  • Use case: Public databases (NCBI, Ensembl), institutional servers
  • Random access: Via HTTP Range header (206 Partial Content)
  • Speed: Fast (1-2 seconds for indexed queries)
  • See section: 4 - HTTPS Remote File Access

FTP (File Transfer Protocol)

  • URL scheme: ftp://server.com/path/file.vcf.gz
  • Use case: Legacy databases, some public repositories
  • Random access: Via FTP REST command for seeking
  • Speed: Slower (3-5 seconds for indexed queries)
  • See section: 3 - FTP Remote File Access

GCS (Google Cloud Storage)

  • URL scheme: gs://bucket-name/path/file.vcf.gz
  • Use case: Google Cloud-hosted genomic data
  • Random access: Similar to S3 (byte-range requests)
  • Speed: Fast (typically <1 second for indexed queries)
  • Note: Requires Google Cloud SDK credentials

1.4 Key Advantages of Remote Access

  • Avoid large downloads: Query specific regions without transferring entire files
  • Cost efficiency: Pay only for data you access, not storage
  • Scalability: Work with datasets larger than local disk space
  • Collaboration: Reference shared datasets directly without copying
  • Real-time access: Always work with latest version of data

1.5 Performance Considerations: Why Indexes Matter

Remote file access requires indexed files for efficient region-based queries:

  • VCF files: Must have .vcf.gz.tbi index
  • BAM files: Must have .bam.bai index
  • BCF files: Must have .bcf.csi index

Bandwidth comparison for 1GB VCF file querying chr1:1-100000:

ScenarioBytes TransferredTimeMethod
Indexed S3~700KB<1 secondByte-range request
Indexed HTTPS~700KB1-2 secondsHTTP Range header
Indexed FTP~700KB3-5 secondsFTP REST seek
Unindexed remote1000MB+5-30 minutesSequential download
Local indexed file~700KB<0.5 secondsDirect file seek

Critical insight: Indexed remote access downloads only 0.07% of file size!

1.6 Setting Up bcftools and samtools with Pixi

The easiest way to install bcftools and samtools with full remote file support is using Pixi with the conda-forge and bioconda channels. These channels include pre-compiled binaries with S3, FTP, and HTTPS support enabled in HTSlib.

Installation Steps

Create a new directory for your project and initialize a Pixi environment:

# Create project directory
mkdir -p ~/htslib-project && cd ~/htslib-project

# Initialize Pixi with conda-forge and bioconda channels
pixi init --channel conda-forge --channel bioconda .

# Add bcftools and samtools
pixi add bcftools samtools

This creates a pixi.toml file:

[workspace]
channels = ["conda-forge", "bioconda"]
name = "htslib-project"
platforms = ["linux-64"]
version = "0.1.0"

[dependencies]
bcftools = ">=1.23,<2"
samtools = ">=1.23,<2"

Verify Installation

# Run bcftools version
pixi run bcftools --version

# Run samtools version
pixi run samtools --version

Both tools will report HTSlib features including S3, GCS, and libcurl support:

bcftools 1.23
Using htslib 1.23
...

Samtools 1.23
...
HTSlib compilation details:
Features: build=configure libcurl=yes S3=yes GCS=yes ...

The key line is S3=yes indicating S3 support is enabled.

Why Pixi for Remote File Access?

  • Pre-compiled with remote support: Bioconda packages include HTSlib compiled with libcurl, S3, and GCS support
  • Easy dependency management: All required libraries included automatically
  • Reproducible environments: pixi.lock ensures the same versions across machines
  • No system-wide installation: Keep tools isolated in your project environment

1.7 Protocol Verification: HTTP Range Requests

For HTTPS/HTTP remote access to work efficiently, the server must support HTTP Range Requests. Verify this before relying on remote queries:

# Test if server supports Range Requests
curl -I -H "Range: bytes=0-500" https://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606_b150_GRCh37p13/VCF/00-common_all_papu.vcf.gz

# Expected response: HTTP/1.1 206 Partial Content
# This means byte-range requests work! ✓

Response headers will show:

HTTP/1.1 206 Partial Content      ← supports Range requests
Accept-Ranges: bytes ← confirms support
Content-Length: 501 ← returns partial content

2. S3 Remote File Access

AWS S3 is a common repository for genomic datasets. HTSlib supports S3 through the s3:// protocol.

2.1 S3 Authentication

Using AWS Credentials

Set AWS credentials as environment variables:

export AWS_ACCESS_KEY_ID="your_access_key"
export AWS_SECRET_ACCESS_KEY="your_secret_key"
export AWS_DEFAULT_REGION="us-east-1"

Using AWS IAM Roles (EC2 Instances)

If running on EC2 with an IAM role attached, HTSlib automatically uses the role's credentials—no environment variables needed.

Using AWS CLI Configuration

HTSlib respects ~/.aws/credentials and ~/.aws/config:

aws configure  # Interactive setup

2.2 Querying S3-hosted VCF Files

Query a specific genomic region from a remote VCF file:

# Query chromosome 1, positions 1-1000000
bcftools view s3://my-bucket/data/variants.vcf.gz chr1:1-1000000

# Count variants in a region
bcftools view -H s3://my-bucket/data/variants.vcf.gz chr1:1-1000000 | wc -l

# Extract samples and subset to region
bcftools view -s sample1,sample2 s3://my-bucket/data/variants.vcf.gz chr1:1-1000000 -O v

2.3 Querying S3-hosted BAM Files

Extract reads from specific genomic regions:

# Query region from remote BAM
samtools view s3://my-bucket/data/aligned.bam chr1:1-1000000

# Count reads in region
samtools view -c s3://my-bucket/data/aligned.bam chr1:1-1000000

# Extract to local BAM file
samtools view -b s3://my-bucket/data/aligned.bam chr1:1-1000000 -o region.bam

2.4 Region-based Subsetting Workflow

Efficient workflow for extracting subsets from S3:

# 1. List available S3 files (using AWS CLI)
aws s3 ls s3://my-bucket/data/ --recursive

# 2. Query specific region from VCF
bcftools view \
s3://my-bucket/data/chr1.vcf.gz \
chr1:10000000-20000000 \
-O v -o local_subset.vcf

# 3. Validate and compress
bgzip -f local_subset.vcf
tabix -p vcf local_subset.vcf.gz

# 4. Annotate or further process locally
bcftools annotate local_subset.vcf.gz -a annotation.vcf.gz -c ID

2.5 Working with Public Genomic Data

Public genomic repositories are excellent for testing remote file access. While S3 is available, HTTPS endpoints are often more reliable for quick testing.

Query NCBI dbSNP data via HTTPS:

NCBI provides public access to common variants. Test with real data:

# Query NCBI dbSNP common variants (via HTTPS)
pixi run bcftools view \
https://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606_b150_GRCh37p13/VCF/00-common_all.vcf.gz \
| pixi run bcftools query -f '%CHROM\t%POS\t%REF\t%ALT\n' \
| head -20

Expected output showing real variants:

1	10177	A	AC
1 10352 T TA
1 10352 T TA
1 10616 CCGCCGTTGCAAAGGCGCGCCG C
1 10642 G A
1 11008 C G
1 11012 C G

Discover available files:

# List NCBI FTP VCF files
curl -s https://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606_b150_GRCh37p13/VCF/ \
| grep -i ".vcf.gz\"" | grep -o "href=\"[^\"]*\"" | cut -d'"' -f2

# Or for 1000 Genomes in S3:
aws s3 ls s3://1000genomes/phase3/data/ --no-sign-request | head -20

Region-based queries (with indexed files):

# Query specific chromosome region
# Note: Requires .tbi index at the remote location
pixi run bcftools view \
https://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606_b150_GRCh37p13/VCF/00-common_all.vcf.gz \
chr1:1000000-2000000

Note on S3 vs HTTPS:

  • HTTPS (like NCBI): Reliable, requires no credentials for public data
  • S3: Fastest for large datasets, requires AWS credentials or public bucket setup
  • FTP: Alternative for some repositories, generally slower than HTTPS

3. FTP Remote File Access

FTP servers host many public genomic datasets. HTSlib supports the ftp:// protocol.

3.1 FTP Authentication

Anonymous Access

Many public repositories allow anonymous FTP:

# No credentials needed
bcftools view ftp://ftp.ncbi.nlm.nih.gov/genomes/H_sapiens/VCF/common_all.vcf.gz chr1:1-1000000

Authenticated FTP

For private FTP servers, include credentials in the URL:

bcftools view ftp://username:password@ftp.example.com/data/variants.vcf.gz chr1:1-1000000

Security Note: URL-embedded credentials are visible in process listings. Prefer environment variables:

export FTP_USER="your_username"
export FTP_PASSWORD="your_password"

# HTSlib will use these automatically
bcftools view ftp://ftp.example.com/data/variants.vcf.gz chr1:1-1000000

3.2 Querying FTP-hosted Files

Common public genomic repositories accessible via FTP:

NCBI FTP Repository

# Query NCBI's common variants file
bcftools view \
ftp://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606_b150_GRCh37p13/VCF/common_all.vcf.gz \
chr1:1-1000000 \
| bcftools query -f '%CHROM\t%POS\t%AC\t%AF\n' \
| head -10

Ensembl FTP Repository

# Query Ensembl VCF (example path)
bcftools view \
ftp://ftp.ensembl.org/pub/data_files/homo_sapiens/VEP/Homo_sapiens_phenotypes_associated_variants-GRCh38.vcf.gz \
1:1-1000000

# List available Ensembl files
ftp ftp.ensembl.org # Interactive FTP session

3.3 FTP Performance Tips

  • Network stability: FTP can be slower than HTTPS for single queries. Consider batch processing.
  • Index availability: Verify .tbi or .csi indexes are present on the FTP server for efficient region queries.
  • Timeout handling: Set timeout environment variables for unreliable connections:
export CURL_CONNECT_TIMEOUT=30
export CURL_TIMEOUT=60

4. HTTPS Remote File Access

HTTPS endpoints often require token-based authentication and are increasingly common for cloud-hosted data.

4.1 HTTPS Authentication Methods

Bearer Token Authentication

For endpoints that accept Bearer tokens (common in cloud platforms):

# Option 1: Set environment variable
export HTTPS_AUTH_BEARER="your_bearer_token"
bcftools view https://api.example.com/data/variants.vcf.gz chr1:1-1000000

# Option 2: Pass in URL (less secure)
bcftools view https://token:your_bearer_token@api.example.com/data/variants.vcf.gz chr1:1-1000000

API Keys in Headers

For services requiring custom headers, use curl with a named pipe:

# Create a wrapper using curl and process substitution
bcftools view \
<(curl -H "Authorization: Bearer TOKEN" \
https://api.example.com/data/variants.vcf.gz \
--output - ) \
chr1:1-1000000

Basic HTTP Authentication

# URL-embedded credentials (not recommended for security)
bcftools view https://username:password@secure.example.com/data/variants.vcf.gz chr1:1-1000000

4.2 Practical HTTPS Workflows

Query Public HTTPS-hosted Data

Many databases provide HTTPS access to genomic data:

# Example: Query genome annotation from HTTPS endpoint
bcftools view https://example.genomics.org/public/variants.vcf.gz chr1:1-1000000

# Extract specific information
bcftools view https://example.genomics.org/public/variants.vcf.gz chr1:1-1000000 \
| bcftools query -f '%CHROM\t%POS\t%REF\t%ALT\t%AC\n'

Download Indexed Files Locally

For repeated queries, consider caching indexes locally:

# Download index file (small, fast)
curl -o variants.vcf.gz.tbi https://api.example.com/data/variants.vcf.gz.tbi

# Query still uses remote file but with local index knowledge
bcftools view https://api.example.com/data/variants.vcf.gz chr1:1-1000000

4.3 SSL/TLS Certificate Issues

If you encounter certificate validation errors:

# Disable certificate verification (use cautiously)
export SSL_NO_VERIFY=1
bcftools view https://api.example.com/data/variants.vcf.gz chr1:1-1000000

# Or use curl environment variable
export CURL_CA_BUNDLE=""
bcftools view https://api.example.com/data/variants.vcf.gz chr1:1-1000000

Security Note: Only disable certificate verification for trusted internal endpoints or testing.

5. Indexing and Efficient Region Queries

Efficient remote file access depends on proper indexing. HTSlib tools use byte-range requests to fetch only necessary data.

5.1 Creating Indexes

Create Tabix Index for VCF Files

# For bgzip-compressed VCF files
tabix -p vcf variants.vcf.gz

# Verify index
ls -lh variants.vcf.gz*

Create CSI Index for BCF Files

# CSI indexes support larger files and datasets
bcftools index -c variants.bcf

# Verify index
ls -lh variants.bcf.csi

Create BAI Index for BAM Files

# Standard BAM indexing
samtools index aligned.bam

# Creates aligned.bam.bai
ls -lh aligned.bam.bai

5.2 Verifying Index Integrity

# Check if VCF index is valid
tabix -l variants.vcf.gz | head -10

# Query a specific region (tests index)
bcftools view variants.vcf.gz chr1:1-1000 -H | wc -l

5.3 Performance: Indexed vs Unindexed Remote Access

Indexed remote access (with region query):

  • Seeks to file index
  • Downloads byte-range containing target region
  • Returns ~100KB-1MB of data

Unindexed remote access:

  • Must scan entire file sequentially
  • Downloads entire file
  • Returns all data (can be GB-sized)

Example performance comparison:

# Indexed query (fast, ~1-5 seconds)
time bcftools view s3://my-bucket/variants.vcf.gz chr1:1-1000000 > /dev/null

# Without index (very slow, minutes or fails)
# HTSlib automatically uses index if available, so explicitly avoid it only if needed

5.4 Testing Remote File Access with Local Files

Before working with remote files, it's helpful to test bcftools and samtools locally. Use this workflow to verify your installation and practice region queries.

Create test VCF file:

# Create test directory
mkdir -p ~/genomics-test && cd ~/genomics-test

# Create test VCF
cat > test.vcf << 'EOF'
##fileformat=VCFv4.2
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total read depth">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT sample1 sample2
chr1 10000 . A G 60 . DP=100 GT 0/1 1/1
chr1 20000 . C T 50 . DP=80 GT 0/0 0/1
chr1 30000 . G A 70 . DP=120 GT 1/1 0/1
chr1 100000 . T C 55 . DP=90 GT 0/1 0/0
EOF

# Compress and index (requires bgzip and tabix from pixi)
pixi run bgzip -f test.vcf
pixi run tabix -p vcf test.vcf.gz

Test bcftools region queries:

# Query specific region
pixi run bcftools view test.vcf.gz chr1:1-50000

# Extract specific samples
pixi run bcftools view -s sample1 test.vcf.gz chr1:10000-50000

# Count variants in region
pixi run bcftools view -H test.vcf.gz chr1:1-100000 | wc -l

Create and test BAM file:

# Create test SAM file
cat > test.sam << 'EOF'
@HD VN:1.0 SO:coordinate
@SQ SN:chr1 LN:1000000
read1 0 chr1 100 60 10M * 0 0 ACGTACGTAC IIIIIIIIII
read2 0 chr1 120 60 10M * 0 0 TGCATGCATG IIIIIIIIII
read3 0 chr1 50000 60 10M * 0 0 AAAAAAAAAA IIIIIIIIII
read4 0 chr1 100000 60 10M * 0 0 TTTTTTTTTT IIIIIIIIII
EOF

# Convert to BAM and index
pixi run samtools view -b test.sam -o test.bam
pixi run samtools index test.bam

Test samtools region queries:

# View reads in region
pixi run samtools view test.bam chr1:1-60000

# Count reads in region
pixi run samtools view -c test.bam chr1:1-60000

# Extract region to new file
pixi run samtools view -b test.bam chr1:50000-100000 -o region.bam

Once comfortable with local files, the same commands work seamlessly with remote S3, FTP, or HTTPS URLs—just replace local filenames with remote URLs.

6. Common Workflows and Recipes

6.1 Extract Variants for Specific Genes

# Define gene regions (example: gene1, gene2)
cat > genes.bed << 'EOF'
1 11869 14409 DDX11L1
1 17369 17436 MIR6859-1
EOF

# Extract variants from remote file for gene regions
bcftools view \
s3://my-bucket/variants.vcf.gz \
-R genes.bed \
-O v -o local_genes.vcf

# Compress and index
bgzip -f local_genes.vcf
tabix -p vcf local_genes.vcf.gz

6.2 Compare Samples Across Remote Files

# Compare variants for two samples in different S3 files
bcftools isec \
<(bcftools view -s sample1 s3://bucket1/variants.vcf.gz) \
<(bcftools view -s sample2 s3://bucket2/variants.vcf.gz) \
-p comparison_output

# Shows shared and unique variants
ls comparison_output/

6.3 Annotate Remote VCF with Remote Annotations

# Annotate S3-hosted VCF using FTP-hosted annotation file
bcftools annotate \
s3://my-bucket/variants.vcf.gz \
-a ftp://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606_b150_GRCh37p13/VCF/common_all.vcf.gz \
-c ID \
-O v -o annotated.vcf

6.4 Extract BAM Regions for Multiple Samples

# Create list of regions
cat > regions.txt << 'EOF'
chr1:1-1000000
chr2:2000000-3000000
chr3:5000000-6000000
EOF

# Extract each region from remote BAM, save locally
while read region; do
samtools view \
s3://my-bucket/aligned.bam \
$region \
-b -o ${region//:/_}.bam
samtools index ${region//:/_}.bam
done < regions.txt

6.5 Stream Processing Without Local Storage

# Pipe remote data through multiple tools
bcftools view s3://my-bucket/variants.vcf.gz chr1:1-1000000 \
| bcftools query -f '%CHROM\t%POS\t%REF\t%ALT\t%AC\t%AF\n' \
| awk '$6 > 0.05' \
| tee filtered_variants.tsv \
| wc -l

7. Troubleshooting and Optimization

7.1 Common Issues

Issue: "File not found" or 404 errors

  • Solution: Verify S3 bucket name, path, and file permissions. Test with AWS CLI:
    aws s3 ls s3://bucket-name/path/file.vcf.gz

Issue: Authentication failures with S3

  • Solution: Verify AWS credentials:
    aws sts get-caller-identity

Issue: Index not found (.tbi or .bai missing)

  • Solution: Create indexes or verify they exist on remote server:
    aws s3 ls s3://bucket-name/file.vcf.gz.tbi

Issue: "Connection timeout" on FTP

  • Solution: Increase timeout and try again:
    export CURL_CONNECT_TIMEOUT=60
    export CURL_TIMEOUT=120

Issue: HTTPS SSL certificate errors

  • Solution: Verify certificate validity or temporarily disable verification for testing:
    curl -k https://endpoint.com/file.vcf.gz.tbi  # -k disables cert verification

7.2 Optimization Tips

For S3 Access:

  • Use S3 Transfer Acceleration if available for faster downloads
  • Place compute resources in the same AWS region as your data
  • Use bucket endpoints to avoid Internet Gateway charges:
    # VPC endpoint for S3 (no data transfer charges within VPC)
    s3://bucket-name/key

For Large Batch Queries:

  • Sort region requests by chromosome for better caching
  • Consider downloading entire chromosome subset once, then querying locally:
    bcftools view s3://bucket/variants.vcf.gz chr1 -o chr1.vcf.gz
    bcftools index chr1.vcf.gz
    # Then query locally multiple times

For FTP Performance:

  • Batch multiple queries when possible
  • Check server-side compression settings
  • Consider mirror/rsync services for frequently accessed data

7.3 Debugging with Verbose Output

# Enable verbose output for debugging
export HTS_LOG_LEVEL=debug

# Use bcftools verbose flag
bcftools view -v s3://bucket/variants.vcf.gz chr1:1-1000000

# Check curl verbose output
export CURL_VERBOSE=1

8. Recap and Key Takeaways

8.1 What We've Covered

  • Protocol Support: HTSlib tools support S3, FTP, and HTTPS remote file access natively
  • Authentication: Each protocol has different auth mechanisms (AWS credentials, FTP user:pass, Bearer tokens)
  • Efficient Queries: Indexed files enable byte-range requests, downloading only necessary data
  • Practical Workflows: Common tasks like region extraction, sample comparison, and annotation using remote files
  • Troubleshooting: How to diagnose and resolve common remote file access issues

8.2 Key Takeaways

  1. Always use indexes for remote files to avoid downloading entire datasets
  2. Choose the right protocol: S3 for AWS, FTP for public repositories, HTTPS for APIs
  3. Set up authentication correctly to avoid repeated connection failures
  4. Stream processing reduces local storage requirements for large datasets
  5. Verify indexes exist on remote servers before querying
  6. Optimize by region: Place compute near data, batch similar queries, use appropriate timeouts

8.3 Quick Reference Workflow

Setup (one time):

# Create project directory
mkdir -p ~/genomics-work && cd ~/genomics-work

# Initialize Pixi with conda-forge and bioconda
pixi init --channel conda-forge --channel bioconda .

# Add tools
pixi add bcftools samtools

# Verify installation
pixi run bcftools --version
pixi run samtools --version

Using remote files:

# 1. Set up authentication (optional, for private buckets)
export AWS_ACCESS_KEY_ID="..."
export AWS_SECRET_ACCESS_KEY="..."
export AWS_DEFAULT_REGION="us-east-1"

# 2. Query real remote file (NCBI dbSNP via HTTPS - no auth needed)
pixi run bcftools view \
https://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606_b150_GRCh37p13/VCF/00-common_all.vcf.gz \
| pixi run bcftools query -f '%CHROM\t%POS\t%REF\t%ALT\n' \
| head -20

# 3. Or for your own S3 data:
pixi run bcftools view s3://my-bucket/variants.vcf.gz chr1:1-1000000 -O v -o chr1_subset.vcf

# 4. Create local index for repeated use
pixi run bgzip -f chr1_subset.vcf
pixi run tabix -p vcf chr1_subset.vcf.gz

8.4 Next Steps

  • Explore your data: Start with small region queries to test setup
  • Automate workflows: Create scripts for common extraction and annotation tasks
  • Scale up: Use Nextflow or similar workflow managers for processing multiple files
  • Monitor costs: Track S3 data transfer and request costs if using cloud storage

8.5 Additional Resources