Working with Remote Files using bcftools and samtools (HTSlib)

February 16, 2026 · 18 min read

Founder at RIVER

HTSlib-based tools like bcftools and samtools provide powerful capabilities for working with genomic data stored on remote servers. Whether your data is in AWS S3, accessible via FTP, or hosted on HTTPS endpoints, these tools allow you to efficiently query and subset remote files without downloading entire datasets. This guide covers authentication, remote file access patterns, and practical workflows.

1. How HTSlib Handles Remote Files with Random Access

HTSlib is a C library that powers many genomic tools. Tools built on HTSlib—including bcftools, samtools, tabix, and others—inherit native support for remote file access via random access queries. This section explains how HTSlib works with remote files and which protocols are supported.

1.1 Random Access Mechanism: The Foundation

HTSlib enables efficient remote file access through a byte-range request mechanism. Instead of downloading entire files, it uses file indexes to determine exact byte positions and requests only those bytes from the remote server.

How it works:

User Request:
  bcftools view s3://bucket/variants.vcf.gz chr1:1000-2000

HTSlib Process:
1. Reads index file (.vcf.gz.tbi) to find byte ranges
2. Index lookup: "chr1:1000-2000 is at bytes 5000-15000"
3. Sends byte-range request to remote server
4. Remote server returns ONLY requested bytes (10KB, not 1GB!)
5. HTSlib decompresses partial data
6. Returns matching variants to user

Result: 10KB downloaded instead of 1GB
Time: 1-5 seconds instead of 5-30 minutes

Critical requirement: Indexed files (.tbi, .bai, .csi)

Without index: Must download entire file from start (fails on large files)
With index: Downloads only necessary bytes (efficient and fast)

1.2 All Supported URL Schemes

HTSlib supports a comprehensive list of protocols via libcurl. Check your installation:

# View all supported protocols
pixi run samtools --version 2>&1 | grep -A 20 "HTSlib URL"

Complete output example:

HTSlib URL scheme handlers present:
    built-in:        file, preload, data
    Google Cloud:    gs+http, gs+https, gs
    libcurl:         gophers, smtp, wss, smb, rtsp, tftp, pop3, smbs, imaps, pop3s, ws, ftps, ftp, gopher, imap, http, https, sftp, smtps, scp, dict, mqtt, telnet
    Amazon S3:       s3+https, s3, s3+http
    crypt4gh:        crypt4gh
    mem:             mem

Protocol categories:

Category	Protocols	Typical Use
Built-in	file, preload, data	Local files, pipes
Cloud Storage	s3://, gs://	AWS, Google Cloud
HTTP-based	http://, https://, ftps://	Web servers, FTP over SSL
Other	ftp://, sftp://, scp://, smb://	Traditional file servers
Specialized	mqtt, dict, imap, etc	Not typical for genomics

Note on protocol coverage: This blog focuses on the most common protocols for bioinformatics: S3, HTTPS, FTP, and GCS. Many other protocols are available but less commonly used for genomic data.

1.3 Typical Bioinformatics Protocols (Covered in This Blog)

The following protocols are widely used for storing and accessing genomic data:

S3 (Amazon S3 and compatible)

URL scheme: s3://bucket-name/path/file.vcf.gz
Use case: Cloud-based genomic repositories, AWS-hosted data
Random access: Via byte-range requests, AWS SDK for auth
Speed: Fastest (typically <1 second for indexed queries)
See section: 2 - S3 Remote File Access

HTTPS (Web servers)

URL scheme: https://server.com/path/file.vcf.gz
Use case: Public databases (NCBI, Ensembl), institutional servers
Random access: Via HTTP Range header (206 Partial Content)
Speed: Fast (1-2 seconds for indexed queries)
See section: 4 - HTTPS Remote File Access

FTP (File Transfer Protocol)

URL scheme: ftp://server.com/path/file.vcf.gz
Use case: Legacy databases, some public repositories
Random access: Via FTP REST command for seeking
Speed: Slower (3-5 seconds for indexed queries)
See section: 3 - FTP Remote File Access

GCS (Google Cloud Storage)

URL scheme: gs://bucket-name/path/file.vcf.gz
Use case: Google Cloud-hosted genomic data
Random access: Similar to S3 (byte-range requests)
Speed: Fast (typically <1 second for indexed queries)
Note: Requires Google Cloud SDK credentials

1.4 Key Advantages of Remote Access

Avoid large downloads: Query specific regions without transferring entire files
Cost efficiency: Pay only for data you access, not storage
Scalability: Work with datasets larger than local disk space
Collaboration: Reference shared datasets directly without copying
Real-time access: Always work with latest version of data

1.5 Performance Considerations: Why Indexes Matter

Remote file access requires indexed files for efficient region-based queries:

VCF files: Must have .vcf.gz.tbi index
BAM files: Must have .bam.bai index
BCF files: Must have .bcf.csi index

Bandwidth comparison for 1GB VCF file querying chr1:1-100000:

Scenario	Bytes Transferred	Time	Method
Indexed S3	~700KB	<1 second	Byte-range request
Indexed HTTPS	~700KB	1-2 seconds	HTTP Range header
Indexed FTP	~700KB	3-5 seconds	FTP REST seek
Unindexed remote	1000MB+	5-30 minutes	Sequential download
Local indexed file	~700KB	<0.5 seconds	Direct file seek

Critical insight: Indexed remote access downloads only 0.07% of file size!

1.6 Setting Up bcftools and samtools with Pixi

The easiest way to install bcftools and samtools with full remote file support is using Pixi with the conda-forge and bioconda channels. These channels include pre-compiled binaries with S3, FTP, and HTTPS support enabled in HTSlib.

Installation Steps

Create a new directory for your project and initialize a Pixi environment:

# Create project directory
mkdir -p ~/htslib-project && cd ~/htslib-project

# Initialize Pixi with conda-forge and bioconda channels
pixi init --channel conda-forge --channel bioconda .

# Add bcftools and samtools
pixi add bcftools samtools

This creates a pixi.toml file:

[workspace]
channels = ["conda-forge", "bioconda"]
name = "htslib-project"
platforms = ["linux-64"]
version = "0.1.0"

[dependencies]
bcftools = ">=1.23,<2"
samtools = ">=1.23,<2"

Verify Installation

# Run bcftools version
pixi run bcftools --version

# Run samtools version
pixi run samtools --version

Both tools will report HTSlib features including S3, GCS, and libcurl support:

bcftools 1.23
Using htslib 1.23
...

Samtools 1.23
...
HTSlib compilation details:
    Features:       build=configure libcurl=yes S3=yes GCS=yes ...

The key line is S3=yes indicating S3 support is enabled.

Why Pixi for Remote File Access?

Pre-compiled with remote support: Bioconda packages include HTSlib compiled with libcurl, S3, and GCS support
Easy dependency management: All required libraries included automatically
Reproducible environments: pixi.lock ensures the same versions across machines
No system-wide installation: Keep tools isolated in your project environment

1.7 Protocol Verification: HTTP Range Requests

For HTTPS/HTTP remote access to work efficiently, the server must support HTTP Range Requests. Verify this before relying on remote queries:

# Test if server supports Range Requests
curl -I -H "Range: bytes=0-500" https://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606_b150_GRCh37p13/VCF/00-common_all_papu.vcf.gz

# Expected response: HTTP/1.1 206 Partial Content
# This means byte-range requests work! ✓

Response headers will show:

HTTP/1.1 206 Partial Content      ← supports Range requests
Accept-Ranges: bytes              ← confirms support
Content-Length: 501               ← returns partial content

2. S3 Remote File Access

AWS S3 is a common repository for genomic datasets. HTSlib supports S3 through the s3:// protocol.

2.1 S3 Authentication

Using AWS Credentials

Set AWS credentials as environment variables:

export AWS_ACCESS_KEY_ID="your_access_key"
export AWS_SECRET_ACCESS_KEY="your_secret_key"
export AWS_DEFAULT_REGION="us-east-1"

Using AWS IAM Roles (EC2 Instances)

If running on EC2 with an IAM role attached, HTSlib automatically uses the role's credentials—no environment variables needed.

Using AWS CLI Configuration

HTSlib respects ~/.aws/credentials and ~/.aws/config:

aws configure  # Interactive setup

2.2 Querying S3-hosted VCF Files

Query a specific genomic region from a remote VCF file:

# Query chromosome 1, positions 1-1000000
bcftools view s3://my-bucket/data/variants.vcf.gz chr1:1-1000000

# Count variants in a region
bcftools view -H s3://my-bucket/data/variants.vcf.gz chr1:1-1000000 | wc -l

# Extract samples and subset to region
bcftools view -s sample1,sample2 s3://my-bucket/data/variants.vcf.gz chr1:1-1000000 -O v

2.3 Querying S3-hosted BAM Files

Extract reads from specific genomic regions:

# Query region from remote BAM
samtools view s3://my-bucket/data/aligned.bam chr1:1-1000000

# Count reads in region
samtools view -c s3://my-bucket/data/aligned.bam chr1:1-1000000

# Extract to local BAM file
samtools view -b s3://my-bucket/data/aligned.bam chr1:1-1000000 -o region.bam

2.4 Region-based Subsetting Workflow

Efficient workflow for extracting subsets from S3:

# 1. List available S3 files (using AWS CLI)
aws s3 ls s3://my-bucket/data/ --recursive

# 2. Query specific region from VCF
bcftools view \
  s3://my-bucket/data/chr1.vcf.gz \
  chr1:10000000-20000000 \
  -O v -o local_subset.vcf

# 3. Validate and compress
bgzip -f local_subset.vcf
tabix -p vcf local_subset.vcf.gz

# 4. Annotate or further process locally
bcftools annotate local_subset.vcf.gz -a annotation.vcf.gz -c ID

2.5 Working with Public Genomic Data

Public genomic repositories are excellent for testing remote file access. While S3 is available, HTTPS endpoints are often more reliable for quick testing.

Query NCBI dbSNP data via HTTPS:

NCBI provides public access to common variants. Test with real data:

# Query NCBI dbSNP common variants (via HTTPS)
pixi run bcftools view \
  https://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606_b150_GRCh37p13/VCF/00-common_all.vcf.gz \
  | pixi run bcftools query -f '%CHROM\t%POS\t%REF\t%ALT\n' \
  | head -20

Expected output showing real variants:

10177	A	AC
10352	T	TA
10352	T	TA
10616	CCGCCGTTGCAAAGGCGCGCCG	C
10642	G	A
11008	C	G
11012	C	G

Discover available files:

# List NCBI FTP VCF files
curl -s https://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606_b150_GRCh37p13/VCF/ \
  | grep -i ".vcf.gz\"" | grep -o "href=\"[^\"]*\"" | cut -d'"' -f2

# Or for 1000 Genomes in S3:
aws s3 ls s3://1000genomes/phase3/data/ --no-sign-request | head -20

Region-based queries (with indexed files):

# Query specific chromosome region
# Note: Requires .tbi index at the remote location
pixi run bcftools view \
  https://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606_b150_GRCh37p13/VCF/00-common_all.vcf.gz \
  chr1:1000000-2000000

Note on S3 vs HTTPS:

HTTPS (like NCBI): Reliable, requires no credentials for public data
S3: Fastest for large datasets, requires AWS credentials or public bucket setup
FTP: Alternative for some repositories, generally slower than HTTPS

3. FTP Remote File Access

FTP servers host many public genomic datasets. HTSlib supports the ftp:// protocol.

3.1 FTP Authentication

Anonymous Access

Many public repositories allow anonymous FTP:

# No credentials needed
bcftools view ftp://ftp.ncbi.nlm.nih.gov/genomes/H_sapiens/VCF/common_all.vcf.gz chr1:1-1000000

Authenticated FTP

For private FTP servers, include credentials in the URL:

bcftools view ftp://username:password@ftp.example.com/data/variants.vcf.gz chr1:1-1000000

Security Note: URL-embedded credentials are visible in process listings. Prefer environment variables:

export FTP_USER="your_username"
export FTP_PASSWORD="your_password"

# HTSlib will use these automatically
bcftools view ftp://ftp.example.com/data/variants.vcf.gz chr1:1-1000000

3.2 Querying FTP-hosted Files

Common public genomic repositories accessible via FTP:

NCBI FTP Repository

# Query NCBI's common variants file
bcftools view \
  ftp://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606_b150_GRCh37p13/VCF/common_all.vcf.gz \
  chr1:1-1000000 \
  | bcftools query -f '%CHROM\t%POS\t%AC\t%AF\n' \
  | head -10

Ensembl FTP Repository

# Query Ensembl VCF (example path)
bcftools view \
  ftp://ftp.ensembl.org/pub/data_files/homo_sapiens/VEP/Homo_sapiens_phenotypes_associated_variants-GRCh38.vcf.gz \
  1:1-1000000

# List available Ensembl files
ftp ftp.ensembl.org  # Interactive FTP session

3.3 FTP Performance Tips

Network stability: FTP can be slower than HTTPS for single queries. Consider batch processing.
Index availability: Verify .tbi or .csi indexes are present on the FTP server for efficient region queries.
Timeout handling: Set timeout environment variables for unreliable connections:

export CURL_CONNECT_TIMEOUT=30
export CURL_TIMEOUT=60

4. HTTPS Remote File Access

HTTPS endpoints often require token-based authentication and are increasingly common for cloud-hosted data.

4.1 HTTPS Authentication Methods

Bearer Token Authentication

For endpoints that accept Bearer tokens (common in cloud platforms):

# Option 1: Set environment variable
export HTTPS_AUTH_BEARER="your_bearer_token"
bcftools view https://api.example.com/data/variants.vcf.gz chr1:1-1000000

# Option 2: Pass in URL (less secure)
bcftools view https://token:your_bearer_token@api.example.com/data/variants.vcf.gz chr1:1-1000000

API Keys in Headers

For services requiring custom headers, use curl with a named pipe:

# Create a wrapper using curl and process substitution
bcftools view \
  <(curl -H "Authorization: Bearer TOKEN" \
     https://api.example.com/data/variants.vcf.gz \
     --output - ) \
  chr1:1-1000000

Basic HTTP Authentication

# URL-embedded credentials (not recommended for security)
bcftools view https://username:password@secure.example.com/data/variants.vcf.gz chr1:1-1000000

4.2 Practical HTTPS Workflows

Query Public HTTPS-hosted Data

Many databases provide HTTPS access to genomic data:

# Example: Query genome annotation from HTTPS endpoint
bcftools view https://example.genomics.org/public/variants.vcf.gz chr1:1-1000000

# Extract specific information
bcftools view https://example.genomics.org/public/variants.vcf.gz chr1:1-1000000 \
  | bcftools query -f '%CHROM\t%POS\t%REF\t%ALT\t%AC\n'

Download Indexed Files Locally

For repeated queries, consider caching indexes locally:

# Download index file (small, fast)
curl -o variants.vcf.gz.tbi https://api.example.com/data/variants.vcf.gz.tbi

# Query still uses remote file but with local index knowledge
bcftools view https://api.example.com/data/variants.vcf.gz chr1:1-1000000

4.3 SSL/TLS Certificate Issues

If you encounter certificate validation errors:

# Disable certificate verification (use cautiously)
export SSL_NO_VERIFY=1
bcftools view https://api.example.com/data/variants.vcf.gz chr1:1-1000000

# Or use curl environment variable
export CURL_CA_BUNDLE=""
bcftools view https://api.example.com/data/variants.vcf.gz chr1:1-1000000

Security Note: Only disable certificate verification for trusted internal endpoints or testing.

5. Indexing and Efficient Region Queries

Efficient remote file access depends on proper indexing. HTSlib tools use byte-range requests to fetch only necessary data.

5.1 Creating Indexes

Create Tabix Index for VCF Files

# For bgzip-compressed VCF files
tabix -p vcf variants.vcf.gz

# Verify index
ls -lh variants.vcf.gz*

Create CSI Index for BCF Files

# CSI indexes support larger files and datasets
bcftools index -c variants.bcf

# Verify index
ls -lh variants.bcf.csi

Create BAI Index for BAM Files

# Standard BAM indexing
samtools index aligned.bam

# Creates aligned.bam.bai
ls -lh aligned.bam.bai

5.2 Verifying Index Integrity

# Check if VCF index is valid
tabix -l variants.vcf.gz | head -10

# Query a specific region (tests index)
bcftools view variants.vcf.gz chr1:1-1000 -H | wc -l

5.3 Performance: Indexed vs Unindexed Remote Access

Indexed remote access (with region query):

Seeks to file index
Downloads byte-range containing target region
Returns ~100KB-1MB of data

Unindexed remote access:

Must scan entire file sequentially
Downloads entire file
Returns all data (can be GB-sized)

Example performance comparison:

# Indexed query (fast, ~1-5 seconds)
time bcftools view s3://my-bucket/variants.vcf.gz chr1:1-1000000 > /dev/null

# Without index (very slow, minutes or fails)
# HTSlib automatically uses index if available, so explicitly avoid it only if needed

5.4 Testing Remote File Access with Local Files

Before working with remote files, it's helpful to test bcftools and samtools locally. Use this workflow to verify your installation and practice region queries.

Create test VCF file:

# Create test directory
mkdir -p ~/genomics-test && cd ~/genomics-test

# Create test VCF
cat > test.vcf << 'EOF'
##fileformat=VCFv4.2
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total read depth">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	sample1	sample2
chr1	10000	.	A	G	60	.	DP=100	GT	0/1	1/1
chr1	20000	.	C	T	50	.	DP=80	GT	0/0	0/1
chr1	30000	.	G	A	70	.	DP=120	GT	1/1	0/1
chr1	100000	.	T	C	55	.	DP=90	GT	0/1	0/0
EOF

# Compress and index (requires bgzip and tabix from pixi)
pixi run bgzip -f test.vcf
pixi run tabix -p vcf test.vcf.gz

Test bcftools region queries:

# Query specific region
pixi run bcftools view test.vcf.gz chr1:1-50000

# Extract specific samples
pixi run bcftools view -s sample1 test.vcf.gz chr1:10000-50000

# Count variants in region
pixi run bcftools view -H test.vcf.gz chr1:1-100000 | wc -l

Create and test BAM file:

# Create test SAM file
cat > test.sam << 'EOF'
@HD	VN:1.0	SO:coordinate
@SQ	SN:chr1	LN:1000000
read1	0	chr1	100	60	10M	*	0	0	ACGTACGTAC	IIIIIIIIII
read2	0	chr1	120	60	10M	*	0	0	TGCATGCATG	IIIIIIIIII
read3	0	chr1	50000	60	10M	*	0	0	AAAAAAAAAA	IIIIIIIIII
read4	0	chr1	100000	60	10M	*	0	0	TTTTTTTTTT	IIIIIIIIII
EOF

# Convert to BAM and index
pixi run samtools view -b test.sam -o test.bam
pixi run samtools index test.bam

Test samtools region queries:

# View reads in region
pixi run samtools view test.bam chr1:1-60000

# Count reads in region
pixi run samtools view -c test.bam chr1:1-60000

# Extract region to new file
pixi run samtools view -b test.bam chr1:50000-100000 -o region.bam

Once comfortable with local files, the same commands work seamlessly with remote S3, FTP, or HTTPS URLs—just replace local filenames with remote URLs.

6. Common Workflows and Recipes

6.1 Extract Variants for Specific Genes

# Define gene regions (example: gene1, gene2)
cat > genes.bed << 'EOF'
1	11869	14409	DDX11L1
1	17369	17436	MIR6859-1
EOF

# Extract variants from remote file for gene regions
bcftools view \
  s3://my-bucket/variants.vcf.gz \
  -R genes.bed \
  -O v -o local_genes.vcf

# Compress and index
bgzip -f local_genes.vcf
tabix -p vcf local_genes.vcf.gz

6.2 Compare Samples Across Remote Files

# Compare variants for two samples in different S3 files
bcftools isec \
  <(bcftools view -s sample1 s3://bucket1/variants.vcf.gz) \
  <(bcftools view -s sample2 s3://bucket2/variants.vcf.gz) \
  -p comparison_output

# Shows shared and unique variants
ls comparison_output/

6.3 Annotate Remote VCF with Remote Annotations

# Annotate S3-hosted VCF using FTP-hosted annotation file
bcftools annotate \
  s3://my-bucket/variants.vcf.gz \
  -a ftp://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606_b150_GRCh37p13/VCF/common_all.vcf.gz \
  -c ID \
  -O v -o annotated.vcf

6.4 Extract BAM Regions for Multiple Samples

# Create list of regions
cat > regions.txt << 'EOF'
chr1:1-1000000
chr2:2000000-3000000
chr3:5000000-6000000
EOF

# Extract each region from remote BAM, save locally
while read region; do
  samtools view \
    s3://my-bucket/aligned.bam \
    $region \
    -b -o ${region//:/_}.bam
  samtools index ${region//:/_}.bam
done < regions.txt

6.5 Stream Processing Without Local Storage

# Pipe remote data through multiple tools
bcftools view s3://my-bucket/variants.vcf.gz chr1:1-1000000 \
  | bcftools query -f '%CHROM\t%POS\t%REF\t%ALT\t%AC\t%AF\n' \
  | awk '$6 > 0.05' \
  | tee filtered_variants.tsv \
  | wc -l

7. Troubleshooting and Optimization

7.1 Common Issues

Issue: "File not found" or 404 errors

Solution: Verify S3 bucket name, path, and file permissions. Test with AWS CLI:
```
aws s3 ls s3://bucket-name/path/file.vcf.gz
```

Issue: Authentication failures with S3

Solution: Verify AWS credentials:
```
aws sts get-caller-identity
```

Issue: Index not found (.tbi or .bai missing)

Solution: Create indexes or verify they exist on remote server:
```
aws s3 ls s3://bucket-name/file.vcf.gz.tbi
```

Issue: "Connection timeout" on FTP

Solution: Increase timeout and try again:

export CURL_CONNECT_TIMEOUT=60
export CURL_TIMEOUT=120

Issue: HTTPS SSL certificate errors

Solution: Verify certificate validity or temporarily disable verification for testing:
```
curl -k https://endpoint.com/file.vcf.gz.tbi  # -k disables cert verification
```

7.2 Optimization Tips

For S3 Access:

Use S3 Transfer Acceleration if available for faster downloads
Place compute resources in the same AWS region as your data

Use bucket endpoints to avoid Internet Gateway charges:

# VPC endpoint for S3 (no data transfer charges within VPC)
s3://bucket-name/key

For Large Batch Queries:

Sort region requests by chromosome for better caching

Consider downloading entire chromosome subset once, then querying locally:

bcftools view s3://bucket/variants.vcf.gz chr1 -o chr1.vcf.gz
bcftools index chr1.vcf.gz
# Then query locally multiple times

For FTP Performance:

Batch multiple queries when possible
Check server-side compression settings
Consider mirror/rsync services for frequently accessed data

7.3 Debugging with Verbose Output

# Enable verbose output for debugging
export HTS_LOG_LEVEL=debug

# Use bcftools verbose flag
bcftools view -v s3://bucket/variants.vcf.gz chr1:1-1000000

# Check curl verbose output
export CURL_VERBOSE=1

8. Recap and Key Takeaways

8.1 What We've Covered

Protocol Support: HTSlib tools support S3, FTP, and HTTPS remote file access natively
Authentication: Each protocol has different auth mechanisms (AWS credentials, FTP user:pass, Bearer tokens)
Efficient Queries: Indexed files enable byte-range requests, downloading only necessary data
Practical Workflows: Common tasks like region extraction, sample comparison, and annotation using remote files
Troubleshooting: How to diagnose and resolve common remote file access issues

8.2 Key Takeaways

Always use indexes for remote files to avoid downloading entire datasets
Choose the right protocol: S3 for AWS, FTP for public repositories, HTTPS for APIs
Set up authentication correctly to avoid repeated connection failures
Stream processing reduces local storage requirements for large datasets
Verify indexes exist on remote servers before querying
Optimize by region: Place compute near data, batch similar queries, use appropriate timeouts

8.3 Quick Reference Workflow

Setup (one time):

# Create project directory
mkdir -p ~/genomics-work && cd ~/genomics-work

# Initialize Pixi with conda-forge and bioconda
pixi init --channel conda-forge --channel bioconda .

# Add tools
pixi add bcftools samtools

# Verify installation
pixi run bcftools --version
pixi run samtools --version

Using remote files:

# 1. Set up authentication (optional, for private buckets)
export AWS_ACCESS_KEY_ID="..."
export AWS_SECRET_ACCESS_KEY="..."
export AWS_DEFAULT_REGION="us-east-1"

# 2. Query real remote file (NCBI dbSNP via HTTPS - no auth needed)
pixi run bcftools view \
  https://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606_b150_GRCh37p13/VCF/00-common_all.vcf.gz \
  | pixi run bcftools query -f '%CHROM\t%POS\t%REF\t%ALT\n' \
  | head -20

# 3. Or for your own S3 data:
pixi run bcftools view s3://my-bucket/variants.vcf.gz chr1:1-1000000 -O v -o chr1_subset.vcf

# 4. Create local index for repeated use
pixi run bgzip -f chr1_subset.vcf
pixi run tabix -p vcf chr1_subset.vcf.gz

8.4 Next Steps

Explore your data: Start with small region queries to test setup
Automate workflows: Create scripts for common extraction and annotation tasks
Scale up: Use Nextflow or similar workflow managers for processing multiple files
Monitor costs: Track S3 data transfer and request costs if using cloud storage

1. How HTSlib Handles Remote Files with Random Access​

1.1 Random Access Mechanism: The Foundation​

1.2 All Supported URL Schemes​

1.3 Typical Bioinformatics Protocols (Covered in This Blog)​

S3 (Amazon S3 and compatible)​

HTTPS (Web servers)​

FTP (File Transfer Protocol)​

GCS (Google Cloud Storage)​

1.4 Key Advantages of Remote Access​

1.5 Performance Considerations: Why Indexes Matter​

1.6 Setting Up bcftools and samtools with Pixi​

Installation Steps​

Verify Installation​

Why Pixi for Remote File Access?​

1.7 Protocol Verification: HTTP Range Requests​

2. S3 Remote File Access​

2.1 S3 Authentication​

Using AWS Credentials​

Using AWS IAM Roles (EC2 Instances)​

Using AWS CLI Configuration​

2.2 Querying S3-hosted VCF Files​

2.3 Querying S3-hosted BAM Files​

2.4 Region-based Subsetting Workflow​

2.5 Working with Public Genomic Data​

3. FTP Remote File Access​

3.1 FTP Authentication​

Anonymous Access​

Authenticated FTP​

3.2 Querying FTP-hosted Files​

NCBI FTP Repository​

Ensembl FTP Repository​

3.3 FTP Performance Tips​

4. HTTPS Remote File Access​

4.1 HTTPS Authentication Methods​

Bearer Token Authentication​

API Keys in Headers​

Basic HTTP Authentication​

4.2 Practical HTTPS Workflows​

Query Public HTTPS-hosted Data​

Download Indexed Files Locally​

4.3 SSL/TLS Certificate Issues​

5. Indexing and Efficient Region Queries​

5.1 Creating Indexes​

Create Tabix Index for VCF Files​

Create CSI Index for BCF Files​

Create BAI Index for BAM Files​

5.2 Verifying Index Integrity​

5.3 Performance: Indexed vs Unindexed Remote Access​

5.4 Testing Remote File Access with Local Files​

6. Common Workflows and Recipes​

6.1 Extract Variants for Specific Genes​

6.2 Compare Samples Across Remote Files​

6.3 Annotate Remote VCF with Remote Annotations​

6.4 Extract BAM Regions for Multiple Samples​

6.5 Stream Processing Without Local Storage​

7. Troubleshooting and Optimization​

7.1 Common Issues​

7.2 Optimization Tips​

7.3 Debugging with Verbose Output​

8. Recap and Key Takeaways​

8.1 What We've Covered​

8.2 Key Takeaways​

8.3 Quick Reference Workflow​

8.4 Next Steps​

8.5 Additional Resources​