Working with Remote Files using bcftools and samtools (HTSlib)
HTSlib-based tools like bcftools and samtools provide powerful capabilities for working with genomic data stored on remote servers. Whether your data is in AWS S3, accessible via FTP, or hosted on HTTPS endpoints, these tools allow you to efficiently query and subset remote files without downloading entire datasets. This guide covers authentication, remote file access patterns, and practical workflows.
1. How HTSlib Handles Remote Files with Random Access
HTSlib is a C library that powers many genomic tools. Tools built on HTSlib—including bcftools, samtools, tabix, and others—inherit native support for remote file access via random access queries. This section explains how HTSlib works with remote files and which protocols are supported.
1.1 Random Access Mechanism: The Foundation
HTSlib enables efficient remote file access through a byte-range request mechanism. Instead of downloading entire files, it uses file indexes to determine exact byte positions and requests only those bytes from the remote server.
How it works:
User Request:
bcftools view s3://bucket/variants.vcf.gz chr1:1000-2000
HTSlib Process:
1. Reads index file (.vcf.gz.tbi) to find byte ranges
2. Index lookup: "chr1:1000-2000 is at bytes 5000-15000"
3. Sends byte-range request to remote server
4. Remote server returns ONLY requested bytes (10KB, not 1GB!)
5. HTSlib decompresses partial data
6. Returns matching variants to user
Result: 10KB downloaded instead of 1GB
Time: 1-5 seconds instead of 5-30 minutes
Critical requirement: Indexed files (.tbi, .bai, .csi)
- Without index: Must download entire file from start (fails on large files)
- With index: Downloads only necessary bytes (efficient and fast)
1.2 All Supported URL Schemes
HTSlib supports a comprehensive list of protocols via libcurl. Check your installation:
# View all supported protocols
pixi run samtools --version 2>&1 | grep -A 20 "HTSlib URL"
Complete output example:
HTSlib URL scheme handlers present:
built-in: file, preload, data
Google Cloud: gs+http, gs+https, gs
libcurl: gophers, smtp, wss, smb, rtsp, tftp, pop3, smbs, imaps, pop3s, ws, ftps, ftp, gopher, imap, http, https, sftp, smtps, scp, dict, mqtt, telnet
Amazon S3: s3+https, s3, s3+http
crypt4gh: crypt4gh
mem: mem
Protocol categories:
| Category | Protocols | Typical Use |
|---|---|---|
| Built-in | file, preload, data | Local files, pipes |
| Cloud Storage | s3://, gs:// | AWS, Google Cloud |
| HTTP-based | http://, https://, ftps:// | Web servers, FTP over SSL |
| Other | ftp://, sftp://, scp://, smb:// | Traditional file servers |
| Specialized | mqtt, dict, imap, etc | Not typical for genomics |
Note on protocol coverage: This blog focuses on the most common protocols for bioinformatics: S3, HTTPS, FTP, and GCS. Many other protocols are available but less commonly used for genomic data.
1.3 Typical Bioinformatics Protocols (Covered in This Blog)
The following protocols are widely used for storing and accessing genomic data:
S3 (Amazon S3 and compatible)
- URL scheme:
s3://bucket-name/path/file.vcf.gz - Use case: Cloud-based genomic repositories, AWS-hosted data
- Random access: Via byte-range requests, AWS SDK for auth
- Speed: Fastest (typically <1 second for indexed queries)
- See section: 2 - S3 Remote File Access
HTTPS (Web servers)
- URL scheme:
https://server.com/path/file.vcf.gz - Use case: Public databases (NCBI, Ensembl), institutional servers
- Random access: Via HTTP Range header (206 Partial Content)
- Speed: Fast (1-2 seconds for indexed queries)
- See section: 4 - HTTPS Remote File Access
FTP (File Transfer Protocol)
- URL scheme:
ftp://server.com/path/file.vcf.gz - Use case: Legacy databases, some public repositories
- Random access: Via FTP REST command for seeking
- Speed: Slower (3-5 seconds for indexed queries)
- See section: 3 - FTP Remote File Access
GCS (Google Cloud Storage)
- URL scheme:
gs://bucket-name/path/file.vcf.gz - Use case: Google Cloud-hosted genomic data
- Random access: Similar to S3 (byte-range requests)
- Speed: Fast (typically <1 second for indexed queries)
- Note: Requires Google Cloud SDK credentials
1.4 Key Advantages of Remote Access
- Avoid large downloads: Query specific regions without transferring entire files
- Cost efficiency: Pay only for data you access, not storage
- Scalability: Work with datasets larger than local disk space
- Collaboration: Reference shared datasets directly without copying
- Real-time access: Always work with latest version of data
1.5 Performance Considerations: Why Indexes Matter
Remote file access requires indexed files for efficient region-based queries:
- VCF files: Must have
.vcf.gz.tbiindex - BAM files: Must have
.bam.baiindex - BCF files: Must have
.bcf.csiindex
Bandwidth comparison for 1GB VCF file querying chr1:1-100000:
| Scenario | Bytes Transferred | Time | Method |
|---|---|---|---|
| Indexed S3 | ~700KB | <1 second | Byte-range request |
| Indexed HTTPS | ~700KB | 1-2 seconds | HTTP Range header |
| Indexed FTP | ~700KB | 3-5 seconds | FTP REST seek |
| Unindexed remote | 1000MB+ | 5-30 minutes | Sequential download |
| Local indexed file | ~700KB | <0.5 seconds | Direct file seek |
Critical insight: Indexed remote access downloads only 0.07% of file size!
1.6 Setting Up bcftools and samtools with Pixi
The easiest way to install bcftools and samtools with full remote file support is using Pixi with the conda-forge and bioconda channels. These channels include pre-compiled binaries with S3, FTP, and HTTPS support enabled in HTSlib.
Installation Steps
Create a new directory for your project and initialize a Pixi environment:
# Create project directory
mkdir -p ~/htslib-project && cd ~/htslib-project
# Initialize Pixi with conda-forge and bioconda channels
pixi init --channel conda-forge --channel bioconda .
# Add bcftools and samtools
pixi add bcftools samtools
This creates a pixi.toml file:
[workspace]
channels = ["conda-forge", "bioconda"]
name = "htslib-project"
platforms = ["linux-64"]
version = "0.1.0"
[dependencies]
bcftools = ">=1.23,<2"
samtools = ">=1.23,<2"
Verify Installation
# Run bcftools version
pixi run bcftools --version
# Run samtools version
pixi run samtools --version
Both tools will report HTSlib features including S3, GCS, and libcurl support:
bcftools 1.23
Using htslib 1.23
...
Samtools 1.23
...
HTSlib compilation details:
Features: build=configure libcurl=yes S3=yes GCS=yes ...
The key line is S3=yes indicating S3 support is enabled.
Why Pixi for Remote File Access?
- Pre-compiled with remote support: Bioconda packages include HTSlib compiled with libcurl, S3, and GCS support
- Easy dependency management: All required libraries included automatically
- Reproducible environments:
pixi.lockensures the same versions across machines - No system-wide installation: Keep tools isolated in your project environment
1.7 Protocol Verification: HTTP Range Requests
For HTTPS/HTTP remote access to work efficiently, the server must support HTTP Range Requests. Verify this before relying on remote queries:
# Test if server supports Range Requests
curl -I -H "Range: bytes=0-500" https://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606_b150_GRCh37p13/VCF/00-common_all_papu.vcf.gz
# Expected response: HTTP/1.1 206 Partial Content
# This means byte-range requests work! ✓
Response headers will show:
HTTP/1.1 206 Partial Content ← supports Range requests
Accept-Ranges: bytes ← confirms support
Content-Length: 501 ← returns partial content
2. S3 Remote File Access
AWS S3 is a common repository for genomic datasets. HTSlib supports S3 through the s3:// protocol.
2.1 S3 Authentication
Using AWS Credentials
Set AWS credentials as environment variables:
export AWS_ACCESS_KEY_ID="your_access_key"
export AWS_SECRET_ACCESS_KEY="your_secret_key"
export AWS_DEFAULT_REGION="us-east-1"
Using AWS IAM Roles (EC2 Instances)
If running on EC2 with an IAM role attached, HTSlib automatically uses the role's credentials—no environment variables needed.
Using AWS CLI Configuration
HTSlib respects ~/.aws/credentials and ~/.aws/config:
aws configure # Interactive setup
2.2 Querying S3-hosted VCF Files
Query a specific genomic region from a remote VCF file:
# Query chromosome 1, positions 1-1000000
bcftools view s3://my-bucket/data/variants.vcf.gz chr1:1-1000000
# Count variants in a region
bcftools view -H s3://my-bucket/data/variants.vcf.gz chr1:1-1000000 | wc -l
# Extract samples and subset to region
bcftools view -s sample1,sample2 s3://my-bucket/data/variants.vcf.gz chr1:1-1000000 -O v
2.3 Querying S3-hosted BAM Files
Extract reads from specific genomic regions:
# Query region from remote BAM
samtools view s3://my-bucket/data/aligned.bam chr1:1-1000000
# Count reads in region
samtools view -c s3://my-bucket/data/aligned.bam chr1:1-1000000
# Extract to local BAM file
samtools view -b s3://my-bucket/data/aligned.bam chr1:1-1000000 -o region.bam
2.4 Region-based Subsetting Workflow
Efficient workflow for extracting subsets from S3:
# 1. List available S3 files (using AWS CLI)
aws s3 ls s3://my-bucket/data/ --recursive
# 2. Query specific region from VCF
bcftools view \
s3://my-bucket/data/chr1.vcf.gz \
chr1:10000000-20000000 \
-O v -o local_subset.vcf
# 3. Validate and compress
bgzip -f local_subset.vcf
tabix -p vcf local_subset.vcf.gz
# 4. Annotate or further process locally
bcftools annotate local_subset.vcf.gz -a annotation.vcf.gz -c ID
2.5 Working with Public Genomic Data
Public genomic repositories are excellent for testing remote file access. While S3 is available, HTTPS endpoints are often more reliable for quick testing.
Query NCBI dbSNP data via HTTPS:
NCBI provides public access to common variants. Test with real data:
# Query NCBI dbSNP common variants (via HTTPS)
pixi run bcftools view \
https://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606_b150_GRCh37p13/VCF/00-common_all.vcf.gz \
| pixi run bcftools query -f '%CHROM\t%POS\t%REF\t%ALT\n' \
| head -20
Expected output showing real variants:
1 10177 A AC
1 10352 T TA
1 10352 T TA
1 10616 CCGCCGTTGCAAAGGCGCGCCG C
1 10642 G A
1 11008 C G
1 11012 C G
Discover available files:
# List NCBI FTP VCF files
curl -s https://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606_b150_GRCh37p13/VCF/ \
| grep -i ".vcf.gz\"" | grep -o "href=\"[^\"]*\"" | cut -d'"' -f2
# Or for 1000 Genomes in S3:
aws s3 ls s3://1000genomes/phase3/data/ --no-sign-request | head -20
Region-based queries (with indexed files):
# Query specific chromosome region
# Note: Requires .tbi index at the remote location
pixi run bcftools view \
https://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606_b150_GRCh37p13/VCF/00-common_all.vcf.gz \
chr1:1000000-2000000
Note on S3 vs HTTPS:
- HTTPS (like NCBI): Reliable, requires no credentials for public data
- S3: Fastest for large datasets, requires AWS credentials or public bucket setup
- FTP: Alternative for some repositories, generally slower than HTTPS
3. FTP Remote File Access
FTP servers host many public genomic datasets. HTSlib supports the ftp:// protocol.
3.1 FTP Authentication
Anonymous Access
Many public repositories allow anonymous FTP:
# No credentials needed
bcftools view ftp://ftp.ncbi.nlm.nih.gov/genomes/H_sapiens/VCF/common_all.vcf.gz chr1:1-1000000
Authenticated FTP
For private FTP servers, include credentials in the URL:
bcftools view ftp://username:password@ftp.example.com/data/variants.vcf.gz chr1:1-1000000
Security Note: URL-embedded credentials are visible in process listings. Prefer environment variables:
export FTP_USER="your_username"
export FTP_PASSWORD="your_password"
# HTSlib will use these automatically
bcftools view ftp://ftp.example.com/data/variants.vcf.gz chr1:1-1000000
3.2 Querying FTP-hosted Files
Common public genomic repositories accessible via FTP:
NCBI FTP Repository
# Query NCBI's common variants file
bcftools view \
ftp://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606_b150_GRCh37p13/VCF/common_all.vcf.gz \
chr1:1-1000000 \
| bcftools query -f '%CHROM\t%POS\t%AC\t%AF\n' \
| head -10
Ensembl FTP Repository
# Query Ensembl VCF (example path)
bcftools view \
ftp://ftp.ensembl.org/pub/data_files/homo_sapiens/VEP/Homo_sapiens_phenotypes_associated_variants-GRCh38.vcf.gz \
1:1-1000000
# List available Ensembl files
ftp ftp.ensembl.org # Interactive FTP session
3.3 FTP Performance Tips
- Network stability: FTP can be slower than HTTPS for single queries. Consider batch processing.
- Index availability: Verify
.tbior.csiindexes are present on the FTP server for efficient region queries. - Timeout handling: Set timeout environment variables for unreliable connections:
export CURL_CONNECT_TIMEOUT=30
export CURL_TIMEOUT=60
4. HTTPS Remote File Access
HTTPS endpoints often require token-based authentication and are increasingly common for cloud-hosted data.
4.1 HTTPS Authentication Methods
Bearer Token Authentication
For endpoints that accept Bearer tokens (common in cloud platforms):
# Option 1: Set environment variable
export HTTPS_AUTH_BEARER="your_bearer_token"
bcftools view https://api.example.com/data/variants.vcf.gz chr1:1-1000000
# Option 2: Pass in URL (less secure)
bcftools view https://token:your_bearer_token@api.example.com/data/variants.vcf.gz chr1:1-1000000
API Keys in Headers
For services requiring custom headers, use curl with a named pipe:
# Create a wrapper using curl and process substitution
bcftools view \
<(curl -H "Authorization: Bearer TOKEN" \
https://api.example.com/data/variants.vcf.gz \
--output - ) \
chr1:1-1000000
Basic HTTP Authentication
# URL-embedded credentials (not recommended for security)
bcftools view https://username:password@secure.example.com/data/variants.vcf.gz chr1:1-1000000
4.2 Practical HTTPS Workflows
Query Public HTTPS-hosted Data
Many databases provide HTTPS access to genomic data:
# Example: Query genome annotation from HTTPS endpoint
bcftools view https://example.genomics.org/public/variants.vcf.gz chr1:1-1000000
# Extract specific information
bcftools view https://example.genomics.org/public/variants.vcf.gz chr1:1-1000000 \
| bcftools query -f '%CHROM\t%POS\t%REF\t%ALT\t%AC\n'
Download Indexed Files Locally
For repeated queries, consider caching indexes locally:
# Download index file (small, fast)
curl -o variants.vcf.gz.tbi https://api.example.com/data/variants.vcf.gz.tbi
# Query still uses remote file but with local index knowledge
bcftools view https://api.example.com/data/variants.vcf.gz chr1:1-1000000
4.3 SSL/TLS Certificate Issues
If you encounter certificate validation errors:
# Disable certificate verification (use cautiously)
export SSL_NO_VERIFY=1
bcftools view https://api.example.com/data/variants.vcf.gz chr1:1-1000000
# Or use curl environment variable
export CURL_CA_BUNDLE=""
bcftools view https://api.example.com/data/variants.vcf.gz chr1:1-1000000
Security Note: Only disable certificate verification for trusted internal endpoints or testing.
5. Indexing and Efficient Region Queries
Efficient remote file access depends on proper indexing. HTSlib tools use byte-range requests to fetch only necessary data.
5.1 Creating Indexes
Create Tabix Index for VCF Files
# For bgzip-compressed VCF files
tabix -p vcf variants.vcf.gz
# Verify index
ls -lh variants.vcf.gz*
Create CSI Index for BCF Files
# CSI indexes support larger files and datasets
bcftools index -c variants.bcf
# Verify index
ls -lh variants.bcf.csi
Create BAI Index for BAM Files
# Standard BAM indexing
samtools index aligned.bam
# Creates aligned.bam.bai
ls -lh aligned.bam.bai
5.2 Verifying Index Integrity
# Check if VCF index is valid
tabix -l variants.vcf.gz | head -10
# Query a specific region (tests index)
bcftools view variants.vcf.gz chr1:1-1000 -H | wc -l
5.3 Performance: Indexed vs Unindexed Remote Access
Indexed remote access (with region query):
- Seeks to file index
- Downloads byte-range containing target region
- Returns ~100KB-1MB of data
Unindexed remote access:
- Must scan entire file sequentially
- Downloads entire file
- Returns all data (can be GB-sized)
Example performance comparison:
# Indexed query (fast, ~1-5 seconds)
time bcftools view s3://my-bucket/variants.vcf.gz chr1:1-1000000 > /dev/null
# Without index (very slow, minutes or fails)
# HTSlib automatically uses index if available, so explicitly avoid it only if needed
5.4 Testing Remote File Access with Local Files
Before working with remote files, it's helpful to test bcftools and samtools locally. Use this workflow to verify your installation and practice region queries.
Create test VCF file:
# Create test directory
mkdir -p ~/genomics-test && cd ~/genomics-test
# Create test VCF
cat > test.vcf << 'EOF'
##fileformat=VCFv4.2
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total read depth">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT sample1 sample2
chr1 10000 . A G 60 . DP=100 GT 0/1 1/1
chr1 20000 . C T 50 . DP=80 GT 0/0 0/1
chr1 30000 . G A 70 . DP=120 GT 1/1 0/1
chr1 100000 . T C 55 . DP=90 GT 0/1 0/0
EOF
# Compress and index (requires bgzip and tabix from pixi)
pixi run bgzip -f test.vcf
pixi run tabix -p vcf test.vcf.gz
Test bcftools region queries:
# Query specific region
pixi run bcftools view test.vcf.gz chr1:1-50000
# Extract specific samples
pixi run bcftools view -s sample1 test.vcf.gz chr1:10000-50000
# Count variants in region
pixi run bcftools view -H test.vcf.gz chr1:1-100000 | wc -l
Create and test BAM file:
# Create test SAM file
cat > test.sam << 'EOF'
@HD VN:1.0 SO:coordinate
@SQ SN:chr1 LN:1000000
read1 0 chr1 100 60 10M * 0 0 ACGTACGTAC IIIIIIIIII
read2 0 chr1 120 60 10M * 0 0 TGCATGCATG IIIIIIIIII
read3 0 chr1 50000 60 10M * 0 0 AAAAAAAAAA IIIIIIIIII
read4 0 chr1 100000 60 10M * 0 0 TTTTTTTTTT IIIIIIIIII
EOF
# Convert to BAM and index
pixi run samtools view -b test.sam -o test.bam
pixi run samtools index test.bam
Test samtools region queries:
# View reads in region
pixi run samtools view test.bam chr1:1-60000
# Count reads in region
pixi run samtools view -c test.bam chr1:1-60000
# Extract region to new file
pixi run samtools view -b test.bam chr1:50000-100000 -o region.bam
Once comfortable with local files, the same commands work seamlessly with remote S3, FTP, or HTTPS URLs—just replace local filenames with remote URLs.
6. Common Workflows and Recipes
6.1 Extract Variants for Specific Genes
# Define gene regions (example: gene1, gene2)
cat > genes.bed << 'EOF'
1 11869 14409 DDX11L1
1 17369 17436 MIR6859-1
EOF
# Extract variants from remote file for gene regions
bcftools view \
s3://my-bucket/variants.vcf.gz \
-R genes.bed \
-O v -o local_genes.vcf
# Compress and index
bgzip -f local_genes.vcf
tabix -p vcf local_genes.vcf.gz
6.2 Compare Samples Across Remote Files
# Compare variants for two samples in different S3 files
bcftools isec \
<(bcftools view -s sample1 s3://bucket1/variants.vcf.gz) \
<(bcftools view -s sample2 s3://bucket2/variants.vcf.gz) \
-p comparison_output
# Shows shared and unique variants
ls comparison_output/
6.3 Annotate Remote VCF with Remote Annotations
# Annotate S3-hosted VCF using FTP-hosted annotation file
bcftools annotate \
s3://my-bucket/variants.vcf.gz \
-a ftp://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606_b150_GRCh37p13/VCF/common_all.vcf.gz \
-c ID \
-O v -o annotated.vcf
6.4 Extract BAM Regions for Multiple Samples
# Create list of regions
cat > regions.txt << 'EOF'
chr1:1-1000000
chr2:2000000-3000000
chr3:5000000-6000000
EOF
# Extract each region from remote BAM, save locally
while read region; do
samtools view \
s3://my-bucket/aligned.bam \
$region \
-b -o ${region//:/_}.bam
samtools index ${region//:/_}.bam
done < regions.txt
6.5 Stream Processing Without Local Storage
# Pipe remote data through multiple tools
bcftools view s3://my-bucket/variants.vcf.gz chr1:1-1000000 \
| bcftools query -f '%CHROM\t%POS\t%REF\t%ALT\t%AC\t%AF\n' \
| awk '$6 > 0.05' \
| tee filtered_variants.tsv \
| wc -l
7. Troubleshooting and Optimization
7.1 Common Issues
Issue: "File not found" or 404 errors
- Solution: Verify S3 bucket name, path, and file permissions. Test with AWS CLI:
aws s3 ls s3://bucket-name/path/file.vcf.gz
Issue: Authentication failures with S3
- Solution: Verify AWS credentials:
aws sts get-caller-identity
Issue: Index not found (.tbi or .bai missing)
- Solution: Create indexes or verify they exist on remote server:
aws s3 ls s3://bucket-name/file.vcf.gz.tbi
Issue: "Connection timeout" on FTP
- Solution: Increase timeout and try again:
export CURL_CONNECT_TIMEOUT=60
export CURL_TIMEOUT=120
Issue: HTTPS SSL certificate errors
- Solution: Verify certificate validity or temporarily disable verification for testing:
curl -k https://endpoint.com/file.vcf.gz.tbi # -k disables cert verification
7.2 Optimization Tips
For S3 Access:
- Use S3 Transfer Acceleration if available for faster downloads
- Place compute resources in the same AWS region as your data
- Use bucket endpoints to avoid Internet Gateway charges:
# VPC endpoint for S3 (no data transfer charges within VPC)
s3://bucket-name/key
For Large Batch Queries:
- Sort region requests by chromosome for better caching
- Consider downloading entire chromosome subset once, then querying locally:
bcftools view s3://bucket/variants.vcf.gz chr1 -o chr1.vcf.gz
bcftools index chr1.vcf.gz
# Then query locally multiple times
For FTP Performance:
- Batch multiple queries when possible
- Check server-side compression settings
- Consider mirror/rsync services for frequently accessed data
7.3 Debugging with Verbose Output
# Enable verbose output for debugging
export HTS_LOG_LEVEL=debug
# Use bcftools verbose flag
bcftools view -v s3://bucket/variants.vcf.gz chr1:1-1000000
# Check curl verbose output
export CURL_VERBOSE=1
8. Recap and Key Takeaways
8.1 What We've Covered
- Protocol Support: HTSlib tools support S3, FTP, and HTTPS remote file access natively
- Authentication: Each protocol has different auth mechanisms (AWS credentials, FTP user:pass, Bearer tokens)
- Efficient Queries: Indexed files enable byte-range requests, downloading only necessary data
- Practical Workflows: Common tasks like region extraction, sample comparison, and annotation using remote files
- Troubleshooting: How to diagnose and resolve common remote file access issues
8.2 Key Takeaways
- Always use indexes for remote files to avoid downloading entire datasets
- Choose the right protocol: S3 for AWS, FTP for public repositories, HTTPS for APIs
- Set up authentication correctly to avoid repeated connection failures
- Stream processing reduces local storage requirements for large datasets
- Verify indexes exist on remote servers before querying
- Optimize by region: Place compute near data, batch similar queries, use appropriate timeouts
8.3 Quick Reference Workflow
Setup (one time):
# Create project directory
mkdir -p ~/genomics-work && cd ~/genomics-work
# Initialize Pixi with conda-forge and bioconda
pixi init --channel conda-forge --channel bioconda .
# Add tools
pixi add bcftools samtools
# Verify installation
pixi run bcftools --version
pixi run samtools --version
Using remote files:
# 1. Set up authentication (optional, for private buckets)
export AWS_ACCESS_KEY_ID="..."
export AWS_SECRET_ACCESS_KEY="..."
export AWS_DEFAULT_REGION="us-east-1"
# 2. Query real remote file (NCBI dbSNP via HTTPS - no auth needed)
pixi run bcftools view \
https://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606_b150_GRCh37p13/VCF/00-common_all.vcf.gz \
| pixi run bcftools query -f '%CHROM\t%POS\t%REF\t%ALT\n' \
| head -20
# 3. Or for your own S3 data:
pixi run bcftools view s3://my-bucket/variants.vcf.gz chr1:1-1000000 -O v -o chr1_subset.vcf
# 4. Create local index for repeated use
pixi run bgzip -f chr1_subset.vcf
pixi run tabix -p vcf chr1_subset.vcf.gz
8.4 Next Steps
- Explore your data: Start with small region queries to test setup
- Automate workflows: Create scripts for common extraction and annotation tasks
- Scale up: Use Nextflow or similar workflow managers for processing multiple files
- Monitor costs: Track S3 data transfer and request costs if using cloud storage