Recently I started developing a set of utilities called seq-collection (sc) written in nim and using the fantastic hts-nim package.
The first utility I added was a tool to convert a VCF to JSON. This tool is useful for building out an API that reads genotype data directly from the VCF format. It is possible to read specific variants or intervals of VCF files when they are indexed, allowing for fast and efficient querying of genetic data without the need for a database. Furthermore, these queries can be made over http connections making it possible to use a VCF file as a database.
For example, as a graduate student I developed a genome browser for C. elegans wild isolates. Queries are made directly on specific genomic intervals of VCF files using bcftools
. However, a large amount of python code is used to convert VCF output ot JSON format. The sc
utility could replace this underlying python code on the server with a simple binary and a few command line arguments to accomplish the same task. Below I have a few examples illustrating how to use sc json
.
Installation
You can download the MAC OSX binary here.
Or build from source at danielecook/seq-collection
Usage
json
Convert a VCF to JSON
Usage:
json [options] vcf [region ...]
Arguments:
vcf VCF to convert to JSON
[region ...] List of regions
Options:
-i, --info=INFO comma-delimited INFO fields; Use 'ALL' for everything
-f, --format=FORMAT comma-delimited FORMAT fields; Use 'ALL' for everything
-s, --samples=SAMPLES Set Samples (default: ALL)
-p, --pretty Prettify result
-a, --array Output as a JSON array instead of individual JSON lines
-z, --zip Zip sample names with FORMAT fields (e.g. {'sample1': 25, 'sample2': 34})
-n, --annotation Parse ANN Fields
--pass Only output variants where FILTER=PASS
--debug Debug
-h, --help Show this help
Examples
List all sites
sc json tests/data/test.vcf.gz
Output
{"CHROM":"I","POS":41947,"ID":".","REF":"A","ALT":["T"],"QUAL":999.0,"FILTER":["PASS"]}
{"CHROM":"I","POS":105133,"ID":".","REF":"G","ALT":["A"],"QUAL":999.0,"FILTER":["PASS"]}
{"CHROM":"I","POS":176422,"ID":".","REF":"A","ALT":["G"],"QUAL":999.0,"FILTER":["PASS"]}
pretty output
We can “prettify” this output using the --pretty
flag:
sc json --pretty tests/data/test.vcf.gz
{
"CHROM": "I",
"POS": 41947,
"ID": ".",
"REF": "A",
"ALT": [
"T"
],
"QUAL": 999.0,
"FILTER": [
"PASS"
]
}
Fetch Genotypes
Next we can output genotype calls by specifying the --FORMAT
flag with a GT
argument:
> sc json --FORMAT=GT tests/data/test.vcf.gz
{"CHROM":"I","POS":41947,"ID":null,"REF":"A","ALT":["T"],"QUAL":999.0,"FILTER":["PASS"],"FORMAT":{"GT":[[0,0],[0,0],[0,0],[0,0],[0,0],[0,0],[0,0],[0,0],[0,0],[0,0],[0,0],[0,0],[0,0],[0,0]]}}
{"CHROM":"I","POS":105133,"ID":null,"REF":"G","ALT":["A"],"QUAL":999.0,"FILTER":["PASS"],"FORMAT":{"GT":[[0,0],[0,0],[0,0],[0,0],[0,0],[0,0],[0,0],[0,0],[0,0],[0,0],[0,0],[0,0],[0,0],[1,1]]}}
{"CHROM":"I","POS":176422,"ID":null,"REF":"A","ALT":["G"],"QUAL":999.0,"FILTER":["PASS"],"FORMAT":{"GT":[[0,0],[0,0],[0,0],[0,0],[0,0],[0,0],[1,1],[0,0],[0,0],[0,0],[0,0],[1,1],[1,1],[null,null]]}}
The genotypes are ordered by sample, and the numbers correspond as follows as follows:
-1
→ Missing0
→ Reference1
→ First ALT allele2
→ Second ALT allele3
→ etc.
You can also use SGT
to outut a string representation of genotypes (e.g. “0/1”). It is also possible to use TGT
to output the actual bases:
sc json --format=TGT tests/data/test.vcf.gz
{"CHROM":"I","POS":41947,"ID":null,"REF":"A","ALT":["T"],"QUAL":999.0,"FILTER":["PASS"],"FORMAT":{"TGT":["A/A","A/A","A/A","A/A","A/A","A/A","A/A","A/A","A/A","A/A","A/A","A/A","A/A","A/A"]}}
{"CHROM":"I","POS":105133,"ID":null,"REF":"G","ALT":["A"],"QUAL":999.0,"FILTER":["PASS"],"FORMAT":{"TGT":["G/G","G/G","G/G","G/G","G/G","G/G","G/G","G/G","G/G","G/G","G/G","G/G","G/G","A/A"]}}
{"CHROM":"I","POS":176422,"ID":null,"REF":"A","ALT":["G"],"QUAL":999.0,"FILTER":["PASS"],"FORMAT":{"TGT":["A/A","A/A","A/A","A/A","A/A","A/A","G/G","A/A","A/A","A/A","A/A","G/G","G/G","./."]}}