Parallelize bcftools functions

bcftools is a great for working with variant call files. In general, it is fast. However, I have found that the process of merging VCF files (using bcftools merge) and performing concordance checking (using bcftools gtcheck) can be a little bit slow. That is why I wrote two functions that take advantage of GNU Parallel to parallelize them.

# ~/.bashrc: executed by bash(1) for non-login shells.
# see /usr/share/doc/bash/examples/startup-files (in the package bash-doc)
# for examples

function bam_chromosomes() {
    # Fetch chromosomes from a bam file
    samtools view -H $1 | \
    grep -Po 'SN:(.*)\t' | \
    cut -c 4-1000
}

function vcf_chromosomes() {
    # Fetch contigs from a vcf file.
    bcftools view -h $1 | \
    grep 'contig' | \
    egrep -o "ID=([^,]+)" | \
    sed 's/ID=//g' | \
    tr '\n' ' '
}


PARALLEL_CORES=6
function parallel_bcftools_merge() {
    file_set=`echo $@ | egrep -o '(\-l|\-\-file-list)(=|[ ]+)[^ ]+' | tr '=' ' ' | cut -f 2 -d ' '`
    if [ -n "${file_set}" ]
        then
            find_vcf=`cat ${file_set} | head -n 1`
        else
            find_vcf=`echo $@ | tr '\t' '\n' | egrep -o '[^ ]+.vcf.gz' | awk 'NR == 1 { print }' - `
    fi
    contigs=`vcf_chromosomes ${find_vcf}`
    current_dir=$(dirname ${find_vcf})
    hash_merge=`echo "$@" | md5sum | cut -c 1-5`
    output_prefix="${current_dir}/parallel_merge.${hash_merge}"
    
    parallel --gnu --workdir ${current_dir} \
    --env args -j ${PARALLEL_CORES} \
    'bcftools merge -r {1} -O u ' $@ ' > ' ${output_prefix}'.{1}.bcf' ::: ${contigs} 
    
    order=`echo $contigs | tr ' ' '\n' | awk -v "prefix=${output_prefix}" '{ print prefix "." $0 ".bcf" }'`
    bcftools concat -O v ${order} | grep -v 'parallel_merge' | sed 's/##bcftools_mergeCommand=merge -r I -O u /##bcftools_mergeCommand=merge /g' | bcftools view -O u
    rm ${order}
}


PARALLEL_CORES=6
function parallel_bcftools_gtcheck() {
    find_vcf=`echo $@ | tr '\t' '\n' | egrep -o '[^ ]+.vcf.gz' | awk 'NR == 1 { print }' - `
    contigs=`vcf_chromosomes ${find_vcf}`
    current_dir=$(dirname ${find_vcf})
    hash_merge=`echo "$@" | md5sum | cut -c 1-5`
    output_prefix="${current_dir}/parallel_concordance.${hash_merge}"
    gtcheck_options=`echo $@ | awk -v vcf=${find_vcf} '{ gsub(vcf,"",$0); print $0; }'`
    parallel --gnu  -j ${PARALLEL_CORES} --workdir ${current_dir} 'bcftools view ' ${find_vcf} ' {} | \
    bcftools gtcheck ' ${gtcheck_options} ' - | \
    awk -v chrom={} "/^CN/ { print \$0 \"\t\" chrom } \$0 !~ /.*CN.*/ { print } \$0 ~ /^# \[1\]CN/ { print \$0 \"\tchrom\"}" - > ' ${output_prefix}'.{}.tsv' ::: ${contigs}

    order=`echo $contigs | tr ' ' '\n' | awk -v "prefix=${output_prefix}" '{ print prefix "." $0 ".tsv" }'`
    cat ${order} | \
    grep 'CN' | \
    awk 'NR == 1 && /Discordance/ { print } NR > 1 && $0 !~ /Discordance/ { print }' | \
    awk '{ gsub("(# |\\[[0-9]+\\])","", $0); gsub(" ","_", $0); print }' | \
    cut -f 2-7 | datamash --header-in --sort --group=Sample_i,Sample_j sum Discordance  sum Number_of_sites | \
    cat <(echo -e "sample_i\tsample_j\tdiscordance\tnumber_of_sites") - | \
    awk 'NR == 1 { print $0 "\tconcordance" } NR > 1 && $4 == 0 { print $0 "\t" } NR > 1 && $4 > 0 { print $0 "\t" ($4-$3)/$4 }'
    rm ${order}
}

Usage

The function vcf_chromosomes extracts chromosomes names from a VCF file using bcftools. Parallelization occurs across chromosomes.

`parallel_bcftools_merge`

parallel_bcftools_merge is run very similar to bcftools merge. The only difference is that you have to pipe it into bcftools to change it to the appropriate output. For example:

parallel_bcftools_merge -m all `ls *list_of_bcffiles` | bcftools view -O z > merged_vcf.vcf.gz

The parallel_bcftools_merge function will generate a temporary vcf for every chromosome. You can use all flags except for -O with this function.

`parallel_bcftools_gtcheck`

parallel_bcftools_gtcheck should not be used with --all-sites, or --plot. I recommend using this function with -H and -G 1 to calculate the absolute number of differences in terms of homozygous calls between samples. Also, this function requires datamash (on OSX, install with brew install datamash)

The output file is slightly different than what bcftools normally outputs. In general, I use this function specifically to calculate conocordance between individual fastq runs - like this:

parallel_bcftools_gtchek -H -G 1 union_samples.vcf.gz > concordance.tsv

This parallelized version generates concordances for each chromosome and then merges the results together using datamash. Output looks like this:

sample_i	sample_j	discordance	number_of_sites	concordance
BGI2-RET1-ED3049	BGI1-RET1-ED3049	927	2344043	0.999605
BGI1-RET1-CB4856	BGI1-RET1-CB4852	144484	2171694	0.933469
BGI1-RET1-CX11315	BGI1-RET1-CB4852	106964	2721950	0.960703
BGI1-RET1-CX11315	BGI1-RET1-CB4856	137200	2059983	0.933398
BGI1-RET1-DL238	BGI1-RET1-CB4852	148217	2097343	0.929331
BGI1-RET1-DL238	BGI1-RET1-CB4856	124132	1803664	0.931178
BGI1-RET1-DL238	BGI1-RET1-CX11315	146580	1996802	0.926593