How many pathogenic ClinVar variants are in intergenic regions? I'll define genomic regions as per this old post. To get started, download the latest ClinVar variants:
wget -c ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh37/clinvar_20170104.vcf.gz # index tabix -p vcf clinvar_20170104.vcf.gz # how many variants? zcat clinvar_20170104.vcf.gz | grep -v "^#" | wc -l 232624
There are nine clinical significance codes associated to ClinVar variants.
0 - Uncertain significance
1 - not provided
2 - Benign
3 - Likely benign
4 - Likely pathogenic
5 - Pathogenic
6 - drug response
7 - histocompatibility
255 - other
##INFO=<ID=CLNSIG,Number=.,Type=String,Description="Variant Clinical Significance, 0 - Uncertain significance, 1 - not provided, 2 - Benign, 3 - Likely benign, 4 - Likely pathogenic, 5 - Pathogenic, 6 - drug response, 7 - histocompatibility, 255 - other"> bcftools query -f '%CLNSIG\n' clinvar_20170104.vcf.gz | sort | uniq -c | sort -k1rn | head 77815 0 29289 5 27305 3 15385 2 11675 0|0 10659 1 6531 4 4572 255 4552 3|3 3392 5|5 # there are actually no histocompatibility variants bcftools query -f '%CLNSIG\n' clinvar_20170104.vcf.gz | grep 7 # returns nothing
As you can see above, some variants have more than one significance value. I also found commas.
bcftools query -f '%CLNSIG\n' clinvar_20170104.vcf.gz | grep , | head 0|5|5|5|5|5|5|5|5|5|5|5|5|5|5|5|5|5|5|5|5|5|5,5|5|5|5|5|5|5|5|5|5|5|5|5 5|5,5|5 5,5 3,2|2|2 1,1 0,0 5,0 0,0 0|0|0,0|0|0,0|0|0 0,0
I wrote a Perl script to skip variants that have both benign and pathogenic codes.
#!/usr/bin/env perl use strict; use warnings; my $usage = "Usage: $0 <clinvar.vcf> <significance code>\n"; my $infile = shift or die $usage; my $code = shift or die $usage; if ($infile =~ /\.gz$/){ open(IN, '-|', "gunzip -c $infile") || die "Could not open $infile: $!\n"; } else { open(IN, '<', $infile) || die "Could not open $infile: $!\n"; } VARIANT: while(<IN>){ chomp; my $current_line = $_; if ($current_line =~ /^#/){ print "$current_line\n"; } else { if (/CLNSIG=(.*?);/){ my $clnsig = $1; # split by pipes or commas my @clnsig = split(/\||,/, $clnsig); # has only one significance code if (scalar(@clnsig) == 1){ if ($clnsig == $code){ print "$current_line\n"; next VARIANT; } } # $match if there is a match to input code my ($benign, $pathogenic, $match) = (0, 0, 0); foreach my $c (@clnsig){ if ($c == 2 || $c == 3){ $benign = 1; } if ($c == 4 || $c == 5){ $pathogenic = 1; } if ($c == $code){ $match = 1; } } # check and skip conflicting codes if ($benign == 1 && $pathogenic == 1){ next VARIANT; } if ($match == 1){ print "$current_line\n"; } } } } close(IN); exit(0);
Now to use the script to create a subset of ClinVar variants that are pathogenic and to convert the VCF file into a BED file.
./stratify.pl Usage: ./stratify.pl <clinvar.vcf> <significance code> ./stratify.pl clinvar_20170104.vcf.gz 5 | grep -v "^#" | wc -l 40651 ./stratify.pl clinvar_20170104.vcf.gz 5 | bgzip > clinvar_20170104_class_5.vcf.gz ~/github/learning_vcf_file/script/vcf_to_bed.pl clinvar_20170104_class_5.vcf.gz | gzip > clinvar_20170104_class_5.bed.gz zcat clinvar_20170104_class_5.bed.gz | wc -l 40651
Using my defining genomic regions repository, I have generated BED files that correspond to merged exonic, intronic, and intergenic regions of hg19.
# the ClinVar VCF file does not use chr to name the chromosomes, which I actually prefer zcat gencode_v19_exon_merged.bed.gz | sed 's/chr//' | gzip > exon.bed.gz zcat gencode_v19_intron.bed.gz | sed 's/chr//' | gzip > intron.bed.gz zcat gencode_v19_intergenic.bed.gz | sed 's/chr//' | gzip > intergenic.bed.gz bedtools intersect -a clinvar_20170104_class_5.bed.gz -b exon.bed.gz -u | wc -l 37418 bedtools intersect -a clinvar_20170104_class_5.bed.gz -b intron.bed.gz -u | wc -l 2994 bedtools intersect -a clinvar_20170104_class_5.bed.gz -b intergenic.bed.gz -u | wc -l 32
There are only 32 pathogenic variants in intergenic regions.
This work is licensed under a Creative Commons
Attribution 4.0 International License.