Genomic location of pathogenic ClinVar variants

How many pathogenic ClinVar variants are in intergenic regions? I'll define genomic regions as per this old post. To get started, download the latest ClinVar variants:

wget -c ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh37/clinvar_20170104.vcf.gz

# index
tabix -p vcf clinvar_20170104.vcf.gz

# how many variants?
zcat clinvar_20170104.vcf.gz | grep -v "^#" | wc -l
232624

There are nine clinical significance codes associated to ClinVar variants.

0 - Uncertain significance
1 - not provided
2 - Benign
3 - Likely benign
4 - Likely pathogenic
5 - Pathogenic
6 - drug response
7 - histocompatibility
255 - other

##INFO=<ID=CLNSIG,Number=.,Type=String,Description="Variant Clinical Significance, 0 - Uncertain significance, 1 - not provided, 2 - Benign, 3 - Likely benign, 4 - Likely pathogenic, 5 - Pathogenic, 6 - drug response, 7 - histocompatibility, 255 - other">
bcftools query -f '%CLNSIG\n' clinvar_20170104.vcf.gz | sort | uniq -c | sort -k1rn | head
  77815 0
  29289 5
  27305 3
  15385 2
  11675 0|0
  10659 1
   6531 4
   4572 255
   4552 3|3
   3392 5|5

# there are actually no histocompatibility variants
bcftools query -f '%CLNSIG\n' clinvar_20170104.vcf.gz | grep 7
# returns nothing

As you can see above, some variants have more than one significance value. I also found commas.

bcftools query -f '%CLNSIG\n' clinvar_20170104.vcf.gz  | grep , | head
0|5|5|5|5|5|5|5|5|5|5|5|5|5|5|5|5|5|5|5|5|5|5,5|5|5|5|5|5|5|5|5|5|5|5|5
5|5,5|5
5,5
3,2|2|2
1,1
0,0
5,0
0,0
0|0|0,0|0|0,0|0|0
0,0

I wrote a Perl script to skip variants that have both benign and pathogenic codes.

#!/usr/bin/env perl

use strict;
use warnings;

my $usage = "Usage: $0 <clinvar.vcf> <significance code>\n";
my $infile = shift or die $usage;
my $code = shift or die $usage;

if ($infile =~ /\.gz$/){
   open(IN, '-|', "gunzip -c $infile") || die "Could not open $infile: $!\n";
} else {
   open(IN, '<', $infile) || die "Could not open $infile: $!\n";
}

VARIANT: while(<IN>){
   chomp;
   my $current_line = $_;
   if ($current_line =~ /^#/){
      print "$current_line\n";
   } else {
      if (/CLNSIG=(.*?);/){
         my $clnsig = $1;
         # split by pipes or commas
         my @clnsig = split(/\||,/, $clnsig);

         # has only one significance code
         if (scalar(@clnsig) == 1){
            if ($clnsig == $code){
               print "$current_line\n";
               next VARIANT;
            }
         }

         # $match if there is a match to input code
         my ($benign, $pathogenic, $match) = (0, 0, 0);
         foreach my $c (@clnsig){
            if ($c == 2 || $c == 3){
               $benign = 1;
            }
            if ($c == 4 || $c == 5){
               $pathogenic = 1;
            }
            if ($c == $code){
               $match = 1;
            }
         }

         # check and skip conflicting codes
         if ($benign == 1 && $pathogenic == 1){
            next VARIANT;
         }

         if ($match == 1){
            print "$current_line\n";
         }
      }
   }
}
close(IN);

exit(0);

Now to use the script to create a subset of ClinVar variants that are pathogenic and to convert the VCF file into a BED file.

./stratify.pl 
Usage: ./stratify.pl <clinvar.vcf> <significance code>

./stratify.pl clinvar_20170104.vcf.gz 5  | grep -v "^#" | wc -l
40651

./stratify.pl clinvar_20170104.vcf.gz 5 | bgzip > clinvar_20170104_class_5.vcf.gz

~/github/learning_vcf_file/script/vcf_to_bed.pl clinvar_20170104_class_5.vcf.gz | gzip  > clinvar_20170104_class_5.bed.gz

zcat clinvar_20170104_class_5.bed.gz | wc -l
40651

Using my defining genomic regions repository, I have generated BED files that correspond to merged exonic, intronic, and intergenic regions of hg19.

# the ClinVar VCF file does not use chr to name the chromosomes, which I actually prefer
zcat gencode_v19_exon_merged.bed.gz | sed 's/chr//' | gzip > exon.bed.gz
zcat gencode_v19_intron.bed.gz | sed 's/chr//' | gzip > intron.bed.gz
zcat gencode_v19_intergenic.bed.gz | sed 's/chr//' | gzip > intergenic.bed.gz

bedtools intersect -a clinvar_20170104_class_5.bed.gz -b exon.bed.gz -u | wc -l
37418

bedtools intersect -a clinvar_20170104_class_5.bed.gz -b intron.bed.gz -u | wc -l
2994

bedtools intersect -a clinvar_20170104_class_5.bed.gz -b intergenic.bed.gz -u | wc -l
32

There are only 32 pathogenic variants in intergenic regions.

This work is licensed under a Creative Commons
Attribution 4.0 International License.

Leave a Reply Cancel reply