Exploring the UK10K variants

It has been almost two months since my last post; I have been occupied with preparing a fellowship application (which has been sent off!) and now I'm occupied with preparing and writing papers. Sadly, I've pushed blogging right down the priority list, even though it's one of the things I enjoy doing the most. This post is on exploring the variants that were discovered as part of the UK10K project. For the uninitiated, the UK10K project was a massive undertaking that aimed to characterise human genetic variation within the UK population by using whole exome (WES) and genome sequencing (WGS). The WGS arm sequenced healthy individuals (n=3,781) that were part of longitudinal studies and the WES arm sequenced individuals (n=5,294 and 5,182 passing QC) with rare diseases, severe obesity, and neurodevelopmental disorders. It's not quite 10K, but it's still an impressive number for now, since the 100,000 Genomes Project has already reached 7,306 genomes:

Why are we sequencing more and more genomes? It's because large numbers of genomes are necessary for capturing rare variants, which are largely private to a single individual (as you will see in this post). Because rare variants are rare, it has not been possible to study them and various groups are interested in understanding the contribution of rare variants to phenotypic traits. Besides trying to understand the contribution of rare variants, I am also interested in the natural genetic variation, i.e. background mutations. I've decided to explore the UK10K list of variants because it is one of the largest datasets available; the final call set contains ~42M single nucleotide variants (SNVs) and ~3.5M insertion/deletions (INDELs). Other large scale projects include the Genome of the Netherlands (n=750), the Icelandic genome sequencing project (n=2,636), and the 1000 Genomes Project (n=2,504). For exploring the UK10K variants I will use GEMINI; please refer to my last post if you want an brief introduction to the tool.

To begin we need to download the VCF file that contains the variants called from whole genome sequencing. You can get it at https://www.uk10k.org/data.html.

Now to create a GEMINI database.


BASE=`basename $VCF .vcf.gz`

vt decompose -s $VCF | vt normalize -r $GENOME - | gzip > $BASE.vt.vcf.gz

# annotate using VEP
# instructions for setting up VEP, if you haven't already
# http://davetang.org/wiki2/index.php?title=VEP
perl ~/src/ensembl-tools-release-82/scripts/variant_effect_predictor/variant_effect_predictor.pl -i $BASE.vt.vcf.gz -o $BASE.vt.vep.vcf --vcf \
--offline --cache --sift b --polyphen b --symbol --numbers --biotype --total_length \
--fields Consequence,Codons,Amino_acids,Gene,SYMBOL,Feature,EXON,PolyPhen,SIFT,Protein_position,BIOTYPE

# gemini
gemini load -v $BASE.vt.vep.vcf -t VEP --cores 8 $BASE.vt.vep.db

There are fields in the VCF file that are useful, which we need to manually add to the database, using the gemini annotate command.

# index using tabix
tabix -p vcf UK10K_COHORT.20140722.sites.vt.vep.vcf.gz

# add various INFO fields in the database
# these steps take some time; I should have timed it :(
gemini annotate -f UK10K_COHORT.20140722.sites.vt.vep.vcf.gz -o list -e DP -t integer UK10K_COHORT.20140722.sites.vt.vep.db
gemini annotate -f UK10K_COHORT.20140722.sites.vt.vep.vcf.gz -o first -e VQSLOD -t float UK10K_COHORT.20140722.sites.vt.vep.db
gemini annotate -f UK10K_COHORT.20140722.sites.vt.vep.vcf.gz -o list -e AC -t integer UK10K_COHORT.20140722.sites.vt.vep.db
gemini annotate -f UK10K_COHORT.20140722.sites.vt.vep.vcf.gz -o list -e AN -t integer UK10K_COHORT.20140722.sites.vt.vep.db
gemini annotate -f UK10K_COHORT.20140722.sites.vt.vep.vcf.gz -o first -e AF -t float UK10K_COHORT.20140722.sites.vt.vep.db

Now let's perform some simply queries. Firstly how many SNVs and INDELs are in the database?

time gemini query -q 'select type,count(*) from variants group by type' --header *.db
type    count(*)
indel   4207224
snp     42424188

real    0m9.012s
user    0m6.788s
sys     0m0.476s

# we can use BCFtools to cross-check
bcftools view -v snps UK10K_COHORT.20140722.sites.vt.vep.vcf.gz | grep -v "^#" | wc -l
bcftools view -v indels UK10K_COHORT.20140722.sites.vt.vep.vcf.gz | grep -v "^#" | wc -l 

Figure 1 of the UK10K paper illustrates that 18-19M of the SNVs are only present once, i.e. allele count = 1, and a lot of the UK10K rare variants (MAF < 1%) have not been recorded in the 1000 Genomes Project. (Variants with an allele count of one indicates that only one individual is heterozygous for this variant; an allele count of two could mean two heterozygous individuals or one individual that is homozygous for the variant.) Let's check out the allele frequencies using GEMINI.

# how many variants with an allele count of 1
time gemini query -q 'select count(*) from variants where AC = 1 and type == "snp"' --header *.db

real    0m31.968s
user    0m14.856s
sys     0m17.076s

# how many SNVs are rare?
# rare is usually defined as MAF of < 1%
time gemini query -q 'select count(*) from variants where AF<0.01 and type =="snp"' --header *.db

real    1m10.993s
user    0m44.700s
sys     0m26.212s

# how many of the rare variants are missing from
# the 1000 Genomes Project?
# the paper reported around ~24M novel SNVs
# the number differs because we are only counting AF<0.01 variants
time gemini query -q 'select count(*) from variants where AF<0.01 and type =="snp" and in_1kg == 0' --header *.db

real    1m11.759s
user    0m45.928s
sys     0m25.752s

It has been established that the predicted functional consequences of common variants (MAF > 5%) are typically less severe than rare variants. (The specific wording in the paper was "variants predicted to have the greatest phenotypic impact [LoF and missense variants, and variants mapping to conserved regions], were depleted at the common end of the derived allele spectrum.") This makes sense because if a variant has a negative effect on an individual, i.e. decreases an individual's fitness, it will eventually be weeded out.

GEMINI annotates variants by their predicted impact and also indicates whether a variant causes a loss of function (typically defined as stop-gained, splice-disrupting, and frameshift variants). Let's compare the severity of rare and common variants.

# all impact severity
time gemini query -q 'select impact_severity,count(*) from variants where type == "snp" group by impact_severity' --header *.db
impact_severity count(*)
HIGH    16377
LOW     42064948
MED     342863

real    1m39.200s
user    1m11.484s
sys     0m27.604s

# rare variants
time gemini query -q 'select impact_severity,count(*) from variants where type == "snp" and AF<0.01 group by impact_severity' --header *.db
impact_severity count(*)
HIGH    14876
LOW     33914231
MED     305248

real    1m44.235s
user    1m17.480s
sys     0m26.628s

# common variants
# 14876 + 1005 != 16377 because there are variants in the
# AF > 0.01 and AF < 0.05 window
time gemini query -q 'select impact_severity,count(*) from variants where type == "snp" and AF>0.05 group by impact_severity' --header *.db

impact_severity count(*)
HIGH    1005
LOW     5852528
MED     24471

real    1m25.799s
user    0m52.060s
sys     0m33.640s

# percentages
# rare
bc -l<<<'14876*100/(14876+33914231+305248)'
# common
bc -l<<<'1005*100/(1005+5852528+24471)'

# loss of function counts for rare variants
time gemini query -q 'select is_lof,count(*) from variants where type == "snp" and AF<0.01 group by is_lof' --header *.db
is_lof  count(*)
0       34222378
1       11977

real    1m28.447s
user    1m0.684s
sys     0m27.656s

# loss of function counts for common variants
time gemini query -q 'select is_lof,count(*) from variants where type == "snp" and AF>0.05 group by is_lof' --header *.db

is_lof  count(*)
0       5877488
1       516

real    1m14.944s
user    0m47.636s
sys     0m27.196s

# percentages
# rare
bc -l<<<'11977*100/(34222378+11977)'
# common
bc -l<<<'516*100/(5877488+516)'

Another analysis that I have been interested in, is finding genes that don't contain any variants due to the fact that these genes cannot tolerate any mutations because they are absolutely crucial to life. This is probably a bit extreme since I would imagine that most genes can tolerate synonymous mutations, though some synonymous mutations can affect translation rates. Nevertheless, here's what I did.

# how many genes contain variants?
time gemini query -q 'select count(gene) from variants group by gene' *.db | wc -l

real    0m9.017s
user    0m8.416s
sys     0m0.588s

# how many genes contain exonic variants
time gemini query -q 'select gene,count(*)  from variants where is_exonic == 1 group by gene' --header *.db | wc -l

real    0m4.364s
user    0m2.320s
sys     0m1.088s

# genes grouped by exonic and non-exonic variants
# for example the gene A1BG has 23 exonic variants
# and 122 non-exonic variants
time gemini query -q 'select gene,is_exonic,count(is_exonic) from variants group by gene,is_exonic' --header *.db | head
gene    is_exonic       count(is_exonic)
None    0       17389131
5S_rRNA 0       641
7SK     0       1500
A1BG    0       122
A1BG    1       23
A1BG-AS1        0       2
A1CF    0       1281
A1CF    1       131
A2M     0       753

# genes grouped by exonic, non-exonic, and impact of variants
time gemini query -q 'select gene,is_exonic,impact_severity,count(is_exonic) from variants where gene =="CCL2" group by gene,is_exonic,impact_severity' --header *.db 
gene    is_exonic       impact_severity count(is_exonic)
CCL2    0       LOW     122
CCL2    1       LOW     8
CCL2    1       MED     2

real    0m0.124s
user    0m0.100s
sys     0m0.020s

# write this to a file
time gemini query -q 'select gene,is_exonic,impact_severity,count(is_exonic) from variants group by gene,is_exonic,impact_severity' --header *.db | gzip > gene_exonic_impact.tsv.gz 

real    2m14.104s
user    1m46.760s
sys     0m27.160s

I have now produced a file that shows the number of variants contained within each gene and stratified by whether the variant lies in an exonic region and by the predicted impact of the variant.

# the FGFR2 gene has been implicated in Pfeiffer syndrome
zcat gene_exonic_impact.tsv.gz | grep FGFR2
FGFR2   0       HIGH    1
FGFR2   0       LOW     2184
FGFR2   0       MED     5
FGFR2   1       LOW     85
FGFR2   1       MED     14

To find genes that are almost devoid of exonic variants, I wrote a script to output genes that have less than two exonic variants, while containing over 3,000 non-exonic variants; the latter condition was used as a proxy for the length of the gene, since shorter genes will naturally contain less variants.

#!/usr/bin/env perl

use strict;
use warnings;

my $infile = 'gene_exonic_impact.tsv.gz';

=head1 INFILE
5S_rRNA 0       LOW     641
7SK     0       HIGH    1
7SK     0       LOW     1496
7SK     0       MED     3
A1BG    0       LOW     120
A1BG    0       MED     2
A1BG    1       LOW     10
A1BG    1       MED     13
A1BG-AS1        0       MED     2
A1CF    0       LOW     1281
A1CF    1       LOW     120
A1CF    1       MED     11
A2M     0       HIGH    2
A2M     0       LOW     745
A2M     0       MED     6
A2M     1       HIGH    1
A2M     1       LOW     45
A2M     1       MED     52
A2ML1   0       HIGH    1

my $previous_gene = 'blah';
my $exon = 0;
my $not_exon = 0;

open(IN,'-|',"gunzip -c $infile") || die "Could not open $infile: $!\n";
   next if /^gene|^None/;
   my ($gene, $is_exonic, $impact, $count) = split(/\t/);

   if ($gene ne $previous_gene){
      if ($not_exon > 3000 && $exon < 2){
         print join("\t", $previous_gene, $exon, $not_exon),"\n";
      ($exon, $not_exon) = (0, 0);
      $previous_gene = $gene;
      if ($is_exonic == 0){
         $not_exon += $count;
      } elsif ($is_exonic == 1){
         $exon += $count;
      } else {
         die "The column is_exonic is not 0 or 1\n";
   } else {
      if ($is_exonic == 0){
         $not_exon += $count;
      } elsif ($is_exonic == 1){
         $exon += $count;
      } else {
         die "The column is_exonic is not 0 or 1\n";



Now to run the script.

# how many genes have over 2000 non-exonic variants
# and less than two exonic variants
parse_gene_exonic_impact.pl | wc -l

# what are some of these genes
# first column gene name, second column number
# of exonic variants, and third column number
# of non-exonic variants
parse_gene_exonic_impact.pl | head
AC003051.1      0       5376
AC004158.2      0       3889
AC004538.3      0       4180
AC005062.2      0       4486
AC005307.3      0       3700
AC005592.2      0       4378
AC007091.1      0       5170
AC007131.2      0       4919
AC007246.3      0       3848
AC007319.1      0       5522

I looked through the list of genes by eye and most of them are non-coding RNAs (rRNAs, snoRNAs, and lincRNAs) or pseudogenes, which I guess were annotated as not having exons, perhaps due to the protein centric definition of exons. I did manage to find two protein coding genes:

SYN2    0       3080
SPON1   0       5178

The SYN2 gene is Synapsin II and the SPON1 gene is Spondin 1; by chance, they both are implicated with neurological functions. Synapsins encode neuronal phosphoproteins which associate with the cytoplasmic surface of synaptic vesicles and SPON1 is a cell adhesion protein that promotes the attachment of spinal cord and sensory neuron cells and the outgrowth of neurites in vitro. I will need to find a way to filter out non-coding RNAs and perform some sort of gene ontology enrichment analysis on the remaining genes to get a better idea of what types of genes are less tolerant of any mutations.


In this post, I demonstrated how one can explore the UK10K variants by using the GEMINI tool. All of the queries are simply SQL statements (which can be piped to various other tools) but they do provide a nice way of implementing variant filtration strategies. I often find myself exploring the GEMINI database schema to come up with new ways of finding interesting variants.

Throughout the post I have timed each command to show how fast a query can be performed; the range is from seconds to around two minutes, which isn't bad when there are ~46M variants; the queries slow down when I add more conditions. The size of the database file is ~100G, since I have included CADD and GERP scores as well as some VCF INFO fields. It would have been nice to have the genotypes of each individual, though this would substantially increase the size of the database.

My next post will be on identifying causative variants, which hopefully won't take me another two months to publish.

Print Friendly, PDF & Email

Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
2 comments Add yours
  1. Certainly A wonderful article.
    I learn a lot from your blog, Thank you very much.
    I will follow this post and try to re-run what you have done.
    Maybe you can see the post on my Blog in the future.

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.