Getting started with analysing DNA sequencing data

I have recently entered new territory and started working on analysing DNA sequencing data (as opposed to analysing RNA sequencing, i.e. transcriptomics). While it is still the analysis of lots of sequencing reads, one of the goals is to identify disease causing mutations; this is in contrast to identifying differentially expressed genes or inferring gene networks. The new playground includes an entirely new repertoire of bioinformatic tools, file formats, and genetics that I was unaware of. While it is always fun to learn about new things, time is always a constraint. In this post, I write about some of the tools and file formats I learned about, and hopefully they will be useful for those who are getting started with analysing DNA sequencing data.

One of the tools I first got familiar with was the Genome Analysis Toolkit (GATK), which seems like one of the more popular toolkits for determining genetic variants in sequencing data. I spent a lot of time reading various GATK guides. Luckily there is a best practices guide that one can follow to get started with the GATK. The guides are very well written and I found the various GATK presentations to be equally informative. I used Bpipe to build a pipeline with the GATK and I would highly recommend using Bpipe, not just with the GATK, but with various bioinformatic pipelines.

At the end of the GATK pipeline, a VCF file is produced that contains all of the inferred variants. The VCF file produced by the GATK contains A LOT of information. I spent some time going through each annotation, which are mostly well documented (annoyingly some aren’t documented nearly as well as the others). In addition, I created a GitHub repository to help me learn a bit more about VCF files; I will be continually updating the repo as I learn more about VCF files and aim to include more “recipes” for performing various tasks with VCF files.

Reading VCF files into R

I use R almost exclusively to produce all my plots, so naturally I wanted to find a way to load VCF files into R (without writing my own terrible parser). I came across the VariantAnnotation package, which as the package name suggests, is a tool for annotating variants, and comes with functions that process VCF files into R. Below is a simple demonstration of producing a histogram of alternate allele frequencies from the sample file provided by the package.

#install package
source("http://bioconductor.org/biocLite.R")
biocLite("VariantAnnotation")

#load library
library(VariantAnnotation)

#location of example file
fl <- system.file("extdata", "chr22.vcf.gz", package="VariantAnnotation")
fl
[1] "/Users/dtang/Library/R/3.2/library/VariantAnnotation/extdata/chr22.vcf.gz"

#read into vcf object
vcf <- readVcf(fl, "hg19")
found header lines for 3 ‘fixed’ fields: ALT, QUAL, FILTER 
found header lines for 22 ‘info’ fields: LDAF, AVGPOST, RSQ, ERATE, THETA, CIEND, CIPOS, END, HOMLEN, HOMSEQ, SVLEN, SVTYPE, AC, AN, AA, AF, AMR_AF, ASN_AF, AFR_AF, EUR_AF, VT, SNPSOURCE 
found header lines for 3 ‘geno’ fields: GT, DS, GL

This VCF file contains a small subset of variants from the 1000 Genomes project. What does this VCF file look like?

gunzip -c /Users/dtang/Library/R/3.2/library/VariantAnnotation/extdata/chr22.vcf.gz | grep -v "^#" | head -2
22      50300078        rs7410291       A       G       100     PASS    AN=2184;AVGPOST=0.9890;THETA=0.0005;VT=SNP;RSQ=0.9856;ERATE=0.0020;SNPSOURCE=LOWCOV;AA=N;AC=751;LDAF=0.3431;AF=0.34;ASN_AF=0.19;AMR_AF=0.20;AFR_AF=0.83;EUR_AF=0.22  GT:DS:GL        0|0:0.000:-0.00,-2.77,-5.00     0|0:0.000:-0.12,-0.62,-2.40     1|0:1.000:-0.48,-0.48,-0.48     0|0:0.000:-0.17,-0.50,-2.79  0|0:0.000:-0.03,-1.13,-5.00
22      50300086        rs147922003     C       T       100     PASS    ERATE=0.0005;AVGPOST=0.9963;AN=2184;AC=20;VT=SNP;THETA=0.0011;RSQ=0.8398;AA=c;SNPSOURCE=LOWCOV;LDAF=0.0091;AF=0.01;AMR_AF=0.01;AFR_AF=0.02;EUR_AF=0.01       GT:DS:GL        0|0:0.000:-0.00,-2.75,-5.00     0|0:0.000:-0.12,-0.62,-2.41     0|0:0.000:-0.48,-0.48,-0.48     0|0:0.000:-0.10,-0.70,-4.70  0|0:0.000:-0.03,-1.16,-5.00

Have a look at this tutorial, if you’re unfamiliar with the VCF. Using the info() function, we can obtain all the values for various INFO key/value pairs in the VCF file. We will use this function to produce a simple histogram of the alternate allele frequency (alternate with respect to the reference genome).

AF <- info(vcf)$AF
class(AF)
[1] "numeric"
summary(AF)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
0.00000 0.00090 0.00460 0.07347 0.05000 1.00000
hist(AF, breaks=50, xlab='Alternate Allele Frequency')

hist_af

Data analysis

One of the most popular and useful data analysis method is the Principal Component Analysis (PCA). The SNPRelate package has functions to perform a PCA on variant data, as well as providing functions for calculating the fixation index, and for performing a Identity By State (IBS) and Identity By Descent analysis. It also has a brilliantly written tutorial. To install the package, just perform the usual Bioconductor steps:

#install SNPRelate
source("http://bioconductor.org/biocLite.R")
biocLite("SNPRelate")

Processing a VCF file using SNPRelate

The SNPRelate package uses the Genomic Data Structure (GDS) format, so we need to convert a VCF file into a GDS file before we can analyse it. As an example, I downloaded a VCF file (ALL.chr22.phase1.projectConsensus.genotypes.vcf.gz) from the 1000 Genomes project, which is located at ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20101123/interim_phase1_release/. The file is around 300M in size.

#load the package
library(SNPRelate)
Loading required package: gdsfmt
SNPRelate -- supported by Streaming SIMD Extensions 2 (SSE2)

my_vcf <- "ALL.chr22.phase1.projectConsensus.genotypes.vcf.gz"

#convert VCF to GDS
snpgdsVCF2GDS(my_vcf, "ALL.chr22.phase1.projectConsensus.genotypes.gds", method="biallelic.only")
VCF Format --> SNP GDS Format
Method: exacting biallelic SNPs
Number of samples: 1094
Parsing "ALL.chr22.phase1.projectConsensus.genotypes.vcf.gz" ...
        import 494975 variants.
+ genotype   { Bit2 1094x494975 } *
Optimize the access efficiency ...
Clean up the fragments of GDS file:
        open the file "ALL.chr22.phase1.projectConsensus.genotypes.gds" (size: 138172578).
        # of fragments in total: 143.
        save it to "ALL.chr22.phase1.projectConsensus.genotypes.gds.tmp".
        rename "ALL.chr22.phase1.projectConsensus.genotypes.gds.tmp" (size: 138171102).
        # of fragments in total: 20.

# Open the GDS file
(genofile <- snpgdsOpen('ALL.chr22.phase1.projectConsensus.genotypes.gds'))
File: /GROUP/ANALYSIS/Tang/muse/snprelate/ALL.chr22.phase1.projectConsensus.genotypes.gds
+    [  ] *
|--+ sample.id   { VStr8 1094 ZIP_RA(26.22%) }
|--+ snp.id   { Int32 494975 ZIP_RA(34.58%) }
|--+ snp.rs.id   { VStr8 494975 ZIP_RA(35.07%) }
|--+ snp.position   { Int32 494975 ZIP_RA(45.92%) }
|--+ snp.chromosome   { VStr8 494975 ZIP_RA(0.10%) }
|--+ snp.allele   { VStr8 494975 ZIP_RA(14.29%) }
|--+ genotype   { Bit2 1094x494975 } *
|--+ snp.annot   [  ]
|  |--+ qual   { Float32 494975 ZIP_RA(0.10%) }
|  |--+ filter   { VStr8 494975 ZIP_RA(0.15%) }

#tally the genotypes
table(read.gdsn( index.gdsn(genofile, "genotype")))

        0         1         2 
 19802443  38544219 483155988

19802443 + 38544219 + 483155988
[1] 541502650

494975 * 1094
[1] 541502650

Data analysis using SNPRelate

I will use the sample file (hapmap_geno.gds) provided by the SNPRelate package to perform a simple analysis.

#load the package
library(SNPRelate)
Loading required package: gdsfmt
SNPRelate -- supported by Streaming SIMD Extensions 2 (SSE2)

#summary of the sample file
snpgdsSummary(snpgdsExampleFileName())
The file name: /Users/dtang/Library/R/3.2/library/SNPRelate/extdata/hapmap_geno.gds 
The total number of samples: 279 
The total number of SNPs: 9088 
SNP genotypes are stored in SNP-major mode (Sample X SNP).

# Open the sample GDS file
(genofile <- snpgdsOpen(snpgdsExampleFileName()))
File: /Users/dtang/Library/R/3.2/library/SNPRelate/extdata/hapmap_geno.gds
+    [  ] *
|--+ sample.id   { FStr8 279 ZIP(23.10%) }
|--+ snp.id   { Int32 9088 ZIP(34.76%) }
|--+ snp.rs.id   { FStr8 9088 ZIP(42.66%) }
|--+ snp.position   { Int32 9088 ZIP(94.73%) }
|--+ snp.chromosome   { UInt8 9088 ZIP(0.94%) } *
|--+ snp.allele   { FStr8 9088 ZIP(14.45%) }
|--+ genotype   { Bit2 279x9088 } *
|--+ sample.annot   [ data.frame ] *
|  |--+ sample.id   { FStr8 279 ZIP(23.10%) }
|  |--+ family.id   { FStr8 279 ZIP(28.37%) }
|  |--+ father.id   { FStr8 279 ZIP(12.98%) }
|  |--+ mother.id   { FStr8 279 ZIP(12.86%) }
|  |--+ sex   { FStr8 279 ZIP(28.32%) }
|  |--+ pop.group   { FStr8 279 ZIP(7.89%) }

What exactly is the GDF format? From the gdsfmt vignette:

The GDS file can contain a hierarchical structure to store multiple GDS variables (or GDS nodes) in the file, and various data types are allowed (see the document of add.gdsn()) including integer, floating-point number and character.

Let’s take a look inside the genofile object:

# use the functions read.gdsn() and index.gdsn()
# to access the data in genofile and store these
# into a data.frame for viewing purposes
df <- data.frame(
                 snp_chromosome = read.gdsn(index.gdsn(genofile, "snp.chromosome")),
                 snp_position = read.gdsn(index.gdsn(genofile, "snp.position")),
                 snp_rs_id = read.gdsn(index.gdsn(genofile, "snp.rs.id")),
                 snp_allele = read.gdsn(index.gdsn(genofile, "snp.allele"))
                 )

head(df)
  snp_chromosome snp_position  snp_rs_id snp_allele
1              1      1355433  rs1695824        G/T
2              1      3114015 rs13328662        C/T
3              1      4124418  rs4654497        A/G
4              1      4221327 rs10915489        A/G
5              1      4271809 rs12132314        C/T
6              1      4358909 rs12042555        C/T

dim(read.gdsn( index.gdsn(genofile, "sample.annot") ))
[1] 279   6

head(read.gdsn( index.gdsn(genofile, "sample.annot") ))
  sample.id family.id father.id mother.id sex pop.group
1   NA19152        72                       F       YRI
2   NA19139        43 200171931 200061729   M       YRI
3   NA18912        28                       F       YRI
4   NA19160        56                       M       YRI
5   NA07034      1341                       M       CEU
6   NA07055      1341                       F       CEU

table(read.gdsn(index.gdsn(genofile, path="sample.annot/pop.group")))

CEU HCB JPT YRI 
 92  47  47  93

dim(read.gdsn( index.gdsn(genofile, "genotype") ))
[1]  279 9088

read.gdsn( index.gdsn(genofile, "genotype"))[1:5,1:5]
     [,1] [,2] [,3] [,4] [,5]
[1,]    2    1    0    1    2
[2,]    1    1    0    1    2
[3,]    2    1    1    2    2
[4,]    2    1    1    2    2
[5,]    0    0    0    2    2

table(read.gdsn(index.gdsn(genofile, "genotype")))

     0      1      2      3 
920248 657267 947899  10138

The SNP genotypic matrix is stored in a SNP-major mode, n_{sample} \times n_{snp} , in contrast to an individual-major mode, n_{snp} \times n_{sample} . From the table above we can see that there are four possible values stored for the genotypes: 0, 1, 2, and 3. For bi-allelic SNP sites, “0” indicates two B alleles, “1” indicates one A allele and one B allele, “2” indicates two A alleles, and “3” is a missing genotype. For multi-allelic sites, it is a count of the reference allele (3 meaning no call). The format of snp.allele is “A allele/B allele”, like “T/G” where T is A allele and G is B allele.

Performing an Identity By State analysis

The Identity By State (IBS) analysis is a straightforward measure of genetic similarity; it simply tallies the number of alleles that are shared between two individuals across a set of given loci. When comparing bi-allelic alleles, there are three possible scenarios at each site: no shared alleles, one shared allele, and two shared alleles; IBS assigns a score of 0, 1, and 2 for the three scenarios, respectively. To obtain a similarity score between two individuals, we just tally up the number of shared alleles and divide this by the maximum possible score, i.e. two times the total number of loci.

#perform the IBS analysis
ibs <- snpgdsIBS(genofile, num.thread=2)
Identity-By-State (IBS) analysis on SNP genotypes:
Excluding 365 SNPs on non-autosomes
Excluding 1 SNP (monomorphic: TRUE, < MAF: NaN, or > missing rate: NaN)
Working space: 279 samples, 8722 SNPs
	Using 2 (CPU) cores
IBS:	the sum of all working genotypes (0, 1 and 2) = 2446510
IBS:	Mon Jul 20 12:56:45 2015	0%
IBS:	Mon Jul 20 12:56:45 2015	100%

length(ibs$ibs)
[1] 77841

#the ibs() function actually calculates all pairwise
#instead of calculating just the upper or lower matrix
choose(279, 2)
[1] 38781

38781*2 + 279
[1] 77841

summary(as.vector(ibs$ibs))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.6988  0.7150  0.7508  0.7472  0.7701  1.0000

We can perform hierarchical clustering on the IBS pairwise scores:

#for colour dendrograms
install.packages("sparcl")

#load package
library(sparcl)

#convert to distance
hc <- hclust(as.dist(1 - ibs$ibs))

#obtain the different populations
pop <- read.gdsn(index.gdsn(genofile, path="sample.annot/pop.group"))
pop <- unclass(factor(pop))

#define some colours
my_colour <- rainbow(4)

#plot the dendrogram
ColorDendrogram(hc,
                y=my_colour[pop],
                branchlength=0.06)

#add a legend
legend(x = 260, y = 0.05, fill = my_colour, legend = levels(pop), cex = 0.5)

hclust_ibsBy tallying the number of shared alleles, we form almost homogenous clusters that contain individuals from one ethnicity. YRI = Yoruba in Ibadan, Nigeria, JPT = Japanese in Tokyo, CHB = Han Chinese in Beijing, and CEU = Utah residents with ancestry from northern and western Europe.

PLINK file formats

One day I was trying to use a tool called The Exomiser (true story) and I couldn’t get it to work because it requires that a PED file must be be provided for multi-sample VCF files. I later learned that the PED format was defined by another popular toolkit called PLINK. To learn what PED files are, I had a look at some files in various PLINK formats that are provided with SNPRelate.

system.file("extdata", "plinkhapmap.bed.gz", package="SNPRelate")
[1] "/Users/dtang/Library/R/3.2/library/SNPRelate/extdata/plinkhapmap.bed.gz"
system.file("extdata", "plinkhapmap.fam.gz", package="SNPRelate")
[1] "/Users/dtang/Library/R/3.2/library/SNPRelate/extdata/plinkhapmap.fam.gz"
system.file("extdata", "plinkhapmap.bim.gz", package="SNPRelate")
[1] "/Users/dtang/Library/R/3.2/library/SNPRelate/extdata/plinkhapmap.bim.gz"

As you may have diligently spotted, there aren’t any PED files above. The BED file, plinkhapmap.bed.gz, is (confusingly for me, not the UCSC Genome Browser defined BED file) actually a binary PED file (I learned this after I tried to open the file). Here’s how it looks:

gunzip -c /Users/dtang/Library/R/3.2/library/SNPRelate/extdata/plinkhapmap.bed.gz | xxd -b | head -5
00000000: 01101100 00011011 00000001 10000010 11101011 10001011  l.....
00000006: 00101010 10111111 11101110 10000011 11101111 11101010  *.....
0000000c: 00100010 11111010 00101010 10001110 11101111 00100010  ".*.."
00000012: 11111111 10101110 10001010 10101111 00001010 10100000  ......
00000018: 11101011 10001010 10101100 10111111 10001010 11101000  ......

If you are interested in what those binary numbers represent, have a look at the description at the PLINK page. The other two files, plinkhapmap.fam.gz and plinkhapmap.bim.gz, are not binary files and can be viewed directly:

gunzip -c /Users/dtang/Library/R/3.2/library/SNPRelate/extdata/plinkhapmap.fam.gz | head -5
0 NA19152 0 0 0 -9
0 NA19139 0 0 0 -9
0 NA18912 0 0 0 -9
0 NA19160 0 0 0 -9
0 NA07034 0 0 0 -9

The six columns of the FAM file are:

Family ID
Individual ID
Paternal ID
Maternal ID
Sex (1=male; 2=female; other=unknown)
Phenotype (‘1’ = control, ‘2’ = case, ‘-9’/’0’/non-numeric = missing data if case/control)

And here’s the BIM file:

gunzip -c /Users/dtang/Library/R/3.2/library/SNPRelate/extdata/plinkhapmap.bim.gz | head -5
1       9       0       2444790 G       C
1       18      0       3314897 C       T
1       20      0       3644455 C       T
1       32      0       4221327 A       G
1       48      0       4541602 A       G

The BIM format contains extended variant information and the columns are:

Chromosome code (either an integer, or ‘X’/’Y’/’XY’/’MT’; ‘0’ indicates unknown) or name
Variant identifier
Position in morgans or centimorgans (safe to use dummy value of ‘0’)
Base-pair coordinate (normally 1-based, but 0 ok; limited to 231-2)
Allele 1 (corresponding to clear bits in .bed; usually minor)
Allele 2 (corresponding to set bits in .bed; usually major)

To process PLINK files using SNPRelate, you would first have to convert them into a GDS file, and then load the GDS file.

Creating a PED file from a VCF file

In order to use The Exomiser with my VCF file, I need a PED file. SNPRelate also provides a function called snpgdsGDS2PED that performs the conversion from VCF to PED:

# using the VCF file I downloaded from the 1000 Genomes project
# which I converted into a GDS file
(genofile <- snpgdsOpen('ALL.chr22.phase1.projectConsensus.genotypes.gds'))

#perform the conversion
snpgdsGDS2PED(genofile, 'ALL.chr22.phase1.projectConsensus.genotypes')
Converting from GDS to PLINK PED:
        Output a MAP file DONE.
        Output a PED file ...
                Output:         Wed Jul 22 10:05:27 2015        0%
                Output:         Wed Jul 22 10:05:57 2015        37%
                Output:         Wed Jul 22 10:06:27 2015        75%
                Output:         Wed Jul 22 10:06:47 2015        100%

Two files are created in the conversion; the MAP files contain 3-4 columns, which are:

Chromosome code. PLINK 1.9 also permits contig names here, but most older programs do not.
Variant identifier
Position in morgans or centimorgans (optional; also safe to use dummy value of ‘0’)
Base-pair coordinate

Here’s how it looks:

cat ALL.chr22.phase1.projectConsensus.genotypes.map | wc -l
494975

cat ALL.chr22.phase1.projectConsensus.genotypes.map | head -5
22              0       16050036
22              0       16050053
22              0       16050069
22              0       16050115
22              0       16050116

The PED format is similar to a FAM file (see above), where the first six fields are the same. The remaining columns contain the genotype information and each row is for one individual. Here’s how it looks:

cat ALL.chr22.phase1.projectConsensus.genotypes.ped | wc -l
1094

cat ALL.chr22.phase1.projectConsensus.genotypes.ped | cut -f1-8 | head -5
0       HG00096 0       0       0       -9      C C     C C
0       HG00097 0       0       0       -9      A C     C C
0       HG00099 0       0       0       -9      A C     C C
0       HG00100 0       0       0       -9      C C     C C
0       HG00101 0       0       0       -9      A C     C C

Finally if you want to convert a PED file into BED, FAM, and BIM files use PLINK.

Summary

Well, that was a fairly wordy post. It is also a rather patchy post, which is reflective of my current knowhow in this area. The point of the post was to outline some of the tools that I found useful while I was getting started with analysing DNA sequencing data, in the hopes that it will be useful to others getting started in this area. I have also provided various links through the posts, so please check those out when in doubt.

Finally, here are a handful of papers to be read:

And if you made it this far…

This is my 200th post! I guess it’s safe to say that I enjoy blogging. As I have mentioned in this post, I am now working in the area of DNA sequencing, so I foresee more posts in that area. The traffic on this blog has actually started to dwindle, so either I’m becoming less interesting or hopefully people have become more adept in bioinformatics and no longer need to visit. Either way, I’ll keep the posts coming.

Print Friendly, PDF & Email



Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
.
7 comments Add yours
  1. Congrats with the 200th post! As always, it is one of many very practical and helpful posts. Enjoying using them as short tutorials and always discover some new elegant solutions. The references to the original papers are well selected!

  2. Congrats with the 200th post!
    GATK is one of the most effective variant callers, though it is mem costive.
    During the data analysis, I found that the DP or AD in the final VCF file is something far from the sam or bam file in the first mapping stage. As posted in GATK forum, it might be due to the HC caller in GATK, removing bad reads or others with bad mapping quality. However, this might results in strong bias for “DP” or “AD” against using IGV.
    Is there any good suggestion?

    1. Hi Junfeng,

      I’m not sure I understand what you mean by a strong bias against using IGV. Either way I think you’re better off posting your questions to the GATK forum; I’m very much a beginner.

      Cheers,

      Dave

    1. Hi Bianca,

      I’m doing well and I hope you’re well! Glad you’re still reading my blog 🙂

      I haven’t tried GEMINI; it’ll definitely be a future post 😉

      Cheers,

      Dave

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.