The 1000 Genome Project

The 1000 Genome Project started as an endeavour to help capture, as much as possible, human genetic variation. The results of the pilot phase, are published in Nature. To sequence a person’s genome, many copies of the DNA are broken into short pieces and each piece is sequenced and mapped, and stored in alignment files. Here’s some information I gathered from the 1000 Genome Project page regarding the alignment files.

Data generated from the 1000 Genome Project is available at their ftp site. All alignment data is in the BAM format and the alignments are found under data/XXXXXXXX/alignment where the XXXXXXXX identifier is the sample name. The spreadsheet available at http://www.1000genomes.org/about#ProjectSamples provides information on the samples.

The BAM filenames themselves contain a lot of information, e.g: NA12878.chrom1.LS454.ssaha.CEU.high_coverage.20091216.bam, where each part separated by the dot (i.e. period) are “Sample name”, “Chromosome”, “Sequencing platform”, “Mapping algorithm”, “Population”, “Analysis group” and the date in the format YYYYMMDD. Sequence reads were aligned to the GRCh37 (hg19) build of the human reference. For more information on the alignment files, please refer to ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/README.alignment_data.

At the moment I’m only interested in the Illumina data, which were all aligned by BWA. The BWA parameters used for alignment:

bwa aln -q 15 -f $sai_file $reference_fasta $fastq_file

The -q parameter is for read trimming and basically reads are trimmed at the position where the threshold of “badness” has been reached. For more information see these SEQanswers threads and the links within:

http://seqanswers.com/forums/showthread.php?t=5628
http://seqanswers.com/forums/showthread.php?t=6251

As an example, say I’m interested in the NA20502 sample, who is a female from Tuscany, Italy. I can download the alignment file for chromosome 20 here. For possibly faster downloads have a look at the Aspera connect software (instructions are available at the 1000 Genome Project data page).

Using IGV, here’s a snapshot of a region on chromosome 20, where by eye there seems to be a couple of SNPs (and some sequencing errors).

Next thing on the agenda is to look at SNPs and INDELs using SAMtools and BCFtools (and to find my own storage for all the alignment files).

Print Friendly, PDF & Email



Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
.
One comment Add yours

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.