Finding junctions with TopHat

For setting up TopHat see my previous post. Here, I wanted to test whether TopHat can find junctions with single end 27bp reads. The reference sequence I used was the test_ref.fa provided by the TopHat authors (see my previous post for the link), where the A’s mark the intron regions: >test_chromosome AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA ACTACTATCTGACTAGACTGGAGGCGCTTGCGACTGAGCTAGGACGTGCC ACTACGGGGATGACGACTAGGACTACGGACGGACTTAGAGCGTCAGATGC AGCGACTGGACTATTTAGGACGATCGGACTGAGGAGGGCAGTAGGACGCT…

Continue Reading

Annotating RNA-Seq data

After mapping your reads from an RNA-Seq experiment, usually the next task is identify the transcripts that the reads came from (i.e. annotating RNA-Seq data) and there are many ways of doing so. Here I just describe a rather crude method whereby I download sequence coordinates of hg19 RefSeqs as a BED12 file from the…

Continue Reading

Encyclopedia of DNA elements (ENCODE)

ENCODE is an abbreviation of “Encyclopedia of DNA elements”, a project that aims to discover and define the functional elements encoded in the human genome via multiple technologies. Used in this context, the term “functional elements” is used to denote a discrete region of the genome that encodes a defined product (e.g., protein) or a…

Continue Reading

The 1000 Genome Project

The 1000 Genome Project started as an endeavour to help capture, as much as possible, human genetic variation. The results of the pilot phase, are published in Nature. To sequence a person’s genome, many copies of the DNA are broken into short pieces and each piece is sequenced and mapped, and stored in alignment files….

Continue Reading

intersectBed

Updated 2014 June 25th The tool intersectBed is part of the BEDTools suite of tools and performs an intersection between two BED files. For example, given two BED files, you may be interested in finding the entries that overlap. To install the latest version of BEDTools, download the source code from GitHub and compile:

Continue Reading

Sequence conservation in vertebrates

The UCSC Genome Browser provides multiple alignments of 46 vertebrate species and conveniently provides them for download. The multiple alignments show regions of sequence conservation among vertebrates. For more information see http://genome.ucsc.edu/cgi-bin/hgTrackUi?db=hg19&g=cons46way. The multiple alignments are stored as Multiple Alignment Files and there are Perl and Python packages that parse them. The MAF format is…

Continue Reading