Using bins when comparing genomic features

Comparing two files containing genomic features is a common task e.g. finding out whether the coordinates of your tags intersect with genes. Of course you could use intersectBed (as part of the BEDTools suite) for this purpose but here's how to do it anyway using Perl. NOTE: I hard code the length of my tags in this script as 27 bp.

The key to binning is this line of code:

my $which_bin = int($start / $bin); #where $bin = 100,000

Any genomic feature in the bed file that is less than 100,000 will be in bin 0. Anything in 100,000 to 199,999 is in bin 1 and so on. These bins are then used as hash keys.

This assumes a non-redundant bed file but can be changed by removing the last CHECK line of code.


#!/usr/bin/perl

use strict;
use warnings;

my $usage = "Usage: $0 <infile.bam> <infile.bed>\n";
my $infile = shift or die $usage;
my $bed = shift or die $usage;

my %bed = ();
my $counter = '0';
my $bin = 100_000;

warn "Processing $bed\n";

open(IN,'<',$bed) || die "Could not open $bed: $!\n";
while(<IN>){
   chomp;
   ++$counter;
   #chr1    134938   134939   name  0       -
   my ($chr,$start,$end,$name,$score,$strand) = split(/\t/);
   my $which_bin = int($start / $bin);
   $bed{$chr}{$strand}{$which_bin}{$counter}->{'START'} = $start;
   $bed{$chr}{$strand}{$which_bin}{$counter}->{'END'} = $end;
}
close(IN);

warn "Finished storing $bed\n";

my $bam_entry = '0';
open(IN,'-|',"samtools view $infile") || die "Could not open $infile: $!\n";
while(<IN>){
   chomp;
   my ($qname,$flag,$rname,$pos,$mapq,$cigar,$mrnm,$mpos,$isize,$seq,$qual,$tag,$vtype,$value) = split;
   next if $flag == '4';
   my $strand = '+';
   my $start = $pos;
   if ($flag == '16'){
      $strand = '-';
      $start = $pos + '27';
   }
   ++$bam_entry;
   my $which_bin = int($start/$bin);
   CHECK: foreach my $entry (sort {$a <=> $b} keys %{$bed{$rname}{$strand}{$which_bin}}){
      my $s_start = $bed{$rname}{$strand}{$which_bin}{$entry}->{'START'};
      my $s_end = $bed{$rname}{$strand}{$which_bin}{$entry}->{'END'};
      #print "$entry\t$rname\t$s_start\t$s_end\n";
      if ($start >= $s_start && $start <= $s_end){
         print "$_\n";
         last CHECK;
      }
   }
   warn "Processed $bam_entry bam entries\n";
}
close(IN);

exit(0);

Analysis of the complement and molecular evolution of tRNA genes in cow

Step by step analysis of identifying the bovine tRNA repertoire (using the most recent version of the bovine genome) based on the procedure published in this article. This was my first first-authored paper (published as part of the bovine genome companion papers), and in retrospect there could have been many improvements, however the paper was done on what I knew back then.

1. Download bosTau5.fa.gz from http://hgdownload-test.cse.ucsc.edu/goldenPath/bosTau5/bigZips/ and gunzip

2. Download and install tRNAscan-SE from http://lowelab.ucsc.edu/software/tRNAscan-SE.tar.gz

Continue reading

Bidirectional genes

Download 5' UTR for all RefSeq genes using the UCSC Table Browser.

Separate features according to strand

cat hg19_refgene_five_utr_110914.bed | perl -nle '@a = split; print if $a[5] eq "+";' > hg19_refgene_five_utr_110914_plus.bed
cat hg19_refgene_five_utr_110914.bed | perl -nle '@a = split; print if $a[5] eq "-";' > hg19_refgene_five_utr_110914_neg.bed

Use intersectBed to find overlapping features

#Force strandedness as a test, should have no output
intersectBed -s -a hg19_refgene_five_utr_110914_neg.bed -b hg19_refgene_five_utr_110914_plus.bed
intersectBed -wo -a hg19_refgene_five_utr_110914_neg.bed -b hg19_refgene_five_utr_110914_plus.bed > overlap
cat overlap | perl -nle '@a = split; $t{$a[3]} = '1'; $t{$a[9]} = '1'; END {print join("\n",keys %t)};' | cut -f1,2 -d'_' > unique

Performing a GO enrichment analysis on the unique list of bidirectional genes and using all the genes as the universe list:

R version 2.13.0 (2011-04-13)
Copyright (C) 2011 The R Foundation for Statistical Computing
ISBN 3-900051-07-0
Platform: x86_64-unknown-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> library("GO.db")
> library("GOstats")
> entrezUniverse=scan("universe2")
> selectedEntrezIds=scan("entrez")
> hgCutoff = 0.001
> #Biological Process
> params = new("GOHyperGParams",geneIds=selectedEntrezIds,universeGeneIds=entrezUniverse,ontology="BP",pvalueCutoff=hgCutoff,conditional=TRUE,testDirection="
over",annotation="org.Hs.eg.db")
> hgOver=hyperGTest(params)
> summary(hgOver)
       GOBPID       Pvalue OddsRatio    ExpCount Count Size
1  GO:0034641 2.845531e-06  1.675853 116.7780905   157 5140
2  GO:0090304 4.877145e-06  1.688666  90.8551719   128 3999
3  GO:0044260 5.050034e-05  1.551424 137.7708834   173 6064
4  GO:0022403 1.093607e-04  2.174322  16.4261789    33  723
5  GO:0016568 2.753699e-04  2.582012   8.3153271    20  366
6  GO:0006996 2.867515e-04  1.923650  22.5838527    40 1037
7  GO:0006281 3.059896e-04  2.631828   7.7473402    19  341
8  GO:0071841 3.514806e-04  1.571534  61.7969662    87 2720
9  GO:0044238 3.720664e-04  1.481915 184.6411559   215 8127
10 GO:0000075 3.912858e-04  3.099874   4.8619672    14  214
11 GO:0051329 4.542075e-04  2.618305   7.3611092    18  324
12 GO:0034622 4.721711e-04  2.353943   9.9965681    22  440
13 GO:0044248 4.733327e-04  1.759037  30.1941794    49 1329
14 GO:0006399 4.960020e-04  3.894081   2.7944952    10  123
15 GO:0006839 4.964491e-04  4.805971   1.8402773     8   81
16 GO:0010564 5.842741e-04  2.558462   7.5201455    18  331
17 GO:0007049 5.963847e-04  1.790438  26.5136248    44 1167
18 GO:0000387 6.138504e-04  8.383672   0.7043037     5   31
19 GO:0006368 7.234803e-04  5.192143   1.4994852     7   66
20 GO:0051276 7.455445e-04  2.080239  13.8588784    27  610
21 GO:0006974 8.086037e-04  3.376597   3.5085746    11  160
22 GO:0050434 8.996583e-04  5.955524   1.1359736     6   50
23 GO:0051028 9.584968e-04  3.873584   2.5218615     9  111
24 GO:0010467 9.885920e-04  1.455638  89.2420894   115 3928
                                                              Term
1                     cellular nitrogen compound metabolic process
2                                   nucleic acid metabolic process
3                         cellular macromolecule metabolic process
4                                                 cell cycle phase
5                                           chromatin modification
6                                           organelle organization
7                                                       DNA repair
8  cellular component organization or biogenesis at cellular level
9                                        primary metabolic process
10                                           cell cycle checkpoint
11                                interphase of mitotic cell cycle
12                        cellular macromolecular complex assembly
13                                      cellular catabolic process
14                                          tRNA metabolic process
15                                         mitochondrial transport
16                                regulation of cell cycle process
17                                                      cell cycle
18                                     spliceosomal snRNP assembly
19        transcription elongation from RNA polymerase II promoter
20                                         chromosome organization
21                                 response to DNA damage stimulus
22                      positive regulation of viral transcription
23                                                  mRNA transport
24                                                 gene expression
> #Molecular Function
> params = new("GOHyperGParams",geneIds=selectedEntrezIds,universeGeneIds=entrezUniverse,ontology="MF",pvalueCutoff=hgCutoff,conditional=TRUE,testDirection="over",annotation="org.Hs.eg.db")
> hgOver=hyperGTest(params)
> summary(hgOver)
      GOMFID       Pvalue OddsRatio    ExpCount Count Size
1 GO:0003676 0.0001312477  1.664829 53.28987613    79 2437
2 GO:0016206 0.0005217889       Inf  0.04574949     2    2
3 GO:0003723 0.0008795796  1.909603 18.43704529    33  806
4 GO:0050662 0.0009527279  3.105508  4.14032903    12  181
                                   Term
1                  nucleic acid binding
2 catechol O-methyltransferase activity
3                           RNA binding
4                      coenzyme binding
> #Cellular component
> params = new("GOHyperGParams",geneIds=selectedEntrezIds,universeGeneIds=entrezUniverse,ontology="CC",pvalueCutoff=hgCutoff,conditional=TRUE,testDirection="over",annotation="org.Hs.eg.db")
> hgOver=hyperGTest(params)
> summary(hgOver)
       GOCCID       Pvalue OddsRatio     ExpCount Count  Size
1  GO:0043227 8.095989e-15  2.357756 193.81366272   266  8649
2  GO:0043229 1.078643e-14  2.445344 215.07960860   285  9598
3  GO:0005622 1.268144e-13  2.727612 259.96442377   320 11601
4  GO:0031974 7.622271e-13  2.476594  50.44219618   102  2251
5  GO:0070013 1.297462e-12  2.479110  48.64949263    99  2171
6  GO:0005634 9.443707e-12  2.053406 120.73858420   183  5388
7  GO:0044422 3.286750e-11  2.013652 121.41084803   182  5418
8  GO:0005654 6.334903e-08  2.336166  28.48157768    59  1271
9  GO:0043228 4.001227e-06  1.784747  58.62140614    92  2616
10 GO:0005730 1.096997e-05  2.682435  11.33884996    28   506
11 GO:0005694 1.166634e-05  2.622132  12.01111380    29   536
12 GO:0043234 5.725214e-05  1.662474  59.30486691    88  2712
13 GO:0000151 6.952506e-05  4.226438   3.11482242    12   139
14 GO:0005739 8.158783e-05  1.881356  29.46756463    51  1315
15 GO:0005737 1.050799e-04  1.488375 182.27313361   218  8134
16 GO:0015630 3.948281e-04  2.107557  14.67776033    29   655
17 GO:0000775 4.285894e-04  3.660054   3.24927519    11   145
18 GO:0005684 5.008308e-04       Inf   0.04481759     2     2
19 GO:0080008 7.364930e-04 11.749319   0.42576709     4    19
                                 Term
1          membrane-bounded organelle
2             intracellular organelle
3                       intracellular
4             membrane-enclosed lumen
5       intracellular organelle lumen
6                             nucleus
7                      organelle part
8                         nucleoplasm
9      non-membrane-bounded organelle
10                          nucleolus
11                         chromosome
12                    protein complex
13           ubiquitin ligase complex
14                      mitochondrion
15                          cytoplasm
16           microtubule cytoskeleton
17     chromosome, centromeric region
18       U2-type spliceosomal complex
19 CUL4 RING ubiquitin ligase complex
> #KEGG
> params = new("KEGGHyperGParams",geneIds=selectedEntrezIds,universeGeneIds=entrezUniverse,pvalueCutoff=hgCutoff,testDirection="over",annotation="org.Hs.eg.db")
> hgOver=hyperGTest(params)
> summary(hgOver)
  KEGGID       Pvalue OddsRatio ExpCount Count Size          Term
1  03013 0.0003025115  3.918123 3.174733    11  152 RNA transport
>

Although this was a brief analysis, the results are somewhat similar to the findings in the paper Trinklein et al., 2004 (An Abundance of Bidirectional Promoters in the Human Genome). Findings from the paper include:

1. DNA-repair genes are more than fivefold overrepresented in the bidirectional class.
2. Chaperone proteins are almost threefold overrepresented
3. Mitochondrial genes are more than twofold overrepresented

Mapping qualities

Updated 2014 December 17th

Current high throughput sequencers produces reads that are short; for example the HiSeq2000 produces millions of reads that are 50 and 100 bp long. To align such short reads with high speed and accuracy, many short read alignment programs have been developed, such as BWA. The major limitation is the length of the sequenced reads because many eukaryotic genomes are repetitive and therefore it is difficult to accurately map these reads. Because of this, alignment programs have mapping qualities for each read that is mapped to the reference genome. A mapping quality is basically the probability that a read is aligned in the wrong place (i.e. phred-scaled posterior probability that the mapping position of this read is incorrect). The probability is calculated as:

p = 10^{-q/10}

where q is the quality. For example a mapping quality of 40 = 10 to the power of -4, which is 0.0001, which means there is a 0.01 percent chance that the read is aligned incorrectly.

p = 10^{-40/10} = 0.0001

Base calling errors with respect to mapping qualities

Sequencers make base calling mistakes and this complicates matters. To illustrate how this affects the mapping qualities using BWA, I will use an example I came across in SEQanswers. First let's examine mapping qualities when a read maps to a specific region without suboptimal hits:

#what version of BWA
bwa 2>&1 | head -4

Program: bwa (alignment via Burrows-Wheeler transformation)
Version: 0.7.10-r789
Contact: Heng Li <lh3@sanger.ac.uk>

cat ref.fa 
>ref
ACGTACGTACGTACGTACGTACGTAGGG
CCCCCCCCCCCCCCCCCCCCCCCCCCCC
CCCCCCCCCCCCCCCCCCCCCCCCCCCC
CCCCCCCCCCCCCCCCCCCCCCCCCCCC
CCCCCCCCCCCCCCCCCCCCCCCCCCCC
CCCCCCCCCCCCCCCCCCCCCCCCCCCC
CCCCCCCCCCCCCCCCCCCCCCCCCCCC

cat read.fa 
>read
ACGTACGTACGTACGTACGTACGTA

Now to do the mapping:

bwa index ref.fa
bwa aln ref.fa read.fa > read.sai
bwa samse ref.fa read.sai read.fa > read.sam

#scroll to see the rest of the SAM file
cat read.sam 
@SQ	SN:ref	LN:196
@PG	ID:bwa	PN:bwa	VN:0.7.10-r789	CL:bwa samse ref.fa read.sai read.fa
read	0	ref	1	37	25M	*	0	0	ACGTACGTACGTACGTACGTACGTA	*	XT:A:U	NM:i:0	X0:i:1	X1:i:0	XM:i:0	XO:i:0	XG:i:0	MD:Z:25

Mapping the read to our reference, BWA returns a mapping quality of 37 (which is actually the highest mapping quality BWA returns).

Next let's create an example with suboptimal hits. Below is a reference that contains five identical stretches of 28 mers and one 28 mer with a single mismatch (in red) compared to the other five:

>ref2
ACGTACGTACGTACGTACGTACGTAGGG
ACGTACGTACGTACGTAGGTACGTAGGG
ACGTACGTACGTACGTAGGTACGTAGGG
ACGTACGTACGTACGTAGGTACGTAGGG
ACGTACGTACGTACGTAGGTACGTAGGG
ACGTACGTACGTACGTAGGTACGTAGGG

Let's map a read from the single mismatch stretch to this reference:

cat ref2.fa 
>ref2
ACGTACGTACGTACGTACGTACGTAGGG
ACGTACGTACGTACGTAGGTACGTAGGG
ACGTACGTACGTACGTAGGTACGTAGGG
ACGTACGTACGTACGTAGGTACGTAGGG
ACGTACGTACGTACGTAGGTACGTAGGG
ACGTACGTACGTACGTAGGTACGTAGGG

cat read2.fa 
>read2
ACGTACGTACGTACGTACGTACGTA

bwa index ref2.fa
bwa aln ref2.fa read2.fa > read2.sai
bwa samse ref2.fa read2.sai read2.fa > read2.sam

#scroll to see the rest of the SAM file
cat read2.sam 
@SQ	SN:ref2	LN:168
@PG	ID:bwa	PN:bwa	VN:0.7.10-r789	CL:bwa samse ref2.fa read2.sai read2.fa
read2	0	ref2	1	16	25M	*	0	0	ACGTACGTACGTACGTACGTACGTA	*	XT:A:U	NM:i:0	X0:i:1	X1:i:5	XM:i:0	XO:i:0	XG:i:0	MD:Z:25

The mapping quality of the read in the second example is 16, which has a probability of  10^{-16/10} = 0.025119 of mapping to the wrong place. Even though the read maps uniquely in the reference, its mapping quality is 16 and not 37. The BWA specific tags in the SAM file provides some nice additional information:

XT Type: Unique/Repeat/N/Mate-sw
X0 Number of best hits
X1 Number of suboptimal hits found by BWA

From the BWA tag information we can quickly deduce whether a read is aligned uniquely; in this case the XT:A:U indicates that it was aligned uniquely. In addition, the X1:i:5 tag indicates that there were 5 suboptimal hits.

Mapping qualities when considering base calling errors

To model base calling errors we can use the Binomial distribution; if I expect there to be 1 base calling error in 100 bps, I can calculate the probability of an error for a read of 25 nt as such using R

#Probability of success (1 error in 100 bases) = 0.99
#Number of trials (each base is a trial) = 25

#no base calling errors, i.e. 25 successes
dbinom(x = 25, size=25, prob=0.99) 
#[1] 0.7778214

#one base calling error, i.e. 24 successes
dbinom(x = 24, size=25, prob=0.99) 
#[1] 0.1964195

#two base calling errors, i.e. 23 successes
dbinom(x = 23, size=25, prob=0.99) 
#[1] 0.02380843

If we expect 1 base calling error in 100 bps, the probability of making two base calling errors in 25 bps is quite low. Using the formula from the SEQanswers post that calculates the posterior probability that the best alignment is actually correct:

 \frac{P_{zero\ base\ errors}}{P_{zero\ bases\ errors} + 5 \times P_{one\ base\ error}}

Calculating this in R:

#the posterior probability that the best alignment is correct
p = 0.99
dbinom(x = 25, size=25, prob=p) / (dbinom(x = 25, size=25, prob=p) + (5 * dbinom(x = 24, size=25, prob=p)))
#[1] 0.4419643

In reality base calling is much more accurate than 1 error in 100 bases, which is a Phred quality score of 20. If we changed the base calling error rate to 1 in 1000 (Phred score of 30):

p = 0.999
dbinom(x = 25, size=25, prob=p) / (dbinom(x = 25, size=25, prob=p) + (5 * dbinom(x = 24, size=25, prob=p)))
#[1] 0.88879

then the posterior probability that the best alignment is correct improves to 0.88879. Using a base calling error rate of 1 in 10000 (Phred score of 40):

p = 0.9999
dbinom(x = 25, size=25, prob=p) / (dbinom(x = 25, size=25, prob=p) + (5 * dbinom(x = 24, size=25, prob=p)))
#[1] 0.9876531

improves the probability to 0.9876531, which is a ~0.012 probability that the alignment is incorrect, which is around the same ball park to the BWA mapping quality of 16, which is a 0.025 probability that the alignment is incorrect.

Does BWA make use of base calling qualities?

When I included base calling qualities to the read

@read
ACGTACGTACGTACGTACGTACGTA
+
!!!!!!!!!!!!!!!!!!!!!!!!!

I still get the same mapping quality of 16 with BWA, indicating that mapping qualities are not used by BWA:

tag 0 artificial 1 16 25M * 0 0 ACGTACGTACGTACGTACGTACGTA !!!!!!!!!!!!!!!!!!!!!!!!! XT:A:U NM:i:0 X0:i:1 X1:i:5 XM:i:0 XO:i:0 XG:i:0 MD:Z:25

This was confirmed when I examined the BWA manual, which mentioned that "Base quality is NOT considered in evaluating hits."

Using Velvet

wget http://www.ebi.ac.uk/~zerbino/velvet/velvet_1.1.05.tgz
tar -xzf velvet_1.1.05.tgz
sudo apt-get install texlive-latex-base
make
cd ~/bin/
ln -s ~/src/velvet_1.1.05/velvet* .

Write script for generating random tags from a longer piece of DNA

#!/usr/bin/perl

use strict;
use warnings;

my $dna = 'GATGTTTAAAGGCATTTCCTGTAAAAAAAAAAAAAAAAAGAGAGCGAGAGAAAAAGAGAGAGAGAGAGAGAGAATTTCTGATGATTAAAAAAAAAAAGTGAGGAATGCTGAGTTAAACAAAGTTAACCACATTCTCTCAGAGCCTTGAATGTGCTAATATGTGCTAATGTGTCTTATGGCTCTCTAAGGAGGGTGTAGTCAAAATCATCTTCTACACTGCTTAGTTCCCGGGAACCCACTTTTTTTTTTTTTTTATTCATGGTCGAATATTATTTATTGTCAGAAAGGTACAGCATTCACACCAATATCAGACAAAATAGATTTTAACTAAAAAATTATTTCGAGACAAAAATAACAATATATGTTAATAAAAGGCTCAATTAAAAATGTATAACAATTATAAACACATACACATCAAACAACAGTTCCCCAAAATACGTAAAGCAAACATTGACAGGATTGAAGGGAGAAATAGACCACTCTACAATAGTAGTTGGGGT';

my $dna_length = length($dna);

my $tag_length = '31';
my $end_limit = $dna_length - $tag_length + 1;

for (1 .. 5000){
   my $random = int(rand($end_limit));
   my $random_tag = substr($dna,$random,$tag_length);
   my $end = $random + $tag_length;
   my $id = $_;
   $id .= "_${random}_$end";
   print ">$id\n$random_tag\n";
}

exit(0);

Generate random tags and use as input for velvet

generate_random_tag.pl > test.fa
mkdir velvet_test
velveth velvet_test/ 21 -short test.fa
velvetg velvet_test/
cd velvet_test/
cat contigs.fa
>NODE_1_length_480_cov_114.583336
GATGTTTAAAGGCATTTCCTGTAAAAAAAAAAAAAAAAAGAGAGCGAGAGAAAAAGAGAG
AGAGAGAGAGAGAATTTCTGATGATTAAAAAAAAAAAGTGAGGAATGCTGAGTTAAACAA
AGTTAACCACATTCTCTCAGAGCCTTGAATGTGCTAATATGTGCTAATGTGTCTTATGGC
TCTCTAAGGAGGGTGTAGTCAAAATCATCTTCTACACTGCTTAGTTCCCGGGAACCCACT
TTTTTTTTTTTTTTATTCATGGTCGAATATTATTTATTGTCAGAAAGGTACAGCATTCAC
ACCAATATCAGACAAAATAGATTTTAACTAAAAAATTATTTCGAGACAAAAATAACAATA
TATGTTAATAAAAGGCTCAATTAAAAATGTATAACAATTATAAACACATACACATCAAAC
AACAGTTCCCCAAAATACGTAAAGCAAACATTGACAGGATTGAAGGGAGAAATAGACCAC
TCTACAATAGTAGTTGGGGT

I don't know why in the definition line reads length = 480 (NODE_1_length_480) when the contig length is 500.

BLAST the contig back to the original sequence

makeblastdb -in original.fa -dbtype nucl -title original -out original
blastn -db original -query velvet_test/contigs.fa
Score =  924 bits (500),  Expect = 0.0
Identities = 500/500 (100%), Gaps = 0/500 (0%)
Strand=Plus/Plus

Query  1    GATGTTTAAAGGCATTTCCTGTaaaaaaaaaaaaaaaaagagagcgagagaaaaagagag  60
            ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct  1    GATGTTTAAAGGCATTTCCTGTAAAAAAAAAAAAAAAAAGAGAGCGAGAGAAAAAGAGAG  60

Query  61   agagagagagagaATTTCTGATGATTaaaaaaaaaaaGTGAGGAATGCTGAGTTAAACAA  120
            ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct  61   AGAGAGAGAGAGAATTTCTGATGATTAAAAAAAAAAAGTGAGGAATGCTGAGTTAAACAA  120

Query  121  AGTTAACCACATTCTCTCAGAGCCTTGAATGTGCTAATATGTGCTAATGTGTCTTATGGC  180
            ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct  121  AGTTAACCACATTCTCTCAGAGCCTTGAATGTGCTAATATGTGCTAATGTGTCTTATGGC  180

Query  181  TCTCTAAGGAGGGTGTAGTCAAAATCATCTTCTACACTGCTTAGTTCCCGGGAACCCACt  240
            ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct  181  TCTCTAAGGAGGGTGTAGTCAAAATCATCTTCTACACTGCTTAGTTCCCGGGAACCCACT  240

Query  241  ttttttttttttttATTCATGGTCGAATATTATTTATTGTCAGAAAGGTACAGCATTCAC  300
            ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct  241  TTTTTTTTTTTTTTATTCATGGTCGAATATTATTTATTGTCAGAAAGGTACAGCATTCAC  300

Query  301  ACCAATATCAGACAAAATAGATTTTAACTAAAAAATTATTTCGAGACAAAAATAACAATA  360
            ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct  301  ACCAATATCAGACAAAATAGATTTTAACTAAAAAATTATTTCGAGACAAAAATAACAATA  360

Query  361  TATGTTAATAAAAGGCTCAATTAAAAATGTATAACAATTATAAACACATACACATCAAAC  420
            ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct  361  TATGTTAATAAAAGGCTCAATTAAAAATGTATAACAATTATAAACACATACACATCAAAC  420

Query  421  AACAGTTCCCCAAAATACGTAAAGCAAACATTGACAGGATTGAAGGGAGAAATAGACCAC  480
            ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct  421  AACAGTTCCCCAAAATACGTAAAGCAAACATTGACAGGATTGAAGGGAGAAATAGACCAC  480

Query  481  TCTACAATAGTAGTTGGGGT  500
            ||||||||||||||||||||
Sbjct  481  TCTACAATAGTAGTTGGGGT  500

Visualise where the tags came from by first using a short read mapper (bwa)

#!/usr/bin/perl

use strict;
use warnings;

my $usage = "Usage: $0 <infile.fast[aq]> <indexed_genome_to_map_to>\n";
my $file = shift or die $usage;
my $genome = shift or die $usage;

die "Genome file $genome does not exist\n" unless -e $genome;

my $base = $file;
$base =~ s/\.f[aq]$//;
my $aln = "bwa aln -n 2 -t 60 $genome $file > $base.sai";
system($aln);
my $samse = "nice bwa samse $genome $base.sai $file > $base.sam";
system($samse);
my $bam = "nice samtools view -S -b $base.sam > $base.bam";
system($bam);
my $bam_sort = "nice samtools sort $base.bam ${base}_sorted";
system($bam_sort);
my $bam_index = "nice samtools index ${base}_sorted.bam";
system($bam_index);
unlink("$base.sai");
unlink("$base.sam");
unlink("$base.bam");

exit(0);

__END__
bwa.pl test.fa original.fa

Visualise mapped tags using IGV

Now let's try introducing some sequencing errors (1 error in 31 bases):

#!/usr/bin/perl

use strict;
use warnings;
use diagnostics;

my $dna = 'GATGTTTAAAGGCATTTCCTGTAAAAAAAAAAAAAAAAAGAGAGCGAGAGAAAAAGAGAGAGAGAGAGAGAGAATTTCTGATGATTAAAAAAAAAAAGTGAGGAATGCTGAGTTAAACAAAGTTAACCACATTCTCTCAGAGCCTTGAATGTGCTAATATGTGCTAATGTGTCTTATGGCTCTCTAAGGAGGGTGTAGTCAAAATCATCTTCTACACTGCTTAGTTCCCGGGAACCCACTTTTTTTTTTTTTTTATTCATGGTCGAATATTATTTATTGTCAGAAAGGTACAGCATTCACACCAATATCAGACAAAATAGATTTTAACTAAAAAATTATTTCGAGACAAAAATAACAATATATGTTAATAAAAGGCTCAATTAAAAATGTATAACAATTATAAACACATACACATCAAACAACAGTTCCCCAAAATACGTAAAGCAAACATTGACAGGATTGAAGGGAGAAATAGACCACTCTACAATAGTAGTTGGGGT';

my $dna_length = length($dna);

my $tag_length = '31';
my $end_limit = $dna_length - $tag_length + 1;

for (1 .. 5000){
   my $random = int(rand($end_limit));
   my $random_tag = substr($dna,$random,$tag_length);
   my $end = $random + $tag_length;
   my $id = $_;
   $id .= "_${random}_$end";
   #print ">$id\n$random_tag\n";
   my $random_mutated_tag = random_mutation($random_tag);
   print ">$id\n$random_mutated_tag\n";
}

exit(0);

sub random_mutation {
   my ($seq) = @_;
   my $seq_length = length($seq);
   my $random = int(rand($seq_length));
   my $base = substr($seq,$random,1);
   my @base = ('A','C','G','T');
   my $random_base = $base;
   while ($base eq $random_base){
      my $random_base_index = int(rand(scalar(@base)));
      #print "$random_base_index\n";
      $random_base = $base[$random_base_index];
   }
   my $mutate_seq = substr($seq,$random,1,$random_base);
   return($seq);
}

The contigs.fa file is shown below:

>NODE_13_length_21_cov_1.619048
AAATAGATTTTAACTAAAAATTTATTTCGAGACAAAAATAA
>NODE_19_length_143_cov_57.832169
ATTTCGAGACAAAAATAACAATATATGTTAATAAAAGGCTCAATTAAAAATGTATAACAA
TTATAAACACATACACATCAAACAACAGTTCCCCAAAATACGTAAAGCAAACATTGACAG
GATTGAAGGGAGAAATAGACCACTCTACAATAGTAGTTGGGGT
>NODE_24_length_190_cov_57.431580
AGAGAGAGAGAGAGAGAGAATTTCTGATGATTAAAAAAAAAAAGTGAGGAATGCTGAGTT
AAACAAAGTTAACCACATTCTCTCAGAGCCTTGAATGTGCTAATATGTGCTAATGTGTCT
TATGGCTCTCTAAGGAGGGTGTAGTCAAAATCATCTTCTACACTGCTTAGTTCCCGGGAA
CCCACTTTTTTTTTTTTTTTATTCATGGTC
>NODE_25_length_21_cov_1.809524
CAATATCAGACAAAATAGATGTTAACTAAAAAATTATTTCG
>NODE_33_length_21_cov_2.285714
CAATATCAGACAAAATAGATCTTAACTAAAAAATTATTTCG
>NODE_34_length_21_cov_1.619048
ATAGATTTTAACTAAAAAATCATTTCGAGACAAAAATAACA
>NODE_35_length_21_cov_3.523809
TTTTTTTATTCATGGTCGAAAATTATTTATTGTCAGAAAGG
>NODE_36_length_21_cov_1.666667
TCATGGTCGAATATTATTTACTGTCAGAAAGGTACAGCATT
>NODE_37_length_21_cov_2.380952
GACAAAATAGATTTTAACTAGAAAATTATTTCGAGACAAAA
>NODE_43_length_21_cov_2.047619
AGACAAAATAGATTTTAACTCAAAAATTATTTCGAGACAAA
>NODE_44_length_21_cov_3.000000
ACCAATATCAGACAAAATAGGTTTTAACTAAAAAATTATTT
>NODE_45_length_21_cov_2.095238
CAGACAAAATAGATTTTAACGAAAAAATTATTTCGAGACAA
>NODE_46_length_21_cov_2.666667
AATATTATTTATTGTCAGAACGGTACAGCATTCACACCAAT
>NODE_47_length_21_cov_1.619048
TTCACACCAATATCAGACAAGATAGATTTTAACTAAAAAAT
>NODE_48_length_21_cov_2.190476
ATTCATGGTCGAATATTATTCATTGTCAGAAAGGTACAGCA
>NODE_49_length_21_cov_2.476191
TTCACACCAATATCAGACAATATAGATTTTAACTAAAAAAT
>NODE_50_length_21_cov_1.809524
ATATCAGACAAAATAGATTTCAACTAAAAAATTATTTCGAG
>NODE_51_length_21_cov_2.571429
AATATCAGACAAAATAGATTCTAACTAAAAAATTATTTCGA
>NODE_52_length_21_cov_1.714286
CGAATATTATTTATTGTCAGGAAGGTACAGCATTCACACCA
>NODE_53_length_21_cov_1.428571
TTTTTTTTTTATTCATGGTCCAATATTATTTATTGTCAGAA
>NODE_54_length_21_cov_2.380952
TTTTTTATTCATGGTCGAATTTTATTTATTGTCAGAAAGGT
>NODE_55_length_21_cov_1.238095
GAATATTATTTATTGTCAGATAGGTACAGCATTCACACCAA

I am unable to generate the full sequence even with greater than 300X coverage (using the default settings). Of course in reality you would get much better sequencing results.

IGV plot of the tags with errors introduced at a frequency of 1 in 31.