Using Velvet

wget http://www.ebi.ac.uk/~zerbino/velvet/velvet_1.1.05.tgz
tar -xzf velvet_1.1.05.tgz
sudo apt-get install texlive-latex-base
make
cd ~/bin/
ln -s ~/src/velvet_1.1.05/velvet* .

Write script for generating random tags from a longer piece of DNA

#!/usr/bin/perl

use strict;
use warnings;

my $dna = 'GATGTTTAAAGGCATTTCCTGTAAAAAAAAAAAAAAAAAGAGAGCGAGAGAAAAAGAGAGAGAGAGAGAGAGAATTTCTGATGATTAAAAAAAAAAAGTGAGGAATGCTGAGTTAAACAAAGTTAACCACATTCTCTCAGAGCCTTGAATGTGCTAATATGTGCTAATGTGTCTTATGGCTCTCTAAGGAGGGTGTAGTCAAAATCATCTTCTACACTGCTTAGTTCCCGGGAACCCACTTTTTTTTTTTTTTTATTCATGGTCGAATATTATTTATTGTCAGAAAGGTACAGCATTCACACCAATATCAGACAAAATAGATTTTAACTAAAAAATTATTTCGAGACAAAAATAACAATATATGTTAATAAAAGGCTCAATTAAAAATGTATAACAATTATAAACACATACACATCAAACAACAGTTCCCCAAAATACGTAAAGCAAACATTGACAGGATTGAAGGGAGAAATAGACCACTCTACAATAGTAGTTGGGGT';

my $dna_length = length($dna);

my $tag_length = '31';
my $end_limit = $dna_length - $tag_length + 1;

for (1 .. 5000){
   my $random = int(rand($end_limit));
   my $random_tag = substr($dna,$random,$tag_length);
   my $end = $random + $tag_length;
   my $id = $_;
   $id .= "_${random}_$end";
   print ">$id\n$random_tag\n";
}

exit(0);

Generate random tags and use as input for velvet

generate_random_tag.pl > test.fa
mkdir velvet_test
velveth velvet_test/ 21 -short test.fa
velvetg velvet_test/
cd velvet_test/
cat contigs.fa
>NODE_1_length_480_cov_114.583336
GATGTTTAAAGGCATTTCCTGTAAAAAAAAAAAAAAAAAGAGAGCGAGAGAAAAAGAGAG
AGAGAGAGAGAGAATTTCTGATGATTAAAAAAAAAAAGTGAGGAATGCTGAGTTAAACAA
AGTTAACCACATTCTCTCAGAGCCTTGAATGTGCTAATATGTGCTAATGTGTCTTATGGC
TCTCTAAGGAGGGTGTAGTCAAAATCATCTTCTACACTGCTTAGTTCCCGGGAACCCACT
TTTTTTTTTTTTTTATTCATGGTCGAATATTATTTATTGTCAGAAAGGTACAGCATTCAC
ACCAATATCAGACAAAATAGATTTTAACTAAAAAATTATTTCGAGACAAAAATAACAATA
TATGTTAATAAAAGGCTCAATTAAAAATGTATAACAATTATAAACACATACACATCAAAC
AACAGTTCCCCAAAATACGTAAAGCAAACATTGACAGGATTGAAGGGAGAAATAGACCAC
TCTACAATAGTAGTTGGGGT

I don’t know why in the definition line reads length = 480 (NODE_1_length_480) when the contig length is 500.

BLAST the contig back to the original sequence

makeblastdb -in original.fa -dbtype nucl -title original -out original
blastn -db original -query velvet_test/contigs.fa
Score =  924 bits (500),  Expect = 0.0
Identities = 500/500 (100%), Gaps = 0/500 (0%)
Strand=Plus/Plus

Query  1    GATGTTTAAAGGCATTTCCTGTaaaaaaaaaaaaaaaaagagagcgagagaaaaagagag  60
            ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct  1    GATGTTTAAAGGCATTTCCTGTAAAAAAAAAAAAAAAAAGAGAGCGAGAGAAAAAGAGAG  60

Query  61   agagagagagagaATTTCTGATGATTaaaaaaaaaaaGTGAGGAATGCTGAGTTAAACAA  120
            ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct  61   AGAGAGAGAGAGAATTTCTGATGATTAAAAAAAAAAAGTGAGGAATGCTGAGTTAAACAA  120

Query  121  AGTTAACCACATTCTCTCAGAGCCTTGAATGTGCTAATATGTGCTAATGTGTCTTATGGC  180
            ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct  121  AGTTAACCACATTCTCTCAGAGCCTTGAATGTGCTAATATGTGCTAATGTGTCTTATGGC  180

Query  181  TCTCTAAGGAGGGTGTAGTCAAAATCATCTTCTACACTGCTTAGTTCCCGGGAACCCACt  240
            ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct  181  TCTCTAAGGAGGGTGTAGTCAAAATCATCTTCTACACTGCTTAGTTCCCGGGAACCCACT  240

Query  241  ttttttttttttttATTCATGGTCGAATATTATTTATTGTCAGAAAGGTACAGCATTCAC  300
            ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct  241  TTTTTTTTTTTTTTATTCATGGTCGAATATTATTTATTGTCAGAAAGGTACAGCATTCAC  300

Query  301  ACCAATATCAGACAAAATAGATTTTAACTAAAAAATTATTTCGAGACAAAAATAACAATA  360
            ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct  301  ACCAATATCAGACAAAATAGATTTTAACTAAAAAATTATTTCGAGACAAAAATAACAATA  360

Query  361  TATGTTAATAAAAGGCTCAATTAAAAATGTATAACAATTATAAACACATACACATCAAAC  420
            ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct  361  TATGTTAATAAAAGGCTCAATTAAAAATGTATAACAATTATAAACACATACACATCAAAC  420

Query  421  AACAGTTCCCCAAAATACGTAAAGCAAACATTGACAGGATTGAAGGGAGAAATAGACCAC  480
            ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct  421  AACAGTTCCCCAAAATACGTAAAGCAAACATTGACAGGATTGAAGGGAGAAATAGACCAC  480

Query  481  TCTACAATAGTAGTTGGGGT  500
            ||||||||||||||||||||
Sbjct  481  TCTACAATAGTAGTTGGGGT  500

Visualise where the tags came from by first using a short read mapper (bwa)

#!/usr/bin/perl

use strict;
use warnings;

my $usage = "Usage: $0 <infile.fast[aq]> <indexed_genome_to_map_to>\n";
my $file = shift or die $usage;
my $genome = shift or die $usage;

die "Genome file $genome does not exist\n" unless -e $genome;

my $base = $file;
$base =~ s/\.f[aq]$//;
my $aln = "bwa aln -n 2 -t 60 $genome $file > $base.sai";
system($aln);
my $samse = "nice bwa samse $genome $base.sai $file > $base.sam";
system($samse);
my $bam = "nice samtools view -S -b $base.sam > $base.bam";
system($bam);
my $bam_sort = "nice samtools sort $base.bam ${base}_sorted";
system($bam_sort);
my $bam_index = "nice samtools index ${base}_sorted.bam";
system($bam_index);
unlink("$base.sai");
unlink("$base.sam");
unlink("$base.bam");

exit(0);

__END__
bwa.pl test.fa original.fa

Visualise mapped tags using IGV

Now let’s try introducing some sequencing errors (1 error in 31 bases):

#!/usr/bin/perl

use strict;
use warnings;
use diagnostics;

my $dna = 'GATGTTTAAAGGCATTTCCTGTAAAAAAAAAAAAAAAAAGAGAGCGAGAGAAAAAGAGAGAGAGAGAGAGAGAATTTCTGATGATTAAAAAAAAAAAGTGAGGAATGCTGAGTTAAACAAAGTTAACCACATTCTCTCAGAGCCTTGAATGTGCTAATATGTGCTAATGTGTCTTATGGCTCTCTAAGGAGGGTGTAGTCAAAATCATCTTCTACACTGCTTAGTTCCCGGGAACCCACTTTTTTTTTTTTTTTATTCATGGTCGAATATTATTTATTGTCAGAAAGGTACAGCATTCACACCAATATCAGACAAAATAGATTTTAACTAAAAAATTATTTCGAGACAAAAATAACAATATATGTTAATAAAAGGCTCAATTAAAAATGTATAACAATTATAAACACATACACATCAAACAACAGTTCCCCAAAATACGTAAAGCAAACATTGACAGGATTGAAGGGAGAAATAGACCACTCTACAATAGTAGTTGGGGT';

my $dna_length = length($dna);

my $tag_length = '31';
my $end_limit = $dna_length - $tag_length + 1;

for (1 .. 5000){
   my $random = int(rand($end_limit));
   my $random_tag = substr($dna,$random,$tag_length);
   my $end = $random + $tag_length;
   my $id = $_;
   $id .= "_${random}_$end";
   #print ">$id\n$random_tag\n";
   my $random_mutated_tag = random_mutation($random_tag);
   print ">$id\n$random_mutated_tag\n";
}

exit(0);

sub random_mutation {
   my ($seq) = @_;
   my $seq_length = length($seq);
   my $random = int(rand($seq_length));
   my $base = substr($seq,$random,1);
   my @base = ('A','C','G','T');
   my $random_base = $base;
   while ($base eq $random_base){
      my $random_base_index = int(rand(scalar(@base)));
      #print "$random_base_index\n";
      $random_base = $base[$random_base_index];
   }
   my $mutate_seq = substr($seq,$random,1,$random_base);
   return($seq);
}

The contigs.fa file is shown below:

>NODE_13_length_21_cov_1.619048
AAATAGATTTTAACTAAAAATTTATTTCGAGACAAAAATAA
>NODE_19_length_143_cov_57.832169
ATTTCGAGACAAAAATAACAATATATGTTAATAAAAGGCTCAATTAAAAATGTATAACAA
TTATAAACACATACACATCAAACAACAGTTCCCCAAAATACGTAAAGCAAACATTGACAG
GATTGAAGGGAGAAATAGACCACTCTACAATAGTAGTTGGGGT
>NODE_24_length_190_cov_57.431580
AGAGAGAGAGAGAGAGAGAATTTCTGATGATTAAAAAAAAAAAGTGAGGAATGCTGAGTT
AAACAAAGTTAACCACATTCTCTCAGAGCCTTGAATGTGCTAATATGTGCTAATGTGTCT
TATGGCTCTCTAAGGAGGGTGTAGTCAAAATCATCTTCTACACTGCTTAGTTCCCGGGAA
CCCACTTTTTTTTTTTTTTTATTCATGGTC
>NODE_25_length_21_cov_1.809524
CAATATCAGACAAAATAGATGTTAACTAAAAAATTATTTCG
>NODE_33_length_21_cov_2.285714
CAATATCAGACAAAATAGATCTTAACTAAAAAATTATTTCG
>NODE_34_length_21_cov_1.619048
ATAGATTTTAACTAAAAAATCATTTCGAGACAAAAATAACA
>NODE_35_length_21_cov_3.523809
TTTTTTTATTCATGGTCGAAAATTATTTATTGTCAGAAAGG
>NODE_36_length_21_cov_1.666667
TCATGGTCGAATATTATTTACTGTCAGAAAGGTACAGCATT
>NODE_37_length_21_cov_2.380952
GACAAAATAGATTTTAACTAGAAAATTATTTCGAGACAAAA
>NODE_43_length_21_cov_2.047619
AGACAAAATAGATTTTAACTCAAAAATTATTTCGAGACAAA
>NODE_44_length_21_cov_3.000000
ACCAATATCAGACAAAATAGGTTTTAACTAAAAAATTATTT
>NODE_45_length_21_cov_2.095238
CAGACAAAATAGATTTTAACGAAAAAATTATTTCGAGACAA
>NODE_46_length_21_cov_2.666667
AATATTATTTATTGTCAGAACGGTACAGCATTCACACCAAT
>NODE_47_length_21_cov_1.619048
TTCACACCAATATCAGACAAGATAGATTTTAACTAAAAAAT
>NODE_48_length_21_cov_2.190476
ATTCATGGTCGAATATTATTCATTGTCAGAAAGGTACAGCA
>NODE_49_length_21_cov_2.476191
TTCACACCAATATCAGACAATATAGATTTTAACTAAAAAAT
>NODE_50_length_21_cov_1.809524
ATATCAGACAAAATAGATTTCAACTAAAAAATTATTTCGAG
>NODE_51_length_21_cov_2.571429
AATATCAGACAAAATAGATTCTAACTAAAAAATTATTTCGA
>NODE_52_length_21_cov_1.714286
CGAATATTATTTATTGTCAGGAAGGTACAGCATTCACACCA
>NODE_53_length_21_cov_1.428571
TTTTTTTTTTATTCATGGTCCAATATTATTTATTGTCAGAA
>NODE_54_length_21_cov_2.380952
TTTTTTATTCATGGTCGAATTTTATTTATTGTCAGAAAGGT
>NODE_55_length_21_cov_1.238095
GAATATTATTTATTGTCAGATAGGTACAGCATTCACACCAA

I am unable to generate the full sequence even with greater than 300X coverage (using the default settings). Of course in reality you would get much better sequencing results.

IGV plot of the tags with errors introduced at a frequency of 1 in 31.

Print Friendly, PDF & Email



Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
.
7 comments Add yours
  1. Thanks for your helpful page.

    Regarding the discrepancy of length=480 vs. 500, the velvet docs say this:

    “Note that the length and coverage information provided in the header of
    each contig should therefore be understood in k-mers and in k-mer coverage (cf.
    5.2) respectively. E.g. for a 500bp contig and a k-mer length of 21, the length
    in the header will be 480.”

    Rich

  2. “Note that the length and coverage information provided in the header of
    each contig should therefore be understood in k-mers and in k-mer coverage (cf.
    5.2) respectively. E.g. for a 500bp contig and a k-mer length of 21, the length
    in the header will be 480” in velvet manual. 🙂
    sorry for the late reply.

  3. Hello Davo,

    “I am unable to generate the full sequence even with greater than 300X coverage (using the default settings).” In this statement, 300x Coverage indicates coverage of every position/nucleotide in the original genome sequence?

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.