wget http://www.ebi.ac.uk/~zerbino/velvet/velvet_1.1.05.tgz tar -xzf velvet_1.1.05.tgz sudo apt-get install texlive-latex-base make cd ~/bin/ ln -s ~/src/velvet_1.1.05/velvet* .
Write script for generating random tags from a longer piece of DNA
#!/usr/bin/perl
use strict;
use warnings;
my $dna = 'GATGTTTAAAGGCATTTCCTGTAAAAAAAAAAAAAAAAAGAGAGCGAGAGAAAAAGAGAGAGAGAGAGAGAGAATTTCTGATGATTAAAAAAAAAAAGTGAGGAATGCTGAGTTAAACAAAGTTAACCACATTCTCTCAGAGCCTTGAATGTGCTAATATGTGCTAATGTGTCTTATGGCTCTCTAAGGAGGGTGTAGTCAAAATCATCTTCTACACTGCTTAGTTCCCGGGAACCCACTTTTTTTTTTTTTTTATTCATGGTCGAATATTATTTATTGTCAGAAAGGTACAGCATTCACACCAATATCAGACAAAATAGATTTTAACTAAAAAATTATTTCGAGACAAAAATAACAATATATGTTAATAAAAGGCTCAATTAAAAATGTATAACAATTATAAACACATACACATCAAACAACAGTTCCCCAAAATACGTAAAGCAAACATTGACAGGATTGAAGGGAGAAATAGACCACTCTACAATAGTAGTTGGGGT';
my $dna_length = length($dna);
my $tag_length = '31';
my $end_limit = $dna_length - $tag_length + 1;
for (1 .. 5000){
my $random = int(rand($end_limit));
my $random_tag = substr($dna,$random,$tag_length);
my $end = $random + $tag_length;
my $id = $_;
$id .= "_${random}_$end";
print ">$id\n$random_tag\n";
}
exit(0);
Generate random tags and use as input for velvet
generate_random_tag.pl > test.fa mkdir velvet_test velveth velvet_test/ 21 -short test.fa velvetg velvet_test/ cd velvet_test/ cat contigs.fa >NODE_1_length_480_cov_114.583336 GATGTTTAAAGGCATTTCCTGTAAAAAAAAAAAAAAAAAGAGAGCGAGAGAAAAAGAGAG AGAGAGAGAGAGAATTTCTGATGATTAAAAAAAAAAAGTGAGGAATGCTGAGTTAAACAA AGTTAACCACATTCTCTCAGAGCCTTGAATGTGCTAATATGTGCTAATGTGTCTTATGGC TCTCTAAGGAGGGTGTAGTCAAAATCATCTTCTACACTGCTTAGTTCCCGGGAACCCACT TTTTTTTTTTTTTTATTCATGGTCGAATATTATTTATTGTCAGAAAGGTACAGCATTCAC ACCAATATCAGACAAAATAGATTTTAACTAAAAAATTATTTCGAGACAAAAATAACAATA TATGTTAATAAAAGGCTCAATTAAAAATGTATAACAATTATAAACACATACACATCAAAC AACAGTTCCCCAAAATACGTAAAGCAAACATTGACAGGATTGAAGGGAGAAATAGACCAC TCTACAATAGTAGTTGGGGT
I don't know why in the definition line reads length = 480 (NODE_1_length_480) when the contig length is 500.
BLAST the contig back to the original sequence
makeblastdb -in original.fa -dbtype nucl -title original -out original blastn -db original -query velvet_test/contigs.fa
Score = 924 bits (500), Expect = 0.0
Identities = 500/500 (100%), Gaps = 0/500 (0%)
Strand=Plus/Plus
Query 1 GATGTTTAAAGGCATTTCCTGTaaaaaaaaaaaaaaaaagagagcgagagaaaaagagag 60
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct 1 GATGTTTAAAGGCATTTCCTGTAAAAAAAAAAAAAAAAAGAGAGCGAGAGAAAAAGAGAG 60
Query 61 agagagagagagaATTTCTGATGATTaaaaaaaaaaaGTGAGGAATGCTGAGTTAAACAA 120
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct 61 AGAGAGAGAGAGAATTTCTGATGATTAAAAAAAAAAAGTGAGGAATGCTGAGTTAAACAA 120
Query 121 AGTTAACCACATTCTCTCAGAGCCTTGAATGTGCTAATATGTGCTAATGTGTCTTATGGC 180
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct 121 AGTTAACCACATTCTCTCAGAGCCTTGAATGTGCTAATATGTGCTAATGTGTCTTATGGC 180
Query 181 TCTCTAAGGAGGGTGTAGTCAAAATCATCTTCTACACTGCTTAGTTCCCGGGAACCCACt 240
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct 181 TCTCTAAGGAGGGTGTAGTCAAAATCATCTTCTACACTGCTTAGTTCCCGGGAACCCACT 240
Query 241 ttttttttttttttATTCATGGTCGAATATTATTTATTGTCAGAAAGGTACAGCATTCAC 300
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct 241 TTTTTTTTTTTTTTATTCATGGTCGAATATTATTTATTGTCAGAAAGGTACAGCATTCAC 300
Query 301 ACCAATATCAGACAAAATAGATTTTAACTAAAAAATTATTTCGAGACAAAAATAACAATA 360
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct 301 ACCAATATCAGACAAAATAGATTTTAACTAAAAAATTATTTCGAGACAAAAATAACAATA 360
Query 361 TATGTTAATAAAAGGCTCAATTAAAAATGTATAACAATTATAAACACATACACATCAAAC 420
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct 361 TATGTTAATAAAAGGCTCAATTAAAAATGTATAACAATTATAAACACATACACATCAAAC 420
Query 421 AACAGTTCCCCAAAATACGTAAAGCAAACATTGACAGGATTGAAGGGAGAAATAGACCAC 480
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct 421 AACAGTTCCCCAAAATACGTAAAGCAAACATTGACAGGATTGAAGGGAGAAATAGACCAC 480
Query 481 TCTACAATAGTAGTTGGGGT 500
||||||||||||||||||||
Sbjct 481 TCTACAATAGTAGTTGGGGT 500
Visualise where the tags came from by first using a short read mapper (bwa)
#!/usr/bin/perl
use strict;
use warnings;
my $usage = "Usage: $0 <infile.fast[aq]> <indexed_genome_to_map_to>\n";
my $file = shift or die $usage;
my $genome = shift or die $usage;
die "Genome file $genome does not exist\n" unless -e $genome;
my $base = $file;
$base =~ s/\.f[aq]$//;
my $aln = "bwa aln -n 2 -t 60 $genome $file > $base.sai";
system($aln);
my $samse = "nice bwa samse $genome $base.sai $file > $base.sam";
system($samse);
my $bam = "nice samtools view -S -b $base.sam > $base.bam";
system($bam);
my $bam_sort = "nice samtools sort $base.bam ${base}_sorted";
system($bam_sort);
my $bam_index = "nice samtools index ${base}_sorted.bam";
system($bam_index);
unlink("$base.sai");
unlink("$base.sam");
unlink("$base.bam");
exit(0);
__END__
bwa.pl test.fa original.fa
Visualise mapped tags using IGV
Now let's try introducing some sequencing errors (1 error in 31 bases):
#!/usr/bin/perl
use strict;
use warnings;
use diagnostics;
my $dna = 'GATGTTTAAAGGCATTTCCTGTAAAAAAAAAAAAAAAAAGAGAGCGAGAGAAAAAGAGAGAGAGAGAGAGAGAATTTCTGATGATTAAAAAAAAAAAGTGAGGAATGCTGAGTTAAACAAAGTTAACCACATTCTCTCAGAGCCTTGAATGTGCTAATATGTGCTAATGTGTCTTATGGCTCTCTAAGGAGGGTGTAGTCAAAATCATCTTCTACACTGCTTAGTTCCCGGGAACCCACTTTTTTTTTTTTTTTATTCATGGTCGAATATTATTTATTGTCAGAAAGGTACAGCATTCACACCAATATCAGACAAAATAGATTTTAACTAAAAAATTATTTCGAGACAAAAATAACAATATATGTTAATAAAAGGCTCAATTAAAAATGTATAACAATTATAAACACATACACATCAAACAACAGTTCCCCAAAATACGTAAAGCAAACATTGACAGGATTGAAGGGAGAAATAGACCACTCTACAATAGTAGTTGGGGT';
my $dna_length = length($dna);
my $tag_length = '31';
my $end_limit = $dna_length - $tag_length + 1;
for (1 .. 5000){
my $random = int(rand($end_limit));
my $random_tag = substr($dna,$random,$tag_length);
my $end = $random + $tag_length;
my $id = $_;
$id .= "_${random}_$end";
#print ">$id\n$random_tag\n";
my $random_mutated_tag = random_mutation($random_tag);
print ">$id\n$random_mutated_tag\n";
}
exit(0);
sub random_mutation {
my ($seq) = @_;
my $seq_length = length($seq);
my $random = int(rand($seq_length));
my $base = substr($seq,$random,1);
my @base = ('A','C','G','T');
my $random_base = $base;
while ($base eq $random_base){
my $random_base_index = int(rand(scalar(@base)));
#print "$random_base_index\n";
$random_base = $base[$random_base_index];
}
my $mutate_seq = substr($seq,$random,1,$random_base);
return($seq);
}
The contigs.fa file is shown below:
>NODE_13_length_21_cov_1.619048 AAATAGATTTTAACTAAAAATTTATTTCGAGACAAAAATAA >NODE_19_length_143_cov_57.832169 ATTTCGAGACAAAAATAACAATATATGTTAATAAAAGGCTCAATTAAAAATGTATAACAA TTATAAACACATACACATCAAACAACAGTTCCCCAAAATACGTAAAGCAAACATTGACAG GATTGAAGGGAGAAATAGACCACTCTACAATAGTAGTTGGGGT >NODE_24_length_190_cov_57.431580 AGAGAGAGAGAGAGAGAGAATTTCTGATGATTAAAAAAAAAAAGTGAGGAATGCTGAGTT AAACAAAGTTAACCACATTCTCTCAGAGCCTTGAATGTGCTAATATGTGCTAATGTGTCT TATGGCTCTCTAAGGAGGGTGTAGTCAAAATCATCTTCTACACTGCTTAGTTCCCGGGAA CCCACTTTTTTTTTTTTTTTATTCATGGTC >NODE_25_length_21_cov_1.809524 CAATATCAGACAAAATAGATGTTAACTAAAAAATTATTTCG >NODE_33_length_21_cov_2.285714 CAATATCAGACAAAATAGATCTTAACTAAAAAATTATTTCG >NODE_34_length_21_cov_1.619048 ATAGATTTTAACTAAAAAATCATTTCGAGACAAAAATAACA >NODE_35_length_21_cov_3.523809 TTTTTTTATTCATGGTCGAAAATTATTTATTGTCAGAAAGG >NODE_36_length_21_cov_1.666667 TCATGGTCGAATATTATTTACTGTCAGAAAGGTACAGCATT >NODE_37_length_21_cov_2.380952 GACAAAATAGATTTTAACTAGAAAATTATTTCGAGACAAAA >NODE_43_length_21_cov_2.047619 AGACAAAATAGATTTTAACTCAAAAATTATTTCGAGACAAA >NODE_44_length_21_cov_3.000000 ACCAATATCAGACAAAATAGGTTTTAACTAAAAAATTATTT >NODE_45_length_21_cov_2.095238 CAGACAAAATAGATTTTAACGAAAAAATTATTTCGAGACAA >NODE_46_length_21_cov_2.666667 AATATTATTTATTGTCAGAACGGTACAGCATTCACACCAAT >NODE_47_length_21_cov_1.619048 TTCACACCAATATCAGACAAGATAGATTTTAACTAAAAAAT >NODE_48_length_21_cov_2.190476 ATTCATGGTCGAATATTATTCATTGTCAGAAAGGTACAGCA >NODE_49_length_21_cov_2.476191 TTCACACCAATATCAGACAATATAGATTTTAACTAAAAAAT >NODE_50_length_21_cov_1.809524 ATATCAGACAAAATAGATTTCAACTAAAAAATTATTTCGAG >NODE_51_length_21_cov_2.571429 AATATCAGACAAAATAGATTCTAACTAAAAAATTATTTCGA >NODE_52_length_21_cov_1.714286 CGAATATTATTTATTGTCAGGAAGGTACAGCATTCACACCA >NODE_53_length_21_cov_1.428571 TTTTTTTTTTATTCATGGTCCAATATTATTTATTGTCAGAA >NODE_54_length_21_cov_2.380952 TTTTTTATTCATGGTCGAATTTTATTTATTGTCAGAAAGGT >NODE_55_length_21_cov_1.238095 GAATATTATTTATTGTCAGATAGGTACAGCATTCACACCAA
I am unable to generate the full sequence even with greater than 300X coverage (using the default settings). Of course in reality you would get much better sequencing results.
IGV plot of the tags with errors introduced at a frequency of 1 in 31.

This work is licensed under a Creative Commons
Attribution 4.0 International License.


Thanks for your helpful page.
Regarding the discrepancy of length=480 vs. 500, the velvet docs say this:
“Note that the length and coverage information provided in the header of
each contig should therefore be understood in k-mers and in k-mer coverage (cf.
5.2) respectively. E.g. for a 500bp contig and a k-mer length of 21, the length
in the header will be 480.”
Rich
Thanks for the comment and for the answer. As per usual, the answer lies in the documentation 🙂
i’d noticed that line, 🙂
Which line were you referring to?
“Note that the length and coverage information provided in the header of
each contig should therefore be understood in k-mers and in k-mer coverage (cf.
5.2) respectively. E.g. for a 500bp contig and a k-mer length of 21, the length
in the header will be 480” in velvet manual. 🙂
sorry for the late reply.
Hello Davo,
May i know that why you have used 21 for k value?
Hello Davo,
“I am unable to generate the full sequence even with greater than 300X coverage (using the default settings).” In this statement, 300x Coverage indicates coverage of every position/nucleotide in the original genome sequence?