RepeatMasker

From Dave's wiki
Jump to navigation Jump to search

My old instructions http://davetang.org/wiki/tiki-index.php?page=RepeatMasker

Installing

Below are the following steps I carried out to install RepeatMasker on my MacBook Air. Firstly get your copy of cross_match by following the instructions at http://www.phrap.org/consed/consed.html#howToGet. Then install the GNU compilers on Mac OS X https://wiki.helsinki.fi/display/HUGG/Installing+the+GNU+compilers+on+Mac+OS+X.

#as per the instructions at wiki.helsinki.fi
#download and install Xcode; start Xcode and update packages
#download gcc-4.9-bin.tar.gz and unzip
gunzip gcc-4.9-bin.tar.gz
sudo tar xvf gcc-4.9-bin.tar -C /

To install cross_match

mkdir cross_match
mv distrib.tar.Z cross_match/
cd cross_match
tar -xzf distrib.tar.Z
#then change the compiler to gcc in the makefile
#CC= gcc
vi makefile
make

Now download the binaries for the Tandem Repeats Finder at http://tandem.bu.edu/trf/trf.download.html

#rename the binary to trf
mv trf407b.macos64 trf

Download annotations from RepBase Update at http://www.girinst.org/server/RepBase/index.php

#download this file
#http://www.girinst.org/server/RepBase/protected/repeatmaskerlibraries/repeatmaskerlibraries-20140131.tar.gz
tar -xzf repeatmaskerlibraries-20140131.tar.gz
cd Libraries
cp * ../RepeatMasker/Libraries/

Download the RepeatMasker software and follow the instructions at the prompt http://www.repeatmasker.org/RMDownload.html

tar -xzf RepeatMasker-open-4-0-5.tar.gz
perl ./configure

Done!

RepeatMasker --help
######################################################################
RepeatMasker
Developed by Arian Smit and Robert Hubley
Please refer to: Smit, AFA, Hubley, R. & Green, P "RepeatMasker" at
http://www.repeatmasker.org
                                                                      
The interspersed repeat databases are modified versions of 
those found in "RepBase Update" (http://www.girinst.org/)
######################################################################


RepeatMasker is a program that screens DNA sequences for interspersed
repeats and low complexity DNA sequences. The output of the program is
a detailed annotation of the repeats that are present in the query
sequence as well as a modified version of the query sequence in which
all the annotated repeats have been masked (default: replaced by
Ns). Sequence comparisons in RepeatMasker are performed by the program
cross_match, an efficient implementation of the Smith-Waterman-Gotoh
algorithm developed by Phil Green, or by WU-Blast developed by Warren
Gish.

Test run

wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/chrM.fa.gz
gunzip chrM.fa.gz
RepeatMasker -e crossmatch -species human -s -xsmall chrM.fa
RepeatMasker version open-4.0.5
Search Engine: Crossmatch [ 0.990329 ]
Master RepeatMasker Database: /Users/davetang/src/RepeatMasker/Libraries/RepeatMaskerLib.embl ( Complete Database: 20140131 )


Building general libraries in: /Users/davetang/src/RepeatMasker/Libraries/20140131/general
Building species libraries in: /Users/davetang/src/RepeatMasker/Libraries/20140131/homo_sapiens
   - 1508 ancestral and ubiquitous sequence(s) for homo sapiens
   - 8 lineage specific sequence(s) for homo sapiens

analyzing file chrM.fa

Checking for E. coli insertion elements
identifying Simple Repeats in batch 1 of 1
identifying full-length ALUs in batch 1 of 1
identifying full-length interspersed repeats in batch 1 of 1
identifying remaining ALUs in batch 1 of 1
identifying most interspersed repeats in batch 1 of 1
identifying long interspersed repeats in batch 1 of 1
identifying ancient repeats in batch 1 of 1
identifying retrovirus-like sequences in batch 1 of 1
identifying Simple Repeats in batch 1 of 1
processing output: 
cycle 1 
cycle 2 
cycle 3 
cycle 4 
cycle 5 
cycle 6 
cycle 7 
cycle 8 
cycle 9 
cycle 10 
Generating output... 
masking
done

Interpreting the results

Firstly have a read of http://www.repeatmasker.org/webrepeatmaskerhelp.html

RepeatMasker will output five different files

ls -1 chrM.fa.*
chrM.fa.cat
chrM.fa.log
chrM.fa.masked
chrM.fa.out
chrM.fa.tbl

cat chrM.fa.tbl
==================================================
file name: chrM.fa                  
sequences:             1
total length:      16571 bp  (16571 bp excl N/X-runs) 
GC level:         44.49 %
bases masked:        422 bp ( 2.55 %)
==================================================
               number of      length   percentage
               elements*    occupied  of sequence
--------------------------------------------------
SINEs:                0            0 bp    0.00 %
      ALUs            0            0 bp    0.00 %
      MIRs            0            0 bp    0.00 %

LINEs:                0            0 bp    0.00 %
      LINE1           0            0 bp    0.00 %
      LINE2           0            0 bp    0.00 %
      L3/CR1          0            0 bp    0.00 %

LTR elements:         0            0 bp    0.00 %
      ERVL            0            0 bp    0.00 %
      ERVL-MaLRs      0            0 bp    0.00 %
      ERV_classI      0            0 bp    0.00 %
      ERV_classII     0            0 bp    0.00 %

DNA elements:         0            0 bp    0.00 %
     hAT-Charlie      0            0 bp    0.00 %
     TcMar-Tigger     0            0 bp    0.00 %

Unclassified:         0            0 bp    0.00 %

Total interspersed repeats:        0 bp    0.00 %


Small RNA:            4          373 bp    2.25 %

Satellites:           0            0 bp    0.00 %
Simple repeats:       1           49 bp    0.30 %
Low complexity:       0            0 bp    0.00 %
==================================================

* most repeats fragmented by insertions or deletions
  have been counted as one element
                                                      

The query species was assumed to be homo sapiens  
RepeatMasker version open-4.0.5 , sensitive mode
                                 
run with cross_match version 0.990329
RepBase Update 20140131, RM database version 20140131

 cat chrM.fa.out
   SW   perc perc perc  query     position in query     matching        repeat          position in repeat
score   div. del. ins.  sequence  begin end    (left)   repeat          class/family  begin  end    (left)  ID

  324   32.0  3.2  3.9  chrM       2592  2747 (13824) + LSU-rRNA_Hsa    rRNA            3753   3907 (1128)   1  
  650    5.1  0.0  0.0  chrM       3231  3308 (13263) + tRNA-Leu-TTA(m) tRNA               1     78    (0)   2  
  429   16.7  0.0  0.0  chrM       4330  4401 (12170) C tRNA-Gln-CAA_   tRNA             (3)     72      1   3  
  432   13.4  1.5  0.0  chrM       7449  7515  (9056) C tRNA-Ser-TCA(m) tRNA             (5)     68      1   4  
   14   10.3  0.0  8.8  chrM      16207 16255   (316) + (TCAACT)n       Simple_repeat      1     44    (0)   5

cat chrM.fa.cat
324 32.00 3.23 3.85 chrM 2592 2747 (13676) LSU-rRNA_Hsa#rRNA 3753 3907 (1128) m_b1s551i0

  chrM                2592 AAGGTAGCATAATCACTTGTTCCTTAAATAGGGACCTGTATGAATGGCTC 2641
                                   vv   vv  i  ivii   v   v   vi i        v v
  LSU-rRNA_Hsa#       3753 AAGGTAGCCAAATGCCTCGTCATCTAATTAGTGACGCGCATGAATGGATG 3802

  chrM                2642 CACGAGGGTTCAGCTGTCTCTTACTTTTAACCAGTGAAATTGACCTGCCC 2691
                           v     iv i vi     i  vv  vi  v    i    iii-- v   -
  LSU-rRNA_Hsa#       3803 AACGAGATTCCCACTGTCCCTACCTACTATCCAGCGAAACCA--CAGCC- 3849

  chrM                2692 GTGAAGAGG-CGGGCATGACA----CAGCAAGACGAGAAGACCCTATGGA 2736
                           ---   i i-     v  i i----    ii ivi          i v  
  LSU-rRNA_Hsa#       3850 ---AAGGGAACGGGCTTGGCGGAATCAGCGGGGAAAGAAGACCCTGTTGA 3896

  chrM                2737 GCTTTAATTTA 2747
                               v v i  
  LSU-rRNA_Hsa#       3897 GCTTGACTCTA 3907

Matrix = 20p43g.matrix
Kimura (with divCpGMod) = 36.97
Transitions / transversions = 1.09 (25/23)
Gap_init rate = 0.05 (8 / 155), avg. gap size = 1.38 (11 / 8)

650 5.13 0.00 0.00 chrM 3231 3308 (13263) tRNA-Leu-TTA(m)#tRNA 1 78 (0) c_b1s401i0

  chrM                3231 GTTAAGATGGCAGAGCCCGGTAATCGCATAAAACTTAAAACTTTACAGTC 3280
                                            i                      i         
  tRNA-Leu-TTA(          1 GTTAAGATGGCAGAGCCTGGTAATCGCATAAAACTTAAAATTTTACAGTC 50

  chrM                3281 AGAGGTTCAATTCCTCTTCTTAACAACA 3308
                                     i              v  
  tRNA-Leu-TTA(         51 AGAGGTTCAACTCCTCTTCTTAACACCA 78

Matrix = 18p43g.matrix
Kimura (with divCpGMod) = 5.35
Transitions / transversions = 3.00 (3/1)
Gap_init rate = 0.00 (0 / 77), avg. gap size = 0.0 (0 / 0)

429 16.67 0.00 0.00 chrM 4330 4401 (12170) C tRNA-Gln-CAA_#tRNA (3) 72 1 c_b1s401i1

  chrM                4330 CTAGGACTATGAGAATCGAACCCATCCCTGAGAATCCAAAATTCTCCGTG 4379
                               i     ii    i     i i          i              
C tRNA-Gln-CAA_         72 CTAGAACTATAGGAATTGAACCTACCCCTGAGAATTCAAAATTCTCCGTG 23

  chrM                4380 CCACCTATCACACCCCATCCTA 4401
                            i      i     vii     
C tRNA-Gln-CAA_         22 CTACCTATTACACCATGTCCTA 1

Matrix = 18p43g.matrix
Kimura (with divCpGMod) = 19.95
Transitions / transversions = 11.00 (11/1)
Gap_init rate = 0.00 (0 / 71), avg. gap size = 0.0 (0 / 0)

432 13.43 1.47 0.00 chrM 7449 7515 (9056) C tRNA-Ser-TCA(m)#tRNA (5) 68 1 m_b1s357i0

  chrM                7449 AAAAAGGAAGGAATCGAACCCCCCAAAG-CTGGTTTCAAGCCAACCCCAT 7497
                                   i                v  -               i     
C tRNA-Ser-TCA(         68 AAAAAGGAGGGAATCGAACCCCCCACAGACTGGTTTCAAGCCAATCCCAT 19

  chrM                7498 GGCCTCCATGACTTTTTC 7515
                           ii    ii    i  i  
C tRNA-Ser-TCA(         18 AACCTCTGTGACCTTCTC 1

Matrix = 18p43g.matrix
Kimura (with divCpGMod) = 15.39
Transitions / transversions = 8.00 (8/1)
Gap_init rate = 0.02 (1 / 66), avg. gap size = 1.00 (1 / 1)

13 15.38 6.67 2.13 chrM 16207 16251 (320) (CAACTAT)n#Simple_repeat 1 47 (0) m_b1s252i0

  chrM               16207 CAAGTA-CAGCAATCAACCTTCAACTATCACAC-ATCAACT-GCAACT 16251
                              v  -  i v      iv          -  -       -v     
  (CAACTAT)n#Si          1 CAACTATCAACTATCAACTATCAACTATCA-ACTATCAACTATCAACT 47

Matrix = Unknown
Transitions / transversions = 0.50 (2/4)
Gap_init rate = 0.09 (4 / 44), avg. gap size = 1.00 (4 / 4)

14 10.30 0.00 8.82 chrM 16219 16255 (316) (TCAACT)n#Simple_repeat 1 34 (0) m_b1s252i1

  chrM               16219 TCAACCTTCAACTATCACACATCAACTGCAACTCCAA 16255
                               -        -   -  v      v     i   
  (TCAACT)n#Sim          1 TCAA-CTTCAACT-TCA-ACTTCAACTTCAACTTCAA 34

Matrix = Unknown
Transitions / transversions = 0.50 (1/2)
Gap_init rate = 0.08 (3 / 36), avg. gap size = 1.00 (3 / 3)

## Total Sequences: 1
## Total Length: 16571
## Total NonMask ( excluding >20bp runs of N/X bases ): 16571
## Total NonSub ( excluding all non ACGT bases ):16571
RepeatMasker version open-4.0.5 , sensitive mode
run with cross_match version 0.990329
RepBase Update 20140131, RM database version 20140131

Scripts

Scripts available from http://doua.prabi.fr/software/one-code-to-find-them-all to merge TEs; refer to the publication for more details http://www.mobilednajournal.com/content/5/1/13.

Scripts from Aurelie Kapusta for parsing RepeatMasker out files, which can be accessed by running:

git clone git@github.com:4ureliek/Parsing-RepeatMasker-Outputs.git