RepeatMasker
My old instructions http://davetang.org/wiki/tiki-index.php?page=RepeatMasker
Installing
Below are the following steps I carried out to install RepeatMasker on my MacBook Air. Firstly get your copy of cross_match by following the instructions at http://www.phrap.org/consed/consed.html#howToGet. Then install the GNU compilers on Mac OS X https://wiki.helsinki.fi/display/HUGG/Installing+the+GNU+compilers+on+Mac+OS+X.
#as per the instructions at wiki.helsinki.fi #download and install Xcode; start Xcode and update packages #download gcc-4.9-bin.tar.gz and unzip gunzip gcc-4.9-bin.tar.gz sudo tar xvf gcc-4.9-bin.tar -C /
To install cross_match
mkdir cross_match mv distrib.tar.Z cross_match/ cd cross_match tar -xzf distrib.tar.Z #then change the compiler to gcc in the makefile #CC= gcc vi makefile make
Now download the binaries for the Tandem Repeats Finder at http://tandem.bu.edu/trf/trf.download.html
#rename the binary to trf mv trf407b.macos64 trf
Download annotations from RepBase Update at http://www.girinst.org/server/RepBase/index.php
#download this file #http://www.girinst.org/server/RepBase/protected/repeatmaskerlibraries/repeatmaskerlibraries-20140131.tar.gz tar -xzf repeatmaskerlibraries-20140131.tar.gz cd Libraries cp * ../RepeatMasker/Libraries/
Download the RepeatMasker software and follow the instructions at the prompt http://www.repeatmasker.org/RMDownload.html
tar -xzf RepeatMasker-open-4-0-5.tar.gz perl ./configure
Done!
RepeatMasker --help ###################################################################### RepeatMasker Developed by Arian Smit and Robert Hubley Please refer to: Smit, AFA, Hubley, R. & Green, P "RepeatMasker" at http://www.repeatmasker.org The interspersed repeat databases are modified versions of those found in "RepBase Update" (http://www.girinst.org/) ###################################################################### RepeatMasker is a program that screens DNA sequences for interspersed repeats and low complexity DNA sequences. The output of the program is a detailed annotation of the repeats that are present in the query sequence as well as a modified version of the query sequence in which all the annotated repeats have been masked (default: replaced by Ns). Sequence comparisons in RepeatMasker are performed by the program cross_match, an efficient implementation of the Smith-Waterman-Gotoh algorithm developed by Phil Green, or by WU-Blast developed by Warren Gish.
Test run
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/chrM.fa.gz gunzip chrM.fa.gz RepeatMasker -e crossmatch -species human -s -xsmall chrM.fa RepeatMasker version open-4.0.5 Search Engine: Crossmatch [ 0.990329 ] Master RepeatMasker Database: /Users/davetang/src/RepeatMasker/Libraries/RepeatMaskerLib.embl ( Complete Database: 20140131 ) Building general libraries in: /Users/davetang/src/RepeatMasker/Libraries/20140131/general Building species libraries in: /Users/davetang/src/RepeatMasker/Libraries/20140131/homo_sapiens - 1508 ancestral and ubiquitous sequence(s) for homo sapiens - 8 lineage specific sequence(s) for homo sapiens analyzing file chrM.fa Checking for E. coli insertion elements identifying Simple Repeats in batch 1 of 1 identifying full-length ALUs in batch 1 of 1 identifying full-length interspersed repeats in batch 1 of 1 identifying remaining ALUs in batch 1 of 1 identifying most interspersed repeats in batch 1 of 1 identifying long interspersed repeats in batch 1 of 1 identifying ancient repeats in batch 1 of 1 identifying retrovirus-like sequences in batch 1 of 1 identifying Simple Repeats in batch 1 of 1 processing output: cycle 1 cycle 2 cycle 3 cycle 4 cycle 5 cycle 6 cycle 7 cycle 8 cycle 9 cycle 10 Generating output... masking done
Interpreting the results
Firstly have a read of http://www.repeatmasker.org/webrepeatmaskerhelp.html
RepeatMasker will output five different files
ls -1 chrM.fa.* chrM.fa.cat chrM.fa.log chrM.fa.masked chrM.fa.out chrM.fa.tbl cat chrM.fa.tbl ================================================== file name: chrM.fa sequences: 1 total length: 16571 bp (16571 bp excl N/X-runs) GC level: 44.49 % bases masked: 422 bp ( 2.55 %) ================================================== number of length percentage elements* occupied of sequence -------------------------------------------------- SINEs: 0 0 bp 0.00 % ALUs 0 0 bp 0.00 % MIRs 0 0 bp 0.00 % LINEs: 0 0 bp 0.00 % LINE1 0 0 bp 0.00 % LINE2 0 0 bp 0.00 % L3/CR1 0 0 bp 0.00 % LTR elements: 0 0 bp 0.00 % ERVL 0 0 bp 0.00 % ERVL-MaLRs 0 0 bp 0.00 % ERV_classI 0 0 bp 0.00 % ERV_classII 0 0 bp 0.00 % DNA elements: 0 0 bp 0.00 % hAT-Charlie 0 0 bp 0.00 % TcMar-Tigger 0 0 bp 0.00 % Unclassified: 0 0 bp 0.00 % Total interspersed repeats: 0 bp 0.00 % Small RNA: 4 373 bp 2.25 % Satellites: 0 0 bp 0.00 % Simple repeats: 1 49 bp 0.30 % Low complexity: 0 0 bp 0.00 % ================================================== * most repeats fragmented by insertions or deletions have been counted as one element The query species was assumed to be homo sapiens RepeatMasker version open-4.0.5 , sensitive mode run with cross_match version 0.990329 RepBase Update 20140131, RM database version 20140131 cat chrM.fa.out SW perc perc perc query position in query matching repeat position in repeat score div. del. ins. sequence begin end (left) repeat class/family begin end (left) ID 324 32.0 3.2 3.9 chrM 2592 2747 (13824) + LSU-rRNA_Hsa rRNA 3753 3907 (1128) 1 650 5.1 0.0 0.0 chrM 3231 3308 (13263) + tRNA-Leu-TTA(m) tRNA 1 78 (0) 2 429 16.7 0.0 0.0 chrM 4330 4401 (12170) C tRNA-Gln-CAA_ tRNA (3) 72 1 3 432 13.4 1.5 0.0 chrM 7449 7515 (9056) C tRNA-Ser-TCA(m) tRNA (5) 68 1 4 14 10.3 0.0 8.8 chrM 16207 16255 (316) + (TCAACT)n Simple_repeat 1 44 (0) 5 cat chrM.fa.cat 324 32.00 3.23 3.85 chrM 2592 2747 (13676) LSU-rRNA_Hsa#rRNA 3753 3907 (1128) m_b1s551i0 chrM 2592 AAGGTAGCATAATCACTTGTTCCTTAAATAGGGACCTGTATGAATGGCTC 2641 vv vv i ivii v v vi i v v LSU-rRNA_Hsa# 3753 AAGGTAGCCAAATGCCTCGTCATCTAATTAGTGACGCGCATGAATGGATG 3802 chrM 2642 CACGAGGGTTCAGCTGTCTCTTACTTTTAACCAGTGAAATTGACCTGCCC 2691 v iv i vi i vv vi v i iii-- v - LSU-rRNA_Hsa# 3803 AACGAGATTCCCACTGTCCCTACCTACTATCCAGCGAAACCA--CAGCC- 3849 chrM 2692 GTGAAGAGG-CGGGCATGACA----CAGCAAGACGAGAAGACCCTATGGA 2736 --- i i- v i i---- ii ivi i v LSU-rRNA_Hsa# 3850 ---AAGGGAACGGGCTTGGCGGAATCAGCGGGGAAAGAAGACCCTGTTGA 3896 chrM 2737 GCTTTAATTTA 2747 v v i LSU-rRNA_Hsa# 3897 GCTTGACTCTA 3907 Matrix = 20p43g.matrix Kimura (with divCpGMod) = 36.97 Transitions / transversions = 1.09 (25/23) Gap_init rate = 0.05 (8 / 155), avg. gap size = 1.38 (11 / 8) 650 5.13 0.00 0.00 chrM 3231 3308 (13263) tRNA-Leu-TTA(m)#tRNA 1 78 (0) c_b1s401i0 chrM 3231 GTTAAGATGGCAGAGCCCGGTAATCGCATAAAACTTAAAACTTTACAGTC 3280 i i tRNA-Leu-TTA( 1 GTTAAGATGGCAGAGCCTGGTAATCGCATAAAACTTAAAATTTTACAGTC 50 chrM 3281 AGAGGTTCAATTCCTCTTCTTAACAACA 3308 i v tRNA-Leu-TTA( 51 AGAGGTTCAACTCCTCTTCTTAACACCA 78 Matrix = 18p43g.matrix Kimura (with divCpGMod) = 5.35 Transitions / transversions = 3.00 (3/1) Gap_init rate = 0.00 (0 / 77), avg. gap size = 0.0 (0 / 0) 429 16.67 0.00 0.00 chrM 4330 4401 (12170) C tRNA-Gln-CAA_#tRNA (3) 72 1 c_b1s401i1 chrM 4330 CTAGGACTATGAGAATCGAACCCATCCCTGAGAATCCAAAATTCTCCGTG 4379 i ii i i i i C tRNA-Gln-CAA_ 72 CTAGAACTATAGGAATTGAACCTACCCCTGAGAATTCAAAATTCTCCGTG 23 chrM 4380 CCACCTATCACACCCCATCCTA 4401 i i vii C tRNA-Gln-CAA_ 22 CTACCTATTACACCATGTCCTA 1 Matrix = 18p43g.matrix Kimura (with divCpGMod) = 19.95 Transitions / transversions = 11.00 (11/1) Gap_init rate = 0.00 (0 / 71), avg. gap size = 0.0 (0 / 0) 432 13.43 1.47 0.00 chrM 7449 7515 (9056) C tRNA-Ser-TCA(m)#tRNA (5) 68 1 m_b1s357i0 chrM 7449 AAAAAGGAAGGAATCGAACCCCCCAAAG-CTGGTTTCAAGCCAACCCCAT 7497 i v - i C tRNA-Ser-TCA( 68 AAAAAGGAGGGAATCGAACCCCCCACAGACTGGTTTCAAGCCAATCCCAT 19 chrM 7498 GGCCTCCATGACTTTTTC 7515 ii ii i i C tRNA-Ser-TCA( 18 AACCTCTGTGACCTTCTC 1 Matrix = 18p43g.matrix Kimura (with divCpGMod) = 15.39 Transitions / transversions = 8.00 (8/1) Gap_init rate = 0.02 (1 / 66), avg. gap size = 1.00 (1 / 1) 13 15.38 6.67 2.13 chrM 16207 16251 (320) (CAACTAT)n#Simple_repeat 1 47 (0) m_b1s252i0 chrM 16207 CAAGTA-CAGCAATCAACCTTCAACTATCACAC-ATCAACT-GCAACT 16251 v - i v iv - - -v (CAACTAT)n#Si 1 CAACTATCAACTATCAACTATCAACTATCA-ACTATCAACTATCAACT 47 Matrix = Unknown Transitions / transversions = 0.50 (2/4) Gap_init rate = 0.09 (4 / 44), avg. gap size = 1.00 (4 / 4) 14 10.30 0.00 8.82 chrM 16219 16255 (316) (TCAACT)n#Simple_repeat 1 34 (0) m_b1s252i1 chrM 16219 TCAACCTTCAACTATCACACATCAACTGCAACTCCAA 16255 - - - v v i (TCAACT)n#Sim 1 TCAA-CTTCAACT-TCA-ACTTCAACTTCAACTTCAA 34 Matrix = Unknown Transitions / transversions = 0.50 (1/2) Gap_init rate = 0.08 (3 / 36), avg. gap size = 1.00 (3 / 3) ## Total Sequences: 1 ## Total Length: 16571 ## Total NonMask ( excluding >20bp runs of N/X bases ): 16571 ## Total NonSub ( excluding all non ACGT bases ):16571 RepeatMasker version open-4.0.5 , sensitive mode run with cross_match version 0.990329 RepBase Update 20140131, RM database version 20140131
Scripts
Scripts available from http://doua.prabi.fr/software/one-code-to-find-them-all to merge TEs; refer to the publication for more details http://www.mobilednajournal.com/content/5/1/13.
Scripts from Aurelie Kapusta for parsing RepeatMasker out files, which can be accessed by running:
git clone git@github.com:4ureliek/Parsing-RepeatMasker-Outputs.git