Eukaryotic genomes are repetitive in nature i.e. the sequence content is not unique. When mapping high throughput sequencing reads back to the genome, whether for de novo assembly or for RNA sequencing, a subset of reads will map to more than 1 location. Some people refer to these reads as multi-reads for multi mapping reads. One way of dealing with multi mapping reads is to discard them, however the results may be biased and even worst, there may be something interesting that is being discarded.
One strategy for mapping multi mapping reads from an RNA-seq experiment, was to examine the mapping locations of all reads (single and multi mapping reads) and using the information from single mapping reads to probabilistically assign where a multi mapping read maps. To illustrate using a simple example, imagine that 100 reads were uniquely mapped to chr1:100-126 and another 100 reads multi mapped to chr1:102-128 AND chrX:102-128. The idea is that the 100 multi mapping reads "probably" arose from the genomic entity at chr1:100.
Below is a demonstration of the software, which is written in Python.
#Usage python MuMRescueLite.py <input file> <output file> <window>
The format of the input file is ID, locations, chromosome, strand, start, end, count. Below is the format of the input file of the example I discussed above:
#ID locations chromosome strand start end count read_x 1 chr1 + 100 126 100 read_y 2 chr1 + 102 128 100 read_y 2 chrX + 102 128 100
Now running the script:
python MuMRescueLite.py test.tsv out.tsv 4 cat out.tsv tagID mapPositions chromosome strand start stop count weight read_x 1 chr1 + 100 126 100 1.0 read_y 2 chr1 + 102 128 100 1.0 read_y 2 chrX + 102 128 100 0.0
For the window size, I used 4; here's a detailed description of the window parameter from the documentation:
window: the number of bases around each MuM location to seek single mapped tags at each multi mapping location; MuMRescueLite will search a length of window/2 upstream and downstream of a given MuM location.
Notice in our results, the weight column, which from the documentation:
weight as probability for this sequence of this mapped position; 1.0 for the single mapped sequences, from 0.0 to 1.0 for the multi mapped tags
So read_y (the multi mapping read) has been given a weight of 1 to the location chr1:102-128.
Here's one more example:
#ID locations chromosome strand start end count read_x 1 chr1 + 100 126 100 read_y 2 chr1 + 102 128 100 read_y 2 chrX + 102 128 100 read_z 1 chrX + 100 126 50
python MuMRescueLite.py test.tsv out.tsv 4 tagID mapPositions chromosome strand start stop count weight read_x 1 chr1 + 100 126 100 1.0 read_y 2 chr1 + 102 128 100 0.666666666667 read_y 2 chrX + 102 128 100 0.333333333333 read_z 1 chrX + 100 126 50 1.0
read_y is split based on the number of single mapping tags from read_x and read_z. There are more reads at chr1:100-126, hence more weight is given to this loci; the weight is in the proportion of 100:50 or 2:1.
The fundamental goal of MuMRescueLite is to calculate the probability that from a set of possible loci, a given locus is the true source of a MuM. This is achieved by counting the SiMs that occur in a nominal window around each locus occupied by a MuM and dividing this count by the total number of SiMs proximal to all loci associated with that MuM. In this way, MuMs are assigned proportionately to each of their loci and are therefore ‘rescued’. MuMs that do not coincide with at least one SiM are not rescued.