How to deal with multi mapping reads

Eukaryotic genomes are repetitive in nature i.e. the sequence content is not unique. When mapping high throughput sequencing reads back to the genome, whether for de novo assembly or for RNA sequencing, a subset of reads will map to more than 1 location. Some people refer to these reads as multi-reads for multi mapping reads. One way of dealing with multi mapping reads is to discard them, however the results may be biased and even worst, there may be something interesting that is being discarded.

One strategy for mapping multi mapping reads from an RNA-seq experiment, was to examine the mapping locations of all reads (single and multi mapping reads) and using the information from single mapping reads to probabilistically assign where a multi mapping read maps. To illustrate using a simple example, imagine that 100 reads were uniquely mapped to chr1:100-126 and another 100 reads multi mapped to chr1:102-128 AND chrX:102-128. The idea is that the 100 multi mapping reads "probably" arose from the genomic entity at chr1:100.

Below is a demonstration of the software, which is written in Python.

#Usage
python MuMRescueLite.py <input file> <output file> <window>

The format of the input file is ID, locations, chromosome, strand, start, end, count. Below is the format of the input file of the example I discussed above:

#ID	locations	chromosome	strand	start	end	count
read_x	1	chr1	+	100	126	100
read_y	2	chr1	+	102	128	100
read_y	2	chrX	+	102	128	100

Now running the script:

python MuMRescueLite.py test.tsv out.tsv 4
cat out.tsv
tagID	mapPositions	chromosome	strand	start	stop	count	weight
read_x	1	chr1	+	100	126	100	1.0
read_y	2	chr1	+	102	128	100	1.0
read_y	2	chrX	+	102	128	100	0.0

For the window size, I used 4; here's a detailed description of the window parameter from the documentation:

window: the number of bases around each MuM location to seek single mapped tags at each multi mapping location; MuMRescueLite will search a length of window/2 upstream and downstream of a given MuM location.

Notice in our results, the weight column, which from the documentation:

weight as probability for this sequence of this mapped position; 1.0 for the single mapped sequences, from 0.0 to 1.0 for the multi mapped tags

So read_y (the multi mapping read) has been given a weight of 1 to the location chr1:102-128.

Here's one more example:

#ID	locations	chromosome	strand	start	end	count
read_x	1	chr1	+	100	126	100
read_y	2	chr1	+	102	128	100
read_y	2	chrX	+	102	128	100
read_z	1	chrX	+	100	126	50
python MuMRescueLite.py test.tsv out.tsv 4
tagID	mapPositions	chromosome	strand	start	stop	count	weight
read_x	1	chr1	+	100	126	100	1.0
read_y	2	chr1	+	102	128	100	0.666666666667
read_y	2	chrX	+	102	128	100	0.333333333333
read_z	1	chrX	+	100	126	50	1.0

read_y is split based on the number of single mapping tags from read_x and read_z. There are more reads at chr1:100-126, hence more weight is given to this loci; the weight is in the proportion of 100:50 or 2:1.

Conclusion

The fundamental goal of MuMRescueLite is to calculate the probability that from a set of possible loci, a given locus is the true source of a MuM. This is achieved by counting the SiMs that occur in a nominal window around each locus occupied by a MuM and dividing this count by the total number of SiMs proximal to all loci associated with that MuM. In this way, MuMs are assigned proportionately to each of their loci and are therefore ‘rescued’. MuMs that do not coincide with at least one SiM are not rescued.

Equivalents in R, Python and Perl

Last update 2015 September 9th

I've been using Perl heavily for several years until I started my PhD back in 2010 (I still use it for many tasks but much more sparingly). Perl was widely used back in the early days when the human genome was yet to be sequenced and this famous article explained some of the reasons. However, the general trend is that Perl is on the decline, while Python on the rise. There are many reasons for this trend and at least for the biologists/bioinformaticians who program in Python that I have spoken to, they all claim that syntactically, Perl is ugly when compared to Python. Perl was what I learned first when entering bioinformatics, so I got used to looking at Perl code. However, given this trend, I will need to learn some Python to be able to read the growing number of scripts/applications written in Python. Surely it's not necessary, but I have a penchant for looking into how things are done.

Continue reading