How to deal with multi mapping reads

Eukaryotic genomes are repetitive in nature i.e. the sequence content is not unique. When mapping high throughput sequencing reads back to the genome, whether for de novo assembly or for RNA sequencing, a subset of reads will map to more than 1 location. Some people refer to these reads as multi-reads for multi mapping reads. One way of dealing with multi mapping reads is to discard them, however the results may be biased and even worst, there may be something interesting that is being discarded.

One strategy for mapping multi mapping reads from an RNA-seq experiment, was to examine the mapping locations of all reads (single and multi mapping reads) and using the information from single mapping reads to probabilistically assign where a multi mapping read maps. To illustrate using a simple example, imagine that 100 reads were uniquely mapped to chr1:100-126 and another 100 reads multi mapped to chr1:102-128 AND chrX:102-128. The idea is that the 100 multi mapping reads "probably" arose from the genomic entity at chr1:100.

Below is a demonstration of the software, which is written in Python.

#Usage
python MuMRescueLite.py <input file> <output file> <window>

The format of the input file is ID, locations, chromosome, strand, start, end, count. Below is the format of the input file of the example I discussed above:

#ID	locations	chromosome	strand	start	end	count
read_x	1	chr1	+	100	126	100
read_y	2	chr1	+	102	128	100
read_y	2	chrX	+	102	128	100

Now running the script:

python MuMRescueLite.py test.tsv out.tsv 4
cat out.tsv
tagID	mapPositions	chromosome	strand	start	stop	count	weight
read_x	1	chr1	+	100	126	100	1.0
read_y	2	chr1	+	102	128	100	1.0
read_y	2	chrX	+	102	128	100	0.0

For the window size, I used 4; here's a detailed description of the window parameter from the documentation:

window: the number of bases around each MuM location to seek single mapped tags at each multi mapping location; MuMRescueLite will search a length of window/2 upstream and downstream of a given MuM location.

Notice in our results, the weight column, which from the documentation:

weight as probability for this sequence of this mapped position; 1.0 for the single mapped sequences, from 0.0 to 1.0 for the multi mapped tags

So read_y (the multi mapping read) has been given a weight of 1 to the location chr1:102-128.

Here's one more example:

#ID	locations	chromosome	strand	start	end	count
read_x	1	chr1	+	100	126	100
read_y	2	chr1	+	102	128	100
read_y	2	chrX	+	102	128	100
read_z	1	chrX	+	100	126	50
python MuMRescueLite.py test.tsv out.tsv 4
tagID	mapPositions	chromosome	strand	start	stop	count	weight
read_x	1	chr1	+	100	126	100	1.0
read_y	2	chr1	+	102	128	100	0.666666666667
read_y	2	chrX	+	102	128	100	0.333333333333
read_z	1	chrX	+	100	126	50	1.0

read_y is split based on the number of single mapping tags from read_x and read_z. There are more reads at chr1:100-126, hence more weight is given to this loci; the weight is in the proportion of 100:50 or 2:1.

Conclusion

The fundamental goal of MuMRescueLite is to calculate the probability that from a set of possible loci, a given locus is the true source of a MuM. This is achieved by counting the SiMs that occur in a nominal window around each locus occupied by a MuM and dividing this count by the total number of SiMs proximal to all loci associated with that MuM. In this way, MuMs are assigned proportionately to each of their loci and are therefore ‘rescued’. MuMs that do not coincide with at least one SiM are not rescued.

Print Friendly, PDF & Email



Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
.
3 comments Add yours
  1. MuMRescueLite is a really interesting approach that i am trying to apply to my data. And your posts are always really helpful and understandable.
    May i ask how can u manage to transform .bed format after mapping (e.g with bowtie/bwa) to the format required as input from MuMRescueLite.py?
    Meaning from this:
    chr6 135135525 135135550 HWI-ST897:159:C1ACKACXX:1:1101:1127:2068 255 +
    chr5 140027416 140027441 HWI-ST897:159:C1ACKACXX:1:1101:1280:2087 255 +
    To this:
    #ID locations chromosome strand start end count
    read_x 1 chr1 + 100 126 100
    I am trying through awk but don't seem to make it work! Have you used MumRescueLite with real data?

    1. You can use awk (or any other scripting language) but it's not that straight-forward. For a given genomic range, you need to know how many reads support that range and how many of those reads support other genomic ranges. This rescue approach was used in the FANTOM3 project.

Leave a Reply

Your email address will not be published. Required fields are marked *

Time limit is exhausted. Please reload CAPTCHA.