Mapping full-length mRNA sequences

I have used BLAT to align full-length mRNA sequences a long time ago. Since BLAT has been out for over 20 years, I was wondering what modern day alignment tool I could use as a replacement. Minimap2 came to mind and in this post I use it to map some known transcript sequences to the genome and compare the mapping results with reported coordinates.

I'll use Docker for the sake of reproducibility; specifically I'm using a general purpose image I have previously prepared. The Dockerfile provides all the details about the image. If you want to follow the post and do not want to use Docker, most up to date Linux distributions should work fine.

# pull image
docker pull davetang/build:1.2.1

# start a container
docker run --rm -it -v $(pwd):/work -u $(stat -c "%u:%g" ${HOME}) davetang/build:1.2.1 /bin/bash

If you see "I have no name!" on the prompt after starting a Docker container, just ignore it. There is no entry in /etc/passwd for your user ID and group. We want to run the container using the same user ID and group as the host operating system, so that we have permission to read and write any files generated inside the Docker container.

Now to get started, we need the hg38 genome for mapping our transcripts and we can download it from UCSC.

wget -c https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz

We will use GENCODE transcript sequences.

wget -c https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_41/gencode.v41.transcripts.fa.gz

We will download and extract minimap2 from GitHub.

wget https://github.com/lh3/minimap2/releases/download/v2.24/minimap2-2.24_x64-linux.tar.bz2
tar -xjf minimap2-2.24_x64-linux.tar.bz2

We will use bedtools to convert the SAM file generated by minimap2 to a BED file.

wget https://github.com/arq5x/bedtools2/releases/download/v2.30.0/bedtools.static.binary -O bedtools
chmod 755 bedtools

We will also need SAMtools to perform some filtering on the SAM file.

wget https://github.com/samtools/samtools/releases/download/1.15.1/samtools-1.15.1.tar.bz2
tar -xjf samtools-1.15.1.tar.bz2
mkdir tools
my_prefix=$(pwd)/tools
cd samtools-1.15.1
./configure --prefix=${my_prefix}
make && make install
cd ..

Now that we have all the required files, we will run minimap2 using a specific preset for long spliced alignments. The -a option specifies SAM output and the -t option specifies the number of threads to use.

minimap2-2.24_x64-linux/minimap2 -ax splice -t 8 hg38.fa.gz gencode.v41.transcripts.fa.gz > gencode.v41.transcripts.sam

We will convert the SAM output to BED12 format. We want this format because we want to easily compare the minimap2 alignments with the alignments reported on the UCSC Genome Browser.

Minimap2 returns secondary alignments, so we will use samtools to keep only the primary alignment/s.

tools/bin/samtools view -F 256 -b gencode.v41.transcripts.sam | ./bedtools bamtobed -splitD -bed12 -i - | gzip > gencode.v41.transcripts.minimap2.bed.gz

To download the GENCODE transcripts BED file on the UCSC Genome Browser, use the Table Browser. Select the following on the Table Browser:

  • clade: Mammal
  • genome: Human
  • assembly: Dec. 2013 (GRCh38/hg38)
  • group: Genes and Gene Predictions
  • track: GENCODE V41
  • table: knownGene
  • region: genome
  • output format: BED - browser extensible data
  • output filename: gencode.v41.transcripts.bed.gz
  • file type returned: gzip compressed

Then click get output and get BED. (You can download the BED file from my server if you do not feel like using the Table Browser.)

To compare the results, we need to match the BED entry for the same transcript and compare the following columns:

  • chrom
  • chromStart
  • chromEnd
  • blockCount
  • blockSizes
  • blockStarts

For example, if we examine the minimap2 alignment for the ENST00000413465.6 transcript, we can see that it maps to chr17 starting at position 7661778 and ending at 7676594. The alignment contains 7 "blocks", which correspond to the exons. The exon sizes are 236, 110, 113, 184, 279, 22, and 74 and the starting position of the exons are 0, 12402, 13080, 13274, 14215, 14603, and 14742. To get the genomic coordinates of the exons, add the blockStarts to the chromStart to get the start and add the blockStarts, blockSizes, and chromStart to get the end.

zcat gencode.v41.transcripts.minimap2.bed.gz | grep ENST00000413465.6 | cut -f1-3,10-12
chr17   7661778 7676594 7       236,110,113,184,279,22,74       0,12402,13080,13274,14215,14603,14742

Here's the reported alignment from the UCSC Genome Browser, which is identical to the alignment reported by minimap2 (apart from the trailing comma at the end of blockStarts and blockSizes).

zcat gencode.v41.transcripts.bed.gz | grep ENST00000413465.6 | cut -f1-3,10-12
chr17   7661778 7676594 7       236,110,113,184,279,22,74,      0,12402,13080,13274,14215,14603,14742,

I was going to write a Perl script to compare all the alignments but since most people do not use Perl anymore, I wrote a Python script instead. I'm a Python amateur so please let me know if I made any mistakes or how I can improve my code. The code below will compare all non-multimapping alignments between minimap2 and those reported on the UCSC Genome Browser.

#!/usr/bin/env python3

import pandas as pd

# BED12 format
bed_col_names = [
   'chrom',
   'chrom_start',
   'chrom_end',
   'name',
   'score',
   'strand',
   'thick_start',
   'thick_end',
   'item_rgb',
   'block_count',
   'block_sizes',
   'block_starts'
]

# tab-delimited files are loaded with read_csv but with sep = "\t"
minimap2 = pd.read_csv("gencode.v41.transcripts.minimap2.bed.gz", sep = "\t", names = bed_col_names)
ucsc = pd.read_csv("gencode.v41.transcripts.bed.gz", sep = "\t", names = bed_col_names)

# remove some columns that will not be used
# whether to drop labels from the index (axis = 0) or columns (axis = 1)
minimap2 = minimap2.drop(['score', 'thick_start', 'thick_end', 'item_rgb'], axis = 1)
ucsc = ucsc.drop(['score', 'thick_start', 'thick_end', 'item_rgb'], axis = 1)

# keep only the transcript ID as the transcript identifier in the minimap2 results
minimap2["name"] = minimap2["name"].str.split("|").str[0]

# set index column, like setting row names
minimap2 = minimap2.set_index(minimap2['name'])
ucsc = ucsc.set_index(ucsc['name'])

# remove the trailing commas in block sizes and starts
ucsc["block_sizes"] = ucsc["block_sizes"].str.replace(",$", "", regex = True)
ucsc["block_starts"] = ucsc["block_starts"].str.replace(",$", "", regex = True)

# transcripts that map to more than one loci
mm_multimappers = minimap2["name"].duplicated()
print("Number of transcripts that minimap2 mapped to one or more loci:")
print(mm_multimappers.value_counts(), "\n")
ucsc_multimappers = ucsc["name"].duplicated()
print("Number of transcripts that UCSC mapped to one or more loci:")
print(ucsc_multimappers.value_counts(), "\n")

# remove multimappers
# use keep = False to store duplicated transcripts including the first occurrence
mm_multimappers = minimap2["name"].duplicated(keep = False)
ucsc_multimappers = ucsc["name"].duplicated(keep = False)
# the tilde switches the booleans, like !
minimap2 = minimap2[~mm_multimappers]
ucsc = ucsc[~ucsc_multimappers]

# test case using TP53
tp53 = 'ENST00000413465.6'
tp53_minimap2 = minimap2[minimap2['name'] == tp53]
tp53_ucsc = ucsc[ucsc['name'] == tp53]
print("Comparing", tp53, "between minimap2 and UCSC")
print(tp53_minimap2 == tp53_ucsc, "\n")

# reorder rows of UCSC to match minimap2
transcripts = minimap2['name']
ucsc = ucsc.reindex(transcripts)
print("Number of transcripts that were mapped identically between minimap2 and UCSC:")
mm_vs_ucsc = (minimap2 == ucsc)
# https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.all.html
all_cols_true = mm_vs_ucsc.all(axis = 'columns')
print(all_cols_true.value_counts(), "\n")

# sanity check by manually checking some results
identical_cases = all_cols_true[all_cols_true].head()
nonidentical_cases = all_cols_true[all_cols_true == False].head()

print("Some transcripts that were mapped identically between minimap2 and UCSC:")
print(minimap2[minimap2['name'].isin(identical_cases.index)])
print(ucsc[ucsc['name'].isin(identical_cases.index)], "\n")

print("Some transcripts that were mapped differently between minimap2 and UCSC:")
print(minimap2[minimap2['name'].isin(nonidentical_cases.index)])
print(ucsc[ucsc['name'].isin(nonidentical_cases.index)],"\n")

quit()

The initial output will indicate the number of multimappers.

Number of transcripts that minimap2 mapped to one or more loci:
False    249612
True       2429
Name: name, dtype: int64

Number of transcripts that UCSC mapped to one or more loci:
False    272178
True        174
Name: name, dtype: int64

The next set of output will show results comparing ENST00000413465.6, which we previously saw was identical.

Comparing ENST00000413465.6 between minimap2 and UCSC
                   chrom  chrom_start  chrom_end  name  strand  block_count  block_sizes  block_starts
name
ENST00000413465.6   True         True       True  True    True         True         True          True

The tally of all comparisons is shown next and indicates that around 91% of alignments were exactly the same.

Number of transcripts that were mapped identically between minimap2 and UCSC:
True     225261
False     22179
dtype: int64

Finally, the last set of output shows some examples of transcripts mapped identically and mapped differently.

Some transcripts that were mapped identically between minimap2 and UCSC:
                  chrom  chrom_start  chrom_end               name strand  block_count   block_sizes block_starts
name
ENST00000456328.2  chr1        11868      14409  ENST00000456328.2      +            3  359,109,1189   0,744,1352
ENST00000473358.1  chr1        29553      31097  ENST00000473358.1      +            3   486,104,122  0,1010,1422
ENST00000461467.1  chr1        35244      36073  ENST00000461467.1      -            2       237,353        0,476
ENST00000606857.1  chr1        52472      53312  ENST00000606857.1      +            1           840            0
ENST00000642116.1  chr1        57597      64116  ENST00000642116.1      +            3   56,157,1201  0,1102,5318
                  chrom  chrom_start  chrom_end               name strand  block_count   block_sizes block_starts
name
ENST00000456328.2  chr1      11868.0    14409.0  ENST00000456328.2      +          3.0  359,109,1189   0,744,1352
ENST00000473358.1  chr1      29553.0    31097.0  ENST00000473358.1      +          3.0   486,104,122  0,1010,1422
ENST00000461467.1  chr1      35244.0    36073.0  ENST00000461467.1      -          2.0       237,353        0,476
ENST00000606857.1  chr1      52472.0    53312.0  ENST00000606857.1      +          1.0           840            0
ENST00000642116.1  chr1      57597.0    64116.0  ENST00000642116.1      +          3.0   56,157,1201  0,1102,5318

Some transcripts that were mapped differently between minimap2 and UCSC:
                   chrom  chrom_start  chrom_end               name strand  block_count                               block_sizes                                       block_starts
name
ENST00000450305.2   chr1        12009      13670  ENST00000450305.2      +            8                   48,5,43,85,80,2,150,218                    0,50,175,603,965,1046,1215,1443
ENST00000488147.1   chr1        14403      29570  ENST00000488147.1      -           11  100,31,152,159,198,136,137,147,99,154,37  0,604,1392,2203,2454,2829,3202,3511,3864,10334...
ENST00000469289.1  chr19        71873      72718  ENST00000469289.1      +            2                                   401,134                                              0,711
ENST00000607096.1   chr9        30143      30281  ENST00000607096.1      +            1                                       138                                                  0
ENST00000417324.1  chr19        76162      77690  ENST00000417324.1      -            3                               621,205,361                                         0,723,1167
                  chrom  chrom_start  chrom_end               name strand  block_count                              block_sizes                                       block_starts
name
ENST00000450305.2  chr1      12009.0    13670.0  ENST00000450305.2      +          6.0                      48,49,85,78,154,218                            0,169,603,965,1211,1443
ENST00000488147.1  chr1      14403.0    29570.0  ENST00000488147.1      -         11.0  98,34,152,159,198,136,137,147,99,154,37  0,601,1392,2203,2454,2829,3202,3511,3864,10334...
ENST00000469289.1  chr1      30266.0    31109.0  ENST00000469289.1      +          2.0                                  401,134                                              0,709
ENST00000607096.1  chr1      30365.0    30503.0  ENST00000607096.1      +          1.0                                      138                                                  0
ENST00000417324.1  chr1      34553.0    36081.0  ENST00000417324.1      -          3.0                              621,205,361                                         0,723,1167

Summary

Minimap2 (using the spliced long read preset) works fairly well with mapping full-length mRNA sequences. 91% of comparable alignments was identical between minimap2 and alignments reported on the UCSC Genome Browser.

Further work should break down how each alignment is different. Minimap2 could also be run with different parameters, since the following alignment by minimap2 does seem dubious with two blocks that are 5 and 2 nucleotides long.

ENST00000450305.2   chr1        12009      13670  ENST00000450305.2      +            8                   48,5,43,85,80,2,150,218                    0,50,175,603,965,1046,1215,1443

An additional aligner could also be used since it is not a given that the alignment reported on the UCSC Genome Browser is the definitive alignment.

Finally, I will be spending more time learning about minimap2 since I tihnk it can be used as my new multi-purpose alignment tool.

Print Friendly, PDF & Email



Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
.

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.