What fraction of 10x reads map to ribosomal proteins?

As you may have learned in school, ribosomes are cell organelles that can be thought of as protein factories because they aid in the synthesis of proteins. The eukaryotic ribosome is made up of ribosomal proteins and ribosomal RNA (rRNA). This post is about ribosomal protein gene expression and not non-coding rRNAs. I was interested in what percentage of the total expression comes from ribosomal protein genes in 10x scRNA-seq experiments.

I found an article from 10x Genomics with some statistics:

Cell type Percentage of reads mapping to ribosomal proteins
Barnyard Cells (1:1 HEK293T:3T3) 25-35
Neuronal Cells 15-20
PBMCs 35-40
Isolated Pan T Cells 40-45

As noted in the article:

Single Cell 3' libraries do contain reads mapping to ribosomal protein transcripts (Rps, Rpl). The fraction of reads varies based on cell type and overall cell health.

Although both ribosomal proteins and ribosomal RNA (rRNA) make up the ribosome complex, ribosomal protein transcripts are not equivalent to ribosomal RNA (rRNA). Ribosomal protein transcript detection will not necessarily correlate with either rRNA or mitochondrial transcripts.

It's nice that we have some statistics from the article but as usual, I trust them but I'd like to verify their claims.

First things first, what are the gene symbols (the 10x gene by barcode matrix uses gene symbols) of ribosomal protein genes? I found a database that listed some ribosomal protein genes for humans that I listed below as vectors in R. The total number of ribosomal protein genes (small plus the large subunit) from this database is 81.

small_subunit <- c("RPSA", "RPS2", "RPS3", "RPS3A", "RPS4X", "RPS4Y", "RPS5", "RPS6", "RPS7", "RPS8", "RPS9", "RPS10", "RPS11", "RPS12", "RPS13", "RPS14", "RPS15", "RPS15A", "RPS16", "RPS17", "RPS18", "RPS19", "RPS20", "RPS21", "RPS23", "RPS24", "RPS25", "RPS26", "RPS27", "RPS27A", "RPS28", "RPS29", "RPS30")
large_subunit <- c("RPL3", "RPL4", "RPL5", "RPL6", "RPL7", "RPL7A", "RPL8", "RPL9", "RPL10", "RPL10A", "RPL11", "RPL12", "RPL13", "RPL13A", "RPL14", "RPL15", "RPL17", "RPL18", "RPL18A", "RPL19", "RPL21", "RPL22", "RPL23", "RPL23A", "RPL24", "RPL26", "RPL27", "RPL27A", "RPL28", "RPL29", "RPL30", "RPL31", "RPL32", "RPL34", "RPL35", "RPL35A", "RPL36", "RPL36A", "RPL37", "RPL37A", "RPL38", "RPL39", "RPL40", "RPL41", "RPLP0", "RPLP1", "RPLP2", "RPLP3")
ribosomal_protein_genes <- c(small_subunit, large_subunit)
length(ribosomal_protein_genes)
[1] 81

A typical regular expression (in Python) used to capture ribosomal protein gene symbols is:

adata.var["ribo"] = adata.var_names.str.startswith(("RPS", "RPL"))

The regex finds genes that begin with RPS or RPL. Let's check whether this captures all the genes from the ribosome database.

length(ribosomal_protein_genes) == length(grep(pattern = "^RP[SL]", x = ribosomal_protein_genes))
[1] TRUE

Another claim I'd like to verify is that "Single Cell 3' libraries do contain reads mapping to ribosomal protein transcripts (Rps, Rpl)". For this we will need the Cell Ranger reference data. Again as noted by 10x Genomics:

The 10x 2020-A references do not annotate the rRNA transcripts, and therefore rRNA transcripts will not show up as counts.

After downloading the reference data, we can check the GTF file used by Cell Ranger, to see if it does contain ribosomal protein genes. The one liner below does the following:

  1. Reads genes.gtf
  2. Filters for gene entries
  3. Prints out the gene symbol for ribosomal protein genes
  4. Counts the number of lines, i.e. the number of genes
cat genes.gtf | awk -F "\t" '$3 == "gene" {print $9}' | perl -lne 'if(/gene_name\s"(RP[SL].*?)";/){ print $1 }' | wc -l
103

There's more "ribosomal protein genes" in the GTF, so let's see what they are by creating another vector in R with the 103 gene symbols and getting the difference between this vector and the vector of ribosomal protein genes from the database.

gtf_ribo <- c("RPL22", "RPL11", "RPS6KA1", "RPS8", "RPL5", "RPS27", "RPS6KC1", "RPS7", "RPS27A", "RPL31", "RPL37A", "RPL32", "RPL15", "RPSA", "RPL14", "RPL29", "RPL24", "RPL22L1", "RPL39L", "RPL35A", "RPL9", "RPL34-AS1", "RPL34", "RPS3A", "RPL37", "RPS23", "RPS14", "RPL26L1", "RPS18", "RPS10-NUDT3", "RPS10", "RPL10A", "RPL7L1", "RPS12", "RPS6KA2", "RPS6KA2-IT1", "RPS6KA2-AS1", "RPS20", "RPL7", "RPL30", "RPL8", "RPS6", "RPL35", "RPL12", "RPL7A", "RPS24", "RPLP2", "RPL27A", "RPS13", "RPS6KA4", "RPS6KB2", "RPS6KB2-AS1", "RPS3", "RPS25", "RPS26", "RPL41", "RPL6", "RPLP0", "RPL21", "RPL10L", "RPS29", "RPL36AL", "RPS6KL1", "RPS6KA5", "RPS27L", "RPL4", "RPLP1", "RPS17", "RPL3L", "RPS2", "RPS15A", "RPL13", "RPL26", "RPL23A", "RPL23", "RPL19", "RPL27", "RPS6KB1", "RPL38", "RPL17", "RPS15", "RPL36", "RPS28", "RPL18A", "RPS16", "RPS19", "RPL18", "RPL13A", "RPS11", "RPS9", "RPL28", "RPS5", "RPS21", "RPL3", "RPS19BP1", "RPS6KA3", "RPS4X", "RPS6KA6", "RPL36A", "RPL39", "RPL10", "RPS4Y1", "RPS4Y2")

dplyr::setdiff(gtf_ribo, ribosomal_protein_genes)
 [1] "RPS6KA1"     "RPS6KC1"     "RPL22L1"     "RPL39L"      "RPL34-AS1"   "RPL26L1"     "RPS10-NUDT3" "RPL7L1"     
 [9] "RPS6KA2"     "RPS6KA2-IT1" "RPS6KA2-AS1" "RPS6KA4"     "RPS6KB2"     "RPS6KB2-AS1" "RPL10L"      "RPL36AL"    
[17] "RPS6KL1"     "RPS6KA5"     "RPS27L"      "RPL3L"       "RPS6KB1"     "RPS19BP1"    "RPS6KA3"     "RPS6KA6"    
[25] "RPS4Y1"      "RPS4Y2"

If we look up RPS6KA1, we see that it encodes for ribosomal protein S6 kinase A1 in Homo sapiens, However, it doesn't seem like RPS6KA2-AS1 is a ribosomal protein gene as it's a non-coding RNA. Therefore it may be better to be explicit in listing ribosomal protein genes than relying on a regular expression.

Now let's download some 10x Genomics scRNA-seq data and calculate some ribosomal protein gene percentages! I wrote a script to download some datasets that tries to correspond to the table provided by 10x that I showed at the start of the post.

I also prepared a Docker image that was used for the analysis below; you can run it using the script I prepared. It will start a container running RStudio Server and you can access RStudio using a web browser and by entering the port.

library("Seurat")

small_subunit <- c("RPSA", "RPS2", "RPS3", "RPS3A", "RPS4X", "RPS4Y", "RPS5", "RPS6", "RPS7", "RPS8", "RPS9", "RPS10", "RPS11", "RPS12", "RPS13", "RPS14", "RPS15", "RPS15A", "RPS16", "RPS17", "RPS18", "RPS19", "RPS20", "RPS21", "RPS23", "RPS24", "RPS25", "RPS26", "RPS27", "RPS27A", "RPS28", "RPS29", "RPS30")
large_subunit <- c("RPL3", "RPL4", "RPL5", "RPL6", "RPL7", "RPL7A", "RPL8", "RPL9", "RPL10", "RPL10A", "RPL11", "RPL12", "RPL13", "RPL13A", "RPL14", "RPL15", "RPL17", "RPL18", "RPL18A", "RPL19", "RPL21", "RPL22", "RPL23", "RPL23A", "RPL24", "RPL26", "RPL27", "RPL27A", "RPL28", "RPL29", "RPL30", "RPL31", "RPL32", "RPL34", "RPL35", "RPL35A", "RPL36", "RPL36A", "RPL37", "RPL37A", "RPL38", "RPL39", "RPL40", "RPL41", "RPLP0", "RPLP1", "RPLP2", "RPLP3")
ribosomal_protein_genes <- c(small_subunit, large_subunit)

create_seurat_object <- function(path, proj, min_cells = 3, min_features = 200){
  CreateSeuratObject(
    counts = Read10X(data.dir = path),
    min.cells = min_cells,
    min.features = min_features,
    project = proj
  )
}

pbmc  <- create_seurat_object(path = "../data/10x/pbmc4k/filtered_gene_bc_matrices/GRCh38/", proj = 'pbmc4k')
pan_t <- create_seurat_object(path = "../data/10x/t_3k/filtered_gene_bc_matrices/GRCh38/", proj = 'pan_t_3k')
gb <- create_seurat_object(path = "../data/10x/glioblastoma/filtered_feature_bc_matrix/", proj = 'glioblastoma')

add_ribo_prot_pc <- function(obj){
  gs <- rownames(obj@assays$RNA$counts)
  feats <- gs[gs %in% ribosomal_protein_genes]
  stopifnot(length(feats)>0)
  obj[["percent.ribo"]] <- PercentageFeatureSet(obj, features = feats)
  obj
}

pbmc <- add_ribo_prot_pc(pbmc)
pan_t <- add_ribo_prot_pc(pan_t)
gb <- add_ribo_prot_pc(gb)

library(ggplot2)
rbind(
  pbmc@meta.data,
  pan_t@meta.data,
  gb@meta.data
) |>
  ggplot(aes(orig.ident, percent.ribo)) +
  geom_boxplot() +
  theme_minimal() +
  labs(x = '', y = 'Percentage ribosomal protein genes', title = 'What percentage of 10x reads map to ribosomal protein genes?')

As a table.

library(dplyr)
rbind(
  pbmc@meta.data,
  pan_t@meta.data,
  gb@meta.data
) |>
  group_by(orig.ident) |>
  summarise(
    mean = mean(percent.ribo),
    median = median(percent.ribo),
    pt25 = quantile(percent.ribo, probs = 0.25),
    pt75 = quantile(percent.ribo, probs = 0.75)
  )
orig.ident mean median pt25 pt75
pbmc4k 36.599582 37.877574 28.359555 44.879348
pan_t_3k 43.238583 45.388574 39.962031 48.676976
glioblastoma 8.130479 7.142857 5.342492 9.528511

The percentages are similar to the 10x statistics for the PBMCs and Isolated Pan T Cells because I used the same dataset that was supposedly used in the article. It wasn't clear, which dataset was used for the Neuronal Cells, so I used a glioblastoma dataset and thus the stats are slightly different.

As you can see, the fraction of 10x reads that map to ribosomal protein genes can differ by quite a bit! It's good to have some expectation of the proportion by looking at other datasets!


This post was sponsored by Logos Bio:

Logos Biosystems develops innovative life science imaging solutions designed to help researchers "see beyond the cell". Their scientist-led team creates practical and affordable tools used in a wide range of research areas, from drug discovery to agriculture. Logos Biosystems offers automated cell counting, digital cell imaging, and tissue clearing 3D imaging technologies designed to bolster your studies.

Print Friendly, PDF & Email



Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.