Rett syndrome and piRNAs

Two years ago we were invited to write an article in a Special Issue of the Disease Markers journal, which was dedicated to autism spectrum disorders (ASD) and biomarkers. From Wikipedia, ASD are

disorders characterised by social deficits and communication difficulties, stereotyped or repetitive behaviours and interests, and in some cases, cognitive delays.

In the spirit of reproducibility and open science, I will write about the analysis we performed in our article on Rett syndrome and piRNAs. It’s a little late given that the paper was published almost 2 years ago, but better late than never.

Firstly a bit of background; MECP2 is an abbreviation for the gene methyl CpG binding protein 2 and mutations in this gene are the cause of most cases of Rett syndrome. The MECP2 protein contains a methyl-CpG binding domain (MBD) and is capable of binding specifically to methylated DNA. MECP2 can activate transcription as well as repress transcription. MECP2 also seems to have a role in silencing repetitive elements. Have a look at the introduction of our paper for more background on MECP2.

Four years ago, there was a paper published by the Gage lab, that showed an increase in long interspersed nuclear elements-1 (L1) transcription and retrotransposition in rodents without the MeCP2 gene, which supports the hypothesis by Bird. Furthermore, their work revealed that patients with Rett syndrome (caused by a MeCP2 mutation), have an increased susceptibility for L1 retrotransposition. L1 is a class I transposon known as a retrotransposon and are usually silenced since their retrotransposition can cause disease. Work by Kazazian and colleagues showed how a case of haemophilia A was caused by an insertion of L1.

Now comes piwi-interacting RNAs (piRNA), which are a class of non-coding RNA that are between 24 and 32 nucleotides long. They are thought to be involved in gene silencing, especially in silencing transposons, by forming the piRNA-induced silencing complex (piRISC). The sequence of many piRNAs are antisense to transposon sequences because they are actually derived from transposons. In fact when they were first discovered in common fruit flies, they were termed repeat-associated small interfering RNAs (rasiRNA). One theory of piRNA biogenesis states that piRNAs are processed from single-stranded RNA precursors coming from particular intergenic repetitive elements (such as transposons). Here is a nice summary of piRNA biogenesis and function.

Taken together, this formed our hypothesis in our paper:

Since transcripts from repetitive elements contribute to piRNA biogenesis, we propose that piRNA levels may be higher in the absence of MeCP2 and that increased piRNA levels may contribute to the mis-regulation of some genes seen in the Mecp2 knockout mouse brain.

Our analysis

I’ve made available all the scripts that were used in the analysis on GitHub. There are two very minor differences between what was published and the pipeline available on GitHub: 1) I used SHRiMP_2_2_3 instead of SHRiMP_2_2_2 and 2) I rounded the expression values to 2 decimal points. These differences do not change any of the conclusions we previously made in the paper.

We used the small RNA sequencing data from this paper:

Genome-wide analysis reveals methyl-CpG binding protein 2-dependent regulation of microRNAs in a mouse model of Rett syndrome

Two pooled total RNA samples (isolated from 4 pairs of postnatal 6 week old wild-type and Mecp2-null mice) were sequenced using the SOLiD v2 sequencing system. The sequence provided online has actually been pre-processed and only “good” and “best” reads are available (a total of 3,660,124 reads in WT and 2,789,136 reads in the KO). Initially, I naively thought about converting the colour-space reads into base-space, but colleagues I worked with previously advised otherwise.

For our set of piRNAs, we used the NONCODE version 3 database (version 4 is available now) and extracted 75,814 mouse piRNAs. We then mapped the colour-space reads directly to this set of mouse piRNAs using SHRiMP version 2.2.2 (SHRiMP version 2.2.3 in the GitHub repository) and the default settings (as this setting mapped the most reads). For reads that cross-mapped to more than one piRNA, we divided the read by the number of places it multi-mapped to (similar to this strategy) and assigned this divided number to all piRNAs the read mapped to. Reads mapping to more than 5 piRNA were discarded and finally expression numbers were normalised by counts per million using the total library sizes.

Analysis using R

If you’ve successfully ran the entire analysis available from the GitHub repository (by running “”), you should end up with a file called SRR089647_SRR089648_exp.tsv, which contains the expression values of all the piRNAs for the WT and KO samples. Here I will create a new plot, which I think is a bit more informative that the plot we showed in the paper. I will also create summaries in R, recapitulating what we reported.

#load final data outputted from the pipeline
data <- read.table("SRR089647_SRR089648_exp.tsv", header=F, stringsAsFactors=F)

#name the columns
names(data) <- c('piRNA', 'length', 'WT', 'KO')

#the library sizes, used for normalisation
wt_lib_size <- 3660124
ko_lib_size <- 2789136

#normalise by counts per million
data$WT_tpm <- data$WT*1000000/sum(wt_lib_size)
data$KO_tpm <- data$KO*1000000/sum(ko_lib_size)

#calculate the fold changes
data$FC <- data$KO_tpm/data$WT_tpm

#calculate the mean expression of a piRNA
data$mean_exp <- mapply(mean, data$WT_tpm, data$KO_tpm)

#     piRNA length        WT       KO     WT_tpm     KO_tpm        FC
#1 DQ546606     29 147865.67 229912.2 40399.0876 82431.3228 2.0404254
#2 DQ540966     30 146001.83 228493.2 39889.8589 81922.5631 2.0537190
#3 DQ540188     25  29252.92  24557.0  7992.3303  8804.5187 1.1016210
#4 DQ558990     30   6166.50  11048.0  1684.7790  3961.0833 2.3510997
#5 DQ541777     30   1995.50   2411.0   545.2001   864.4254 1.5855195
#6 DQ703900     32   4252.00   1621.5  1161.7093   581.3628 0.5004375

Now in the paper, we filtered out piRNAs that had a raw expression of less than 5 in the KO sample. This was done to filter out lowly expressed piRNAs. This step can be improved since we shouldn’t filter out piRNAs based on the expression from just one sample. In this case, it didn’t make much of a difference to the results (i.e., piRNAs lowly expressed in the KO were also lowly expressed in the WT) but I thought I point it out because now that I look back, it can be improved.

#how many piRNAs had an expression of
#less than 5 raw tags in the KO sample?
#  357  8368

#what was the mean expression of these piRNAs?
#as I said above, piRNAs lowly expressed in the KO
#were also lowly expressed in the WT
summary(apply( data[data$KO<5,3:4], 1, mean))
#   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
# 0.1000  0.2250  0.5000  0.5647  0.5000 25.5000

#create a subset of piRNAs that have
#expression > 5 in the KO
data_subset <- data[data$KO>5,]
#[1] 327   8

#   61   266

#as we reported, 81% of piRNAs
#have higher expression in the KO
#[1] 0.8134557

#in my opinion a more informative plot than what we published
     xlab="Log2 mean expression",
     ylab="Fold change")
abline(h=1, col='dodgerblue')

pirna_mean_exp_vs_fcThe majority of piRNAs are expressed higher in the Mecp2 KO as we previously reported.


The scientific conclusion of this work on Rett syndrome and piRNAs can be read in our open access paper, piRNAs Warrant Investigation in Rett Syndrome: An Omics Perspective; here I just wanted to briefly comment about open research. I’m extremely new to the concept of open research but by observing the work of colleagues and people I follow on Twitter, I want to take part in this movement. As a researcher, it’s extremely frustrating to read someone’s methods section and realise that I would have to directly email the person to get an idea of what they did because what they described is insufficient for reproducing their work. So from now on I will provide all my scripts/code for any papers I publish on my GitHub account. I have realised that just by having the mindset that I’m going to share my work, made me more diligent and less sloppy.

Perhaps people are afraid to share their work because it will go through another layer scrutiny beyond that of your supervisor and peer-reviewers, but now I realise that publication doesn’t mean the end of a project. Besides, careful scrutiny from others is the only way I can learn and improve, so I will embrace it.

Print Friendly, PDF & Email

Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
2 comments Add yours
  1. Thanks for the useful post. Did you include quantile normalization or something similar in this workflow? Is normalizing by read depth typically sufficient for differential expression analysis? I guess performing quantile normalization on the piRNAs alone would eliminate the skew toward one sample, so you’d have to include transcripts or other noncoding RNAs at that step?

    1. Hi Adam,

      thanks for the comments. The dataset that we used was unfortunately a pool of 4 pairs of WT and KO mouse; so instead of having 4 versus 4, it is a 1 versus 1 dataset. Usually, we normalise the samples by taking into account the variability of the biological replicates and the read depth. So for this work, we only normalised by read depth and I would say that it’s not ideal.

      I just looked into quantile normalisation (I always thought it was a method only for microarray data) and as you commented, it would eliminate the skew towards one sample. I’ll read a bit more about it and will update the post with some more analyses when I have some more time.

      Thanks again for the useful comments.



Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.