disorders characterised by social deficits and communication difficulties, stereotyped or repetitive behaviours and interests, and in some cases, cognitive delays.
In the spirit of reproducibility and open science, I will write about the analysis we performed in our article on Rett syndrome and piRNAs. It’s a little late given that the paper was published almost 2 years ago, but better late than never.
Firstly a bit of background; MECP2 is an abbreviation for the gene methyl CpG binding protein 2 and mutations in this gene are the cause of most cases of Rett syndrome. The MECP2 protein contains a methyl-CpG binding domain (MBD) and is capable of binding specifically to methylated DNA. MECP2 can activate transcription as well as repress transcription. MECP2 also seems to have a role in silencing repetitive elements. Have a look at the introduction of our paper for more background on MECP2.
Four years ago, there was a paper published by the Gage lab, that showed an increase in long interspersed nuclear elements-1 (L1) transcription and retrotransposition in rodents without the MeCP2 gene, which supports the hypothesis by Bird. Furthermore, their work revealed that patients with Rett syndrome (caused by a MeCP2 mutation), have an increased susceptibility for L1 retrotransposition. L1 is a class I transposon known as a retrotransposon and are usually silenced since their retrotransposition can cause disease. Work by Kazazian and colleagues showed how a case of haemophilia A was caused by an insertion of L1.
Now comes piwi-interacting RNAs (piRNA), which are a class of non-coding RNA that are between 24 and 32 nucleotides long. They are thought to be involved in gene silencing, especially in silencing transposons, by forming the piRNA-induced silencing complex (piRISC). The sequence of many piRNAs are antisense to transposon sequences because they are actually derived from transposons. In fact when they were first discovered in common fruit flies, they were termed repeat-associated small interfering RNAs (rasiRNA). One theory of piRNA biogenesis states that piRNAs are processed from single-stranded RNA precursors coming from particular intergenic repetitive elements (such as transposons). Here is a nice summary of piRNA biogenesis and function.
Taken together, this formed our hypothesis in our paper:
Since transcripts from repetitive elements contribute to piRNA biogenesis, we propose that piRNA levels may be higher in the absence of MeCP2 and that increased piRNA levels may contribute to the mis-regulation of some genes seen in the Mecp2 knockout mouse brain.
I’ve made available all the scripts that were used in the analysis on GitHub. There are two very minor differences between what was published and the pipeline available on GitHub: 1) I used SHRiMP_2_2_3 instead of SHRiMP_2_2_2 and 2) I rounded the expression values to 2 decimal points. These differences do not change any of the conclusions we previously made in the paper.
We used the small RNA sequencing data from this paper:
Two pooled total RNA samples (isolated from 4 pairs of postnatal 6 week old wild-type and Mecp2-null mice) were sequenced using the SOLiD v2 sequencing system. The sequence provided online has actually been pre-processed and only “good” and “best” reads are available (a total of 3,660,124 reads in WT and 2,789,136 reads in the KO). Initially, I naively thought about converting the colour-space reads into base-space, but colleagues I worked with previously advised otherwise.
For our set of piRNAs, we used the NONCODE version 3 database (version 4 is available now) and extracted 75,814 mouse piRNAs. We then mapped the colour-space reads directly to this set of mouse piRNAs using SHRiMP version 2.2.2 (SHRiMP version 2.2.3 in the GitHub repository) and the default settings (as this setting mapped the most reads). For reads that cross-mapped to more than one piRNA, we divided the read by the number of places it multi-mapped to (similar to this strategy) and assigned this divided number to all piRNAs the read mapped to. Reads mapping to more than 5 piRNA were discarded and finally expression numbers were normalised by counts per million using the total library sizes.
Analysis using R
If you’ve successfully ran the entire analysis available from the GitHub repository (by running “run.sh”), you should end up with a file called SRR089647_SRR089648_exp.tsv, which contains the expression values of all the piRNAs for the WT and KO samples. Here I will create a new plot, which I think is a bit more informative that the plot we showed in the paper. I will also create summaries in R, recapitulating what we reported.
#load final data outputted from the pipeline data <- read.table("SRR089647_SRR089648_exp.tsv", header=F, stringsAsFactors=F) #name the columns names(data) <- c('piRNA', 'length', 'WT', 'KO') #the library sizes, used for normalisation wt_lib_size <- 3660124 ko_lib_size <- 2789136 #normalise by counts per million data$WT_tpm <- data$WT*1000000/sum(wt_lib_size) data$KO_tpm <- data$KO*1000000/sum(ko_lib_size) #calculate the fold changes data$FC <- data$KO_tpm/data$WT_tpm #calculate the mean expression of a piRNA data$mean_exp <- mapply(mean, data$WT_tpm, data$KO_tpm) head(data) # piRNA length WT KO WT_tpm KO_tpm FC #1 DQ546606 29 147865.67 229912.2 40399.0876 82431.3228 2.0404254 #2 DQ540966 30 146001.83 228493.2 39889.8589 81922.5631 2.0537190 #3 DQ540188 25 29252.92 24557.0 7992.3303 8804.5187 1.1016210 #4 DQ558990 30 6166.50 11048.0 1684.7790 3961.0833 2.3510997 #5 DQ541777 30 1995.50 2411.0 545.2001 864.4254 1.5855195 #6 DQ703900 32 4252.00 1621.5 1161.7093 581.3628 0.5004375
Now in the paper, we filtered out piRNAs that had a raw expression of less than 5 in the KO sample. This was done to filter out lowly expressed piRNAs. This step can be improved since we shouldn’t filter out piRNAs based on the expression from just one sample. In this case, it didn’t make much of a difference to the results (i.e., piRNAs lowly expressed in the KO were also lowly expressed in the WT) but I thought I point it out because now that I look back, it can be improved.
#how many piRNAs had an expression of #less than 5 raw tags in the KO sample? table(data$KO<5) # #FALSE TRUE # 357 8368 #what was the mean expression of these piRNAs? #as I said above, piRNAs lowly expressed in the KO #were also lowly expressed in the WT summary(apply( data[data$KO<5,3:4], 1, mean)) # Min. 1st Qu. Median Mean 3rd Qu. Max. # 0.1000 0.2250 0.5000 0.5647 0.5000 25.5000 #create a subset of piRNAs that have #expression > 5 in the KO data_subset <- data[data$KO>5,] dim(data_subset) # 327 8 table(data_subset$FC>1) # #FALSE TRUE # 61 266 #as we reported, 81% of piRNAs #have higher expression in the KO 266/327 # 0.8134557 #in my opinion a more informative plot than what we published plot(log2(data_subset$mean_exp), data_subset$FC, pch=19, cex=0.5, xlab="Log2 mean expression", ylab="Fold change") abline(h=1, col='dodgerblue')
The scientific conclusion of this work on Rett syndrome and piRNAs can be read in our open access paper, piRNAs Warrant Investigation in Rett Syndrome: An Omics Perspective; here I just wanted to briefly comment about open research. I’m extremely new to the concept of open research but by observing the work of colleagues and people I follow on Twitter, I want to take part in this movement. As a researcher, it’s extremely frustrating to read someone’s methods section and realise that I would have to directly email the person to get an idea of what they did because what they described is insufficient for reproducing their work. So from now on I will provide all my scripts/code for any papers I publish on my GitHub account. I have realised that just by having the mindset that I’m going to share my work, made me more diligent and less sloppy.
Perhaps people are afraid to share their work because it will go through another layer scrutiny beyond that of your supervisor and peer-reviewers, but now I realise that publication doesn’t mean the end of a project. Besides, careful scrutiny from others is the only way I can learn and improve, so I will embrace it.
This work is licensed under a Creative Commons
Attribution 4.0 International License.