Using the R SeqinR package

Just a quick demonstration of the SeqinR package in R using sequences available from the NONCODE database version 3. First download the fasta file, which is available at http://noncode.org/datadownload/ncrna_NONCODE[v3.0].fasta.tar.gz, then install the necessary packages in R and load the fasta file (note I have extracted and renamed the fasta file into ncrna_noncode_v3.fa).

wget -c http://noncode.org/datadownload/ncrna_NONCODE[v3.0].fasta.tar.gz
tar -xzf *.tar.gz 
mv *.fasta ncrna_noncode_v3.fa
cat ncrna_noncode_v3.fa | grep "^>" | wc -l
411553

Now with the R analysis.

install.packages("seqinr")
library("seqinr")
ncrna <- read.fasta(file = "ncrna_noncode_v3.fa")
#how many fasta sequences
length(ncrna)
[1] 411553

n1 <- ncrna[[2]]
n1
  [1] "a" "c" "c" "t" "c" "g" "a" "c" "c" "a" "c" "c" "c" "t" "t" "a" "a" "c" "t" "t" "g" "g" "g" "t" "g" "c" "a" "g" "g"
 [30] "t" "a" "t" "t" "c" "a" "a" "c" "a" "a" "a" "a" "g" "c" "a" "a" "t" "g" "a" "a" "t" "c" "a" "a" "g" "g" "a" "a" "t"
 [59] "g" "a" "a" "t" "c" "a" "a" "t" "g" "g" "a" "t" "t" "t" "t" "c" "a" "a" "t" "g" "g" "a" "t" "t" "t" "a" "t" "g" "g"
 [88] "a" "t" "t" "t" "t" "a" "a" "a" "a" "a" "c" "a" "g" "a" "g" "a" "a" "c" "t" "c" "a" "g" "a" "a" "a" "t" "c" "t" "a"
[117] "a" "c" "a" "g" "a" "a" "a" "t" "t" "t" "a" "a" "c" "a" "g" "a" "a" "a" "t" "t" "t" "a" "a" "a" "t" "t" "t" "g" "t"
[146] "c" "g" "a" "t" "c" "t" "a" "c" "a" "a" "a" "t" "t" "g" "c" "c" "c" "t" "t" "a" "t" "c" "t" "t" "t" "t" "t" "c" "c"
[175] "a" "t" "c" "t" "t" "a" "a" "a" "c" "t" "a" "a" "a" "c" "g" "t" "t" "a" "a" "t" "a" "a" "c" "t" "t" "a" "t" "t" "g"
[204] "t" "t" "g" "t" "t" "g" "a" "a" "t" "a" "c" "a" "g" "c" "t" "t" "g" "t" "g" "g" "a" "a" "t" "g" "t" "c" "g" "g" "g"
[233] "g" "t" "a" "c" "a" "a" "t" "g" "t" "c" "g" "g" "g" "g"
attr(,"name")
[1] "n1"
attr(,"Annot")
[1] ">n1 | AB002583 | tmRNA | chloroplast Cyanidioschyzon merolae | ssrA | NONCODE v2.0 | NULL | NULL | -1.0577600 | -0.3460597"
attr(,"class")
[1] "SeqFastadna"

#count number of nucleotides
count(n1,1)

 a  c  g  t 
86 40 44 76

#count number of dinucleotides
count(n1,2)
aa ac ag at ca cc cg ct ga gc gg gt ta tc tg tt 
38 15  9 24 16  7  5 12 15  4 14 10 16 14 16 30

#GC content
GC(n1)
[1] 0.3414634

#Store the fasta header
annotation <- getAnnot(ncrna)
#count the number of piRNAs
pirna_index <- grep("piRNA",annotation,ignore.case=T)
length(pirna_index)
[1] 174724
pirna <- ncrna[pirna_index]

#count the number of human piRNAs
pirna_annotation <- getAnnot(pirna)
pirna_human_index <- grep("Homo",pirna_annotation,ignore.case=T)
length(pirna_human_index)
[1] 32152
pirna_human <- pirna[pirna_human_index]
length(pirna_human)
[1] 32152

#store human pirna annotation
pirna_human_annotation <- getAnnot(pirna_human)

length(pirna_human_annotation)
[1] 32152

#write out as fasta
write.fasta(pirna_human_sequence, pirna_human_annotation, 'human_pirna.fa')

pirna_human_sequence <- getSequence(pirna_human)
#write a function to return the tenth base
tenth_base <- function(x){
   return(x[10])
}

table(mapply(tenth_base,pirna_human_sequence))

    a     c     g     n     t
10456  6153  8030     6  7507

#write a function to return the first base
first_base <- function(x){
   return(x[1])
}
table(mapply(first_base,pirna_human_sequence))

    a     c     g     t
 1924  2134  2699 25395

#length distribution of all human piRNAs
table(getLength(pirna_human))


  16   18   19   20   21   22   23   24   25   26   27   28   29   30   31   32
   2    1    3    6    1    6    3    4    8 4188 3792 3956 6749 7846 4359 1226
  33
   2

lengths <- table(getLength(pirna_human))
barplot(lengths, xlab="piRNA lengths", ylab="Frequency")

human_pirna_length

Conclusions

Using the SeqinR package in R and piRNA sequences provided by the NONCODE database, we could obtain basic sequence statistics on a particular class of noncoding RNA called piwi-interacting RNAs (piRNAs). piRNAs are a class of noncoding RNAs that are roughly 26-31nt long, which we could observe in our subset of human piRNAs. They are also known to have a U/T at the first base, which we observed, and an A at the tenth base, which we could observe slightly. We could perform some statistics to see if there were an enrichment.

For further information on the capabilities of this package, you should refer to the reference manual, which is linked below. The package provides three pages worth of functions.

More info

The SeqinR man page.




Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
.

2 thoughts on “Using the R SeqinR package

  1. Hi there,

    This is really cool and currently I am trying to do the same with a GenBank file. I am able to have it use FASTA, but when I go to do the count, my numbers always come up to 0. Any suggestions?

    Thanks!

    • Hi Jennifer,

      you'll have to show me the code (and the FASTA file) you used before I can try to figure out what went wrong.

      Cheers,

      Dave

Leave a Reply

Your email address will not be published. Required fields are marked *

Time limit is exhausted. Please reload CAPTCHA.