Using the R SeqinR package

Just a quick demonstration of the SeqinR package in R using sequences available from the NONCODE database version 3. First download the fasta file, which is available at[v3.0].fasta.tar.gz, then install the necessary packages in R and load the fasta file (note I have extracted and renamed the fasta file into ncrna_noncode_v3.fa).

wget -c[v3.0].fasta.tar.gz
tar -xzf *.tar.gz 
mv *.fasta ncrna_noncode_v3.fa
cat ncrna_noncode_v3.fa | grep "^>" | wc -l

Now with the R analysis.

ncrna <- read.fasta(file = "ncrna_noncode_v3.fa")
#how many fasta sequences
[1] 411553

n1 <- ncrna[[2]]
  [1] "a" "c" "c" "t" "c" "g" "a" "c" "c" "a" "c" "c" "c" "t" "t" "a" "a" "c" "t" "t" "g" "g" "g" "t" "g" "c" "a" "g" "g"
 [30] "t" "a" "t" "t" "c" "a" "a" "c" "a" "a" "a" "a" "g" "c" "a" "a" "t" "g" "a" "a" "t" "c" "a" "a" "g" "g" "a" "a" "t"
 [59] "g" "a" "a" "t" "c" "a" "a" "t" "g" "g" "a" "t" "t" "t" "t" "c" "a" "a" "t" "g" "g" "a" "t" "t" "t" "a" "t" "g" "g"
 [88] "a" "t" "t" "t" "t" "a" "a" "a" "a" "a" "c" "a" "g" "a" "g" "a" "a" "c" "t" "c" "a" "g" "a" "a" "a" "t" "c" "t" "a"
[117] "a" "c" "a" "g" "a" "a" "a" "t" "t" "t" "a" "a" "c" "a" "g" "a" "a" "a" "t" "t" "t" "a" "a" "a" "t" "t" "t" "g" "t"
[146] "c" "g" "a" "t" "c" "t" "a" "c" "a" "a" "a" "t" "t" "g" "c" "c" "c" "t" "t" "a" "t" "c" "t" "t" "t" "t" "t" "c" "c"
[175] "a" "t" "c" "t" "t" "a" "a" "a" "c" "t" "a" "a" "a" "c" "g" "t" "t" "a" "a" "t" "a" "a" "c" "t" "t" "a" "t" "t" "g"
[204] "t" "t" "g" "t" "t" "g" "a" "a" "t" "a" "c" "a" "g" "c" "t" "t" "g" "t" "g" "g" "a" "a" "t" "g" "t" "c" "g" "g" "g"
[233] "g" "t" "a" "c" "a" "a" "t" "g" "t" "c" "g" "g" "g" "g"
[1] "n1"
[1] ">n1 | AB002583 | tmRNA | chloroplast Cyanidioschyzon merolae | ssrA | NONCODE v2.0 | NULL | NULL | -1.0577600 | -0.3460597"
[1] "SeqFastadna"

#count number of nucleotides

 a  c  g  t 
86 40 44 76

#count number of dinucleotides
aa ac ag at ca cc cg ct ga gc gg gt ta tc tg tt 
38 15  9 24 16  7  5 12 15  4 14 10 16 14 16 30

#GC content
[1] 0.3414634

#Store the fasta header
annotation <- getAnnot(ncrna)
#count the number of piRNAs
pirna_index <- grep("piRNA",annotation,
[1] 174724
pirna <- ncrna[pirna_index]

#count the number of human piRNAs
pirna_annotation <- getAnnot(pirna)
pirna_human_index <- grep("Homo",pirna_annotation,
[1] 32152
pirna_human <- pirna[pirna_human_index]
[1] 32152

#store human pirna annotation
pirna_human_annotation <- getAnnot(pirna_human)

[1] 32152

#write out as fasta
write.fasta(pirna_human_sequence, pirna_human_annotation, 'human_pirna.fa')

pirna_human_sequence <- getSequence(pirna_human)
#write a function to return the tenth base
tenth_base <- function(x){


    a     c     g     n     t
10456  6153  8030     6  7507

#write a function to return the first base
first_base <- function(x){

    a     c     g     t
 1924  2134  2699 25395

#length distribution of all human piRNAs

  16   18   19   20   21   22   23   24   25   26   27   28   29   30   31   32
   2    1    3    6    1    6    3    4    8 4188 3792 3956 6749 7846 4359 1226

lengths <- table(getLength(pirna_human))
barplot(lengths, xlab="piRNA lengths", ylab="Frequency")



Using the SeqinR package in R and piRNA sequences provided by the NONCODE database, we could obtain basic sequence statistics on a particular class of noncoding RNA called piwi-interacting RNAs (piRNAs). piRNAs are a class of noncoding RNAs that are roughly 26-31nt long, which we could observe in our subset of human piRNAs. They are also known to have a U/T at the first base, which we observed, and an A at the tenth base, which we could observe slightly. We could perform some statistics to see if there were an enrichment.

For further information on the capabilities of this package, you should refer to the reference manual, which is linked below. The package provides three pages worth of functions.

More info

The SeqinR man page.

Print Friendly, PDF & Email

Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
5 comments Add yours
  1. Hi there,

    This is really cool and currently I am trying to do the same with a GenBank file. I am able to have it use FASTA, but when I go to do the count, my numbers always come up to 0. Any suggestions?


    1. Hi Jennifer,

      you’ll have to show me the code (and the FASTA file) you used before I can try to figure out what went wrong.



  2. Hi there,

    I am trying to use the translate function of seqinr to carry out 3-frame translation.
    My plan is to translate in all three frames and then concatenate the three files
    I loaded the non code fasta like you did and wrote the following
    translate(n1, frame = 0, sens = “F”, numcode = 1, NAstring = “X”, ambiguous = FALSE)

    I get the following errors
    Error in s2n(seq, levels = s2c(“tcag”)) :
    sequence is not a vector of chars

    Since I am very new to R, I am doing something evidently wrong. Can you please help me?


  3. Hi ,
    I want to use CNOGpro for my study which depends on seqin to execute genbank flat file. Below is my code. I would be thankful if you can help me to resolve the error generated.

    gb_file <- reutils::efetch("CP002806", "nuccore", rettype = "gbwithparts", retmode = "text")
    gbkfile <- biofiles::gbRecord(gb_file)

    CNOGpro(hitsfile, gbkfile, windowlength = 100, name = "C. psittaci")

    Error Reading contents of GenBank file… Error in readLines(con = gbkfile) : 'con' is not a connection

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.