Getting started with Circos

Once you've installed Circos, you can get started with Circos by following the Circos quick guide. Bascially this post was done just by following the guide, which is obviously what you should be looking at; this page just acts as a reminder for myself of what I've done.

According to the Hello World guide, the bare minimum that is required in the configuration file is this:

# circos.conf

karyotype = /home/davetang/src/circos-0.62-1/data/karyotype/karyotype.human.txt

<ideogram>

<spacing>
default = 0.005r
</spacing>

radius    = 0.9r
thickness = 20p
fill      = yes

</ideogram>

################################################################
# The remaining content is standard and required. It is imported
# from default files in the Circos distribution.
#
# These should be present in every Circos configuration file and
# overridden as required. To see the content of these files,
# look in etc/ in the Circos distribution.

<image>
# Included from Circos distribution.
<<include /home/davetang/src/circos-0.62-1/etc/image.conf>>
</image>

# RGB/HSV color definitions, color lists, location of fonts, fill patterns.
# Included from Circos distribution.
<<include /home/davetang/src/circos-0.62-1/etc/colors_fonts_patterns.conf>>

# Debugging, I/O an dother system parameters
# Included from Circos distribution.
<<include /home/davetang/src/circos-0.62-1/etc/housekeeping.conf>>

Continue reading

My 100th post!

Just over two years ago I started this blog and now I'm up to my 100th post! I didn't have any particular aims or goals, I just wanted to document and share some of the things I was learning during my PhD and this still holds. Thanks to a flawed Google PageRank algorithm, my blog (and wiki site) receives several thousands of unique visitors each month. I don't monitor the traffic for monetary gain or have any plans to (what a feeble idea anyway), but just purely out of interest. Several colleagues (past and present) have told me that they have independently discovered my site and have found it useful, which helps justify the (limited) time I spend writing things up. I do receive emails, from time to time, from other researchers asking for advice but I usually reply that I'm not an authority in this area and suggest that they either email the Bioconductor mailing list or the appropriate authority. Some of the nicer stories is that I still remember a nice email I got from a reader who just emailed me to say hello and gave me a nice pat on the back 🙂 And I found out that I have at least one fan of my blog 🙂 They made me smile (hence the appropriate use of the emoticons).

I enjoy learning and sharing what I learn. I think it's quite obvious that in the years to come there will be an even greater demand for bioinformaticians and statisticans. So if you come from a biological background (like me), keep learning! I think the most important quality towards learning is patience, so keep at it!

To end I'd like to share something fun (your opinion of fun may differ from mine); it's a brainteaser I found last week. I don't need to post the answer because you'll know the correct answer once you work it out. Enjoy:

There are 5 houses of 5 different colours. Each house is occupied by a man of different nationality. The 5 owners each drink a different type of beverage, smoke a different brand of cigar, and keep a different pet. The question is “who owns the fish?”.

The clues are as follows:

  1. The Brit lives in the red house.
  2. The Swede keeps dogs as pets.
  3. The Dane drinks tea.
  4. The green house is on the left of the white house.
  5. The green house’s owner drinks coffee.
  6. The person who smokes Pall Mall rears birds.
  7. The owner of the yellow house smokes Dunhill.
  8. The man living in the centre house drinks milk.
  9. The Norwegian lives in the first house.
  10. The man who smokes Blends lives next to the one who keeps cats.
  11. The man who keeps the horse lives next to the man who smokes Dunhill.
  12. The owner who smokes Bluemasters drinks beer.
  13. The German smokes Prince.
  14. The Norwegian lives next to the blue house.
  15. The man who smokes Blends has a neighbour who drinks water.

Warning spoilers in the comments.

Visualising hierarchical clustering results

I've written about hierarchical clustering before as an attempt to understand it better. Within R, you can plot the hierarchical clustering results however when working with a large dataset you may produce plots like these where all the labels are overlapping:

and

As you can see you can't see any of the labels. During my honours year, I worked in a molecular evolution lab so I had learned about the Newick format, so I was wondering if there was a way to export the hierarchical clustering results into Newick format, so that I could use TreeView or some other tree visualising program to visualise or to manipulate the results. After some searching, I found this post from the Getting Genetics Done blog, which was exactly what I wanted.

So in almost the exact same instructions as the post I linked above:

#install if necessary
source("http://bioconductor.org/biocLite.R")
biocLite("ctc")

library(ctc)
#hierarchical clustering of my rows using average linkage
hc <- hclust(d <- as.dist(1 - cor(t(data), method="pearson")), method="average")
write.table(hc2Newick(hc), file="hc.newick")

Now open up hc.newick using your text editor of choice (e.g. Vim), and delete the first line, which should be "x" (with the double quotation marks). On the second line it should start as "1" " (with the double quotation marks), remove those so that the beginning of the line is now an open parenthesis i.e. (. Lastly go to the end of this line and remove the double quotation mark i.e. ", so that the end of the line is a semicolon i.e. ;.

Open your tree viewing program of choice, for example Dendroscope and open the hc.newick file. I prefer this program over TreeView because it allows you to zoom right into the tree (mouse scrolling) and to copy node information to the clipboard (as well as many other features!). Within Dendroscope, you can try different types of layouts, such as a circular dendrogram:

Colouring the branches

I learned of the sparcl package from this Stack Overflow post. Here using the iris dataset, which comes with R, I colour the branches according to the species.

#install sparcl package
install.packages("sparcl")
#load package
library(sparcl)
names(iris)
[1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"
hh <- hclust(dist(iris[,c(1:4)]))
table(iris$Species)

    setosa versicolor  virginica 
        50         50         50
species_as_numeric = as.numeric(iris$Species)
ColorDendrogram(hh,y=species_as_numeric,branchlength=2)

colour_hierarchical_clustering
The three species, setosa, versicolor and virginica, are mostly clustered together based on their sepal length, sepal width, petal length and petal width.

goseq using pnas_expression_filtered.tsv

goseq is a Bioconductor package for performing a gene ontology analysis for RNA-seq and other length biased data. For more information on how to use the package, please refer to the vignette. For more information on the pnas_expression file please refer to my older post.

#install if necessary
source("http://bioconductor.org/biocLite.R")
biocLite("goseq")

library(goseq)
data <- read.table("pnas_expression_filtered.tsv",header=T,row.names=1)
nrow(data)
[1] 16494
head(data)
                lane1 lane2 lane3 lane4 lane5 lane6 lane8
ENSG00000124208   478   619   628   744   483   716   240
ENSG00000182463    27    20    27    26    48    55    24
ENSG00000124201   180   218   293   275   373   301    88
ENSG00000124207    76    80    85    97    80    81    37
ENSG00000125835   132   200   200   228   280   204    52
ENSG00000125834    42    60    72    86   131    99    30

d <- data[,1:7]
rownames(d) <- row.names(data)

#install if necessary
source("http://bioconductor.org/biocLite.R")
biocLite("edgeR")

library(edgeR)
group <- c(rep("Control",4),rep("Test",3))
d <- DGEList(counts = d, group=group)
dim(d)
[1] 16494     7
d <- calcNormFactors(d)
d
An object of class "DGEList"
$counts
                lane1 lane2 lane3 lane4 lane5 lane6 lane8
ENSG00000124208   478   619   628   744   483   716   240
ENSG00000182463    27    20    27    26    48    55    24
ENSG00000124201   180   218   293   275   373   301    88
ENSG00000124207    76    80    85    97    80    81    37
ENSG00000125835   132   200   200   228   280   204    52
16489 more rows ...

$samples
        group lib.size norm.factors
lane1 Control   976847    1.0296636
lane2 Control  1154746    1.0372521
lane3 Control  1439393    1.0362662
lane4 Control  1482652    1.0378383
lane5    Test  1820628    0.9537095
lane6    Test  1831553    0.9525624
lane8    Test   680798    0.9583181

d <- estimateCommonDisp(d)
d$common.dispersion
[1] 0.02001965
sqrt(d$common.dispersion)
[1] 0.1414908
de.com <- exactTest(d)

d <- estimateTagwiseDisp(d)
de.tgw <- exactTest(d)
summary(decideTestsDGE(de.tgw, p.value=0.01))
   [,1] 
-1  1404
0  13372
1   1718

genes=as.integer(p.adjust(de.tgw$table$PValue[de.tgw$table$logFC!=0],method="BH")<.01)
names(genes)=row.names(de.tgw$table[de.tgw$table$logFC!=0,])
table(genes)
genes
    0     1 
13372  3122
head(genes)
ENSG00000124208 ENSG00000182463 ENSG00000124201 ENSG00000124207 ENSG00000125835 
              0               0               0               0               0 
ENSG00000125834 
              0
pwf=nullp(genes,"hg19","ensGene")

The x-axis of the plot above are binned lengths of Ensembl genes and the y-axis is the ratio of differentially detected genes. We can see a clear bias in the detection of differential expression with longer genes.


GO.wall=goseq(pwf,"hg19","ensGene")
head(GO.wall)
       category over_represented_pvalue under_represented_pvalue
2269 GO:0005623            1.741568e-52                        1
9863 GO:0044464            1.741568e-52                        1
9830 GO:0044424            2.533098e-49                        1
2268 GO:0005622            2.088365e-48                        1
2238 GO:0005575            6.005558e-48                        1
2321 GO:0005737            2.845073e-47                        1

table(GO.wall$over_represented_pvalue<0.01)

FALSE  TRUE 
14092   844

GO.samp=goseq(pwf,"hg19","ensGene",method="Sampling",repcnt=1000)
head(GO.samp)
     category over_represented_pvalue under_represented_pvalue
36 GO:0000075             0.000999001                        1
41 GO:0000082             0.000999001                        1
46 GO:0000087             0.000999001                        1
67 GO:0000122             0.000999001                        1
77 GO:0000139             0.000999001                        1
92 GO:0000166             0.000999001                        1

table(GO.samp$over_represented_pvalue<0.05)

FALSE  TRUE 
13413  1523

head(GO.nobias)
       category over_represented_pvalue under_represented_pvalue
2269 GO:0005623            1.038052e-67                        1
9863 GO:0044464            1.038052e-67                        1
2238 GO:0005575            3.014341e-64                        1
9830 GO:0044424            3.255426e-60                        1
2268 GO:0005622            6.342680e-60                        1
3482 GO:0008150            1.916257e-59                        1

table(GO.nobias$over_represented_pvalue<0.01)

FALSE  TRUE 
13961   975

summary(GO.wall$category[p.adjust(GO.wall$over_represented_pvalue,method="BH")<.01])
   Length     Class      Mode 
      404 character character

library(GO.db)
for(go in wall.enriched.GO[1:5]){
   print(GOTERM[[go]])
   cat("--------------------------------------\n")
}

GOID: GO:0005623
Term: cell
Ontology: CC
Definition: The basic structural and functional unit of all organisms.
    Includes the plasma membrane and any external encapsulating
    structures such as the cell wall and cell envelope.
--------------------------------------
GOID: GO:0044464
Term: cell part
Ontology: CC
Definition: Any constituent part of a cell, the basic structural and
    functional unit of all organisms.
Synonym: protoplast
--------------------------------------
GOID: GO:0044424
Term: intracellular part
Ontology: CC
Definition: Any constituent part of the living contents of a cell; the
    matter contained within (but not including) the plasma membrane,
    usually taken to exclude large vacuoles and masses of secretory or
    ingested material. In eukaryotes it includes the nucleus and
    cytoplasm.
--------------------------------------
GOID: GO:0005622
Term: intracellular
Ontology: CC
Definition: The living contents of a cell; the matter contained within
    (but not including) the plasma membrane, usually taken to exclude
    large vacuoles and masses of secretory or ingested material. In
    eukaryotes it includes the nucleus and cytoplasm.
Synonym: internal to cell
Synonym: nucleocytoplasm
Synonym: protoplasm
Synonym: protoplast
--------------------------------------
GOID: GO:0005575
Term: cellular_component
Ontology: CC
Definition: The part of a cell or its extracellular environment in
    which a gene product is located. A gene product may be located in
    one or more parts of a cell and its location may be as specific as
    a particular macromolecular complex, that is, a stable, persistent
    association of macromolecules that function together.
Synonym: cellular component
Synonym: cellular component unknown
Synonym: GO:0008372
Secondary: GO:0008372
--------------------------------------

sessionInfo()
R version 2.15.2 (2012-10-26)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=English_Australia.1252  LC_CTYPE=English_Australia.1252   
[3] LC_MONETARY=English_Australia.1252 LC_NUMERIC=C                      
[5] LC_TIME=English_Australia.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] GO.db_2.8.0             org.Hs.eg.db_2.8.0      RSQLite_0.11.2         
 [4] DBI_0.2-5               AnnotationDbi_1.20.2    Biobase_2.18.0         
 [7] BiocGenerics_0.4.0      edgeR_3.0.2             limma_3.14.1           
[10] goseq_1.10.0            geneLenDataBase_0.99.10 BiasedUrn_1.04         
[13] BiocInstaller_1.8.3    

loaded via a namespace (and not attached):
 [1] biomaRt_2.14.0         Biostrings_2.26.2      bitops_1.0-4.2        
 [4] BSgenome_1.26.1        GenomicFeatures_1.10.0 GenomicRanges_1.10.5  
 [7] grid_2.15.2            IRanges_1.16.4         lattice_0.20-10       
[10] Matrix_1.0-10          mgcv_1.7-22            nlme_3.1-105          
[13] parallel_2.15.2        RCurl_1.95-3           Rsamtools_1.10.2      
[16] rtracklayer_1.18.0     stats4_2.15.2          tools_2.15.2          
[19] XML_3.95-0.1           zlibbioc_1.4.0