<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Musings from a PhD candidate</title>
	<atom:link href="http://davetang.org/muse/feed/" rel="self" type="application/rss+xml" />
	<link>http://davetang.org/muse</link>
	<description>The best way to have a good idea is to have lots of ideas-- Linus Pauling</description>
	<lastBuildDate>Tue, 15 May 2012 14:41:35 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.2</generator>
<xhtml:meta xmlns:xhtml="http://www.w3.org/1999/xhtml" name="robots" content="noindex" />
		<item>
		<title>Using blat</title>
		<link>http://davetang.org/muse/2012/05/15/using-blat/</link>
		<comments>http://davetang.org/muse/2012/05/15/using-blat/#comments</comments>
		<pubDate>Tue, 15 May 2012 14:33:47 +0000</pubDate>
		<dc:creator>Davo</dc:creator>
				<category><![CDATA[bioinformatics]]></category>
		<category><![CDATA[mapping]]></category>

		<guid isPermaLink="false">http://davetang.org/muse/?p=1105</guid>
		<description><![CDATA[My multipurpose sequence aligner tool of choice for many years has been blat. This is just a short post on the very basics of blat. Below is a slide I made couple of years ago: First blat splits the reference &#8230; <a href="http://davetang.org/muse/2012/05/15/using-blat/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>My multipurpose sequence aligner tool of choice for many years has been <a href="http://genome.cshlp.org/content/12/4/656.abstract">blat</a>. This is just a short post on the very basics of blat. Below is a slide I made couple of years ago:</p>
<p><a href="http://davetang.org/muse/2012/05/15/using-blat/blat_example/" rel="attachment wp-att-1106"><img src="http://davetang.org/muse/wp-content/uploads/2012/05/blat_example.png" alt="" title="blat_example" width="964" height="722" class="aligncenter size-full wp-image-1106" /></a></p>
<p>First blat splits the reference sequence up into &#8220;tiles&#8221;. The manner in which it is split depends on two parameters, -tileSize and -stepSize, where -tileSize is the size of the tile and -stepSize specifies when to start the next tile. The default setting of both for DNA sequences is 11, which also means the tiles do not overlap.</p>
<p>For blat to report an alignment, your query sequence must match at least two tiles (set via -minMatch) with no mismatches (you can allow up to one mismatch in the tile by using -oneOff). So if you&#8217;re trying to align a 21 bp sequence to a reference using the default setting, blat will never report an alignment.</p>
<p>To illustrate, imagine this reference sequence (44bp):</p>
<p>>database<br />
AAAAAAAAAAACCCCCCCCCCCGGGGGGGGGGGTTTTTTTTTTT</p>
<p>and this query sequence (12bp)</p>
<p>>test<br />
GGGGGGGGGGGT:</p>
<p>The only way an alignment will be reported is if the tileSize is set to 1 and the minScore set to less than 12.</p>
<pre class="brush: bash; title: ; notranslate">
#returns no hit
blat -minScore=0 -stepSize=2 database.fa test.fa output.psl
#returns 2 hits
blat -minScore=0 -stepSize=1 database.fa test.fa output.psl
</pre>
<p>Here&#8217;s an <a href="http://davetang.org/muse/2010/11/16/can-we-use-blat-to-map-mirnas/">old post</a> showing how the blat parameters affect the output.</p>
]]></content:encoded>
			<wfw:commentRss>http://davetang.org/muse/2012/05/15/using-blat/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Getting started with TopHat</title>
		<link>http://davetang.org/muse/2012/05/09/getting-started-with-tophat/</link>
		<comments>http://davetang.org/muse/2012/05/09/getting-started-with-tophat/#comments</comments>
		<pubDate>Wed, 09 May 2012 15:17:37 +0000</pubDate>
		<dc:creator>Davo</dc:creator>
				<category><![CDATA[bioinformatics]]></category>
		<category><![CDATA[rnaseq]]></category>

		<guid isPermaLink="false">http://davetang.org/muse/?p=1078</guid>
		<description><![CDATA[I will use RNA-Seq data from Marioni et al., 2008 Genome Research to test TopHat. I found it funny that the submission title for their dataset was &#8220;RNASeq: the death Knell of expression arrays?&#8221;; I guess they decided to go &#8230; <a href="http://davetang.org/muse/2012/05/09/getting-started-with-tophat/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>I will use RNA-Seq data from <a href="http://genome.cshlp.org/content/18/9/1509.full">Marioni et al., 2008 Genome Research</a> to test TopHat. I found it funny that the submission title for their dataset was &#8220;RNASeq: the death Knell of expression arrays?&#8221;; I guess they decided to go with something much less morbid when they finally published their paper. Their sequence data was downloaded from DDBJ under the accession number SRA000299.</p>
<pre class="brush: bash; title: ; notranslate">
#!/bin/bash
wget ftp://ftp.ddbj.nig.ac.jp/ddbj_database/dra/fastq/SRA000/SRA000299/SRX000571/SRR002321.fastq.bz2
wget ftp://ftp.ddbj.nig.ac.jp/ddbj_database/dra/fastq/SRA000/SRA000299/SRX000571/SRR002323.fastq.bz2
wget ftp://ftp.ddbj.nig.ac.jp/ddbj_database/dra/fastq/SRA000/SRA000299/SRX000604/SRR002322.fastq.bz2
wget ftp://ftp.ddbj.nig.ac.jp/ddbj_database/dra/fastq/SRA000/SRA000299/SRX000605/SRR002320.fastq.bz2
wget ftp://ftp.ddbj.nig.ac.jp/ddbj_database/dra/fastq/SRA000/SRA000299/SRX000605/SRR002325.fastq.bz2
wget ftp://ftp.ddbj.nig.ac.jp/ddbj_database/dra/fastq/SRA000/SRA000299/SRX000606/SRR002324.fastq.bz2
</pre>
<p><strong>Sample summary</strong></p>
<table>
<tr>
<td>SRR002320</td>
<td>080226_CMKIDNEY_0007_3pM</td>
</tr>
<tr>
<td>SRR002321</td>
<td>080226_CMLIVER_0007_3pM</td>
</tr>
<tr>
<td>SRR002322</td>
<td>080317_CM-LIV-2-REPEAT_0003_1.5pM</td>
</tr>
<tr>
<td>SRR002323</td>
<td>080317_CM-LIV-2-REPEAT_0003_3pM</td>
</tr>
<tr>
<td>SRR002324</td>
<td>080317_CM-KID-2-REPEAT_0003_1.5pM</td>
</tr>
<tr>
<td>SRR002325</td>
<td>080317_CM-KID-2-REPEAT_0003_3pM</td>
</tr>
</table>
<p><strong>To setup TopHat</strong></p>
<p>Download binaries for bowtie2 at <a href="http://sourceforge.net/projects/bowtie-bio/files/bowtie2/">http://sourceforge.net/projects/bowtie-bio/files/bowtie2/</a></p>
<p>Download binaries for tophat2 at <a href="http://tophat.cbcb.umd.edu/downloads/tophat-2.0.0.Linux_x86_64.tar.gz">http://tophat.cbcb.umd.edu/downloads/tophat-2.0.0.Linux_x86_64.tar.gz</a></p>
<p>Download test data at <a href="http://tophat.cbcb.umd.edu/downloads/test_data.tar.gz">http://tophat.cbcb.umd.edu/downloads/test_data.tar.gz</a></p>
<p><strong>Run a test job with the test_data</strong></p>
<pre class="brush: bash; title: ; notranslate">
tophat -r 20 test_ref reads_1.fq reads_2.fq
</pre>
<p>The -r parameter</p>
<p>-r/&#8211;mate-inner-dist <int></p>
<p>This is the expected (mean) inner distance between mate pairs. For, example, for paired end runs with fragments selected at 300bp, where each end is 50bp, you should set -r to be 200. There is no default, and this parameter is required for paired end runs.</p>
<p><strong>Interpreting the test results</strong></p>
<p>The reference sequence where the string of A&#8217;s serve as introns.</p>
<pre class="brush: bash; title: ; notranslate">
cat test_ref.fa
&gt;test_chromosome
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
ACTACTATCTGACTAGACTGGAGGCGCTTGCGACTGAGCTAGGACGTGCC
ACTACGGGGATGACGACTAGGACTACGGACGGACTTAGAGCGTCAGATGC
AGCGACTGGACTATTTAGGACGATCGGACTGAGGAGGGCAGTAGGACGCT
ACGTATTTGGCGCGCGGCGCTACGGCTGAGCGTCGAGCTTGCGATACGCC
GTAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAG
ACTATTACTTTATTATCTTACTCGGACGTAGACGGATCGGCAACGGGACT
GTAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAG
TTTTCTACTTGAGACTGGGATCGAGGCGGACTTTTTAGGACGGGACTTGC
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
</pre>
<p>There are 100 paired end reads in reads_1.fq and reads_2.fq.</p>
<pre class="brush: bash; title: ; notranslate">
head -4 reads_1.fq
@test_mRNA_150_290_0/1
TCCTAAAAAGTCCGCCTCGGTCTCAGTCTCAAGTAGAAAAAGTCCCGTTGGCGATCCGTCTACGTCCGAGTAAGA
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

head -4 reads_2.fq
@test_mRNA_150_290_0/2
TACGTATTTGTCGCGCGGCCCTACGGCTGAGCGTCGAGCTTGCGATCCGCCACTATTACTTTATTATCTTACTCG
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
</pre>
<p>If TopHat ran properly, your output would look something like this using IGV and viewing the accepted_hits.bam file:</p>
<p><a href="http://davetang.org/muse/2012/05/09/getting-started-with-tophat/test_data_igv/" rel="attachment wp-att-1096"><img src="http://davetang.org/muse/wp-content/uploads/2012/05/test_data_igv-1024x640.png" alt="" title="test_data_igv" width="640" height="400" class="aligncenter size-large wp-image-1096" /></a></p>
<p><strong>Continuing with the analysis</strong></p>
<p>Build index or download from <a href="ftp://ftp.cbcb.umd.edu/pub/data/bowtie_indexes/hg19.ebwt.zip">ftp://ftp.cbcb.umd.edu/pub/data/bowtie_indexes/hg19.ebwt.zip</a></p>
<pre class="brush: bash; title: ; notranslate">
bowtie2-build /path/to/hg19 hg19
</pre>
<p><strong>Running TopHat</strong></p>
<pre class="brush: bash; title: ; notranslate">
tophat /path/to/hg19 reads1.fq,reads2.fq,reads3.fq
</pre>
<p>To be updated; currently indexing hg19.</p>
]]></content:encoded>
			<wfw:commentRss>http://davetang.org/muse/2012/05/09/getting-started-with-tophat/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Getting started with Circos</title>
		<link>http://davetang.org/muse/2012/05/08/getting-started-with-circos/</link>
		<comments>http://davetang.org/muse/2012/05/08/getting-started-with-circos/#comments</comments>
		<pubDate>Tue, 08 May 2012 13:20:34 +0000</pubDate>
		<dc:creator>Davo</dc:creator>
				<category><![CDATA[visualisation]]></category>

		<guid isPermaLink="false">http://davetang.org/muse/?p=1066</guid>
		<description><![CDATA[Getting Circos working on Ubuntu. For more information, see http://circos.ca/software/download/circos/.]]></description>
			<content:encoded><![CDATA[<p>Getting Circos working on Ubuntu. For more information, see <a href="http://circos.ca/software/download/circos/">http://circos.ca/software/download/circos/</a>.</p>
<pre class="brush: bash; title: ; notranslate">
cat /etc/lsb-release
#DISTRIB_ID=Ubuntu
#DISTRIB_RELEASE=12.04
#DISTRIB_CODENAME=precise
#DISTRIB_DESCRIPTION=&quot;Ubuntu 12.04 LTS&quot;
wget http://circos.ca/distribution/circos-0.60.tgz
sudo cpan App::cpanminus
#change directory into where you unzipped circos and then the bin directory
cd circos-0.60/bin
test.modules &gt; toget
cat toget | grep -v &quot;^ok&quot;
#fail Config::General is not usable (it or a sub-module is missing)
#fail Font::TTF::Font is not usable (it or a sub-module is missing)
#fail GD is not usable (it or a sub-module is missing)
#fail GD::Image is not usable (it or a sub-module is missing)
#fail GD::Polyline is not usable (it or a sub-module is missing)
#fail Math::Bezier is not usable (it or a sub-module is missing)
#fail Math::VecStat is not usable (it or a sub-module is missing)
#fail Readonly is not usable (it or a sub-module is missing)
#fail Regexp::Common is not usable (it or a sub-module is missing)
#fail Set::IntSpan is not usable (it or a sub-module is missing)
#fail Text::Format is not usable (it or a sub-module is missing)

sudo cpanm Config::General
sudo cpanm Font::TTF::Font
sudo cpanm GD
sudo cpanm GD::Image
sudo cpanm GD::Polyline
sudo cpanm Math::Bezier
sudo cpanm Math::VecStat
sudo cpanm Readonly
sudo cpanm Regexp::Common
sudo cpanm Set::IntSpan
sudo cpanm Text::Format

test.modules  | grep ^fail
fail GD is not usable (it or a sub-module is missing)
fail GD::Image is not usable (it or a sub-module is missing)
fail GD::Polyline is not usable (it or a sub-module is missing)

sudo apt-get -y install libgd2-xpm-dev build-essential

sudo cpanm GD
sudo cpanm GD::Image
sudo cpanm GD::Polyline

#all modules should be installed now
test.modules | grep &quot;^fail&quot;

#test GD
gddiag
#see if gddiag.png looks the same as the image at http://www.circos.ca/tutorials/lessons/configuration/png_output/images

#But I was still getting an error about error/configuration.missing.txt
circos 

#debuggroup conf 0.09s welcome to circos v0.60 4 May 2012
#debuggroup conf 0.09s guessing configuration file
#
#  *** CIRCOS ERROR ***
#
#  CONFIGURATION FILE ERROR

#  ...error text from [error/configuration.missing.txt] could not be read...

#  If you are having trouble debugging this error, use this tutorial to learn how
#  to use the debugging facility

#      http://www.circos.ca/tutorials/lessons/configuration/debugging

#  If you're still stumped, get support in the Circos Google Group

#      http://groups.google.com/group/circos-data-visualization

#  Stack trace:
# at /home/tan118/src/circos-0.60/bin/../lib/Circos/Error.pm line 325
#	Circos::Error::fatal_error('configuration', 'missing') called at /home/tan118/src/circos-0.60/bin/../lib/#Circos.pm line 152
#	Circos::run('Circos') called at bin/circos line 232

#I wrote to the author of Circos and turns out that
#the configuration.missing.txt file is missing in the tarball
#download the file and put it inside the error directory
wget http://davetang.org/file/configuration.missing.txt
mv configuration.missing.txt error

#In retrospect, the missing file above wouldn't have been a problem
#i.e. you could still run Circos if you gave it the right parameters
#However I wasn't sure what the error message:
#...error text from [error/configuration.missing.txt] could not be read...
#meant
</pre>
]]></content:encoded>
			<wfw:commentRss>http://davetang.org/muse/2012/05/08/getting-started-with-circos/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Equivalents in R, Python and Perl</title>
		<link>http://davetang.org/muse/2012/05/03/equivalents-in-r-python-and-perl/</link>
		<comments>http://davetang.org/muse/2012/05/03/equivalents-in-r-python-and-perl/#comments</comments>
		<pubDate>Thu, 03 May 2012 14:01:02 +0000</pubDate>
		<dc:creator>Davo</dc:creator>
				<category><![CDATA[programming]]></category>
		<category><![CDATA[perl]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[R]]></category>

		<guid isPermaLink="false">http://davetang.org/muse/?p=55</guid>
		<description><![CDATA[I&#8217;ve used Perl the most and find myself using R more and more due to the statistical packages. It seems that more and more people are switching from Perl to Python, as least in bioinformatics, thus I&#8217;ve started this page &#8230; <a href="http://davetang.org/muse/2012/05/03/equivalents-in-r-python-and-perl/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>I&#8217;ve used Perl the most and find myself using R more and more due to the statistical packages. It seems that more and more people are switching from Perl to Python, as least in bioinformatics, thus I&#8217;ve started this page to help me learn Python (and R).</p>
<p>For loops</p>
<p>In R:</p>
<pre class="brush: r; title: ; notranslate">
for (x in c(0:9))
   print(x)
</pre>
<p>In Python (indentation is required as part of the language):</p>
<pre class="brush: python; title: ; notranslate">
#!/usr/bin/python
for num in range(0, 10):
   print num
</pre>
<p>In Perl:</p>
<pre class="brush: perl; title: ; notranslate">
#!/usr/bin/perl
for (my $i = 0; $i &lt; 10; ++$i){
   print &quot;$i\n&quot;;
}
</pre>
<p>While loops</p>
<p>In R:</p>
<pre class="brush: r; title: ; notranslate">
n &lt;- 10
i &lt;- 1
while(i &lt;= n) {
   print(i)
   i &lt;- i + 1
}
</pre>
<p>In Python</p>
<pre class="brush: python; title: ; notranslate">
#!/usr/bin/python

count = 0
while (count &lt; 9):
   print count
   count = count + 1
</pre>
<p>In Perl:</p>
<pre class="brush: perl; title: ; notranslate">
#!/usr/bin/perl
my $i = 0;
while ($i &lt; 10){
   print &quot;$i\n&quot;;
   ++$i;
}
</pre>
<p>Arrays</p>
<p>In Python and more information <a href="http://www.dreamincode.net/forums/topic/82900-working-with-arrays-in-python/">here</a>:</p>
<pre class="brush: python; title: ; notranslate">
#!/usr/bin/python

from array import *
a=array('i',[1,2,3,4,5])
for i in a:
   print(i)
</pre>
<p>Will be continually updated</p>
]]></content:encoded>
			<wfw:commentRss>http://davetang.org/muse/2012/05/03/equivalents-in-r-python-and-perl/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Creating UCSC Genome Browser custom tracks with links</title>
		<link>http://davetang.org/muse/2012/04/27/creating-ucsc-genome-browser-custom-tracks-with-links/</link>
		<comments>http://davetang.org/muse/2012/04/27/creating-ucsc-genome-browser-custom-tracks-with-links/#comments</comments>
		<pubDate>Fri, 27 Apr 2012 11:41:02 +0000</pubDate>
		<dc:creator>Davo</dc:creator>
				<category><![CDATA[bioinformatics]]></category>
		<category><![CDATA[visualisation]]></category>

		<guid isPermaLink="false">http://davetang.org/muse/?p=1051</guid>
		<description><![CDATA[An extremely useful feature of the UCSC Genome Browser, which I have been using for many years, is the ability to create links your genomic features in your custom track. For more information, see this page, step 5. For example, &#8230; <a href="http://davetang.org/muse/2012/04/27/creating-ucsc-genome-browser-custom-tracks-with-links/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>An extremely useful feature of the UCSC Genome Browser, which I have been using for many years, is the ability to create links your genomic features in your custom track. For more information, see <a href="https://cgwb.nci.nih.gov/goldenPath/help/customTrack.html">this page</a>, step 5.</p>
<p>For example, I want to make a bed file of SNPs and load them as a custom track. When displayed on the Genome Browser, I want to directly go to the dbSNP website by clicking on the feature. All you need to do is add this line into your custom bed file:</p>
<p>track name=&#8221;dbSNPs&#8221; description=&#8221;SNPs from dbSNP&#8221; url=&#8221;http://www.ncbi.nlm.nih.gov/snp/?term=$$&#8221;</p>
<p>Then when you click on the SNP, you should see a page with an &#8220;Outside Link&#8221;.</p>
]]></content:encoded>
			<wfw:commentRss>http://davetang.org/muse/2012/04/27/creating-ucsc-genome-browser-custom-tracks-with-links/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Learning to use biomaRt</title>
		<link>http://davetang.org/muse/2012/04/27/learning-to-use-biomart/</link>
		<comments>http://davetang.org/muse/2012/04/27/learning-to-use-biomart/#comments</comments>
		<pubDate>Fri, 27 Apr 2012 09:20:40 +0000</pubDate>
		<dc:creator>Davo</dc:creator>
				<category><![CDATA[R]]></category>
		<category><![CDATA[annotation]]></category>

		<guid isPermaLink="false">http://davetang.org/muse/?p=1040</guid>
		<description><![CDATA[In the past I&#8217;ve been manually downloading tables of data annoation and parsing them with Perl. I guess it&#8217;s time to do things more elegantly. Below is code taken from the biomaRt vignette: The vignette contains other cool examples, which &#8230; <a href="http://davetang.org/muse/2012/04/27/learning-to-use-biomart/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>In the past I&#8217;ve been manually downloading tables of data annoation and parsing them with Perl. I guess it&#8217;s time to do things more elegantly. Below is code taken from the <a href="http://www.bioconductor.org/packages/release/bioc/vignettes/biomaRt/inst/doc/biomaRt.pdf">biomaRt vignette</a>:</p>
<pre class="brush: r; title: ; notranslate">

library(&quot;biomaRt&quot;)
listMarts()
ensembl=useMart(&quot;ensembl&quot;)
listDatasets(ensembl)
ensembl = useMart(&quot;ensembl&quot;,dataset=&quot;hsapiens_gene_ensembl&quot;)
#building a query, requires filters, attributes and values
#listFilters shows all filters
filters = listFilters(ensembl)
#listAttributes shows all attributes
attributes = listAttributes(ensembl)
#the getBM function is the main query function in biomaRt
#as an example, let's convert affy ids into Entrez gene names
affyids=c(&quot;202763_at&quot;,&quot;209310_s_at&quot;,&quot;207500_at&quot;)
getBM(attributes=c('affy_hg_u133_plus_2', 'entrezgene'), filters = 'affy_hg_u133_plus_2', values = affyids, mart = ensembl)
</pre>
<p>The vignette contains other cool examples, which you should look at if you are interested in using biomaRt.</p>
<p>I will try to build a query that converts a human RefSeq id into the Entrez id</p>
<pre class="brush: r; title: ; notranslate">

library(&quot;biomaRt&quot;)
ensembl = useMart(&quot;ensembl&quot;,dataset=&quot;hsapiens_gene_ensembl&quot;)
filters = listFilters(ensembl)
#look for filters with RefSeq
grep(&quot;refseq&quot;, filters$name, ignore.case=T, value=T)
# [1] &quot;with_refseq_peptide_predicted&quot;  &quot;with_ox_refseq_genomic&quot;         &quot;with_ox_refseq_mrna&quot;            #&quot;with_ox_refseq_mrna_predicted&quot;
# [5] &quot;with_ox_refseq_ncrna&quot;           &quot;with_ox_refseq_ncrna_predicted&quot; &quot;refseq_mrna&quot;                    #&quot;refseq_mrna_predicted&quot;
# [9] &quot;refseq_ncrna&quot;                   &quot;refseq_ncrna_predicted&quot;         &quot;refseq_peptide&quot;                 #&quot;refseq_peptide_predicted&quot;
#[13] &quot;refseq_genomic&quot;
attributes = listAttributes(ensembl)

#RefSeq for beta actin
my_refseq &lt;- 'NM_001101'
getBM(attributes='ensembl_gene_id', filters = 'refseq_mrna', values = my_refseq , mart = ensembl)
#  ensembl_gene_id
#1 ENSG00000075624
getBM(attributes=c('ensembl_gene_id','description'), filters = 'refseq_mrna', values = my_refseq , mart = ensembl)
#  ensembl_gene_id                              description
#1 ENSG00000075624 actin, beta [Source:HGNC Symbol;Acc:132]
</pre>
<p>And lastly an example taken from <a href="http://www.stat.berkeley.edu/~sandrine/Teaching/PH292.S10/Durinck.pdf">http://www.stat.berkeley.edu/~sandrine/Teaching/PH292.S10/Durinck.pdf</a>:</p>
<pre class="brush: r; title: ; notranslate">

snp &lt;- useMart(&quot;snp&quot;,dataset=&quot;hsapiens_snp&quot;)
out=getBM(attributes=c(&quot;refsnp_id&quot;,&quot;allele&quot;,&quot;chrom_start&quot;), filters=c(&quot;chr_name&quot;,&quot;chrom_start&quot;,&quot;chrom_end&quot;), values=list(8,148350, 158612), mart=snp)
nrow(out)
#[1] 465
head(out)
#   refsnp_id allele chrom_start
#1 rs78403279    C/A      148354
#2  rs4057463    T/G      148382
#3  rs3869584    T/C      148423
#4  rs4066939    T/C      148423
#5 rs71213765    -/T      148473
#6  rs3870584    G/A      148523
</pre>
]]></content:encoded>
			<wfw:commentRss>http://davetang.org/muse/2012/04/27/learning-to-use-biomart/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Making a barplot in R</title>
		<link>http://davetang.org/muse/2012/04/26/making-a-barplot-in-r/</link>
		<comments>http://davetang.org/muse/2012/04/26/making-a-barplot-in-r/#comments</comments>
		<pubDate>Thu, 26 Apr 2012 10:26:53 +0000</pubDate>
		<dc:creator>Davo</dc:creator>
				<category><![CDATA[R]]></category>

		<guid isPermaLink="false">http://davetang.org/muse/?p=1032</guid>
		<description><![CDATA[Just a short post on making barplots in R after reading in data via the read.table() function. I created a file with two rows, the first row containing the header and the second row containing the data values. a b &#8230; <a href="http://davetang.org/muse/2012/04/26/making-a-barplot-in-r/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>Just a short post on making barplots in R after reading in data via the read.table() function. I created a file with two rows, the first row containing the header and the second row containing the data values.</p>
<pre>
a       b       c       d       e
10      20      30      20      10
</pre>
<pre class="brush: r; title: ; notranslate">

x &lt;- read.table(&quot;some.file&quot;)
#the data needs to be in matrix format for barplot()
barplot(as.matrix(x[2,]))
#to label the x-axis
barplot(as.matrix(x[2,]),names.arg=as.matrix(x[1,]))
#to label the x-axis using horizontal labels
barplot(as.matrix(x[2,]),names.arg=as.matrix(x[1,]),las=2)
#export as postscript
postscript(file=&quot;some_file.ps&quot;)
barplot(as.matrix(x[2,]),names.arg=as.matrix(x[1,]),las=2)
dev.off()
</pre>
<p><a href="http://davetang.org/muse/2012/04/26/making-a-barplot-in-r/barplot/" rel="attachment wp-att-1037"><img src="http://davetang.org/muse/wp-content/uploads/2012/04/barplot.jpg" alt="" title="barplot" width="622" height="580" class="aligncenter size-full wp-image-1037" /></a></p>
]]></content:encoded>
			<wfw:commentRss>http://davetang.org/muse/2012/04/26/making-a-barplot-in-r/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Comparing different distributions</title>
		<link>http://davetang.org/muse/2012/04/17/comparing-different-distributions/</link>
		<comments>http://davetang.org/muse/2012/04/17/comparing-different-distributions/#comments</comments>
		<pubDate>Tue, 17 Apr 2012 15:09:35 +0000</pubDate>
		<dc:creator>Davo</dc:creator>
				<category><![CDATA[Statistics]]></category>
		<category><![CDATA[statistics]]></category>

		<guid isPermaLink="false">http://davetang.org/muse/?p=1022</guid>
		<description><![CDATA[I recently learned of the Kolmogorov-Smirnov Test and how one can use it to test whether two datasets are likely to be different. Strictly speaking, the p-value gives us a probability of whether or not we can reject the null &#8230; <a href="http://davetang.org/muse/2012/04/17/comparing-different-distributions/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>I recently learned of the Kolmogorov-Smirnov Test and how one can use it to test whether two datasets are likely to be different. Strictly speaking, the p-value gives us a probability of whether or not we can reject the null hypothesis, which is that two datasets have the same distribution. Using R:</p>
<pre class="brush: r; title: ; notranslate">

x &lt;- rpois(n=1000,lambda=100)
y &lt;- rpois(n=1000,lambda=100)
ks.test(x,y)

#        Two-sample Kolmogorov-Smirnov test
#
#data:  x and y
#D = 0.028, p-value = 0.828
#alternative hypothesis: two-sided
#
#Warning message:
#In ks.test(x, y) : p-values will be approximate in the presence of ties
</pre>
<p>We cannot reject the null hypothesis that the two distributions are different, which in this case they aren&#8217;t i.e. they are both from a Poisson distribution with the same lambda.</p>
<pre class="brush: r; title: ; notranslate">

x &lt;- rpois(n=1000,lambda=100)
y &lt;- rnorm(n=1000,mean=100)
ks.test(x,y)

#        Two-sample Kolmogorov-Smirnov test

#data:  x and y
#D = 1, p-value &lt; 2.2e-16
#alternative hypothesis: two-sided

#Warning message:
#In ks.test(x, y) : p-values will be approximate in the presence of ties
</pre>
<p>In this case we get an extremely low p-value, and we can reject the null, which is that both the distributions are the same and they are not (one is a Normal distribution and the other a Poisson distribution).</p>
<p>The warning messages are due to the implementation of the KS test in R, which expects a continuous distribution and thus there should not be any identical values in the two datasets i.e. ties. I&#8217;ve read several sources and they all mention that the KS test can deal with both discrete and continuous data (I&#8217;m guessing because it mainly deals with cumulative quantiles) but I&#8217;m not sure about the implementation in R.</p>
<p>Related <a href="http://www.physics.csbsju.edu/stats/KS-test.html">http://www.physics.csbsju.edu/stats/KS-test.html</a></p>
]]></content:encoded>
			<wfw:commentRss>http://davetang.org/muse/2012/04/17/comparing-different-distributions/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Twitter</title>
		<link>http://davetang.org/muse/2012/04/17/twitter/</link>
		<comments>http://davetang.org/muse/2012/04/17/twitter/#comments</comments>
		<pubDate>Tue, 17 Apr 2012 02:10:43 +0000</pubDate>
		<dc:creator>Davo</dc:creator>
				<category><![CDATA[/etc]]></category>
		<category><![CDATA[bioinformatics]]></category>

		<guid isPermaLink="false">http://davetang.org/muse/?p=1018</guid>
		<description><![CDATA[Today while reading a paper, I found some interesting one-liner facts. They are way too short to create a post on but I would like to make a repository of them. What better place to store these facts than Twitter! &#8230; <a href="http://davetang.org/muse/2012/04/17/twitter/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>Today while reading a paper, I found some interesting one-liner facts. They are way too short to create a post on but I would like to make a repository of them. What better place to store these facts than Twitter!</p>
<p>You can <a href="https://twitter.com/#!/davetangdotorg">follow me on Twitter</a> for a list of facts on molecular biology and on bioinformatics that I didn&#8217;t know or have forgotten about over the years.</p>
]]></content:encoded>
			<wfw:commentRss>http://davetang.org/muse/2012/04/17/twitter/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Variance in RNA-Seq data</title>
		<link>http://davetang.org/muse/2012/04/14/variance-in-rna-seq-data/</link>
		<comments>http://davetang.org/muse/2012/04/14/variance-in-rna-seq-data/#comments</comments>
		<pubDate>Sat, 14 Apr 2012 05:06:35 +0000</pubDate>
		<dc:creator>Davo</dc:creator>
				<category><![CDATA[bioinformatics]]></category>
		<category><![CDATA[R]]></category>
		<category><![CDATA[Statistics]]></category>
		<category><![CDATA[DGE]]></category>

		<guid isPermaLink="false">http://davetang.org/muse/?p=976</guid>
		<description><![CDATA[Using data from this paper. Generate some random data from Poisson distribution]]></description>
			<content:encoded><![CDATA[<p>Using data from this <a href="http://www.pnas.org/content/early/2008/12/16/0807121105">paper</a>.</p>
<pre class="brush: r; title: ; notranslate">

data &lt;- read.table(&quot;pnas_expression_filtered.tsv&quot;,header=T,row.names=1)
mean &lt;- ''
for (i in 1:nrow(data)){
   mean[i] &lt;- log2(mean(c(data$lane1[i],data$lane2[i],data$lane3[i],data$lane4[i])))
}

var &lt;- ''
for (i in 1:nrow(data)){
   var[i] &lt;- log2(var(c(data$lane1[i],data$lane2[i],data$lane3[i],data$lane4[i])))
}

plot(var,mean)
</pre>
<p><a href="http://davetang.org/muse/2012/04/14/variance-in-rna-seq-data/var_vs_mean/" rel="attachment wp-att-978"><img src="http://davetang.org/muse/wp-content/uploads/2012/04/var_vs_mean.png" alt="" title="var_vs_mean" width="640" height="640" class="aligncenter size-full wp-image-978" /></a></p>
<pre class="brush: r; title: ; notranslate">

library(&quot;edgeR&quot;)
d &lt;- data[,1:4]
group &lt;- c(rep(&quot;Control&quot;,4))
d &lt;- DGEList(counts = d, group=group)
d &lt;- calcNormFactors(d)
d &lt;- estimateCommonDisp(d)
d$common.dispersion
#[1] 0.01897251
sqrt(d$common.dispersion)
#[1] 0.137740
getPriorN(d)
d &lt;- estimateTagwiseDisp(d, prop.used=0.5, grid.length=500)
summary(d$tagwise.dispersion)
#    Min.  1st Qu.   Median     Mean  3rd Qu.     Max.
#0.008581 0.017790 0.034540 0.052080 0.070670 0.640000
</pre>
<p>Generate some random data from Poisson distribution</p>
<pre class="brush: r; title: ; notranslate">
data &lt;- matrix(rep(&quot;0&quot;,15000*6),ncol=6,nrow=15000)
for (i in c(1:15000)){
   mean &lt;- sample(x=100000,size=1);
   blah &lt;- c(rpois(n=6,lambda=mean));
   for (j in c(1:6)){
      data[i,j] &lt;- blah[j]
   }
}
head(data)
#     [,1]    [,2]    [,3]    [,4]    [,5]    [,6]
#[1,] &quot;90542&quot; &quot;90596&quot; &quot;90713&quot; &quot;90803&quot; &quot;90454&quot; &quot;90783&quot;
#[2,] &quot;78677&quot; &quot;78763&quot; &quot;78622&quot; &quot;78126&quot; &quot;78575&quot; &quot;78706&quot;
#[3,] &quot;67682&quot; &quot;67740&quot; &quot;67558&quot; &quot;67064&quot; &quot;67567&quot; &quot;67649&quot;
#[4,] &quot;84725&quot; &quot;85171&quot; &quot;84971&quot; &quot;84136&quot; &quot;84704&quot; &quot;84885&quot;
#[5,] &quot;22303&quot; &quot;22338&quot; &quot;22320&quot; &quot;22089&quot; &quot;22247&quot; &quot;22465&quot;
#[6,] &quot;58768&quot; &quot;58938&quot; &quot;59048&quot; &quot;58856&quot; &quot;58958&quot; &quot;58651&quot;
data2 &lt;- matrix(as.numeric(data),ncol=6,nrow=15000)
library(&quot;edgeR&quot;)
d &lt;- data2[,1:6]
group &lt;- c(rep(&quot;Control&quot;,6))
d &lt;- DGEList(counts = d, group=group)
d &lt;- calcNormFactors(d)
d &lt;- estimateCommonDisp(d)
d$common.dispersion
#[1] 0.0001005378
sqrt(d$common.dispersion)
#[1] 0.01002685
getPriorN(d)
d &lt;- estimateTagwiseDisp(d, prop.used=0.5, grid.length=500)
summary(d$tagwise.dispersion)
#     Min.   1st Qu.    Median      Mean   3rd Qu.      Max.
#1.571e-06 1.571e-06 1.571e-06 2.225e-06 1.571e-06 8.297e-05

mean2 &lt;- ''
for (i in 1:nrow(data2)){
   mean2[i] &lt;- log2(mean(data2[i,]))
}

var2 &lt;- ''
for (i in 1:nrow(data2)){
   var2[i] &lt;- log2(var(data2[i,]))
}

cor(as.numeric(var2),as.numeric(mean2))
#[1] 0.821295
cor(as.numeric(var2),as.numeric(mean2),method=&quot;spearman&quot;)
#[1] 0.7203007

plot(var2,mean2)
</pre>
<p><a href="http://davetang.org/muse/2012/04/14/variance-in-rna-seq-data/poisson_var2_mean2/" rel="attachment wp-att-1007"><img src="http://davetang.org/muse/wp-content/uploads/2012/04/poisson_var2_mean2.png" alt="" title="poisson_var2_mean2" width="660" height="598" class="aligncenter size-full wp-image-1007" /></a></p>
]]></content:encoded>
			<wfw:commentRss>http://davetang.org/muse/2012/04/14/variance-in-rna-seq-data/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

