ARCHS4 (All RNA-seq and ChIP-seq sample and signature search) is a resource that provides access to gene and transcript counts uniformly processed (using kallisto) from all human and mouse RNA-seq experiments from the Gene Expression Omnibus (GEO) and the Sequence Read Archive (SRA). The tool gget and the sub-tool archs4 can be used to query ARCHS4 with your gene of interest. I've previously written how you can use gget
to check where a gene is expressed from the command line. In this post, I'll introduce how you can use gget archs4
to check what genes are correlated (by expression pattern) to your gene of interest and how you can generate a heatmap using the list of correlated genes (plus your gene of interest).
The scripts needed to generate the heatmap are hosted on GitHub. You can clone the directory using git
and the necessary scripts are in the script
directory.
git clone https://github.com/davetang/archs4_heatmap.git
You will need to install some dependencies if you plan to use the scripts in your environment. Alternatively, I have also prepared a Docker image that contains all the dependencies so you can generate the heatmap easily.
The plot_heatmap.sh script does all the work. The usage is listed below.
Usage: ./script/plot_heatmap.sh
[ -p | --max-procs INT (default 8) ]
[ -t | --tmp-dir STR (default /tmp) ]
[ -k | --keep keep tmp files ]
[ -s | --species STR (default human) ]
[ -n | --num-genes INT (default 100) ]
[ -c | --cluster-cols ]
[ -v | --version ]
[ -h | --help ]
<HGNC gene symbol>
The only mandatory input is the official HGNC gene symbol. The script will then run gget
to get the list of genes correlated to your gene of interest, then run gget
for each gene in the list, and then run heatmap.R to generate a heatmap.
If you have Docker installed, you can simply run the following to fetch the 50 most correlated genes to TNF from ARCHS4, and plot the results as a heatmap.
docker run --rm -v $(pwd):$(pwd) -w $(pwd) davetang/archs4_heatmap:0.0.4 -p 4 -n 50 TNF
The command above will generate TNF_top50.png
and TNF_top50.csv
. The CSV file contains the expression data used for the heatmap, which looks like this.
I'm not sure what the upper limit of genes you can specify to gget
but here's the top 150.
docker run --rm -v $(pwd):$(pwd) -w $(pwd) davetang/archs4_heatmap:0.0.4 -p 4 -n 150 TNF
plot_heatmap.sh
can cluster the columns but this is turned off by default. To perform sample clustering, use the -c
parameter. In the command below, I'm using more processors (8) and fetching the top 200 most correlated genes to TNF.
docker run --rm -v $(pwd):$(pwd) -w $(pwd) davetang/archs4_heatmap:0.0.4 -c -p 8 -n 200 TNF
If you would like to keep the raw files used for generating the plot you can use the -k
parameter to keep the files and then specify --tmp-dir
to where you would like to store the raw files. For example, the following command will save the raw files into the current directory.
docker run --rm -v $(pwd):$(pwd) -w $(pwd) davetang/archs4_heatmap:0.0.4 -p 6 -k -t $(pwd) CCL2
For more information check out the GitHub repo and submit an issue if you come across any issues!
Please cite the following if you use this for your work:
- Efficient querying of genomic reference databases with gget
- Massive mining of publicly available RNA-seq data from human and mouse
You can cite this blog post and/or the GitHub repo if you found this useful.
This post was sponsored by Logos Biosystems:
Logos Biosystems specializes in developing cutting-edge life science imaging solutions that empower researchers to explore beyond the cellular level. The company’s scientist-led team creates accessible and affordable tools applicable to a broad spectrum of research areas, from drug discovery to agriculture. Logos Biosystems offers automated cell counting, digital cell imaging, and tissue clearing 3D imaging technologies to bolster research endeavors.
This work is licensed under a Creative Commons
Attribution 4.0 International License.