An example RNA-seq count table

I have been using pnas_expression.txt as a test dataset for count table analyses for many years. It was created by Davis McCarthy and was hosted on their Google Sites website. After some time, the site became unavailable and I have been hosting it on my web server since then. The RNA-seq libraries were generated using a procedure based on double-random priming and solid phase selection and sequenced on a Illumina/Solexa flowcell.

pnas_expression.txt has served me well for many years but I was wondering if a more modern-day example RNA-seq count table existed. I did find some some example datasets in Zenodo but it wasn't perfectly clear how the data was generated. Then I thought I'll just generate my own example because that way I'll know all the details. (I'd like to confess that I actually don't know the details of how pnas_expression.txt was generated; more reason to generate my own example!). In this post, I go through how I generated an example RNA-seq count table from raw FASTQ files.

I chose data from the paper The transcriptional landscape and mutational profile of lung adenocarcinoma, which I found some time ago when I wanted to compare some lung adenocarcinomas RNA-seq data I was working on with some publicly available data. The study deposited their data and metadata to GEO under the accession GSE40419. The libraries were sequenced on an Illumina HiSeq 2000.

Shown below are the steps to download the raw FASTQ files for eight RNA-seq samples (four controls and four carcinomas). (The data is actually paired; the same patient contributed the control and cancer samples.)

ffq GSE40419 > GSE40419.json
cat GSE40419.json| jq '.[].geo_samples[].samples[].experiments[].runs[].files.ftp[]' > GSE40419_runs.json
cat GSE40419.json| jq '.[].geo_samples[].samples[] | {title: .title, run_accession: .experiments[].runs[].accession } ' | less > GSE40419_title.json
cat GSE40419_title.json | ./json_to_csv.pl | grep "C[1235]$" | cut -f1 -d',' > runs.txt
cat GSE40419_runs.json | ./json_to_csv.pl | grep -f runs.txt | cut -f7 -d',' | wget -i - -nc -P fastq
cat GSE40419_runs.json | ./json_to_csv.pl | grep -f runs.txt | awk '{print $6,"fastq/"$2}' FS=, OFS="  " > checksum.txt
md5sum -c checksum.txt

The first line uses ffq to generate metadata including the FTP URLs associated with GSE40419; the URLs will be used to download the data.

The second and third lines use jq to parse GSE40419.json and create simplified JSON files that are much easier to parse.

The fourth line generates runs.txt using a Perl script I wrote because I missed writing Perl. The file contains the run IDs that will be used to generate the count table.

The fifth line downloads the run IDs in runs.txt into the fastq directory using wget.

The sixth line generates a checksum file for the FASTQ files and the seventh line use this checksum file to ensure that the files were downloaded properly.

Once the FASTQ files were downloaded and verified, I used nf-core/rnaseq (version 3.14.0) to process them and to produce the RNA-seq count tables. The exact command (inside a Bash script) that was used is shown below.

#!/usr/bin/env bash

set -euo pipefail

FASTA=/home/dtang/data/ensembl/release-112/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa.gz
GTF=/home/dtang/data/ensembl/release-112/Homo_sapiens.GRCh38.112.gtf.gz
THREADS=4
MAXMEM=50GB
DATETIME=$(date +%Y_%m_%d_%H_%M_%S)

export NXF_SINGULARITY_CACHEDIR=${HOME}/nf-core/sif

nextflow run ${HOME}/nf-core/rnaseq/3_14_0/main.nf \
    -resume \
    -with-report execution_report_${DATETIME}.html\
    -with-trace \
    -with-dag flowchart_${DATETIME}.html \
    --input samplesheet.csv \
    --outdir results \
    --fasta ${FASTA} \
    --gtf ${GTF} \
    --aligner star_rsem \
    --save_reference true \
    -profile singularity \
    --max_cpus ${THREADS} \
    --max_memory ${MAXMEM}

>&2 echo Done
exit 0

The script above is available in my GitHub repo; the repo also contains instructions on how to get nf-core/rnaseq up and running. After the workflow successfully completed, the counts and TPM counts generated by RSEM (and MultiQC report) were deposited to Zenodo. Below is the badge with the DOI:

DOI

I wanted to upload the BAM and BAM index files too but I kept getting a pending status on Zenodo, so I gave up.

The very nice thing about using nf-core/rnaseq is that it is reproducible: all tools that were used are packaged into Singularity images and the workflow is version controlled. In addition, details of the workflow are nicely documented on the nf-core website so it is clear how the data was generated.

Finally, I used a refurbished HP ProDesk 600 G4 SF box purchased on Amazon to carry out all the steps; I bought it for around 180 USD for testing stuff. It runs Debian 12 (bookworm) and has a 6 core CPU (Core i5-8500) and is good enough for small tasks. I upgraded the RAM to 64GB and added a 1 TB SSD, which together costed more than the refurbished computer.




Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.