The {GenomicDataCommons} Bioconductor package provides basic infrastructure for querying, accessing, and mining genomic datasets available from the Genomic Data Commons (GDC). The About the GDC webpage provides a brief description of the program:
The Genomic Data Commons (GDC) is a research program of the National Cancer Institute (NCI). The mission of the GDC is to provide the cancer research community with a unified repository and cancer knowledge base that enables data sharing across cancer genomic studies in support of precision medicine.
The National Cancer Institute, part of the National Institutes of Health (NIH), is the federal government's principal agency for cancer research and training. NCI’s mission is to lead, conduct, and support cancer research across the nation to advance scientific knowledge and help all people to live longer, healthier lives. NCI’s scope of work spans a broad spectrum of cancer research across a variety of disciplines and supports research training opportunities at career stages across the academic continuum.
If you need harmonised cancer genomics data, the GDC is a great place to start looking!
Installation
Use BiocManager::install()
to install {GenomicDataCommons}
.
if (! "GenomicDataCommons" %in% installed.packages()[, 1]){
BiocManager::install("GenomicDataCommons")
}
library(GenomicDataCommons)
packageVersion("GenomicDataCommons")
[1] ‘1.30.0’
I will be using the Tidyverse for this post, so install it if you haven't already.
library(tidyverse)
Getting started
Before we get started, we should check the current status of the GDC to see if it's OK to query it. (For example, if there is a government shutdown, NIH services get shutdown too.)
GenomicDataCommons::status()
stopifnot(GenomicDataCommons::status()$status=="OK")
$commit
[1] "48add4be7ac46e7db10e0c6f0e3010d5bb2a50aa"
$data_release
[1] "Data Release 41.0 - August 28, 2024"
$data_release_version
$data_release_version$major
[1] 41
$data_release_version$minor
[1] 0
$data_release_version$release_date
[1] "2024-08-28"
$status
[1] "OK"
$tag
[1] "7.7"
$version
[1] 1
The following code builds a manifest
that can be used to guide the download of raw transcriptomic data. The filtering performed below finds open (i.e., openly accessible) gene expression files quantified as raw counts using STAR (an RNA-seq aligner) from TCGA ovarian cancer patients.
ge_manifest <- files() %>%
filter(cases.project.project_id == 'TCGA-OV') %>%
filter(type == 'gene_expression' ) %>%
filter(access == 'open') %>%
filter(analysis.workflow_type == 'STAR - Counts') %>%
manifest()
head(ge_manifest)
# A tibble: 6 × 17
id data_format access file_name submitter_id data_category acl_1 type platform file_size
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <int>
1 13af947c-a945-4b… TSV open 984c9d22… 962e286d-54… Transcriptom… open gene… Illumina 4230731
2 6ab604cb-ea60-44… TSV open b4388ff2… 5d2257c4-0a… Transcriptom… open gene… Illumina 4242021
3 0da0d708-16f2-45… TSV open 95493f7f… dd17894d-43… Transcriptom… open gene… Illumina 4234995
4 400dc107-3381-4f… TSV open cfdb5780… 0dad997d-7a… Transcriptom… open gene… Illumina 4271814
5 3fadac67-3b7c-42… TSV open 25ff83ac… 657b1f82-36… Transcriptom… open gene… Illumina 4255118
6 6d20dd34-f36e-46… TSV open d59c9b4d… 0c2cff25-73… Transcriptom… open gene… Illumina 4232670
# ? 7 more variables: created_datetime <chr>, md5sum <chr>, updated_datetime <chr>, file_id <chr>,
# data_type <chr>, state <chr>, experimental_strategy <chr>
The gdcdata
function is used to download GDC files. In the example below I download gene counts as TSV files for the first three entries of the manifest.
fnames <- lapply(ge_manifest$id[1:3], gdcdata)
fnames
[[1]]
13af947c-a945-4bb3-9258-842f902e2f0c
"~/.cache/GenomicDataCommons/13af947c-a945-4bb3-9258-842f902e2f0c/984c9d22-35ae-4d5e-a506-30ef69568d46.rna_seq.augmented_star_gene_counts.tsv"
[[2]]
6ab604cb-ea60-44cd-93c7-1d93e5670802
"~/.cache/GenomicDataCommons/6ab604cb-ea60-44cd-93c7-1d93e5670802/b4388ff2-9482-41b4-a80b-a361ec3ef4d8.rna_seq.augmented_star_gene_counts.tsv"
[[3]]
0da0d708-16f2-45e4-a704-36bd69680e13
"~/.cache/GenomicDataCommons/0da0d708-16f2-45e4-a704-36bd69680e13/95493f7f-1071-44c5-b007-0dd5b8637c29.rna_seq.augmented_star_gene_counts.tsv"
Files are downloaded and stored in the directory specified by gdc_cache()
, which you can already see from the output above (~/.cache/GenomicDataCommons/
).
gdc_cache()
[1] "~/.cache/GenomicDataCommons"
The query below will fetch all available STAR gene counts that are open for download; we can use dim()
to find out how many datasets we can download.
open_star_manifest <- files() %>%
filter(analysis.workflow_type == 'STAR - Counts') %>%
filter(access == 'open') %>%
manifest()
dim(open_star_manifest)
[1] 24788 17
Metadata queries
Queries in the {GenomicDataCommons}
package follow the four metadata endpoints available at the GDC; there are four convenience functions that each create GDCQuery
objects:
projects()
cases()
files()
annotations()
The four endpoints: projects
, cases
, files
, and annotations
have various associated fields; below are the default fields.
endpoints <- c("projects", "cases", "files", "annotations")
sapply(endpoints, default_fields)
$projects
[1] "dbgap_accession_number" "disease_type" "intended_release_date"
[4] "name" "primary_site" "project_autocomplete"
[7] "project_id" "releasable" "released"
[10] "state"
$cases
[1] "aliquot_ids" "analyte_ids" "case_autocomplete"
[4] "case_id" "consent_type" "created_datetime"
[7] "days_to_consent" "days_to_lost_to_followup" "diagnosis_ids"
[10] "disease_type" "index_date" "lost_to_followup"
[13] "portion_ids" "primary_site" "sample_ids"
[16] "slide_ids" "state" "submitter_aliquot_ids"
[19] "submitter_analyte_ids" "submitter_diagnosis_ids" "submitter_id"
[22] "submitter_portion_ids" "submitter_sample_ids" "submitter_slide_ids"
[25] "updated_datetime"
$files
[1] "access" "acl"
[3] "average_base_quality" "average_insert_size"
[5] "average_read_length" "cancer_dna_fraction"
[7] "channel" "chip_id"
[9] "chip_position" "contamination"
[11] "contamination_error" "created_datetime"
[13] "data_category" "data_format"
[15] "data_type" "error_type"
[17] "experimental_strategy" "file_autocomplete"
[19] "file_id" "file_name"
[21] "file_size" "genome_doubling"
[23] "imaging_date" "magnification"
[25] "md5sum" "mean_coverage"
[27] "msi_score" "msi_status"
[29] "pairs_on_diff_chr" "plate_name"
[31] "plate_well" "platform"
[33] "proc_internal" "proportion_base_mismatch"
[35] "proportion_coverage_10x" "proportion_coverage_30x"
[37] "proportion_reads_duplicated" "proportion_reads_mapped"
[39] "proportion_targets_no_coverage" "read_pair_number"
[41] "revision" "stain_type"
[43] "state" "state_comment"
[45] "subclonal_genome_fraction" "submitter_id"
[47] "tags" "total_reads"
[49] "tumor_ploidy" "tumor_purity"
[51] "type" "updated_datetime"
[53] "wgs_coverage"
$annotations
[1] "annotation_autocomplete" "annotation_id" "case_id"
[4] "case_submitter_id" "category" "classification"
[7] "created_datetime" "entity_id" "entity_submitter_id"
[10] "entity_type" "legacy_created_datetime" "legacy_updated_datetime"
[13] "notes" "state" "status"
[16] "submitter_id" "updated_datetime"
Number of available fields for each endpoint, i.e., not just the default fields. As you can see the case and file datasets are richly annotated!
all_fields <- sapply(endpoints, available_fields)
names(all_fields) <- endpoints
sapply(all_fields, length)
projects cases files annotations
22 1172 1193 30
These fields can be used for filtering purposes to find an appropriate dataset.
head(all_fields$files)
[1] "access" "acl" "analysis.analysis_id"
[4] "analysis.analysis_type" "analysis.created_datetime" "analysis.input_files.access"
The facet
function is useful for aggregating on values used for a particular field. For example we can check how many datasets are open and controlled (i.e., requires permission).
files() %>% facet("access") %>% aggregations()
$access
doc_count key
1 690181 controlled
2 337336 open
Since there are many fields, we can use grep()
to search for fields of interest, for example we can grep for "project" to find project fields.
grep("project", all_fields$files, ignore.case = TRUE, value = TRUE)
[1] "cases.project.dbgap_accession_number"
[2] "cases.project.disease_type"
[3] "cases.project.intended_release_date"
[4] "cases.project.name"
[5] "cases.project.primary_site"
[6] "cases.project.program.dbgap_accession_number"
[7] "cases.project.program.name"
[8] "cases.project.program.program_id"
[9] "cases.project.project_id"
[10] "cases.project.releasable"
[11] "cases.project.released"
[12] "cases.project.state"
[13] "cases.tissue_source_site.project"
Note that each entry above is separated by a period (.
); this indicates the hierarchical structure. We can summarise the top level fields by using sub
.
unique(sub("^(\\w+)\\..*", "\\1", all_fields$cases))
[1] "aliquot_ids" "analyte_ids" "annotations"
[4] "case_autocomplete" "case_id" "consent_type"
[7] "created_datetime" "days_to_consent" "days_to_lost_to_followup"
[10] "demographic" "diagnoses" "diagnosis_ids"
[13] "disease_type" "exposures" "family_histories"
[16] "files" "follow_ups" "index_date"
[19] "lost_to_followup" "portion_ids" "primary_site"
[22] "project" "sample_ids" "samples"
[25] "slide_ids" "state" "submitter_aliquot_ids"
[28] "submitter_analyte_ids" "submitter_diagnosis_ids" "submitter_id"
[31] "submitter_portion_ids" "submitter_sample_ids" "submitter_slide_ids"
[34] "summary" "tissue_source_site" "updated_datetime"
Files
Files that contain genetic sequence information are controlled since they can be used to reveal sensitive information about an individual's health, ancestry, and identity; for example all BAM files are under controlled access. If you need access to these files, see Obtaining Access to Controlled Data.
files() %>%
filter(data_format == 'bam') %>%
facet("access") %>%
aggregations()
$access
doc_count key
1 171451 controlled
All VCF files are also under controlled access for the same reason.
files() %>%
filter(data_format == 'vcf') %>%
facet("access") %>%
aggregations()
$access
doc_count key
1 218666 controlled
However, Mutation Annotation Format (MAF) data are openly available. These files are tab-delimited text files with aggregated mutation information from VCF files. Since they are aggregated and not patient-specific they do not need to be controlled.
files() %>%
filter(access == 'open') %>%
filter(experimental_strategy == 'WXS') %>%
facet("data_format") %>%
aggregations()
$data_format
doc_count key
1 17773 maf
Project
Below are all the project fields.
all_fields$projects
[1] "dbgap_accession_number"
[2] "disease_type"
[3] "intended_release_date"
[4] "name"
[5] "primary_site"
[6] "program.dbgap_accession_number"
[7] "program.name"
[8] "program.program_id"
[9] "project_autocomplete"
[10] "project_id"
[11] "releasable"
[12] "released"
[13] "state"
[14] "summary.case_count"
[15] "summary.data_categories.case_count"
[16] "summary.data_categories.data_category"
[17] "summary.data_categories.file_count"
[18] "summary.experimental_strategies.case_count"
[19] "summary.experimental_strategies.experimental_strategy"
[20] "summary.experimental_strategies.file_count"
[21] "summary.file_count"
[22] "summary.file_size"
Use projects()
to fetch project information and ids()
to list all available projects.
projects() %>% results_all() -> project_info
head(sort(ids(project_info)))
[1] "APOLLO-LUAD" "BEATAML1.0-COHORT" "BEATAML1.0-CRENOLANIB" "CDDP_EAGLE-1"
[5] "CGCI-BLGSP" "CGCI-HTMCP-CC"
The results()
method will fetch actual results.
projects() %>% results(size = 10) -> my_proj
str(my_proj, max.level = 1)
List of 9
$ id : chr [1:10] "TARGET-AML" "MATCH-Z1I" "HCMI-CMDC" "MATCH-W" ...
$ primary_site :List of 10
$ dbgap_accession_number: chr [1:10] "phs000465" "phs002058" NA "phs001948" ...
$ project_id : chr [1:10] "TARGET-AML" "MATCH-Z1I" "HCMI-CMDC" "MATCH-W" ...
$ disease_type :List of 10
$ name : chr [1:10] "Acute Myeloid Leukemia" "Genomic Characterization CS-MATCH-0007 Arm Z1I" "NCI Cancer Model Development for the Human Cancer Model Initiative" "Genomic Characterization CS-MATCH-0007 Arm W" ...
$ releasable : logi [1:10] TRUE FALSE TRUE FALSE FALSE FALSE ...
$ state : chr [1:10] "open" "open" "open" "open" ...
$ released : logi [1:10] TRUE TRUE TRUE TRUE TRUE TRUE ...
- attr(*, "row.names")= int [1:10] 1 2 3 4 5 6 7 8 9 10
- attr(*, "class")= chr [1:3] "GDCprojectsResults" "GDCResults" "list"
Clinical data
The gdc_clinical
function:
The NCI GDC has a complex data model that allows various studies to supply numerous clinical and demographic data elements. However, across all projects that enter the GDC, there are similarities. This function returns four data.frames associated with case_ids from the GDC.
case_ids <- cases() %>% results(size=10) %>% ids()
clindat <- gdc_clinical(case_ids)
names(clindat)
[1] "demographic" "diagnoses" "exposures" "follow_ups" "main"
We'll take a closer look at the main
data as an example.
idx <- apply(clindat$main, 2, function(x) all(is.na(x)))
head(clindat$main[, !idx])
# A tibble: 6 × 9
id disease_type submitter_id created_datetime primary_site updated_datetime case_id index_date state
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 58771… Myeloid Leu… TARGET-20-P… 2019-02-25T10:1… Hematopoiet… 2023-07-21T02:1… 587713… Diagnosis rele…
2 28da5… Myeloid Leu… TARGET-20-P… 2021-10-12T15:1… Hematopoiet… 2023-07-20T22:4… 28da5b… Diagnosis rele…
3 28dae… Myeloid Leu… TARGET-20-D… 2021-10-12T15:1… Hematopoiet… 2022-09-06T12:5… 28dae0… NA rele…
4 28f7e… Myeloid Leu… TARGET-20-K… 2019-02-25T10:1… Hematopoiet… 2019-10-24T08:2… 28f7e6… NA rele…
5 28ffd… Myeloid Leu… TARGET-20-P… 2019-02-25T10:1… Hematopoiet… 2023-07-21T00:1… 28ffdf… Diagnosis rele…
6 29312… Myeloid Leu… TARGET-20-P… 2019-02-25T10:1… Hematopoiet… 2023-07-20T22:4… 293128… Diagnosis rele…
Cases
Cases can be used to find all files related to a specific case, or sample donor. Below are some details that are available for a particular case.
case1 <- cases() %>% results(size=1)
str(case1, max.level = 1)
List of 17
$ id : chr "58771370-5082-485e-ac68-13edfbd9ef0c"
$ lost_to_followup : logi NA
$ disease_type : chr "Myeloid Leukemias"
$ days_to_lost_to_followup: logi NA
$ submitter_id : chr "TARGET-20-PAWKJC"
$ aliquot_ids :List of 1
$ submitter_aliquot_ids :List of 1
$ created_datetime : chr "2019-02-25T10:13:06.478422-06:00"
$ diagnosis_ids :List of 1
$ sample_ids :List of 1
$ submitter_sample_ids :List of 1
$ primary_site : chr "Hematopoietic and reticuloendothelial systems"
$ submitter_diagnosis_ids :List of 1
$ updated_datetime : chr "2023-07-21T02:14:53.858464-05:00"
$ case_id : chr "58771370-5082-485e-ac68-13edfbd9ef0c"
$ index_date : chr "Diagnosis"
$ state : chr "released"
- attr(*, "row.names")= int 1
- attr(*, "class")= chr [1:3] "GDCcasesResults" "GDCResults" "list"
TCGA
The Cancer Genome Atlas (TCGA), a landmark cancer genomics program, molecularly characterized over 20,000 primary cancer and matched normal samples spanning 33 cancer types. This joint effort between NCI and the National Human Genome Research Institute began in 2006, bringing together researchers from diverse disciplines and multiple institutions. TCGA data is commonly used in cancer genomics studies to compare new results, find biomarkers, etc. A long time ago I used to download TCGA data using Firehose but we can use the {GenomicDataCommons} package!
TCGA nomenclature
Data from TCGA (gene expression, copy number variation, clinical information,
etc.) are available via the Genomic Data Commons (GDC). Primary sequence data
(stored in BAM files) are under controlled accession and data access should be
requested via dbGaP and
should be done by the PI. Since I often forget the cancer abbreviations, I have included the lookup table below.
Study Abbreviation | Study Name |
---|---|
LAML | Acute Myeloid Leukemia |
ACC | Adrenocortical carcinoma |
BLCA | Bladder Urothelial Carcinoma |
LGG | Brain Lower Grade Glioma |
BRCA | Breast invasive carcinoma |
CESC | Cervical squamous cell carcinoma and endocervical adenocarcinoma |
CHOL | Cholangiocarcinoma |
LCML | Chronic Myelogenous Leukemia |
COAD | Colon adenocarcinoma |
CNTL | Controls |
ESCA | Esophageal carcinoma |
FPPP | FFPE Pilot Phase II |
GBM | Glioblastoma multiforme |
HNSC | Head and Neck squamous cell carcinoma |
KICH | Kidney Chromophobe |
KIRC | Kidney renal clear cell carcinoma |
KIRP | Kidney renal papillary cell carcinoma |
LIHC | Liver hepatocellular carcinoma |
LUAD | Lung adenocarcinoma |
LUSC | Lung squamous cell carcinoma |
DLBC | Lymphoid Neoplasm Diffuse Large B-cell Lymphoma |
MESO | Mesothelioma |
MISC | Miscellaneous |
OV | Ovarian serous cystadenocarcinoma |
PAAD | Pancreatic adenocarcinoma |
PCPG | Pheochromocytoma and Paraganglioma |
PRAD | Prostate adenocarcinoma |
READ | Rectum adenocarcinoma |
SARC | Sarcoma |
SKCM | Skin Cutaneous Melanoma |
STAD | Stomach adenocarcinoma |
TGCT | Testicular Germ Cell Tumors |
THYM | Thymoma |
THCA | Thyroid carcinoma |
UCS | Uterine Carcinosarcoma |
UCEC | Uterine Corpus Endometrial Carcinoma |
UVM | Uveal Melanoma |
The following information is from https://www.bioconductor.org/packages/release/bioc/vignettes/TCGAbiolinks/inst/doc/query.html.
A TCGA barcode is composed of a collection of identifiers. Each specifically identifies a TCGA data element. Refer to the following figure for an illustration of how metadata identifiers comprise a barcode. An aliquot barcode contains the highest number of identifiers. For example:
Aliquot barcode: TCGA-G4-6317-02A-11D-2064-05
Participant: TCGA-G4-6317
Sample: TCGA-G4-6317-02
As an example, say we are working on ovarian serous cystadenocarcinoma (TCGA-OV
) and we want RNA expression information. The function below also adds case IDs to the results, so that we can look up additional information about the cases.
get_star_metadata <- function(proj){
files() %>%
filter(cases.project.project_id == proj) %>%
filter(analysis.workflow_type == 'STAR - Counts') %>%
filter(access == 'open') %>%
GenomicDataCommons::select(
c(
default_fields('files'),
"cases.case_id",
"cases.samples.sample_type",
"cases.samples.sample_id"
)
) %>%
results_all()
}
ov_star <- get_star_metadata("TCGA-OV")
str(ov_star, max.level = 1)
List of 18
$ id : chr [1:429] "13af947c-a945-4bb3-9258-842f902e2f0c" "6ab604cb-ea60-44cd-93c7-1d93e5670802" "0da0d708-16f2-45e4-a704-36bd69680e13" "400dc107-3381-4fa9-81e8-76fce69d853c" ...
$ data_format : chr [1:429] "TSV" "TSV" "TSV" "TSV" ...
$ cases :List of 429
$ access : chr [1:429] "open" "open" "open" "open" ...
$ file_name : chr [1:429] "984c9d22-35ae-4d5e-a506-30ef69568d46.rna_seq.augmented_star_gene_counts.tsv" "b4388ff2-9482-41b4-a80b-a361ec3ef4d8.rna_seq.augmented_star_gene_counts.tsv" "95493f7f-1071-44c5-b007-0dd5b8637c29.rna_seq.augmented_star_gene_counts.tsv" "cfdb5780-7a6a-41ec-80f6-1405a06ba81d.rna_seq.augmented_star_gene_counts.tsv" ...
$ submitter_id : chr [1:429] "962e286d-54a6-486f-b2c4-72731366637c" "5d2257c4-0ae7-431f-9b91-d469dd275bf3" "dd17894d-432a-4bec-9dd2-962afe9ffae3" "0dad997d-7a38-4b5e-a839-9a89ae70052a" ...
$ data_category : chr [1:429] "Transcriptome Profiling" "Transcriptome Profiling" "Transcriptome Profiling" "Transcriptome Profiling" ...
$ acl :List of 429
$ type : chr [1:429] "gene_expression" "gene_expression" "gene_expression" "gene_expression" ...
$ platform : chr [1:429] "Illumina" "Illumina" "Illumina" "Illumina" ...
$ file_size : int [1:429] 4230731 4242021 4234995 4271814 4255118 4232670 4250007 4239148 4256453 4208419 ...
$ created_datetime : chr [1:429] "2021-12-13T20:51:04.846189-06:00" "2021-12-13T20:51:57.381949-06:00" "2021-12-13T20:50:59.983850-06:00" "2021-12-13T20:53:18.474460-06:00" ...
$ md5sum : chr [1:429] "78588ba9044de3e2cddc9d5d731e0518" "3697a1390e73861ceb232e48889a377e" "6db226b5e5eefb2b02072af1482139ff" "8ae25ce818674daf47058aec276fa24c" ...
$ updated_datetime : chr [1:429] "2024-07-30T13:59:16.456078-05:00" "2024-07-30T14:01:14.448800-05:00" "2024-07-30T12:53:46.707112-05:00" "2024-07-30T14:00:53.771498-05:00" ...
$ file_id : chr [1:429] "13af947c-a945-4bb3-9258-842f902e2f0c" "6ab604cb-ea60-44cd-93c7-1d93e5670802" "0da0d708-16f2-45e4-a704-36bd69680e13" "400dc107-3381-4fa9-81e8-76fce69d853c" ...
$ data_type : chr [1:429] "Gene Expression Quantification" "Gene Expression Quantification" "Gene Expression Quantification" "Gene Expression Quantification" ...
$ state : chr [1:429] "released" "released" "released" "released" ...
$ experimental_strategy: chr [1:429] "RNA-Seq" "RNA-Seq" "RNA-Seq" "RNA-Seq" ...
- attr(*, "row.names")= int [1:429] 1 2 3 4 5 6 7 8 9 10 ...
- attr(*, "class")= chr [1:3] "GDCfilesResults" "GDCResults" "list"
Examine a single case.
str(ov_star$cases$`96aca0af-a776-460d-95ff-87e364e4ac99`)
'data.frame': 1 obs. of 2 variables:
$ case_id: chr "9446e349-71e6-455a-aa8f-53ec96597146"
$ samples:List of 1
..$ :'data.frame': 1 obs. of 2 variables:
.. ..$ sample_id : chr "1d568bd2-d658-40fa-a341-daa4d2a5bb22"
.. ..$ sample_type: chr "Primary Tumor"
Build data frame of all cases to see what samples are available.
sapply(ov_star$cases, function(x) x$samples) |>
do.call(rbind.data.frame, args = _) -> ov_star_cases
head(ov_star_cases)
sample_id sample_type
13af947c-a945-4bb3-9258-842f902e2f0c bcd16bf3-0877-4e4b-b70b-7d6a497af7ac Primary Tumor
6ab604cb-ea60-44cd-93c7-1d93e5670802 c8de9eaf-cda6-4680-a710-93fd9e9a8903 Primary Tumor
0da0d708-16f2-45e4-a704-36bd69680e13 7404e874-190e-4353-8755-1d9be35eedb7 Primary Tumor
400dc107-3381-4fa9-81e8-76fce69d853c 3669f3bd-dffe-4737-bf81-07f38deceb2e Primary Tumor
3fadac67-3b7c-4218-a52a-de7d6808b75d 8eab6a91-e73b-4555-9915-8ac4b91d9de3 Primary Tumor
6d20dd34-f36e-468e-bcfe-99fc45dd68a2 13f9890c-e405-4625-888e-496eee30dfb4 Primary Tumor
Summary
The {GenomicDataCommons} package makes it easy to interact with the GDC via R. You can use it to find available files that can be linked to clinical data. Raw sequence files and files that contain genomic information about a patient/case is controlled and will require a formal application process before access is granted.
TCGA provides richly annotated datasets including the genetic and molecular characteristics of various cancer subtypes. Since TCGA employs a standardised analysis pipeline that ensures consistency in data processing and minimises biases associated with manual curation, TCGA data is widely used in many cancer studies. Use the {GenomicDataCommons} package to find and download TCGA data!
This post was sponsored by Next Advance:
Next Advance is a leading provider of laboratory instruments, including the Bullet Blender® tissue homogenizers, automated blot processors, and rockers, that are all designed to enhance efficiency and accuracy, empowering scientists to achieve optimal results.

This work is licensed under a Creative Commons
Attribution 4.0 International License.