Using the GenomicDataCommons package

The {GenomicDataCommons} Bioconductor package provides basic infrastructure for querying, accessing, and mining genomic datasets available from the Genomic Data Commons (GDC). The About the GDC webpage provides a brief description of the program:

The Genomic Data Commons (GDC) is a research program of the National Cancer Institute (NCI). The mission of the GDC is to provide the cancer research community with a unified repository and cancer knowledge base that enables data sharing across cancer genomic studies in support of precision medicine.

The National Cancer Institute, part of the National Institutes of Health (NIH), is the federal government's principal agency for cancer research and training. NCI’s mission is to lead, conduct, and support cancer research across the nation to advance scientific knowledge and help all people to live longer, healthier lives. NCI’s scope of work spans a broad spectrum of cancer research across a variety of disciplines and supports research training opportunities at career stages across the academic continuum.

If you need harmonised cancer genomics data, the GDC is a great place to start looking!

Installation

Use BiocManager::install() to install {GenomicDataCommons}.

if (! "GenomicDataCommons" %in% installed.packages()[, 1]){
  BiocManager::install("GenomicDataCommons")
}
library(GenomicDataCommons)
packageVersion("GenomicDataCommons")
[1] ‘1.30.0’

I will be using the Tidyverse for this post, so install it if you haven't already.

library(tidyverse)

Getting started

Before we get started, we should check the current status of the GDC to see if it's OK to query it. (For example, if there is a government shutdown, NIH services get shutdown too.)

GenomicDataCommons::status()
stopifnot(GenomicDataCommons::status()$status=="OK")
$commit
[1] "48add4be7ac46e7db10e0c6f0e3010d5bb2a50aa"

$data_release
[1] "Data Release 41.0 - August 28, 2024"

$data_release_version
$data_release_version$major
[1] 41

$data_release_version$minor
[1] 0

$data_release_version$release_date
[1] "2024-08-28"

$status
[1] "OK"

$tag
[1] "7.7"

$version
[1] 1

The following code builds a manifest that can be used to guide the download of raw transcriptomic data. The filtering performed below finds open (i.e., openly accessible) gene expression files quantified as raw counts using STAR (an RNA-seq aligner) from TCGA ovarian cancer patients.

ge_manifest <- files() %>%
  filter(cases.project.project_id == 'TCGA-OV') %>% 
  filter(type == 'gene_expression' ) %>%
  filter(access == 'open') %>%
  filter(analysis.workflow_type == 'STAR - Counts')  %>%
  manifest()

head(ge_manifest)
# A tibble: 6 × 17
  id                data_format access file_name submitter_id data_category acl_1 type  platform file_size
  <chr>             <chr>       <chr>  <chr>     <chr>        <chr>         <chr> <chr> <chr>        <int>
1 13af947c-a945-4b… TSV         open   984c9d22… 962e286d-54… Transcriptom… open  gene… Illumina   4230731
2 6ab604cb-ea60-44… TSV         open   b4388ff2… 5d2257c4-0a… Transcriptom… open  gene… Illumina   4242021
3 0da0d708-16f2-45… TSV         open   95493f7f… dd17894d-43… Transcriptom… open  gene… Illumina   4234995
4 400dc107-3381-4f… TSV         open   cfdb5780… 0dad997d-7a… Transcriptom… open  gene… Illumina   4271814
5 3fadac67-3b7c-42… TSV         open   25ff83ac… 657b1f82-36… Transcriptom… open  gene… Illumina   4255118
6 6d20dd34-f36e-46… TSV         open   d59c9b4d… 0c2cff25-73… Transcriptom… open  gene… Illumina   4232670
# ? 7 more variables: created_datetime <chr>, md5sum <chr>, updated_datetime <chr>, file_id <chr>,
#   data_type <chr>, state <chr>, experimental_strategy <chr>

The gdcdata function is used to download GDC files. In the example below I download gene counts as TSV files for the first three entries of the manifest.

fnames <- lapply(ge_manifest$id[1:3], gdcdata)
fnames
[[1]]
                                                                                                          13af947c-a945-4bb3-9258-842f902e2f0c 
"~/.cache/GenomicDataCommons/13af947c-a945-4bb3-9258-842f902e2f0c/984c9d22-35ae-4d5e-a506-30ef69568d46.rna_seq.augmented_star_gene_counts.tsv" 

[[2]]
                                                                                                          6ab604cb-ea60-44cd-93c7-1d93e5670802 
"~/.cache/GenomicDataCommons/6ab604cb-ea60-44cd-93c7-1d93e5670802/b4388ff2-9482-41b4-a80b-a361ec3ef4d8.rna_seq.augmented_star_gene_counts.tsv" 

[[3]]
                                                                                                          0da0d708-16f2-45e4-a704-36bd69680e13 
"~/.cache/GenomicDataCommons/0da0d708-16f2-45e4-a704-36bd69680e13/95493f7f-1071-44c5-b007-0dd5b8637c29.rna_seq.augmented_star_gene_counts.tsv" 

Files are downloaded and stored in the directory specified by gdc_cache(), which you can already see from the output above (~/.cache/GenomicDataCommons/).

gdc_cache()
[1] "~/.cache/GenomicDataCommons"

The query below will fetch all available STAR gene counts that are open for download; we can use dim() to find out how many datasets we can download.

open_star_manifest <- files() %>%
    filter(analysis.workflow_type == 'STAR - Counts') %>%
    filter(access == 'open') %>%
    manifest()

dim(open_star_manifest)
[1] 24788    17

Metadata queries

Queries in the {GenomicDataCommons} package follow the four metadata endpoints available at the GDC; there are four convenience functions that each create GDCQuery objects:

  1. projects()
  2. cases()
  3. files()
  4. annotations()

The four endpoints: projects, cases, files, and annotations have various associated fields; below are the default fields.

endpoints <- c("projects", "cases", "files", "annotations")
sapply(endpoints, default_fields)
$projects
 [1] "dbgap_accession_number" "disease_type"           "intended_release_date" 
 [4] "name"                   "primary_site"           "project_autocomplete"  
 [7] "project_id"             "releasable"             "released"              
[10] "state"                 

$cases
 [1] "aliquot_ids"              "analyte_ids"              "case_autocomplete"       
 [4] "case_id"                  "consent_type"             "created_datetime"        
 [7] "days_to_consent"          "days_to_lost_to_followup" "diagnosis_ids"           
[10] "disease_type"             "index_date"               "lost_to_followup"        
[13] "portion_ids"              "primary_site"             "sample_ids"              
[16] "slide_ids"                "state"                    "submitter_aliquot_ids"   
[19] "submitter_analyte_ids"    "submitter_diagnosis_ids"  "submitter_id"            
[22] "submitter_portion_ids"    "submitter_sample_ids"     "submitter_slide_ids"     
[25] "updated_datetime"        

$files
 [1] "access"                         "acl"                           
 [3] "average_base_quality"           "average_insert_size"           
 [5] "average_read_length"            "cancer_dna_fraction"           
 [7] "channel"                        "chip_id"                       
 [9] "chip_position"                  "contamination"                 
[11] "contamination_error"            "created_datetime"              
[13] "data_category"                  "data_format"                   
[15] "data_type"                      "error_type"                    
[17] "experimental_strategy"          "file_autocomplete"             
[19] "file_id"                        "file_name"                     
[21] "file_size"                      "genome_doubling"               
[23] "imaging_date"                   "magnification"                 
[25] "md5sum"                         "mean_coverage"                 
[27] "msi_score"                      "msi_status"                    
[29] "pairs_on_diff_chr"              "plate_name"                    
[31] "plate_well"                     "platform"                      
[33] "proc_internal"                  "proportion_base_mismatch"      
[35] "proportion_coverage_10x"        "proportion_coverage_30x"       
[37] "proportion_reads_duplicated"    "proportion_reads_mapped"       
[39] "proportion_targets_no_coverage" "read_pair_number"              
[41] "revision"                       "stain_type"                    
[43] "state"                          "state_comment"                 
[45] "subclonal_genome_fraction"      "submitter_id"                  
[47] "tags"                           "total_reads"                   
[49] "tumor_ploidy"                   "tumor_purity"                  
[51] "type"                           "updated_datetime"              
[53] "wgs_coverage"                  

$annotations
 [1] "annotation_autocomplete" "annotation_id"           "case_id"                
 [4] "case_submitter_id"       "category"                "classification"         
 [7] "created_datetime"        "entity_id"               "entity_submitter_id"    
[10] "entity_type"             "legacy_created_datetime" "legacy_updated_datetime"
[13] "notes"                   "state"                   "status"                 
[16] "submitter_id"            "updated_datetime"

Number of available fields for each endpoint, i.e., not just the default fields. As you can see the case and file datasets are richly annotated!

all_fields <- sapply(endpoints, available_fields)
names(all_fields) <- endpoints

sapply(all_fields, length)
   projects       cases       files annotations 
         22        1172        1193          30

These fields can be used for filtering purposes to find an appropriate dataset.

head(all_fields$files)
[1] "access"                      "acl"                         "analysis.analysis_id"       
[4] "analysis.analysis_type"      "analysis.created_datetime"   "analysis.input_files.access"

The facet function is useful for aggregating on values used for a particular field. For example we can check how many datasets are open and controlled (i.e., requires permission).

files() %>% facet("access") %>% aggregations()
$access
  doc_count        key
1    690181 controlled
2    337336       open

Since there are many fields, we can use grep() to search for fields of interest, for example we can grep for "project" to find project fields.

grep("project", all_fields$files, ignore.case = TRUE, value = TRUE)
 [1] "cases.project.dbgap_accession_number"        
 [2] "cases.project.disease_type"                  
 [3] "cases.project.intended_release_date"         
 [4] "cases.project.name"                          
 [5] "cases.project.primary_site"                  
 [6] "cases.project.program.dbgap_accession_number"
 [7] "cases.project.program.name"                  
 [8] "cases.project.program.program_id"            
 [9] "cases.project.project_id"                    
[10] "cases.project.releasable"                    
[11] "cases.project.released"                      
[12] "cases.project.state"                         
[13] "cases.tissue_source_site.project"

Note that each entry above is separated by a period (.); this indicates the hierarchical structure. We can summarise the top level fields by using sub.

unique(sub("^(\\w+)\\..*", "\\1", all_fields$cases))
 [1] "aliquot_ids"              "analyte_ids"              "annotations"             
 [4] "case_autocomplete"        "case_id"                  "consent_type"            
 [7] "created_datetime"         "days_to_consent"          "days_to_lost_to_followup"
[10] "demographic"              "diagnoses"                "diagnosis_ids"           
[13] "disease_type"             "exposures"                "family_histories"        
[16] "files"                    "follow_ups"               "index_date"              
[19] "lost_to_followup"         "portion_ids"              "primary_site"            
[22] "project"                  "sample_ids"               "samples"                 
[25] "slide_ids"                "state"                    "submitter_aliquot_ids"   
[28] "submitter_analyte_ids"    "submitter_diagnosis_ids"  "submitter_id"            
[31] "submitter_portion_ids"    "submitter_sample_ids"     "submitter_slide_ids"     
[34] "summary"                  "tissue_source_site"       "updated_datetime"

Files

Files that contain genetic sequence information are controlled since they can be used to reveal sensitive information about an individual's health, ancestry, and identity; for example all BAM files are under controlled access. If you need access to these files, see Obtaining Access to Controlled Data.

files() %>%
  filter(data_format == 'bam') %>%
  facet("access") %>%
  aggregations()
$access
  doc_count        key
1    171451 controlled

All VCF files are also under controlled access for the same reason.

files() %>%
  filter(data_format == 'vcf') %>%
  facet("access") %>%
  aggregations()
$access
  doc_count        key
1    218666 controlled

However, Mutation Annotation Format (MAF) data are openly available. These files are tab-delimited text files with aggregated mutation information from VCF files. Since they are aggregated and not patient-specific they do not need to be controlled.

files() %>%
  filter(access == 'open') %>%
  filter(experimental_strategy == 'WXS') %>%
  facet("data_format") %>%
  aggregations()
$data_format
  doc_count key
1     17773 maf

Project

Below are all the project fields.

all_fields$projects
 [1] "dbgap_accession_number"                               
 [2] "disease_type"                                         
 [3] "intended_release_date"                                
 [4] "name"                                                 
 [5] "primary_site"                                         
 [6] "program.dbgap_accession_number"                       
 [7] "program.name"                                         
 [8] "program.program_id"                                   
 [9] "project_autocomplete"                                 
[10] "project_id"                                           
[11] "releasable"                                           
[12] "released"                                             
[13] "state"                                                
[14] "summary.case_count"                                   
[15] "summary.data_categories.case_count"                   
[16] "summary.data_categories.data_category"                
[17] "summary.data_categories.file_count"                   
[18] "summary.experimental_strategies.case_count"           
[19] "summary.experimental_strategies.experimental_strategy"
[20] "summary.experimental_strategies.file_count"           
[21] "summary.file_count"                                   
[22] "summary.file_size" 

Use projects() to fetch project information and ids() to list all available projects.

projects() %>% results_all() -> project_info

head(sort(ids(project_info)))
[1] "APOLLO-LUAD"           "BEATAML1.0-COHORT"     "BEATAML1.0-CRENOLANIB" "CDDP_EAGLE-1"         
[5] "CGCI-BLGSP"            "CGCI-HTMCP-CC"

The results() method will fetch actual results.

projects() %>% results(size = 10) -> my_proj

str(my_proj, max.level = 1)
List of 9
 $ id                    : chr [1:10] "TARGET-AML" "MATCH-Z1I" "HCMI-CMDC" "MATCH-W" ...
 $ primary_site          :List of 10
 $ dbgap_accession_number: chr [1:10] "phs000465" "phs002058" NA "phs001948" ...
 $ project_id            : chr [1:10] "TARGET-AML" "MATCH-Z1I" "HCMI-CMDC" "MATCH-W" ...
 $ disease_type          :List of 10
 $ name                  : chr [1:10] "Acute Myeloid Leukemia" "Genomic Characterization CS-MATCH-0007 Arm Z1I" "NCI Cancer Model Development for the Human Cancer Model Initiative" "Genomic Characterization CS-MATCH-0007 Arm W" ...
 $ releasable            : logi [1:10] TRUE FALSE TRUE FALSE FALSE FALSE ...
 $ state                 : chr [1:10] "open" "open" "open" "open" ...
 $ released              : logi [1:10] TRUE TRUE TRUE TRUE TRUE TRUE ...
 - attr(*, "row.names")= int [1:10] 1 2 3 4 5 6 7 8 9 10
 - attr(*, "class")= chr [1:3] "GDCprojectsResults" "GDCResults" "list"

Clinical data

The gdc_clinical function:

The NCI GDC has a complex data model that allows various studies to supply numerous clinical and demographic data elements. However, across all projects that enter the GDC, there are similarities. This function returns four data.frames associated with case_ids from the GDC.

case_ids <- cases() %>% results(size=10) %>% ids()
clindat <- gdc_clinical(case_ids)
names(clindat)
[1] "demographic" "diagnoses"   "exposures"   "follow_ups"  "main"       

We'll take a closer look at the main data as an example.

idx <- apply(clindat$main, 2, function(x) all(is.na(x)))
head(clindat$main[, !idx])
# A tibble: 6 × 9
  id     disease_type submitter_id created_datetime primary_site updated_datetime case_id index_date state
  <chr>  <chr>        <chr>        <chr>            <chr>        <chr>            <chr>   <chr>      <chr>
1 58771… Myeloid Leu… TARGET-20-P… 2019-02-25T10:1… Hematopoiet… 2023-07-21T02:1… 587713… Diagnosis  rele…
2 28da5… Myeloid Leu… TARGET-20-P… 2021-10-12T15:1… Hematopoiet… 2023-07-20T22:4… 28da5b… Diagnosis  rele…
3 28dae… Myeloid Leu… TARGET-20-D… 2021-10-12T15:1… Hematopoiet… 2022-09-06T12:5… 28dae0… NA         rele…
4 28f7e… Myeloid Leu… TARGET-20-K… 2019-02-25T10:1… Hematopoiet… 2019-10-24T08:2… 28f7e6… NA         rele…
5 28ffd… Myeloid Leu… TARGET-20-P… 2019-02-25T10:1… Hematopoiet… 2023-07-21T00:1… 28ffdf… Diagnosis  rele…
6 29312… Myeloid Leu… TARGET-20-P… 2019-02-25T10:1… Hematopoiet… 2023-07-20T22:4… 293128… Diagnosis  rele…

Cases

Cases can be used to find all files related to a specific case, or sample donor. Below are some details that are available for a particular case.

case1 <- cases() %>% results(size=1)
str(case1, max.level = 1)
List of 17
 $ id                      : chr "58771370-5082-485e-ac68-13edfbd9ef0c"
 $ lost_to_followup        : logi NA
 $ disease_type            : chr "Myeloid Leukemias"
 $ days_to_lost_to_followup: logi NA
 $ submitter_id            : chr "TARGET-20-PAWKJC"
 $ aliquot_ids             :List of 1
 $ submitter_aliquot_ids   :List of 1
 $ created_datetime        : chr "2019-02-25T10:13:06.478422-06:00"
 $ diagnosis_ids           :List of 1
 $ sample_ids              :List of 1
 $ submitter_sample_ids    :List of 1
 $ primary_site            : chr "Hematopoietic and reticuloendothelial systems"
 $ submitter_diagnosis_ids :List of 1
 $ updated_datetime        : chr "2023-07-21T02:14:53.858464-05:00"
 $ case_id                 : chr "58771370-5082-485e-ac68-13edfbd9ef0c"
 $ index_date              : chr "Diagnosis"
 $ state                   : chr "released"
 - attr(*, "row.names")= int 1
 - attr(*, "class")= chr [1:3] "GDCcasesResults" "GDCResults" "list"

TCGA

The Cancer Genome Atlas (TCGA), a landmark cancer genomics program, molecularly characterized over 20,000 primary cancer and matched normal samples spanning 33 cancer types. This joint effort between NCI and the National Human Genome Research Institute began in 2006, bringing together researchers from diverse disciplines and multiple institutions. TCGA data is commonly used in cancer genomics studies to compare new results, find biomarkers, etc. A long time ago I used to download TCGA data using Firehose but we can use the {GenomicDataCommons} package!

TCGA nomenclature

Data from TCGA (gene expression, copy number variation, clinical information,
etc.) are available via the Genomic Data Commons (GDC). Primary sequence data
(stored in BAM files) are under controlled accession and data access should be
requested via dbGaP and
should be done by the PI. Since I often forget the cancer abbreviations, I have included the lookup table below.

Study Abbreviation Study Name
LAML Acute Myeloid Leukemia
ACC Adrenocortical carcinoma
BLCA Bladder Urothelial Carcinoma
LGG Brain Lower Grade Glioma
BRCA Breast invasive carcinoma
CESC Cervical squamous cell carcinoma and endocervical adenocarcinoma
CHOL Cholangiocarcinoma
LCML Chronic Myelogenous Leukemia
COAD Colon adenocarcinoma
CNTL Controls
ESCA Esophageal carcinoma
FPPP FFPE Pilot Phase II
GBM Glioblastoma multiforme
HNSC Head and Neck squamous cell carcinoma
KICH Kidney Chromophobe
KIRC Kidney renal clear cell carcinoma
KIRP Kidney renal papillary cell carcinoma
LIHC Liver hepatocellular carcinoma
LUAD Lung adenocarcinoma
LUSC Lung squamous cell carcinoma
DLBC Lymphoid Neoplasm Diffuse Large B-cell Lymphoma
MESO Mesothelioma
MISC Miscellaneous
OV Ovarian serous cystadenocarcinoma
PAAD Pancreatic adenocarcinoma
PCPG Pheochromocytoma and Paraganglioma
PRAD Prostate adenocarcinoma
READ Rectum adenocarcinoma
SARC Sarcoma
SKCM Skin Cutaneous Melanoma
STAD Stomach adenocarcinoma
TGCT Testicular Germ Cell Tumors
THYM Thymoma
THCA Thyroid carcinoma
UCS Uterine Carcinosarcoma
UCEC Uterine Corpus Endometrial Carcinoma
UVM Uveal Melanoma

Table source.

The following information is from https://www.bioconductor.org/packages/release/bioc/vignettes/TCGAbiolinks/inst/doc/query.html.

A TCGA barcode is composed of a collection of identifiers. Each specifically identifies a TCGA data element. Refer to the following figure for an illustration of how metadata identifiers comprise a barcode. An aliquot barcode contains the highest number of identifiers. For example:

Aliquot barcode: TCGA-G4-6317-02A-11D-2064-05
Participant: TCGA-G4-6317
Sample: TCGA-G4-6317-02

As an example, say we are working on ovarian serous cystadenocarcinoma (TCGA-OV) and we want RNA expression information. The function below also adds case IDs to the results, so that we can look up additional information about the cases.

get_star_metadata <- function(proj){
  files() %>%
    filter(cases.project.project_id == proj) %>% 
    filter(analysis.workflow_type == 'STAR - Counts') %>%
    filter(access == 'open') %>%
    GenomicDataCommons::select(
      c(
        default_fields('files'),
        "cases.case_id",
        "cases.samples.sample_type",
        "cases.samples.sample_id"
      )
    ) %>%
    results_all()
}

ov_star <- get_star_metadata("TCGA-OV")

str(ov_star, max.level = 1)
List of 18
 $ id                   : chr [1:429] "13af947c-a945-4bb3-9258-842f902e2f0c" "6ab604cb-ea60-44cd-93c7-1d93e5670802" "0da0d708-16f2-45e4-a704-36bd69680e13" "400dc107-3381-4fa9-81e8-76fce69d853c" ...
 $ data_format          : chr [1:429] "TSV" "TSV" "TSV" "TSV" ...
 $ cases                :List of 429
 $ access               : chr [1:429] "open" "open" "open" "open" ...
 $ file_name            : chr [1:429] "984c9d22-35ae-4d5e-a506-30ef69568d46.rna_seq.augmented_star_gene_counts.tsv" "b4388ff2-9482-41b4-a80b-a361ec3ef4d8.rna_seq.augmented_star_gene_counts.tsv" "95493f7f-1071-44c5-b007-0dd5b8637c29.rna_seq.augmented_star_gene_counts.tsv" "cfdb5780-7a6a-41ec-80f6-1405a06ba81d.rna_seq.augmented_star_gene_counts.tsv" ...
 $ submitter_id         : chr [1:429] "962e286d-54a6-486f-b2c4-72731366637c" "5d2257c4-0ae7-431f-9b91-d469dd275bf3" "dd17894d-432a-4bec-9dd2-962afe9ffae3" "0dad997d-7a38-4b5e-a839-9a89ae70052a" ...
 $ data_category        : chr [1:429] "Transcriptome Profiling" "Transcriptome Profiling" "Transcriptome Profiling" "Transcriptome Profiling" ...
 $ acl                  :List of 429
 $ type                 : chr [1:429] "gene_expression" "gene_expression" "gene_expression" "gene_expression" ...
 $ platform             : chr [1:429] "Illumina" "Illumina" "Illumina" "Illumina" ...
 $ file_size            : int [1:429] 4230731 4242021 4234995 4271814 4255118 4232670 4250007 4239148 4256453 4208419 ...
 $ created_datetime     : chr [1:429] "2021-12-13T20:51:04.846189-06:00" "2021-12-13T20:51:57.381949-06:00" "2021-12-13T20:50:59.983850-06:00" "2021-12-13T20:53:18.474460-06:00" ...
 $ md5sum               : chr [1:429] "78588ba9044de3e2cddc9d5d731e0518" "3697a1390e73861ceb232e48889a377e" "6db226b5e5eefb2b02072af1482139ff" "8ae25ce818674daf47058aec276fa24c" ...
 $ updated_datetime     : chr [1:429] "2024-07-30T13:59:16.456078-05:00" "2024-07-30T14:01:14.448800-05:00" "2024-07-30T12:53:46.707112-05:00" "2024-07-30T14:00:53.771498-05:00" ...
 $ file_id              : chr [1:429] "13af947c-a945-4bb3-9258-842f902e2f0c" "6ab604cb-ea60-44cd-93c7-1d93e5670802" "0da0d708-16f2-45e4-a704-36bd69680e13" "400dc107-3381-4fa9-81e8-76fce69d853c" ...
 $ data_type            : chr [1:429] "Gene Expression Quantification" "Gene Expression Quantification" "Gene Expression Quantification" "Gene Expression Quantification" ...
 $ state                : chr [1:429] "released" "released" "released" "released" ...
 $ experimental_strategy: chr [1:429] "RNA-Seq" "RNA-Seq" "RNA-Seq" "RNA-Seq" ...
 - attr(*, "row.names")= int [1:429] 1 2 3 4 5 6 7 8 9 10 ...
 - attr(*, "class")= chr [1:3] "GDCfilesResults" "GDCResults" "list"

Examine a single case.

str(ov_star$cases$`96aca0af-a776-460d-95ff-87e364e4ac99`)
'data.frame': 1 obs. of  2 variables:
 $ case_id: chr "9446e349-71e6-455a-aa8f-53ec96597146"
 $ samples:List of 1
  ..$ :'data.frame':  1 obs. of  2 variables:
  .. ..$ sample_id  : chr "1d568bd2-d658-40fa-a341-daa4d2a5bb22"
  .. ..$ sample_type: chr "Primary Tumor"

Build data frame of all cases to see what samples are available.

sapply(ov_star$cases, function(x) x$samples) |>
  do.call(rbind.data.frame, args = _) -> ov_star_cases

head(ov_star_cases)
                                                                sample_id   sample_type
13af947c-a945-4bb3-9258-842f902e2f0c bcd16bf3-0877-4e4b-b70b-7d6a497af7ac Primary Tumor
6ab604cb-ea60-44cd-93c7-1d93e5670802 c8de9eaf-cda6-4680-a710-93fd9e9a8903 Primary Tumor
0da0d708-16f2-45e4-a704-36bd69680e13 7404e874-190e-4353-8755-1d9be35eedb7 Primary Tumor
400dc107-3381-4fa9-81e8-76fce69d853c 3669f3bd-dffe-4737-bf81-07f38deceb2e Primary Tumor
3fadac67-3b7c-4218-a52a-de7d6808b75d 8eab6a91-e73b-4555-9915-8ac4b91d9de3 Primary Tumor
6d20dd34-f36e-468e-bcfe-99fc45dd68a2 13f9890c-e405-4625-888e-496eee30dfb4 Primary Tumor

Summary

The {GenomicDataCommons} package makes it easy to interact with the GDC via R. You can use it to find available files that can be linked to clinical data. Raw sequence files and files that contain genomic information about a patient/case is controlled and will require a formal application process before access is granted.

TCGA provides richly annotated datasets including the genetic and molecular characteristics of various cancer subtypes. Since TCGA employs a standardised analysis pipeline that ensures consistency in data processing and minimises biases associated with manual curation, TCGA data is widely used in many cancer studies. Use the {GenomicDataCommons} package to find and download TCGA data!


This post was sponsored by Next Advance:

Next Advance is a leading provider of laboratory instruments, including the Bullet Blender® tissue homogenizers, automated blot processors, and rockers, that are all designed to enhance efficiency and accuracy, empowering scientists to achieve optimal results.




Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.