Split a multi-valued column into multiple columns

Tidy data is a standard way of mapping the meaning of a dataset to its structure. A dataset is messy or tidy depending on how rows, columns and tables are matched up with observations, variables and types. In tidy data:

  1. Each variable forms a column.
  2. Each observation forms a row.
  3. Each type of observational unit forms a table.

Therefore by these criteria, a column with multiple values, i.e., a multi-valued column, is messy. (However, there are MultiValue databases like NoSQL that support and encourage the use of attributes which can take a list of values!) Within the Tidyverse, I find working with tidy data much easier.

Two widely used file formats in bioinformatics, the Variant Call Format (VCF) and Gene Transfer Format, which are really just tab-delimited files, have multi-valued columns that are inconvenient to work with. In this post, we will use tools from the Tidyverse to split a multi-valued column into multiple columns to make the data easier to work with in R. We will use a VCF and a GTF file to demonstrate.

Splitting the INFO column in a VCF file

To get started, install the tidyverse if you haven't already.

if(!require("tidyverse")){
  install.packages("tidyverse")
}
library(tidyverse)

We will load a small portion of a VCF file using read_tsv; in addition we will rename #CHROM to CHROM and then change all the column names to lower case.

vcf_url <- "https://raw.githubusercontent.com/davetang/learning_vcf_file/master/eg/Pfeiffer.vcf"
read_tsv(vcf_url, comment = "##", show_col_types = FALSE, n_max = 1000) |>
  dplyr::rename(CHROM = `#CHROM`) |>
  dplyr::rename_with(tolower) -> my_vcf

head(my_vcf)
# A tibble: 6 × 10
  chrom    pos id         ref   alt    qual filter info            format manuel
  <dbl>  <dbl> <chr>      <chr> <chr> <dbl> <chr>  <chr>           <chr>  <chr> 
1     1 866511 rs60722469 C     CCCCT 259.  PASS   AC=2;AF=1.00;A… GT:AD… 1/1:6…
2     1 879317 rs7523549  C     T     151.  PASS   AC=1;AF=0.50;A… GT:AD… 0/1:1…
3     1 879482 .          G     C     485.  PASS   AC=1;AF=0.50;A… GT:AD… 0/1:2…
4     1 880390 rs3748593  C     A     288.  PASS   AC=1;AF=0.50;A… GT:AD… 0/1:1…
5     1 881627 rs2272757  G     A     486.  PASS   AC=1;AF=0.50;A… GT:AD… 0/1:1…
6     1 884091 rs7522415  C     G      65.5 PASS   AC=1;AF=0.50;A… GT:AD… 0/1:6…

Note that the info column is packed with all sorts of information for each variant. Also note the consistent format of the info column: each annotation is separated by a semi-colon (;) and annotations are stored as key-value pairs with an equal sign in between.

my_vcf |>
  select(info) |>
  head()
# A tibble: 6 × 1
  info                                                                          
  <chr>                                                                         
1 AC=2;AF=1.00;AN=2;DB;DP=11;FS=0.000;HRun=0;HaplotypeScore=41.3338;MQ0=0;MQ=61…
2 AC=1;AF=0.50;AN=2;BaseQRankSum=1.455;DB;DP=21;Dels=0.00;FS=1.984;HRun=0;Haplo…
3 AC=1;AF=0.50;AN=2;BaseQRankSum=1.934;DP=48;Dels=0.00;FS=4.452;HRun=0;Haplotyp…
4 AC=1;AF=0.50;AN=2;BaseQRankSum=-4.517;DB;DP=29;Dels=0.00;FS=1.485;HRun=0;Hapl…
5 AC=1;AF=0.50;AN=2;BaseQRankSum=0.199;DB;DP=33;Dels=0.00;FS=0.000;HRun=1;Haplo…
6 AC=1;AF=0.50;AN=2;BaseQRankSum=-0.259;DB;DP=12;Dels=0.00;FS=0.000;HRun=1;Hapl…

Firstly, we will use separate_rows to create a new row for each annotation by using ; as the separator/delimiter; note that I have included \\s* after ;, which is a regex for specifying a single whitespace occurring 0 or more times. By including the regex, whitespace after ; will be removed, which is good because we do not want whitespaces in our data. In addition, a mutate call is used prior to calling separate_rows and it is used to remove a trailing semicolon, if it exists.

my_vcf |>
  mutate(info = sub(pattern = ";$", replacement = "", x = .data$info)) |>
  separate_rows(info, sep = ";\\s*") |>
  head()
# A tibble: 6 × 10
  chrom    pos id         ref   alt    qual filter info     format        manuel
  <dbl>  <dbl> <chr>      <chr> <chr> <dbl> <chr>  <chr>    <chr>         <chr> 
1     1 866511 rs60722469 C     CCCCT  259. PASS   AC=2     GT:AD:DP:GQ:… 1/1:6…
2     1 866511 rs60722469 C     CCCCT  259. PASS   AF=1.00  GT:AD:DP:GQ:… 1/1:6…
3     1 866511 rs60722469 C     CCCCT  259. PASS   AN=2     GT:AD:DP:GQ:… 1/1:6…
4     1 866511 rs60722469 C     CCCCT  259. PASS   DB       GT:AD:DP:GQ:… 1/1:6…
5     1 866511 rs60722469 C     CCCCT  259. PASS   DP=11    GT:AD:DP:GQ:… 1/1:6…
6     1 866511 rs60722469 C     CCCCT  259. PASS   FS=0.000 GT:AD:DP:GQ:… 1/1:6…

The next step is to split the key-value pairs and we will use the separate function to separate the pairs into two columns, which we will name key and value, using the equal sign as the separator/delimiter. Sometimes a key is missing a value and in these cases, the value will be NA.

my_vcf |>
  mutate(info = sub(pattern = ";$", replacement = "", x = .data$info)) |>
  separate_rows(info, sep = ";\\s*") |>
  separate(info, c('key', 'value'), sep = "=") |>
  head(10)
# A tibble: 10 × 11
   chrom    pos id         ref   alt    qual filter key      value format manuel
   <dbl>  <dbl> <chr>      <chr> <chr> <dbl> <chr>  <chr>    <chr> <chr>  <chr> 
 1     1 866511 rs60722469 C     CCCCT  259. PASS   AC       2     GT:AD… 1/1:6…
 2     1 866511 rs60722469 C     CCCCT  259. PASS   AF       1.00  GT:AD… 1/1:6…
 3     1 866511 rs60722469 C     CCCCT  259. PASS   AN       2     GT:AD… 1/1:6…
 4     1 866511 rs60722469 C     CCCCT  259. PASS   DB       <NA>  GT:AD… 1/1:6…
 5     1 866511 rs60722469 C     CCCCT  259. PASS   DP       11    GT:AD… 1/1:6…
 6     1 866511 rs60722469 C     CCCCT  259. PASS   FS       0.000 GT:AD… 1/1:6…
 7     1 866511 rs60722469 C     CCCCT  259. PASS   HRun     0     GT:AD… 1/1:6…
 8     1 866511 rs60722469 C     CCCCT  259. PASS   Haploty… 41.3… GT:AD… 1/1:6…
 9     1 866511 rs60722469 C     CCCCT  259. PASS   MQ0      0     GT:AD… 1/1:6…
10     1 866511 rs60722469 C     CCCCT  259. PASS   MQ       61.94 GT:AD… 1/1:6…

The current state of the transformation produces a new row for each variant annotation and two columns containing the key and value. If we want our data in wide format where each annotation is a column, we can use the pivot_wider function.

In the code below, I have used the first seven columns (id_cols = chrom:filter) to specify the columns that uniquely identifies each variant, i.e. the same variant will have the same values in these columns. We specify our column names from the key column and the values for these cells will come from the value column.

my_vcf |>
  mutate(info = sub(pattern = ";$", replacement = "", x = .data$info)) |>
  separate_rows(info, sep = ";\\s*") |>
  separate(info, c('key', 'value'), sep = "=") |>
  distinct() |> # remove duplicated annotations, if any
  pivot_wider(id_cols = chrom:filter, names_from = key, values_from = value) -> my_vcf_tidy

head(my_vcf_tidy, 10)
# A tibble: 10 × 24
   chrom    pos id       ref   alt     qual filter AC    AF    AN    DB    DP   
   <dbl>  <dbl> <chr>    <chr> <chr>  <dbl> <chr>  <chr> <chr> <chr> <chr> <chr>
 1     1 866511 rs60722… C     CCCCT  259.  PASS   2     1.00  2     <NA>  11   
 2     1 879317 rs75235… C     T      151.  PASS   1     0.50  2     <NA>  21   
 3     1 879482 .        G     C      485.  PASS   1     0.50  2     <NA>  48   
 4     1 880390 rs37485… C     A      288.  PASS   1     0.50  2     <NA>  29   
 5     1 881627 rs22727… G     A      486.  PASS   1     0.50  2     <NA>  33   
 6     1 884091 rs75224… C     G       65.5 PASS   1     0.50  2     <NA>  12   
 7     1 884101 rs49704… A     C       85.8 PASS   1     0.50  2     <NA>  12   
 8     1 892460 rs41285… G     C     1737.  PASS   1     0.50  2     <NA>  152  
 9     1 897730 rs75496… C     T      225.  PASS   1     0.50  2     <NA>  21   
10     1 909238 rs38297… G     C      248.  PASS   1     0.50  2     <NA>  19   
# ? 12 more variables: FS <chr>, HRun <chr>, HaplotypeScore <chr>, MQ0 <chr>,
#   MQ <chr>, QD <chr>, set <chr>, BaseQRankSum <chr>, Dels <chr>,
#   MQRankSum <chr>, ReadPosRankSum <chr>, DS <chr>

Now each row is a single variant and each column is a variable, making it much easier to work with! We can easily subset variants with a QD > 50 (after transforming the type of numeric).

my_vcf_tidy |>
  dplyr::mutate(QD = as.numeric(QD)) |>
  dplyr::filter(QD > 50)
# A tibble: 6 × 24
  chrom      pos id       ref   alt    qual filter AC    AF    AN    DB    DP   
  <dbl>    <dbl> <chr>    <chr> <chr> <dbl> <chr>  <chr> <chr> <chr> <chr> <chr>
1     1  1276973 rs70949… G     GACAC  459. PASS   2     1.00  2     <NA>  7    
2     1  3396278 .        T     TGCC…  447. PASS   1     0.50  2     <NA>  8    
3     1 19208145 rs11254… G     GGGT… 2735. PASS   2     1.00  2     <NA>  25   
4     1 22207804 rs66803… ACCC… A     1769. PASS   2     1.00  2     <NA>  27   
5     1 31764713 rs11166… C     CAAG  1018. PASS   2     1.00  2     <NA>  16   
6     1 31905889 rs30504… A     ACAG  2172. PASS   2     1.00  2     <NA>  39   
# ? 12 more variables: FS <chr>, HRun <chr>, HaplotypeScore <chr>, MQ0 <chr>,
#   MQ <chr>, QD <dbl>, set <chr>, BaseQRankSum <chr>, Dels <chr>,
#   MQRankSum <chr>, ReadPosRankSum <chr>, DS <chr>

Splitting the group column in a GTF file

The GTF also has a multi-valued column with key-value pairs.

my_gtf <- read_tsv(
  file = "https://github.com/davetang/importbio/raw/master/inst/extdata/gencode.v38.annotation.sample.gtf.gz",
  comment = "#",
  show_col_types = FALSE,
  col_names = c('chr', 'src', 'feat', 'start', 'end', 'score', 'strand', 'frame', 'group')
)

my_gtf |>
  select(group) |>
  head()
# A tibble: 6 × 1
  group                                                                         
  <chr>                                                                         
1 "gene_id \"ENSG00000223972.5\"; gene_type \"transcribed_unprocessed_pseudogen…
2 "gene_id \"ENSG00000223972.5\"; transcript_id \"ENST00000456328.2\"; gene_typ…
3 "gene_id \"ENSG00000223972.5\"; transcript_id \"ENST00000456328.2\"; gene_typ…
4 "gene_id \"ENSG00000223972.5\"; transcript_id \"ENST00000456328.2\"; gene_typ…
5 "gene_id \"ENSG00000223972.5\"; transcript_id \"ENST00000456328.2\"; gene_typ…
6 "gene_id \"ENSG00000223972.5\"; transcript_id \"ENST00000450305.2\"; gene_typ…

We can use the same strategy (but with some additional formatting steps) to split the column up.

my_gtf |>
  mutate(group = sub(pattern = ";$", replacement = "", x = .data$group)) |>
  mutate(group = gsub(pattern = '"', replacement = "", x = .data$group)) |>
  separate_rows(group, sep = ";\\s*") |>
  separate(group, c('key', 'value'), sep = "\\s") |>
  distinct() |> # remove duplicated annotations, if any
  pivot_wider(id_cols = chr:frame, names_from = key, values_from = value) -> my_gtf_split

head(my_gtf_split, 10)
# A tibble: 10 × 24
   chr   src    feat  start   end score strand frame gene_id gene_type gene_name
   <chr> <chr>  <chr> <dbl> <dbl> <chr> <chr>  <chr> <list>  <list>    <list>   
 1 chr1  HAVANA gene  11869 14409 .     +      .     <chr>   <chr [1]> <chr [1]>
 2 chr1  HAVANA tran… 11869 14409 .     +      .     <chr>   <chr [1]> <chr [1]>
 3 chr1  HAVANA exon  11869 12227 .     +      .     <chr>   <chr [1]> <chr [1]>
 4 chr1  HAVANA exon  12613 12721 .     +      .     <chr>   <chr [1]> <chr [1]>
 5 chr1  HAVANA exon  13221 14409 .     +      .     <chr>   <chr [1]> <chr [1]>
 6 chr1  HAVANA tran… 12010 13670 .     +      .     <chr>   <chr [1]> <chr [1]>
 7 chr1  HAVANA exon  12010 12057 .     +      .     <chr>   <chr [1]> <chr [1]>
 8 chr1  HAVANA exon  12179 12227 .     +      .     <chr>   <chr [1]> <chr [1]>
 9 chr1  HAVANA exon  12613 12697 .     +      .     <chr>   <chr [1]> <chr [1]>
10 chr1  HAVANA exon  12975 13052 .     +      .     <chr>   <chr [1]> <chr [1]>
# ? 13 more variables: level <list>, hgnc_id <list>, havana_gene <list>,
#   transcript_id <list>, transcript_type <list>, transcript_name <list>,
#   transcript_support_level <list>, tag <list>, havana_transcript <list>,
#   exon_number <list>, exon_id <list>, ont <list>, protein_id <list>

However, the split columns are lists because there were some cases where there were multiple annotations with the same key and a list is needed to store multiple values (which was what the warning above was about). For example the tag key was repeated more than once with different unique values for some annotations.

map_lgl(my_gtf_split$tag, function(x) length(x) > 1)
 [1] FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[13] FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[25]  TRUE FALSE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[37] FALSE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE
[49]  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE
[61]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE
[73]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE
[85] FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE

We can check which columns have multiple values.

check_column <- function(x){
  any(map_lgl(x, function(y) length(y) > 1))
}

my_check <- map_lgl(my_gtf_split, check_column)
my_check[my_check]
    transcript_id   transcript_name               tag havana_transcript 
             TRUE              TRUE              TRUE              TRUE 
      exon_number               ont 
             TRUE              TRUE 

Therefore despite only a subset of the columns containing multiple values, all the pivoted columns were turned into lists. However we can turn the columns back into characters, which makes sense for the gene_id column which only contained single unique character values in the first place.

my_gtf_split |>
  mutate(gene_id = as.character(gene_id)) |>
  head()
# A tibble: 6 × 24
  chr   src    feat   start   end score strand frame gene_id gene_type gene_name
  <chr> <chr>  <chr>  <dbl> <dbl> <chr> <chr>  <chr> <chr>   <list>    <list>   
1 chr1  HAVANA gene   11869 14409 .     +      .     ENSG00… <chr [1]> <chr [1]>
2 chr1  HAVANA trans… 11869 14409 .     +      .     ENSG00… <chr [1]> <chr [1]>
3 chr1  HAVANA exon   11869 12227 .     +      .     ENSG00… <chr [1]> <chr [1]>
4 chr1  HAVANA exon   12613 12721 .     +      .     ENSG00… <chr [1]> <chr [1]>
5 chr1  HAVANA exon   13221 14409 .     +      .     ENSG00… <chr [1]> <chr [1]>
6 chr1  HAVANA trans… 12010 13670 .     +      .     ENSG00… <chr [1]> <chr [1]>
# ? 13 more variables: level <list>, hgnc_id <list>, havana_gene <list>,
#   transcript_id <list>, transcript_type <list>, transcript_name <list>,
#   transcript_support_level <list>, tag <list>, havana_transcript <list>,
#   exon_number <list>, exon_id <list>, ont <list>, protein_id <list>

But we can also do this to the tag column (even if it needed a list to store the multiple values) and entries with multiple values get turned into R (character) code!

my_gtf_split |>
  mutate(tag = as.character(tag)) |>
  select(tag) |>
  head()
# A tibble: 6 × 1
  tag                                  
  <chr>                                
1 "NULL"                               
2 "basic"                              
3 "basic"                              
4 "basic"                              
5 "basic"                              
6 "c(\"basic\", \"Ensembl_canonical\")"

If you don't mind having R (character) code in your data, you can perform this transformation across all pivoted columns.

my_gtf_split |>
  mutate(across(gene_id:protein_id, as.character))
# A tibble: 94 × 24
   chr   src    feat  start   end score strand frame gene_id gene_type gene_name
   <chr> <chr>  <chr> <dbl> <dbl> <chr> <chr>  <chr> <chr>   <chr>     <chr>    
 1 chr1  HAVANA gene  11869 14409 .     +      .     ENSG00… transcri… DDX11L1  
 2 chr1  HAVANA tran… 11869 14409 .     +      .     ENSG00… transcri… DDX11L1  
 3 chr1  HAVANA exon  11869 12227 .     +      .     ENSG00… transcri… DDX11L1  
 4 chr1  HAVANA exon  12613 12721 .     +      .     ENSG00… transcri… DDX11L1  
 5 chr1  HAVANA exon  13221 14409 .     +      .     ENSG00… transcri… DDX11L1  
 6 chr1  HAVANA tran… 12010 13670 .     +      .     ENSG00… transcri… DDX11L1  
 7 chr1  HAVANA exon  12010 12057 .     +      .     ENSG00… transcri… DDX11L1  
 8 chr1  HAVANA exon  12179 12227 .     +      .     ENSG00… transcri… DDX11L1  
 9 chr1  HAVANA exon  12613 12697 .     +      .     ENSG00… transcri… DDX11L1  
10 chr1  HAVANA exon  12975 13052 .     +      .     ENSG00… transcri… DDX11L1  
# ? 84 more rows
# ? 13 more variables: level <chr>, hgnc_id <chr>, havana_gene <chr>,
#   transcript_id <chr>, transcript_type <chr>, transcript_name <chr>,
#   transcript_support_level <chr>, tag <chr>, havana_transcript <chr>,
#   exon_number <chr>, exon_id <chr>, ont <chr>, protein_id <chr>

Summary

The following steps can be used to split a column containing key-value pairs into separate columns:

  1. Use separate_rows() to split a single column into rows
  2. Use separate() to split a key-value pair into two columns
  3. Use pivot_wider() to convert the long format table back to wide format

However, sometimes data is packed into a single column because it cannot be nicely formatted in the first place.


This post was sponsored by Ecodyst:

Ecodyst's innovative metallic condensers redefine rotary evaporation. They significantly enhance recovery rates while demonstrating a strong commitment to environmental responsibility. By eliminating the need for coolant liquids and reducing energy consumption by over 50%, Ecodyst's rotary evaporators offer a groundbreaking solution to your lab's needs.

Print Friendly, PDF & Email



Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.