Split single column of key-value pairs into multiple columns

Two widely used file formats in bioinformatics, VCF and GTF, have single columns that are packed with annotation information. This makes them a bit inconvenient to work with in R when using data frames because the values need to be unpacked, i.e. split. In addition, this violates one of the conditions for tidy data, which is that every cell is a single value. In this post, we will use tools from the tidyverse to split the values into multiple columns to make the data easier to work with in R.

To get started, install the tidyverse if you haven't already.

if(!require("tidyverse")){
  install.packages("tidyverse")
}
library(tidyverse)

I have a small package called importbio that can be used to load a VCF and GTF file. You can install it using the remotes package.

if(!require("remotes")){
  install.packages("remotes")
}

if(!require("importbio")){
  remotes::install_github('davetang/importbio')
}

Splitting the info column in a VCF file

We will load a small VCF file using importvcf.

library(importbio)
my_vcf <- importvcf("https://raw.githubusercontent.com/davetang/learning_vcf_file/master/eg/Pfeiffer.vcf")

my_vcf %>%
  select(info) %>%
  head()
# A tibble: 6 × 1
  info                                                                                                         
  <chr>                                                                                                        
1 AC=2;AF=1.00;AN=2;DB;DP=11;FS=0.000;HRun=0;HaplotypeScore=41.3338;MQ0=0;MQ=61.94;QD=23.51;set=variant        
2 AC=1;AF=0.50;AN=2;BaseQRankSum=1.455;DB;DP=21;Dels=0.00;FS=1.984;HRun=0;HaplotypeScore=0.0000;MQ0=0;MQ=60.00…
3 AC=1;AF=0.50;AN=2;BaseQRankSum=1.934;DP=48;Dels=0.00;FS=4.452;HRun=0;HaplotypeScore=0.5784;MQ0=0;MQ=59.13;MQ…
4 AC=1;AF=0.50;AN=2;BaseQRankSum=-4.517;DB;DP=29;Dels=0.00;FS=1.485;HRun=0;HaplotypeScore=0.0000;MQ0=0;MQ=56.9…
5 AC=1;AF=0.50;AN=2;BaseQRankSum=0.199;DB;DP=33;Dels=0.00;FS=0.000;HRun=1;HaplotypeScore=1.8893;MQ0=0;MQ=60.00…
6 AC=1;AF=0.50;AN=2;BaseQRankSum=-0.259;DB;DP=12;Dels=0.00;FS=0.000;HRun=1;HaplotypeScore=0.0000;MQ0=0;MQ=53.2…

Note that the info column is packed with all sorts of information for each variant. Also note the consistent format of the info column: each annotation is separated by a semi-colon (;) and annotations are stored as key-value pairs with an equal sign in between.

Firstly, we will use separate_rows to create a new row for each annotation by using ; as the separator/delimiter; note that I have included \\s* after ;, which is a regex for specifying a single whitespace occurring 0 or more times. By including the regex, whitespace after ; will be removed, which is good because we do not want whitespace in our data. In addition, a mutate call is used prior to calling separate_rows and it is used to remove a trailing semicolon, if it exists.

my_vcf %>%
  mutate(info = sub(pattern = ";$", replacement = "", x = .data$info)) %>%
  separate_rows(info, sep = ";\\s*") %>%
  head()
# A tibble: 6 × 10
  vid              chrom    pos id         ref   alt   qual   filter info     type 
  <chr>            <fct>  <int> <chr>      <chr> <chr> <chr>  <chr>  <chr>    <chr>
1 1_866511_C_CCCCT 1     866511 rs60722469 C     CCCCT 258.62 PASS   AC=2     ins  
2 1_866511_C_CCCCT 1     866511 rs60722469 C     CCCCT 258.62 PASS   AF=1.00  ins  
3 1_866511_C_CCCCT 1     866511 rs60722469 C     CCCCT 258.62 PASS   AN=2     ins  
4 1_866511_C_CCCCT 1     866511 rs60722469 C     CCCCT 258.62 PASS   DB       ins  
5 1_866511_C_CCCCT 1     866511 rs60722469 C     CCCCT 258.62 PASS   DP=11    ins  
6 1_866511_C_CCCCT 1     866511 rs60722469 C     CCCCT 258.62 PASS   FS=0.000 ins  

The next step is to split the key-value pairs and we will use the separate function to separate the pairs into two columns, which we will name key and value, using the equal sign as the separator/delimiter. Sometimes a key is missing a value and in these cases, the value will be NA.

my_vcf %>%
  mutate(info = sub(pattern = ";$", replacement = "", x = .data$info)) %>%
  separate_rows(info, sep = ";\\s*") %>%
  separate(info, c('key', 'value'), sep = "=") %>%
  head(10)
# A tibble: 10 × 11
   vid              chrom    pos id         ref   alt   qual   filter key            value   type 
   <chr>            <fct>  <int> <chr>      <chr> <chr> <chr>  <chr>  <chr>          <chr>   <chr>
 1 1_866511_C_CCCCT 1     866511 rs60722469 C     CCCCT 258.62 PASS   AC             2       ins  
 2 1_866511_C_CCCCT 1     866511 rs60722469 C     CCCCT 258.62 PASS   AF             1.00    ins  
 3 1_866511_C_CCCCT 1     866511 rs60722469 C     CCCCT 258.62 PASS   AN             2       ins  
 4 1_866511_C_CCCCT 1     866511 rs60722469 C     CCCCT 258.62 PASS   DB             NA      ins  
 5 1_866511_C_CCCCT 1     866511 rs60722469 C     CCCCT 258.62 PASS   DP             11      ins  
 6 1_866511_C_CCCCT 1     866511 rs60722469 C     CCCCT 258.62 PASS   FS             0.000   ins  
 7 1_866511_C_CCCCT 1     866511 rs60722469 C     CCCCT 258.62 PASS   HRun           0       ins  
 8 1_866511_C_CCCCT 1     866511 rs60722469 C     CCCCT 258.62 PASS   HaplotypeScore 41.3338 ins  
 9 1_866511_C_CCCCT 1     866511 rs60722469 C     CCCCT 258.62 PASS   MQ0            0       ins  
10 1_866511_C_CCCCT 1     866511 rs60722469 C     CCCCT 258.62 PASS   MQ             61.94   ins  

The current state of the transformation produces a new row for each variant annotation and two columns containing the key and value. If we want our data in wide format where each annotation is a column, we can use the pivot_wider function.

In the code below, I have used the first eight columns (id_cols = vid:filter) to specify the columns that uniquely identifies each variant, i.e. the same variant will have the same values in these columns. We specify our column names from the key column and the values for these cells will come from the value column.

my_vcf %>%
  mutate(info = sub(pattern = ";$", replacement = "", x = .data$info)) %>%
  separate_rows(info, sep = ";\\s*") %>%
  separate(info, c('key', 'value'), sep = "=") %>%
  distinct() %>% # remove duplicated annotations, if any
  pivot_wider(id_cols = vid:filter, names_from = key, values_from = value) %>%
  head(10)

Now each row is a single variant and each column is a variable.

# A tibble: 10 × 28
   vid     chrom    pos id    ref   alt   qual  filter AC    AF    AN    DB    DP    FS    HRun  HaplotypeScore
   <chr>   <fct>  <int> <chr> <chr> <chr> <chr> <chr>  <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>         
 1 1_8665… 1     866511 rs60… C     CCCCT 258.… PASS   2     1.00  2     NA    11    0.000 0     41.3338       
 2 1_8793… 1     879317 rs75… C     T     150.… PASS   1     0.50  2     NA    21    1.984 0     0.0000        
 3 1_8794… 1     879482 .     G     C     484.… PASS   1     0.50  2     NA    48    4.452 0     0.5784        
 4 1_8803… 1     880390 rs37… C     A     288.… PASS   1     0.50  2     NA    29    1.485 0     0.0000        
 5 1_8816… 1     881627 rs22… G     A     486.… PASS   1     0.50  2     NA    33    0.000 1     1.8893        
 6 1_8840… 1     884091 rs75… C     G     65.46 PASS   1     0.50  2     NA    12    0.000 1     0.0000        
 7 1_8841… 1     884101 rs49… A     C     85.81 PASS   1     0.50  2     NA    12    0.000 3     0.0000        
 8 1_8924… 1     892460 rs41… G     C     1736… PASS   1     0.50  2     NA    152   0.000 0     0.0000        
 9 1_8977… 1     897730 rs75… C     T     225.… PASS   1     0.50  2     NA    21    6.419 1     1.8410        
10 1_9092… 1     909238 rs38… G     C     247.… PASS   1     0.50  2     NA    19    0.000 1     0.0000        
# … with 12 more variables: MQ0 <chr>, MQ <chr>, QD <chr>, set <chr>, BaseQRankSum <chr>, Dels <chr>,
#   MQRankSum <chr>, ReadPosRankSum <chr>, DS <chr>, GENE <chr>, INHERITANCE <chr>, MIM <chr>

Splitting the group column in a GTF file

The GTF also has a column packed with key-value pairs.

my_gtf <- read_tsv(
  file = "https://github.com/davetang/importbio/raw/master/inst/extdata/gencode.v38.annotation.sample.gtf.gz",
  comment = "#",
  show_col_types = FALSE,
  col_names = c('chr', 'src', 'feat', 'start', 'end', 'score', 'strand', 'frame', 'group'))

my_gtf %>%
  select(group) %>%
  head()
# A tibble: 6 × 1
  group                                                                                                        
  <chr>                                                                                                        
1 "gene_id \"ENSG00000223972.5\"; gene_type \"transcribed_unprocessed_pseudogene\"; gene_name \"DDX11L1\"; lev…
2 "gene_id \"ENSG00000223972.5\"; transcript_id \"ENST00000456328.2\"; gene_type \"transcribed_unprocessed_pse…
3 "gene_id \"ENSG00000223972.5\"; transcript_id \"ENST00000456328.2\"; gene_type \"transcribed_unprocessed_pse…
4 "gene_id \"ENSG00000223972.5\"; transcript_id \"ENST00000456328.2\"; gene_type \"transcribed_unprocessed_pse…
5 "gene_id \"ENSG00000223972.5\"; transcript_id \"ENST00000456328.2\"; gene_type \"transcribed_unprocessed_pse…
6 "gene_id \"ENSG00000223972.5\"; transcript_id \"ENST00000450305.2\"; gene_type \"transcribed_unprocessed_pse…

We can use the same strategy (but with some additional formatting steps) to split the column up.

my_gtf %>%
  mutate(group = sub(pattern = ";$", replacement = "", x = .data$group)) %>%
  mutate(group = gsub(pattern = '"', replacement = "", x = .data$group)) %>%
  separate_rows(group, sep = ";\\s*") %>%
  separate(group, c('key', 'value'), sep = "\\s") %>%
  distinct() %>% # remove duplicated annotations, if any
  pivot_wider(id_cols = chr:frame, names_from = key, values_from = value) -> my_gtf_split

head(my_gtf_split, 10)
# A tibble: 10 × 24
   chr   src    feat       start   end score strand frame gene_id gene_type gene_name level hgnc_id havana_gene
   <chr> <chr>  <chr>      <dbl> <dbl> <chr> <chr>  <chr> <list>  <list>    <list>    <lis> <list>  <list>     
 1 chr1  HAVANA gene       11869 14409 .     +      .     <chr>   <chr [1]> <chr [1]> <chr> <chr>   <chr [1]>  
 2 chr1  HAVANA transcript 11869 14409 .     +      .     <chr>   <chr [1]> <chr [1]> <chr> <chr>   <chr [1]>  
 3 chr1  HAVANA exon       11869 12227 .     +      .     <chr>   <chr [1]> <chr [1]> <chr> <chr>   <chr [1]>  
 4 chr1  HAVANA exon       12613 12721 .     +      .     <chr>   <chr [1]> <chr [1]> <chr> <chr>   <chr [1]>  
 5 chr1  HAVANA exon       13221 14409 .     +      .     <chr>   <chr [1]> <chr [1]> <chr> <chr>   <chr [1]>  
 6 chr1  HAVANA transcript 12010 13670 .     +      .     <chr>   <chr [1]> <chr [1]> <chr> <chr>   <chr [1]>  
 7 chr1  HAVANA exon       12010 12057 .     +      .     <chr>   <chr [1]> <chr [1]> <chr> <chr>   <chr [1]>  
 8 chr1  HAVANA exon       12179 12227 .     +      .     <chr>   <chr [1]> <chr [1]> <chr> <chr>   <chr [1]>  
 9 chr1  HAVANA exon       12613 12697 .     +      .     <chr>   <chr [1]> <chr [1]> <chr> <chr>   <chr [1]>  
10 chr1  HAVANA exon       12975 13052 .     +      .     <chr>   <chr [1]> <chr [1]> <chr> <chr>   <chr [1]>  
# … with 10 more variables: transcript_id <list>, transcript_type <list>, transcript_name <list>,
#   transcript_support_level <list>, tag <list>, havana_transcript <list>, exon_number <list>, exon_id <list>,
#   ont <list>, protein_id <list>

However, the split columns are lists because there were some cases where there were multiple annotations with the same key and a list is needed to store multiple values. For example the tag key was repeated more than once with different unique values for some annotations.

map_lgl(my_gtf_split$tag, function(x) length(x) > 1)
 [1] FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE
[18]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE
[35]  TRUE  TRUE FALSE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE
[52]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[69]  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE
[86] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE

We can check which columns have multiple values.

check_column <- function(x){
  any(map_lgl(x, function(y) length(y) > 1))
}

my_check <- map_lgl(my_gtf_split, check_column)
names(my_check[my_check])
[1] "transcript_id"     "transcript_name"   "tag"               "havana_transcript" "exon_number"      
[6] "ont"       

Therefore despite only a subset of the columns containing multiple values, all the pivoted columns were turned into lists. However we can turn the columns back into characters, which makes sense for the gene_id column which only contained single unique character values in the first place.

my_gtf_split %>%
  mutate(gene_id = as.character(gene_id)) %>%
  head()
# A tibble: 6 × 24
  chr   src    feat       start   end score strand frame gene_id  gene_type gene_name level hgnc_id havana_gene
  <chr> <chr>  <chr>      <dbl> <dbl> <chr> <chr>  <chr> <chr>    <list>    <list>    <lis> <list>  <list>     
1 chr1  HAVANA gene       11869 14409 .     +      .     ENSG000… <chr [1]> <chr [1]> <chr> <chr>   <chr [1]>  
2 chr1  HAVANA transcript 11869 14409 .     +      .     ENSG000… <chr [1]> <chr [1]> <chr> <chr>   <chr [1]>  
3 chr1  HAVANA exon       11869 12227 .     +      .     ENSG000… <chr [1]> <chr [1]> <chr> <chr>   <chr [1]>  
4 chr1  HAVANA exon       12613 12721 .     +      .     ENSG000… <chr [1]> <chr [1]> <chr> <chr>   <chr [1]>  
5 chr1  HAVANA exon       13221 14409 .     +      .     ENSG000… <chr [1]> <chr [1]> <chr> <chr>   <chr [1]>  
6 chr1  HAVANA transcript 12010 13670 .     +      .     ENSG000… <chr [1]> <chr [1]> <chr> <chr>   <chr [1]>  
# … with 10 more variables: transcript_id <list>, transcript_type <list>, transcript_name <list>,
#   transcript_support_level <list>, tag <list>, havana_transcript <list>, exon_number <list>, exon_id <list>,
#   ont <list>, protein_id <list>

But we can also do this to the tag column (even if it needed a list to store the multiple values) and entries with multiple values get turned into R (character) code!

my_gtf_split %>%
  mutate(tag = as.character(tag)) %>%
  select(tag) %>%
  head()
# A tibble: 6 × 1
  tag                                  
  <chr>                                
1 "NULL"                               
2 "basic"                              
3 "basic"                              
4 "basic"                              
5 "basic"                              
6 "c(\"basic\", \"Ensembl_canonical\")"

If you don't mind having R (character) code in your data, you can perform this transformation across all pivoted columns.

my_gtf_split %>%
  mutate(across(gene_id:protein_id, as.character))
# A tibble: 94 × 24
   chr   src    feat       start   end score strand frame gene_id gene_type gene_name level hgnc_id havana_gene
   <chr> <chr>  <chr>      <dbl> <dbl> <chr> <chr>  <chr> <chr>   <chr>     <chr>     <chr> <chr>   <chr>      
 1 chr1  HAVANA gene       11869 14409 .     +      .     ENSG00… transcri… DDX11L1   2     HGNC:3… OTTHUMG000…
 2 chr1  HAVANA transcript 11869 14409 .     +      .     ENSG00… transcri… DDX11L1   2     HGNC:3… OTTHUMG000…
 3 chr1  HAVANA exon       11869 12227 .     +      .     ENSG00… transcri… DDX11L1   2     HGNC:3… OTTHUMG000…
 4 chr1  HAVANA exon       12613 12721 .     +      .     ENSG00… transcri… DDX11L1   2     HGNC:3… OTTHUMG000…
 5 chr1  HAVANA exon       13221 14409 .     +      .     ENSG00… transcri… DDX11L1   2     HGNC:3… OTTHUMG000…
 6 chr1  HAVANA transcript 12010 13670 .     +      .     ENSG00… transcri… DDX11L1   2     HGNC:3… OTTHUMG000…
 7 chr1  HAVANA exon       12010 12057 .     +      .     ENSG00… transcri… DDX11L1   2     HGNC:3… OTTHUMG000…
 8 chr1  HAVANA exon       12179 12227 .     +      .     ENSG00… transcri… DDX11L1   2     HGNC:3… OTTHUMG000…
 9 chr1  HAVANA exon       12613 12697 .     +      .     ENSG00… transcri… DDX11L1   2     HGNC:3… OTTHUMG000…
10 chr1  HAVANA exon       12975 13052 .     +      .     ENSG00… transcri… DDX11L1   2     HGNC:3… OTTHUMG000…
# … with 84 more rows, and 10 more variables: transcript_id <chr>, transcript_type <chr>,
#   transcript_name <chr>, transcript_support_level <chr>, tag <chr>, havana_transcript <chr>,
#   exon_number <chr>, exon_id <chr>, ont <chr>, protein_id <chr>

Summary

The following steps can be used to split a column containing key-value pairs into separate columns:

  1. Use separate_rows to split a single column into rows
  2. Use separate to split a key-value pair into two columns
  3. Use pivot_wider to convert the long format table back to wide format

However, sometimes data is packed into a single column because it cannot be nicely formatted in the first place.

Print Friendly, PDF & Email



Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.