Reading irregular data into R

Some bioinformatics tools output files that are visually nice and are meant for manual inspection. This practice of generating visually nice output (and/or) Excel files may be rooted in how bioinformaticians and biologists used to work with each other; give the bioinformatician/s some data to analyse and they will generate output that will be manually inspected and verified by the biologist/s. I say "used to" because nowadays a lot of biologists are much more savvy with data processing and analysis, and can either analyse their own data or don't need to do everything manually.

One such tool is NetMHCIIpan, which generates output that requires a bit more work to parse in contrast to parsing a CSV or TSV file. Here's how the output looks:

gunzip -c data/netmhciipan.out.gz | head -20
# NetMHCIIpan version 4.0

# Input is in FASTA format

# Peptide length 15

# Prediction Mode: EL

# Threshold for Strong binding peptides (%Rank) 2%
# Threshold for Weak binding peptides (%Rank)   10%

# Allele: DRB1_0101
--------------------------------------------------------------------------------------------------------------------------------------------
 Pos           MHC              Peptide   Of        Core  Core_Rel        Identity      Score_EL %Rank_EL Exp_Bind  BindLevel
--------------------------------------------------------------------------------------------------------------------------------------------
   1     DRB1_0101      MAEMKTDAATLAQEA    3   MKTDAATLA     0.980          P9WNK5      0.097960     9.99       NA   <=WB
   2     DRB1_0101      AEMKTDAATLAQEAG    2   MKTDAATLA     0.867          P9WNK5      0.041685    16.05       NA       
   3     DRB1_0101      EMKTDAATLAQEAGN    4   DAATLAQEA     0.613          P9WNK5      0.012857    28.51       NA       
   4     DRB1_0101      MKTDAATLAQEAGNF    3   DAATLAQEA     0.813          P9WNK5      0.009156    33.36       NA       
   5     DRB1_0101      KTDAATLAQEAGNFE    2   DAATLAQEA     0.627          P9WNK5      0.004353    45.83       NA       

The base R function read.table() can skip n number of lines at the start of a file, which would have solved the problem (after figuring out how many lines to skip) but the end of the file contains some free text and lines starting with hyphens, i.e. -.

gunzip -c data/netmhciipan.out.gz | tail
  80     DRB1_0101      GVQYSRADEEQQQAL    3   YSRADEEQQ     0.940          P9WNK5      0.033621    17.96       NA       
  81     DRB1_0101      VQYSRADEEQQQALS    2   YSRADEEQQ     0.900          P9WNK5      0.005510    41.52       NA       
  82     DRB1_0101      QYSRADEEQQQALSS    1   YSRADEEQQ     0.473          P9WNK5      0.001152    73.90       NA       
  83     DRB1_0101      YSRADEEQQQALSSQ    5   EEQQQALSS     0.733          P9WNK5      0.003404    50.77       NA       
  84     DRB1_0101      SRADEEQQQALSSQM    4   EEQQQALSS     0.840          P9WNK5      0.004729    44.32       NA       
  85     DRB1_0101      RADEEQQQALSSQMG    3   EEQQQALSS     0.667          P9WNK5      0.008057    35.30       NA       
  86     DRB1_0101      ADEEQQQALSSQMGF    5   QQALSSQMG     0.593          P9WNK5      0.006416    38.95       NA       
--------------------------------------------------------------------------------------------------------------------------------------------
Number of strong binders: 3 Number of weak binders: 5
--------------------------------------------------------------------------------------------------------------------------------------------

The approach I opted for, for parsing this output, was to create a function that skips lines that match a specified pattern called a regular expression or regex. We will use the readr package, a very useful package for loading data into R.

library(readr)

The read_lines() function will be used first to read the input. This function reads up to n_max lines from a file and stores each line as a character vector. Notably:

  • Files ending in .gz, .bz2, .xz or .zip will be automatically uncompressed.
  • Files starting with http://, https://, ftp://, or ftps:// will be automatically downloaded.

Specifying n_max = -1L will read all lines of a file; read_lines is typically used to scan only a small portion of a file to figure out its content but here we will use it to read every line.

read_lines("data/netmhciipan.out.gz", n_max = -1L) |>
  str()
 chr [1:104] "# NetMHCIIpan version 4.0" "" "# Input is in FASTA format" "" ...

We will define a skip_line function that can be used to skip lines that match a regex. This is typical of how I use Perl or Python to read and store files.

The regex below will match (and therefore skip) lines that:

  • start with a - (^-) or (|)
  • start with Number (^Number) or (|)
  • start with a # (^#) or (|)
  • are blank, i.e. does not contain any characters (^$) or (|)
  • start with a space and Pos (^\\sPos).
skip_line <- function(x, regex = "^#"){
  wanted <- !grepl(pattern = regex, x = x, perl = TRUE)
  return(x[wanted])
}

read_lines("data/netmhciipan.out.gz", n_max = -1L) |>
  skip_line(x = _, regex = "^-|^Number|^#|^$|^\\sPos") |>
  head()
[1] "   1     DRB1_0101      MAEMKTDAATLAQEA    3   MKTDAATLA     0.980          P9WNK5      0.097960     9.99       NA   <=WB"
[2] "   2     DRB1_0101      AEMKTDAATLAQEAG    2   MKTDAATLA     0.867          P9WNK5      0.041685    16.05       NA       "
[3] "   3     DRB1_0101      EMKTDAATLAQEAGN    4   DAATLAQEA     0.613          P9WNK5      0.012857    28.51       NA       "
[4] "   4     DRB1_0101      MKTDAATLAQEAGNF    3   DAATLAQEA     0.813          P9WNK5      0.009156    33.36       NA       "
[5] "   5     DRB1_0101      KTDAATLAQEAGNFE    2   DAATLAQEA     0.627          P9WNK5      0.004353    45.83       NA       "
[6] "   6     DRB1_0101      TDAATLAQEAGNFER    5   LAQEAGNFE     0.640          P9WNK5      0.007844    35.70       NA       "

We use the native R pipe (|>) to pipe the character vector to the skip_line function, which will then skip entries matching our regex; this requires R-4.1.0 or above. Note the use of the _ placeholder for the input vector (this requires R-4.2.0 or above); this is not necessary since the vector will be automatically forwarded to the skip_line function when using |> but I have included it here to show how you can refer to piped input.

Finally, we will use the read_table function (this is similar to the read.table function) to output a tibble from the lines that we want, i.e. those that were not skipped. This function is very useful for this type of data where each column is separated by an irregular number of spaces (one or more).

We will define the column names and the column types that are stored as a vector and list, respectively. This is good practice because this ensures that your column contains the same data type. In addition, by explicitly defining the columns, read_table can assign NAs when data is not available.

my_colnames <- c(
  "pos",
  "mhc",
  "peptide",
  "offset",
  "core",
  "core_rel",
  "id",
  "score_el",
  "perc_rank_el",
  "exp_bind",
  "bind_level"
)

my_coltypes <- list(
  pos = "i",
  mhc = "c",
  peptide = "c",
  offset = "i",
  core = "c",
  core_rel = "d",
  id = "c",
  score_el = "d",
  perc_rank_el = "d",
  exp_bind = "d",
  bind_level = "c"
)

read_lines("data/netmhciipan.out.gz", n_max = -1L) |>
  skip_line(x = _, regex = "^-|^Number|^#|^$|^\\sPos") |>
  read_table(col_names = my_colnames, col_types = my_coltypes) |>
  tail()
# A tibble: 6 × 11
    pos mhc   peptide offset core  core_rel id    score_el perc_rank_el exp_bind
  <int> <chr> <chr>    <int> <chr>    <dbl> <chr>    <dbl>        <dbl>    <dbl>
1    81 DRB1… VQYSRA…      2 YSRA…    0.9   P9WN…  0.00551         41.5       NA
2    82 DRB1… QYSRAD…      1 YSRA…    0.473 P9WN…  0.00115         73.9       NA
3    83 DRB1… YSRADE…      5 EEQQ…    0.733 P9WN…  0.00340         50.8       NA
4    84 DRB1… SRADEE…      4 EEQQ…    0.84  P9WN…  0.00473         44.3       NA
5    85 DRB1… RADEEQ…      3 EEQQ…    0.667 P9WN…  0.00806         35.3       NA
6    86 DRB1… ADEEQQ…      5 QQAL…    0.593 P9WN…  0.00642         39.0       NA
# ? 1 more variable: bind_level <chr>

Summary

My initial approach was to use read_lines to figure out how many lines to skip when using read_table but the text at the end of the output resulted in parsing errors. This was when I thought a better approach is to explicitly state which lines to skip, no matter where they occur in the output.

Print Friendly, PDF & Email



Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.