OBO

From Dave's wiki
Jump to navigation Jump to search

The OBO flat file is described here -> http://www.geneontology.org/GO.format.obo-1_2.shtml

Comments start with an exclamation mark

<tag>: <value> {<trailing modifiers>} ! <comment>

The tag name is always a string. The value is always a string, but the value string may require special parsing depending on the tag with which it is associated.

At present, every OBO stanza always begins with an id tag.

is_a: This tag describes a subclassing relationship between one term and another. The value is the id of the term of which this term is a subclass. A term may have any number of is_a relationships.

intersection_of: This tag indicates that this term is equivalent to the intersection of several other terms. The value is either a term id, or a relationship type id, a space, and a term id.

relationship: This tag describes a typed relationship between this term and another term. The value of this tag should be the relationship type id, and then the id of the target term. The relationship type name must be a relationship type name as defined in a typedef tag stanza. The [Typedef] must either occur in a document in the current parse batch, or in a file imported via an import header tag. If the relationship type name is undefined, a parse error will be generated. If the id of the target term cannot be resolved by the end of parsing the current batch of files, this tag describes a "dangling reference"; see the parser requirements section for information about how a parser may handle dangling references. If a relationship is specified for a term with an is_obsolete value of true, a parse error will be generated.

OBO-Edit

Download here -> http://oboedit.org/

Download graphviz -> http://www.graphviz.org/

Once installed there are examples inside the folder test_resources. Let's take a look at one of the example files called structured-car.obo:

format-version: 1.2
date: 10:05:2011 11:25
saved-by: nomi
auto-generated-by: OBO-Edit 2.1-beta13
default-namespace: file:/Users/nomi/Documents/workspace/OBO-Edit/test_resources/car.obo

[Term]
id: TEST:0000001
name: car
synonym: "automobile" EXACT []

[Term]
id: TEST:0000002
name: blue

[Term]
id: TEST:0000003
name: blue car
is_a: TEST:0000001 ! car
relationship: has_color TEST:0000002 ! blue

[Term]
id: TEST:0000004
name: blue VW
is_a: TEST:0000005 ! VW
relationship: has_color TEST:0000002 ! blue

[Term]
id: TEST:0000005
name: VW
is_a: TEST:0000001 ! car

[Term]
id: TEST:0000006
name: automobile
is_obsolete: true

[Typedef]
id: has_color
name: has_color

[Typedef]
id: has_make
name: has_make

On the highest level are the terms TEST:0000001 (car) and TEST:0000002 (blue). TEST:0000003 (blue car) is_a TEST:0000001 (car) and has the relationship has_color TEST:0000002 (blue). So the parent is TEST:0000001 and the child TEST:0000003.

TEST:0000005 (VW) is_a TEST:0000001 (car). TEST:0000004 (blue VW) is_a TEST:0000005 (VW). There are two levels here, TEST:0000004 -> TEST:0000005 -> TEST:0000001.

Perl parser

#!/bin/env perl

use strict;
use warnings;
use Getopt::Std;

my %opts = ();
getopts('f:h', \%opts);

if ($opts{'h'} || !keys %opts){
   usage();
}

sub usage {
   print STDERR <<EOF;

Program: parse_obo.pl (parses an OBO file)
Version: 0.0.1

Usage: $0 -f infile

where -f      the name of the obo file
      -h      this helpful usage message
EOF
   exit;
}

my @terms    = ();
my @obsos    = ();
my @synonyms = ();
my @parents  = ();
my %id2idx   = ();
my %children = ();

my @maps      = qw(TERM OBSOLETE PARENTS CHILDREN ANCESTOR OFFSPRING);
my %mapcounts = ();

open(IN,'<',$opts{'f'}) || die "Could not open $opts{f}: $!\n";
while (<IN>) {
   chomp;
   if (/^\[Term\]/) {
      my $id     = ;
      my $term   = ;
      my $obso   = 0;
      my @altids = ();
      my @syns   = ();
      my @rels   = ();
      while (<IN>) {
         chomp;
         last if (/^$/);
         if (/^id: (\S+)/) {
            $id   = $1;
         } elsif (/^name: (.*)$/) {
            $term = $1;
         } elsif (/^is_obsolete:\s+true/) {
            $obso = 1;
         } elsif (/^alt_id: (.*)$/) {
            push(@altids, $1);
         } elsif (/^synonym: (.*)$/) {
            my $line = $1;
            my ($syn) = $line =~ /^\"(.*)\" \S+ \^\*\]$/;
            push(@syns, $1);
            if ($1 eq ) {
               print STDERR "Unexpected line: $line\n";
            }
         } elsif (/^is_a: (\S+)/) {
            push(@rels, [$1, 'is_a']);
         } elsif (/^relationship: (\S+) (\S+)/) {
            push(@rels, [$2, $1]);
         }
      }
      if ($term eq ) {
         print STDERR "name is blank for $id\n";
         next;
      }
      if (!$obso) {
         push(@terms, [$id, $term]);
         foreach my $altid (@altids) {
            push(@synonyms, [$id, $altid, $altid, 1]);
         }
         foreach my $syn (@syns) {
            push(@synonyms, [$id, $syn, "", 0]);
         }
         foreach my $rel (@rels) {
            push(@parents, [$id, @$rel]);
            $children{$rel->[0]}->{$id}++;
         }
      } else {
         push(@obsos, [$id, $term]);
      }
   }
}
close(IN);

use Data::Dumper;
print Dumper(@parents);

Running the script:

parse_obo.pl -f structured-car.obo
$VAR1 = [
          'TEST:0000003',
          'TEST:0000001',
          'is_a'
        ];
$VAR2 = [
          'TEST:0000003',
          'TEST:0000002',
          'has_color'
        ];
$VAR3 = [
          'TEST:0000004',
          'TEST:0000005',
          'is_a'
        ];
$VAR4 = [
          'TEST:0000004',
          'TEST:0000002',
          'has_color'
        ];
$VAR5 = [
          'TEST:0000005',
          'TEST:0000001',
          'is_a'
        ];