From Dave's wiki
Revision as of 03:29, 7 March 2014 by Admin (talk | contribs) (→‎Perl parser)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

The OBO flat file is described here -> http://www.geneontology.org/GO.format.obo-1_2.shtml

Comments start with an exclamation mark

<tag>: <value> {<trailing modifiers>} ! <comment>

The tag name is always a string. The value is always a string, but the value string may require special parsing depending on the tag with which it is associated.

At present, every OBO stanza always begins with an id tag.

is_a: This tag describes a subclassing relationship between one term and another. The value is the id of the term of which this term is a subclass. A term may have any number of is_a relationships.

intersection_of: This tag indicates that this term is equivalent to the intersection of several other terms. The value is either a term id, or a relationship type id, a space, and a term id.

relationship: This tag describes a typed relationship between this term and another term. The value of this tag should be the relationship type id, and then the id of the target term. The relationship type name must be a relationship type name as defined in a typedef tag stanza. The [Typedef] must either occur in a document in the current parse batch, or in a file imported via an import header tag. If the relationship type name is undefined, a parse error will be generated. If the id of the target term cannot be resolved by the end of parsing the current batch of files, this tag describes a "dangling reference"; see the parser requirements section for information about how a parser may handle dangling references. If a relationship is specified for a term with an is_obsolete value of true, a parse error will be generated.


Download here -> http://oboedit.org/

Download graphviz -> http://www.graphviz.org/

Once installed there are examples inside the folder test_resources. Let's take a look at one of the example files called structured-car.obo:

format-version: 1.2
date: 10:05:2011 11:25
saved-by: nomi
auto-generated-by: OBO-Edit 2.1-beta13
default-namespace: file:/Users/nomi/Documents/workspace/OBO-Edit/test_resources/car.obo

id: TEST:0000001
name: car
synonym: "automobile" EXACT []

id: TEST:0000002
name: blue

id: TEST:0000003
name: blue car
is_a: TEST:0000001 ! car
relationship: has_color TEST:0000002 ! blue

id: TEST:0000004
name: blue VW
is_a: TEST:0000005 ! VW
relationship: has_color TEST:0000002 ! blue

id: TEST:0000005
name: VW
is_a: TEST:0000001 ! car

id: TEST:0000006
name: automobile
is_obsolete: true

id: has_color
name: has_color

id: has_make
name: has_make

On the highest level are the terms TEST:0000001 (car) and TEST:0000002 (blue). TEST:0000003 (blue car) is_a TEST:0000001 (car) and has the relationship has_color TEST:0000002 (blue). So the parent is TEST:0000001 and the child TEST:0000003.

TEST:0000005 (VW) is_a TEST:0000001 (car). TEST:0000004 (blue VW) is_a TEST:0000005 (VW). There are two levels here, TEST:0000004 -> TEST:0000005 -> TEST:0000001.

Perl parser

#!/bin/env perl

use strict;
use warnings;
use Getopt::Std;

my %opts = ();
getopts('f:h', \%opts);

if ($opts{'h'} || !keys %opts){

sub usage {
   print STDERR <<EOF;

Program: parse_obo.pl (parses an OBO file)
Version: 0.0.1

Usage: $0 -f infile

where -f      the name of the obo file
      -h      this helpful usage message

my @terms    = ();
my @obsos    = ();
my @synonyms = ();
my @parents  = ();
my %id2idx   = ();
my %children = ();

my %mapcounts = ();

open(IN,'<',$opts{'f'}) || die "Could not open $opts{f}: $!\n";
while (<IN>) {
   if (/^\[Term\]/) {
      my $id     = ;
      my $term   = ;
      my $obso   = 0;
      my @altids = ();
      my @syns   = ();
      my @rels   = ();
      while (<IN>) {
         last if (/^$/);
         if (/^id: (\S+)/) {
            $id   = $1;
         } elsif (/^name: (.*)$/) {
            $term = $1;
         } elsif (/^is_obsolete:\s+true/) {
            $obso = 1;
         } elsif (/^alt_id: (.*)$/) {
            push(@altids, $1);
         } elsif (/^synonym: (.*)$/) {
            my $line = $1;
            my ($syn) = $line =~ /^\"(.*)\" \S+ \^\*\]$/;
            push(@syns, $1);
            if ($1 eq ) {
               print STDERR "Unexpected line: $line\n";
         } elsif (/^is_a: (\S+)/) {
            push(@rels, [$1, 'is_a']);
         } elsif (/^relationship: (\S+) (\S+)/) {
            push(@rels, [$2, $1]);
      if ($term eq ) {
         print STDERR "name is blank for $id\n";
      if (!$obso) {
         push(@terms, [$id, $term]);
         foreach my $altid (@altids) {
            push(@synonyms, [$id, $altid, $altid, 1]);
         foreach my $syn (@syns) {
            push(@synonyms, [$id, $syn, "", 0]);
         foreach my $rel (@rels) {
            push(@parents, [$id, @$rel]);
      } else {
         push(@obsos, [$id, $term]);

use Data::Dumper;
print Dumper(@parents);

Running the script:

parse_obo.pl -f structured-car.obo
$VAR1 = [
$VAR2 = [
$VAR3 = [
$VAR4 = [
$VAR5 = [