meeting today

8 views
Skip to first unread message

Michel Dumontier

unread,
Oct 11, 2007, 7:21:59 PM10/11/07
to BioPAX...@googlegroups.com
Hey,
  Thanks for the great discussion guys - I've been digesting the class proposal (all reactions are classes with restrictions on their participants - and if these are asserted as an equivalentClass, then we can classify the reactions from different databases) over the last hour and half, and i definitely see merit in pursuing that. I think the big challenge is whether you can find the mappings between reactome:Glucose and biocyc:Glucose - you might be able to do the syntactic match by assuming they refer to the same thing (which may be reasonable, but there can be no semantic guarantee), but what will you do in cases where you don't have syntactic matches? I understand the the "debugging the bug" work is relevant here, perhaps you can summarize?

The instance proposal i laid out uses the following:
0) use an ontology of basic types of reactions (similar to BioPAX, which don't identify specific participants), instantiate ontologies of reactions such as the GO process or EC
1) use owl:sameAs between database individuals and their semantic equivalents (i.e. a uniprot protein)
2) assert data as instances of their respective classes (generally from OBO ontologies)
3) normalize relations: use basic ( i.e. RO-compatible) relations, and create classes out of BioPAX datatype properties.
4) use situational objects in which relevant information (location, qualities, etc) are associated with processes. these situational objects have real molecules as their parts.

ok, that's it for now...

-=Michel=-
--
Michel Dumontier
Assistant Professor of Bioinformatics
http://dumontierlab.com

Alan Ruttenberg

unread,
Oct 12, 2007, 2:44:35 AM10/12/07
to BioPAX...@googlegroups.com
On Oct 11, 2007, at 7:21 PM, Michel Dumontier wrote:

> I think the big challenge is whether you can find the mappings
> between reactome:Glucose and biocyc:Glucose - you might be able to

> do the syntactic match byassuming they refer to the same thing

> (which may be reasonable, but there can be no semantic guarantee),
> but what will you do in cases where you don't have syntactic matches?

I agree that this is a challenge. (Glucose is a particularly nasty
case, as it happens)

In cases where both database providers have provided what are
intended as unification xrefs to a third database, and there is an
entity in each that identifies using the same third party, we can
hypothesize that they are the same. This may turn out to be correct
or incorrect. A common case of being incorrect , we found (exposed by
the reasoner), was in the case of using KEGG ids that were, in
effect, classes of compounds. In that case it wasn't uncommon for the
provider to mistakenly identify a specific compound with the generic
KEGG identifier.

In DTB, we found the generics in KEGG and treated them differently.
We found them with the combination of script and then manual review
of the results.

In other cases, a similar strategy may be needed - script to identify
candidates, and then review to sanction their use. Where no exact
match can be made, relations between "compounds" can be made, such as
using subclass relations between KEGG generics, and more specific
compounds.

I claim there is no representational way around this. However, if we
construct other parts of the representation well, when such a mapping
between compounds is made, other relationships, such as equivalence
between reactions, can be inferred - those are cases where we have
not had to curate the mapping (though we may want to review them to
ensure that the database provider has not made a mistake. They do
make mistakes, you know)

We can also write systems with enough statements that an erroneous
mapping will trigger an inconsistency, which can then be manually
reviewed and corrected. Our experience with DBD showed that even
simple constraints can uncover many mistakes.
More aggressively assigning mappings, such as syntactically, and then
collecting mismappings uncovered by inconsistency, can also be of
use. For instance, where a curator might be presented with a choice
of 5 possible meanings for a term, these mismappings might eliminate
3 cases...

What we can look forward to, and specify, is how to write and
publish, in OWL, such mappings and mismappings, so that the manual
labor of a few can be shared with the many.

More specific examples from DBD, as I have time...


-Alan


zuc...@research.dfci.harvard.edu

unread,
Oct 12, 2007, 3:59:12 AM10/12/07
to BioPAX...@googlegroups.com, michel.d...@gmail.com
Hey Michel,


> Thanks for the great discussion guys - I've been digesting the class
> proposal (all reactions are classes with restrictions on their
> participants - and if these are asserted as an equivalentClass, then
> we can classify the reactions from different databases) over the last

> hour and half, and i definitely see merit in pursuing that. I think


> the big challenge is whether you can find the mappings between
> reactome:Glucose and biocyc:Glucose - you might be able to do the

> syntactic match by assuming they refer to the same thing (which may be


> reasonable, but there can be no semantic guarantee), but what will you

> do in cases where you don't have syntactic matches? I understand the
> the "debugging the bug" work is relevant here, perhaps you can
> summarize?
>

Sure. Glucose is a good example.

In the representation suggested by Alan, we should consider all
molecules as classes that are represented to some degree of
specification.


There are several choices for trying to reconcile compounds in different
databases.


In Reactome,
1. unify by Xref. If both compounds have the same KEGG ID, CHEBI ID, CAS
ID or Pubchem ID, then they belong in the class of compounds with the same
Xref.

In Biocyc, beta-D-Glucose (Frame ID: GLC) has links to
PUBCHEM:64689, LIGAND-COMPOUND:C00031, and CAS:50-99-7

Amazingly, Reactome does not appear to contain an entry for beta-D-Glucose.
It does contain an entry for alpha-D-Glucose, and it has links to
ChEBI:17925 LIGAND-COMPOUND:C00267, and PUBCHEM:8143952

BioCyc also has an entry for alpha-D-Glucose (Frame ID: ALPHA-GLUCOSE),
but unfortunately, it does not contain any links to external databases.

2. unify by chemical structure. If both compounds have the chemical
structure (whether InCHI, SMILES, Molfile, CML... etc) then they can be
compared and unified.

The SMILES structure of biocyc:ALPHA-GLUCOSE is: C1(C(O)C(O)C(O)C(O1)CO)(O)

The SMILES structure of biocyc:GLC is: C(O)C1(C(O)C(O)C(O)C(O1)O)

The SMILES structure of reactome:alpha-D-Glucose does not exist. However,
the SMILES structure of it's link to ChEBI is
C[C@H]1O[C@H](O)[C@H](O)[C@@H](O)[C@@H]1O
The canonical SMILES structure of its link to PUBCHEM is:
C(C1C(C(C(C(O1)O)O)O)O)O

The isomeric SMILES strucutre of its link to PUBCHEM is:
C([C@@H]1[C@H]([C@@H]([C@H]([C@H](O1)O)O)O)O)O

So, the SMILES string of biocyc:alpha-D-Glucose does not match any
structure in ChEBI or Pubchem. In fact, the SMILES structure of
alpha-D-Glucose in ChEBI does not match the SMILES structure of
alpha-D-Glucose in PubChem.
LIGAND does not use SMILES, but they do use Molfiles.

ChEBI also uses InChI's. This is the ChEBI InChI for alpha-D-Glucose:
InChI=1/C6H12O6/c7-1-2-3(8)4(9)5(10)6(11)12-2/h2-11H,1H2/t2-,3-,4+,5-,6+/m1/s1

PubChem also uses InChI's. This is the PubChem InChI for alpha-D-Glucose:
InChI=1/C6H12O6/c7-1-2-3(8)4(9)5(10)6(11)12-2/h2-11H,1H2/t2-,3-,4+,5-,6+/m1/s1

So it looks like we have a match between ChEBI:alpha-D-Glucose and
PubChem:alpha-D-Glucose.

When I converted the SMILES for biocyc:alpha-D-Glucose using
http://inchi.info/converter_en.html
I got the following InChI:

InChI=1/C6H12O6/c7-1-2-3(8)4(9)5(10)6(11)12-2/h2-11H,1H2

Note that this is *not* identical to the ChEBI and PUBCHEM compounds, as
it does not include the stereochemical layer:
http://wwmm.ch.cam.ac.uk/inchifaq/#Specifically,%20what%20are%20InChI%20layers?

However, because InChI's are layered, one could assign CHEBI and PUBCHEM
compounds as a subclass of the (underspecified) BioCyc compound.

Similarly, when converting the Molfile for LIGAND-COMPOUND:alpha-D-Glucose
I got the following InChI:
InChI=1/C6H12O6/c7-1-2-3(8)4(9)5(10)6(11)12-2/h2-11H,1H2

Note that this unifies the KEGG and BioCyc versions of alpha-D-Glucose.

So far, InChI's seem to be the best bet for unifying database compounds.

3. unify by reaction participants. If two compounds are used in the
same context (ie. they are both substrates of the same reaction, that
can give a hint.

None of the reactions for biocyc:alpha-D-Glucose match any of the
reactions in reactome:alpha-D-Glucose.

4. unify by taxonomy Both Reactome and BioCyc have compound taxonomies.
By aligning the taxonomies, one can group compounds as subclasses of
the same class.

In the biocyc ontology, the relevant compound hierarchy is as follows.
(We will ignore the fact that L-Glucose is an instance and D-Glucose is a
class. When we map the compounds to BioPAX, everything becomes a class)

Hexoses
+L-Hexoses
++L-Glucose
+D-Hexoses
++D-Glucose
+++alpha-D-Glucose
+++beta-D-Glucose

For Reactome, there does not appear to be a compound hierarchy explicitly
represented, although there is a cryptic reference to the fact that
Reactome has classified alpha-D-glucose as a hexose:
Is represented by generalisation(s):
hexoses transported by GLUT2 [cytosol]
hexoses transported by SGLT1 [cytosol].

As a result, by unifying Reactome:Hexose with BioCyc:Hexoses, all the
subclasses are pooled together.

5. IUPAC matching: synonyms are still the way that humans match
compounds. it is possible to take an IUPAC name and uniquely identify
the structure, but not the other way around.

http://en.wikipedia.org/wiki/International_Union_of_Pure_and_Applied_Chemistry_nomenclature


6. Group by chemical formula. If all you have is a chemical formula, at
least you can group them into the same class of isomers.

One could imagine creating a compound hierarchy based on InChI levels.

it would be very easy to classify compounds according to this hierarchy in
an automated way.

Anyway, this was a long-winded way of saying what Alan said.

There is no easy way to unify compounds from different databases. Glucose
happens to be a particularly ugly example, but there are plenty of others,
I assure you.

Sincerely,

Jeremy

> The instance proposal i laid out uses the following:
> 0) use an ontology of basic types of reactions (similar to BioPAX,
> which don't identify specific participants), instantiate ontologies of
> reactions such as the GO process or EC
> 1) use owl:sameAs between database individuals and their semantic
> equivalents (i.e. a uniprot protein)
> 2) assert data as instances of their respective classes (generally
> from OBO ontologies)

> 3) normalize relations: use basic (i.e. RO-compatible) relations, and

zuc...@research.dfci.harvard.edu

unread,
Oct 12, 2007, 4:22:01 AM10/12/07
to BioPAX...@googlegroups.com, michel.d...@gmail.com
I sent off the last email before finishing the analysis:

5. IUPAC matching: synonyms are still the way that humans match
compounds. it is possible to take an IUPAC name and uniquely identify the
structure, but not the other way around.

http://en.wikipedia.org/wiki/International_Union_of_Pure_and_Applied_Chemistry_nomenclature

For example, in ChEBI, the IUPAC name for Alpha-D-Glucose is:
α-D-glucopyranose

However, in PubChem, the IUPAC name for alpha-D-Glucose is:
(2S,3R,4S,5S,6R)-6-(hydroxymethyl)oxane-2,3,4,5-tetrol

It turns out that it may be possible to automatically convert IUPAC
nomenclature into structures.

If you haven't checked out Richard Apodaca's website, this url should give
you a taste:
http://depth-first.com/articles/tag/iupac


Joanne Luciano

unread,
Oct 12, 2007, 7:22:34 AM10/12/07
to BioPAX...@googlegroups.com
Are you in town now?

On Oct 11, 2007, at 7:21 PM, Michel Dumontier wrote:


Joanne Luciano, PhD
Predictive Medicine, Inc.
45 Orchard Street
Belmont MA 02478-3008



Michel Dumontier

unread,
Oct 12, 2007, 9:40:37 AM10/12/07
to BioPAX...@googlegroups.com
Alan, Jeremy

Thanks! What a great analysis! Well substantiated.

:-)

So, is it worthwhile to scan through the pathway/reaction databases and enumerate which identifiers are being used?

-=Michel=-

Reply all
Reply to author
Forward
0 new messages