Hey, Thanks for the great discussion guys - I've been digesting the class proposal (all reactions are classes with restrictions on their participants - and if these are asserted as an equivalentClass, then we can classify the reactions from different databases) over the last hour and half, and i definitely see merit in pursuing that. I think the big challenge is whether you can find the mappings between reactome:Glucose and biocyc:Glucose - you might be able to do the syntactic match by assuming they refer to the same thing (which may be reasonable, but there can be no semantic guarantee), but what will you do in cases where you don't have syntactic matches? I understand the the "debugging the bug" work is relevant here, perhaps you can summarize?
The instance proposal i laid out uses the following: 0) use an ontology of basic types of reactions (similar to BioPAX, which don't identify specific participants), instantiate ontologies of reactions such as the GO process or EC 1) use owl:sameAs between database individuals and their semantic equivalents (i.e. a uniprot protein) 2) assert data as instances of their respective classes (generally from OBO ontologies) 3) normalize relations: use basic (i.e. RO-compatible) relations, and create classes out of BioPAX datatype properties. 4) use situational objects in which relevant information (location, qualities, etc) are associated with processes. these situational objects have real molecules as their parts.
ok, that's it for now...
-=Michel=- -- Michel Dumontier Assistant Professor of Bioinformatics http://dumontierlab.com
On Oct 11, 2007, at 7:21 PM, Michel Dumontier wrote:
> I think the big challenge is whether you can find the mappings > between reactome:Glucose and biocyc:Glucose - you might be able to > do the syntactic match byassuming they refer to the same thing > (which may be reasonable, but there can be no semantic guarantee), > but what will you do in cases where you don't have syntactic matches?
I agree that this is a challenge. (Glucose is a particularly nasty case, as it happens)
In cases where both database providers have provided what are intended as unification xrefs to a third database, and there is an entity in each that identifies using the same third party, we can hypothesize that they are the same. This may turn out to be correct or incorrect. A common case of being incorrect , we found (exposed by the reasoner), was in the case of using KEGG ids that were, in effect, classes of compounds. In that case it wasn't uncommon for the provider to mistakenly identify a specific compound with the generic KEGG identifier.
In DTB, we found the generics in KEGG and treated them differently. We found them with the combination of script and then manual review of the results.
In other cases, a similar strategy may be needed - script to identify candidates, and then review to sanction their use. Where no exact match can be made, relations between "compounds" can be made, such as using subclass relations between KEGG generics, and more specific compounds.
I claim there is no representational way around this. However, if we construct other parts of the representation well, when such a mapping between compounds is made, other relationships, such as equivalence between reactions, can be inferred - those are cases where we have not had to curate the mapping (though we may want to review them to ensure that the database provider has not made a mistake. They do make mistakes, you know)
We can also write systems with enough statements that an erroneous mapping will trigger an inconsistency, which can then be manually reviewed and corrected. Our experience with DBD showed that even simple constraints can uncover many mistakes. More aggressively assigning mappings, such as syntactically, and then collecting mismappings uncovered by inconsistency, can also be of use. For instance, where a curator might be presented with a choice of 5 possible meanings for a term, these mismappings might eliminate 3 cases...
What we can look forward to, and specify, is how to write and publish, in OWL, such mappings and mismappings, so that the manual labor of a few can be shared with the many.
More specific examples from DBD, as I have time...
> Thanks for the great discussion guys - I've been digesting the class > proposal (all reactions are classes with restrictions on their > participants - and if these are asserted as an equivalentClass, then > we can classify the reactions from different databases) over the last > hour and half, and i definitely see merit in pursuing that. I think > the big challenge is whether you can find the mappings between > reactome:Glucose and biocyc:Glucose - you might be able to do the > syntactic match by assuming they refer to the same thing (which may be > reasonable, but there can be no semantic guarantee), but what will you > do in cases where you don't have syntactic matches? I understand the > the "debugging the bug" work is relevant here, perhaps you can > summarize?
Sure. Glucose is a good example.
In the representation suggested by Alan, we should consider all molecules as classes that are represented to some degree of specification.
There are several choices for trying to reconcile compounds in different databases.
In Reactome, 1. unify by Xref. If both compounds have the same KEGG ID, CHEBI ID, CAS ID or Pubchem ID, then they belong in the class of compounds with the same Xref.
In Biocyc, beta-D-Glucose (Frame ID: GLC) has links to PUBCHEM:64689, LIGAND-COMPOUND:C00031, and CAS:50-99-7
Amazingly, Reactome does not appear to contain an entry for beta-D-Glucose. It does contain an entry for alpha-D-Glucose, and it has links to ChEBI:17925 LIGAND-COMPOUND:C00267, and PUBCHEM:8143952
BioCyc also has an entry for alpha-D-Glucose (Frame ID: ALPHA-GLUCOSE), but unfortunately, it does not contain any links to external databases.
2. unify by chemical structure. If both compounds have the chemical structure (whether InCHI, SMILES, Molfile, CML... etc) then they can be compared and unified.
The SMILES structure of biocyc:ALPHA-GLUCOSE is: C1(C(O)C(O)C(O)C(O1)CO)(O)
The SMILES structure of biocyc:GLC is: C(O)C1(C(O)C(O)C(O)C(O1)O)
The SMILES structure of reactome:alpha-D-Glucose does not exist. However, the SMILES structure of it's link to ChEBI is C[C@H]1O[C@H](O)[C@H](O)[C@@H](O)[C@@H]1O The canonical SMILES structure of its link to PUBCHEM is: C(C1C(C(C(C(O1)O)O)O)O)O
The isomeric SMILES strucutre of its link to PUBCHEM is: C([C@@H]1[C@H]([C@@H]([C@H]([C@H](O1)O)O)O)O)O
So, the SMILES string of biocyc:alpha-D-Glucose does not match any structure in ChEBI or Pubchem. In fact, the SMILES structure of alpha-D-Glucose in ChEBI does not match the SMILES structure of alpha-D-Glucose in PubChem. LIGAND does not use SMILES, but they do use Molfiles.
ChEBI also uses InChI's. This is the ChEBI InChI for alpha-D-Glucose: InChI=1/C6H12O6/c7-1-2-3(8)4(9)5(10)6(11)12-2/h2-11H,1H2/t2-,3-,4+,5-,6+/m1 /s1
PubChem also uses InChI's. This is the PubChem InChI for alpha-D-Glucose: InChI=1/C6H12O6/c7-1-2-3(8)4(9)5(10)6(11)12-2/h2-11H,1H2/t2-,3-,4+,5-,6+/m1 /s1
So it looks like we have a match between ChEBI:alpha-D-Glucose and PubChem:alpha-D-Glucose.
However, because InChI's are layered, one could assign CHEBI and PUBCHEM compounds as a subclass of the (underspecified) BioCyc compound.
Similarly, when converting the Molfile for LIGAND-COMPOUND:alpha-D-Glucose I got the following InChI: InChI=1/C6H12O6/c7-1-2-3(8)4(9)5(10)6(11)12-2/h2-11H,1H2
Note that this unifies the KEGG and BioCyc versions of alpha-D-Glucose.
So far, InChI's seem to be the best bet for unifying database compounds.
3. unify by reaction participants. If two compounds are used in the same context (ie. they are both substrates of the same reaction, that can give a hint.
None of the reactions for biocyc:alpha-D-Glucose match any of the reactions in reactome:alpha-D-Glucose.
4. unify by taxonomy Both Reactome and BioCyc have compound taxonomies. By aligning the taxonomies, one can group compounds as subclasses of the same class.
In the biocyc ontology, the relevant compound hierarchy is as follows. (We will ignore the fact that L-Glucose is an instance and D-Glucose is a class. When we map the compounds to BioPAX, everything becomes a class)
For Reactome, there does not appear to be a compound hierarchy explicitly represented, although there is a cryptic reference to the fact that Reactome has classified alpha-D-glucose as a hexose: Is represented by generalisation(s): hexoses transported by GLUT2 [cytosol] hexoses transported by SGLT1 [cytosol].
As a result, by unifying Reactome:Hexose with BioCyc:Hexoses, all the subclasses are pooled together.
5. IUPAC matching: synonyms are still the way that humans match compounds. it is possible to take an IUPAC name and uniquely identify the structure, but not the other way around.
6. Group by chemical formula. If all you have is a chemical formula, at least you can group them into the same class of isomers.
One could imagine creating a compound hierarchy based on InChI levels.
it would be very easy to classify compounds according to this hierarchy in an automated way.
Anyway, this was a long-winded way of saying what Alan said.
There is no easy way to unify compounds from different databases. Glucose happens to be a particularly ugly example, but there are plenty of others, I assure you.
> The instance proposal i laid out uses the following: > 0) use an ontology of basic types of reactions (similar to BioPAX, > which don't identify specific participants), instantiate ontologies of > reactions such as the GO process or EC > 1) use owl:sameAs between database individuals and their semantic > equivalents (i.e. a uniprot protein) > 2) assert data as instances of their respective classes (generally > from OBO ontologies) > 3) normalize relations: use basic (i.e. RO-compatible) relations, and > create classes out of BioPAX datatype properties. > 4) use situational objects in which relevant information (location, > qualities, etc) are associated with processes. these situational > objects have real molecules as their parts.
> ok, that's it for now...
> -=Michel=- > -- > Michel Dumontier > Assistant Professor of Bioinformatics > http://dumontierlab.com
I sent off the last email before finishing the analysis:
5. IUPAC matching: synonyms are still the way that humans match compounds. it is possible to take an IUPAC name and uniquely identify the structure, but not the other way around.
> Hey, > Thanks for the great discussion guys - I've been digesting the > class proposal (all reactions are classes with restrictions on > their participants - and if these are asserted as an > equivalentClass, then we can classify the reactions from different > databases) over the last hour and half, and i definitely see merit > in pursuing that. I think the big challenge is whether you can find > the mappings between reactome:Glucose and biocyc:Glucose - you > might be able to do the syntactic match by assuming they refer to > the same thing (which may be reasonable, but there can be no > semantic guarantee), but what will you do in cases where you don't > have syntactic matches? I understand the the "debugging the bug" > work is relevant here, perhaps you can summarize?
> The instance proposal i laid out uses the following: > 0) use an ontology of basic types of reactions (similar to BioPAX, > which don't identify specific participants), instantiate ontologies > of reactions such as the GO process or EC > 1) use owl:sameAs between database individuals and their semantic > equivalents (i.e. a uniprot protein) > 2) assert data as instances of their respective classes (generally > from OBO ontologies) > 3) normalize relations: use basic ( i.e. RO-compatible) relations, > and create classes out of BioPAX datatype properties. > 4) use situational objects in which relevant information (location, > qualities, etc) are associated with processes. these situational > objects have real molecules as their parts.
> ok, that's it for now...
> -=Michel=- > -- > Michel Dumontier > Assistant Professor of Bioinformatics > http://dumontierlab.com
Joanne Luciano, PhD Predictive Medicine, Inc. 45 Orchard Street Belmont MA 02478-3008 Email: jluci...@predmed.com
> > Thanks for the great discussion guys - I've been digesting the class > > proposal (all reactions are classes with restrictions on their > > participants - and if these are asserted as an equivalentClass, then > > we can classify the reactions from different databases) over the last > > hour and half, and i definitely see merit in pursuing that. I think > > the big challenge is whether you can find the mappings between > > reactome:Glucose and biocyc:Glucose - you might be able to do the > > syntactic match by assuming they refer to the same thing (which may be > > reasonable, but there can be no semantic guarantee), but what will you > > do in cases where you don't have syntactic matches? I understand the > > the "debugging the bug" work is relevant here, perhaps you can > > summarize?
> Sure. Glucose is a good example.
> In the representation suggested by Alan, we should consider all > molecules as classes that are represented to some degree of > specification.
> There are several choices for trying to reconcile compounds in different > databases.
> In Reactome, > 1. unify by Xref. If both compounds have the same KEGG ID, CHEBI ID, CAS > ID or Pubchem ID, then they belong in the class of compounds with the same > Xref.
> In Biocyc, beta-D-Glucose (Frame ID: GLC) has links to > PUBCHEM:64689, LIGAND-COMPOUND:C00031, and CAS:50-99-7
> Amazingly, Reactome does not appear to contain an entry for > beta-D-Glucose. > It does contain an entry for alpha-D-Glucose, and it has links to > ChEBI:17925 LIGAND-COMPOUND:C00267, and PUBCHEM:8143952
> BioCyc also has an entry for alpha-D-Glucose (Frame ID: ALPHA-GLUCOSE), > but unfortunately, it does not contain any links to external databases.
> 2. unify by chemical structure. If both compounds have the chemical > structure (whether InCHI, SMILES, Molfile, CML... etc) then they can be > compared and unified.
> The SMILES structure of biocyc:ALPHA-GLUCOSE is: > C1(C(O)C(O)C(O)C(O1)CO)(O)
> The SMILES structure of biocyc:GLC is: C(O)C1(C(O)C(O)C(O)C(O1)O)
> The SMILES structure of reactome:alpha-D-Glucose does not exist. However, > the SMILES structure of it's link to ChEBI is > C[C@H]1O[C@H](O)[C@H](O)[C@@H](O)[C@@H]1O > The canonical SMILES structure of its link to PUBCHEM is: > C(C1C(C(C(C(O1)O)O)O)O)O
> The isomeric SMILES strucutre of its link to PUBCHEM is: > C([C@@H]1[C@H]([C@@H]([C@H]([C@H](O1)O)O)O)O)O
> So, the SMILES string of biocyc:alpha-D-Glucose does not match any > structure in ChEBI or Pubchem. In fact, the SMILES structure of > alpha-D-Glucose in ChEBI does not match the SMILES structure of > alpha-D-Glucose in PubChem. > LIGAND does not use SMILES, but they do use Molfiles.
> ChEBI also uses InChI's. This is the ChEBI InChI for alpha-D-Glucose:
> However, because InChI's are layered, one could assign CHEBI and PUBCHEM > compounds as a subclass of the (underspecified) BioCyc compound.
> Similarly, when converting the Molfile for LIGAND-COMPOUND:alpha-D-Glucose > I got the following InChI: > InChI=1/C6H12O6/c7-1-2-3(8)4(9)5(10)6(11)12-2/h2-11H,1H2
> Note that this unifies the KEGG and BioCyc versions of alpha-D-Glucose.
> So far, InChI's seem to be the best bet for unifying database compounds.
> 3. unify by reaction participants. If two compounds are used in the > same context (ie. they are both substrates of the same reaction, that > can give a hint.
> None of the reactions for biocyc:alpha-D-Glucose match any of the > reactions in reactome:alpha-D-Glucose.
> 4. unify by taxonomy Both Reactome and BioCyc have compound taxonomies. > By aligning the taxonomies, one can group compounds as subclasses of > the same class.
> In the biocyc ontology, the relevant compound hierarchy is as follows. > (We will ignore the fact that L-Glucose is an instance and D-Glucose is a > class. When we map the compounds to BioPAX, everything becomes a class)
> For Reactome, there does not appear to be a compound hierarchy explicitly > represented, although there is a cryptic reference to the fact that > Reactome has classified alpha-D-glucose as a hexose: > Is represented by generalisation(s): > hexoses transported by GLUT2 [cytosol] > hexoses transported by SGLT1 [cytosol].
> As a result, by unifying Reactome:Hexose with BioCyc:Hexoses, all the > subclasses are pooled together.
> 5. IUPAC matching: synonyms are still the way that humans match > compounds. it is possible to take an IUPAC name and uniquely identify > the structure, but not the other way around.
> 6. Group by chemical formula. If all you have is a chemical formula, at > least you can group them into the same class of isomers.
> One could imagine creating a compound hierarchy based on InChI levels.
> it would be very easy to classify compounds according to this hierarchy in > an automated way.
> Anyway, this was a long-winded way of saying what Alan said.
> There is no easy way to unify compounds from different databases. Glucose > happens to be a particularly ugly example, but there are plenty of others, > I assure you.
> Sincerely,
> Jeremy
> > The instance proposal i laid out uses the following: > > 0) use an ontology of basic types of reactions (similar to BioPAX, > > which don't identify specific participants), instantiate ontologies of > > reactions such as the GO process or EC > > 1) use owl:sameAs between database individuals and their semantic > > equivalents (i.e. a uniprot protein) > > 2) assert data as instances of their respective classes (generally > > from OBO ontologies) > > 3) normalize relations: use basic (i.e. RO-compatible) relations, and > > create classes out of BioPAX datatype properties. > > 4) use situational objects in which relevant information (location, > > qualities, etc) are associated with processes. these situational > > objects have real molecules as their parts.
> > ok, that's it for now...
> > -=Michel=- > > -- > > Michel Dumontier > > Assistant Professor of Bioinformatics > > http://dumontierlab.com