Web Images Videos Maps News Shopping Gmail more »
Recently Visited Groups | Help | Sign in
Google Groups Home
meeting today
There are currently too many topics in this group that display first. To make this topic appear first, remove this option from another topic.
There was an error processing your request. Please try again.
flag
  6 messages - Collapse all  -  Translate all to Translated (View all originals)
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
Michel Dumontier  
View profile  
 More options Oct 11 2007, 7:21 pm
From: "Michel Dumontier" <michel.dumont...@gmail.com>
Date: Thu, 11 Oct 2007 19:21:59 -0400
Local: Thurs, Oct 11 2007 7:21 pm
Subject: meeting today

Hey,
  Thanks for the great discussion guys - I've been digesting the class
proposal (all reactions are classes with restrictions on their participants
- and if these are asserted as an equivalentClass, then we can classify the
reactions from different databases) over the last hour and half, and i
definitely see merit in pursuing that. I think the big challenge is whether
you can find the mappings between reactome:Glucose and biocyc:Glucose - you
might be able to do the syntactic match by assuming they refer to the same
thing (which may be reasonable, but there can be no semantic guarantee), but
what will you do in cases where you don't have syntactic matches? I
understand the the "debugging the bug" work is relevant here, perhaps you
can summarize?

The instance proposal i laid out uses the following:
0) use an ontology of basic types of reactions (similar to BioPAX, which
don't identify specific participants), instantiate ontologies of reactions
such as the GO process or EC
1) use owl:sameAs between database individuals and their semantic
equivalents (i.e. a uniprot protein)
2) assert data as instances of their respective classes (generally from OBO
ontologies)
3) normalize relations: use basic (i.e. RO-compatible) relations, and create
classes out of BioPAX datatype properties.
4) use situational objects in which relevant information (location,
qualities, etc) are associated with processes. these situational objects
have real molecules as their parts.

ok, that's it for now...

-=Michel=-
--
Michel Dumontier
Assistant Professor of Bioinformatics
http://dumontierlab.com


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Alan Ruttenberg  
View profile  
 More options Oct 12 2007, 2:44 am
From: Alan Ruttenberg <alanruttenb...@gmail.com>
Date: Fri, 12 Oct 2007 02:44:35 -0400
Local: Fri, Oct 12 2007 2:44 am
Subject: Re: meeting today
On Oct 11, 2007, at 7:21 PM, Michel Dumontier wrote:

> I think the big challenge is whether you can find the mappings  
> between reactome:Glucose and biocyc:Glucose - you might be able to  
> do the syntactic match byassuming they refer to the same thing  
> (which may be reasonable, but there can be no semantic guarantee),  
> but what will you do in cases where you don't have syntactic matches?

I agree that this is a challenge. (Glucose is a particularly nasty  
case, as it happens)

In cases where both database providers have provided what are  
intended as unification xrefs to a third database, and there is an  
entity in each that identifies using the same third party, we can  
hypothesize that they are the same. This may turn out to be correct  
or incorrect. A common case of being incorrect , we found (exposed by  
the reasoner), was in the case of using KEGG ids that were, in  
effect, classes of compounds. In that case it wasn't uncommon for the  
provider to mistakenly identify a specific compound with the generic  
KEGG identifier.

In DTB, we found the generics in KEGG and treated them differently.  
We found them with the combination of script and then manual review  
of the results.

In other cases, a similar strategy may be needed - script to identify  
candidates, and then review to sanction their use. Where no exact  
match can be made, relations between "compounds" can be made, such as  
using subclass relations between KEGG generics, and more specific  
compounds.

I claim there is no representational way around this. However, if we  
construct other parts of the representation well, when such a mapping  
between compounds is made, other relationships, such as equivalence  
between reactions, can be inferred - those are cases where we have  
not had to curate the mapping (though we may want to review them to  
ensure that the database provider has not made a mistake. They do  
make mistakes, you know)

We can also write systems with enough statements that an erroneous  
mapping will trigger an inconsistency, which can then be manually  
reviewed and corrected. Our experience with DBD showed that even  
simple constraints can uncover many mistakes.
More aggressively assigning mappings, such as syntactically, and then  
collecting mismappings uncovered by inconsistency, can also be of  
use. For instance, where a curator might be presented with a choice  
of 5 possible meanings for a term, these mismappings might eliminate  
3 cases...

What we can look forward to, and specify, is how to write and  
publish, in OWL, such mappings and mismappings, so that the manual  
labor of a few can be shared with the many.

More specific examples from DBD, as I have time...

-Alan


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
zuc...@research.dfci.harvard.edu  
View profile  
 More options Oct 12 2007, 3:59 am
From: zuc...@research.dfci.harvard.edu
Date: Fri, 12 Oct 2007 03:59:12 -0400 (EDT)
Local: Fri, Oct 12 2007 3:59 am
Subject: Re: meeting today
Hey Michel,

>   Thanks for the great discussion guys - I've been digesting the class
> proposal (all reactions are classes with restrictions on their
> participants - and if these are asserted as an equivalentClass, then
> we can classify the reactions from different databases) over the last
> hour and half, and i definitely see merit in pursuing that. I think
> the big challenge is whether you can find the mappings between
> reactome:Glucose and biocyc:Glucose - you might be able to do the
> syntactic match by assuming they refer to the same thing (which may be
> reasonable, but there can be no semantic guarantee), but what will you
> do in cases where you don't have syntactic matches? I understand the
> the "debugging the bug" work is relevant here, perhaps you can
> summarize?

Sure. Glucose is a good example.

In the representation suggested by Alan, we should consider all
molecules as classes that are represented to some degree of
specification.

There are several choices for trying to reconcile compounds in different
databases.

In Reactome,
1. unify by Xref.  If both compounds have the same KEGG ID, CHEBI ID, CAS
ID or Pubchem ID, then they belong in the class of compounds with the same
Xref.

In Biocyc, beta-D-Glucose (Frame ID: GLC) has links to
PUBCHEM:64689, LIGAND-COMPOUND:C00031, and CAS:50-99-7

Amazingly, Reactome does not appear to contain an entry for beta-D-Glucose.
It does contain an entry for alpha-D-Glucose, and it has links to
ChEBI:17925 LIGAND-COMPOUND:C00267, and PUBCHEM:8143952

BioCyc also has an entry for alpha-D-Glucose (Frame ID: ALPHA-GLUCOSE),
but unfortunately, it does not contain any links to external databases.

2. unify by chemical structure.  If both compounds have the chemical
structure (whether InCHI, SMILES, Molfile, CML... etc) then they can be
compared and unified.

The SMILES structure of biocyc:ALPHA-GLUCOSE is: C1(C(O)C(O)C(O)C(O1)CO)(O)

The SMILES structure of biocyc:GLC is: C(O)C1(C(O)C(O)C(O)C(O1)O)

The SMILES structure of reactome:alpha-D-Glucose does not exist. However,
the SMILES structure of it's link to ChEBI is
C[C@H]1O[C@H](O)[C@H](O)[C@@H](O)[C@@H]1O
The canonical SMILES structure of its link to PUBCHEM is:
C(C1C(C(C(C(O1)O)O)O)O)O

The isomeric SMILES strucutre of its link to PUBCHEM is:
C([C@@H]1[C@H]([C@@H]([C@H]([C@H](O1)O)O)O)O)O

So, the SMILES string of biocyc:alpha-D-Glucose does not match any
structure in ChEBI or Pubchem.  In fact, the SMILES structure of
alpha-D-Glucose in ChEBI does not match the SMILES structure of
alpha-D-Glucose in PubChem.
LIGAND does not use SMILES, but they do use Molfiles.

ChEBI also uses InChI's.  This is the ChEBI InChI for alpha-D-Glucose:
InChI=1/C6H12O6/c7-1-2-3(8)4(9)5(10)6(11)12-2/h2-11H,1H2/t2-,3-,4+,5-,6+/m1 /s1

PubChem also uses InChI's. This is the PubChem InChI for alpha-D-Glucose:
 InChI=1/C6H12O6/c7-1-2-3(8)4(9)5(10)6(11)12-2/h2-11H,1H2/t2-,3-,4+,5-,6+/m1 /s1

So it looks like we have a match between ChEBI:alpha-D-Glucose and
PubChem:alpha-D-Glucose.

When I converted the SMILES for biocyc:alpha-D-Glucose using
http://inchi.info/converter_en.html
I got the following InChI:

InChI=1/C6H12O6/c7-1-2-3(8)4(9)5(10)6(11)12-2/h2-11H,1H2

Note that this is *not* identical to the ChEBI and PUBCHEM compounds, as
it does not include the stereochemical layer:
http://wwmm.ch.cam.ac.uk/inchifaq/#Specifically,%20what%20are%20InChI...

However, because InChI's are layered, one could assign CHEBI and PUBCHEM
compounds as a subclass of the (underspecified) BioCyc compound.

Similarly, when converting the Molfile for LIGAND-COMPOUND:alpha-D-Glucose
I got the following InChI:
InChI=1/C6H12O6/c7-1-2-3(8)4(9)5(10)6(11)12-2/h2-11H,1H2

Note that this unifies the KEGG and BioCyc versions of alpha-D-Glucose.

So far, InChI's seem to be the best bet for unifying database compounds.

3. unify by reaction participants.  If two compounds are used in the
same context (ie. they are both substrates of the same reaction, that
can give a hint.

None of the reactions for biocyc:alpha-D-Glucose match any of the
reactions in reactome:alpha-D-Glucose.

4. unify by taxonomy  Both Reactome and BioCyc have compound taxonomies.
 By aligning the taxonomies, one can group compounds as subclasses of
the same class.

 In the biocyc ontology, the relevant compound hierarchy is as follows.
(We will ignore the fact that L-Glucose is an instance and D-Glucose is a
class.  When we map the compounds to BioPAX, everything becomes a class)

Hexoses
+L-Hexoses
++L-Glucose
+D-Hexoses
++D-Glucose
+++alpha-D-Glucose
+++beta-D-Glucose

For Reactome, there does not appear to be a compound hierarchy explicitly
represented, although there is a cryptic reference to the fact that
Reactome has classified alpha-D-glucose as a hexose:
Is represented by generalisation(s):
    hexoses transported by GLUT2 [cytosol]
    hexoses transported by SGLT1 [cytosol].

As a result, by unifying Reactome:Hexose with BioCyc:Hexoses, all the
subclasses are pooled together.

5. IUPAC matching: synonyms are still the way that humans match
compounds.  it is possible to take an IUPAC name and uniquely identify
the structure, but not the other way around.

http://en.wikipedia.org/wiki/International_Union_of_Pure_and_Applied_...

6. Group by chemical formula.  If all you have is a chemical formula, at
least you can group them into the same class of isomers.

One could imagine creating a compound hierarchy based on InChI levels.

it would be very easy to classify compounds according to this hierarchy in
an automated way.

Anyway, this was a long-winded way of saying what Alan said.

There is no easy way to unify compounds from different databases.  Glucose
happens to be a particularly ugly example, but there are plenty of others,
I assure you.

Sincerely,

Jeremy


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
zuc...@research.dfci.harvard.edu  
View profile  
 More options Oct 12 2007, 4:22 am
From: zuc...@research.dfci.harvard.edu
Date: Fri, 12 Oct 2007 04:22:01 -0400 (EDT)
Local: Fri, Oct 12 2007 4:22 am
Subject: Re: meeting today
I sent off the last email before finishing the analysis:

5. IUPAC matching: synonyms are still the way that humans match
compounds.  it is possible to take an IUPAC name and uniquely identify the
structure, but not the other way around.

http://en.wikipedia.org/wiki/International_Union_of_Pure_and_Applied_...

For example, in ChEBI, the IUPAC name for Alpha-D-Glucose is:
&#945;-D-glucopyranose

However, in PubChem, the IUPAC name for alpha-D-Glucose is:
(2S,3R,4S,5S,6R)-6-(hydroxymethyl)oxane-2,3,4,5-tetrol

It turns out that it may be possible to automatically convert IUPAC
nomenclature into structures.

If you haven't checked out Richard Apodaca's website, this url should give
you a taste:
http://depth-first.com/articles/tag/iupac


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Joanne Luciano  
View profile  
 More options Oct 12 2007, 7:22 am
From: Joanne Luciano <jluci...@predmed.com>
Date: Fri, 12 Oct 2007 07:22:34 -0400
Local: Fri, Oct 12 2007 7:22 am
Subject: Re: meeting today

Are you in town now?

On Oct 11, 2007, at 7:21 PM, Michel Dumontier wrote:

Joanne Luciano, PhD
Predictive Medicine, Inc.
45 Orchard Street
Belmont MA 02478-3008
Email: jluci...@predmed.com

    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Michel Dumontier  
View profile  
 More options Oct 12 2007, 9:40 am
From: "Michel Dumontier" <michel.dumont...@gmail.com>
Date: Fri, 12 Oct 2007 09:40:37 -0400
Local: Fri, Oct 12 2007 9:40 am
Subject: Re: meeting today

Alan, Jeremy

Thanks! What a great analysis! Well substantiated.

:-)

So, is it worthwhile to scan through the pathway/reaction databases and
enumerate which identifiers are being used?

-=Michel=-

On 10/12/07, zuc...@research.dfci.harvard.edu <

--
Michel Dumontier
Assistant Professor of Bioinformatics
http://dumontierlab.com

    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
End of messages
« Back to Discussions « Newer topic     Older topic »

Create a group - Google Groups - Google Home - Terms of Service - Privacy Policy
©2009 Google