FW: Categories of Generic entity representation issues

1 view
Skip to first unread message

Joanne Luciano

unread,
Jul 12, 2005, 9:14:27 PM7/12/05
to Debuggin...@googlegroups.com
Hi Aaron and Xiaoxia, co-authors if you accept the offer ....

Jeremy and I are hoping to submit a paper to PSS (Pacific Symposium on
Biocomputing). http://psb.stanford.edu/cfp-semweb.html

The deadline is totally unrealistic, July 18th, but I felt
we already have enough material for a paper, if we could just 'edit it.'
I started with recasting our notes from the tutorial we gave at ISMB. The
notes and
slides can be found here:

http://www.biopathways.org/ismb2005tutorial-am6/

The recast paper is attached

During our tutorial we found we needed to introduce people to RDF
and the semantic web. So, my notion for the paper is to take the integration
case
studies and show what kinds of issues arise and then show how they were
resolved and how they could more easily be addressed with RDF and semantic
web
technologies.

For example, we already have text that discusses how SBML can be extended
(annotated) with BioPAX; something Jeremy and I worked out a while ago,
but have not yet published.

Recently there has been a lot of discussion on the biopax-discuss list
about representing generic entities. I haven't caught up with it yet,
but I've got a sense of some of the issues. The discussions lead to
the text Jeremy wrote below. Jeremy thought it would be good to start with
the generic idea and I agreed to recast that into PSB format. It's below,
but I'm not sure how to structure it in the new doc.... Jeremy???

Joanne
-----Original Message-----
From: Jeremy Zucker [mailto:zuc...@research.dfci.harvard.edu]
Sent: Tuesday, July 12, 2005 1:05 PM
To: jluc...@predmed.com
Subject: Categories of Generic entity representation issues

Hello folks,

After discussions with Xiaoxia and Aaron yesterday, it seems there are
several representation issues that are being confounded.

The overall goal is to use BioCyc data to develop complete and consistent
metabolic flux models. Issues arise when the data is incorrect or the
representation scheme is ambiguous.

Below is a discussion of some of the ambiguities that currently exist in
BioCyc.

BioCyc organizes compounds into a class hierarchy from more general to more
specific. Instances of these classes contain information such as molecular
weights, chemical formulas, and/or atomic structure. This information is
useful for determining whether a reaction is balanced, so that it can be
incorporated into a metabolic flux model. As long as all the participants of
a reaction are instances, the meaning is clear.

However, BioCyc also permits classes of compounds to participate in a
reaction. We call this a generalized reaction. Generalized reactions
contain several ambiguities that need to be resolved.

1. A generalized reaction is typically shorthand for several specific
reactions. The challenge in this case is to infer the specific reactions
from the general reaction. We present several categories of generalized
reactions and discuss the strategy for recognizing and resolving the correct
inference.




To describe the various categories, it is important to have a prototypical
example of each type of generic.

atomic structure matching:
EC# 1.1.1.1: an alcohol -> an aldehyde or keytone

Polymerization reaction:
Glycogens + Glucose -> Glycogens

Symbol/name matching:
NAD(P)H -> NAD(P)

More to follow...

Jeremy


PSB-ZuckerLuciano.doc

Markus Krummenacker

unread,
Jul 13, 2005, 4:49:30 PM7/13/05
to Debuggin...@googlegroups.com
Joanne Luciano writes:
> -----Original Message-----
> From: Jeremy Zucker [mailto:zuc...@research.dfci.harvard.edu]
> Sent: Tuesday, July 12, 2005 1:05 PM
> To: jluc...@predmed.com
> Subject: Categories of Generic entity representation issues

> Below is a discussion of some of the ambiguities that currently exist in
> BioCyc.
>
> BioCyc organizes compounds into a class hierarchy from more general to more
> specific. Instances of these classes contain information such as molecular
> weights, chemical formulas, and/or atomic structure. This information is
> useful for determining whether a reaction is balanced, so that it can be
> incorporated into a metabolic flux model. As long as all the participants of
> a reaction are instances, the meaning is clear.
>
> However, BioCyc also permits classes of compounds to participate in a
> reaction. We call this a generalized reaction. Generalized reactions
> contain several ambiguities that need to be resolved.

hi everybody,

i'd like to make a few comments regarding this:

- biocyc uses cpd (compound) classes in reaction equations largely because
the EC system itself contains a huge number of reactions that
contain R-groups. the reason is that many enzymes have broad
substrate specificity. with cpd classes, we can capture this
notion.

- however, there are several separate reasons for when a cpd class can
be used, and this is probably one main source of ambiguity that is
not cleanly resolved currently. our curator's guide at:
http://bioinformatics.ai.sri.com/ptools/curatorsguide.pdf

discusses these issues in more depth in chapter 3.2.3 "Compound
Classes for Broad Substrate Specificity and Polymerization". in
short, a cpd class can be used when an enzyme is known to have a
broad substrate specifity, and (most) of the actual compounds could
be enumerated explicitly (though it might not be desireable to have
this kind of bloated data around). related to this is the treatment
of polymers, which are also represented by classes, with some
auxiliary infrastructure, to deal with the different names they can
have.

a different but common reason for a cpd class is when an enzyme has
not been sufficiently characterised experimentally so we can be sure
about the exact scope of substrates it might process. biocyc then
often uses a cpd class that is not supposed to contain any instances
(and those classes ought to be marked more explicitly, to make the
degree of ignorance more explicit). also, often dna sequence
matches can be made to homologs of enzymes that allow the genome
annotators to say something about the likely substrate class for a
given gene, but without knowing the precise substrate, though it may
become known at a later point in time.


because of the level of ignorance and/or actual broadness of substrate
specificity of real world enzymes, using cpd classes in reaction
equations has been a reasonable choice for capturing some of what is
going on. however, i realize that there are still a number of
unresolved issues, that will require adding some additional complexity
to the schema to capture everything that will be needed. it will
happen at some point. :-)

--
--
Regards
Markus Krummenacker
Reply all
Reply to author
Forward
0 new messages