order of XREFs in smallMolecule

Rafael Alcántara

unread,

Jun 6, 2012, 6:32:40 AM6/6/12

to biopax-...@googlegroups.com

Hi.

I am trying to find a workaround to a problem, perhaps some of you can help.
Reactome's BioPAX level 2 file contains some smallMolecule instances
with more than one XREF to ChEBI, being converted from Reactome's
EntitySets (alternative substrates/products).
I want to parse this file and extract the different reactions from the
concrete substrates/products, but to that end both collections of xrefs
(alternative substrates and alternative products) must be ordered to
make the proper correspondence. While the BioPAX file seems to have the
XREFs ordered, some of these Reactome EntitySets are properly assigned,
but others are not.

I see the implementation in ReferenceHelper
(http://biopax.sourceforge.net/paxtools-4.1.1/paxtools-core/xref/org/biopax/paxtools/impl/level2/ReferenceHelper.html#25)
is a HashSet. That might be the reason, but I don't know how Jena is
handling the model under the hood (I use
JenaIOHandler().convertFromOWL(InputStream)).

Any ideas are more than welcome.
Thanks in advance,

--
Dr. Rafael Alc�ntara
<rafael.a...@ebi.ac.uk>
Software Engineer
Cheminformatics and Metabolism Team
European Bioinformatics Institute - EMBL
Tel. +44 (0)1223 494414

Oliver Ruebenacker

unread,

Jun 6, 2012, 8:13:30 AM6/6/12

to biopax-...@googlegroups.com

Hello Rafael,

Properties in BioPAX (and in RDF/OWL in general) do not have an
order. If Reactome output behaves as you describe, they are doing
something non-standard. Perhaps some one from Reactome can comment on
this?

If for some reason you still need the order: OpenRDF Sesame allows
you to write your own RDF handler, which receives the statements
directly from the RDF parser, I'm assuming in the order they appear in
the input.

I prefer Sesame to Jena, because Sesame is much more transparently
divided into components, allowing you to use some components of it
while skipping others, and to reimplement some of their interfaces. If
all you need is reading and writing RDF, you only need a small
fraction of Sesame (the parts named RIO, Model and Util, I think)

Take care
Oliver

On Wed, Jun 6, 2012 at 6:32 AM, Rafael Alcántara
<rafael.a...@ebi.ac.uk> wrote:
> Hi.
>
> I am trying to find a workaround to a problem, perhaps some of you can help.
> Reactome's BioPAX level 2 file contains some smallMolecule instances
> with more than one XREF to ChEBI, being converted from Reactome's
> EntitySets (alternative substrates/products).
> I want to parse this file and extract the different reactions from the
> concrete substrates/products, but to that end both collections of xrefs
> (alternative substrates and alternative products) must be ordered to
> make the proper correspondence. While the BioPAX file seems to have the
> XREFs ordered, some of these Reactome EntitySets are properly assigned,
> but others are not.
>
> I see the implementation in ReferenceHelper
> (http://biopax.sourceforge.net/paxtools-4.1.1/paxtools-core/xref/org/biopax/paxtools/impl/level2/ReferenceHelper.html#25)
> is a HashSet. That might be the reason, but I don't know how Jena is
> handling the model under the hood (I use
> JenaIOHandler().convertFromOWL(InputStream)).
>
> Any ideas are more than welcome.
> Thanks in advance,
>
> --

> Dr. Rafael Alcántara

> <rafael.a...@ebi.ac.uk>
> Software Engineer
> Cheminformatics and Metabolism Team
> European Bioinformatics Institute - EMBL
> Tel. +44 (0)1223 494414
>

--
Oliver Ruebenacker
Bioinformatics Consultant (http://www.knowomics.com/wiki/Oliver_Ruebenacker)
Knowomics, The Bioinformatics Network (http://www.knowomics.com)
SBPAX: Turning Bio Knowledge into Math Models (http://www.sbpax.org)

D'Eustachio, Peter

unread,

Jun 6, 2012, 10:30:35 AM6/6/12

to biopax-...@googlegroups.com, Guanming Wu

When we annotate a reaction in which an enzyme with broad substrate specificity can convert any one of a group of input / substrate molecules to corresponding output / product molecules, we use sets as input and output. The order in which the specific molecules are listed in the input set corresponds to the order in the output set, which is intended to mean that the reaction can convert input-1 to output-1 or input-2 to output-2 and so forth. There is nothing in the logic of our data model or curator software that enforces correct ordering of the two sets - it depends on the alertness of the human curator.

We do not catalog the attributes of small molecules, but get them by reference from ChEBI. With the growth of their data set and ontology structure, we now often have a better way (we think) to annotate broad substrate specificity. Instead of using home-made entity sets, we use higher-level terms from ChEBI. Thus, instead of assembling our own sets to list nucleotide monophosphates or aliphatic alcohols, we use the terms for those concepts from ChEBI to create a generic molecule instance in Reactome, then use that generic molecule as input or output in a reaction. A limitation of this approach is that not all molecule collections we need correspond to entities in ChEBI, so our annotation still sometimes uses sets (also, we haven't systematically replaced the existing legacy sets.)

That's where we stand at present - I hope it at least explains the situation.

Peter D'E

Oliver Ruebenacker

unread,

Jun 6, 2012, 11:25:27 AM6/6/12

to biopax-...@googlegroups.com

Hello Rafael,

Actually, I just remember that Sesame by default preserves order, so
the simplest solution would be to read the RDF graph relying on the
default implementation.

Technically, Sesame's RDFGraph is an interface extending a Java
Collection of Sesame Statements. The default implementation is a Java
List of Sesame Statements. That is in my mind not the most truthful
implementation of RDF/OWL standards, but it does have the advantage
that order is preserved.

If you choose to reimplement RDFGraph, you can use a Java Collection
other than a Java List, for example a Java Set, which does not
preserve order. Then you can still catch the order of statements in
the input by using a custom RDf handler.

Take care
Oliver

Oliver Ruebenacker

unread,

Jun 6, 2012, 11:42:36 AM6/6/12

to biopax-...@googlegroups.com

Hello Peter,

Let me see if I understand your data correctly: you have a reaction
R: A -> B, and A is a superset of A1, A2, etc. while B is a superset
of B1, B2, etc. And R is really a superset of R1: A1 -> B1, R2: A2 ->
B2, etc.

You have a unification cross reference to ChEBI that describes A1.
Unless it is also valid for A, you can not legally list it as a
unification cross-reference of A.

For example, it would be illegal to assign to the same physical
entity unification cross references to both methanol and ethanol,
because it can not be both at the same time. I suppose you could use
cross reference instead of unification cross reference, but it would
be unclear what that means.

Can't you just spell out the individual reactions R1, R2, etc?

Take care
Oliver

D'Eustachio, Peter

unread,

Jun 6, 2012, 12:50:39 PM6/6/12

to biopax-...@googlegroups.com, Guanming Wu

Hello Oliver,

You understand right. I don't understand "unification cross-reference" reliably enough to be sure but what you say sounds plausible. We avoid a combinatorial explosion (which we would need to handle manually) by using sets in this way. I guess (but Guanming would know better) that a Reactome reaction involving a set could be expanded into a list of reactions with a single input and output chemical entity for each at the time of the export to BioPAX (though I know we've considered that and there are reasons - Guanming and Gary may know better - why we haven't done it).

Peter

Guanming Wu

unread,

Jun 6, 2012, 1:03:17 PM6/6/12

to biopax-...@googlegroups.com, Peter D'Eustachio

Hi Oliver,

Just as Peter said, we have not tried to expand reactions involving sets to avoid combinatorial explosion and also hoped to make the exported data closer to our original Reactome contents.

Thanks,

Guanming

Oliver Ruebenacker

unread,

Jun 7, 2012, 11:58:42 AM6/7/12

to biopax-...@googlegroups.com

Hello Peter, Guanming,

I don't see combinatorial explosion in the way I (and Rafael, I
think) understood it, i.e. if you only have reactions R1: A1 -> B1,

R2: A2 -> B2, etc.

To have combinatorial explosion, you would also need to have more
reactions, e.g. R12: A1 -> B2. In that case, I don't see why order
would matter (as Rafael assumed).

My understanding of unification cross reference is that two entities
can be considered identical if they have one identical unification
cross reference. E.g. if X and Y both have UXR "ethanol", they are the
same, i.e. they are both ethanol. That wouldn't work if B at the same
time also had UXR "methanol".

I would prefer you to spell out all reactions for the sake of
correct BioPAX and the possibility of integrating your date with
other. And how bad can it be? If you have reaction A + B -> C, and
both A and B have twenty options each, let there be 400 reactions, no
big deal.

You can also make expansion optional, to keep it simple for those
who evaluate pathways by eye.

Take care
Oliver

D'Eustachio, Peter

unread,

Jun 7, 2012, 4:06:11 PM6/7/12

to biopax-...@googlegroups.com

Hello Oliver,

For our attempts at a human-readable graphic representation of pathways, 20 versus 400, or even 40, is a substantial difference. For a data set that will only be traversed computationally, 20 versus 400 or even 4000 is probably not important, so ways of taking our "compact" representation and expanding the instances that involve generic or set inputs could be workable - simply getting rid of the "compact" representation within Reactome, I think, isn't workable.

Another issue is maintenance. If we create a reaction with a set as input, two kinds are easy and reliable. Changing the list of allowed inputs requires only an edit of the set instance, not addition or deletion of a whole reaction, and changing some other feature of the reaction again involves only a single edit, not parallel edits on each of a series of reactions.

Another use of sets is to group different enzymes that can all catalyze the same reaction. Human Reactome does that only to a limited extent, but groups using the Reactome data model to annotate reactions for diverse microbes and possibly for plants will use this annotation strategy extensively (as does KEGG already).

IgorRodchenkov

unread,

Jun 7, 2012, 6:23:16 PM6/7/12

to biopax-...@googlegroups.com

Just a couple of comments and thoughts:

1. According to L3 UnificationXref semantic, if any two BioPAX objects share at least one UnificationXref that means they are about the same Thing, and can be merged. In most cases (from my experience) a UnificationXref was added to more than one parent BioPAX object by mistake, such as - RelationshipXref class should be used for orthologs IDs, UniProt UnificationXref is NOT for a Protein (it's for a ProteinReference), ChEBI UnificationXref is not for SmallMolecule (it's for SmallMoleculeReference), EntrezGene does not "work" for a ProteinReference's UnificationXref, etc.

2. There are proc and cons and limitations of both having "20" (generalized knowledge) vs. "400" (more specific, experimental data) interactions, and people actually need both (curated). But, as Peter said, "compact" BioPAX export is more convenient both from visualization and curation/support perspectives. Whereas, for a more specific computational analysis, e.g., before mapping to SBML, SIF or GSEA, it should be possible (though not easy) to expand and filter these data, i.e., convert to another, perhaps one-organism, no-generics BioPAX representation and then remove all undesired evidence, interactions, and members. How to expand/filter won't be the same for all projects; it has to be decided by a researcher or s/w engineer on case by case basis. Also, if there is reaction A + B -> C, and both A and B have twenty options each, it won't necessarily lead to 400 reactions if done wisely (should not mess up different organisms, consider publications, evidence, etc...)

IR.

Reply all

Reply to author

Forward