PRO

Oliver Ruebenacker

unread,

Mar 23, 2009, 10:07:41 PM3/23/09

to biopa...@googlegroups.com

Hello, All,

So I downloaded PRO
(ftp://ftp.pir.georgetown.edu/databases/ontology/pro_obo/pro.obo) and
looked at it with Protege. I expected to find a deep hierarchy with
lots generic classes and many properties. Instead, I see a very flat
hierarchy with mostly highly specific classes and only one property.
Am I missing something? Is PRO just another catalogue?

So PRO is designed for the single molecule level. Do they intend
stoichiometric coefficients to be properties of restrictions on
reactions rather than properties of reactions themselves? Is it useful
at all for the ensemble level?

Take care
Oliver

--
Oliver Ruebenacker, Computational Cell Biologist
BioPAX Integration at Virtual Cell (http://vcell.org/biopax)
Center for Cell Analysis and Modeling
http://www.oliver.curiousworld.org

Alan Ruttenberg

unread,

Mar 24, 2009, 2:52:26 AM3/24/09

to biopa...@googlegroups.com

You need the annotations file too. They've coded stuff as xrefs that
should be changed to relations. Also you should read some of their web
site and paper.

-Alan

Oliver Ruebenacker

unread,

Mar 24, 2009, 6:44:32 AM3/24/09

to biopa...@googlegroups.com

Hello Alan, All,

On Tue, Mar 24, 2009 at 2:52 AM, Alan Ruttenberg
<alanrut...@gmail.com> wrote:
> You need the annotations file too.

Which file do you mean? PAF.txt?

> They've coded stuff as xrefs that
> should be changed to relations.

You mean, you don't like the way PRO is now either?

> Also you should read some of their web
> site and paper.

I read some of their website and are scratching my head why the PRO
class hierarchy seems to be only two levels deep after "protein" when
they talk about four levels.

Anyway, they say PRO is single molecule level. Is it then any use to
Systems Biologists who work on the ensemble level?

What exactly is PRO supposed to teach us? How is their class
hierarchy special?

Alan Ruttenberg

unread,

Mar 24, 2009, 9:21:15 AM3/24/09

to biopa...@googlegroups.com

On Tue, Mar 24, 2009 at 6:44 AM, Oliver Ruebenacker <cur...@gmail.com> wrote:
>
> Hello Alan, All,
>
> On Tue, Mar 24, 2009 at 2:52 AM, Alan Ruttenberg
> <alanrut...@gmail.com> wrote:
>> You need the annotations file too.
>
> Which file do you mean? PAF.txt?
>
>> They've coded stuff as xrefs that
>> should be changed to relations.
>
> You mean, you don't like the way PRO is now either?

Oliver, you should see by now that I don't like the way a lot of
things are. Otherwise I wouldn't be working to fix things. So I try to
identify which projects and strategies are most likely to advance
where we are.

I have a date with the PRO developers to address their representation.
Their curation is very good and this is a matter of rendering to be
fixed. But the underlying principle - that proteins as entities need
to be identified as distinct from records about them is sound.

For example, consider the section marked sequence annotations in
http://www.uniprot.org/uniprot/P04637

What are the entities that are being described? Are they all the sort
of thing that coexist in one person? Consider the region described as
associated with 1-44. What do the "natural variations" with indices
between 1 and 4 have to do with the function ascribed to 1-44? Are
there papers that describe the molecular machine that has the natural
variation at 35?

>
>> Also you should read some of their web
>> site and paper.
>
> I read some of their website and are scratching my head why the PRO
> class hierarchy seems to be only two levels deep after "protein" when
> they talk about four levels.
>
> Anyway, they say PRO is single molecule level. Is it then any use to
> Systems Biologists who work on the ensemble level?

SMBL and others describe species that are quite often identified as
collections of molecules of a certain sort. The molecules implied by
the natural variations lines behave quite differently than those
without the variations (namely they occur in cancer and are perhaps
involved in mechanisms unique to cancer). A model of cancer ought to
behave differently than a model in which cancer does not occur. The
differences that explain it need to be somewhere.

> What exactly is PRO supposed to teach us? How is their class
> hierarchy special?

At the base PRO gives identifiers for distinct molecular machines or
their parts. This is a sound basis upon which to record information
about function. Without anything else this is useful.

Regards,
Alan

Oliver Ruebenacker

unread,

Mar 24, 2009, 12:17:25 PM3/24/09

to biopa...@googlegroups.com

Hello Alan, All,

On Tue, Mar 24, 2009 at 9:21 AM, Alan Ruttenberg
<alanrut...@gmail.com> wrote:
> Oliver, you should see by now that I don't like the way a lot of
> things are. Otherwise I wouldn't be working to fix things. So I try to
> identify which projects and strategies are most likely to advance
> where we are.

Sure, we all want to change the world. So the point of interest is
less what PRO is but what it may have the potential to become?

> But the underlying principle - that proteins as entities need
> to be identified as distinct from records about them is sound.

I understand entities are distinct form records, but I don't see how
their identifications are different from records. So I don't
understand what the difference is effectively.

> For example, consider the section marked sequence annotations in
> http://www.uniprot.org/uniprot/P04637
>
> What are the entities that are being described? Are they all the sort
> of thing that coexist in one person? Consider the region described as
> associated with 1-44. What do the "natural variations" with indices
> between 1 and 4 have to do with the function ascribed to 1-44? Are
> there papers that describe the molecular machine that has the natural
> variation at 35?

What is a molecular machine?

I understand there is missing information, but I don't understand
how that relates to the entity versus record distinction. Besides, is
it possible that the information is missing because it is not known?

> SMBL and others describe species that are quite often identified as
> collections of molecules of a certain sort. The molecules implied by
> the natural variations lines behave quite differently than those
> without the variations (namely they occur in cancer and are perhaps
> involved in mechanisms unique to cancer). A model of cancer ought to
> behave differently than a model in which cancer does not occur. The
> differences that explain it need to be somewhere.

Sure, that information needs to be somewhere. How about reporting
the UniProt variant number?

> At the base PRO gives identifiers for distinct molecular machines or
> their parts. This is a sound basis upon which to record information
> about function. Without anything else this is useful.

I don't understand how that is different from what UniProt intends to do.

Alan Ruttenberg

unread,

Mar 24, 2009, 1:28:05 PM3/24/09

to biopa...@googlegroups.com

On Tue, Mar 24, 2009 at 12:17 PM, Oliver Ruebenacker <cur...@gmail.com> wrote:
>
> Hello Alan, All,
>
> On Tue, Mar 24, 2009 at 9:21 AM, Alan Ruttenberg
> <alanrut...@gmail.com> wrote:
>> Oliver, you should see by now that I don't like the way a lot of
>> things are. Otherwise I wouldn't be working to fix things. So I try to
>> identify which projects and strategies are most likely to advance
>> where we are.
>
> Sure, we all want to change the world. So the point of interest is
> less what PRO is but what it may have the potential to become?

It is both.

>> But the underlying principle - that proteins as entities need
>> to be identified as distinct from records about them is sound.
>
> I understand entities are distinct form records, but I don't see how
> their identifications are different from records. So I don't
> understand what the difference is effectively.

Database record do not make clear what entities they refer to. They
are ambiguous for some kinds of data exchange because people are able
to mean different things by citing them. And do.

>> For example, consider the section marked sequence annotations in
>> http://www.uniprot.org/uniprot/P04637
>>
>> What are the entities that are being described? Are they all the sort
>> of thing that coexist in one person? Consider the region described as
>> associated with 1-44. What do the "natural variations" with indices
>> between 1 and 4 have to do with the function ascribed to 1-44? Are
>> there papers that describe the molecular machine that has the natural
>> variation at 35?
>
> What is a molecular machine?

It is a word I and others have used to emphasize that proteins and
protein complexes do things, by analogy to machines and machine parts
we build.

>
> I understand there is missing information, but I don't understand
> how that relates to the entity versus record distinction. Besides, is
> it possible that the information is missing because it is not known?

What information could be missing? How could you say what was known
and unknown. The organization of these records changes. Do we not
agree that one record describes several types of things, types that we
care about, like between proteins are part of the mechanism by which
cancer takes hold, and others not?

>> SMBL and others describe species that are quite often identified as
>> collections of molecules of a certain sort. The molecules implied by
>> the natural variations lines behave quite differently than those
>> without the variations (namely they occur in cancer and are perhaps
>> involved in mechanisms unique to cancer). A model of cancer ought to
>> behave differently than a model in which cancer does not occur. The
>> differences that explain it need to be somewhere.
>
> Sure, that information needs to be somewhere. How about reporting
> the UniProt variant number?

What's it's URI? Is it stable? What is the sequence of the protein
referred to by the variant number? etc.

>> At the base PRO gives identifiers for distinct molecular machines or
>> their parts. This is a sound basis upon which to record information
>> about function. Without anything else this is useful.
>
> I don't understand how that is different from what UniProt intends to do.

I'm not sure what to say other than "think harder".

-Alan

Oliver Ruebenacker

unread,

Mar 24, 2009, 2:06:56 PM3/24/09

to biopa...@googlegroups.com

Hello Alan, All,

On Tue, Mar 24, 2009 at 1:28 PM, Alan Ruttenberg
<alanrut...@gmail.com> wrote:
> I'm not sure what to say other than "think harder".

I'm not sure what to say other than "try harder to communicate what you mean".

Alan Ruttenberg

unread,

Mar 24, 2009, 2:19:01 PM3/24/09

to biopa...@googlegroups.com

We're meeting in a couple of hours. Let's talk then. I'm willing to
try harder, though I'm mystified as to why this particular topic comes
up over and over. Can you bring a use case to the table so we have
something concrete that we can discuss the matter in reference to?

Remember, the BioPAX OBO effort is an effort to build an *ontology*,
and the job of an ontology, at least the sort we work on in the
foundry, is to make it very clear what we are talking about. And it is
very clear that some bits and a protein are different sort of things.

As one more example, consider the fact that there are many resources
that have records about protein. How do you suggest that we say that
both some of the protein identifiers enumerated at

http://www.ncbi.nlm.nih.gov/sites/entrez?Db=gene&Cmd=ShowDetailView&TermToSearch=7157&ordinalpos=1&itool=EntrezSystem2.PEntrez.Gene.Gene_ResultsPanel.Gene_RVDocSum#refseq

and

http://www.uniprot.org/uniprot/P04637

refer to the same thing? With owl:sameAs? Seems that would be wrong,
wouldn't it?

Shouldn't there be something that they are both *about*? And shouldn't
we document exactly what those things are?

-Alan

Oliver Ruebenacker

unread,

Mar 24, 2009, 2:44:31 PM3/24/09

to biopa...@googlegroups.com

Hello Alan, All,

On Tue, Mar 24, 2009 at 2:19 PM, Alan Ruttenberg
<alanrut...@gmail.com> wrote:
>
> We're meeting in a couple of hours. Let's talk then.

I'm sure we will.

> I'm willing to
> try harder, though I'm mystified as to why this particular topic comes
> up over and over.

Maybe because the issue has never been resolved?

> Can you bring a use case to the table so we have
> something concrete that we can discuss the matter in reference to?

OK, how about:

http://www.reactome.org/cgi-bin/eventbrowser?DB=gk_current&FOCUS_SPECIES=Homo%20sapiens&ID=177934&

> Remember, the BioPAX OBO effort is an effort to build an *ontology*,
> and the job of an ontology, at least the sort we work on in the
> foundry, is to make it very clear what we are talking about. And it is
> very clear that some bits and a protein are different sort of things.

Well, I've never heard about a project with the goal of being unclear.

> As one more example, consider the fact that there are many resources
> that have records about protein. How do you suggest that we say that
> both some of the protein identifiers enumerated at
>
> http://www.ncbi.nlm.nih.gov/sites/entrez?Db=gene&Cmd=ShowDetailView&TermToSearch=7157&ordinalpos=1&itool=EntrezSystem2.PEntrez.Gene.Gene_ResultsPanel.Gene_RVDocSum#refseq
>
> and
>
> http://www.uniprot.org/uniprot/P04637
>
> refer to the same thing? With owl:sameAs? Seems that would be wrong,
> wouldn't it?
>
> Shouldn't there be something that they are both *about*? And shouldn't
> we document exactly what those things are?

Already BioPAX has XREF and sub properties, with entities as domain
and records as range. So BP clearly does distinguish between entity
and record.

What is the alternative? Have PRO include a class for any variation
of a protein you may possibly be interested in? Is PRO going to
provide a class for UniProt's P04637 with 44 sub classes? Or 44*35?
What if a protein has 10 sites for potential phosphorylation, is PRO
going to provide 2^10 sub classes?

Michel Dumontier

unread,

Mar 24, 2009, 2:45:22 PM3/24/09

to biopa...@googlegroups.com

On Tue, Mar 24, 2009 at 2:19 PM, Alan Ruttenberg <alanrut...@gmail.com> wrote:

We're meeting in a couple of hours. Let's talk then. I'm willing to
try harder, though I'm mystified as to why this particular topic comes
up over and over. Can you bring a use case to the table so we have
something concrete that we can discuss the matter in reference to?

Remember, the BioPAX OBO effort is an effort to build an *ontology*,
and the job of an ontology, at least the sort we work on in the
foundry, is to make it very clear what we are talking about. And it is
very clear that some bits and a protein are different sort of things.

As one more example, consider the fact that there are many resources
that have records about protein. How do you suggest that we say that
both some of the protein identifiers enumerated at

http://www.ncbi.nlm.nih.gov/sites/entrez?Db=gene&Cmd=ShowDetailView&TermToSearch=7157&ordinalpos=1&itool=EntrezSystem2.PEntrez.Gene.Gene_ResultsPanel.Gene_RVDocSum#refseq

and

http://www.uniprot.org/uniprot/P04637

refer to the same thing? With owl:sameAs? Seems that would be wrong,
wouldn't it?

Shouldn't there be something that they are both *about*? And shouldn't
we document exactly what those things are?

The NCBI record cross-references the NCBI proteins NP_000537 and NP_001119585 to Uniprot's P04637. Given they share the same sequence (in the same organism), they are for all intensive purposes equivalent.

-=Michel=-

-Alan

On Tue, Mar 24, 2009 at 2:06 PM, Oliver Ruebenacker <cur...@gmail.com> wrote:
>
> Hello Alan, All,
>
> On Tue, Mar 24, 2009 at 1:28 PM, Alan Ruttenberg
> <alanrut...@gmail.com> wrote:
>> I'm not sure what to say other than "think harder".
>
> I'm not sure what to say other than "try harder to communicate what you mean".
>
> Take care
> Oliver
>
> --
> Oliver Ruebenacker, Computational Cell Biologist
> BioPAX Integration at Virtual Cell (http://vcell.org/biopax)
> Center for Cell Analysis and Modeling
> http://www.oliver.curiousworld.org
>
> >
>

--
Michel Dumontier
Assistant Professor of Bioinformatics
http://dumontierlab.com

Alan Ruttenberg

unread,

Mar 24, 2009, 3:01:30 PM3/24/09

to biopa...@googlegroups.com

On Tue, Mar 24, 2009 at 2:44 PM, Oliver Ruebenacker <cur...@gmail.com> wrote:
>
> Hello Alan, All,
>
> On Tue, Mar 24, 2009 at 2:19 PM, Alan Ruttenberg
> <alanrut...@gmail.com> wrote:
>>
>> We're meeting in a couple of hours. Let's talk then.
>
> I'm sure we will.
>
>> I'm willing to
>> try harder, though I'm mystified as to why this particular topic comes
>> up over and over.
>
> Maybe because the issue has never been resolved?

It actually has been resolved. The BioPAX-OBO effort is an attempt to
have similar goals as BioPAX did, but to accomplish them in a manner
compatible with OBO foundry principles. IAO, and this distinction, is
supported within the foundry.

>
>> Can you bring a use case to the table so we have
>> something concrete that we can discuss the matter in reference to?
>
> OK, how about:
>
> http://www.reactome.org/cgi-bin/eventbrowser?DB=gk_current&FOCUS_SPECIES=Homo%20sapiens&ID=177934&

Sorry, a use case talks about *use*. Please expand.

> Well, I've never heard about a project with the goal of being unclear.

Well, goal or not, many land up this way, not necessarily because of intention.

>
>> As one more example, consider the fact that there are many resources
>> that have records about protein. How do you suggest that we say that
>> both some of the protein identifiers enumerated at
>>
>> http://www.ncbi.nlm.nih.gov/sites/entrez?Db=gene&Cmd=ShowDetailView&TermToSearch=7157&ordinalpos=1&itool=EntrezSystem2.PEntrez.Gene.Gene_ResultsPanel.Gene_RVDocSum#refseq
>>
>> and
>>
>> http://www.uniprot.org/uniprot/P04637
>>
>> refer to the same thing? With owl:sameAs? Seems that would be wrong,
>> wouldn't it?
>>
>> Shouldn't there be something that they are both *about*? And shouldn't
>> we document exactly what those things are?
>
> Already BioPAX has XREF and sub properties, with entities as domain
> and records as range. So BP clearly does distinguish between entity
> and record.

No it doesn't. It has not theory of what a "utility class" is, and
there is no consistent theory of whether the entities in BioPAX that
are called proteins are a) protein records b) protein molecules c)
classes of protein molecules d) ensembles of protein molecules. In
fact there is evidence of them being used as each of these in
different places.

Moreover, the XREF and subproperties don't work the way the semantic
web works. They do not guarantee, never mind even suggest, how
consistency of reference is to be accomplished.

> What is the alternative? Have PRO include a class for any variation
> of a protein you may possibly be interested in? Is PRO going to
> provide a class for UniProt's P04637 with 44 sub classes? Or 44*35?
> What if a protein has 10 sites for potential phosphorylation, is PRO
> going to provide 2^10 sub classes?

PRO chooses as its domain proteins that have been experimentally been
observed, not ones that potentially might be observed.

-Alan

Alan Ruttenberg

unread,

Mar 24, 2009, 3:03:30 PM3/24/09

to biopa...@googlegroups.com

> The NCBI record cross-references the NCBI proteins
> NP_000537 and NP_001119585 to Uniprot's P04637. Given they share the same
> sequence (in the same organism), they are for all intensive purposes
> equivalent.
>
>
> -=Michel=-

I am sure that for the intents and purposes of the NCBI and the EBI
are not equivalent. Certainly neither of them consider themselves
redundant. Does equivalent means one should be shut down? So I contest
the "all" in your claim.

Are you suggesting that one *does* use sameAs on these two identifiers?

-Alan

Michel Dumontier

unread,

Mar 24, 2009, 3:38:53 PM3/24/09

to biopa...@googlegroups.com

It doesn't matter what the intent and purpose of the NCBI and EBI are. It matters that the entities identified by their respective identifiers, which are purposely and meaninfully linked, are different names for the same biological entity. They are, for all intensive purposes (of the perspective of a biological scientist), equivalent.

-=Michel=-

Alan Ruttenberg

unread,

Mar 24, 2009, 3:45:25 PM3/24/09

to biopa...@googlegroups.com

On Tue, Mar 24, 2009 at 3:38 PM, Michel Dumontier
<michel.d...@gmail.com> wrote:
> It doesn't matter what the intent and purpose of the NCBI and EBI are. It
> matters that the entities identified by their respective identifiers, which
> are purposely and meaninfully linked, are different names for the same
> biological entity. They are, for all intensive purposes (of the perspective
> of a biological scientist), equivalent.
> -=Michel=-

So share we coin the relation:

sameForSomeIntentsAndPurposesFromThePerspectiveOfABiologicalScientist?

What shall we give as a definition?

Do you include scientists who determine protein structures to be biologists?

FWIW, I think you are rather presumptuously claiming to speak for all
intents and purposes for all biologists. Particularly as I know some
who don't agree. I guess they wouldn't be "reasonable" ones, right?

-Alan

Michel Dumontier

unread,

Mar 24, 2009, 3:52:16 PM3/24/09

to biopa...@googlegroups.com

As a biological scientist myself, i can claim some authority wrt to the perspective of one. It's also unnecessary to flame me in this way.

-=Michel=-

Alan Ruttenberg

unread,

Mar 24, 2009, 4:03:29 PM3/24/09

to biopa...@googlegroups.com

Michel,

You are flaming away at this issue, for which there has been much
discussion. If you meant to speak for yourself you could have made
that more clear.

s/They are, for all intensive purposes (of the perspective of a
biological scientist), equivalent/They are, for all intensive purposes
(from my perspective as a biological scientist), equivalent/

If I misinterpreted this message, I apologize. However you seem to
speaking rather globally about this issue here an on other forums. For
example:

"In the life sciences, scientists don't care about database records -
they care about the molecules and the biological processes for which
facts have been collected about."

Even this is overbroad. For computational biologists I have worked
with, with repeatable results only possible against a particular
version of a record or genome build, records *are* important.

I also want to clarify a comment you made about shared names (please
consider writing to our mailing list where discussion of the effort
happens)

"I, like several others, am interested to see how the committee will
"make sure that its URIs ... resolve to information that is useful". I
expect that this will be challenging to establish utility,
particularly in the context of a term contained in an expressive
ontology."

First, the domain of the shared names effort is database records, not
entities named in ontologies. The "useful" information we have
specified and agreed upon concerns how to find a type assertion,
pointers to specific encodings of the record, and pointers to third
party metadata about the record. It is not within scope, for instance,
to translate entry gene to RDF. The scope of shared names is simply to
point to any such encodings in a way that clients can make decisions
about which they want to retrieve.

I'm happy to talk with you in more detail about the effort. But I
would prefer that you attempt to understand its scope and content
before passing judgement.

Incidentally, Marc-Alexandre is on our steering committee,
representing, to the extent he can, Bio2RDF.

Regards,
Alan

-Alan

On Tue, Mar 24, 2009 at 3:52 PM, Michel Dumontier

Michel Dumontier

unread,

Mar 24, 2009, 4:39:20 PM3/24/09

to biopa...@googlegroups.com

On Tue, Mar 24, 2009 at 4:03 PM, Alan Ruttenberg <alanrut...@gmail.com> wrote:

Michel,

You are flaming away at this issue, for which there has been much
discussion. If you meant to speak for yourself you could have made
that more clear.

ummm... no... all i said was that the two identifiers refer to the same thing...

s/They are, for all intensive purposes (of the perspective of a
biological scientist), equivalent/They are, for all intensive purposes
(from my perspective as a biological scientist), equivalent/

If I misinterpreted this message, I apologize. However you seem to
speaking rather globally about this issue here an on other forums. For
example:

"In the life sciences, scientists don't care about database records -
they care about the molecules and the biological processes for which
facts have been collected about."

Even this is overbroad. For computational biologists I have worked
with, with repeatable results only possible against a particular
version of a record or genome build, records *are* important.

there is no doubt that the records contain information about the biomolecules, and from a computational scientist perspective (rather than a biological scientist perspective), access to such documents is important. however, when we engage in conversation about proteins and other biomolecules, we don't talk about records.

I also want to clarify a comment you made about shared names (please
consider writing to our mailing list where discussion of the effort
happens)

"I, like several others, am interested to see how the committee will
"make sure that its URIs ... resolve to information that is useful". I
expect that this will be challenging to establish utility,
particularly in the context of a term contained in an expressive
ontology."

First, the domain of the shared names effort is database records, not
entities named in ontologies. The "useful" information we have
specified and agreed upon concerns how to find a type assertion,
pointers to specific encodings of the record, and pointers to third
party metadata about the record. It is not within scope, for instance,
to translate entry gene to RDF. The scope of shared names is simply to
point to any such encodings in a way that clients can make decisions
about which they want to retrieve.

i'm glad you're addressing this here, but it is not apparent from the web site, which is where i gathered information about the effort.

I'm happy to talk with you in more detail about the effort. But I
would prefer that you attempt to understand its scope and content
before passing judgement.

There was no judgement - i was raising a point about scope that wasn't apparent to me from the information that was publicly provided. I think provisions towards unifying the URI naming is a good thing, which is what i stated

"If the sharednames group wants to recommend an consensual approach on the _syntax_ of any given name, with appropriate rationale, then it’s possible that more people will use it as a guiding principle. "

please refrain from taking my comments out of context.

Incidentally, Marc-Alexandre is on our steering committee,
representing, to the extent he can, Bio2RDF.