PhyloWS, CQL, NeXML on TreeBASE2

3 views
Skip to first unread message

Rutger Vos

unread,
Jul 2, 2009, 7:10:05 PM7/2/09
to TreeBASE Developers, nexml-...@lists.sourceforge.net, PhyloWS Forum, Val Tannen, William Piel, Mark Dominus, Hilmar Lapp, jfharney
Hi,

I've implemented nexml export on treebase2, and made the serializer
attach predicates from this list
(http://spreadsheets.google.com/pub?key=rL--O7pyhR8FcnnG5-ofAlw) in
the indicated locations. The predicates with asterisks can be used as
search predicates through the PhyloWS architecture as described here:
http://localhost:8080/treebase-web/help/urlAPI.jsp

In addition, CQL searches can be tried out on the main search tabs by
clicking the "Advanced search..." links, e.g. see:
http://localhost:8080/treebase-web/search/studySearch.html

Rutger

--
Dr. Rutger A. Vos
Department of zoology
University of British Columbia
http://www.nexml.org
http://rutgervos.blogspot.com

Rutger Vos

unread,
Jul 2, 2009, 9:40:38 PM7/2/09
to TreeBASE Developers, nexml-...@lists.sourceforge.net, PhyloWS Forum, Val Tannen, William Piel, Mark Dominus, Hilmar Lapp, jfharney
erm, please substitute 8ball.sdsc.edu:6666 for localhost:8080 in the
examples below.

William Piel

unread,
Jul 3, 2009, 12:13:43 PM7/3/09
to Rutger Vos, TreeBASE Developers, nexml-...@lists.sourceforge.net, PhyloWS Forum, Val Tannen, Mark Dominus, Hilmar Lapp, jfharney
cool stuff.

I notice that this departs a bit from the phylows that is proposed here.  For example, the proposed phylows puts "/find/" before "/tree/", whereas you have it the other way. And the other major difference is that the proposed phylows suggest that to search on trees you do something like:

/phylows/find/tree/?name=Primates

whereas you are implementing:

/phylows/taxon/find/?name=Primates&recordSchema=tree

Your method is probably better and clearer -- in that it makes more sense that we're doing a find on a taxon with the result being a tree (only the taxon label is inherently part of the tree object), but perhaps we should get the other PhyloWS developers in agreement (i.e. Ryan Scherle), and then modify the wiki accordingly. 

I notice that while the following produces a hit of one record:


...yet I'm unable to get any results via rss:


Is my syntax incorrect?

Also, I believe that this should give me a list of trees:


but instead it gives me a list of taxa. Perhaps my syntax is wrong?

bp

Rutger Vos

unread,
Jul 3, 2009, 6:16:53 PM7/3/09
to William Piel, TreeBASE Developers, nexml-...@lists.sourceforge.net, PhyloWS Forum, Val Tannen, Mark Dominus, Hilmar Lapp, jfharney
Hi Bill,

glad you like it. I think I will use this on one of the days in Lisbon
to have students download data and process it.

On Fri, Jul 3, 2009 at 9:13 AM, William Piel<willia...@yale.edu> wrote:
> cool stuff.
> I notice that this departs a bit from the phylows that is proposed here.
>  For example, the proposed phylows puts "/find/" before "/tree/", whereas
> you have it the other way. And the other major difference is that the
> proposed phylows suggest that to search on trees you do something like:
> /phylows/find/tree/?name=Primates
> whereas you are implementing:
> /phylows/taxon/find/?name=Primates&recordSchema=tree

The former, "standard" way to me seems very ambiguous. I would
interpret it to mean the name of the tree, not of a taxon in the tree.

> Your method is probably better and clearer -- in that it makes more sense
> that we're doing a find on a taxon with the result being a tree (only the
> taxon label is inherently part of the tree object), but perhaps we should
> get the other PhyloWS developers in agreement (i.e. Ryan Scherle), and then
> modify the wiki accordingly.
> I notice that while the following produces a hit of one record:
> http://8ball.sdsc.edu:6666/treebase-web/phylows/taxon/find?query=tb.title.taxon==Homo
> ...yet I'm unable to get any results via rss:
> http://8ball.sdsc.edu:6666/treebase-web/phylows/taxon/find?query=tb.title.taxon==Homo&format=rss1
> Is my syntax incorrect?

I don't know - I *am* getting an rss feed with a single item returned.
Maybe you should "view source" to see it?

> Also, I believe that this should give me a list of trees:
> http://8ball.sdsc.edu:6666/treebase-web/phylows/taxon/find?query=tb.title.taxon==Homo&recordSchema=tree
> but instead it gives me a list of taxa. Perhaps my syntax is wrong?

The recordSchema switch is only used in combination with format=rss1,
the thinking being that the web interface behaviour should stay the
same (we can switch tabs anyway to project a result set into a
different context) but for programmatic access we do need recordSchema
(because - no tabs).

Cheers,

Rutger

William Piel

unread,
Jul 3, 2009, 9:24:16 PM7/3/09
to Rutger Vos, TreeBASE Developers, nexml-...@lists.sourceforge.net, PhyloWS Forum, Val Tannen, Mark Dominus, Hilmar Lapp, jfharney

On Jul 3, 2009, at 6:16 PM, Rutger Vos wrote:

>> I notice that while the following produces a hit of one record:
>> http://8ball.sdsc.edu:6666/treebase-web/phylows/taxon/find?query=tb.title.taxon==Homo
>> ...yet I'm unable to get any results via rss:
>> http://8ball.sdsc.edu:6666/treebase-web/phylows/taxon/find?query=tb.title.taxon==Homo&format=rss1
>> Is my syntax incorrect?
>
> I don't know - I *am* getting an rss feed with a single item returned.
> Maybe you should "view source" to see it?

Ah... indeed, it works for FireFox and Camino, but it does not work
for Safari (says "zero articles").

>> Also, I believe that this should give me a list of trees:
>> http://8ball.sdsc.edu:6666/treebase-web/phylows/taxon/find?query=tb.title.taxon==Homo&recordSchema=tree
>> but instead it gives me a list of taxa. Perhaps my syntax is wrong?
>
> The recordSchema switch is only used in combination with format=rss1,
> the thinking being that the web interface behaviour should stay the
> same (we can switch tabs anyway to project a result set into a
> different context) but for programmatic access we do need recordSchema
> (because - no tabs).

Ok. Although perhaps this could be a low-priority feature to be added
later. (I can imagine this being a useful feature for web sites like
tolweb.org and eol.org, in which for each species page they can have a
simple hyperlink called "trees in TreeBASE with taxon x" -- thus
avoiding users to have to make another mouse-click on a tab once they
get to TreeBASE.

I can't figure out why your rss does not work in Safari. For example,
these two urls produce, more or less, the same content since they are
making the same query:

http://8ball.sdsc.edu:6666/treebase-web/phylows/taxon/find?query=tb.title.taxon==Homo&recordSchema=tree&format=rss1
http://purl.org/phylo/treebase/phylows/find/tree/?query=taxon_name+any+Homo&operation=searchRetrieve&recordSchema=pc

...by yours says "0 articles" in Safari while mine says "25 articles".
Could you try adding "<?xml version="1.0" encoding="utf-8"?>" as a
header? Thats the only substantive difference between the two, as far
as I can tell.

(also, it would be cool if yours included some other human-readable
metadata -- like tree name, tree title, article citation, etc -- just
a little synopsis so that people can use this in an RSS client)

bp


Hilmar Lapp

unread,
Jul 4, 2009, 2:50:53 AM7/4/09
to William Piel, Rutger Vos, TreeBASE Developers, nexml-...@lists.sourceforge.net, PhyloWS Forum, Val Tannen, Mark Dominus, jfharney

On Jul 4, 2009, at 3:24 AM, William Piel wrote:

> Could you try adding "<?xml version="1.0" encoding="utf-8"?>" as a
> header?

BTW the <?xml> line is required to be present for it to be valid XML.

-hilmar
--
===========================================================
: Hilmar Lapp -:- Durham, NC -:- informatics.nescent.org :
===========================================================

Hilmar Lapp

unread,
Jul 4, 2009, 3:08:22 AM7/4/09
to William Piel, Rutger Vos, TreeBASE Developers, nexml-...@lists.sourceforge.net, PhyloWS Forum, Val Tannen, Mark Dominus, jfharney

On Jul 3, 2009, at 6:13 PM, William Piel wrote:

> cool stuff.

I agree - the API is definitely heading in the right direction.

I suggest some tweaks:

>
> I notice that this departs a bit from the phylows that is proposed
> here. For example, the proposed phylows puts "/find/" before "/
> tree/", whereas you have it the other way.

Right, this is not in compliance with the spec. find/ comes first as
it changes the resource from a record and its URI to a finder.

I.e., although it's possible that we change the spec, I don't see the
reason that would justify that.

In general, note that REST APIs at present aren't formally declared in
a descriptor document that a general purpose validator could use and
validate compliance. So really extra care needs to be taken to comply
with the spec, or otherwise it's not a spec but a loose prescription.
It seems like we should also implement a PhyloWS validator that
uncovers violations quickly.

> And the other major difference is that the proposed phylows suggest
> that to search on trees you do something like:
>
> /phylows/find/tree/?name=Primates
>
> whereas you are implementing:
>
> /phylows/taxon/find/?name=Primates&recordSchema=tree

Note BTW that a taxon finder is a custom addition to the API. Which is
fine in principle, except that I'd suggest you conform to the pattern
in the API spec and put find/ first.

Also, find/taxon/ would imply that you are finding (and returning)
taxa, which if I understand correctly is not the case - rather it
seems you have one query parameter in the URI path (namely that you
are searching by taxon?) and one in the query string. So if this is
searching trees, it needs to be find/tree/, and if you are matching
against taxon names, the query parameter needs to be tb.taxon.name or
whatever the blessed metadata term for this purpose is.

Third, recordSchema=tree means that you want records back in the tree
schema. Unless you have invented that schema meanwhile, this is in all
likelihood not what you want. Rather, the value should be nexml I
suppose. find/tree already implies that you are finding (and
returning) trees, so there is no point in expressing that redundantly
in the query string. You might want to specify that you only want the
tree and not also the matrix, but that would be a separate query
parameter and should not be confounded with the return format.

Rutger Vos

unread,
Jul 7, 2009, 8:17:59 PM7/7/09
to Hilmar Lapp, William Piel, TreeBASE Developers, nexml-...@lists.sourceforge.net, PhyloWS Forum, Val Tannen, Mark Dominus, jfharney
Hi Hilmar, all,

thanks for your comments!

>> I notice that this departs a bit from the phylows that is proposed here.

>>  For example, the proposed phylows puts "/find/" before "/tree/", whereas


>> you have it the other way.
>
> Right, this is not in compliance with the spec. find/ comes first as it
> changes the resource from a record and its URI to a finder.

Right, switching that around is fairly trivial, so I'll do that.

> Also, find/taxon/ would imply that you are finding (and returning) taxa,
> which if I understand correctly is not the case - rather it seems you have
> one query parameter in the URI path (namely that you are searching by
> taxon?) and one in the query string. So if this is searching trees, it needs
> to be find/tree/, and if you are matching against taxon names, the query
> parameter needs to be tb.taxon.name or whatever the blessed metadata term
> for this purpose is.
>
> Third, recordSchema=tree means that you want records back in the tree
> schema. Unless you have invented that schema meanwhile, this is in all
> likelihood not what you want. Rather, the value should be nexml I suppose.
> find/tree already implies that you are finding (and returning) trees, so
> there is no point in expressing that redundantly in the query string. You
> might want to specify that you only want the tree and not also the matrix,
> but that would be a separate query parameter and should not be confounded
> with the return format.

Mmmmm... I think this warrants a little more discussion. It's probably
true that for most implementors their searches can be conveniently
decomposed into several domains (tree search/matrix search/taxon
search/etc.) and that for each domain the metaphor is that of
searching a single table where the CQL indices are that table's
columns.

Then, within each domain there is a limited number of concerns: how to
search on the provided indices and how to format the results. For
example, for a search like
http://8ball.sdsc.edu:6666/treebase-web/search/studySearch.html?query=dcterms.identifier=S2484&format=rss1&recordSchema=tree
the implementation is thus:

* there is a self-contained study searcher
* the searcher knows how predicates map onto columns in the study
table (e.g. dcterms.identifier is the same as study.id)
* the searcher knows how to unpack a study object and get the trees out

if instead we'd have phylows/tree/find?query=study.identifier=S2484,
the implementation would be something like:

* there is a tree searcher
* the tree searcher needs to know not just about the tree table but
also about how all other predicates map onto all other tables, and how
they join with the tree table
* the tree searcher needs to know how to traverse study objects and
where trees are inside the study object
* (and similar overlap of concerns becomes necessary if we want the
trees for a given matrix, or for a taxon, or what have you)

To me that seems like bad design. We'll lose any separation of concern
and might end up with a lot of redundancy between searchers - and a
lot more code (and bugs) to write. I realize that I'm overloading the
"recordSchema" token (and should fix that) but some way of saying
"search THIS domain and project the results into THAT domain" seems
very, very handy - especially because CQL doesn't have a notion of
joins.

Rutger Vos

unread,
Jul 7, 2009, 8:19:19 PM7/7/09
to Hilmar Lapp, William Piel, TreeBASE Developers, nexml-...@lists.sourceforge.net, PhyloWS Forum, Val Tannen, Mark Dominus, jfharney

Hilmar Lapp

unread,
Jul 7, 2009, 8:39:25 PM7/7/09
to Rutger Vos, William Piel, TreeBASE Developers, nexml-...@lists.sourceforge.net, PhyloWS Forum, Val Tannen, Mark Dominus, jfharney

On Jul 7, 2009, at 8:17 PM, Rutger Vos wrote:

> I realize that I'm overloading the "recordSchema" token (and should
> fix that)

That was my main point in this regard.

> but some way of saying "search THIS domain and project the results
> into THAT domain" seems very, very handy - especially because CQL
> doesn't have a notion of
> joins.

I fully agree. We may just be talking past each other, but I'm not
seeing why something like

phylows/tree/find?query=study.identifier=S2484

doesn't achieve exactly that - it says search the study domain and
project the results into the tree domain. Conversely,

phylows/study/find?query=tree.identifier=TB2484

says to search in the tree domain and project the results into the
study domain (i.e., return studies that have a tree matching the query).

You aren't trying to suggest that dcterms.title or dcterms.identifier
should mean different things for different finders, right?

I get the sense that you are tying URL patterns and implementations
closely together; i.e., phylows/tree/find executes one and the same
chunk of code no matter what the query is, and so there would be chunk
of code sitting under phylows/study/find that finds trees and another,
separate, chunk of code sitting under phylows/tree/find that finds
trees. But of course the URL patterns and the code they execute (if
any - it may just be indexed files and XSLTs) are two completely
separate things. There is no reason that phylows/study/find and
phylows/tree/find couldn't (in fact shouldn't) use the exact same tree
finder class for finding trees.

I think we really need to look at the PhyloWS as a standardized
pattern of web-service URLs that are completely decoupled from the
underlying implementation which can take a multitude of shapes.

What we should pay attention to though is that the API *allows*
optimizing of code reuse and clean design of implementations. Are you
saying that it stands in the way of that, and if so, how does it
prevent clean design of implementations?

Rutger Vos

unread,
Jul 7, 2009, 8:58:07 PM7/7/09
to Hilmar Lapp, William Piel, TreeBASE Developers, nexml-...@lists.sourceforge.net, PhyloWS Forum, Val Tannen, Mark Dominus, jfharney
> You aren't trying to suggest that dcterms.title or dcterms.identifier should
> mean different things for different finders, right?

I kind of am: in find/study, dcterms.identifier is a study ID, in
find/tree, dcterms.identifier is a tree ID. Internally, the finders
traverse a CQL parse tree and translate these predicates into more
refined subproperties (tb.identifier.study and tb.identifier.tree,
respectively). In other words, if a tree is the subject, then the
predicate dcterms.identifier is interpreted as the refined subproperty
tb.identifier.tree.

By the way, I made a simple ontology (attached) that formalizes this
inheritance. Would be nice to have this available as
http://purl.org/phylo/treebase/terms# or whatever (speaking of which:
have you had a chance to add me to the treebase & phylows purl
domains?) Seems to me that's pretty much in line with the Contextual
part of CQL - I've seen many examples using dublin core predicates
whose exact semantics are context-dependent.

(By the way 2, you're saying "*should* mean different things for
different finders". I don't know whether they *should*, but that's
certainly how they are implemented now.)

> What we should pay attention to though is that the API *allows* optimizing
> of code reuse and clean design of implementations. Are you saying that it
> stands in the way of that, and if so, how does it prevent clean design of
> implementations?

I think it stands in the way of clean design because any finder
(find/tree, find/matrix, find/study) potentially needs to process
predicates from any other domain (e.g. find/tree apparently needs to
know about study IDs), which is harder than just having to deal with
your own domain and subsequently having to project your result set
into a different domain.

treebase.owl

Hilmar Lapp

unread,
Jul 7, 2009, 11:16:18 PM7/7/09
to Rutger Vos, William Piel, TreeBASE Developers, nexml-...@lists.sourceforge.net, PhyloWS Forum, Val Tannen, Mark Dominus, jfharney

On Jul 7, 2009, at 8:58 PM, Rutger Vos wrote:

>> You aren't trying to suggest that dcterms.title or
>> dcterms.identifier should
>> mean different things for different finders, right?
>
> I kind of am: in find/study, dcterms.identifier is a study ID, in
> find/tree, dcterms.identifier is a tree ID.

I think that's a very bad idea. It defeats the purpose of a controlled
vocabulary (let alone ontology) to formalize unambiguously what we
mean, and that we mean the same thing when we use the same term in the
same application.

> Internally, the finders traverse a CQL parse tree and translate
> these predicates into more refined subproperties
> (tb.identifier.study and tb.identifier.tree,
> respectively). In other words, if a tree is the subject, then the
> predicate dcterms.identifier is interpreted as the refined subproperty
> tb.identifier.tree.

To me this is backwards to how an ontology works. You would use the
refined sub-properties, and if an agent doesn't understand what to do
with it it would use the ontology to get at a more general term which
it might recognize.

In RDF and OWL properties don't change their meaning based on subject
or object. Rather, subject and object can change their semantics by
applying a property (that has range or domain defined) to them.

> By the way, I made a simple ontology (attached) that formalizes this
> inheritance.

What I can see is that they are declared as subproperty of
dc.identifier. They make no assertions about range or domain, no?

> I've seen many examples using dublin core predicates whose exact
> semantics are context-dependent.

Yes, but not within the same application profile (metadata
vocabulary), right?

>
>> What we should pay attention to though is that the API *allows*
>> optimizing
>> of code reuse and clean design of implementations. Are you saying
>> that it
>> stands in the way of that, and if so, how does it prevent clean
>> design of
>> implementations?
>
> I think it stands in the way of clean design because any finder
> (find/tree, find/matrix, find/study) potentially needs to process
> predicates from any other domain (e.g. find/tree apparently needs to
> know about study IDs

But that is only true for TreeBASE. That one finder implementation in
TreeBASE should not cooperate with another finder implementation in
TreeBASE is your design decision, not one from PhyloWS, right?

Reply all
Reply to author
Forward
0 new messages