Re: telecon summary

1 view
Skip to first unread message

Hilmar Lapp

unread,
May 27, 2009, 10:27:48 AM5/27/09
to William Piel, Rutger A. Vos, Val Tannen, Mark Dominus, John Harney, PhyloWS Forum, Arlin Stoltzfus, Enrico Pontelli

On May 26, 2009, at 9:09 PM, William Piel wrote:

> [...]
> I'm not sure I understand the objection here. [...] But perhaps I
> don't understand the problem

Val's concern (I wouldn't call it an objection) is that CQL doesn't
have a way to say what you want back, i.e., the SELECT component of a
query is missing.

I explained how this isn't (and in fact shouldn't) be within CQL's
scope. I.e., CQL is about saying *which* records you want, not *what*
you want from those records that match. It is the surrounding API
specification that takes care of the latter. So for example, the SRU
query API standard has ways to express which format you want back. You
can define at will different formats that you support, each of which
may have a different richness if the data that it returns. (And as
PhyloWS is SRU-based and inspired, it supports that principle already.)

I felt that we had reasonable agreement on that being a fine and
acceptable way of addressing that.

> [...]
> Okay with CDAO, but I'm guessing that we have to go beyond CDAO's
> vocabulary -- e.g. assuming that CDAO does not have
> treebase_study_id or ubio_namebankid in its controlled vocabulary.

I'm not sure yet that CDAO itself would be the default metadata
vocabulary ('context set'). I was rather thinking that that would be
bootstrapped from CDAO, and then extended based on community needs. I
would expect that some of these extensions would be suitable to go
right back into CDAO (and we spoke about the need for a defined
process by which new terms can be proposed for inclusion into CDAO and
would eventually make it in there or get resolved in some other
fashion).

As for your examples, you wouldn't want treebase_study_id in a
standard PhyloWS context set. However, you might want studyName and
studyIdentifier (knowing that many providers won't support those as
the concept of study as a filter for phylogenetic data isn't widely
used). And you probably wouldn't want ubio_namebankid but you may want
taxonIdentifier.

I.e., the terms in the standard context set will be at an abstract and
broadly applicable level (this is what users and developers will look
through to figure out how to query multiple databases in a consistent
fashion), and individual providers will map those to their own
database or application schema. For example, TreeBASE could map
"taxonIdentifier = ?" to "(ubio_namebankid = ? OR ncbi_taxonid = ?)".

There could also be more than one metadata vocabulary (in a sense,
there already are: Dublin Core and Darwin Core should probably at
least be partially supported, or imported into the standard context
set), having terms at different levels of application or provider
specificity.

At any rate, building this standard metadata vocabulary is at the top
of the agenda for PhyloWS-related activities - the need for this
became very evident at the hackathon, and will come up fast in Dazhi's
Summer of Code project (and possibly also in John Harney's VDC summer
project). I'm copying here the newly created PhyloWS Google Group
which is best place for such discussing this further, and I hope that
Ryan can guide us well through the bootstrapping process. (I'm also
cc'ing Arlin and Enrico re: the points about CDAO).

> [...]
> Rutger: is there consensus on the way to attach metadata in NeXML?
> (e.g. that RDFa stuff).

Yes, Rutger is implementing and testing that right now, with very
promising results. See the nexml mailing list.

> [...] I was hoping that the TDWG TCTS would help us communicate
> taxonomic metadata, but this bits we need to communicate
> (ncbi_taxid, ubio_namebankid, etc) don't seem to be addressed by TCTS.

They are - they are taxon identifiers. TCTS is (and should be)
application and provider agnostic. NCBI_taxid is obviously not.
There'll always be many (and a changing number of) application and
provider-specific metadata elements that could be useful at any given
time, but I think the right strategy is for providers and applications
to map these to generic terms in metadata vocabularies. (BTW note that
there has been some thought in TDWG-related circles to create an
ontology based on TCS; I don't know where that currently stands.)

-hilmar

--
===========================================================
: Hilmar Lapp -:- Durham, NC -:- informatics.nescent.org :
===========================================================

William Piel

unread,
May 27, 2009, 11:25:07 AM5/27/09
to Hilmar Lapp, Rutger Vos, Val Tannen, Mark Dominus, John Harney, Arlin Stoltzfus, Enrico Pontelli, PhyloWS Forum

On May 27, 2009, at 10:27 AM, Hilmar Lapp wrote:

For example, TreeBASE could map "taxonIdentifier = ?" to "(ubio_namebankid = ? OR ncbi_taxonid = ?)".

I see the advantage of using the generic term "taxonIdentifier" -- but don't we also want to be able to specify what kind of taxonIdentifier we mean? If TreeBASE always translates taxonIdentifier into "(ubio_namebankid = ? OR ncbi_taxonid = ?)" then we are creating imprecision, which in many cases defeats the purpose of resorting to ID numbers in the first place (e.g. suppose someone decides to use taxonIdentifier = 9504 instead of taxonName = "Aotus" in order to disambiguate between the plant and the monkey, but this 9504 number results in both a genus of monkey in ncbi and a pigeon in ubio). We need to provide some sort of namespace. For example, taxonIdentifier = "ncbi_taxid:12345" vs taxonIdentifier = "ubio_namebankid:12345" so that we know what to do with the 12345 part. Or we could insist on using LSIDs -- but that works for only some identifiers (e.g. ubio) but not others (e.g. ncbi). 

Additionally, TreeBASE would like to distinguish between taxon_label (any string attached to a leaf node of a tree or row of a matrix), taxon_name (a controlled nomenclature that leaf nodes map to), and higher_taxon_name (explicitly asking for all descendants of a name in a classification).  How can the nuance among these be expressed in PhyloWS?

bp


Hilmar Lapp

unread,
May 27, 2009, 11:37:43 AM5/27/09
to William Piel, Rutger A. Vos, Val Tannen, Mark Dominus, John Harney, Arlin Stoltzfus, Enrico Pontelli, PhyloWS Forum, Ryan Scherle

On May 27, 2009, at 11:25 AM, William Piel wrote:

>
> On May 27, 2009, at 10:27 AM, Hilmar Lapp wrote:
>
>> For example, TreeBASE could map "taxonIdentifier = ?" to
>> "(ubio_namebankid = ? OR ncbi_taxonid = ?)".
>
> I see the advantage of using the generic term "taxonIdentifier" --
> but don't we also want to be able to specify what kind of
> taxonIdentifier we mean? If TreeBASE always translates
> taxonIdentifier into "(ubio_namebankid = ? OR ncbi_taxonid = ?)"
> then we are creating imprecision, which in many cases defeats the
> purpose of resorting to ID numbers in the first place

Good point. The value for querying by taxon identifier would obviously
have to include the namespace of an identifier, if that precision is
desired. When the identifier is an LSID, a DOI, or an HTTP URI, the
namespace is included by definition. If it is a database-internal
identifier, such as a uBio or NCBI taxon key, then they should be
qualified if the user (or querying application) knows what they are.
(BTW that's the same as in genomic database identifiers, for example.)

> [...]


> Additionally, TreeBASE would like to distinguish between taxon_label
> (any string attached to a leaf node of a tree or row of a matrix),

Is that an OTU label?

> taxon_name (a controlled nomenclature that leaf nodes map to), and
> higher_taxon_name (explicitly asking for all descendants of a name
> in a classification).

The latter essentially is query term expansion using a hierarchical
vocabulary (aka taxonomy). I've been wondering how best to specify
this. Maybe Ryan has some insight here?

William Piel

unread,
May 27, 2009, 12:11:49 PM5/27/09
to Hilmar Lapp, Rutger A. Vos, Val Tannen, Mark Dominus, John Harney, Arlin Stoltzfus, Enrico Pontelli, PhyloWS Forum, Ryan Scherle

On May 27, 2009, at 11:37 AM, Hilmar Lapp wrote:

>> [...]
>> Additionally, TreeBASE would like to distinguish between taxon_label
>> (any string attached to a leaf node of a tree or row of a matrix),
>
> Is that an OTU label?

yes.

bp


William Piel

unread,
May 27, 2009, 12:36:33 PM5/27/09
to Karen Cranston, Rutger Vos, Val Tannen, Mark Dominus, John Harney, Arlin Stoltzfus, Enrico Pontelli, PhyloWS Forum, Ryan Scherle

On May 27, 2009, at 11:52 AM, Karen Cranston wrote:

I agree with Bill. We need to allow for both generic queries (i.e.
label="pigeon") and also for specific queries where a user knows
exactly what they want (i.e. node number 9504 in the ncbi taxonomy).

I'm unclear about the need for higher_taxon_name vs taxon_name. It
seems that what we need to specify is the type of query, not the type
of identifier. For example, I might want:

all trees containing a given taxon_name
subtrees of a taxon_name
trees that are subtrees of the MRCA of a list of taxon_name

Assuming that names are coming from a controlled vocabulary, what is
the advantage of "higher_taxon_name=primates + taxon_name =E.
caballus" over  "taxon_name=primates + taxon_name=E. caballus"? The
provider can tell whether a name is a leaf or an internal nodes. Why
does the user need to specify this?

Not all nodes in all trees will be mapped with internal nodes names, which is why asking for "any kind of bird" or "any kind of primate" is such a powerful query. 

For example, here is a query that asks for "taxon_name any "Equus caballus" and h.taxon_name any Primates", which means that the tree has to have a horse in it and *any* kind of primate. The result is 12 trees:


On the other hand, we could do your other query ("taxon_name any "Equus caballus" and taxon_name any Primates"):


Which results in zero hits. Not too surprising because trees that have a node called "Primates" (e.g. these 10 trees) usually deal with relationships among mammal orders, in which case they are unlikely to have species names (like Equus caballus) in them. 

bp









William Piel

unread,
May 27, 2009, 1:52:53 PM5/27/09
to Karen Cranston, Rutger Vos, Val Tannen, Mark Dominus, John Harney, Arlin Stoltzfus, Enrico Pontelli, PhyloWS Forum, Ryan Scherle
Hi Karen,

(FYI -- are you replying to "all" as BCC -- because otherwise I'm the
only one getting your responses)

On May 27, 2009, at 1:35 PM, Karen Cranston wrote:

> How does TreeBase define "any kind of bird" without an internal node
> named "Aves"?

By storing a dump of the NCBI taxonomy tables (let's call them
ncbi_names and ncbi_nodes), and then running this query:

SELECT DISTINCT t.tree_id FROM trees t JOIN nodes n ON (t.tree_id =
n.tree_id)
JOIN taxon_variants tv ON (n.taxon_variant_id = tv.taxon_variant_id)
JOIN taxa tx ON (tv.taxon_id = tx.taxon_id)
JOIN ncbi_names nna ON (tx.taxid = nna.tax_id)
JOIN ncbi_nodes nno ON (nna.tax_id = nno.tax_id),
ncbi_names hna NATURAL JOIN ncbi_nodes hno
WHERE nno.left_id >= hno.left_id
AND nno.left_id < hno.right_id
AND nna.taxon_name LIKE 'Aves';

This returns a list of all trees that contain any kind of Aves, even
if none of their nodes, whether internal or distal, have been labeled
with 'Aves'.

So this is a taxon query that targets a special (and separate)
classification tree -- rather than the topological ("MRCA") queries
that you listed, which target the actual trees.

bp


Karen Cranston

unread,
May 27, 2009, 2:00:42 PM5/27/09
to William Piel, Rutger Vos, Val Tannen, Mark Dominus, John Harney, Arlin Stoltzfus, Enrico Pontelli, PhyloWS Forum, Ryan Scherle
(Replying to all now, not just Bill)

I understand the differentiation that Bill is making. I just want to
be clear about the distinction between defining a type of identifier
and a type of query. We are proposing:
- contains(higher_taxon_name) and contains(taxon_name)

instead of
- subtree_of(taxon_name) and contains(taxon_name)

In the current spec (https://www.nescent.org/wg_evoinfo/PhyloWS/REST), we have:

Subtree query: /phylows/tree/<identifier>/clade/<nodeID>?
MRCA query: /phylows/tree/<identifier>/clade/mrca/?includes=<nodeID1,nodeID2,...>&[excludes=
<nodeID1,nodeID2,...>]

So, we seem to be departing a bit from this idea. I'm ok with that,
but I want to make sure we've thought it through.

--
~~~~~~~~~~~~~~~~~~~~~~~
karen.c...@gmail.com
~~~~~~~~~~~~~~~~~~~~~~~

Hilmar Lapp

unread,
May 27, 2009, 4:21:37 PM5/27/09
to Rutger Vos, William Piel, Val Tannen, Mark Dominus, John Harney, PhyloWS Forum, Arlin Stoltzfus, Enrico Pontelli

On May 27, 2009, at 4:06 PM, Rutger Vos wrote:

> On Wed, May 27, 2009 at 10:27 AM, Hilmar Lapp <hl...@nescent.org>
> wrote:
>>
>> On May 26, 2009, at 9:09 PM, William Piel wrote:
>>
>>> [...]
>>> I'm not sure I understand the objection here. [...] But perhaps I
>>> don't
>>> understand the problem
>>
>> Val's concern (I wouldn't call it an objection) is that CQL doesn't
>> have a
>> way to say what you want back, i.e., the SELECT component of a
>> query is
>> missing.
>

> As long as we're talking about query languages, why hasn't anyone
> brought up SPARQL yet? When I was on Okinawa (gloat) I saw a very
> spectacular demonstration (by Mark Wilkinson) of its usage in chaining
> together (joining) web services on the fly. In fact, that was the
> initial inspiration for the proposal that lured John Harney in.
> Building up a bit of a SPARQL knowledge base might also come in handy
> for CDAO->NeXML conversion done by whatever lives behind
> sawsdl:loweringSchemaMapping.

Oh absolutely. But that's on top of a triple store, not a relational
database, and not for retrieving structured data types. So the two are
fully complementary in my view in the sense of what they do and
accomplish, and synergistic in the sense that getting programmatic
access to NeXML documents with fully RDFa-extractable embedded
semantics should not only put anyone in the position to play with a
SPARQL endpoint, but also give you the necessary mappings for a D2RQ
bridge.

Reply all
Reply to author
Forward
0 new messages