TreeBASE search predicates

rutge...@gmail.com

unread,

Jun 12, 2009, 7:45:03 PM6/12/09

to phy...@googlegroups.com

Hi,

I'm sharing a google docs spreadsheet with you. It contains candidate
search predicates we would like to expose through a TreeBASE web service
interface. In addition, it contains the subjects they may apply to, the
value space of the objects, where/how they would be expressed and retrieved
in nexml and a short description of the application of each of these
predicates.

All implementation details aside, we imagine one should be able to search
for example on dc.title='foo' and get a result set where the study titles
match 'foo'. The list of predicates is a combination of dublin core/prism
(for publication metadata) and a tb (TreeBASE) prefix.

As a request for team CDAO, are any of the tb predicates in the spreadsheet
concepts in CDAO? Could they be?

To everyone else, please comment on the naming scheme. For example, it
seems redundant to have taxonID and taxaID and treeID (etc.), on the other
hand, it disambiguates the subject of the query. Should things be renamed?
Does it make sense as is?

Thanks,

Rutger

TreeBASE search predicates
http://spreadsheets.google.com/ccc?key=rL--O7pyhR8FcnnG5-ofAlw

rutge...@gmail.com

unread,

Jun 12, 2009, 7:45:31 PM6/12/09

to phy...@googlegroups.com

Rutger Vos

unread,

Jun 12, 2009, 8:07:53 PM6/12/09

to Karen Cranston, PhyloWS Forum

I just sent Karen a personal invite, if anyone else wants one too just
let me know. This is for a google docs spreadsheet, I'm not sure how
to share that with a forum (and I don't want to make it
world-read/writable).

On Fri, Jun 12, 2009 at 4:49 PM, Karen Cranston<karen.c...@gmail.com> wrote:
> Thanks for putting this together, Rutger. But, the link doesn't work for me
> (I get a permission denied error for this email address). Should this
> instead go in the "Files" section of the PhyloWS group?
>
> Karen

--
Dr. Rutger A. Vos
Department of zoology
University of British Columbia
http://www.nexml.org
http://rutgervos.blogspot.com

Karen Cranston

unread,

Jun 12, 2009, 8:24:23 PM6/12/09

to rutge...@gmail.com, phy...@googlegroups.com

Can I make one initial request? Can we make this a little less
TreeBASE specific? I assume that we want to be friendly to other
existing or future databases of trees, so when we make this public, we
may want to have a core group of predicates that apply to trees in
general and then examples of how to extend to a specific
implementation (e.g. the tb.matrixTB1ID, which is pretty specific for
this project).

There does seem to be a fair bit of redundancy, as well as labels
whose meaning aren't really that clear (the difference between
treeKind and treeType or matrixID and matrixLabel is not immediately
obvious). Does it make more sense to split these into separate
predicates? For example, have matrix.xxx and tree.xxx. The matrix
labels could then be used by projects that only have data matrices
(e.g. benchmark data sets for alignment or phylogeny reconstruction)
without having to worry about tree-specific terms.

I'd like to see some simplification and renaming. Are there places
where we can make use of Darwin Core terms rather than defining new
terms? DWC has a datasetID as well as a whole pile of Taxon-related
terms.

On Jun 12, 2009, at 4:45 PM, rutge...@gmail.com wrote:

>

Rutger Vos

unread,

Jun 12, 2009, 9:15:31 PM6/12/09

to William Piel, Val Tannen, Mark Dominus, Hilmar Lapp, Arlin Stoltzfus, Enrico Pontelli, TreeBASE Developers, PhyloWS Forum

Hi,

On Fri, Jun 12, 2009 at 5:23 PM, William Piel<willia...@yale.edu> wrote:
> Thanks Rutger. This is really useful.
>
> Some questions:
>
> -- Regarding the "prism.startingPage" and "prism.endingPage", I think our
> model stores these in one field (i.e. "123-132") -- I guess that means
> splitting the field with some sort of regular expression -- e.g.
> /^(\d+)[\s-\.]+(\d*)$/ -- unless prism also offers a combined "pages"
> option.

There is a pageRange property, which I've added.

> -- In instances where an LSID exists (e.g. all taxonNamebankIDs have LSIDs),
> would it be better to offer that, or stick with CDAO?

W.r.t. the identifiers I'm the least pleased with what I'm suggesting.
Now the identifiers are treated as TreeBASE specific (e.g.
tb:taxonID). It's possible that these can be moved into CDAO, or, if
objects have IDs doesn't seem to fit in with CDAO's mission of
representing the core knowledge of phylogenetics (and IDs are more of
an implementation detail) maybe they should be moved into a PhyloWS
vocabulary? And should different classes of IDs have different syntax,
e.g. a special predicate for LSIDs, versus namespaced IDs for other
authorities (say, "TreeBASE:Tr1231", "Dryad:2324" etc.)?

> -- I was, in a way, chagrined to see that a new "superset" of taxa is
> available -- the GNI (http://globalnames.org/). They've essentially grabbed
> all of uBio's data and added Species2000 and ZooBank to become a source of
> names for EOL and GBIF, together with a names architecture
> (http://gnapartnership.org/gna/wiki) that is under development. Given (a)
> the similarity with uBio's mission, and (b) the fact that big money players
> are involved while uBio seems to be languishing, it may be that this marks
> the beginning of the end for uBio. And that may mean that some day a lot of
> our taxon intel work will need to be rewritten. I only mention this in case
> a bit of foresight, while designing our API terms, might help us adapt to a
> future changing name informatics landscape.

> -- I take it that separate dc.creator elements are created for each author:
> is there a way to communicate author order?

Actually, this is treated inconsistently in practice: I've seen
multiple dc:creator annotations with one author each and I've seen
them all concatenated within a single dc:creator annotation. I would
like us to be as granular as possible so I'd favour the former.
Alternatively, authors could be annotated using FOAF, so we can break
it down in first/last/middle name, and add other contact info (email).

> -- Is there a dc. or prism. for author email, abstract, or keywords?

There is a prism.keyword (used as a set of atomic annotations) and
dc.subject (best practice dictates this would be a comma-separated
list of terms from a controlled vocab.). If we want to make available
more about authors/editors perhaps we might use FOAF?

> [Actually, I just realized that I was think about all this vocabulary
> largely in terms of decorating returned NeXML with metadata rather than as
> PhyloWS search terms. Of course people don't need to search on "email"
> (etc)]

Mmmm... maybe they do need to search on "email", I don't want to
presume to know that :)

Rutger

> On Jun 12, 2009, at 7:45 PM, rutge...@gmail.com wrote:
>
>> Hi,
>>
>> I'm sharing a google docs spreadsheet with you. It contains candidate
>> search predicates we would like to expose through a TreeBASE web service
>> interface. In addition, it contains the subjects they may apply to, the
>> value space of the objects, where/how they would be expressed and retrieved
>> in nexml and a short description of the application of each of these
>> predicates.
>>
>> All implementation details aside, we imagine one should be able to search
>> for example on dc.title='foo' and get a result set where the study titles
>> match 'foo'. The list of predicates is a combination of dublin core/prism
>> (for publication metadata) and a tb (TreeBASE) prefix.
>>
>> As a request for team CDAO, are any of the tb predicates in the
>> spreadsheet concepts in CDAO? Could they be?
>>
>> To everyone else, please comment on the naming scheme. For example, it
>> seems redundant to have taxonID and taxaID and treeID (etc.), on the other
>> hand, it disambiguates the subject of the query. Should things be renamed?
>> Does it make sense as is?
>>
>> Thanks,
>>
>> Rutger
>>
>> TreeBASE search predicates
>> http://spreadsheets.google.com/ccc?key=rL--O7pyhR8FcnnG5-ofAlw
>
>
>
>
>

--

Rutger Vos

unread,

Jun 12, 2009, 9:58:33 PM6/12/09

to Karen Cranston, phy...@googlegroups.com, TreeBASE Developers

Hi,

On Fri, Jun 12, 2009 at 5:24 PM, Karen Cranston<karen.c...@gmail.com> wrote:
> Can I make one initial request? Can we make this a little less TreeBASE
> specific? I assume that we want to be friendly to other existing or future
> databases of trees, so when we make this public, we may want to have a core
> group of predicates that apply to trees in general and then examples of how
> to extend to a specific implementation (e.g. the tb.matrixTB1ID, which is
> pretty specific for this project).

Absolutely 100% correct. I was hoping this discussion would start,
because I think many of the predicates I now pushed into the tb:
namespace can be moved up either to a PhyloWS vocabulary that defines
generic search fields for phylogenetic web services (e.g. to lookup
things by their IDs or labels) or even to CDAO (assuming the fields
are relevant to CDAO's mission). Ideally only the distinction between
the TreeBASE1 and TreeBASE2 identifiers would be something for the tb:
vocabulary namespace.

> There does seem to be a fair bit of redundancy, as well as labels whose
> meaning aren't really that clear (the difference between treeKind and
> treeType or matrixID and matrixLabel is not immediately obvious).

treeKind and treeType are ambiguous, but that's how they are called in
the treebase schema. The treeType is meant to indicate whether the
tree is an atomic result (e.g. a single, optimal topology) or some
kind of summary (e.g. a supertree, a consensus tree). TreeKind says
something about what we assume the tips to mean (species or single
sequences), which in turn says something about how the data are
homologized.

matrixID and matrixLabel should be obvious - they're just like taxonID
and taxonLabel, or treeID and treeLabel etc. An ID is an identifier
(e.g. "TreeBASE:M21313"), a Label is a human readable string (e.g.
"Cytochrome B matrix, aligned using ClustalW").

> Does it
> make more sense to split these into separate predicates? For example, have
> matrix.xxx and tree.xxx. The matrix labels could then be used by projects
> that only have data matrices (e.g. benchmark data sets for alignment or
> phylogeny reconstruction) without having to worry about tree-specific terms.

I think the goal is that we can run queries such as:

select * from matrices where tb.matrixID='TreeBASE:M21313';

...which implies that there is a vocabulary, identified by the tb
prefix, that explains what a 'matrixID' is. I agree that it might seem
less redundant to do something like:

select * from matrices where matrix.id='TreeBASE:M21313';

...but all that does is imply a separate vocabulary with a matrix
prefix. This in turn implies that there would have to be vocabularies
for matrix, taxon, taxa, tree, trees (etc.?) which would all have to
be developed and maintained, and whose namespaces would need to be
imported by whoever is formulating the query (imagine this as a SPARQL
query, for example). I believe the end result is actually *more*
redundancy and *longer* queries, so nothing much would be gained that
way.

> I'd like to see some simplification and renaming. Are there places where we
> can make use of Darwin Core terms rather than defining new terms? DWC has a
> datasetID as well as a whole pile of Taxon-related terms.

Good idea, haven't looked at that yet.

Thanks for your comments - let's keep this discussion going!

Rutger

> On Jun 12, 2009, at 4:45 PM, rutge...@gmail.com wrote:
>
>>
>> Hi,
>>
>> I'm sharing a google docs spreadsheet with you. It contains candidate
>> search predicates we would like to expose through a TreeBASE web service
>> interface. In addition, it contains the subjects they may apply to, the
>> value space of the objects, where/how they would be expressed and
>> retrieved
>> in nexml and a short description of the application of each of these
>> predicates.
>>
>> All implementation details aside, we imagine one should be able to search
>> for example on dc.title='foo' and get a result set where the study titles
>> match 'foo'. The list of predicates is a combination of dublin core/prism
>> (for publication metadata) and a tb (TreeBASE) prefix.
>>
>> As a request for team CDAO, are any of the tb predicates in the
>> spreadsheet
>> concepts in CDAO? Could they be?
>>
>> To everyone else, please comment on the naming scheme. For example, it
>> seems redundant to have taxonID and taxaID and treeID (etc.), on the other
>> hand, it disambiguates the subject of the query. Should things be renamed?
>> Does it make sense as is?
>>
>> Thanks,
>>
>> Rutger
>>
>> TreeBASE search predicates
>> http://spreadsheets.google.com/ccc?key=rL--O7pyhR8FcnnG5-ofAlw
>>
>> >>
>
>

Arlin

unread,

Jun 24, 2009, 1:18:31 PM6/24/09

to PhyloWS

Sorry for getting into this late. I am confused about how to handle
some of these issues about developing a general vocabulary. For
instance, there are several concepts that are "labels". How do we
handle this?

Do we have a CDAO concept of "display label" (i.e., label for the
purposes of display), or do we get this concept from somewhere else (a
display label is an information artefact), since of course display
labels are used for all sorts of things?

Do we sub-class the label concept, e.g., "treeBlockLabel" and
"matrixLabel", or do we just have a generic label concept with a broad
domain?

Arlin

On Jun 12, 9:58 pm, Rutger Vos <rutgera...@gmail.com> wrote:
> Hi,
>

Arlin

unread,

Jun 24, 2009, 5:24:47 PM6/24/09

to PhyloWS

On Jun 24, 1:18 pm, Arlin <stolt...@umbi.umd.edu> wrote:
> Do we have a CDAO concept of "display label" (i.e., label for the
> purposes of display), or do we get this concept from somewhere else (a
> display label is an information artefact), since of course display
> labels are used for all sorts of things?

in particular, is a label nothing more than an rdfs:label as defined
by the w3c:

http://www.w3.org/TR/rdf-schema/#ch_label

? or is that mixing syntax with semantics?

Rutger Vos

unread,

Jun 24, 2009, 6:17:15 PM6/24/09

to Arlin, PhyloWS

Hi Arlin,

great questions! Thanks for getting involved.

I *think* rdfs:label would be mixing syntax with semantics.

It would make some amount of sense to use dc:title for labels (very
generic, so very likely to be recognized by off-the-shelf tool),
except I was hesitant to go with that in the spreadsheet because I
imagined it might create ambiguity about what the subject is in CQL
queries.

E.g., if we compose a query like "dc.title any fish", are we matching
against tree labels? Matrix labels? Study labels? All of the above? I
guess it depends on the rest of the phylows specification.

If we specify that the API is "/PhyloWS/tree/?query=dc.title+any+fish"
such that the path elements below PhyloWS determine the context there
would be no ambiguity and we could use dc:title.

Rutger

--
Dr. Rutger A. Vos
Department of zoology

Reply all

Reply to author

Forward