over the last couple of days I have worked on modifying the TreeBASE2
architecture to make it accessible in a more RESTful way, using RDF
and NeXML to expose the underlying data. I believe something *like*
this architecture should be part of the treebase web service
architecture as a way to de-reference search results (e.g. from a list
of returned URLs).
Check out this URL:
http://8ball.sdsc.edu:6666/treebase-web/PhyloWS/TB2:S1787 - this is
the resource that represents a "Study" record in TreeBASE. At that
address, what you get back is a little rdf response that lists other
resources that are related to the current one, and what mime type they
respond with[see bottom*]. In this case, the subtended structure is as
follows:
PhyloWS/TB2:S1787 - rdf index file
PhyloWS/TB2:S1787.xml - nexml serialization
PhyloWS/TB2:S1787.nex - nexus serialization
PhyloWS/TB2:S1787.rdf - rdf serialization of the rdfa attachments in
the nexml file
PhyloWS/TB2:S1787.html - web page version
the nexml serialization attaches to all elements that are
serializations of treebase objects annotations that contain their
respective resource URIs (which are similarly structured, recursively,
as that of the Study resource). It also contains attachments for NCBI
taxon identifiers, uBio Namebank IDs, and dublin core metadata about
the associated publication.
The idea is that from the top level study record you can drill down to
other related resources in various formats. For example, here is the
rdf index for a tree within this study:
http://8ball.sdsc.edu:6666/treebase-web/PhyloWS/TB2:Tr4816
So what do you think about the API this implies? I'm debating whether
to introduce more structure in the URLs (as in Vivek's code), e.g.
PhyloWS/study/TB2:S1787/tree/TB2:Tr4816 but I'm not sure if that's
going to be a good idea when/if we're mapping our URLs onto purls
(incidentally, Hilmar, you control the namespace @ purl, can I somehow
join?).
And what about the identifiers? I think for this to work in a
generalizable way this is the minimum - a prefix to dispatch to the
right authority (note I did the same for uBio: and NCBI:) and an
opaque string. In TreeBASE's case this string is sort-of-namespaced as
well; the short prefixes ("S", "Tr", etc.) map onto classes of the
TreeBASE code bases so that hibernate can figure out where to fetch
the persisted objects from.
How do we like the "file extensions" to indicate the requested format?
Not exactly SRU, I know. I saw it done this way by Katayama's group in
Tokyo and I thought it was cute (but trivially changeable of course).
Any comments greatly appreciated!
Rutger
* to get a semantics-aware view of these index pages, go here:
http://demo.openlinksw.com/rdfbrowser2/ and paste one of the treebase
URLs into the box and hit "describe".
you can also view the rdf with tools such as this one:
http://linkeddata.uriburner.com/about/html/http://8ball.sdsc.edu:6666/treebase-web/PhyloWS/TB2:S1787
--
Dr. Rutger A. Vos
Department of zoology
University of British Columbia
http://www.nexml.org
http://rutgervos.blogspot.com
> over the last couple of days I have worked on modifying the TreeBASE2
> architecture to make it accessible in a more RESTful way, using RDF
> and NeXML to expose the underlying data.
I think this is great stuff. It's cool to put an entire study in a
single serialized stream (and some day when there is a formal way of
communicating the structure of analyses, we'll be able to say
something about how trees are derived from matrices, etc, within a
NeXML file).
I'll just mention that for cases where there's more than one taxon
block, ".nex" implementation lacks "LINK" commands for the trees,
which causes Mesquite to do weird stuff with them.
For example, for this study:
http://8ball.sdsc.edu:6666/treebase-web/PhyloWS/TB2:S260.nex
... when opened in Mesquite, some trees are associated with matrices
while others are "orphaned." To solve the problem, just add the
following:
tree block Tb4985 needs the command "LINK TAXA = TaxonLabelSet9634;"
tree block Tb4984 needs the command "LINK TAXA = TaxonLabelSet9633;"
tree block Tb4986 needs the command "LINK TAXA = TaxonLabelSet9635;"
etc
> I'm debating whether to introduce more structure in the URLs (as in
> Vivek's code), e.g.
> PhyloWS/study/TB2:S1787/tree/TB2:Tr4816 but I'm not sure if that's
> going to be a good idea when/if we're mapping our URLs onto purls
I definitely vote for /PhyloWS/study/TB2:S260.xml instead of /PhyloWS/
TB2:S260.xml
I'm not sure why purls should matter here with respect to including /
study/. The advantage of /study/ is that it is more compatible with
exiting PhyloWS specification, and it allows us to also create purls
to /tree/ and /matrix/.
My other comment is that since these serve as GUIDs, I'd think it is
best to minimize the proliferation of distinct GUIDs that are pointing
at the same object (except that they pull back different formats).
Instead of:
.../PhyloWS/study/TB2:S260
.../PhyloWS/study/TB2:S260.xml
.../PhyloWS/study/TB2:S260.nex
why not do:
.../PhyloWS/study/TB2:S260 or .../PhyloWS/study/TB2:S260?
format=list_resources
.../PhyloWS/study/TB2:S260?format=nexml
.../PhyloWS/study/TB2:S260?format=nexus
The advantage being that (1) the ".../PhyloWS/study/TB2:S260" part
remains unique, so when different resources try to cite the same study
with a purl, they'll be less at risk of using different but synonymous
purls, and (2) the format is more explicit -- i.e. "format=nexml" is
more specific as compared to just ".xml" (e.g., what if some day we
want to serve PhyloXML -- easy: format=phyloxml)
Can the ".../PhyloWS/study/TB2:S260" default to a RSS 1.0 list of
available formats/resources? -- i.e. essentially communicating the
same RDF as you have, but formatted so that browsers display it in a
readable fashion.
bp
Hi all,
over the last couple of days I have worked on modifying the TreeBASE2
architecture to make it accessible in a more RESTful way
http://8ball.sdsc.edu:6666/treebase-web/PhyloWS/TB2:S1787 - this is
the resource that represents a "Study" record in TreeBASE.
At that address, what you get back is a little rdf response that lists other
resources that are related to the current one, and what mime type they
respond with[see bottom*]. In this case, the subtended structure is as
follows:
PhyloWS/TB2:S1787 - rdf index file
PhyloWS/TB2:S1787.xml - nexml serialization
PhyloWS/TB2:S1787.nex - nexus serialization
PhyloWS/TB2:S1787.rdf - rdf serialization of the rdfa attachments in
the nexml file
PhyloWS/TB2:S1787.html - web page version
thanks so much for your detailed response! I've gone in and made some
of the changes you suggest (not live on the server yet, though, just
on my laptop for now).
> Minor note: URLs are case-sensitive, and the PhyloWS prefix is all-lowercase
> (phylows).
OK, I changed that, it's all lower case now.
> Note that while as we discussed adding URIs patterns into PhyloWS is fair
> game if your resource has levels of data organization that aren't in the
> standard. I would recommend, though, to follow the design pattern of the
> standard. I.e., use a prefix that indicates the type of entity that the URI
> identifies, rather than just putting the ID under the phylows/ base. So for
> example for studies, you could do phylows/study/TB2:S1787, inserting the
> phylows/study/ pattern as a proprietary extension.
I still have mixed feelings about the structured data organization,
with type strings as path elements, inside the URLs. Here's why my
feelings are mixed:
* pro: this could actually be handy if a request comes in on the
server, say /tree/TB2:Tr2312/node/TB2:Tn3423/, we can provide more
metadata (and maybe more serialization formats) in the resource
description because we now have the tree context for the focal node.
* con: we now need to make a mapping between treebase classes and
these path elements - and if we want to expose *every* object in the
database (including, say, character states, authors and weight sets)
that would mean having to come up with ....mmmm... 50 of these path
fragment strings (admittedly that's a wild stab, but it'll be at least
on that order).
Still, that's doable - but the worst con I think is more serious: I
think we want our identifiers to function a bit like DOIs, in that
people might want to go to http://${phylowsResolver}/phylows/${id} and
get redirected to the correct page without having to know how to
construct the subtended path.
Say I'm reading a paper, and it says "we deposited our tree as
TB2:Tr2312 and our matrix as TB2:M2342", I would like to be able to
just copy those IDs from the document and tack them onto a resolver
url the way I do with DOIs, i.e. http://dx.doi.org/${copiedDOI} -
having to implement the logic that the tree ID lives under
phylows/tree/TB2:Tr2312 and the matrix under phylows/matrix/TB2:M2342
is enough of a hassle to do programmatically (especially for nodes and
other nested objects), let alone having that be understood by innocent
bystanders.
> I agree the format extension looks cute, but I'm not sure it isn't more
> clever than necessary (e.g., standard runtime API calls in nearly every
> language allow you to get at a query parameter or HTTP header value, but the
> above requires custom parsing, no matter how simple that might be).
> Second, it provides for different URIs - that's OK I suppose if we all agree
> that the things identified by them are all different. Are we sure we want to
> look at these as essentially different things rather than different
> serializations of the same thing?
Yeah, you're right (Bill mentioned this too), I changed it to say
?format=foo because I do see them as different serializations of the
same resource.
> Third, it's not what the standard currently says. While you can add on
> capabilities to the standard w/o violating it, it creates flavors, the
> effect of which when it wouldn't have been necessary we're all too familiar
> with.
Agreed.
> Forth, I think any scheme following a phylows/<type>/<identifier>.rdf
> pattern better return a serialization of the entire "thing" (i.e., record),
> rather than only some, and possibly processed metadata. If you are going to
> return a subset of the metadata only then that should be obvious from the
> URI (or by specific request parameter, altering the default).
That's certainly the intention: the ?format=rdf serialization should
be the full CDAO serialization including extracted RDFa. It's just
that John Harney (our semantic web services point man) is still
working on the liftingSchemaReference stylesheet that will make that
happen.
> Fifth, what I think this scheme would be at odds with is the Linked Data
> principles. In fact, that's what I think needs fixing in the current PhyloWS
> spec too - we kind of confound the URI for a thing (such as a tree)
In this implementation that would be the base address, without
format=foo, which, as per Bill's suggestion, will return an RSS1.0
description of available views/serializations (RSS1.0 is the flavour
of RSS that is actually RDF/XML - as an aside I think we should also
use it to list the resource addresses of search results).
> with the
> URI of its description (the data and metadata as RDF)
i.e. format=rdf
> and its serialization
> in a standard exchange format (as NeXML).
i.e. format=nexml
> We should probably start thinking
> about the appropriate 303 redirect structure that is recommended,
Ah! It's 303? I was using 302, but I very happily changed that (I have
goofy reasons for liking the number 303:
http://video.google.com/videoplay?docid=2520461739591700600)
> and to
> what extent we want to recommend or dissuade from content negotiation
> (conneg).
> -hilmar
What do you mean by content negotiation - that the client says which
content types it accepts (including gzipped content, which will become
useful for large rdf documents) or that the server decides what the
client will get based on its user agent string?
Cheers,
Rutger
> * con: we now need to make a mapping between treebase classes and
> these path elements - and if we want to expose *every* object in the
> database (including, say, character states, authors and weight sets)
> that would mean having to come up with ....mmmm... 50 of these path
> fragment strings (admittedly that's a wild stab, but it'll be at least
> on that order).
I don't know that we're obliged to create URN-GUIDs for *every* object
-- just the ones that matter. The most important "deliverables" are
study, tree, and matrix. (Certainly, people should be able to *search*
on other objects -- author, journal, taxon, character title, character
state, etc (e.g. /phylows/find/study/query="author+any+Darwin"), but I
don't know that these are central enough to the mission of TreeBASE
that they require their own resolvable URN-GUIDs).
> I think we want our identifiers to function a bit like DOIs, in that
> people might want to go to http://${phylowsResolver}/phylows/${id} and
> get redirected to the correct page without having to know how to
> construct the subtended path.
True, but like DOIs, we can advertise our identifiers as the full
string (i.e. the "official" identifier is ".../phylows/study/
TB2:S1234" not "S1234") -- so in our instructions to authors, we give
them the full GUID as what they should quote in their paper.
bp
I think that the plan should be to make everything that has a treebase
ID be a uniquely identifiable resource. This so that any semantic
computation (i.e. any sparql query that might need to join things) is
at least possible. It should be possible to search for something like
"get me all lemurs that have a four-tooth grooming comb", which means
that a reasoner must deal with characters, state sets, states, matrix
cells, taxa, all of which need to be identifiable resources. In any
case, once we start creating URN-GUIDs for some objects (as I have)
it's easier to do it generically for every persistable anyway so it
would be more of an obligation *not* to create them for some objects.
> True, but like DOIs, we can advertise our identifiers as the full
> string (i.e. the "official" identifier is ".../phylows/study/
> TB2:S1234" not "S1234") -- so in our instructions to authors, we give
> them the full GUID as what they should quote in their paper.
OK, that's how I implemented it now. Conceptually there are now sub
"folders" within phylows for /study/, /taxon/, /matrix/ and /tree/
which are simply obtained form the last part of the package name of
the subtended object (e.g. a tree object in treebase has as a fully
qualified name org.cipres.treebase.domain.tree.PhyloTree, so we use
the "tree" string that precedes PhyloTree). It means, as you say, that
people would now need to use "study/TB2:S1787" as the published
identifier.
I see. I agree.
bp
>> I would recommend, though, to follow the design pattern of the
>> standard. I.e., use a prefix that indicates the type of entity that
>> the URI
>> identifies, rather than just putting the ID under the phylows/
>> base. So for
>> example for studies, you could do phylows/study/TB2:S1787,
>> inserting the
>> phylows/study/ pattern as a proprietary extension.
>
> I still have mixed feelings about the structured data organization,
> with type strings as path elements, inside the URLs. Here's why my
> feelings are mixed:
>
> [...]
> * con: we now need to make a mapping between treebase classes and
> these path elements - and if we want to expose *every* object in the
> database (including, say, character states, authors and weight sets)
> that would mean having to come up with ....mmmm... 50 of these path
> fragment strings
That's not true. If you want a catch-all, or if your data resource
supports uniquely identifying every data element w/o knowing its
container element (which many resources will *not* be able to support
because they weren't designed to) then create a URI prefix such as /
phylows/item/ or /phylows/element/ or /phylows/object/ which is
followed by an identifier.
This could also be your resolver: /phylows/resolve/<identifier>
The standard can't demand that capability to be present (for the
reason stated above), but would I guess be advised well to recommend
URI prefixes that one should use if one wants to support that
capacity. So how do the above sound?
>> We should probably start thinking
>> about the appropriate 303 redirect structure that is recommended,
>
> Ah! It's 303? I was using 302, but I very happily changed that:
See the httpRange-14 document:
http://www.w3.org/2001/tag/doc/httpRange-14/2007-05-31/HttpRange-14
>
> What do you mean by content negotiation - that the client says which
> content types it accepts (including gzipped content, which will become
> useful for large rdf documents) or that the server decides what the
> client will get based on its user agent string?
The server decides which representation the client is asking for based
on its Accept request header value sent by the client. Deciding by the
user agent is practiced quite frequently too, but I don't think
technically qualifies for being called content negotiation.
Yup, that was what I was planning to do, but then I realized that:
i) I can get any object from the database using just the
TB2:[A-Z][a-z][0-9]+ strings, and,
ii) once hibernate has fetched it from the database I can get the
object's package name and use that to reconstruct the URI prefix
(because, fortuitously, the last word of the package name is either
"tree", "study", "taxon", or "matrix").
So now treebase *can* use URIs such as /phylows/study/TB2:S1787 and
there is no hardcoding of the "/study/" bit anywhere.
I've been reading up on the pattern of 303 redirect + content
negotiation (of which I wasn't aware), will try to implement that too.
I think it should be that ?format=foo will override the Accept header
so that I'm not forced to spoof headers when trying to debug the rdf
in a browser window.
> I think it should be that ?format=foo will override the Accept header
I think so too, though technically if what a client can accept is at
odds with the valid return options from a request I'm not sure this
should not result in an error (but for debugging you wouldn't want
this, obviously).
Thanks for stepping in! That's useful info - I suppose it's a 4xx
error because the client shouldn't have asked something the server
doesn't support (hence the client is in error)?
> Should this discussion be taking place on the -devel list?
There's no nexml-devel list, if that's what you mean. Frankly I don't
think we need one - there might be a spike in traffic right now but in
general it's so quiet I don't think we want to fragment things even
further by having conversations going on on two separate lists.
Rutger
> I believe browsers nearly always accept */* as a last resort
Browsers yes. But as a semantic web agent I would only put application/
rdf+xml under the Accept header, unless I'm prepared to take other
things too.
BTW most servers don't follow the 406 recommendation. Try running curl
with an Accept header value of for example solely application/rdf+xml
against your web server of choice with a GET on an HTML page. You'll
normally get the HTML page (unless the server has an RDF blurb for the
page).