We have two kind of ids:
The ID of an record is a partial ID local to the dataset. The ID of the dataset is the base ID used to create a complete reference ID of the records of the dataset. A full ID is created by concatenating the (base) ID of a dataset with the ID of an record. The full ID created has to be a valid URI.
--peter keane
It would be an interesting feature to define a way to extract the dataset url from a record url but I don't think it is necessary.
I not sure this is even a BibJSON specification issue. What we have called the "ref" id in a BibJSON dataset is not a permanent identifier and the dataset itself is portable.
BKNPeople for sure is not committed to maintaining permanent identifiers. I think the more important
issue is how we find a dataset url for a specific identifier. Jim raised this issue with me in a conversation yesterday.
At the risk of sounding soap-boxy, I wanted to express a few reactions
here. Note that my perspective is based on an understanding of the
principles behind REST-based architectures (
http://tomayko.com/writings/rest-to-my-wife ). I am of the opinion
that REST (the real principles, not the hype that you sometimes hear
about) offers the best way to evaluate the design of network-based
information systems (indeed, it describes the set of principles on
which the web itself is based).
In a nutshell, REST is based on a few basic tenets:
1. every important resource needs a name (on the web, that means a URL).
2. every such resource needs one or more representations (HTML is most
common, but XML, JSON, etc. are common as well).
3. there is a basic (uniform) set of operations that can be performed
on a resource (GET, PUT, DELETE, POST)
4. application state (i.e. the flow of operations) should be driven by
hypertext
Those might seem overly abstract, but it turns out they have very
useful real-world application, which I'll try to apply here. I think
that 1. and 4. are probably most important to the issues being
addressed here.
On Tue, Mar 2, 2010 at 5:19 PM, Benjamin Kalish <bka...@gmail.com> wrote:
> Hi Jack,
> The specification says "If the value refers to a global ID, it means that
> this record can be local, or remote. If it is remote, its representation
> should be resolvable on the network specified by the URI." I read that as
> meaning that if the record is remote its full id should in fact be a URL,
Note that an entity might very well have one or more "IDs". These
may/may not be the same as the "name" as understood by REST. Notably,
the Atom Syndication Format (which, along with its partner
specification Atom Publishing Protocol offers "best practice" RESTful
design) has an "ID" element, but it is NOT meant to be a URL (the spec
says: "Its content MUST be an IRI, as defined by [RFC3987]. Note that
the definition of "IRI" excludes relative references. Though the IRI
might use a dereferencable scheme, Atom Processors MUST NOT assume it
can be dereferenced.").
To get the "name" (in the REST sense) of an Atom Document, you look to
hypertext: there is a link relation called "self". It is the URL in
the href attribute of the link element with @rel=self that defines the
URL (i.e., network "name" of the document).
> but perhaps I misunderstood.
> I agree that the important issue is how we find a dataset url given an
> identifier. Full record id's need not be URLs, but the usefulness of a
> record reference will be severely restricted if there isn't some way to
> retrieve the record in question.
I am not convinced that it is important to find a dataset url given an
identifier. While it is important that an ID be globally unique (like
the atom:id) and thus allowing disambiguation, etc., suggesting that
it has some sort of semantics, or is itself a bit of "hypertext" from
which you can extract a URL is probably a dangerous assumption. That
said, it is of course hugely important that we be able to find the a
dataset given its URL *or* that we be able to find the dataset for a
given record when we have that record's URL. And that is where
hypertext comes in. In many instances, it is a dereference operation.
But often, it may require two operations: a dereference AND a look at
the retrieved document, the semantices of which should allow me to
extract the link I am seeking (based on pre-establish link
relationships, e.g.,
http://www.iana.org/assignments/link-relations/link-relations.xhtml ).
> It occurs to me that there are two distinct use cases involving retrieving
> records:
>
> The dataset is available as a BibJSON file. The server can only respond with
> the file itself and retrieving an individual record requires client side
> processing.
> The dataset is available through some web application such as bib server.
> Such web applications may choose to make individual records available.
>
The client really should not be expected to know anything about the
HTTP server other than its address. Whether it is a web application
or a static file system is immaterial. The important thing is that it
adheres to the HTTP standard. Content negotiation is part of that
standard -- the client can send an accept header saying whether it
wants JSON or HTML or whatever. The server can say "here it is" or
"no I don't have that format" as it wishes. Whether individual
records are available or not should also be an "arm's length"
transaction (i.e., does not require the client to have special
knowledge of the server). If a record is not available, the server
simply responds with a 404 "not available" response. Alternatively, a
request for the whole dataset should be in a format with sufficient
semantics (an Atom feed is a good example of this) to allow the client
to discover the URLs for individual items.
> In my mind the BibJSON specification should be most concerned with the first
> use case, though it would be nice to have a consistent API for the second
> use case as well. If we wish to have a single scheme for URI's that can be
> used for both purposes then it is probably best to place the record in
> the query string, e.g "http://foo.bar/dataset?id=baz".
> If "http://foo.bar/dataset" points to a simple BibJSON file then the querry
> string is ignored, and the file will be returned. If,
> however, "http://foo.bar/dataset" points to a web application, then it may
> use the query string to identify and retrieve the record in question.
> If the above scheme were implemented it would then be necessary for
> applications to determine in some way whether the retrieved data represents
> an individual record or the entire dataset.
This is where I think it would be very useful to step back and see how
this approach breaks the principles outlined in REST. This approach
effectivley takes the individual records off the web. The are no
longer individually addressable, but must be interacted with by way of
a "tunnel" that goes through the parent (dataset) record. Of course,
if the individual records *did* still have their own URL, this is less
problematic (we see query strings as a "filtering" mechanism in
REST-based systems with some regularity. Although it's likely
something of a violation of the basic principles, in practice it can
be useful).
> (Imagine the confusion if an
> application were to accidentally cache a BibJSON file thinking it was an
> entire dataset when it in fact only contained a single record!)
As long as we are talking about using URLs to "name" things (i.e., a
dataset has a URL, a record has a URL), I don't know that it's be a
huge problem.
> The
> specification could allow for a specific response to indicate this, but it
> is an additional layer of complexity. By including such elements we would be
> making BibJSON not just a serialization format, but also an API, and that
Interesting point -- a good serialization format (e.g., HTML or ATOM)
does give you an API "out of the box". By good serialization format,
I simply mean "has well-understood link semantics." It is an
attainable and useful goal for BibJSON.
> may not be something we want to do. I'm not sure.
> I'm sure that this problem has been addressed by others. Does anyone know
> how it has been solved in other cases?
If you are interested, there are two resources that I can recommend
pretty highly that address just the sort of challenges being faced
here.
RESTful Web Service Cookbook http://oreilly.com/catalog/9780596801694/
REST in Practice http://oreilly.com/catalog/9781449383169/
Looking at Atom/AtomPub is great as well (much of that is covered in
the Cookbook).
I suspect there may be an inclination to think that the unique
challenges faced here are in some way not aligned with the somewhat
more abstract or theoretical issues addressed by the REST folks. It's
my opinion that not only are the REST principles really quite useful,
but also that ignoring them now will limit the applicability,
flexibility, and robustness of the BibJSON specification. Conversely,
designing with REST principles in mind invariable leads to
serendipitious reuse and interoperability that is hard to predict from
this starting vantage point.
The nutshell of my take is this: First, it's important that
everything you may wish to interact with have a URL. (I'll note here
that we are mainly talking about GET operations, but when you start
being able to perform PUT, DELETE, and POST operations, things get
very interesting -- in a *good* way, fast). Second, when you need
more information about something (e.g., the URL of it's parent), don't
look to the name itself, but rather "get" a representation and let
that representation answer your question.
Obviously, to make this all happen, the BibJSON format needs
well-defined link semantics (this is different than the "linkage"
document -- all BibJSON-related documents need ot be able to express a
link and assert a particular relationship of that link to the current
document).
I hope this is all taken in the spirit intended (to be helpful!) --
--Peter Keane
On Wed, Mar 3, 2010 at 12:24 PM, Benjamin Kalish <bka...@gmail.com> wrote:
> Hi all,
> Thank you all for your input on this issue. I think the issue
> of determining dataset URL's may have been a bit of a red herring, and I'm
> afraid I got sidetracked by it. I think, what needs addressing is this:
> having come across a non-local reference in a BibJSON dataset, how can I
> tell if it refers to a record which may be retrieved, and if it may be
> retrieved, then how do I retrieve it? I imagine that retrieving such a
> record would require several steps:
> - Determine whether or not the record is retrievable. Maybe this is as
> simple as assuming it is retrievable unless a 404 or similar response is
> encountered?
agreed.
> - Determine the URL associated with the record. This may or may not be the
> dataset URL. Also, there should be a means of providing such information
> within the original BibJSON file—this need not be the only way, but dataset
> authors should be able to create datasets with references which can be
> resolved without requiring external name resolution services.
agreed.
> - Retrieve the resource associated with the URL. This should be as simple as
> making a request using the protocol specified in the URL.
agreed.
> - Examine the retrieved resource (which should be valid BibJSON but may
> contain other data in addition to the desired record) and extract the
> desired record. While I think it is reasonable to give the server the option
> to return either an individual record or the entire dataset it is essential
> that applications need not know which will be returned in advance, i.e. they
> should be able to find all the information they need by examining the
> response in some standard and efficient way.
yes, I agree. It might be useful to have some sort of "document type"
attribute that the client could check. In the world of Atom, it's
common to look at the root element to see if we are looking at an
atom:entry or an atom:feed.
> Does this make sense? Is it in line with RESTful principles? (I don't have a
> lot of time to do research on REST right now.)
Makes sense to me, and I think it's in line w/ REST principles.
--peter