dataset ids and record ids

2 views
Skip to first unread message

Benjamin Kalish

unread,
Mar 1, 2010, 8:15:43 AM3/1/10
to bib...@googlegroups.com
Hey folks,

The BibJSON specification contains the following:

We have two kind of ids:

    1. ID of a Dataset
    2. ID of an record

The ID of an record is a partial ID local to the dataset. The ID of the dataset is the base ID used to create a complete reference ID of the records of the dataset. A full ID is created by concatenating the (base) ID of a dataset with the ID of an record. The full ID created has to be a valid URI.


This is pretty clear, but it does leave me with some questions. For example, does this mean that it would be allowable to have a dataset id of 'http://foo/bar' and a record id 'baz' giving a full ID of 'http://foo/barbaz'? This is, after all, a valid URI specified according to the stated rules. Following the current rules, if I understand them, in order to get the more sensibly formed 'http://foo/bar/baz' either the dataset id must have a trailing slash or the record id must begin with a slash. This strikes me as an awkward requirement since a trailing slash on a URL means the URL points to a directory, while a dataset may well be a single file, and it is useful to be able to import record ids unchanged from another source, such as a BibTeX file.

Personally, I would prefer to use URL fragment identifiers, so that the 'baz' record could be referred to as 'http://foo/bar#baz'. This is possible according to the current specification, but would require that the the record ids be prefixed with or the dataset id be postfixed with '#'. This is less problematic since 'http://foo/bar' and 'http://foo/bar#' are generally understood to be equivalent, but declaring the dataset id to be 'http:foo/bar#' seems strange.

What do other folks think? Am I misunderstanding the specification? Making a mountain out of a mole hill? Or is this something worth changing?

Benjamin M. Kalish

Benjamin Kalish

unread,
Mar 1, 2010, 8:32:14 AM3/1/10
to bib...@googlegroups.com
Hey folks,

A little reflection suggests that my suggestion of using fragment identifiers was a poor one, since they are generally not seen by the server. The basic difficult remains. Perhaps the solution is as simple as saying that the full id should be the concatenation of the dataset id, a slash, and the record id, with the caveat that resulting string be normalized by removing any resulting double slashes?

Benjamin M. Kalish

Benjamin Kalish

unread,
Mar 1, 2010, 8:44:52 AM3/1/10
to bib...@googlegroups.com
Hey folks,

Sorry for the multiple messages, but it occurs to me that a very desirable property of a record's full id is the ability to decompose it into a dataset and a record id. This simply can't be done under the current specification because there is no unique delimiter separating the components. 'http://foo/bar/baz' could decompose into 'http://foo/' and '/bar/baz' as easily as into 'http://foo/bar/' and 'baz'. It could also decompose into 'http://foo/bar' and '/baz' or even 'http://foo/b' and 'ar/baz'!

This ambiguity could create some real problems!

Benjamin M. Kalish
4 Lawn Ave, Apt 2L
Northampton, MA  01060-2221
Phone: 413-687-7738
Email: bka...@gmail.com

Peter Keane

unread,
Mar 1, 2010, 1:50:01 PM3/1/10
to bib...@googlegroups.com
My instinct would be to avoid URL composition/decomposition, and (if
possible) look the the REST principle of using "hypertext as the
engine of application state." Ideally, the representation returned
from an HTTP GET on an item url would include a link to (i.e. URL for
) the dataset of which it was a part. Keeping URL opaque in this way
can be a useful practice.

--peter keane

Benjamin Kalish

unread,
Mar 1, 2010, 3:40:56 PM3/1/10
to bib...@googlegroups.com
Hi Peter,

What you say makes for HTTP requests to a server which has been specially configured to return BibJSON data, but what if you have a full record URL and the dataset is served as a single BibJSON file or the dataset will be used locally by reading the file from disk? In these cases it would be difficult to even find the dataset since there is no way to determine the dataset URL from the record URL.

Benjamin M. Kalish

Jack Alves

unread,
Mar 2, 2010, 1:10:03 PM3/2/10
to bib...@googlegroups.com
It would be an interesting feature to define a way to extract the dataset url from a record url but I don't think it is necessary. I not sure this is even a BibJSON specification issue. What we have called the "ref" id in a BibJSON dataset is not a permanent identifier and the dataset itself is portable. BKNPeople for sure is not committed to maintaining permanent identifiers. I think the more important issue is how we find a dataset url for a specific identifier. Jim raised this issue with me in a conversation yesterday.

I am saying all this without having deeply reviewed the BibJSON spec. I do remember a "linkage" feature and it seems like there should be some mechanism in linkage to map an id attribute to a url of a dataset associated with the id.

It is also interesting to look at existing dataset urls and see that each has a convention but there is no standard.

http://www.researcherid.com/rid/B-7723-2009
http://genealogy.math.ndsu.nodak.edu/id.php?id=30968
http://en.wikipedia.org/wiki/Albert_Einstein
http://openlibrary.org/a/OL3402689A/Bryan_Coombs
http://www.freebase.com/view/en/albert_einstein


Jack

Benjamin Kalish

unread,
Mar 2, 2010, 6:19:28 PM3/2/10
to bib...@googlegroups.com
Hi Jack,

The specification says "If the value refers to a global ID, it means that this record can be local, or remote. If it is remote, its representation should be resolvable on the network specified by the URI." I read that as meaning that if the record is remote its full id should in fact be a URL, but perhaps I misunderstood.

I agree that the important issue is how we find a dataset url given an identifier. Full record id's need not be URLs, but the usefulness of a record reference will be severely restricted if there isn't some way to retrieve the record in question.

It occurs to me that there are two distinct use cases involving retrieving records:
  1. The dataset is available as a BibJSON file. The server can only respond with the file itself and retrieving an individual record requires client side processing.
  2. The dataset is available through some web application such as bib server. Such web applications may choose to make individual records available.
In my mind the BibJSON specification should be most concerned with the first use case, though it would be nice to have a consistent API for the second use case as well. If we wish to have a single scheme for URI's that can be used for both purposes then it is probably best to place the record in the query string, e.g "http://foo.bar/dataset?id=baz". If "http://foo.bar/dataset" points to a simple BibJSON file then the querry string is ignored, and the file will be returned. If, however, "http://foo.bar/dataset" points to a web application, then it may use the query string to identify and retrieve the record in question.

If the above scheme were implemented it would then be necessary for applications to determine in some way whether the retrieved data represents an individual record or the entire dataset. (Imagine the confusion if an application were to accidentally cache a BibJSON file thinking it was an entire dataset when it in fact only contained a single record!) The specification could allow for a specific response to indicate this, but it is an additional layer of complexity. By including such elements we would be making BibJSON not just a serialization format, but also an API, and that may not be something we want to do. I'm not sure.

I'm sure that this problem has been addressed by others. Does anyone know how it has been solved in other cases?

Benjamin M. Kalish

Dr. Micah Altman

unread,
Mar 3, 2010, 9:59:52 AM3/3/10
to bibjson
Concurrence, mostly.


It would be an interesting feature to define a way to extract the dataset url from a record url but I don't think it is necessary.

Concur.
 
I not sure this is even a BibJSON specification issue. What we have called the "ref" id in a BibJSON dataset is not a permanent identifier and the dataset itself is portable.

Concur.
 
BKNPeople for sure is not committed to maintaining permanent identifiers. I think the more important

This is surprising. If the id's for people aren't persistent, than their value for disambiguation is quite limited.

Along similar lines, things that you wish people to cite or include in a citation, benefits from a persistent id -- e.g. annotated or authoritative or hand-crafted bibliographies.
 
issue is how we find a dataset url for a specific identifier. Jim raised this issue with me in a conversation yesterday.

Use a standard name resolution service? DOI's are the preferred solution where there is higher budget for maintenance, handles where there is a low budget, possibly PURL's for some unusual situations.
 

Peter Keane

unread,
Mar 3, 2010, 12:16:18 PM3/3/10
to bib...@googlegroups.com
Hi Benjamin-

At the risk of sounding soap-boxy, I wanted to express a few reactions
here. Note that my perspective is based on an understanding of the
principles behind REST-based architectures (
http://tomayko.com/writings/rest-to-my-wife ). I am of the opinion
that REST (the real principles, not the hype that you sometimes hear
about) offers the best way to evaluate the design of network-based
information systems (indeed, it describes the set of principles on
which the web itself is based).

In a nutshell, REST is based on a few basic tenets:

1. every important resource needs a name (on the web, that means a URL).
2. every such resource needs one or more representations (HTML is most
common, but XML, JSON, etc. are common as well).
3. there is a basic (uniform) set of operations that can be performed
on a resource (GET, PUT, DELETE, POST)
4. application state (i.e. the flow of operations) should be driven by
hypertext

Those might seem overly abstract, but it turns out they have very
useful real-world application, which I'll try to apply here. I think
that 1. and 4. are probably most important to the issues being
addressed here.

On Tue, Mar 2, 2010 at 5:19 PM, Benjamin Kalish <bka...@gmail.com> wrote:
> Hi Jack,
> The specification says "If the value refers to a global ID, it means that
> this record can be local, or remote. If it is remote, its representation
> should be resolvable on the network specified by the URI." I read that as
> meaning that if the record is remote its full id should in fact be a URL,

Note that an entity might very well have one or more "IDs". These
may/may not be the same as the "name" as understood by REST. Notably,
the Atom Syndication Format (which, along with its partner
specification Atom Publishing Protocol offers "best practice" RESTful
design) has an "ID" element, but it is NOT meant to be a URL (the spec
says: "Its content MUST be an IRI, as defined by [RFC3987]. Note that
the definition of "IRI" excludes relative references. Though the IRI
might use a dereferencable scheme, Atom Processors MUST NOT assume it
can be dereferenced.").

To get the "name" (in the REST sense) of an Atom Document, you look to
hypertext: there is a link relation called "self". It is the URL in
the href attribute of the link element with @rel=self that defines the
URL (i.e., network "name" of the document).


> but perhaps I misunderstood.
> I agree that the important issue is how we find a dataset url given an
> identifier. Full record id's need not be URLs, but the usefulness of a
> record reference will be severely restricted if there isn't some way to
> retrieve the record in question.

I am not convinced that it is important to find a dataset url given an
identifier. While it is important that an ID be globally unique (like
the atom:id) and thus allowing disambiguation, etc., suggesting that
it has some sort of semantics, or is itself a bit of "hypertext" from
which you can extract a URL is probably a dangerous assumption. That
said, it is of course hugely important that we be able to find the a
dataset given its URL *or* that we be able to find the dataset for a
given record when we have that record's URL. And that is where
hypertext comes in. In many instances, it is a dereference operation.
But often, it may require two operations: a dereference AND a look at
the retrieved document, the semantices of which should allow me to
extract the link I am seeking (based on pre-establish link
relationships, e.g.,
http://www.iana.org/assignments/link-relations/link-relations.xhtml ).


> It occurs to me that there are two distinct use cases involving retrieving
> records:
>
> The dataset is available as a BibJSON file. The server can only respond with
> the file itself and retrieving an individual record requires client side
> processing.
> The dataset is available through some web application such as bib server.
> Such web applications may choose to make individual records available.
>

The client really should not be expected to know anything about the
HTTP server other than its address. Whether it is a web application
or a static file system is immaterial. The important thing is that it
adheres to the HTTP standard. Content negotiation is part of that
standard -- the client can send an accept header saying whether it
wants JSON or HTML or whatever. The server can say "here it is" or
"no I don't have that format" as it wishes. Whether individual
records are available or not should also be an "arm's length"
transaction (i.e., does not require the client to have special
knowledge of the server). If a record is not available, the server
simply responds with a 404 "not available" response. Alternatively, a
request for the whole dataset should be in a format with sufficient
semantics (an Atom feed is a good example of this) to allow the client
to discover the URLs for individual items.


> In my mind the BibJSON specification should be most concerned with the first
> use case, though it would be nice to have a consistent API for the second
> use case as well. If we wish to have a single scheme for URI's that can be
> used for both purposes then it is probably best to place the record in
> the query string, e.g "http://foo.bar/dataset?id=baz".
> If "http://foo.bar/dataset" points to a simple BibJSON file then the querry
> string is ignored, and the file will be returned. If,
> however, "http://foo.bar/dataset" points to a web application, then it may
> use the query string to identify and retrieve the record in question.
> If the above scheme were implemented it would then be necessary for
> applications to determine in some way whether the retrieved data represents
> an individual record or the entire dataset.

This is where I think it would be very useful to step back and see how
this approach breaks the principles outlined in REST. This approach
effectivley takes the individual records off the web. The are no
longer individually addressable, but must be interacted with by way of
a "tunnel" that goes through the parent (dataset) record. Of course,
if the individual records *did* still have their own URL, this is less
problematic (we see query strings as a "filtering" mechanism in
REST-based systems with some regularity. Although it's likely
something of a violation of the basic principles, in practice it can
be useful).

> (Imagine the confusion if an
> application were to accidentally cache a BibJSON file thinking it was an
> entire dataset when it in fact only contained a single record!)

As long as we are talking about using URLs to "name" things (i.e., a
dataset has a URL, a record has a URL), I don't know that it's be a
huge problem.

> The
> specification could allow for a specific response to indicate this, but it
> is an additional layer of complexity. By including such elements we would be
> making BibJSON not just a serialization format, but also an API, and that

Interesting point -- a good serialization format (e.g., HTML or ATOM)
does give you an API "out of the box". By good serialization format,
I simply mean "has well-understood link semantics." It is an
attainable and useful goal for BibJSON.

> may not be something we want to do. I'm not sure.
> I'm sure that this problem has been addressed by others. Does anyone know
> how it has been solved in other cases?

If you are interested, there are two resources that I can recommend
pretty highly that address just the sort of challenges being faced
here.

RESTful Web Service Cookbook http://oreilly.com/catalog/9780596801694/
REST in Practice http://oreilly.com/catalog/9781449383169/

Looking at Atom/AtomPub is great as well (much of that is covered in
the Cookbook).

I suspect there may be an inclination to think that the unique
challenges faced here are in some way not aligned with the somewhat
more abstract or theoretical issues addressed by the REST folks. It's
my opinion that not only are the REST principles really quite useful,
but also that ignoring them now will limit the applicability,
flexibility, and robustness of the BibJSON specification. Conversely,
designing with REST principles in mind invariable leads to
serendipitious reuse and interoperability that is hard to predict from
this starting vantage point.

The nutshell of my take is this: First, it's important that
everything you may wish to interact with have a URL. (I'll note here
that we are mainly talking about GET operations, but when you start
being able to perform PUT, DELETE, and POST operations, things get
very interesting -- in a *good* way, fast). Second, when you need
more information about something (e.g., the URL of it's parent), don't
look to the name itself, but rather "get" a representation and let
that representation answer your question.

Obviously, to make this all happen, the BibJSON format needs
well-defined link semantics (this is different than the "linkage"
document -- all BibJSON-related documents need ot be able to express a
link and assert a particular relationship of that link to the current
document).

I hope this is all taken in the spirit intended (to be helpful!) --

--Peter Keane

Benjamin Kalish

unread,
Mar 3, 2010, 1:24:48 PM3/3/10
to bib...@googlegroups.com
Hi all,

Thank you all for your input on this issue. I think the issue of determining dataset URL's may have been a bit of a red herring, and I'm afraid I got sidetracked by it. I think, what needs addressing is this: having come across a non-local reference in a BibJSON dataset, how can I tell if it refers to a record which may be retrieved, and if it may be retrieved, then how do I retrieve it? I imagine that retrieving such a record would require several steps:

- Determine whether or not the record is retrievable. Maybe this is as simple as assuming it is retrievable unless a 404 or similar response is encountered?

- Determine the URL associated with the record. This may or may not be the dataset URL. Also, there should be a means of providing such information within the original BibJSON file—this need not be the only way, but dataset authors should be able to create datasets with references which can be resolved without requiring external name resolution services.

- Retrieve the resource associated with the URL. This should be as simple as making a request using the protocol specified in the URL.

- Examine the retrieved resource (which should be valid BibJSON but may contain other data in addition to the desired record) and extract the desired record. While I think it is reasonable to give the server the option to return either an individual record or the entire dataset it is essential that applications need not know which will be returned in advance, i.e. they should be able to find all the information they need by examining the response in some standard and efficient way.

Does this make sense? Is it in line with RESTful principles? (I don't have a lot of time to do research on REST right now.)

Benjamin M. Kalish

Jack Alves

unread,
Mar 3, 2010, 5:07:22 PM3/3/10
to bib...@googlegroups.com
It has been a while since I read the BibJSON spec  but I think the issue here is that we are putting too much of a burden on the url to represent itself. A schema should define an attribute. The "Uri" and ""Url" are defined as primitive string types in the BibJSON spec. We probably need a richer type that allows for describing a url with additional attributes. So if a url represents a dataset then the the url should be in a type that describes the url as a dataset. If a schema wants to link to a dataset it would use something like the "dataset" type is an object that contains a url to a dataset.

Peter Keane

unread,
Mar 5, 2010, 8:29:19 AM3/5/10
to bib...@googlegroups.com
Hi Benjamin-


On Wed, Mar 3, 2010 at 12:24 PM, Benjamin Kalish <bka...@gmail.com> wrote:
> Hi all,
> Thank you all for your input on this issue. I think the issue
> of determining dataset URL's may have been a bit of a red herring, and I'm
> afraid I got sidetracked by it. I think, what needs addressing is this:
> having come across a non-local reference in a BibJSON dataset, how can I
> tell if it refers to a record which may be retrieved, and if it may be
> retrieved, then how do I retrieve it? I imagine that retrieving such a
> record would require several steps:
> - Determine whether or not the record is retrievable. Maybe this is as
> simple as assuming it is retrievable unless a 404 or similar response is
> encountered?

agreed.

> - Determine the URL associated with the record. This may or may not be the
> dataset URL. Also, there should be a means of providing such information
> within the original BibJSON file—this need not be the only way, but dataset
> authors should be able to create datasets with references which can be
> resolved without requiring external name resolution services.

agreed.

> - Retrieve the resource associated with the URL. This should be as simple as
> making a request using the protocol specified in the URL.

agreed.

> - Examine the retrieved resource (which should be valid BibJSON but may
> contain other data in addition to the desired record) and extract the
> desired record. While I think it is reasonable to give the server the option
> to return either an individual record or the entire dataset it is essential
> that applications need not know which will be returned in advance, i.e. they
> should be able to find all the information they need by examining the
> response in some standard and efficient way.

yes, I agree. It might be useful to have some sort of "document type"
attribute that the client could check. In the world of Atom, it's
common to look at the root element to see if we are looking at an
atom:entry or an atom:feed.

> Does this make sense? Is it in line with RESTful principles? (I don't have a
> lot of time to do research on REST right now.)

Makes sense to me, and I think it's in line w/ REST principles.

--peter

Benjamin Kalish

unread,
Mar 5, 2010, 11:02:32 AM3/5/10
to bib...@googlegroups.com
Excellent!

Now we need to figure out what additions or clarifications the spec might require to make this easy...

Does anyone have any suggestions?


Benjamin M. Kalish
4 Lawn Ave, Apt 2L
Northampton, MA  01060-2221
Phone: 413-687-7738
Email: bka...@gmail.com


Reply all
Reply to author
Forward
0 new messages