Some BibJSON datasets

Frederick Giasson

unread,

Dec 16, 2009, 10:15:02 AM12/16/09

to bib...@googlegroups.com

Benjamin Kalish

unread,

Dec 16, 2009, 11:05:01 PM12/16/09

to bib...@googlegroups.com

Hi Fred,

Conceptually, these files are all part of one dataset. The current spec says, under the heading of "The Dataset Concept", that "a dataset can be split into multiple dataset segments...Each segment of a dataset shares the same <id> of the dataset." I can see that you have done that here, but isn't clear to me is how a user or application is supposed to know about the existence of these additional segments. Is this information provided in some way of which I am not aware? Or is this something we should consider adding to the specification?

Thanks!

Benjamin Kalish

Frederick Giasson

unread,

Dec 17, 2009, 8:48:24 AM12/17/09

to bib...@googlegroups.com

Hi Benjamin!

> Conceptually, these files are all part of one dataset. The current
> spec says, under the heading of "The Dataset Concept", that "a dataset
> can be split into multiple dataset segments...Each segment of a
> dataset shares the same <id> of the dataset." I can see that you have
> done that here, but isn't clear to me is how a user or application is
> supposed to know about the existence of these additional segments. Is
> this information provided in some way of which I am not aware? Or is
> this something we should consider adding to the specification?

Well, right now it is based on the Open World Assumption, which means
that it is not because you don't have access to (or don't know it
exists), that it doesn't exists. However, it is true that we could gain
by adding one attribute to the description of the datasets: something
like "accessibleAt" or "instantiatedAt", or something similar, which
would link a dataset description to one or multiple files that
instantiate the description of the records that belong to this dataset
(important: without changing the ID of that dataset).

So, we could end-up with something like:

"dataset": {
"id": "http://people.bibkn.org/wsf/datasets/106/",
"prefLabel": "Oberwolfach Photo Collection ",
"description": "Oberwolfach Photo Collection ",
"prefURL": "http://owpdb.mfo.de/",
"instantiatedAt":
"http://people.bibkn.org/drupal/data/oberwolfach/A.bibjson",
"instantiatedAt":
"http://people.bibkn.org/drupal/data/oberwolfach/B.bibjson",
"instantiatedAt":
"http://people.bibkn.org/drupal/data/oberwolfach/C.bibjson",
"instantiatedAt":
"http://people.bibkn.org/drupal/data/oberwolfach/D.bibjson",
"schema": "http://www.bibkn.org/drupal/bibjson/bibjson_schema.json",
"linkage": [
"http://www.bibkn.org/drupal/bibjson/oberwolfach_linkage.json",
"http://www.bibkn.org/drupal/bibjson/iron_linkage.json"
]
},

This new attribute (instantiatedAt) has to be seen as a convenient way
to describe information about a dataset.

Is this what you were meaning?

Thanks!

Take care,

Fred

Benjamin Kalish

unread,

Dec 17, 2009, 5:06:43 PM12/17/09

to bib...@googlegroups.com

So, is the Open World Assumption that "If it exists, it will be found"? It sounds like this what you are saying. It also sounds very optimistic! Even if an agent can find such files, by checking appropriate indexes/search engines/services, I dislike the idea of requiring so much work when it could be avoided by a simple solution!

Your "instantiatedAt" proposal would get rid of this problem, although it could make for difficult to maintain datasets, since every file in a dataset would have to be updated if a single file were added or removed. I would prefer to see each file in the dataset point towards one common file which would in turn point towards each of the segments. Perhaps we could use the existing "metaFile" attribute in each segment, with each "metaFile" pointing towards a single file which would contain, in addition to other dataset metadata, a list of segments using the "instantiatedAt" attribute.

Benjamin Kalish
4 Lawn Ave, Apt 2L
Northampton, MA 01060-2221
Phone: 413-687-7738
Email: bka...@gmail.com

Frederick Giasson

unread,

Dec 18, 2009, 10:12:17 AM12/18/09

to bib...@googlegroups.com

Hi Benjamin,

> So, is the Open World Assumption that "If it exists, it will be
> found"? It sounds like this what you are saying. It also sounds very
> optimistic! Even if an agent can find such files, by checking
> appropriate indexes/search engines/services, I dislike the idea of
> requiring so much work when it could be avoided by a simple solution!

No, the open world assumption means that it is not because you don't
know about something, that it doesn't exists (quite the opposite). This
also means that even if you have a list of X number of "dataset files",
it doesn't mean you have the entire set of files, that there is not
another one, somewhere, where you don't have access to. This is just a
framework to use so that you ever consider knowing everything about
something :)

> Your "instantiatedAt" proposal would get rid of this problem, although
> it could make for difficult to maintain datasets, since every file in
> a dataset would have to be updated if a single file were added or
> removed. I would prefer to see each file in the dataset point towards
> one common file which would in turn point towards each of the
> segments. Perhaps we could use the existing "metaFile" attribute in
> each segment, with each "metaFile" pointing towards a single file
> which would contain, in addition to other dataset metadata, a list of
> segments using the "instantiatedAt" attribute.

I am glad that you carefully read the spec! I think this would be a good
usage of the datasets metaFiles. In fact, you would still need an
attribute such as "instantiatedAt" to let this metaFile point to all
files. So, we would have something like this:

Dataset A, file 1:
==================

{
"dataset": {
"id": "http://dataset/a/",
"metaFile": "http://dataset/a/metafile.bibjson"
}
}

==================

Dataset A, file 2:
==================

{
"dataset": {
"id": "http://dataset/a/",
"metaFile": "http://dataset/a/metafile.bibjson"
}
}

==================

Dataset A, Metafile:
==================

{
"dataset": {
"id": "http://dataset/a/",
"instantiatedAt": "http://dataset/a/datasetA_1.bibjson"
"instantiatedAt": "http://dataset/a/datasetA_2.bibjson"
}
}

==================

Note: nobody should confuse the dataset ID and the location of the
dataset slice file.

So, as you suggest, you only have to maintain the growth of the meta
file instead of all dataset files. which greatly simplify the task.

This is certainly something I would suggest to do. But I would think
about another attribute name than "instantiatedAt".

Is this what you had in mind?

Thanks!

Take care,

Fred

Benjamin Kalish

unread,

Dec 18, 2009, 1:33:39 PM12/18/09

to bib...@googlegroups.com

Yes, this is exactly what I had in mind. And I agree that the "instantiatedAt" attribute could be better named. Perhaps "dataSegment" or "dataSegmentURL"?

Benjamin Kalish

Reply all

Reply to author

Forward