Some BibJSON datasets

11 views
Skip to first unread message

Frederick Giasson

unread,
Dec 16, 2009, 10:15:02 AM12/16/09
to bib...@googlegroups.com
Hi all,

As requested by Jim, here are some BibJSON datasets (latest spec). We
have two of them:

(1) Oberwalfach dataset that you can download from here:

http://people.bibkn.org/drupal/data/oberwolfach/A.bibjson
<http://people.bibkn.org/drupal/data/oberwolfach/X.bibjson>
http://people.bibkn.org/drupal/data/oberwolfach/B.bibjson
<http://people.bibkn.org/drupal/data/oberwolfach/Y.bibjson>
http://people.bibkn.org/drupal/data/oberwolfach/C.bibjson
<http://people.bibkn.org/drupal/data/oberwolfach/Z.bibjson>

....

http://people.bibkn.org/drupal/data/oberwolfach/X.bibjson
http://people.bibkn.org/drupal/data/oberwolfach/Y.bibjson
http://people.bibkn.org/drupal/data/oberwolfach/Z.bibjson

This dataset come from here: http://owpdb.mfo.de/

(2) AuthorClaims dataset that can be downloaded from here:

http://people.bibkn.org/drupal/data/authorclaims/authorclaims.bibjson

Next dataset to be updated is the math genealogy project. I will post a
new email on this mailing list when it will be available.

Thanks!

Take care,

Fred

Benjamin Kalish

unread,
Dec 16, 2009, 11:05:01 PM12/16/09
to bib...@googlegroups.com
Hi Fred,

Conceptually, these files are all part of one dataset. The current spec says, under the heading of "The Dataset Concept", that "a dataset can be split into multiple dataset segments...Each segment of a dataset shares the same <id> of the dataset." I can see that you have done that here, but isn't clear to me is how a user or application is supposed to know about the existence of these additional segments. Is this information provided in some way of which I am not aware? Or is this something we should consider adding to the specification?

Thanks!

Benjamin Kalish

Frederick Giasson

unread,
Dec 17, 2009, 8:48:24 AM12/17/09
to bib...@googlegroups.com
Hi Benjamin!

> Conceptually, these files are all part of one dataset. The current
> spec says, under the heading of "The Dataset Concept", that "a dataset
> can be split into multiple dataset segments...Each segment of a
> dataset shares the same <id> of the dataset." I can see that you have
> done that here, but isn't clear to me is how a user or application is
> supposed to know about the existence of these additional segments. Is
> this information provided in some way of which I am not aware? Or is
> this something we should consider adding to the specification?

Well, right now it is based on the Open World Assumption, which means
that it is not because you don't have access to (or don't know it
exists), that it doesn't exists. However, it is true that we could gain
by adding one attribute to the description of the datasets: something
like "accessibleAt" or "instantiatedAt", or something similar, which
would link a dataset description to one or multiple files that
instantiate the description of the records that belong to this dataset
(important: without changing the ID of that dataset).

So, we could end-up with something like:

"dataset": {
"id": "http://people.bibkn.org/wsf/datasets/106/",
"prefLabel": "Oberwolfach Photo Collection ",
"description": "Oberwolfach Photo Collection ",
"prefURL": "http://owpdb.mfo.de/",
"instantiatedAt":
"http://people.bibkn.org/drupal/data/oberwolfach/A.bibjson",
"instantiatedAt":
"http://people.bibkn.org/drupal/data/oberwolfach/B.bibjson",
"instantiatedAt":
"http://people.bibkn.org/drupal/data/oberwolfach/C.bibjson",
"instantiatedAt":
"http://people.bibkn.org/drupal/data/oberwolfach/D.bibjson",
"schema": "http://www.bibkn.org/drupal/bibjson/bibjson_schema.json",
"linkage": [
"http://www.bibkn.org/drupal/bibjson/oberwolfach_linkage.json",
"http://www.bibkn.org/drupal/bibjson/iron_linkage.json"
]
},

This new attribute (instantiatedAt) has to be seen as a convenient way
to describe information about a dataset.


Is this what you were meaning?


Thanks!


Take care,


Fred

Benjamin Kalish

unread,
Dec 17, 2009, 5:06:43 PM12/17/09
to bib...@googlegroups.com
So, is the Open World Assumption that "If it exists, it will be found"? It sounds like this what you are saying. It also sounds very optimistic! Even if an agent can find such files, by checking appropriate indexes/search engines/services, I dislike the idea of requiring so much work when it could be avoided by a simple solution!

Your "instantiatedAt" proposal would get rid of this problem, although it could make for difficult to maintain datasets, since every file in a dataset would have to be updated if a single file were added or removed. I would prefer to see each file in the dataset point towards one common file which would in turn point towards each of the segments. Perhaps we could use the existing "metaFile" attribute in each segment, with each "metaFile" pointing towards a single file which would contain, in addition to other dataset metadata, a list of segments using the "instantiatedAt" attribute.

Benjamin Kalish
4 Lawn Ave, Apt 2L
Northampton, MA  01060-2221
Phone: 413-687-7738
Email: bka...@gmail.com

Frederick Giasson

unread,
Dec 18, 2009, 10:12:17 AM12/18/09
to bib...@googlegroups.com
Hi Benjamin,

> So, is the Open World Assumption that "If it exists, it will be
> found"? It sounds like this what you are saying. It also sounds very
> optimistic! Even if an agent can find such files, by checking
> appropriate indexes/search engines/services, I dislike the idea of
> requiring so much work when it could be avoided by a simple solution!
No, the open world assumption means that it is not because you don't
know about something, that it doesn't exists (quite the opposite). This
also means that even if you have a list of X number of "dataset files",
it doesn't mean you have the entire set of files, that there is not
another one, somewhere, where you don't have access to. This is just a
framework to use so that you ever consider knowing everything about
something :)

> Your "instantiatedAt" proposal would get rid of this problem, although
> it could make for difficult to maintain datasets, since every file in
> a dataset would have to be updated if a single file were added or
> removed. I would prefer to see each file in the dataset point towards
> one common file which would in turn point towards each of the
> segments. Perhaps we could use the existing "metaFile" attribute in
> each segment, with each "metaFile" pointing towards a single file
> which would contain, in addition to other dataset metadata, a list of
> segments using the "instantiatedAt" attribute.

I am glad that you carefully read the spec! I think this would be a good
usage of the datasets metaFiles. In fact, you would still need an
attribute such as "instantiatedAt" to let this metaFile point to all
files. So, we would have something like this:


Dataset A, file 1:
==================

{
"dataset": {
"id": "http://dataset/a/",
"metaFile": "http://dataset/a/metafile.bibjson"
}
}

==================


Dataset A, file 2:
==================

{
"dataset": {
"id": "http://dataset/a/",
"metaFile": "http://dataset/a/metafile.bibjson"
}
}

==================

Dataset A, Metafile:
==================

{
"dataset": {
"id": "http://dataset/a/",
"instantiatedAt": "http://dataset/a/datasetA_1.bibjson"
"instantiatedAt": "http://dataset/a/datasetA_2.bibjson"
}
}

==================

Note: nobody should confuse the dataset ID and the location of the
dataset slice file.

So, as you suggest, you only have to maintain the growth of the meta
file instead of all dataset files. which greatly simplify the task.

This is certainly something I would suggest to do. But I would think
about another attribute name than "instantiatedAt".

Is this what you had in mind?


Thanks!


Take care,


Fred

Benjamin Kalish

unread,
Dec 18, 2009, 1:33:39 PM12/18/09
to bib...@googlegroups.com
Yes, this is exactly what I had in mind. And I agree that the "instantiatedAt" attribute could be better named. Perhaps "dataSegment" or "dataSegmentURL"?

Benjamin Kalish
Reply all
Reply to author
Forward
0 new messages