Guidance around dataset autodiscovery

8 views
Skip to first unread message

Leigh Dodds

unread,
May 29, 2012, 7:15:04 AM5/29/12
to void-di...@googlegroups.com
Hi,

I'd like to get some feedback on how best to use the dataset autodiscovery aspects of Void. Some issues and questions outlined below.

Firstly it took me a little while to find the appropriate documentation, some of the older docs are still very visible in google. You might want to consider adding some redirects to direct people to the latest versions. Would be very helpful! 

I'm assuming that the latest advice can be found at: 


The documentation suggests that all datasets should be described in the document served by the well known URI. The implication (at least to me) is the document also contains a full description of each dataset. The guidance also notes that the dataset description document should reference each dataset via a foaf:primaryTopic or foaf:topic relationship.

I'm thinking of using this mechanism in a couple of different scenarios:

1. Where there is a single dataset for a domain. This is trivial

2. Where there are several datasets for a domain, which may or may not come from different publishers and which may not be subsets. I'm assuming here that multiple foaf:topic relationships to "top level" datasets is the expected behaviour? 

The alternative would be to synthesize a new "top level" dataset which has all the others as subsets. However as datasets may have no relationship to one another (other than a common publisher) so creating asserting a subset relationship doesn't seem correct to me.

3. Where there are a large number of datasets for a domain, e.g. Kasabi. In this case the auto-discovery document could get very large if a full description is included of each dataset. 

I'm wondering on the third point specifically, whether the Dataset Description document needs to contain a full description of each dataset. Might a simplified description, e.g. type and label, be sufficient?

It might help to document expected application behaviour when processing the document. If I just want to present a list of datasets to a user a simplified description might be sufficient. 

Should consuming applications be expected to follow their nose to find more information about a dataset?

Michael also mentioned DCAT on Twitter but I'm not sure how that fits into this scenario.

Cheers,

L.

Richard Cyganiak

unread,
May 29, 2012, 2:33:03 PM5/29/12
to void-di...@googlegroups.com
Hi Leigh,

On 29 May 2012, at 12:15, Leigh Dodds wrote:
> I'm assuming that the latest advice can be found at:
>
> http://www.w3.org/TR/void/#well-known

Yes.

> The documentation suggests that all datasets should be described in the document served by the well known URI. The implication (at least to me) is the document also contains a full description of each dataset. The guidance also notes that the dataset description document should reference each dataset via a foaf:primaryTopic or foaf:topic relationship.
>
> I'm thinking of using this mechanism in a couple of different scenarios:
>
> 1. Where there is a single dataset for a domain. This is trivial
>
> 2. Where there are several datasets for a domain, which may or may not come from different publishers and which may not be subsets. I'm assuming here that multiple foaf:topic relationships to "top level" datasets is the expected behaviour?

Yes.

> The alternative would be to synthesize a new "top level" dataset which has all the others as subsets. However as datasets may have no relationship to one another (other than a common publisher) so creating asserting a subset relationship doesn't seem correct to me.

There are no constraints regarding what can go into a shared dataset, so this wouldn't be wrong either. But if you feel there's nothing that really connects the datasets and that you want to say using VoID, then it's perhaps better to leave them as separate datasets and link them all to the dataset description document using foaf:topic.

> 3. Where there are a large number of datasets for a domain, e.g. Kasabi. In this case the auto-discovery document could get very large if a full description is included of each dataset.
>
> I'm wondering on the third point specifically, whether the Dataset Description document needs to contain a full description of each dataset. Might a simplified description, e.g. type and label, be sufficient?

This is not a case we have really considered when designing the mechanism, so to be honest your guess is as good as mine.

In practice a simplified description may be all that is possible.

> It might help to document expected application behaviour when processing the document. If I just want to present a list of datasets to a user a simplified description might be sufficient.
>
> Should consuming applications be expected to follow their nose to find more information about a dataset?

Both of these questions apply whenever we find a description of some resource in an RDF file, right? So it's not specific to VoID really? VoID provides a way of saying things about datasets. It's clients that make the rules regarding processing.

> Michael also mentioned DCAT on Twitter but I'm not sure how that fits into this scenario.

VoID is for describing individual RDF datasets. DCAT is for describing collections (catalogs) of datasets in whatever format. It's not aligned in any way with VoID autodiscovery though.

Best,
Richard



>
> Cheers,
>
> L.

Reply all
Reply to author
Forward
0 new messages