Groups keyboard shortcuts have been updated
See shortcuts

Representing dataset statistics

Skip to first unread message

Alasdair J G Gray

Jan 29, 2014, 7:09:27 AM1/29/14
Dear VoIDers

(Please cc answers to

I am involved in an activity within the W3C Healthcare and Life Sciences Interest Group to specify a set of guidelines for describing datasets. We are well on our way to finalising our community recommendation, you can find the working draft of our note at

I am particularly interested in gaining your insight on the dataset statistics that we are aiming to capture and represent using VoID. The following link will take you to the correct section of the document.

As you will see we go beyond the simple statistics that are represented in the VoID vocabulary. (The motivation for all these statistics has come from the experiences of the Bio2RDF community.) Specifically, are we violating the VoID predicates and linksets in the way we are using these constructs?

Best regards,


Alasdair J G Gray
Lecturer in Computer Science, Heriot-Watt University, UK.
Telephone: +44 131 451 3429
Twitter: @gray_alasdair
Arrange a Meeting:


PLEASE NOTE: There may be a delay in me dealing with your email as I am participating in UCU industrial action by ‘working to contract’ in support of the union’s campaign for fair pay in higher education. 
For more info go here 

Sunday Times Scottish University of the Year 2011-2013
Top in the UK for student experience
Fourth university in the UK and top in Scotland (National Student Survey 2012)

We invite research leaders and ambitious early career researchers to join us in leading and driving research in key inter-disciplinary themes. Please see for further information and how to apply.

Heriot-Watt University is a Scottish charity registered under charity number SC000278.

Richard Cyganiak

Jan 29, 2014, 9:05:14 AM1/29/14

Just a quick reaction on the dataset statistics section.

Less is probably more there. Unless you have a very concrete need for the more complex constructs there (e.g., you have a federation framework that requires exactly those statistics), then I’d recommend sticking to the simplest constructs. If there is a particular number you want to include that cannot be expressed with a simple VoID property, it may be better to introduce a new property.

I say this because the more complex constructs (e.g., clever stuff with class and property partitions) tend to go unused and can be misleading.


On 29 Jan 2014, at 12:09, Alasdair J G Gray <> wrote:

> Dear VoIDers
> (Please cc answers to
> I am involved in an activity within the W3C Healthcare and Life Sciences Interest Group to specify a set of guidelines for describing datasets. We are well on our way to finalising our community recommendation, you can find the working draft of our note at
> I am particularly interested in gaining your insight on the dataset statistics that we are aiming to capture and represent using VoID. The following link will take you to the correct section of the document.
> As you will see we go beyond the simple statistics that are represented in the VoID vocabulary. (The motivation for all these statistics has come from the experiences of the Bio2RDF community.) Specifically, are we violating the VoID predicates and linksets in the way we are using these constructs?
> Best regards,
> Alasdair
> Alasdair J G Gray
> Lecturer in Computer Science, Heriot-Watt University, UK.
> Email:
> Web:
> Telephone: +44 131 451 3429
> Twitter: @gray_alasdair
> Arrange a Meeting:
> --
> PLEASE NOTE: There may be a delay in me dealing with your email as I am participating in UCU industrial action by ‘working to contract’ in support of the union’s campaign for fair pay in higher education.
> For more info go here
> Sunday Times Scottish University of the Year 2011-2013
> Top in the UK for student experience
> Fourth university in the UK and top in Scotland (National Student Survey 2012)
> We invite research leaders and ambitious early career researchers to join us in leading and driving research in key inter-disciplinary themes. Please see for further information and how to apply.
> Heriot-Watt University is a Scottish charity registered under charity number SC000278.
> --
> You received this message because you are subscribed to the Google Groups "void-discussion" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> For more options, visit

Michel Dumontier

Jan 29, 2014, 11:17:23 AM1/29/14
Hi Richard,
  The use case is being driven to a large extent by our work to provide summary statistics for Bio2RDF datasets. You can see an example of a dataset page here:

and a view of all the datasets here

we want to provide specific recommendations on the structure of a dataset description, so that they are available for download, and query in an endpoint.  that way people won't be querying our endpoint over and over with the same statistics gathering queries.


Kjetil Kjernsmo

Jan 29, 2014, 2:30:20 PM1/29/14
On Wednesday 29. January 2014 15.05.14 Richard Cyganiak wrote:
> Less is probably more there. Unless you have a very concrete need for the
> more complex constructs there (e.g., you have a federation framework that
> requires exactly those statistics), then I'd recommend sticking to the
> simplest constructs. If there is a particular number you want to include
> that cannot be expressed with a simple VoID property, it may be better to
> introduce a new property.
> I say this because the more complex constructs (e.g., clever stuff with
> class and property partitions) tend to go unused and can be misleading.

So, just a quick note from me too, as I'm doing some clever data profiling stuff
for my ph.d. ;-) Most of the proposed statistics here is useful for
federation, as shown by Olaf Görlitz et al in their SPLENDID paper. However,
as I'm computing it in my code, I can only note that it is pretty heavy to
compute, and indeed, it is quite unlikely that people will do it unless the
data providers have a very compelling reason to do it.

I've seen that in the last few days, Philip Stutz have been implementing
cardinality caching in their Triplerush triple store. That's one case where it
is likely that such statistics can be provided, since it becomes much more
affordable to do. See

Another case where it is likely to exist is when the statistics is used for
internal optimizations.

For all others, I think the key is to argue for *why* a certain piece of
information is important to expose, keeping in mind that it is possibly
demanding to produce. Just an IG recommendation is unlikely to suffice, I
suspect, it would have to be on the form "to enable $foo, expose $bar".



Michel Dumontier

Jan 29, 2014, 2:48:15 PM1/29/14
one way to understand the contents of a dataset is to determine the connectivity between the different elements of the data. One such way is to indicate what object properties or what datatype properties are linked to the types. another way is to show show different types are connected to one another (and which relation is used to connect them).  from this you can list them in tables or develop graphical overviews.


You received this message because you are subscribed to the Google Groups "void-discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to
For more options, visit

Michel Dumontier
Associate Professor of Medicine (Biomedical Informatics), Stanford University
Chair, W3C Semantic Web for Health Care and the Life Sciences Interest Group

Kjetil Kjernsmo

Jan 29, 2014, 4:34:44 PM1/29/14
to, Michel Dumontier, HCLS IG
On Wednesday 29. January 2014 20.48.15 Michel Dumontier wrote:
> one way to understand the contents of a dataset is to determine the
> connectivity between the different elements of the data. One such way is to
> indicate what object properties or what datatype properties are linked to
> the types. another way is to show show different types are connected to one
> another (and which relation is used to connect them). from this you can
> list them in tables or develop graphical overviews.

Right! I'm not the one who needs to be convinced, nor are the IG members, I
suspect. :-) The audience that need to be convinced the random publishers of
lifesci data, but they have a cost-benefit analysis to make, and for things as
heavy as this, you need to be very, very convincing about the benefits, the
costs are likely to be quite apparent.


Michel Dumontier

Jan 29, 2014, 4:38:51 PM1/29/14
ah yes, we will be expanding our note to include these kinds of statements :)



You received this message because you are subscribed to the Google Groups "void-discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to
For more options, visit
Reply all
Reply to author
0 new messages