Representing dataset statistics

Alasdair J G Gray

unread,

Jan 29, 2014, 7:09:27 AM1/29/14

to void-di...@googlegroups.com, HCLS IG

Dear VoIDers

(Please cc answers to public-sem...@w3.org)

I am involved in an activity within the W3C Healthcare and Life Sciences Interest Group to specify a set of guidelines for describing datasets. We are well on our way to finalising our community recommendation, you can find the working draft of our note at

https://docs.google.com/document/d/1zGQJ9bO_dSc8taINTNHdnjYEzUyYkbjglrcuUPuoITw/edit#

I am particularly interested in gaining your insight on the dataset statistics that we are aiming to capture and represent using VoID. The following link will take you to the correct section of the document.

https://docs.google.com/document/d/1zGQJ9bO_dSc8taINTNHdnjYEzUyYkbjglrcuUPuoITw/edit#heading=h.citrvpsndja

As you will see we go beyond the simple statistics that are represented in the VoID vocabulary. (The motivation for all these statistics has come from the experiences of the Bio2RDF community.) Specifically, are we violating the VoID predicates and linksets in the way we are using these constructs?

Best regards,

Alasdair

Alasdair J G Gray

Lecturer in Computer Science, Heriot-Watt University, UK.

Email: A.J.G...@hw.ac.uk

Web: http://www.macs.hw.ac.uk/~ajg33

ORCID: http://orcid.org/0000-0002-5711-4872

Telephone: +44 131 451 3429

Twitter: @gray_alasdair

Arrange a Meeting: http://doodle.com/agray

--

PLEASE NOTE: There may be a delay in me dealing with your email as I am participating in UCU industrial action by ‘working to contract’ in support of the union’s campaign for fair pay in higher education.

For more info go here www.ucu.org.uk/hepay13

Sunday Times Scottish University of the Year 2011-2013
Top in the UK for student experience
Fourth university in the UK and top in Scotland (National Student Survey 2012)

We invite research leaders and ambitious early career researchers to join us in leading and driving research in key inter-disciplinary themes. Please see www.hw.ac.uk/researchleaders for further information and how to apply.

Heriot-Watt University is a Scottish charity registered under charity number SC000278.

Richard Cyganiak

unread,

Jan 29, 2014, 9:05:14 AM1/29/14

to void-di...@googlegroups.com, HCLS IG

Alasdair,

Just a quick reaction on the dataset statistics section.

Less is probably more there. Unless you have a very concrete need for the more complex constructs there (e.g., you have a federation framework that requires exactly those statistics), then I’d recommend sticking to the simplest constructs. If there is a particular number you want to include that cannot be expressed with a simple VoID property, it may be better to introduce a new property.

I say this because the more complex constructs (e.g., clever stuff with class and property partitions) tend to go unused and can be misleading.

Best,
Richard

On 29 Jan 2014, at 12:09, Alasdair J G Gray <Alasda...@gmail.com> wrote:

> Dear VoIDers
>
> (Please cc answers to public-sem...@w3.org)
>
> I am involved in an activity within the W3C Healthcare and Life Sciences Interest Group to specify a set of guidelines for describing datasets. We are well on our way to finalising our community recommendation, you can find the working draft of our note at
> https://docs.google.com/document/d/1zGQJ9bO_dSc8taINTNHdnjYEzUyYkbjglrcuUPuoITw/edit#
>
> I am particularly interested in gaining your insight on the dataset statistics that we are aiming to capture and represent using VoID. The following link will take you to the correct section of the document.
> https://docs.google.com/document/d/1zGQJ9bO_dSc8taINTNHdnjYEzUyYkbjglrcuUPuoITw/edit#heading=h.citrvpsndja
>
> As you will see we go beyond the simple statistics that are represented in the VoID vocabulary. (The motivation for all these statistics has come from the experiences of the Bio2RDF community.) Specifically, are we violating the VoID predicates and linksets in the way we are using these constructs?
>
> Best regards,
>
> Alasdair
>
>
> Alasdair J G Gray
> Lecturer in Computer Science, Heriot-Watt University, UK.
> Email: A.J.G...@hw.ac.uk
> Web: http://www.macs.hw.ac.uk/~ajg33
> ORCID: http://orcid.org/0000-0002-5711-4872
> Telephone: +44 131 451 3429
> Twitter: @gray_alasdair
> Arrange a Meeting: http://doodle.com/agray
>
> --
>
> PLEASE NOTE: There may be a delay in me dealing with your email as I am participating in UCU industrial action by ‘working to contract’ in support of the union’s campaign for fair pay in higher education.
> For more info go here www.ucu.org.uk/hepay13
>
>

> Sunday Times Scottish University of the Year 2011-2013
> Top in the UK for student experience
> Fourth university in the UK and top in Scotland (National Student Survey 2012)
>
> We invite research leaders and ambitious early career researchers to join us in leading and driving research in key inter-disciplinary themes. Please see www.hw.ac.uk/researchleaders for further information and how to apply.
>
> Heriot-Watt University is a Scottish charity registered under charity number SC000278.
>

> --
> You received this message because you are subscribed to the Google Groups "void-discussion" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to void-discussi...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.

Michel Dumontier

unread,

Jan 29, 2014, 11:17:23 AM1/29/14

to void-di...@googlegroups.com, HCLS IG

Hi Richard,

The use case is being driven to a large extent by our work to provide summary statistics for Bio2RDF datasets. You can see an example of a dataset page here:

http://download.bio2rdf.org/release/2/drugbank/drugbank.html

and a view of all the datasets here

http://download.bio2rdf.org/release/2/release.html

we want to provide specific recommendations on the structure of a dataset description, so that they are available for download, and query in an endpoint. that way people won't be querying our endpoint over and over with the same statistics gathering queries.

m.

Kjetil Kjernsmo

unread,

Jan 29, 2014, 2:30:20 PM1/29/14

to void-di...@googlegroups.com, HCLS IG

On Wednesday 29. January 2014 15.05.14 Richard Cyganiak wrote:
> Less is probably more there. Unless you have a very concrete need for the
> more complex constructs there (e.g., you have a federation framework that
> requires exactly those statistics), then I'd recommend sticking to the
> simplest constructs. If there is a particular number you want to include
> that cannot be expressed with a simple VoID property, it may be better to
> introduce a new property.
>
> I say this because the more complex constructs (e.g., clever stuff with
> class and property partitions) tend to go unused and can be misleading.

So, just a quick note from me too, as I'm doing some clever data profiling stuff
for my ph.d. ;-) Most of the proposed statistics here is useful for
federation, as shown by Olaf Görlitz et al in their SPLENDID paper. However,
as I'm computing it in my code, I can only note that it is pretty heavy to
compute, and indeed, it is quite unlikely that people will do it unless the
data providers have a very compelling reason to do it.

I've seen that in the last few days, Philip Stutz have been implementing
cardinality caching in their Triplerush triple store. That's one case where it
is likely that such statistics can be provided, since it becomes much more
affordable to do. See https://github.com/uzh/triplerush

Another case where it is likely to exist is when the statistics is used for
internal optimizations.

For all others, I think the key is to argue for *why* a certain piece of
information is important to expose, keeping in mind that it is possibly
demanding to produce. Just an IG recommendation is unlikely to suffice, I
suspect, it would have to be on the form "to enable $foo, expose $bar".

Cheers,

Kjetil

Michel Dumontier

unread,

Jan 29, 2014, 2:48:15 PM1/29/14

to void-di...@googlegroups.com, HCLS IG

one way to understand the contents of a dataset is to determine the connectivity between the different elements of the data. One such way is to indicate what object properties or what datatype properties are linked to the types. another way is to show show different types are connected to one another (and which relation is used to connect them). from this you can list them in tables or develop graphical overviews.

m.

--
You received this message because you are subscribed to the Google Groups "void-discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to void-discussi...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--

Michel Dumontier

Associate Professor of Medicine (Biomedical Informatics), Stanford University

Chair, W3C Semantic Web for Health Care and the Life Sciences Interest Group

http://dumontierlab.com

Kjetil Kjernsmo

unread,

Jan 29, 2014, 4:34:44 PM1/29/14

to void-di...@googlegroups.com, Michel Dumontier, HCLS IG

On Wednesday 29. January 2014 20.48.15 Michel Dumontier wrote:
> one way to understand the contents of a dataset is to determine the
> connectivity between the different elements of the data. One such way is to
> indicate what object properties or what datatype properties are linked to
> the types. another way is to show show different types are connected to one
> another (and which relation is used to connect them). from this you can
> list them in tables or develop graphical overviews.

Right! I'm not the one who needs to be convinced, nor are the IG members, I
suspect. :-) The audience that need to be convinced the random publishers of
lifesci data, but they have a cost-benefit analysis to make, and for things as
heavy as this, you need to be very, very convincing about the benefits, the
costs are likely to be quite apparent.

Kjetil

Michel Dumontier

unread,

Jan 29, 2014, 4:38:51 PM1/29/14

to void-di...@googlegroups.com, HCLS IG

ah yes, we will be expanding our note to include these kinds of statements :)

m.

Kjetil

--
You received this message because you are subscribed to the Google Groups "void-discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to void-discussi...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Reply all

Reply to author

Forward