DataCube Consuming Tools/Services

Leigh Dodds

unread,

Oct 8, 2012, 8:55:04 AM10/8/12

to publishing-st...@googlegroups.com

Hi,

I'm currently doing a piece of work that will be looking at how best to surface some data using the DataCube vocabulary. To this end I'm interested in speaking to implementers about their needs. I'm interested to find out what tools and/or services are available for consuming Data Cube data. If you know of, or are working on such a tool, I'd be grateful if you could drop me a line.

I've already spoken to a few people about this but wanted to ask more publicly to try and find good examples.

Some things I'm interested in:

* If the data is available as Linked Data, then is there an expectation that a SPARQL endpoint will be provided too?

* Is a SPARQL endpoint more important than Linked Data (e.g. because tools need to perform arbitrary queries, rather than follow-your-nose traversal)

* Might the Linked Data API offer a means for exposing richer access to data, without (direct) recourse to a SPARQL endpoint

* Or are data dumps really the preferred option.

Clearly there are a lot of different factors that can influence the answers to those questions, but I'm interested to know what approaches people are initially taking.

Cheers,

L.

Keith Alexander

unread,

Oct 8, 2012, 1:07:58 PM10/8/12

to publishing-st...@googlegroups.com

Hi Leigh,

I have created some linked data statistical datasets though, and tried
to think of useful approaches, but I don't think I've ever written
an app that consumed Data Cube data. So these are just my opinions
from that perspective:

>
> * If the data is available as Linked Data, then is there an expectation that
> a SPARQL endpoint will be provided too?

The verbosity of RDF for representing multi-dimensional data makes me
think that RDF stores might not be the most efficient way of storing
and querying large volumes of stats, so I think expectation of a
SPARQL endpoint might be lower (though it's possible something like
D2RQ might work better?).

When (at Talis) we had to produce a 'triplification' of the European
Central Bank stats for the LATC project for instance, we decided that
doing a complete conversion and putting it into a triple store would
be completely impractical, even with the resources we had available to
us. We hit upon a hybrid approach of making the metadata about the
datasets (and perhaps some geo data) available through SPARQL + Linked
Data API, and making URIs of the datasets themselves deref to a
script that fetched the original data and converted it to RDF on the
fly.

> * Is a SPARQL endpoint more important than Linked Data (e.g. because tools
> need to perform arbitrary queries, rather than follow-your-nose traversal)
> * Might the Linked Data API offer a means for exposing richer access to
> data, without (direct) recourse to a SPARQL endpoint

basic Linked Data - dereffing CBDs of all the observations for
instance, could be pretty tiresome.
SPARQL is more efficient in that regard, but it would be nice to
combine the fyn discovery of linked data, and SPARQL's ability to
slice in different ways and calculate aggregates.
Linked Data API can let you provide these different views, but, for
statistical data, it would be nice to be able to also have calculated
properties - the average, max, min values for a slice (for instance) -
the areas dimension with the highest/lowest value for a
measureproperty and time dimension - that kind of thing.

At root, statistics are interesting because they let you compare
things, so I think there's a lot of value in trying to surface and
facilitate those comparisons in their publication.

Best

Keith

BillRoberts

unread,

Oct 8, 2012, 1:44:56 PM10/8/12

to publishing-st...@googlegroups.com

Hi Leigh

When publishing data cube data, I generally have followed the 'flattened' structure and usually not made use of slices. The main reason for this is I've always been providing a SPARQL endpoint with the data, and then you can easily produce whatever slice you want using SPARQL - and as you point out, by including every dimension with every observation, that makes the structure of SPARQL queries simpler than if you use slices to reduce redundancy.

A minor problem - not a big deal - with using slices is that if you provide a CBD when describing resources, then since the recommendation is to link each slice to each observation in that slice, the CBD of a slice can sometimes be very large.

Cheers

Bill

BillRoberts

unread,

Oct 8, 2012, 2:26:33 PM10/8/12

to publishing-st...@googlegroups.com

and when consuming linked data to make visualisations, i've used SPARQL to pull out the relevant data to put into maps, histograms etc.

The Linked Data API would also be pretty easy in most cases, though if a SPARQL endpoint was available too, I would use SPARQL by preference.

But I'm already a SPARQL convert, and am not typical of the data-using public at large!

For the DCLG Open Data Communities site, which has lots of data cube data in it (and offers a SPARQL endpoint), we get quite a lot of enquiries asking how to get the data as CSV. Using APIs and/or SPARQL is something that a lot of data users are not familiar with, so finding effective ways to support those users is definitely worth thinking about.

On Monday, 8 October 2012 14:55:04 UTC+2, Leigh Dodds wrote:

Sarven Capadisli

unread,

Oct 8, 2012, 2:48:23 PM10/8/12

to publishing-st...@googlegroups.com

On 2012-10-08 14:55, Leigh Dodds wrote:
> * If the data is available as Linked Data, then is there an expectation
> that a SPARQL endpoint will be provided too?

If the statistical data at hand tends to be on the "heavy" side, and if
that's generally acknowledged by its consumers, it may be okay to not
have that expectation. After all, given the store size and available
system resources, it is not always feasible to return results within a
reasonable amount of time for non-trivial queries any way.

On the other hand, I think the expectation should be there because it
opens up the possibility to do federated queries on the fly. I think
this is fairly important for data comparisons and visualizations.

> * Is a SPARQL endpoint more important than Linked Data (e.g. because
> tools need to perform arbitrary queries, rather than follow-your-nose
> traversal)

I don't think so. Tools are just one type of users. As base case, all
things of importance should have an HTTP URI with an appropriate
representation for humans and machines. If I have to take one or the
other, I would most certainly take the Linked Data because I can point
at it, or grab it as dumps.

> * Might the Linked Data API offer a means for exposing richer access to
> data, without (direct) recourse to a SPARQL endpoint
> * Or are data dumps really the preferred option.

I don't see it as either or. Dumps should be there because I see that as
a best practice and probably helps to cut down on others using my
resources, especially if they can do things better with it on their end
for whatever means.

-Sarven

Dave Reynolds

unread,

Oct 8, 2012, 4:11:38 PM10/8/12

to publishing-st...@googlegroups.com

On 08/10/12 13:55, Leigh Dodds wrote:
> Hi,
>
> I'm currently doing a piece of work that will be looking at how best to
> surface some data using the DataCube vocabulary. To this end I'm
> interested in speaking to implementers about their needs. I'm interested
> to find out what tools and/or services are available for consuming Data
> Cube data. If you know of, or are working on such a tool, I'd be
> grateful if you could drop me a line.

As mentioned in twitter we (Ian) did prototype a generic cube explorer
sometime ago but haven't developed it beyond the early prototype stage.

For the cube visualizations we've done on projects we've typically
rendered them client side with custom code, fetching the data typically
as JSON from an LDA endpoint. That how the bathing water detail pages
like [1] are done.

> I've already spoken to a few people about this but wanted to ask more
> publicly to try and find good examples.
>
> Some things I'm interested in:
>
> * If the data is available as Linked Data, then is there an expectation
> that a SPARQL endpoint will be provided too?

Ideally but you can get a long way with a good LDA spec, especially if
you define slices - which in turn makes it easier for developers to get
CSV or JSON views of whole slices as well.

> * Is a SPARQL endpoint more important than Linked Data (e.g. because
> tools need to perform arbitrary queries, rather than follow-your-nose
> traversal)

See above - combo of slices and LDA works well and you can then filter
down LDA style. In fact arguably in cube browsing you do more filtering
than joining so LDA is a particularly good match.

> * Might the Linked Data API offer a means for exposing richer access to
> data, without (direct) recourse to a SPARQL endpoint

[Should have read ahead :)] Yes.

In the SDMX there is now a standard for REST-style access to SDMX data
resources. I haven't looked at the full spec but an early talk I went to
on that made it look very flexible and pretty compatible with a well
fleshed out LDA spec.

> * Or are data dumps really the preferred option.

We took the view that you want dumps as well but the common need was to
project out a slice of the data for graphing and that's what LDA enables.

Dave

Benedikt Kämpgen

unread,

Oct 9, 2012, 1:41:58 PM10/9/12

to publishing-st...@googlegroups.com

Hi Leigh,

> consuming Data Cube data. If you know of, or are working on such a
> tool, I'd be grateful if you could drop me a line.

We are working on an OLAP engine [1] that translates OLAP queries into SPARQL 1.1. It is still in alpha. Although it could in theory run directly on a SPARQL endpoint via HTTP, we currently mostly use the typical data warehousing architecture: We (semi-)automatically crawl by follow-your-nose traversal or download dumps of all the RDF Data Cube we want to integrate. This data we load into a client-side triple store for further pre-processing and eventual analysis.

A high-performance SPARQL endpoint and federated query engine are convenient, but I think this data warehousing approach is the most common.

Thus:

> * If the data is available as Linked Data, then is there an
> expectation that a SPARQL endpoint will be provided too?

No, not for me at least.

> * Is a SPARQL endpoint more important than Linked Data (e.g. because
> tools need to perform arbitrary queries, rather than
> follow-your-nose traversal)

No, dereferenceable URIs are more important.

> * Or are data dumps really the preferred option.

Dereferenceable URIs probably should not contain GB of RDF, but instead the data be more evenly spread over the Linked Data source, e.g., by using slices, more granular datasets or data structure definitions, and data dumps (esp. if they are pointed to via standard properties such as provided by VoID).

Best,

Benedikt

[1] <http://code.google.com/p/olap4ld/>

--
AIFB, Karlsruhe Institute of Technology (KIT)
Phone: +49 721 608-47946
Email: benedikt...@kit.edu
Web: http://www.aifb.kit.edu/web/Hauptseite/en

Reply all

Reply to author

Forward