Data Dumps at source.data.gov.uk

15 views
Skip to first unread message

Leigh Dodds

unread,
Aug 24, 2010, 8:52:53 AM8/24/10
to uk-government-data-developers
Hi,

I've just put together an initial set of data dumps for the majority
of the Linked Data currently being published by data.gov.uk. More
information on what's not included and why in a moment.

(Disclaimer: what follows is my understanding of the current state of
play, so any errors/omissions then blame me :)


THE REPOSITORY

There is a server at http://source.data.gov.uk which has been set up
to provide access to both data dumps and (eventually) the code used to
generate/convert the data. The data dumps can be found at:

http://source.data.gov.uk/data/

The intention is to create a repository of versioned datasets that
will allow anyone to mirror the data for their own use/purposes, e.g.
to perform local analysis or to host in your own triple store. Over
time this repository should become a complete archival copy of all of
the Linked Data that is published through data.gov.uk, complete with
information on the provenance of individual datasets.

The team behind data.gov.uk are still working through a number of the
best practices, so right now I've simply put up copies of all the
currently live datasets.


HOW THE DATA IS ORGANISED

The web archive is organised into a series of sub-directories:

* Sector — top-level sector. E.g. as used in *.data.gov.uk
* Dataset — dataset directory, a short identifier for the dataset.
I've made some of these up at present
* Date-stamped directory — in format of yyyy-mm-dd.
* Data files — This may be an number of data files in different
formats. E.g the data may span a number of small files, some files may
be ntriples for loading into default graph and some files may be
nquads.

For example, the RDF version of Edubase currently available from
http://education.data.gov.uk can be found here:

http://source.data.gov.uk/data/education/edubase/2009-08-14/

with the general pattern being:

http://source.data.gov.uk/data/[sector]/[dataset]/[timestamp]/

Currently only the latest versions of each dataset are being loaded
into the live SPARQL endpoints, but over time there will be a move
towards using named graphs for versioning (as described at [1]).


LINKED DATA, DATA DUMPS & SERVICES

The sector identifier ties together the Linked Data, the data dumps,
and the SPARQL endpoints and other services. For example if you're
looking at some Linked Data, e.g.:

http://education.data.gov.uk/id/school/100866

Then this data will be included in the SPARQL endpoint at:

http://services.data.gov.uk/education/sparql

The search interface at:

http://services.data.gov.uk/education/search

And the raw data can be found in one (or more) of the datasets accessible from:

http://source.data.gov.uk/data/education/


WHAT IS NOT INCLUDED?

As I explained at that start of this email, not all of the Linked Data
being published from data.gov.uk, or the UK government is currently
represented in these data dumps.

The RDF available from the legislation.gov.uk is currently only
available as Linked Data because its surfaced directly from the
website. Ditto, that published from the London Gazette website as
RDFa. It would be possible to regularly crawl and dump those sources,
but I'm not sure if there are plans to do that yet. Other departments
and projects may also surface their own data and data dumps.

The other dataset that is not represented in the dump are the
date-time URIs available from reference.data.gov.uk, e.g. [2]. as
these are all algorithmically generated. I don't recommend anyone
crawls those :)

Any questions then please ask.

Cheers,

L.

[1]. http://www.jenitennison.com/blog/node/141
[2]. http://reference.data.gov.uk/id/day/2010-09-24

--
Leigh Dodds
Programme Manager, Talis Platform
Talis
leigh...@talis.com
http://www.talis.com

Feargal Hogan

unread,
Aug 24, 2010, 9:21:14 AM8/24/10
to uk-government-...@googlegroups.com
Leigh
Are you going to wiki that info?
Thks

Feargal Hogan

> * Sector - top-level sector. E.g. as used in *.data.gov.uk
> * Dataset - dataset directory, a short identifier for the dataset.


> I've made some of these up at present

> * Date-stamped directory - in format of yyyy-mm-dd.
> * Data files - This may be an number of data files in

Leigh Dodds

unread,
Aug 24, 2010, 9:32:43 AM8/24/10
to uk-government-...@googlegroups.com
On 24 August 2010 14:21, Feargal Hogan <fea...@thehoganfamily.info> wrote:
> Leigh
> Are you going to wiki that info?

I would, but I don't seem to be able to edit the wiki, despite signing in:

You do not have permission to edit this page, for the following reason:

The action you have requested is limited to users in the group: Users.

Cheers,

L.

Kingsley Idehen

unread,
Aug 24, 2010, 9:53:33 AM8/24/10
to uk-government-...@googlegroups.com, Leigh Dodds
On 8/24/10 8:52 AM, Leigh Dodds wrote:
> Hi,
>
> I've just put together an initial set of data dumps for the majority
> of the Linked Data currently being published by data.gov.uk. More
> information on what's not included and why in a moment.
>
> (Disclaimer: what follows is my understanding of the current state of
> play, so any errors/omissions then blame me :)
>
>
> THE REPOSITORY
>
> There is a server at http://source.data.gov.uk which has been set up
> to provide access to both data dumps and (eventually) the code used to
> generate/convert the data. The data dumps can be found at:
>
> http://source.data.gov.uk/data/
>
> The intention is to create a repository of versioned datasets that
> will allow anyone to mirror the data for their own use/purposes, e.g.
> to perform local analysis or to host in your own triple store. Over
> time this repository should become a complete archival copy of all of
> the Linked Data that is published through data.gov.uk, complete with
> information on the provenance of individual datasets.
>
> The team behind data.gov.uk are still working through a number of the
> best practices, so right now I've simply put up copies of all the
> currently live datasets.
>
>
> HOW THE DATA IS ORGANISED
>
> The web archive is organised into a series of sub-directories:
>
> * Sector � top-level sector. E.g. as used in *.data.gov.uk
> * Dataset � dataset directory, a short identifier for the dataset.

> I've made some of these up at present
> * Date-stamped directory � in format of yyyy-mm-dd.
> * Data files � This may be an number of data files in different

> formats. E.g the data may span a number of small files, some files may
> be ntriples for loading into default graph and some files may be
> nquads.
>
> For example, the RDF version of Edubase currently available from
> http://education.data.gov.uk can be found here:
>
> http://source.data.gov.uk/data/education/edubase/2009-08-14/
>
> with the general pattern being:
>
> http://source.data.gov.uk/data/[sector]/[dataset]/[timestamp]/
>
> Currently only the latest versions of each dataset are being loaded
> into the live SPARQL endpoints, but over time there will be a move
> towards using named graphs for versioning (as described at [1]).
>
>
> LINKED DATA, DATA DUMPS& SERVICES

>
> The sector identifier ties together the Linked Data, the data dumps,
> and the SPARQL endpoints and other services. For example if you're
> looking at some Linked Data, e.g.:
>
> http://education.data.gov.uk/id/school/100866
>
> Then this data will be included in the SPARQL endpoint at:
>
> http://services.data.gov.uk/education/sparql
>
> The search interface at:
>
> http://services.data.gov.uk/education/search
>
> And the raw data can be found in one (or more) of the datasets accessible from:
>
> http://source.data.gov.uk/data/education/
>
>
> WHAT IS NOT INCLUDED?
>
> As I explained at that start of this email, not all of the Linked Data
> being published from data.gov.uk, or the UK government is currently
> represented in these data dumps.
>
> The RDF available from the legislation.gov.uk is currently only
> available as Linked Data because its surfaced directly from the
> website. Ditto, that published from the London Gazette website as
> RDFa. It would be possible to regularly crawl and dump those sources,
> but I'm not sure if there are plans to do that yet. Other departments
> and projects may also surface their own data and data dumps.

What about sitemaps (basic variety) for the HTML+RDFa resource crawling
guidance?


> The other dataset that is not represented in the dump are the
> date-time URIs available from reference.data.gov.uk, e.g. [2]. as
> these are all algorithmically generated. I don't recommend anyone
> crawls those :)
>
> Any questions then please ask.

So how do we get this data loaded in other RDF data stores?

Kingsley


--

Regards,

Kingsley Idehen
President& CEO
OpenLink Software
Web: http://www.openlinksw.com
Weblog: http://www.openlinksw.com/blog/~kidehen
Twitter/Identi.ca: kidehen

Leigh Dodds

unread,
Aug 24, 2010, 10:05:41 AM8/24/10
to Kingsley Idehen, uk-government-...@googlegroups.com
Hi,

On 24 August 2010 14:53, Kingsley Idehen <kid...@openlinksw.com> wrote:
>  On 8/24/10 8:52 AM, Leigh Dodds wrote:

...


>> The RDF available from the legislation.gov.uk is currently only
>> available as Linked Data because its surfaced directly from the
>> website. Ditto, that published from the London Gazette website as
>> RDFa. It would be possible to regularly crawl and dump those sources,
>> but I'm not sure if there are plans to do that yet. Other departments
>> and projects may also surface their own data and data dumps.
>
> What about sitemaps (basic variety) for the HTML+RDFa resource crawling
> guidance?

Yes, that would be useful. I'm not in position to help there though I'm afraid.

>> The other dataset that is not represented in the dump are the
>> date-time URIs available from reference.data.gov.uk, e.g. [2]. as
>> these are all algorithmically generated. I don't recommend anyone
>> crawls those :)
>>
>> Any questions then please ask.
>
> So how do we get this data loaded in other RDF data stores?

To be clear: its not in *any* RDF store at the moment. This is code
that was developed by Stuart Williams at Epimorphics, its a web
service that is deployed at reference.data.gov.uk.

Stuart has a nice blog post that introduces the data and some intended
uses here:

http://www.epimorphics.com/web/wiki/using-interval-set-uris-statistical-data

It's an interesting question about how data like this, which can't be
feasibly be completely materialised, is included in a triple store.
Limited constrained crawls are a possibility of course, but another
approach would be to look at providing access to the data through
SPARQL extensions.

Cheers,

L.

Kingsley Idehen

unread,
Aug 24, 2010, 12:04:02 PM8/24/10
to Leigh Dodds, uk-government-...@googlegroups.com
Reply all
Reply to author
Forward
0 new messages