data.gov.uk format verifier

Tom Morris

unread,

Feb 2, 2010, 7:17:37 PM2/2/10

to uk-government-data-developers

Last week, I hacked this together at the Hacks and Hackers Hackday in London:

http://dataformatchecker.heroku.com/

A number of journalists and myself (representing the hackers!) had a
little bit of a moan about the datasets on data.gov.uk. Don't get me
wrong: we are very happy that the government have released all the
data, but decided it would be useful to see if we could prompt the
data.gov.uk people into somehow figuring out how to turn all those
PDFs and Excel files into real data files (XML, JSON, CSV, RDF et al.)

But if we are going to bring about a change in policy, we need to know
where we stand. So I have taken the datasets as of 2010-01-21, and
asked people to crawl through them, specifying what format the data is
in, and whether it is actually available (a few of the pages 404 for
instance). Some of the health data requires registration. I want the
statistics about what format they are in, but I'm too lazy to go
through the 2,900+ records checking them. So I've built a little app
to crowdsource the donkey work of statistics collection.

I launched it on Twitter about ten minutes ago, and we've already had
about a hundred entries checked.

Those who want to follow the viral spread of the data.gov.uk format
checker on Twitter may want to watch:
http://backtweets.com/search?q=http%3A%2F%2Fdataformatchecker.heroku.com%2F

Once the Twitter hordes have gone through the data.gov.uk offerings,
I'll pluck the statistics out of the database and will release the raw
data (probably as RDF and JSON) and also compile the statistics into a
report. The report will be made public, of course, but if anyone wants
to give me any advice how to get the relevant people to read it, I'd
be grateful.

--
Tom Morris
<http://tommorris.org/>

bill.r...@planet.nl

unread,

Feb 3, 2010, 2:22:45 AM2/3/10

to uk-government-...@googlegroups.com

Tom - great idea, just done a few. Very quick and easy.

Van: uk-government-...@googlegroups.com namens Tom Morris
Verzonden: wo 3-2-2010 1:17
Aan: uk-government-data-developers
Onderwerp: [uk-government-data-developers] data.gov.uk format verifier

Stuart Harrison

unread,

Feb 3, 2010, 5:02:01 AM2/3/10

to uk-government-...@googlegroups.com

Great job Tom, I've put out a call on Twitter and done a few myself. Plus kudos for the kitten pictures. If there's one thing the internet likes, it's pictures of cats.

Glyn Wintle

unread,

Feb 3, 2010, 5:37:00 AM2/3/10

to uk-government-...@googlegroups.com

Great idea Tom. Clearly some of the pages are little hard to classify
but I imagine that will be a useful data set your collecting there.

What should some one select if the page provides a link to more pages
that provide links to the data in pdf format?

Tom Morris

unread,

Feb 3, 2010, 7:24:47 AM2/3/10

to uk-government-data-developers

Thanks.

Basically follow your gut - if you have to click through a page or
two, that's okay - just mark it as whatever format you find. If you
have to register, dance a jig, invoke the name of the Dark Lord and
balance three cups of coffee on your head, mark it as 'fail'. ;)

Christopher Gutteridge

unread,

Feb 3, 2010, 9:25:28 AM2/3/10

to uk-government-...@googlegroups.com

As have I.

I suggest a bit more clarity would improve the data you gather. eg. if the data is provided in .xls and .pdf then call it ".xls"

I guess we want to highlight how far along each dataset is.

Also, add <label>...</label> around the radio button and text, then you can click anywhere on the text to select it, rather than the little "O" (yes, I'm that lazy)

-- 
Christopher Gutteridge -- http://www.ecs.soton.ac.uk/people/cjg

Lead Developer, EPrints Project, http://eprints.org/

Web Projects Manager, School of Electronics and Computer Science,
University of Southampton.

Scott Wilcox

unread,

Feb 3, 2010, 9:49:53 AM2/3/10

to uk-government-...@googlegroups.com

Agreed completely with Chris, was going to say the exact same thing.

Sent from my iPhone

Jonathan Wyatt

unread,

Feb 3, 2010, 11:48:39 AM2/3/10

to uk-government-...@googlegroups.com

Nice work.

Looks like the whole lots will be done soon.

Maybe once the whole list has been done; the option buttons could be swapped for check boxes and those pages with more than one type of data could relisted, so we can find out exactly what those multiple data types are.

Similarly with the pages that fail, options for the type of fail.

Are you able to calculate the average number of pages that each user processes?
Or the number of different people taking part?

Those would be interesting statisitcs.

Tom Morris

unread,

Feb 3, 2010, 11:55:21 AM2/3/10

to uk-government-data-developers

On Wed, Feb 3, 2010 at 16:48, Jonathan Wyatt <jon...@googlemail.com> wrote:
> Maybe once the whole list has been done; the option buttons could be swapped
> for check boxes and those pages with more than one type of data could
> relisted, so we can find out exactly what those multiple data types are.
>

I decided against that as it would make the page more complicated. For
those which have multiple data types, I can just check them by hand.

> Similarly with the pages that fail, options for the type of fail.
>
> Are you able to calculate the average number of pages that each user
> processes?
> Or the number of different people taking part?
>

I've just added that. I'm tracking IP addresses but will anonymise
them once it's all done. So we can see roughly how many
pages-per-participant.

I've added the <label> element to the page too.

Jonathan Wyatt

unread,

Feb 3, 2010, 12:52:24 PM2/3/10

to uk-government-...@googlegroups.com

> those which have multiple data types, I can just check them by hand.

Cool, lets hope that's a short list. I've seen quite few so far that have both PDF and Excel.

Also seen some interesting graphs, and one page had tables of stats displayed as images (each table was a separate jpeg)

Only a couple of 404s so far, which is promising; but far too many PDFs for my liking.

Rufus Pollock

unread,

Feb 3, 2010, 4:06:21 PM2/3/10

to uk-government-...@googlegroups.com, David Read

Great effort so far! I wanted to chime in and make a suggestion:

Would you be interested in contributing these format results directly
back into data.gov.uk?

To be more specific:

1. All the metadata in data.gov.uk comes from an HMG dedicated CKAN
instance (similar to ckan.net but just for data.gov.uk)

2. It has long been the plan to pull *all* of this metadata to a
public CKAN instance -- most likely ckan.net. ckan.net allows complete
wiki-like editing (and and API for making and pulling changes).
Furthermore each data "package" has an associated set of package
resources each of which has a dedicated "format" field so you could
write this kind of info directly into the dataset/package metadata.

3. Assuming the data.gov.uk packages on ckan.net are openly editable
then you can write back your format info directly to the
package/dataset metadata on ckan.net. This can then be pushed,
subject, I imagine, to a bit of reviewing,back in to the data.gov.uk
CKAN instance and hence appear directly on data.gov.uk.

Even better reformatted versions of the dataset can also be listed on
the relevant dataset page simply by adding an extra package resource
-- see for example this "unofficial" package on ckan.net for PESA:
<http://www.ckan.net/package/ukgov-finances-pesa>

4. Result: we've nicely closed the loop between these "unofficial"
efforts :) and the official data.gov.uk metadata.

Regards,

Rufus

PS: the Open Knowledge Foundation are the original and primary
developers of the CKAN software (but its open source so anyone can use
it and contribute to it ...)
--
Open Knowledge Foundation
Promoting Open Knowledge in a Digital Age
http://www.okfn.org/ - http://blog.okfn.org/

Prof Nigel Shadbolt

unread,

Feb 3, 2010, 6:17:47 PM2/3/10

to <uk-government-data-developers@googlegroups.com>, Rufus Pollock, David Read

Can I endorse the power of this sort of approach - this is precisely where we can get fast and effective collaborative improvement. As Rufus points out we will need a process of review before consolidation back in - but the value of a public improvement process is very powerful and a good argument in favour of open public data

Best

Nigel

To be more specific:

<http://www.ckan.net/package/ukgov-finances-pesa><http://www.ckan.net/package/ukgov-finances-pesa>

Tom Morris

unread,

Feb 3, 2010, 9:02:37 PM2/3/10

to uk-government-data-developers

On Wed, Feb 3, 2010 at 21:06, Rufus Pollock <rufus....@okfn.org> wrote:
> Would you be interested in contributing these format results directly
> back into data.gov.uk?
>

Sure. Once the first round of data format checking is done, I'll
release both the data (RDF and JSON) and a report summarising the
statistics. It may take a few days to get around to it, but it will
get done.

There's still 538 to go. Do keep on cracking on folks!

Jonathan Wyatt

unread,

Feb 4, 2010, 12:03:22 PM2/4/10

to uk-government-...@googlegroups.com

No more jobs. Well done, amorphous anonymous community. We'll now go and pester the relevant people to suck less.

Hooray!!!!!!!

I look forward to finding out who sucks and who rules in the world of uk government data publication.

Richard Pope

unread,

Feb 4, 2010, 12:06:57 PM2/4/10

to uk-government-...@googlegroups.com

On Wed, Feb 3, 2010 at 12:17 AM, Tom Morris <t...@tommorris.org> wrote:
> Last week, I hacked this together at the Hacks and Hackers Hackday in London:

For anyone who's interested, I finally got round to writing up the
various things other people got up to on the day:

http://blog.scraperwiki.com/post/370673225/hacks-and-hackers-hack-day-report

--
/*
ric...@memespring.co.uk
memespring.co.uk
++44 7976730458
memespring (flickr/skype/etc)
memspr (aim)
*/

Mia

unread,

Feb 21, 2010, 3:21:16 PM2/21/10

to uk-government-...@googlegroups.com

Did you ever get a chance to put that together? I'd also be interested in the typical length of a session.

cheers, Mia

--------------------------------------------
http://openobjects.org.uk/

http://twitter.com/mia_out

Ed Summers

unread,

Feb 27, 2010, 4:18:54 PM2/27/10

to uk-government-...@googlegroups.com

Really nice work Tom. I see the work was completed? I wonder if you
considered publishing the results as an RDF file on the heroku site?
Something like:

<http://data.gov.uk/dataset/2008_injury_road_traffic_collisions_in_northern_ireland>
dc:format "application/pdf" .

//Ed

Tom Morris

unread,

Feb 27, 2010, 6:03:39 PM2/27/10

to uk-government-...@googlegroups.com

I'll try to do so.

I'm planning to get the data out of the database and start fiddling
with it tomorrow. Dev8D this week rather got in the way of doing
anything with data.gov.uk data!

Christopher Gutteridge

unread,

Feb 28, 2010, 5:50:53 AM2/28/10

to uk-government-...@googlegroups.com

Not for me it didn't *grin* I've been working on an idea...

I want to try and dramatically lower the barrier to people start hacking
with the RDF. To enable this I have created a new (yet another?) PHP RDF
library called Graphite:
http://lemur.ecs.soton.ac.uk/~cjg/Graphite/

I've produced some examples of working with the data.gov.uk RDF files.
http://lemur.ecs.soton.ac.uk/~cjg/Graphite/gov.php

--

Christopher Gutteridge -- http://www.ecs.soton.ac.uk/people/cjg

Lead Developer, EPrints Project, http://eprints.org/

Web Projects Manager, University of Southampton,

Andy Turner

unread,

Mar 2, 2010, 12:32:37 PM3/2/10

to uk-government-...@googlegroups.com

Hello,

I was at Dev8D too!

I think I replied on another list thread about Road Accident data. Anyway, by coincidence, I added all the Northern Ireland road accident data to an ESDS usage 40885 (https://www.esds.ac.uk/newRegistration/showProjectDetails.asp?pn=40885) yesterday. (This already has the data for GB 1985 to 2008 as per http://www.geog.leeds.ac.uk/people/a.turner/data/Stats19/). The following are the specific study numbers I added:

SN 6386 Northern Ireland Road Traffic Collision Data, 2004
- http://www.data-archive.ac.uk/findingData/snDescription.asp?sn=6386
SN 6391 Northern Ireland Road Traffic Collision Data, 2005
- http://www.data-archive.ac.uk/findingData/snDescription.asp?sn=6391
SN 6392 Northern Ireland Road Traffic Collision Data, 2006
- http://www.data-archive.ac.uk/findingData/snDescription.asp?sn=6392
SN 6393 Northern Ireland Road Traffic Collision Data, 2007
- http://www.data-archive.ac.uk/findingData/snDescription.asp?sn=6393
SN 6394 Northern Ireland Road Traffic Collision Data, 2008
- http://www.data-archive.ac.uk/findingData/snDescription.asp?sn=6394

I know the basic layout for the data for Northern Ireland (3 related tables) matches with those for Stats 19 for Great Britain, but I've not looked into this any further...

Bye for now,

Andy

Marcus Zubed

unread,

Mar 3, 2010, 9:15:12 AM3/3/10

to UK Government Data Developers

We have done a web GIS solution for a major motorway maintanence
company, enabling internal and external self-service of information
http://www.zubed.com/assets/1.png

Can you tell me if/where we can find a Great Britain data set ?

regards

Marcus

On Mar 2, 5:32 pm, Andy Turner <a.g.d.tur...@gmail.com> wrote:
> Hello,
>
> I was at Dev8D too!
>
> I think I replied on another list thread about Road Accident data. Anyway,
> by coincidence, I added all the Northern Ireland road accident data to an
> ESDS usage 40885 (https://www.esds.ac.uk/newRegistration/showProjectDetails.asp?pn=40885)

> yesterday. (This already has the data for GB 1985 to 2008 as perhttp://www.geog.leeds.ac.uk/people/a.turner/data/Stats19/). The following

> are the specific study numbers I added:
>

> - SN 6386 Northern Ireland Road Traffic Collision Data, 2004
> -http://www.data-archive.ac.uk/findingData/snDescription.asp?sn=6386
> - SN 6391 Northern Ireland Road Traffic Collision Data, 2005
> -http://www.data-archive.ac.uk/findingData/snDescription.asp?sn=6391
> - SN 6392 Northern Ireland Road Traffic Collision Data, 2006
> -http://www.data-archive.ac.uk/findingData/snDescription.asp?sn=6392
> - SN 6393 Northern Ireland Road Traffic Collision Data, 2007
> -http://www.data-archive.ac.uk/findingData/snDescription.asp?sn=6393
> - SN 6394 Northern Ireland Road Traffic Collision Data, 2008
> -http://www.data-archive.ac.uk/findingData/snDescription.asp?sn=6394

Andy Powell

unread,

Mar 3, 2010, 11:07:56 AM3/3/10

to UK Government Data Developers

Apologies if this has come up before...

I'm looking at Designing URI Sets for the UK Public Sector [1] and finding it hard to get my head around the difference between a List URI and a Set URI.

List URIs are defined as: "These provide a list of the Identifier URIs that are contained within a set."

Set URIs are defined as: "A type of Identifier URI that names the URI set and can be resolved to provide the quality characteristics of the set."

Does anyone have any real examples of these? I'm confused! Am I right in thinking that "set" and "URI set" are being used interchangeably here?

[1] http://www.cabinetoffice.gov.uk/media/308995/public_sector_uri.pdf

Thanks,

Andy

--
Andy Powell
Research Programme Director
Eduserv
t: 01225 474319
m: 07989 476710
twitter: @andypowe11
blog: efoundations.typepad.com

www.eduserv.org.uk

Jeni Tennison

unread,

Mar 3, 2010, 11:39:33 AM3/3/10

to uk-government-...@googlegroups.com

Hi Andy,

The pattern we're now using, which differs a little from that in the
URI sets document you quote is that

/id/{concept} (eg /id/school)

refers to a URI set, which is a set of resources that are of the same
type (in this case schools) and follow the same basic pattern within
their URI (in this case /id/school/{number}).

The URI set is a kind of abstract notion: a potentially infinite
collection of resources that share the same URI pattern. When you
request the URI set with that URI, you will be redirected to a URI in
the form:

/doc/{concept} (eg /doc/school)

which will return back to you the first page of a list of schools as
well as metadata about the URI set (such as the pattern that the URIs
follow). This a much more concrete notion: an ordered list of known
resources for which you get the first page.

There are of course other lists of schools that you could envisage,
such as:

/doc/school/level/secondary
/doc/school/district/00UA

and so on, depending on the way that an API is configured.

(There are some more details here which we have yet to pull into a
proper document, but I'd be glad to give you access to our initial
drafts and/or discuss this further if you want some specific advice;
drop me a line.)

Cheers,

Jeni

--
Jeni Tennison
http://www.jenitennison.com

Reply all

Reply to author

Forward