http://dataformatchecker.heroku.com/
A number of journalists and myself (representing the hackers!) had a
little bit of a moan about the datasets on data.gov.uk. Don't get me
wrong: we are very happy that the government have released all the
data, but decided it would be useful to see if we could prompt the
data.gov.uk people into somehow figuring out how to turn all those
PDFs and Excel files into real data files (XML, JSON, CSV, RDF et al.)
But if we are going to bring about a change in policy, we need to know
where we stand. So I have taken the datasets as of 2010-01-21, and
asked people to crawl through them, specifying what format the data is
in, and whether it is actually available (a few of the pages 404 for
instance). Some of the health data requires registration. I want the
statistics about what format they are in, but I'm too lazy to go
through the 2,900+ records checking them. So I've built a little app
to crowdsource the donkey work of statistics collection.
I launched it on Twitter about ten minutes ago, and we've already had
about a hundred entries checked.
Those who want to follow the viral spread of the data.gov.uk format
checker on Twitter may want to watch:
http://backtweets.com/search?q=http%3A%2F%2Fdataformatchecker.heroku.com%2F
Once the Twitter hordes have gone through the data.gov.uk offerings,
I'll pluck the statistics out of the database and will release the raw
data (probably as RDF and JSON) and also compile the statistics into a
report. The report will be made public, of course, but if anyone wants
to give me any advice how to get the relevant people to read it, I'd
be grateful.
--
Tom Morris
<http://tommorris.org/>
What should some one select if the page provides a link to more pages
that provide links to the data in pdf format?
Thanks.
Basically follow your gut - if you have to click through a page or
two, that's okay - just mark it as whatever format you find. If you
have to register, dance a jig, invoke the name of the Dark Lord and
balance three cups of coffee on your head, mark it as 'fail'. ;)
-- Christopher Gutteridge -- http://www.ecs.soton.ac.uk/people/cjg Lead Developer, EPrints Project, http://eprints.org/ Web Projects Manager, School of Electronics and Computer Science, University of Southampton.
I decided against that as it would make the page more complicated. For
those which have multiple data types, I can just check them by hand.
> Similarly with the pages that fail, options for the type of fail.
>
> Are you able to calculate the average number of pages that each user
> processes?
> Or the number of different people taking part?
>
I've just added that. I'm tracking IP addresses but will anonymise
them once it's all done. So we can see roughly how many
pages-per-participant.
I've added the <label> element to the page too.
Great effort so far! I wanted to chime in and make a suggestion:
Would you be interested in contributing these format results directly
back into data.gov.uk?
To be more specific:
1. All the metadata in data.gov.uk comes from an HMG dedicated CKAN
instance (similar to ckan.net but just for data.gov.uk)
2. It has long been the plan to pull *all* of this metadata to a
public CKAN instance -- most likely ckan.net. ckan.net allows complete
wiki-like editing (and and API for making and pulling changes).
Furthermore each data "package" has an associated set of package
resources each of which has a dedicated "format" field so you could
write this kind of info directly into the dataset/package metadata.
3. Assuming the data.gov.uk packages on ckan.net are openly editable
then you can write back your format info directly to the
package/dataset metadata on ckan.net. This can then be pushed,
subject, I imagine, to a bit of reviewing,back in to the data.gov.uk
CKAN instance and hence appear directly on data.gov.uk.
Even better reformatted versions of the dataset can also be listed on
the relevant dataset page simply by adding an extra package resource
-- see for example this "unofficial" package on ckan.net for PESA:
<http://www.ckan.net/package/ukgov-finances-pesa>
4. Result: we've nicely closed the loop between these "unofficial"
efforts :) and the official data.gov.uk metadata.
Regards,
Rufus
PS: the Open Knowledge Foundation are the original and primary
developers of the CKAN software (but its open source so anyone can use
it and contribute to it ...)
--
Open Knowledge Foundation
Promoting Open Knowledge in a Digital Age
http://www.okfn.org/ - http://blog.okfn.org/
Best
Nigel
To be more specific:
<http://www.ckan.net/package/ukgov-finances-pesa><http://www.ckan.net/package/ukgov-finances-pesa>
Sure. Once the first round of data format checking is done, I'll
release both the data (RDF and JSON) and a report summarising the
statistics. It may take a few days to get around to it, but it will
get done.
There's still 538 to go. Do keep on cracking on folks!
Hooray!!!!!!!No more jobs. Well done, amorphous anonymous community. We'll now go and pester the relevant people to suck less.
For anyone who's interested, I finally got round to writing up the
various things other people got up to on the day:
http://blog.scraperwiki.com/post/370673225/hacks-and-hackers-hack-day-report
--
/*
ric...@memespring.co.uk
memespring.co.uk
++44 7976730458
memespring (flickr/skype/etc)
memspr (aim)
*/
<http://data.gov.uk/dataset/2008_injury_road_traffic_collisions_in_northern_ireland>
dc:format "application/pdf" .
//Ed
I'll try to do so.
I'm planning to get the data out of the database and start fiddling
with it tomorrow. Dev8D this week rather got in the way of doing
anything with data.gov.uk data!
I want to try and dramatically lower the barrier to people start hacking
with the RDF. To enable this I have created a new (yet another?) PHP RDF
library called Graphite:
http://lemur.ecs.soton.ac.uk/~cjg/Graphite/
I've produced some examples of working with the data.gov.uk RDF files.
http://lemur.ecs.soton.ac.uk/~cjg/Graphite/gov.php
--
Christopher Gutteridge -- http://www.ecs.soton.ac.uk/people/cjg
Lead Developer, EPrints Project, http://eprints.org/
Web Projects Manager, University of Southampton,
Can you tell me if/where we can find a Great Britain data set ?
regards
Marcus
On Mar 2, 5:32 pm, Andy Turner <a.g.d.tur...@gmail.com> wrote:
> Hello,
>
> I was at Dev8D too!
>
> I think I replied on another list thread about Road Accident data. Anyway,
> by coincidence, I added all the Northern Ireland road accident data to an
> ESDS usage 40885 (https://www.esds.ac.uk/newRegistration/showProjectDetails.asp?pn=40885)
> yesterday. (This already has the data for GB 1985 to 2008 as perhttp://www.geog.leeds.ac.uk/people/a.turner/data/Stats19/). The following
> are the specific study numbers I added:
>
> - SN 6386 Northern Ireland Road Traffic Collision Data, 2004
> -http://www.data-archive.ac.uk/findingData/snDescription.asp?sn=6386
> - SN 6391 Northern Ireland Road Traffic Collision Data, 2005
> -http://www.data-archive.ac.uk/findingData/snDescription.asp?sn=6391
> - SN 6392 Northern Ireland Road Traffic Collision Data, 2006
> -http://www.data-archive.ac.uk/findingData/snDescription.asp?sn=6392
> - SN 6393 Northern Ireland Road Traffic Collision Data, 2007
> -http://www.data-archive.ac.uk/findingData/snDescription.asp?sn=6393
> - SN 6394 Northern Ireland Road Traffic Collision Data, 2008
> -http://www.data-archive.ac.uk/findingData/snDescription.asp?sn=6394
I'm looking at Designing URI Sets for the UK Public Sector [1] and finding it hard to get my head around the difference between a List URI and a Set URI.
List URIs are defined as: "These provide a list of the Identifier URIs that are contained within a set."
Set URIs are defined as: "A type of Identifier URI that names the URI set and can be resolved to provide the quality characteristics of the set."
Does anyone have any real examples of these? I'm confused! Am I right in thinking that "set" and "URI set" are being used interchangeably here?
[1] http://www.cabinetoffice.gov.uk/media/308995/public_sector_uri.pdf
Thanks,
Andy
--
Andy Powell
Research Programme Director
Eduserv
t: 01225 474319
m: 07989 476710
twitter: @andypowe11
blog: efoundations.typepad.com
The pattern we're now using, which differs a little from that in the
URI sets document you quote is that
/id/{concept} (eg /id/school)
refers to a URI set, which is a set of resources that are of the same
type (in this case schools) and follow the same basic pattern within
their URI (in this case /id/school/{number}).
The URI set is a kind of abstract notion: a potentially infinite
collection of resources that share the same URI pattern. When you
request the URI set with that URI, you will be redirected to a URI in
the form:
/doc/{concept} (eg /doc/school)
which will return back to you the first page of a list of schools as
well as metadata about the URI set (such as the pattern that the URIs
follow). This a much more concrete notion: an ordered list of known
resources for which you get the first page.
There are of course other lists of schools that you could envisage,
such as:
/doc/school/level/secondary
/doc/school/district/00UA
and so on, depending on the way that an API is configured.
(There are some more details here which we have yet to pull into a
proper document, but I'd be glad to give you access to our initial
drafts and/or discuss this further if you want some specific advice;
drop me a line.)
Cheers,
Jeni
--
Jeni Tennison
http://www.jenitennison.com