Museum Object Data - Scraping and Re-exposing

0 views
Skip to first unread message

Dan Zambonini

unread,
May 21, 2008, 7:02:15 AM5/21/08
to mashed...@googlegroups.com
Hello,

For those that haven't seen my tweets, I've been working away with Mike
(Ellis) on an idea he originally had (codenamed 'hoard.it') about
scraping semantic data from non-semantic (but templated) HTML pages. He
blogged about it last night:

http://electronicmuseum.org.uk/2008/05/20/hoardit-bootstrapping-the-naw/

We basically spidered about 7 (so far...) museum websites for their
object data, setting up specific templates so that we could grab
granular bits of information (dublin core, images, dimensions,
materials, etc) from each page. This has all gone into a central
repository, which can be queried in a REST-like way:

http://feeds.boxuk.com/museums/

We've got about 44,000 object records so far (and about 4,000 'museum
details'; the system can be used to spider any type of data), and for
most objects, we've normalised the date.created field (to 'from/to
year') and extracted long/lat for the 'place created', allowing us to
more easily combine the data in queries and visualisations.

This is just a prototype at the moment, so is probably a bit flaky, and
some of the data isn't as high-quality as it could be, but as Frankie
would say, it's probably 'assez bon'.

We're setting up a wiki around this idea here:
http://hoardit.pbwiki.com/

Ta!

Dan

Reply all
Reply to author
Forward
0 new messages