Digital NZ Geoparser

5 views
Skip to first unread message

Gordon Anderson

unread,
Sep 1, 2009, 6:41:04 PM9/1/09
to nzopengo...@googlegroups.com
hi everyone

Inspired by the weekend bar camp I have finally got round to releasing the code for the geoparser I wrote a couple of months back.  For more information about the process involved, see http://groups.google.com/group/digitalnz/browse_thread/thread/b5b0c96ce08ca441?pli=1 - no point in repeating it here.

The parsing is far from perfect but it is a start - it would be useful for me to see examples where it has failed (e.g. matching against words that ought to be stopped, failing to picking out a street name etc) so I can tweak the algorithm.    I think a hybrid approach where items are automatically geoparsed (and marked as such visually to the user) but can then subsequently be ratified by a human (and again marked as such and shown to the user) is the way to go, something that was discussed over the weekend.  A useful metric for spotting erroneous geoparsing is the area of the match, if it is massive then the chances are a stop word such as 'Photographer' or 'Premises' has matched a street somewhere random in the world, and an area of 0 may indicate that place names have been missed, either due to misspelling or an error in the algorithm.

Oh the code :)


Instructions are in the README file for installing on a clean Ubuntu Jaunty.  Let me know if there are any errors in those.  There are still a bunch of things I wish to add, but it might have to wait until I have moved to Thailand

Cheers

Gordon


Walter McGinnis

unread,
Sep 2, 2009, 11:56:26 PM9/2/09
to nzopengo...@googlegroups.com
You might find another project that uses Open Calais interesting.
Here's a blog post that serves a project introduction:

http://www.powerhousemuseum.com/dmsblog/index.php/2009/09/02/introducing-about-nsw-maps-census-visualisations-cross-search/

Cheers,
Walter

Gordon Anderson

unread,
Sep 3, 2009, 6:15:34 AM9/3/09
to nzopengo...@googlegroups.com
hi Walter

Thanks for the link - looks interesting

Cheers

Gordon

Brett Cooper

unread,
Sep 3, 2009, 8:18:58 PM9/3/09
to nzopengo...@googlegroups.com
Both projects are in Ruby,  why was that language used.
Have you any screen shots of using it, or a screen-cast?

I have cloned your post to help document this on http://wiki.open.org.nz/Geo_Parse_Digital_NZ_API_Records_by_Placename_Extraction

great stuff.
regards
Brett Cooper


2009/9/3 Gordon Anderson <gordon.b...@gmail.com>

Gordon Anderson

unread,
Sep 3, 2009, 10:41:58 PM9/3/09
to nzopengo...@googlegroups.com
hi Brett

There is a link to some screen shots in the README file on github, but I probably should have posted it in the first email :)


No screencast at present, might do that or put a live copy of the site up once I have removed non required scaffolding.  Kinda busy with sorting out the move from NZ to Thailand at the moment :)

Cheers

Gordon

Gordon Anderson

unread,
Sep 3, 2009, 10:48:16 PM9/3/09
to nzopengo...@googlegroups.com
hi Brett

With regards to using Ruby it is my language of choice for language agnostic tasks.  I added Rails to the mix as it is fast for prototyping web apps (and the geoparser is most definitely a prototype).  You also get a testing framework for free.

Cheers

Gordon

Gordon Anderson

unread,
Nov 19, 2009, 11:22:04 PM11/19/09
to nzopengo...@googlegroups.com
hi

Better late than never... moving countries does take a while :)

I've finally put aside some time and deployed my geoparser online at
http://digitalnzgeoparser.tripodtravel.co.nz. A Google like search
screen is presented on the front page. Upon entering a term a search
is performed against the Digital NZ search API, and if not cached
metadata records are retrieved one by patient one from those search
results. This can result in slow page times, as a search returning 10
uncached entries will hit the API 11 times (the search API does not
return the full metadata which I require to extract geographical
placenames).

As an example (cached) search try 'railway station', a search term
independent of any geographical location so as to provide a decent
geographical spread of results (ie searching for Hataitai would
produce lots of results in the same area).

http://digitalnzgeoparser.tripodtravel.co.nz/archive_searches/search?q=railway+station

The initial page returns some images from Flickr and the Alexander
Turnbull collection. Clicking on a title of a result will redirect
the browser to the source web page, and clicking on 'Metadata' will
render a page containing the original image with all of the extracted
text I could find from the given metadata record. Where things get
more interesting is viewing the metadata for records that have been
geoparsed, so click on 'Map' on the first 2 results:

http://digitalnzgeoparser.tripodtravel.co.nz/natlib_metadatas/64803/map
http://digitalnzgeoparser.tripodtravel.co.nz/natlib_metadatas/86312/map

These metadata pages render a google map using placenames extracted
from the text.


To see more (currently) geoparsed results click on the following:
- 'Images' under Category (a search result page containing 20 images
will be rendered)
- 'Alexander Turnbull Library' under creator

The search results rendered are now 'Images from the Alexander
Turnbull Collection for the search term "railway station"' -
http://digitalnzgeoparser.tripodtravel.co.nz/archive_searches/search?q=railway%20station&page=1&f[]=21&f[]=10
The first couple of result pages have maps associated with the records.

Note that ideally I'd like the facet URLs ultimately to look more like this:
http://digitalnzgeoparser.tripodtravel.co.nz/search/railway+station/category/images/provider/alexander+turnbull+collection


Caveats:
======
1) Slow - due to hitting the search API multiple times for uncached
searches expect slow response times.
2) Geoparsing is too slow to run live on a live web search. The
hosting provider also prevents RAM hungry batch jobs and does not have
memcache, so I am running a cron from my laptop directly to the
production database. Geoparsing is taking roughly a minute per
national library record at the moment, whereas locally 10 seconds is
more the norm locally (currently trying to fix). As such if you
search for a place it may be day before you see maps for it.
3) No check is made for the Digital NZ search API failing (ie Digital
NZ search server is down or my key has gone over the limit for the
day)
4) I've allowed up to 50,000 facet results to be returned for any one
field. The main problem here is that the creator list can be very
long, see http://digitalnzgeoparser.tripodtravel.co.nz/archive_searches/search?q=cricket&page=1&f[]=10
for an example. Also, see
http://groups.google.co.nz/group/digitalnz/browse_thread/thread/ae01eacfacfbd9a4?pli=1
for previous comments I have made about this.


Please feel free to have a play around and search for the likes of
where you live, where you have been on holiday etc as this will give
the cached metadata records a better geographical spread than by just
typing in random search terms myself. I'll keep the geoparsing
running on my laptop in order to get more records mapped.

Suggestions / comments welcome

Cheers

Gordon
Reply all
Reply to author
Forward
0 new messages