initial attempts at geoparsing digital nz data

14 views
Skip to first unread message

Gordon Anderson

unread,
Jun 2, 2009, 7:17:17 AM6/2/09
to digi...@googlegroups.com
hi

This is still a work in progress but I thought others might be interested.  I've been having a look at geoparsing a small portion of the corpus of digital nz metadata to see if it relevant text can be extracted and geographic locations extracted.  I think I might have found a way forward...

Initially (and this was code from a previous project a year ago) I tried parsing the text and throwing strings of consecutive words starting with capitals at the google geocoder (e.g. "Wellington High School, Wellington").  However this seemed to produce noisy results, especially when the google geocoder return locations for the likes of "View", "Photographer" and other random strings especially at the start of sentences.  The main problem I had was in isolating a geographical region with reasonably precision without the noise getting in the way.  

Other than random words that seemed innocuous producing geographical results, less innocuous ones also caused issues - as an example the string "Looking along Oriental Parade, Wellington. Taken by Sydney Charles Smith circa 1912" would end up matching Sydney in Australia which is not exactly desirable.  As such it was time to look for other APIs to attempt to filter out non geographic text prior to geoparsing.


Open Calais (http://www.opencalais.com/) was the first API to come  to the rescue.  It  is provided by Reuters and attempts to extract metadata such as politicians, organisations, people and many other categories from submitted text alone.  With this I was able to identify the likes of "Bank of New Zealand" as an organisation (thus avoid 'Bank' geomatching somewhere the USA)  and our friend Sydney Charles Smith as a person, and thus exclude them from geo parsing queries.


Still not quite there yet though :)  I also came across Yahoo Placemaker (http://developer.yahoo.com/geo/placemaker/) a couple of nights ago and it seems to do a good job of identifying a region for a piece of text, the main issue I was having.  It was still matching Sydney Charles Smith as Sydney, and indeed appears to be case insensitive as 'sydney charles smith' also matched the Australian city.  I instead opted to remove such text completely using Open Calais and started to see much more accurate results.


Currently I have ruby scripts that do the following:

- Download a natlib record by id and squirrel the metadata in a database on my laptop to avoid a second hit against the API
- Create a metadata string for geoparsing by adding title, description, coverage, subject and placenames together
- Use open calais to remove organisations and people names (there may be others worth removing) and thus avoid noise
- Use Yahoo Placemaker to get a list of regions / towns / suburbs, as well as geographic extending (lat/lon bounding box)

The next step (yet to be done) is to use my previous attempts at parsing using the Google Geocoder but use the regional information and bounding box provided by the Yahoo API to filter out noise.

Appended below is some example output from my scripts - comments welcome :)

Cheers

Gordon



ORIGINAL METATEXT FOR RECORD:75636
====
Electric tram in Invercargill

Electric tram in Invercargill (destination: Georgetown), circa 1912. It bears an advertisement for Dominion Tread Tires. Photographer unidentified.

Invercargill City

1912

Trams - New Zealand - Southland Region

FILTERING
========
CALAIS TAGS:
--- 
Position: 
- Photographer
City: 
- Georgetown
- Invercargill City
Country: 
- New Zealand
Relations: 
- ""


TEXT FOR YAHOO
======
Electric tram in Invercargill

Electric tram in Invercargill (destination: Georgetown), circa 1912. It bears an advertisement for Dominion Tread Tires. Photographer unidentified.

Invercargill City

1912

Trams - New Zealand - Southland Region


TEXT FOR GOOGLE
=======
Electric tram in Invercargill

Electric tram in Invercargill (destination: Georgetown), circa 1912. It bears an advertisement for Dominion Tread Tires. Photographer unidentified.

Invercargill City

1912

Trams - New Zealand - Southland Region

RESULTS
LOC:2348887 Invercargill, Southland, NZ Town -46.4118 168.352 WEIGHT=1 CONFIDENCE=10
LOC:15021750 Southland, NZ State -45.4649 167.853 WEIGHT=1 CONFIDENCE=10
LOC:28644394 Georgetown, Invercargill, Southland, NZ Suburb -46.4201 168.369 WEIGHT=1 CONFIDENCE=10
LOC:55875899 Invercargill City, Southland, NZ County -46.474 168.369 WEIGHT=1 CONFIDENCE=10

EXTENTS
SW: -47.2911, 166.427
NE: -44.2551, 169.279
GEO SCOPE
Invercargill City, Southland, NZ

ADMIN SCOPE
Invercargill City, Southland, NZ


ORIGINAL METATEXT FOR RECORD:63489
====
Part 2 of a 2 part panorama of Wellington Public Hospital, Newtown, Wellington

Part 2 of a 2 part panorama of Wellington Public Hospital, Newtown, Wellington, taken in 1910 by Sydney Charles Smith.

Newtown

1910

Hospitals - New Zealand - Wellington Region

Wellington Hospital

FILTERING
========
CALAIS TAGS:
--- 
Person: 
- Sydney Charles Smith
City: 
- Newtown
- Wellington
Country: 
- New Zealand
Relations: 
- ""
Organization: 
- Wellington Public Hospital
REMOVING FROM GEO SEARCH:Sydney Charles Smith
REMOVING FROM GEO SEARCH:Wellington Public Hospital


TEXT FOR YAHOO
======
Part 2 of a 2 part panorama of , Newtown, Wellington

Part 2 of a 2 part panorama of , Newtown, Wellington, taken in 1910 by .

Newtown

1910

Hospitals - New Zealand - Wellington Region

Wellington Hospital


TEXT FOR GOOGLE
=======
Part 2 of a 2 part panorama of wellington public hospital, Newtown, Wellington

Part 2 of a 2 part panorama of wellington public hospital, Newtown, Wellington, taken in 1910 by sydney charles smith.

Newtown

1910

Hospitals - New Zealand - Wellington Region

Wellington Hospital

RESULTS
LOC:2351310 Wellington, Wellington, NZ Town -41.2805 174.767 WEIGHT=1 CONFIDENCE=6
LOC:22726472 Newtown, Wellington, Wellington, NZ Suburb -41.3142 174.779 WEIGHT=1 CONFIDENCE=10

EXTENTS
SW: -41.3491, 174.694
NE: -41.1882, 174.847
GEO SCOPE
Wellington, Wellington, NZ

ADMIN SCOPE
Wellington, Wellington, NZ





ORIGINAL METATEXT FOR RECORD:26868
====
Bowling Green Invercargill F G R 6824

Showing the exterior of the Invercargill Bowling Club in Yarrow Street, between Doon and Elles Road

Southland Region (N.Z.)

Southland Region (N.Z.)

Invercargill

Bowling Clubs

Yarrow Street

FILTERING
========
CALAIS TAGS:
--- 
Facility: 
- Elles Road
- Yarrow Street
Relations: 
- ""
Organization: 
- Invercargill Bowling Club
REMOVING FROM GEO SEARCH:Invercargill Bowling Club


TEXT FOR YAHOO
======
Bowling Green Invercargill F G R 6824

Showing the exterior of the  in Yarrow Street, between Doon and Elles Road

Southland Region (N.Z.)

Southland Region (N.Z.)

Invercargill

Bowling Clubs

Yarrow Street


TEXT FOR GOOGLE
=======
Bowling Green Invercargill F G R 6824

Showing the exterior of the invercargill bowling club in Yarrow Street, between Doon and Elles Road

Southland Region (N.Z.)

Southland Region (N.Z.)

Invercargill

Bowling Clubs

Yarrow Street

RESULTS
LOC:2348887 Invercargill, Southland, NZ Town -46.4118 168.352 WEIGHT=1 CONFIDENCE=7
LOC:2367481 Bowling Green, KY, US Town 36.9946 -86.4456 WEIGHT=1 CONFIDENCE=3
LOC:2392943 Doon, IA, US Town 43.2782 -96.2359 WEIGHT=1 CONFIDENCE=5
LOC:15021750 Southland, NZ State -45.4649 167.853 WEIGHT=1 CONFIDENCE=7
LOC:23424916 New Zealand Country -43.5877 170.367 WEIGHT=1 CONFIDENCE=7

EXTENTS
SW: -52.6171, -96.2436
NE: 43.2865, 169.279
GEO SCOPE
Pacific/Auckland, ZZ

ADMIN SCOPE
New Zealand








Paul Hagon

unread,
Jun 3, 2009, 5:01:10 AM6/3/09
to DigitalNZ
Awesome - you beat me to playing around with Placemaker. It looks to
be giving some good results. With the things I've been playing around
with I seem to be getting better results from Yahoo's geocoding
services over Googles, but I still need to do further work. Thanks for
sharing.


On Jun 2, 9:17 pm, Gordon Anderson <gordon.b.ander...@gmail.com>
wrote:
> *REMOVING FROM GEO SEARCH:Sydney Charles Smith*
> *REMOVING FROM GEO SEARCH:Wellington Public Hospital*

Gordon Paynter

unread,
Jun 6, 2009, 5:51:32 AM6/6/09
to DigitalNZ
Hi Gordon:

This is great. We (DigitalNZ) still hope to do something like this,
but we haven't gotten to it yet.

More generally, we'd like to automatically all sorts of metadata to
selected records, not just geospatial metadata. And we'll probably
eventually open up a service so that (duly authorised) third parties
can submit metadata (geotags, or just tags, or anything else) back to
the DigitalNZ metadata database, but there are a few technical hurdles
first, and we probably have to go back to the content partners too.

I'm surprised the "placename" metadata isn't more generally useful to
you. FYI "placename" is a holding field where I dumped any
geographical coverage metadata until we have time to add some support
for getting it into a more useful form. Some content partners don't
provide anything suitable, and in other cases it is a bit messy, but
overall I am pretty sure that field is all geographical. I'd be
interested in any feedback you have on the placename values.

Anyway, thanks for sharing your experience, please let us know how you
get on with the Google Geocoder. My approach to this was going to be
to take the data that is in the placename field and run it through
http://www.geonames.org/ but your approach is going to be a lot more
useful for records with no placename metadata. It would be interesting
to see if the two methods give comparable results. Or if your methods
can reliably provide useful results for records with no formal
placename metadata.

Gordon


Gordon Anderson

unread,
Jun 10, 2009, 2:57:21 AM6/10/09
to digi...@googlegroups.com
Hi Gordon

Thanks for your feedback.  Let's hope people don't get confused by a Gordon talking to a Gordon :)


Hi Gordon:

This is great. We (DigitalNZ) still hope to do something like this,
but we haven't gotten to it yet.

More generally, we'd like to automatically all sorts of metadata to
selected records, not just geospatial metadata.

Open Calais might well be of use to you here - I can post some examples later if you wish.  It not only attempts to isolate geographical regions but people, politicians, organisations etc

 
And we'll probably
eventually open up a service so that (duly authorised) third parties
can submit metadata (geotags, or just tags, or anything else) back to
the DigitalNZ metadata database, but there are a few technical hurdles
first, and we probably have to go back to the content partners too.
cool - I can see how it would be useful to having a human definitively isolate the correct geographic location from a number of automatically created points.


I'm surprised the "placename" metadata isn't more generally useful to
you. FYI "placename" is a holding field where I dumped any
geographical coverage metadata until we have time to add some support
for getting it into a more useful form.

from the random records I have downloaded I did not see the placename field populated all that often.
 
Some content partners don't
provide anything suitable, and in other cases it is a bit messy, but
overall I am pretty sure that field is all geographical. I'd be
interested in any feedback you have on the placename values.

Just having a quick look I am seeing values that are mainly regional.   I am making use of the value if it is available though, but I am trying to get more accurate results where possible.
 


Anyway, thanks for sharing your experience, please let us know how you
get on with the Google Geocoder.   My approach to this was going to be
to take the data that is in the placename field and run it through
http://www.geonames.org/  but your approach is going to be a lot more
useful for records with no placename metadata.
 
I think there are licensing issues as regards the Google Geocoder (such as only showing points on a Google map from memory) so geonames which I have also played with might be a better option in the spirit of open data.

 
It would be interesting
to see if the two methods give comparable results.  Or if your methods
can reliably provide useful results for records with no formal
placename metadata.

I have posted screenshots of some examples at http://picasaweb.google.com/gordon.b.anderson/DigitalNZGeoparsing - these are all automated results using the algorithm outlined in my first email on this thread, and have not been altered by myself.  It should be noted that images 4 and 5 in the above sequence are the same record, where one outlying point (Wakefield in the UK) exists and the other four clustered closely around the correct location.  I'd welcome suggestions on how to statisically remove these anomalies.

Cheers

Gordon

 

Reply all
Reply to author
Forward
0 new messages