Account Options

  1. Sign in
The old Google Groups will be going away soon.
Switch to the new Google Groups.
Google Groups Home
« Groups Home
initial attempts at geoparsing digital nz data
There are currently too many topics in this group that display first. To make this topic appear first, remove this option from another topic.
There was an error processing your request. Please try again.
flag
  4 messages - Collapse all  -  Translate all to Translated (View all originals)
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
Gordon Anderson  
View profile  
 More options Jun 2 2009, 7:17 am
From: Gordon Anderson <gordon.b.ander...@gmail.com>
Date: Tue, 2 Jun 2009 23:17:17 +1200
Local: Tues, Jun 2 2009 7:17 am
Subject: initial attempts at geoparsing digital nz data

hi
This is still a work in progress but I thought others might be interested.
 I've been having a look at geoparsing a small portion of the corpus of
digital nz metadata to see if it relevant text can be extracted and
geographic locations extracted.  I think I might have found a way forward...

Initially (and this was code from a previous project a year ago) I tried
parsing the text and throwing strings of consecutive words starting with
capitals at the google geocoder (e.g. "Wellington High School, Wellington").
 However this seemed to produce noisy results, especially when the google
geocoder return locations for the likes of "View", "Photographer" and other
random strings especially at the start of sentences.  The main problem I had
was in isolating a geographical region with reasonably precision without the
noise getting in the way.

Other than random words that seemed innocuous producing geographical
results, less innocuous ones also caused issues - as an example the string
"Looking along Oriental Parade, Wellington. Taken by Sydney Charles Smith
circa 1912" would end up matching Sydney in Australia which is not
exactly desirable.  As such it was time to look for other APIs to attempt to
filter out non geographic text prior to geoparsing.

Open Calais (http://www.opencalais.com/) was the first API to come  to the
rescue.  It  is provided by Reuters and attempts to extract metadata such as
politicians, organisations, people and many other categories from submitted
text alone.  With this I was able to identify the likes of "Bank of New
Zealand" as an organisation (thus avoid 'Bank' geomatching somewhere the
USA)  and our friend Sydney Charles Smith as a person, and thus exclude them
from geo parsing queries.

Still not quite there yet though :)  I also came across Yahoo Placemaker (
http://developer.yahoo.com/geo/placemaker/) a couple of nights ago and it
seems to do a good job of identifying a region for a piece of text, the main
issue I was having.  It was still matching Sydney Charles Smith as Sydney,
and indeed appears to be case insensitive as 'sydney charles smith' also
matched the Australian city.  I instead opted to remove such text completely
using Open Calais and started to see much more accurate results.

Currently I have ruby scripts that do the following:

- Download a natlib record by id and squirrel the metadata in a database on
my laptop to avoid a second hit against the API
- Create a metadata string for geoparsing by adding title, description,
coverage, subject and placenames together
- Use open calais to remove organisations and people names (there may be
others worth removing) and thus avoid noise
- Use Yahoo Placemaker to get a list of regions / towns / suburbs, as well
as geographic extending (lat/lon bounding box)

The next step (yet to be done) is to use my previous attempts at parsing
using the Google Geocoder but use the regional information and bounding box
provided by the Yahoo API to filter out noise.

Appended below is some example output from my scripts - comments welcome :)

Cheers

Gordon

ORIGINAL METATEXT FOR RECORD:75636
====
Electric tram in Invercargill

Electric tram in Invercargill (destination: Georgetown), circa 1912. It
bears an advertisement for Dominion Tread Tires. Photographer unidentified.

Invercargill City

1912

Trams - New Zealand - Southland Region

FILTERING
========
CALAIS TAGS:
---
Position:
- Photographer
City:
- Georgetown
- Invercargill City
Country:
- New Zealand
Relations:
- ""

TEXT FOR YAHOO
======
Electric tram in Invercargill

Electric tram in Invercargill (destination: Georgetown), circa 1912. It
bears an advertisement for Dominion Tread Tires. Photographer unidentified.

Invercargill City

1912

Trams - New Zealand - Southland Region

TEXT FOR GOOGLE
=======
Electric tram in Invercargill

Electric tram in Invercargill (destination: Georgetown), circa 1912. It
bears an advertisement for Dominion Tread Tires. Photographer unidentified.

Invercargill City

1912

Trams - New Zealand - Southland Region

RESULTS
LOC:2348887 Invercargill, Southland, NZ Town -46.4118 168.352 WEIGHT=1
CONFIDENCE=10
LOC:15021750 Southland, NZ State -45.4649 167.853 WEIGHT=1 CONFIDENCE=10
LOC:28644394 Georgetown, Invercargill, Southland, NZ Suburb -46.4201 168.369
WEIGHT=1 CONFIDENCE=10
LOC:55875899 Invercargill City, Southland, NZ County -46.474 168.369
WEIGHT=1 CONFIDENCE=10

EXTENTS
SW: -47.2911, 166.427
NE: -44.2551, 169.279
GEO SCOPE
Invercargill City, Southland, NZ

ADMIN SCOPE
Invercargill City, Southland, NZ

ORIGINAL METATEXT FOR RECORD:63489
====
Part 2 of a 2 part panorama of Wellington Public Hospital, Newtown,
Wellington

Part 2 of a 2 part panorama of Wellington Public Hospital, Newtown,
Wellington, taken in 1910 by Sydney Charles Smith.

Newtown

1910

Hospitals - New Zealand - Wellington Region

Wellington Hospital

FILTERING
========
CALAIS TAGS:
---
Person:
- Sydney Charles Smith
City:
- Newtown
- Wellington
Country:
- New Zealand
Relations:
- ""
Organization:
- Wellington Public Hospital
*REMOVING FROM GEO SEARCH:Sydney Charles Smith*
*REMOVING FROM GEO SEARCH:Wellington Public Hospital*

TEXT FOR YAHOO
======
Part 2 of a 2 part panorama of , Newtown, Wellington

Part 2 of a 2 part panorama of , Newtown, Wellington, taken in 1910 by .

Newtown

1910

Hospitals - New Zealand - Wellington Region

Wellington Hospital

TEXT FOR GOOGLE
=======
Part 2 of a 2 part panorama of wellington public hospital, Newtown,
Wellington

Part 2 of a 2 part panorama of wellington public hospital, Newtown,
Wellington, taken in 1910 by sydney charles smith.

Newtown

1910

Hospitals - New Zealand - Wellington Region

Wellington Hospital

RESULTS
LOC:2351310 Wellington, Wellington, NZ Town -41.2805 174.767 WEIGHT=1
CONFIDENCE=6
LOC:22726472 Newtown, Wellington, Wellington, NZ Suburb -41.3142 174.779
WEIGHT=1 CONFIDENCE=10

EXTENTS
SW: -41.3491, 174.694
NE: -41.1882, 174.847
GEO SCOPE
Wellington, Wellington, NZ

ADMIN SCOPE
Wellington, Wellington, NZ

ORIGINAL METATEXT FOR RECORD:26868
====
Bowling Green Invercargill F G R 6824

Showing the exterior of the Invercargill Bowling Club in Yarrow Street,
between Doon and Elles Road

Southland Region (N.Z.)

Southland Region (N.Z.)

Invercargill

Bowling Clubs

Yarrow Street

FILTERING
========
CALAIS TAGS:
---
Facility:
- Elles Road
- Yarrow Street
Relations:
- ""
Organization:
- Invercargill Bowling Club
REMOVING FROM GEO SEARCH:Invercargill Bowling Club

TEXT FOR YAHOO
======
Bowling Green Invercargill F G R 6824

Showing the exterior of the  in Yarrow Street, between Doon and Elles Road

Southland Region (N.Z.)

Southland Region (N.Z.)

Invercargill

Bowling Clubs

Yarrow Street

TEXT FOR GOOGLE
=======
Bowling Green Invercargill F G R 6824

Showing the exterior of the invercargill bowling club in Yarrow Street,
between Doon and Elles Road

Southland Region (N.Z.)

Southland Region (N.Z.)

Invercargill

Bowling Clubs

Yarrow Street

RESULTS
LOC:2348887 Invercargill, Southland, NZ Town -46.4118 168.352 WEIGHT=1
CONFIDENCE=7
LOC:2367481 Bowling Green, KY, US Town 36.9946 -86.4456 WEIGHT=1
CONFIDENCE=3
LOC:2392943 Doon, IA, US Town 43.2782 -96.2359 WEIGHT=1 CONFIDENCE=5
LOC:15021750 Southland, NZ State -45.4649 167.853 WEIGHT=1 CONFIDENCE=7
LOC:23424916 New Zealand Country -43.5877 170.367 WEIGHT=1 CONFIDENCE=7

EXTENTS
SW: -52.6171, -96.2436
NE: 43.2865, 169.279
GEO SCOPE
Pacific/Auckland, ZZ

ADMIN SCOPE
New Zealand


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Paul Hagon  
View profile  
 More options Jun 3 2009, 5:01 am
From: Paul Hagon <paul.ha...@gmail.com>
Date: Wed, 3 Jun 2009 02:01:10 -0700 (PDT)
Local: Wed, Jun 3 2009 5:01 am
Subject: Re: initial attempts at geoparsing digital nz data
Awesome - you beat me to playing around with Placemaker. It looks to
be giving some good results. With the things I've been playing around
with I seem to be getting better results from Yahoo's geocoding
services over Googles, but I still need to do further work. Thanks for
sharing.

On Jun 2, 9:17 pm, Gordon Anderson <gordon.b.ander...@gmail.com>
wrote:


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Gordon Paynter  
View profile  
 More options Jun 6 2009, 5:51 am
From: Gordon Paynter <gordon.payn...@gmail.com>
Date: Sat, 6 Jun 2009 02:51:32 -0700 (PDT)
Local: Sat, Jun 6 2009 5:51 am
Subject: Re: initial attempts at geoparsing digital nz data
Hi Gordon:

This is great. We (DigitalNZ) still hope to do something like this,
but we haven't gotten to it yet.

More generally, we'd like to automatically all sorts of metadata to
selected records, not just geospatial metadata. And we'll probably
eventually open up a service so that (duly authorised) third parties
can submit metadata (geotags, or just tags, or anything else) back to
the DigitalNZ metadata database, but there are a few technical hurdles
first, and we probably have to go back to the content partners too.

I'm surprised the "placename" metadata isn't more generally useful to
you. FYI "placename" is a holding field where I dumped any
geographical coverage metadata until we have time to add some support
for getting it into a more useful form. Some content partners don't
provide anything suitable, and in other cases it is a bit messy, but
overall I am pretty sure that field is all geographical. I'd be
interested in any feedback you have on the placename values.

Anyway, thanks for sharing your experience, please let us know how you
get on with the Google Geocoder.   My approach to this was going to be
to take the data that is in the placename field and run it through
http://www.geonames.org/  but your approach is going to be a lot more
useful for records with no placename metadata. It would be interesting
to see if the two methods give comparable results.  Or if your methods
can reliably provide useful results for records with no formal
placename metadata.

Gordon


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Gordon Anderson  
View profile  
 More options Jun 10 2009, 2:57 am
From: Gordon Anderson <gordon.b.ander...@gmail.com>
Date: Wed, 10 Jun 2009 18:57:21 +1200
Local: Wed, Jun 10 2009 2:57 am
Subject: Re: [DigitalNZ] Re: initial attempts at geoparsing digital nz data

Hi Gordon

Thanks for your feedback.  Let's hope people don't get confused by a Gordon
talking to a Gordon :)

> Hi Gordon:

> This is great. We (DigitalNZ) still hope to do something like this,
> but we haven't gotten to it yet.

> More generally, we'd like to automatically all sorts of metadata to
> selected records, not just geospatial metadata.

Open Calais might well be of use to you here - I can post some examples
later if you wish.  It not only attempts to isolate geographical regions but
people, politicians, organisations etc

> And we'll probably
> eventually open up a service so that (duly authorised) third parties
> can submit metadata (geotags, or just tags, or anything else) back to
> the DigitalNZ metadata database, but there are a few technical hurdles
> first, and we probably have to go back to the content partners too.

cool - I can see how it would be useful to having a human definitively
isolate the correct geographic location from a number of automatically
created points.

> I'm surprised the "placename" metadata isn't more generally useful to
> you. FYI "placename" is a holding field where I dumped any
> geographical coverage metadata until we have time to add some support
> for getting it into a more useful form.

from the random records I have downloaded I did not see the placename field
populated all that often.

> Some content partners don't
> provide anything suitable, and in other cases it is a bit messy, but
> overall I am pretty sure that field is all geographical. I'd be
> interested in any feedback you have on the placename values.

Just having a quick look I am seeing values that are mainly regional.   I am
making use of the value if it is available though, but I am trying to get
more accurate results where possible.

> Anyway, thanks for sharing your experience, please let us know how you
> get on with the Google Geocoder.   My approach to this was going to be
> to take the data that is in the placename field and run it through
> http://www.geonames.org/  but your approach is going to be a lot more
> useful for records with no placename metadata.

I think there are licensing issues as regards the Google Geocoder (such as
only showing points on a Google map from memory) so geonames which I have
also played with might be a better option in the spirit of open data.

> It would be interesting
> to see if the two methods give comparable results.  Or if your methods
> can reliably provide useful results for records with no formal
> placename metadata.

I have posted screenshots of some examples at
http://picasaweb.google.com/gordon.b.anderson/DigitalNZGeoparsing - these
are all automated results using the algorithm outlined in my first email on
this thread, and have not been altered by myself.  It should be noted that
images 4 and 5 in the above sequence are the same record, where one outlying
point (Wakefield in the UK) exists and the other four clustered closely
around the correct location.  I'd welcome suggestions on how to statisically
remove these anomalies.

Cheers

Gordon


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
End of messages
« Back to Discussions « Newer topic     Older topic »