Pleiades Plus: Machine Alignment of Pleiades and GeoNames

79 views
Skip to first unread message

Tom Elliott

unread,
Mar 18, 2014, 10:20:25 AM3/18/14
to lod...@googlegroups.com
I thought some members might be interested in the following:

Pleiades Plus is an experimental machine alignment between Pleiades place resources and content in the GeoNames Gazetteer.

It pairs Pleiades URIs with GeoNames URIs when a given pair seems likely to identify the same place. Conceived and prototyped by Leif Isaksen (University of Southampton/Pelagios Project), the current version is produced daily by Ryan Baumann (Duke Collaboratory for Classics Computing). Like all of Pleiades' native data serializations, you can download the Pleiades Plus dataset for free from the Pleiades Download page.

We hope this alignment will facilitate supervised accession of GeoNames links into Pleiades that will then surface in our linked data outputs. We also hope that third parties will be able to make use of this resource in a variety of ways.

Tom

Tom Elliott, Ph.D.
Associate Director for Digital Programs and Senior Research Scholar
Institute for the Study of the Ancient World (NYU)
http://isaw.nyu.edu/people/staff/tom-elliott



Hugh Glaser

unread,
Mar 24, 2014, 12:40:22 PM3/24/14
to lod...@googlegroups.com
Hi.
Nice stuff.
So I picked the mapping up and made a sameAs store out of them at
http://sameas.org/store/pleiades/
I also put the same data in the main sameAs.org store, so they can get more linkage.

However, the URIs I used for geonames are not the ones from the file.
The file has URIs on the www.geonames.org domain, which is fine to see the page, but the Linked Data URIs are sws.geonames.org, which are what I have used.
Leif, you may decide that the sws ones are better?

In case that was unclear, www.geonames.org URIs are *not* Linked Data URIs, because you can't get RDF back from them (as far as I know).

Anyway, I've made a start - if anyone wants me to do any different or more, please ask.

Best
Hugh

Leif Isaksen

unread,
Mar 24, 2014, 4:54:49 PM3/24/14
to lod...@googlegroups.com, Ryan Baumann
Fantastic, thanks Hugh!

@Ryan, I think Hugh is right here. I guess this should be a two-letter tweak to your script and the cron-job will do the rest?

@Hugh can sameAs.org be updated on a regular basis? Ryan's script is intended to run as a nightly job so that new entries to Pleiades and Geonames are included. If that complicates things, are there things that would make it easier?

@Rainer, if Hugh does this for similar alignments, I wonder if your gazetteer alignment tool could draw from SameAs.org directly, rather than locating and parsing individual alignment files?

@Everyone else - as Tom suggests, it would be great to hear about similar alignment activity, or even just requests for specific alignments (PastPlace? TGN? TMGeo? Ordnance Survey?). In many cases, Ryan's work may get us most of the way there already.

All the best

L.




--
You received this message because you are subscribed to the Google Groups "LOD Gazetteer Consortium" group.
To unsubscribe from this group and stop receiving emails from it, send an email to lod-gc+un...@googlegroups.com.
To post to this group, send email to lod...@googlegroups.com.
Visit this group at http://groups.google.com/group/lod-gc.
For more options, visit https://groups.google.com/d/optout.

Ryan Baumann

unread,
Mar 25, 2014, 11:19:41 AM3/25/14
to Leif Isaksen, lod...@googlegroups.com
Hi all,

I've gone ahead and switched the script to use sws.geonames.org, so
that should hopefully get picked up in tomorrow's run:
https://github.com/ryanfb/pleiades-plus/commit/00fe86995a40de72176f83951cd1d8304965ddc0

Leif also asked that I repost some information about the reusability
of the Pleiades Plus script on this thread:

In its current state it will probably work best for aligning
gazetteers which, like GeoNames, have machine-actionable geo
coordinates associated with names in some way. For some databases it
might be necessary to perform a first pass against it that adds
approximate location based on some other criteria (country, region,
etc.) before using the pleiades-plus logic to align them. This is
mostly just to filter down results to likely candidates, as otherwise
in any ambiguous cases you'll get all permutations of name match
combinations. Thinking about it now, a tool for doing such first-pass
processing of hierarchically-organized place resources might be
generally useful and a nice separation of concerns for the tooling (we
have other databases we'd like to align against Pleiades that have the
same problem of having their own project-specific geographic
hierarchical organization without explicit geo coordinates).

It would also be good to add some logic to pleiades-plus that goes
beyond exact string match for finding candidates, which would probably
help with GeoNames alignment as well. I'm actually at the Code4Lib
conference this week and attended a session on OpenRefine yesterday,
which got me thinking about the potential for both Pleiades and
GeoNames etc. reconciliation services for OpenRefine. One nice thing
about this is OpenRefine already has some string similarity and
clustering built in. I also wonder if there might be some potential
for a general geospatial processing extension for OpenRefine (for e.g.
spatial operations). I got a very initial first pass at a Pleiades
name reconciliation service working which I've gone ahead and put up
here: https://github.com/ryanfb/reconciliation_service_skeleton

Best,
-Ryan

On Mon, Mar 24, 2014 at 4:54 PM, Leif Isaksen <lei...@googlemail.com> wrote:
> Fantastic, thanks Hugh!
>
> @Ryan, I think Hugh is right here. I guess this should be a two-letter tweak
> to your script and the cron-job will do the rest?
>
> @Hugh can sameAs.org be updated on a regular basis? Ryan's script is
> intended to run as a nightly job so that new entries to Pleiades and
> Geonames are included. If that complicates things, are there things that
> would make it easier?
>
> @Rainer, if Hugh does this for similar alignments, I wonder if your
> gazetteer alignment tool could draw from SameAs.org directly, rather than
> locating and parsing individual alignment files?
>
> @Everyone else - as Tom suggests, it would be great to hear about similar
> alignment activity, or even just requests for specific alignments
> (PastPlace? TGN? TMGeo? Ordnance Survey?). In many cases, Ryan's work may
> get us most of the way there already.
>
> All the best
>
> L.
>
>
>
>
> On Mon, Mar 24, 2014 at 4:40 PM, Hugh Glaser <hugh....@gmail.com> wrote:
>>

Hugh Glaser

unread,
Mar 28, 2014, 11:42:54 AM3/28/14
to lod...@googlegroups.com
OK, that's really nice - you have shown interest, so I'll put some more work in :-)
It seems  that you will benefit from being on the new platform, which is much easier to manage.
For simplicity I have put you on a SOCIAM project machine - I'm sure Nigel will be happy to help.
There is now an (almost) empty store at
http://sociam-pub.ecs.soton.ac.uk/pleiades/sameas/
There you can put pairs to your heart's content, using the REST-like http PUT, as specified in the usage page at
http://sociam-pub.ecs.soton.ac.uk/pleiades/sameas/usage/
Eg http://sociam-pub.ecs.soton.ac.uk/pleiades/sameas/pairs/http%3A%2F%2Fpleiades.stoa.org%2Fplaces%2F993%20/http%3A%2F%2Fsws.geonames.org%2F146669/
Which I have asserted.
The admin page (http://sociam-pub.ecs.soton.ac.uk/pleiades/sameas/admin/) lets you clear the store out etc.
As you would expect, you can also put a whole file of pairs at
http://sociam-pub.ecs.soton.ac.uk/pleiades/sameas/pairs/
I suspect simply emptying the store and reasserting all the data each time it changes will be the easiest thing.
I hope that makes sense.

If you want more stores, just ask - your auth won't let you create new ones yourself.
If you want a differentFrom, for example, or ones that gives labels for URIs, or vice versa (reconciliation), are often useful.

Email me for the auth details - probably better than putting them here :-)

Finally, if you find yourself having to move between coordinate systems, we have a simplt REST service sthat does that at http://dev.ragld.com/services/coords/
It only does UK for local ones, but I can add others if you want and you tell me the code to do it.

Good luck.


On Tuesday, March 18, 2014 2:20:25 PM UTC, Tom Elliott wrote:

Leif Isaksen

unread,
Mar 28, 2014, 11:49:20 AM3/28/14
to lod...@googlegroups.com
Cheers Hugh

Multiple translatable coordinate systems? Groovy! 

L.


Hugh Glaser

unread,
Mar 28, 2014, 11:50:41 AM3/28/14
to lod...@googlegroups.com
Oops - important lesson :-)
There is a space (%20) at the end of the first URI in the PUT URI because it was created with cut and paste.
So there was a space at the end in the store.
I fixed it in the store.


On Tuesday, March 18, 2014 2:20:25 PM UTC, Tom Elliott wrote:

Dallan Quass

unread,
Mar 29, 2014, 12:54:07 PM3/29/14
to lod...@googlegroups.com
WeRelate.org has a database of historical places: http://www.werelate.org/wiki/Portal:Place . I'd be happy to make the data available in a csv or json format if someone wanted to integrate it.


Leif Isaksen

unread,
Mar 31, 2014, 6:14:29 AM3/31/14
to lod...@googlegroups.com
Thanks Dallan

An alignment to this would be extremely useful. A couple of quick Qs:

- am I right thinking that WeRelate specialises in historical places up to 1900? In some senses that's a meaningless question of course but I see that you assign regional jurisdictions to that time. I suppose what I'm really suggesting is that all gazetteers make a note of their intended temporal, geographic and conceptual scope so that when we choose one to annotate with we can pick the most appropriate (while still being indirectly connected to all the other gazetteers)

- As we start to connect more gazetteers it may make sense to align them with a broad/shallow gazetteer like wikidata as a bridge to other gazetteers, rather than many-to-many. Rainer and Humphrey have been looking into this in more detail recently and have a meeting planned with the wikidata folks. 

Humphrey/Rainer do you have any thoughts about whether this is currently a workable strategy or are there any wrinkles that need ironing out first? Once this is clearer I suspect this might be an essential modification to Ryan's alignment tools.

All the best

Leif

Dallan Quass

unread,
Apr 1, 2014, 10:39:41 AM4/1/14
to lod...@googlegroups.com
The places are a merge of three sources:

* A dump of the english-language wikipedia around 2007 - I extracted around 50,000 places organized into the hierarchy that was current in 2007.
* Political places from the Getty TGN around the same time -- those places were generally organized into the hierarchy that was current in 1990.
  which has roughly half a million historical places without lat/lon coordinates. Places in the FHLC appear according to a single snapshot in time. In general, most places in the FHLC are assigned the jurisdictional hierarchy that was current around 1900. Roughly half of the places at WeRelate appear only in the FHLC (and not in the other two sources), so I chose to follow the FHLC convention of using 1900 as the jurisdictional hierarchy to use when assigning Place page titles. 

I spent a few months merging these sources as best as I could. It's not a perfect merge, and there are still some duplicates especially in Eastern Europe. Place pages in WeRelate support multiple jurisdictional hierarchies, so when an FHLC place was merged with a TGN or wikipedia place, the jurisdictional hierarchy from the other source was listed as an alternate.  Over the past several years WeRelate users have also been adding alternate jurisdictional hierarchies. For example: http://www.werelate.org/wiki/Place:Southorpe%2C_Northamptonshire%2C_England

I like the idea of using wikidata as a bridge. One to many is definitely easier than many to many. Hopefully the wikidata people will be amenable to adding additional historical jurisdictions if you find some that are missing.

Also, not sure if you're aware of http://gov.genealogy.net/search/index. They appear to have fantastic coverage of german historical places. WeRelate's german places are not as good.

Another potential source of historical places is: https://familysearch.org/stdfinder/PlaceStandardLookup.jsp This is a separate database from the FHLC database; as far as I know, the two are not connected. This database is not currently available as open-content, though they might be talked into making it available. I once did a quick comparison between WeRelate's and FamilySearch's ability to match user-entered place texts against their respective historical place databases here: https://github.com/DallanQ/Places/wiki/Comparison-to-FamilySearch

I think it would be terrific if you were to create a wikidata-based database as a central historical place database and develop tools & procedures to help people match places to that database. I believe I could get WeRelate users to help match WeRelate places to it if that would help.


Tom Elliott

unread,
Apr 3, 2014, 12:11:14 PM4/3/14
to lod...@googlegroups.com, Ryan Baumann
Hi all:

I’m embarrassed to say that this thread was my first introduction to WeRelate.org and its historical geographic content. What an eye-opener! 

Dallan’s last email caught my interest: did they keep the TGN and other native identifiers when they did the merge? A quick inspection reveals the answer to be yes, at least for TGN. In my opinion, werelate’s place data is not just impressive and valuable in its own right, but also is ripe for cross-walking to other datasets, as contemplated in the preceding message thread. The external identifiers will provide plenty of hooks, in addition to names and coordinates, on the basis of which to make tentative machine matches. And once matches have been made, the werelate.org data could really help downstream linking between third-party gazetteers. Recursive fun! :)

Anyway, this has got me thinking about moving toward trying to get a gazetteer alignment between werelate.org and Pleiades, and also seeing what could be done to build out Ryan’s Pleiades+Geonames tool to something more general and source-agnostic.

So, I think I can speak for many both on and off this list when I say I’d love to see a CSV or JSON dump of the werelate.org places, as Dallan so generously offered. 

Best,
Tom 

Tom Elliott, Ph.D.
Associate Director for Digital Programs and Senior Research Scholar
Institute for the Study of the Ancient World (NYU)
http://isaw.nyu.edu/people/staff/tom-elliott

Co-Managing Editor, Pleiades


Dallan Quass

unread,
Apr 3, 2014, 1:13:45 PM4/3/14
to lod...@googlegroups.com
Thanks!  The places contain links to the Getty TGN, Wikipedia, and the FHLC.

What format would you like the data in?  Do you have a particular format in mind?  It's currently available in a CSV format described here: https://github.com/DallanQ/Places/wiki/Database-download but I could put it in a different format if that would be better.

Tom Elliott

unread,
Apr 3, 2014, 2:38:48 PM4/3/14
to lod...@googlegroups.com
Dallan:

Thanks. That CSV looks like more than enough to go on for now. Thanks. Will get back in touch if I run into any snags and when I have something to report.

Best,
Tom

Tom Elliott, Ph.D.
Associate Director for Digital Programs and Senior Research Scholar
Institute for the Study of the Ancient World (NYU)
http://isaw.nyu.edu/people/staff/tom-elliott



Leif Isaksen

unread,
Aug 24, 2014, 8:10:22 AM8/24/14
to lod...@googlegroups.com
Hi all

the recent release of the TGN as Linked Data has got me thinking about
this again. There are now quite a few URI-based historical gazetteers
floating around. So two questions:

- Do folks feel there is still any essential infrastructure missing?
(I'd be particularly interested to hear whether folks who have been
working with the wikidata data think it's still the right way
forwards)

- Is this something that might be best served by a) a one-off event
where we bring gazetteers and crunch through them and get most of the
work done, or b) piecemeal as when people find time?

Cheers

L.
Reply all
Reply to author
Forward
0 new messages