Howdy once again,
I think it's worth posting as well Mike's helpful reply:
Hi Todd,
I can give you some brief answers now that will hopefully help you get
a better idea of where TextGrounder currently stands. As Jason
mentioned, we're looking to improve it starting in late August.
1) Is the format of the gazetteer hard-coded into TextGrounder, or is
there a simple way to add fields?
Right now we only fully support the GeoNames gazetteer, though there
is limited support for World Gazetteer as well (basically, you won't
get multipoint representations of countries unless you use GeoNames).
Theoretically, you could write your own gazetteer reader that mimics
the structure/API of topo.gaz.GeoNamesGazetteer. The load method is
where the meat of the work is done that splits each line into fields
and populates the Gazetteer object accordingly.
2) Does TextGrounder have a way of understanding the "popularity" of
geolocations? E.g. in a text about Austin, TX, is there a way to let
TextGrounder know that "Paris" (without any other qualifications) most
likely refers to the city in France, not the one in Texas?
There isn't exactly a way of encoding popularity per se, but
WeightedMinDistResolver has some notion of this. It's an iterative
algorithm that first assumes each location is equally prominent (or
popular, if you like), then resolves toponyms by trying to put them
close together on earth within a document, then recomputes these
prominences across the whole corpus based on which places were
frequently chosen, then re-resolves with those new weights, and so on.
This was designed based on the intuition that there might be a lot of
documents in a corpus about France that mention Paris and a lot of
other places in France (may or may not be true in reality), and a lot
of documents that mention Texas and Austin and Houston and so on, so
that both Texas the US state and Paris the French city would get
strongly weighted, ultimately leading them to be selected even in
documents that contain both Texas and Paris (unless there really was a
ton of evidence that it's Paris, TX that's being talked about). This
is a pretty crude way to do things, but one nice thing (or not nice
thing, depending on your goals and what data you're starting with) is
that it doesn't require any population information or a prioi
knowledge about which places are more likely to be chosen. This goes
along with what Jason mentioned about having our algorithms be general
and applicable even to texts from 200 or 300 years ago, where
population/popularity information from today may be pretty invalid.
That said, it might be nice to have a way to say, "hey, I DO have a
lot of background knowledge about the corpus I want to resolve, so let
me set some defaults or general rules or use population from a
particular gazetteer as a knowledge source" and so on. I'll let Jason
comment further on this issue as this has mostly been a high level,
philosophical decision (though is also due in part to the fact that
population information in many freely available gazetteers is often
spotty; Wikipedia might be a good way to go though, and is something
we'd like to work with heavily in the near future).
3) At what point in the text analysis does TextGrounder invoke actual
geographic proximity to decide between locations?
Both the BasicMinDistResolver (BMD) and WeightedMinDistResolver (WMD)
use the distance method of the Coordinate class fundamentally, which
returns the great circle distance in radians to another Coordinate.
With BMD, each toponym is resolved to the location that minimizes the
total distance to some possible resolution of all other toponyms in
the same document. WMD works in much the same way except that
distances are divided by the prominence weights discussed above, so
that locations that are highly prominent seem "closer" than those that
aren't.
Clearly, one of the big problems with both BMD and WMD is that in
general you'll tend to get a clump of small towns that all happen to
be named after highly populous places, none or few of which are
actually the ones being mentioned. We're looking to develop more
sophisticated resolvers, likely using label propagation as a tool for
connecting related entities (words, locations, toponyms, documents,
etc.), that take advantage of a larger, richer dataset like Wikipedia.
Hopefully doing so can help solve these basic problems without the
need for a lot of heuristic rules and other quick fixes.
I hope that helps some. Thanks for your interest! It's great to see
somebody trying out TextGrounder besides us. :)
Mike