TextGrounder & "Popularity"

bobtodd

unread,

Jul 22, 2011, 8:58:06 AM7/22/11

to TextGrounder Open Discussion, fl...@infochimps.org

Howdy all,

Jason suggested that I post some questions I had regarding
TextGrounder to the discussion forum. The basic questions I had
(which came up when evaluating the possibility of using TextGrounder
at Infochimps) were the following:

1) Is the format of the gazetteer hard-coded into TextGrounder, or is
there a simple way to add fields?
2) Does TextGrounder have a way of understanding the "popularity" of
geolocations? E.g. in a text about Austin, TX, is there a way to let
TextGrounder know that "Paris" (without any other qualifications) most
likely refers to the city in France, not the one in Texas?
3) At what point in the text analysis does TextGrounder invoke actual
geographic proximity to decide between locations?

Regards,
Todd

bobtodd

unread,

Jul 22, 2011, 9:01:31 AM7/22/11

to TextGrounder Open Discussion, fl...@infochimps.org

Howdy once again,

I think it's worth posting as well Mike's helpful reply:

Hi Todd,

I can give you some brief answers now that will hopefully help you get
a better idea of where TextGrounder currently stands. As Jason
mentioned, we're looking to improve it starting in late August.

1) Is the format of the gazetteer hard-coded into TextGrounder, or is
there a simple way to add fields?

Right now we only fully support the GeoNames gazetteer, though there
is limited support for World Gazetteer as well (basically, you won't
get multipoint representations of countries unless you use GeoNames).
Theoretically, you could write your own gazetteer reader that mimics
the structure/API of topo.gaz.GeoNamesGazetteer. The load method is
where the meat of the work is done that splits each line into fields
and populates the Gazetteer object accordingly.

2) Does TextGrounder have a way of understanding the "popularity" of
geolocations? E.g. in a text about Austin, TX, is there a way to let
TextGrounder know that "Paris" (without any other qualifications) most
likely refers to the city in France, not the one in Texas?

There isn't exactly a way of encoding popularity per se, but
WeightedMinDistResolver has some notion of this. It's an iterative
algorithm that first assumes each location is equally prominent (or
popular, if you like), then resolves toponyms by trying to put them
close together on earth within a document, then recomputes these
prominences across the whole corpus based on which places were
frequently chosen, then re-resolves with those new weights, and so on.
This was designed based on the intuition that there might be a lot of
documents in a corpus about France that mention Paris and a lot of
other places in France (may or may not be true in reality), and a lot
of documents that mention Texas and Austin and Houston and so on, so
that both Texas the US state and Paris the French city would get
strongly weighted, ultimately leading them to be selected even in
documents that contain both Texas and Paris (unless there really was a
ton of evidence that it's Paris, TX that's being talked about). This
is a pretty crude way to do things, but one nice thing (or not nice
thing, depending on your goals and what data you're starting with) is
that it doesn't require any population information or a prioi
knowledge about which places are more likely to be chosen. This goes
along with what Jason mentioned about having our algorithms be general
and applicable even to texts from 200 or 300 years ago, where
population/popularity information from today may be pretty invalid.
That said, it might be nice to have a way to say, "hey, I DO have a
lot of background knowledge about the corpus I want to resolve, so let
me set some defaults or general rules or use population from a
particular gazetteer as a knowledge source" and so on. I'll let Jason
comment further on this issue as this has mostly been a high level,
philosophical decision (though is also due in part to the fact that
population information in many freely available gazetteers is often
spotty; Wikipedia might be a good way to go though, and is something
we'd like to work with heavily in the near future).

3) At what point in the text analysis does TextGrounder invoke actual
geographic proximity to decide between locations?

Both the BasicMinDistResolver (BMD) and WeightedMinDistResolver (WMD)
use the distance method of the Coordinate class fundamentally, which
returns the great circle distance in radians to another Coordinate.
With BMD, each toponym is resolved to the location that minimizes the
total distance to some possible resolution of all other toponyms in
the same document. WMD works in much the same way except that
distances are divided by the prominence weights discussed above, so
that locations that are highly prominent seem "closer" than those that
aren't.

Clearly, one of the big problems with both BMD and WMD is that in
general you'll tend to get a clump of small towns that all happen to
be named after highly populous places, none or few of which are
actually the ones being mentioned. We're looking to develop more
sophisticated resolvers, likely using label propagation as a tool for
connecting related entities (words, locations, toponyms, documents,
etc.), that take advantage of a larger, richer dataset like Wikipedia.
Hopefully doing so can help solve these basic problems without the
need for a lot of heuristic rules and other quick fixes.

I hope that helps some. Thanks for your interest! It's great to see
somebody trying out TextGrounder besides us. :)

Mike

bobtodd

unread,

Jul 22, 2011, 9:25:42 AM7/22/11

to TextGrounder Open Discussion, fl...@infochimps.org

Howdy Mike,

Thanks very much for the great reply, especially when I know you have
other matters to attend to. That gives me a much better understanding
for how TextGrounder works.

There are two possible routes that come to mind in the effort to see
if TextGrounder can be helped to favor Paris, France over Paris,
Texas. One of course is population, as has been mentioned.
Practically speaking, this is probably the quickest measure of
popularity and the most broadly useful piece of information if we go
beyond purely linguistic data.

But another idea that would be in keeping with the notion that
TextGrounder follow very general linguistic principles would be to
bring the lexicon into play. (Of course here we get into the common
problem in linguistics that everything depends on everything.) But,
simply put, if you check the OED, Paris appears as a reference to the
city in France, and there's no mention of Texas. So it might be worth
weighting highly those place names that occur in a dictionary; and the
nice part about that is that there are several machine-readable
English dictionaries available. The added benefit is that such a
system would get you place names that are *historically* important,
and this is arguably more useful than current population. E.g. I have
no idea how Rome ranks population-wise among world cities, but when I
use the word it almost never refers to the town in New York, because I
refer to Rome as a mark of its historical importance.

Best,
t

Jason Baldridge

unread,

Jul 25, 2011, 7:06:44 PM7/25/11

to textgrou...@googlegroups.com, fl...@infochimps.org

That's a good point about using the dictionaries. Wikipedia also has this property, and you can judge the popularity of a location by the number of links from other pages to its page -- this should have better coverage than any of the dictionaries, I think.

Thanks for posting Mike's response.

Jason

--
Jason Baldridge
Assistant Professor, Department of Linguistics
The University of Texas at Austin
http://www.jasonbaldridge.com
http://twitter.com/jasonbaldridge

Reply all

Reply to author

Forward