Hello, and a question

Mark Betz

unread,

Jul 30, 2014, 1:40:28 PM7/30/14

to clavin...@googlegroups.com

Hi, folks. Thank you for access to the group. I'm a developer and system architect working for a startup in NYC. We've recently been experimenting with CLAVIN for the purpose of extracting city and state from text descriptions of events that occur at specific places. So first of all, thanks for your work on this tool.

The version of CLAVIN that we are using is the clavin-rest repo, which we were able to get up and running inside a docker container very easily using the supplied installation steps. So, thanks for that, too :).

As far as I can tell from peering into the source (I am not much of a java dev but do have a strong C/C++ background and some java) this version of CLAVIN is using the Stanford extractor. We have not changed anything so we are using it with the default configuration and simply passing queries to the rest API from a python wrapper.

Our application is probably somewhat less general than the problems the tool was built for. We know that what we are processing will generally contain either an address or a city/state pair at some point in the text, and we can usually pin it down well enough that there isn't a lot of extraneous text included (other than perhaps the name of a venue). So we began our testing by simply feeding the api some common city/state pairs that we encounter. One thing we noticed right off is that unless we title cased the terms we got nothing back. I posted about that on the github page before discovering this group, and have already been given some pointers about caseless models for the Stanford extractor. For our purposes it was simple enough to title case the query text for the tests.

Having done that we saw some interesting results. Many cases worked impressively well. "Nashville Tn" gets the right city, for example, and so does "Nashville" and "Omaha Nb". However some cases yielded anomalous results. The term "Washington Dc" gets the state, as does "Washington, Dc". The term "Washington, D.C." gets a city labelled "District of Columbia", while "Washington D.C." (no comma) gets a city labelled "Washington, D.C." Trying "Columbia Sc" as well as various other combinations consistently yielded the "Republic of Colombia" or "Pengkalan Baharu S.C."

Those are our results so far, and I would be very interested in any thoughts you all have on the applicability of the tool to our problem domain, and ways in which we might tune its use to get better results from cases like these.

Thanks again, and regards,

--Mark

Charlie Greenbacker

unread,

Aug 2, 2014, 6:04:24 PM8/2/14

to Mark Betz, clavin...@googlegroups.com

Hi Mark,

Our application is probably somewhat less general than the problems the tool was built for. We know that what we are processing will generally contain either an address or a city/state pair at some point in the text, and we can usually pin it down well enough that there isn't a lot of extraneous text included (other than perhaps the name of a venue). So we began our testing by simply feeding the api some common city/state pairs that we encounter.

This is the crux of the matter. CLAVIN is designed to operate on text in complete sentences as input. By feeding it simple queries or fragments of text, you're bound to see unexpected results.

That being said, CLAVIN is also at the mercy of whatever the entity extractor (whether Stanford NER, Apache OpenNLP NameFinder, etc.) is able to pull out of the input text. If the extractor misses something, CLAVIN won't be able to resolve it.

You can test out Stanford NER directly using this online demo: http://nlp.stanford.edu:8080/ner/process (Be sure to select the english.all.3class.distsim.crf.ser.gz classifier to see exactly what CLAVIN sees.)

However some cases yielded anomalous results. The term "Washington Dc" gets the state, as does "Washington, Dc". The term "Washington, D.C." gets a city labelled "District of Columbia", while "Washington D.C." (no comma) gets a city labelled "Washington, D.C."

I tossed those examples into the Stanford NER demo, and here's what I got (inlineXML output format):

"Washington Dc" --> <LOCATION>Washington</LOCATION> Dc
"Washington, Dc" --> <LOCATION>Washington</LOCATION>, Dc
"Washington, D.C." --> <LOCATION>Washington</LOCATION>, <LOCATION>D.C.</LOCATION>
"Washington D.C." --> <LOCATION>Washington D.C.</LOCATION>

As you can see, Stanford NER doesn't handle commas or city/state pairs all that well. The only example that returned the "correct" output is #4. I hope this helps make it a bit more clear why CLAVIN did what it did. Blame the entity extractor! :)

Those are our results so far, and I would be very interested in any thoughts you all have on the applicability of the tool to our problem domain, and ways in which we might tune its use to get better results from cases like these.

Though not designed to address exactly the same use case as yours, I've been working on a new method for CLAVIN that circumvents the entity extractor and works with multi-part location names as input. The specific use case I've got in mind is tabular input with separate text columns for city, state/province, and country. It's working fairly well, but I haven't had the time to clean up the code and submit a pull request.

Can you tell us more about your startup and the particular problem you're tackling?

Cheers,

Charlie

--
You received this message because you are subscribed to the Google Groups "clavin-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to clavin-users...@googlegroups.com.
To post to this group, send email to clavin...@googlegroups.com.
Visit this group at http://groups.google.com/group/clavin-users.
To view this discussion on the web visit https://groups.google.com/d/msgid/clavin-users/07abe90c-7ebe-4334-9eff-c3b611925d9f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Mark Betz

unread,

Aug 6, 2014, 10:10:40 AM8/6/14

to clavin...@googlegroups.com, betz...@gmail.com

Hi, Charlie. Thanks for your reply, and sorry to be a few days posting back. The installation we tested was the clavin-rest package, and as I mentioned previously it appears to be using the Stanford extractor by default. Perhaps I was wrong about that. In any event, I can't say very much about the problem we're tackling, other than to note that descriptions of things that have happened or will happen are posted online with some verbiage indicating where the thing happened or will be happening. This text ranges from complete addresses to fragments such as "Hotel Excelsior, New York, NY." For our purposes we just want to resolve these snippets of text to a city/state. So far we appear to be on the right track with a package named twofishes, that was recommended by a reader who saw my post here. It seems to handle variations on city/state very well, so with that and the Geonames API we are able to put together a pretty complete package of info on the city/state referenced in a piece of text.

One thing I have found odd is that POI places are consistently linked up to county-level rather than city/town level in the geonames data. So we are currently engaged in trying to correctly resolve, say, "Bryant Park" to something that people will recognize as "New York City" rather than "New York County" (or "Murray Hill" which is what we get if we take the lat/long for Bryant Park and reverse geocode that on Geonames to the closest placename).

My first brush with geocoding, so it's been interesting :).

Charlie Greenbacker

unread,

Aug 8, 2014, 4:31:53 PM8/8/14

to Mark Betz, clavin...@googlegroups.com

Hi Mark,

This text ranges from complete addresses to fragments such as "Hotel Excelsior, New York, NY."

CLAVIN really isn't designed to handle either of those. It's a geoparser, which means it wants complete sentences as input. It sounds like you might be better off using a geocoder instead of CLAVIN.

So far we appear to be on the right track with a package named twofishes, that was recommended by a reader who saw my post here.

I've heard very good things about twofishes as well, and it's definitely worth exploring as a potential fit for your use case.

One thing I have found odd is that POI places are consistently linked up to county-level rather than city/town level in the geonames data.

Yes, I've been meaning to tweak the algorithm a bit to account for that. CLAVIN's core resolution algorithm is biased in favor of geopolitical entities with larger populations, meaning "New York" will generally be resolved to New York State rather than New York City, etc. I've got some ideas about how to tackle this, but nobody is currently paying me to improve CLAVIN and my children generally eat up all of my spare time. :)

Cheers,
Charlie

To view this discussion on the web visit https://groups.google.com/d/msgid/clavin-users/1a905713-1abd-4f2b-a89f-15312352a364%40googlegroups.com.

Mark Betz

unread,

Aug 8, 2014, 4:49:02 PM8/8/14

to clavin...@googlegroups.com, betz...@gmail.com

I hear that. I have three myself :).

Twofishes will resolve "NY" to New York state, but it will resolve "New York" to the city, so I guess it is making some more aggressive assumptions.

We're still tweaking our approach to getting to a populated place from a POI or airport, park, etc. We've tried querying geonames for the closest populated place, but that often leads to results that are probably correct but not relevant, i.e. "North Beach, NY" for "Laguardia Airport" or "Vermont, IN" for "Kokomo Municipal Airport." We have also tried the same query limited to PPLs with >5000 pop, which leads to "Long Island City, NY" and "Kokomo, IN" for those same queries. Definitely better for the latter query, but not unambiguously so for the former, and a rule like this is very likely to have side-effects in other places that we haven't tested. We've tried geonames neighborhood queries, which work well for some cities, not at all for others. The Zillow data is apparently hit or miss. We have one final idea that can be done within the current data, which is to localize the POI to the county seat of the county in which it is located. Since we focus on large metro areas this might be good enough.

In general you would want to attach "Laguardia Airport" reliably to the city people expect it to be associated with. "New York, NY" would be best, "Queens, NY" might suffice. This seems to be a fairly hard problem. Where my thinking is headed with this right now is: have a list of metro area shapefiles, and use the lat/long to figure out which area a place is in, then use whatever name we want for it. If the place is not in a metro area probably we would fall back on the closest PPL approach. But of shapefiles I am wholly ignorant so I have some learning to do.