Hi, folks. Thank you for access to the group. I'm a developer and system architect working for a startup in NYC. We've recently been experimenting with CLAVIN for the purpose of extracting city and state from text descriptions of events that occur at specific places. So first of all, thanks for your work on this tool.
The version of CLAVIN that we are using is the
clavin-rest repo, which we were able to get up and running inside a docker container very easily using the supplied installation steps. So, thanks for that, too :).
As far as I can tell from peering into the source (I am not much of a java dev but do have a strong C/C++ background and some java) this version of CLAVIN is using the Stanford extractor. We have not changed anything so we are using it with the default configuration and simply passing queries to the rest API from a python wrapper.
Our application is probably somewhat less general than the problems the tool was built for. We know that what we are processing will generally contain either an address or a city/state pair at some point in the text, and we can usually pin it down well enough that there isn't a lot of extraneous text included (other than perhaps the name of a venue). So we began our testing by simply feeding the api some common city/state pairs that we encounter. One thing we noticed right off is that unless we title cased the terms we got nothing back. I posted about that on the github page before discovering this group, and have already been given some pointers about caseless models for the Stanford extractor. For our purposes it was simple enough to title case the query text for the tests.
Having done that we saw some interesting results. Many cases worked impressively well. "Nashville Tn" gets the right city, for example, and so does "Nashville" and "Omaha Nb". However some cases yielded anomalous results. The term "Washington Dc" gets the state, as does "Washington, Dc". The term "Washington, D.C." gets a city labelled "District of Columbia", while "Washington D.C." (no comma) gets a city labelled "Washington, D.C." Trying "Columbia Sc" as well as various other combinations consistently yielded the "Republic of Colombia" or "Pengkalan Baharu S.C."
Those are our results so far, and I would be very interested in any thoughts you all have on the applicability of the tool to our problem domain, and ways in which we might tune its use to get better results from cases like these.
Thanks again, and regards,
--Mark