Regarding custom entries the the data file.

117 views
Skip to first unread message

Febin Thomas

unread,
Nov 10, 2013, 11:57:44 PM11/10/13
to clavin...@googlegroups.com
I have been trying add some new informations in allCountries.txt file to create the index.
But failed in getting seached the entries I added. Is there any additional care to take in adding new entries in the file file?

Charlie Greenbacker

unread,
Nov 11, 2013, 9:16:32 AM11/11/13
to Febin Thomas, clavin...@googlegroups.com
Hi Febin,

Rather than editing the allCountries.txt file directly, we recommend adding records to the src/main/resources/SupplementaryGazetteer.txt file instead. The reason is that you will want to download an updated allCountries.txt file from GeoNames.org periodically to get the latest geographic information, and you don't want to have to remember to edit the new allCountries.txt file each time you take an update.

Also, any records that you add will need to be in the same GeoNames.org format used in the existing allCountries.txt or SupplementaryGazetteer.txt files. More info on the GeoNames.org format can be found here: http://download.geonames.org/export/dump/

That being said, you will need to rebuild the CLAVIN index each time you modify the allCountries.txt or SupplementaryGazetteer.txt files. This is Step 6 in the README file. Please note that you should delete (or archive) the existing IndexDirectory before building a new one.

If that still doesn't work, send me the new records you're trying to add, along with the sentences you're trying to geotag, and I'll take a look at what's going on.

- Charlie


On Sun, Nov 10, 2013 at 11:57 PM, Febin Thomas <feb...@algotree.com> wrote:
I have been trying add some new informations in allCountries.txt file to create the index.
But failed in getting seached the entries I added. Is there any additional care to take in adding new entries in the file file?

--
You received this message because you are subscribed to the Google Groups "clavin-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to clavin-users...@googlegroups.com.
To post to this group, send email to clavin...@googlegroups.com.
Visit this group at http://groups.google.com/group/clavin-users.
To view this discussion on the web visit https://groups.google.com/d/msgid/clavin-users/3df66a4c-2b23-469d-8ca6-13fe6579e536%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.



--
Charlie Greenbacker
Principal Data Scientist
Creator & Product Manager of CLAVIN
Berico Technologies
mobile: 860-965-8885

Charlie Greenbacker

unread,
Nov 12, 2013, 10:02:39 AM11/12/13
to Febin Thomas, clavin...@googlegroups.com
Febin,

(I'm including the clavin-users mailing list in my reply so that everyone may benefit, and I've included your original attachments as well.)

There are a couple of issues I noticed in what you sent me:
  • Where did you get the geonameid values for the new records in the SupplementaryGazetteer? There already seem to be existing records with the same geonameids in the GeoNames.org gaztteer (e.g., 8629796 is a subdistrict in East Timor, 8629800 is an administrative area in Uganda). You should choose unique ids.
  • When you get "That's all folks!" as the only output from the WorkflowDemo program, it means that CLAVIN was unable to extract & resolve any locations in the input text. Otherwise, you would have received a list of ResolvedLocation objects prior to the "That's all folks!" statement. The input sentence you provided ("Kottekkad, Attore, Kuttur situated with in the perimeter of 5km.") is a bit unusual, and I suspect that the Apache OpenNLP entity extractor used by default was not trained on sentences like that one. I'd recommend trying a slightly different sentence (e.g., "I was born in Kottekkad and grew up in Attore."). I'd also suggest using the CLAVIN-NERD package instead, which is based on the GPL-licensed Stanford NER entity extractor in place of OpenNLP. Stanford NER tends to do a better job with the extraction phase, especially with slightly unusual input sentences.

Let me know if that works, or if you're still having trouble.

- Charlie


On Tue, Nov 12, 2013 at 12:08 PM, Febin Thomas <feb...@algotree.com> wrote:
Hi Charlie,

Thanks for the replay.

please find the format I kept in the attached file SupplimentaryGazetteer.txt

I found the lattitude and longitude given inthe file from the site http://pos-map.appspot.com/en/coordinates60.html

and I entered the wgs84 decimal degree values shown on the right side from the page.

after that I created the index using the command

MAVEN_OPTS="-Xmx2048M" mvn exec:java -Dexec.mainClass="com.bericotech.clavin.index.IndexDirectoryBuilder"

then I run the WorkFlowDemo program with the text I have added. I attached the file Somalia-doc.txt here by.

I get just


"That's all folks!"


in the terminal.

--
Febin Thomas
=================
ALGOTREE,
83,DD Oceano Mall,
Marine Drive,
Kochin.
PIN-682 011

TELE:+91-484-2350275
in...@algotree.com
www.algotree.com
=================
SupplementaryGazetteer.txt
Somalia-doc.txt

Charlie Greenbacker

unread,
Nov 13, 2013, 7:21:17 AM11/13/13
to Febin Thomas, clavin...@googlegroups.com
Glad to hear it's working for you, Febin. Please let us know if you run into any further problems.

Cheers,
Charlie


On Wed, Nov 13, 2013 at 2:03 AM, Febin Thomas <feb...@algotree.com> wrote:
Dear Charlie,

yes, CLAVIN-NERD seems to be working with my examples.

I corrected the geonameid repetition but still clavin was not giving me an expected result. but CLAVIN -NERD recognised the places along with the new entries I added.

Thanks for giving me your kind attention and make it runnable.



Reply all
Reply to author
Forward
0 new messages