Multi-language Entity Extraction

10 views

Skip to first unread message

Peter Chan

unread,

Feb 23, 2013, 10:50:14 AM2/23/13

to ep...@googlegroups.com

Here is Sit's reply on multi-language status on entity extraction.

Here is the info link: http://nlp.stanford.edu/software/CRF-NER.shtml & the demo at http://nlp.stanford.edu:8080/ner/

It seems only English/German/Chinese are available at the moment and they are separate models so we are in a similar circumstance to Lucene where additional work will be required for language detection and multi-language documents.

Sudheendra Hangal

unread,

Feb 25, 2013, 1:41:34 AM2/25/13

to Peter Chan, ep...@googlegroups.com

We should take the German and Chinese claims with a grain of salt, since the named entity recognizer is unlikely to

have been trained, used and tested as extensively as the English one. The quality of results will have to be evaluated

when we have a real corpus with either of these languages.

An interesting advantage for us is that since we're in the domain of email, we have accurate names for correspondents.

At a minimum, these can be used as a "gazette" of names to recognize, even if the named entity recognizer fails

or is non-existent for a language.

Btw, Lucene analyzer language support means that a somewhat sensible analyzer is available (the part that does tokenization

and stemming). However, you should still be able to search for virtually all unicode strings -- for example, I can

correctly search full words in Indic languages (Kannada and English) that are not on the Lucene analyzers page.

[s]

--
You received this message because you are subscribed to the Google Groups "ePADD" group.
To unsubscribe from this group and stop receiving emails from it, send an email to epadd+un...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Reply all

Reply to author

Forward

0 new messages