Multi-language Entity Extraction

10 views
Skip to first unread message

Peter Chan

unread,
Feb 23, 2013, 10:50:14 AM2/23/13
to ep...@googlegroups.com
Here is Sit's reply on multi-language status on entity extraction.

Here is the info link: http://nlp.stanford.edu/software/CRF-NER.shtml & the demo at http://nlp.stanford.edu:8080/ner/

It seems only English/German/Chinese are available at the moment and they are separate models so we are in a similar circumstance to Lucene where additional work will be required for language detection and multi-language documents.


Sudheendra Hangal

unread,
Feb 25, 2013, 1:41:34 AM2/25/13
to Peter Chan, ep...@googlegroups.com

We should take the German and Chinese claims with a grain of salt, since the named entity recognizer is unlikely to
have been trained, used and tested as extensively as the English one. The quality of results will have to be evaluated
when we have a real corpus with either of these languages.

An interesting advantage for us is that since we're in the domain of email, we have accurate names for correspondents.
At a minimum, these can be used as a "gazette" of names to recognize, even if the named entity recognizer fails
or is non-existent for a language. 

Btw, Lucene analyzer language support means that a somewhat sensible analyzer is available (the part that does tokenization
and stemming). However, you should still be able to search for virtually all unicode strings -- for example, I can 
correctly search full words in Indic languages (Kannada and English) that are not on the Lucene analyzers page.

[s]




--
You received this message because you are subscribed to the Google Groups "ePADD" group.
To unsubscribe from this group and stop receiving emails from it, send an email to epadd+un...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Reply all
Reply to author
Forward
0 new messages