Hi
In my work, we're currenlty using OpenNLP for named entity extraction. I want to do some testing with Epic just to see how they compare (since we code in Scala, it would be nice to use a Scala library). Here is the code I'm using:
val pipeline = {
MLSentenceSegmenter.bundled("en").get andThen TreebankTokenizer
}
val ner = Segmenter.nerSystem(NerSelector.loadNer("en").get.asInstanceOf[SemiCRF[String, String]])
try {
val preprocessedSlab = pipeline(Slab(txt))
val nered = ner(preprocessedSlab)
for ((span, sentence) <- nered.iterator[Sentence] if span.nonEmpty) {
for ((espan, entity) <- nered.covered[EntityMention](span)) {
println(entity.entityType + " " + preprocessedSlab.spanned(espan))
}
}
} catch {
case ex: Exception => println(s"Error while processing $txt", ex)
}
I tested this against this text:
Singer-songwriter David Crosby hit a jogger with his car Sunday evening, a spokesman said. The accident happened in Santa Ynez, California, near where Crosby lives, on January 31, 2015. Crosby was driving at approximately 50 mph when he struck the jogger, according to California Highway Patrol Spokesman Don Clotworthy. The posted speed limit was 55.
and Epic gave me these entities:
PER David Crosby
LOC Santa Ynez
LOC California
PER Crosby
PER Don Clotworthy
In comparison, OpenNLP also recognized these entities:
California Highway Patrol
The date information, for example, is missed by Epic. Is this most likely a difference in the modelling used between the two libraries? Or is there something else I should be doing with Epic to match these entities?
thanks
scott s