Human names recognition

601 views
Skip to first unread message

Jonathon Paarlberg

unread,
Aug 27, 2015, 9:54:16 AM8/27/15
to OpenRefine
Do any of you know of a generic solution to let a computer recognize that a text string is probably the name of a human being? I'm thinking some massive dictionary of given names and surnames could be referenced? Are any of them FOSS?

Thanks.

Joe Wicentowski

unread,
Aug 27, 2015, 10:37:31 AM8/27/15
to openr...@googlegroups.com
Jonathon,

You might take a look at Stanford NER, which includes "person" among its list of named entities that it can identify.  I believe it's been trained on a corpus of modern English texts, so uses statistical reasoning rather than a simple dictionary lookup.  See http://nlp.stanford.edu/software/CRF-NER.shtml.

For example, if you download the package, unzip it, and cd into the directory, this command (adapted from the FAQ):

$ java -mx500m -cp stanford-ner.jar edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier classifiers/english.all.3class.distsim.crf.ser.gz -textFile sample.txt -outputFormat inlineXML

will yield the following results:

The fate of <ORGANIZATION>Lehman Brothers</ORGANIZATION>, the beleaguered investment bank, hung in the balance on Sunday as <ORGANIZATION>Federal Reserve</ORGANIZATION> officials and the leaders of major financial institutions continued to gather in emergency meetings trying to complete a plan to rescue the stricken bank.  Several possible plans emerged from the talks, held at the <ORGANIZATION>Federal Reserve Bank of New York</ORGANIZATION> and led by <PERSON>Timothy R. Geithner</PERSON>, the president of the <ORGANIZATION>New York Fed</ORGANIZATION>, and <ORGANIZATION>Treasury</ORGANIZATION> Secretary <PERSON>Henry M. Paulson Jr</PERSON>.
CRFClassifier tagged 85 words in 2 documents at 1075.95 words per second.

I'm not sure if anyone has integrated entity recognition as a resolver for OR, but you could pre-process your data before handing it to OR - to filter out unwanted entities like organization, etc.  

Joe

On Thu, Aug 27, 2015 at 9:54 AM, Jonathon Paarlberg <lion...@gmail.com> wrote:
Do any of you know of a generic solution to let a computer recognize that a text string is probably the name of a human being? I'm thinking some massive dictionary of given names and surnames could be referenced? Are any of them FOSS?

Thanks.

--
You received this message because you are subscribed to the Google Groups "OpenRefine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Jonathon Paarlberg

unread,
Aug 27, 2015, 10:48:41 AM8/27/15
to OpenRefine
That's very interesting, Joe. I may want to put that in my little black bag. I'll check it out.

Jonathon Paarlberg

unread,
Aug 27, 2015, 10:49:56 AM8/27/15
to OpenRefine
Just a thought, but another use for it would be to run a replace on all <Person> name entities so as to anonymize the information.


On Thursday, August 27, 2015 at 9:54:16 AM UTC-4, Jonathon Paarlberg wrote:
Reply all
Reply to author
Forward
0 new messages