Training of Named Entity Recognition

Abhilash Dighe

unread,

Oct 16, 2013, 11:51:19 AM10/16/13

to nltk-...@googlegroups.com

Hi,

I am working on a project where I need to recognize the universities from the education section of a resume. I have thought of using NER for the same. I am considering all named entities being tagged as Organizations as educational institutes. However the inbuilt implementation of NER in NLTK isn't giving me desired result.

I have decided to build my own corpus and train the NER for improving the accuracy. However, I am not able to fiind any documentation which shows where the training data should be kept, in what format it should be, how to train the NER with that particular corpus etc.

I have gone through the nltk/chunk/named_entity.py code and while defining its train_paths it refers to corpora/ace_data folder which I'm not able to find. Also that code isn't sufficiently commented for me to understand what to change in order to add my own training data.

I was hoping that you guys could help me out with this problem, or point me to some blog post which could be of help. Thanks in advance.

Regards,

Abhilash Dighe

Jacob Perkins

unread,

Oct 18, 2013, 5:22:51 PM10/18/13

to nltk-...@googlegroups.com

Hi Abhilash,

I wrote a post about how to train a chunker here: http://streamhacker.com/2008/12/29/how-to-train-a-nltk-chunker/

You can use this for NER training when you treat NEs as chunk tags, which is basically how the NLTK NER chunker works.

Jacob

---

http://streamhacker.com

http://twitter.com/japerk

Abhilash Dighe

unread,

Oct 20, 2013, 7:42:28 PM10/20/13

to nltk-...@googlegroups.com

Thanks Jacob for replying.

I have gone through your post and it was quite helpful. I have downloaded the conll2000 corpora and am planning to change the train.txt file in the corpora. I just want to reiterate my problem statement once again. I want to extract the universities from the education section of the resume by training the chunker. I am planning on keeping just two tags: UNI and NONUNI for my chunks. So for training I should first POS tag my training sentences and then add the chunk tags after them as UNI or NONUNI in the train.txt file. I will similarly make the test data and then train and test it as you have done in your code.

Am I right? Just asking because it will take time to build the training data and don't want to jump into it without being sure of my method.

Jacob Perkins

unread,

Oct 21, 2013, 1:12:43 PM10/21/13

to nltk-...@googlegroups.com

Hi Abhilash,

To use the conll2000 format, you should use IOB chunk tags (the 3rd column in conll2000). What that means is that any non uni chunk tag should be O, the first word in a university name should be B-UNI, and every other word in the name should be I-UNI. So you'd end up with something like this:

The DT O

Fake NN B-UNI

University NN I-UNI

Name NN I-UNI

is VB O

...

This way the chunk parsers can specifically look for B-* & I-* tags to determine what makes up a chunk.

Jacob

---

http://streamhacker.com

http://twitter.com/japerk

Daniel Wu

unread,

Aug 5, 2014, 11:19:57 AM8/5/14

to nltk-...@googlegroups.com

Hi Abhilash,

I apologize for the question that's not directly helpful, but I was curious if you were able to figure out this issue from last year? I'm working on a very similar problem, and I was curious what python code and solution you came up with? I'd greatly appreciate learning from what you've learned.

Reply all

Reply to author

Forward