Need to improve custom NER tagging output, using HMM

73 views
Skip to first unread message

Yogesh Kulkarni

unread,
Sep 7, 2017, 11:57:12 AM9/7/17
to nltk-users
I am building a NER tagger for a new tag "LEGAL".

I have got training data with IOB format, sample below:

Training sentence: [('kurian', 'O'), ...('the', 'O'), ('proceedings', 'B-LEGAL'), ('by', 'O'), ('notification', 'B-LEGAL'), ('dated', 'O'), ... ('the', 'O'), ('land', 'B-LEGAL'), ('acquisition', 'I-LEGAL'), ('act', 'B-LEGAL'), ... ('of', 'O'), ('acquisition', 'B-LEGAL'), ('is', 'O'), ('residential', 'O'), ('and', 'O'), ('commercial', 'B-LEGAL'), ('for', 'O'),...('period', 'B-LEGAL'), ('of', 'O'), ('three', 'O'), ('months', 'O'), ('from', 'O'), ('today', 'O')]


Although not shown fully here, above, I have many IOB marked sub-sequences. And there are about 400 such texts used for training.

Code I am using is:

from nltk.tag import hmm

print("Training sentence: {}".format(train_data[0]))

trainer
= hmm.HiddenMarkovModelTrainer()
tagger
= trainer.train_supervised(train_data)
print(tagger)

for tst in test_x[:20]:
 test_sentence
= " ".join(tst)
 
print("Test sentence: {}".format(test_sentence))
 result
= tagger.tag(test_sentence.split())
 
print("Tagged sentence: {}".format(result))
 catchphrases
= [ w for w,t in result if "LEGAL" in t]
 
print("Catchphrases: {}".format(catchphrases))




The output is not very promising.

Test sentence: 1 after hea...ame are dismissed
Tagged sentence: [('1', 'O'), ..., ('customs', 'B-LEGAL'), ('excise', 'I-LEGAL'), ('service', 'I-LEGAL'), ('tax', 'I-LEGAL'), ('appellate', 'O'),...('dismissed', 'O')]
Catchphrases: ['customs', 'excise', 'service', 'tax']

Test sentence: 1 this app.....accordingly
Tagged sentence: [('1', 'O'), ...('ordered', 'O'), ('accordingly', 'O')]
Catchphrases: []

Test sentence: 1 an issue ... costs
Tagged sentence: [('1', 'O'), ('an', 'O'), ('issue', 'O'), ... ('to', 'O'), ('costs', 'O')]
Catchphrases: []

Shown one working and two non working examples above. Actually, tagging not done for most of the cases.

Any ideas to improve? Any hyper-parameters?

Should I use CRF instead? If yes, any link for its tutorial?
Reply all
Reply to author
Forward
0 new messages