Extracting relations using custom entities.

245 views
Skip to first unread message

Shawny Boy

unread,
Mar 2, 2018, 11:23:16 AM3/2/18
to nltk-users
Hi! 

I have a relation extraction problem, and I am trying to use NLTK's built in function, extract_rels(), to extract the relations. However, I wish to extract them using my own list of entities, instead of their conventional named entities such as: ORGANIZATION, PERSON, LOCATION, etc.

Here is a sample of my code:
-----------------------------------------------------------------------------------------------------------------------------------------
for i in range(1, len(sys.argv)):
	with open(sys.argv[i], 'r') as f:
		sample = f.read()
		sentences = nltk.sent_tokenize(sample)
		tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences]
		tagged_sentences = [nltk.pos_tag(sentence) for sentence in tokenized_sentences]

		IN = re.compile(r'.')

		for i, sent in enumerate(tagged_sentences):
			sent = nltk.ne_chunk(sent)
			rels = extract_rels('PER', 'ORG', sent, corpus='ace', pattern=IN)
			for rel in rels:
				print('{0:<5}{1}'.format(i, rtuple(rel)))
			#end for rel
#end for i
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
I wish to replace 'PER' and 'ORG' with my own list of entities. If that is not possible, can anyone point me to a different resource that could help me achieve my goal? Thank you for the help!

Best,
Shawny Boy

Alexis

unread,
Mar 3, 2018, 3:38:00 AM3/3/18
to nltk-...@googlegroups.com
To extract your own entity types, you need to train a classifier trained with your own annotated corpus (marked with the entities you want to detect). Take a look at the nltk book, especially chapters 6 and 7.

Alexis

--
You received this message because you are subscribed to the Google Groups "nltk-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nltk-users+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Shawny Boy

unread,
May 3, 2018, 1:31:35 AM5/3/18
to nltk-users
Thanks!

Shawny Boy

unread,
May 11, 2018, 9:22:27 AM5/11/18
to nltk-users
Hi!

I've been reading chapters 6 and 7, and I'm looking at example 3.2 of chapter 7 (ConsecutiveNPChunker). And I see that it takes in a "train_sents" and "test_sents" parameter. What exactly is the format of the "train_sents" and "test_sents"? Thanks! 

Best,
Shawny


On Saturday, March 3, 2018 at 12:38:00 AM UTC-8, Alexis wrote:

Shawny Boy

unread,
May 11, 2018, 9:22:34 AM5/11/18
to nltk-users
Hi, 

Is this: 
"Hi NNP O
here RB O
's POS O
my PRP$ O
phone NN B-KEYWORD
number NN I-KEYWORD
, , O
let VB O
me PRP O"
...the "trained_sent" format? Also, am I supposed to build my own algorithm to get it into that format? Thanks!

Best,
Shawny

On Saturday, March 3, 2018 at 12:38:00 AM UTC-8, Alexis wrote:

Shawny Boy

unread,
May 18, 2018, 6:38:39 PM5/18/18
to nltk-users

Hi!

I have converted my "training corpora" to IOB format, but I am not sure what to do with it. In CH 7 under Named Entity Recognition it says: "based on this training corpus, we can construct a tagger that can be sued to label new sentences...", but it doesn't seem to go into much detail on how to do so?

Attached is my "training corpora". Thanks!

Best,
 Shawny

On Saturday, March 3, 2018 at 12:38:00 AM UTC-8, Alexis wrote:
openTrained.txt

Dimitriadis, A. (Alexis)

unread,
May 18, 2018, 7:11:32 PM5/18/18
to <nltk-users@googlegroups.com>
On 18 May 2018, at 16:58, Shawny Boy <lilsh...@gmail.com> wrote:


Hi!

I have converted my "training corpora" to IOB format, but I am not sure what to do with it. In CH 7 under Named Entity Recognition it says: "based on this training corpus, we can construct a tagger that can be sued to label new sentences...", but it doesn't seem to go into much detail on how to do so? 

Read chapter 7 again, more carefully this time. It shows how "NP chunking" and named entity recognition can be turned into the same task (word classification), by encoding them in IOB format. The earlier sections of the chapter show step by step, with code, how to implement a chunker that can also be used for named entity recognition.

Attached is my "training corpora". Thanks! 

Best,
 Shawny

That looks like a word list. Your training corpus should look like the texts you will be tagging (presumably, technical manuals), with non-entity words tagged "O". (Also the format must be ONE word per line, e.g. "file creation flags" should be three lines.)

For terms like your Unix error codes, a classifier is overkill. Every instance of the token O_WRONLY is an error code, for example, so a dictionary based approach would be much easier to implement (especially since you seem to have a dictionary of the identifiers you want to tag). Where it starts to get interesting is multi-word expressions like "file creation flags"; but if you are still over your head with the classifier approach, consider trying the simple approach first.

Alexis

Shawny Boy

unread,
Jun 6, 2018, 9:48:50 PM6/6/18
to nltk-users
Hi Alexis! 

I've decided to approach my problem a different way, but thank you so much for the help so far! I really appreciate it! 
Reply all
Reply to author
Forward
0 new messages