Who said that?! Figuring out dialogue speakers

Erik

unread,

Dec 29, 2014, 12:08:51 PM12/29/14

to nltk-...@googlegroups.com

Hello,

I'm trying to build a tagger/classifier that can look at a paragraph of a novel and, from contextual clues, take a good guess at who said it.

I have a good idea of what features to pay attention to, but I'm not sure which method to use once I have them. I like some of the features of a Brill Tagger, Naive Bayes, etc. I was hoping someone could suggest the best way forward. Here's an example:

[1] Mike smiled. "I don't think we should do that."

[2] "If we don't, we're finished," Alicia said.

[3] "So you claim."

[4] "Wait a second," John interrupted. "Alicia, who told you that?"

[5] "Don't worry about it," she said.

My basic plan is to look at the most telling features, and in their absence, rely on some induction to make guesses. So here, the tagger would be extremely confident about lines [2] and [4], because there are names (Alicia, John) attached directly to them. It would be similarly confident about [1], because Mike's name appears in the sentence directly preceding it. For [3], I'd like it to say, 'Okay, no clues here, let's assume that this is a back-and-forth, and therefore it's Mike speaking again, since Alicia just took her turn." [5] doesn't have a name attached to it, but Alicia is called out specifically in [4], so there could be a feature coding for that. If John hadn't called out Alicia, maybe it could note the gender of the pronoun and take a guess based on that.

I would use a classifier, but there are any arbitrary number of possible speakers, and really the tags we want are going to be PREVIOUS-SPEAKER, TWO-SPEAKERS-BACK, NEAREST-PROPER-NOUN, etc. But salted in there will be some explicitly tagged speakers. So I suspect that a Brill tagger or a Naive Bayes will not be able to generalize properly with such a large/sparse set of tags. Chapter 6 of the NLTK book mentions open-class classification once, but promptly drops the subject. Could be promising?

That's about as far as I've gotten -- I'm struggling to wrap my head around this! Thanks for any advice.

-erik

Chris Hokamp

unread,

Dec 31, 2014, 1:42:51 PM12/31/14

to nltk-...@googlegroups.com

I'm trying to build a tagger/classifier that can look at a paragraph of a novel and, from contextual clues, take a good guess at who said it.

This looks like a sequence tagging task because the previous and following paragraphs presumably add information which makes it easier to label the current speaker.

I have a good idea of what features to pay attention to, but I'm not sure which method to use once I have them. I like some of the features of a Brill Tagger, Naive Bayes, etc.

You could use something like a Conditional Random Field (CRF), passing in each paragraph as an element of the sequence. NLTK has an interface to the Mallet CRF library [1], but you would need to provide the implementations for the features.

My basic plan is to look at the most telling features, and in their absence, rely on some induction to make guesses. So here, the tagger would be extremely confident about lines [2] and [4], because there are names (Alicia, John) attached directly to them. It would be similarly confident about [1], because Mike's name appears in the sentence directly preceding it. For [3], I'd like it to say, 'Okay, no clues here, let's assume that this is a back-and-forth, and therefore it's Mike speaking again, since Alicia just took her turn." [5] doesn't have a name attached to it, but Alicia is called out specifically in [4], so there could be a feature coding for that. If John hadn't called out Alicia, maybe it could note the gender of the pronoun and take a guess based on that.

All of these examples can be coded into features for a CRF

I would use a classifier, but there are any arbitrary number of possible speakers, and really the tags we want are going to be PREVIOUS-SPEAKER, TWO-SPEAKERS-BACK, NEAREST-PROPER-NOUN, etc. But salted in there will be some explicitly tagged speakers. So I suspect that a Brill tagger or a Naive Bayes will not be able to generalize properly with such a large/sparse set of tags. Chapter 6 of the NLTK book mentions open-class classification once, but promptly drops the subject. Could be promising?

I think you need to train on tagged data, meaning that you need training (and testing) data where the paragraphs are labeled with the speaker names (or tags like <PREVIOUS-SPEAKER>). If you don't know all of the possible speakers for a given book, you would have to induce them with an unsupervised method, like clustering or EM, but that seems pretty risky. Also the set of possible speaker identities obviously changes drastically depending on the book.

Hope this helps a bit,

Chris

[1] http://www.nltk.org/_modules/nltk/tag/crf.html

Erik

unread,

Jan 1, 2015, 1:26:55 PM1/1/15

to nltk-...@googlegroups.com

CRF looks like a tremendous lead, thank you, Chris.

Sneha Jha

unread,

Jan 1, 2015, 1:38:54 PM1/1/15

to nltk-...@googlegroups.com

At least part of this warrants looking into coreference resolution I think.

--
You received this message because you are subscribed to the Google Groups "nltk-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nltk-users+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all

Reply to author

Forward