Hello,
I'm trying to build a tagger/classifier that can look at a paragraph of a novel and, from contextual clues, take a good guess at who said it.
I have a good idea of what features to pay attention to, but I'm not sure which method to use once I have them. I like some of the features of a Brill Tagger, Naive Bayes, etc. I was hoping someone could suggest the best way forward. Here's an example:
[1] Mike smiled. "I don't think we should do that."
[2] "If we don't, we're finished," Alicia said.
[3] "So you claim."
[4] "Wait a second," John interrupted. "Alicia, who told you that?"
[5] "Don't worry about it," she said.
My basic plan is to look at the most telling features, and in their absence, rely on some induction to make guesses. So here, the tagger would be extremely confident about lines [2] and [4], because there are names (Alicia, John) attached directly to them. It would be similarly confident about [1], because Mike's name appears in the sentence directly preceding it. For [3], I'd like it to say, 'Okay, no clues here, let's assume that this is a back-and-forth, and therefore it's Mike speaking again, since Alicia just took her turn." [5] doesn't have a name attached to it, but Alicia is called out specifically in [4], so there could be a feature coding for that. If John hadn't called out Alicia, maybe it could note the gender of the pronoun and take a guess based on that.
I would use a classifier, but there are any arbitrary number of possible speakers, and really the tags we want are going to be PREVIOUS-SPEAKER, TWO-SPEAKERS-BACK, NEAREST-PROPER-NOUN, etc. But salted in there will be some explicitly tagged speakers. So I suspect that a Brill tagger or a Naive Bayes will not be able to generalize properly with such a large/sparse set of tags. Chapter 6 of the NLTK book mentions open-class classification once, but promptly drops the subject. Could be promising?
That's about as far as I've gotten -- I'm struggling to wrap my head around this! Thanks for any advice.
-erik