Jesse Fulton
unread,Nov 15, 2010, 1:10:12 PM11/15/10Sign in to reply to author
Sign in to forward
You do not have permission to delete messages in this group
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to nltk-users
Hi all,
I'm an NLTK (and Python) newbie, so forgive me if this is a pretty
straightforward question...
I have an untagged set of text (ie., chat-room transcripts) which I'm
trying to use to generate new sentences. From what I've read, it seems
like the steps to do this are to train a HMM on a similar corpus, and
then use that to chunk & tag my chat-room transcripts, and then I can
use the HiddenMarkovModelTagger.random_sample() function to generate
my text. Is that correct? If so, how do I go about training the
Tagger? Do I need to train it on chunks (phrases) and words? I've
tried the following, and the results are less than stellar...
# example code
def conll_tag_chunks(chunk_sents):
tag_sents = [nltk.chunk.tree2conlltags(tree) for tree in
chunk_sents]
return [[(t, c) for (w, t, c) in chunk_tags] for chunk_tags in
tag_sents]
conll_sents = nltk.corpus.conll2000.tagged_sents()
conll_train = list(conll_sents)
conll_train_chunks = conll_tag_chunks(conll_train)
hmmTrainer = HiddenMarkovModelTrainer()
hmmTagger = hmmTrainer.train(conll_train_chunks)
#just realized this next line is wrong - I should be calling
hmmTagger.train() - but what would the args be?
hmmTagger = hmmTrainer.train(conll_train)
hmmTagger.random_sample(random, 20)
# sample output: [('cushion', 'NN'), ('of', 'IN'), ('year', 'NN'),
('Municipals', 'NNS'), ('and', 'CC'), ('increased', 'VBN'), ('letter',
'NN'), ('managers', 'NNS'), ('of', 'IN'), ('installed', 'VBN'), (',',
','), ('The', 'DT'), ('Economists', 'NNS'), ('.', '.'), ("''", "''"),
('exercises', 'VBZ'), ('required', 'VBN'), ('this', 'DT'), ('law',
'NN'), ('to', 'TO')]
I've been able to successfully mark up my sentences (somewhat
accurately) using NgramTaggers, but my research has led me to believe
that HMM is much better for *generating* text...
Thanks in advance!
-Jesse