How to generate text using Hidden Markov Models?

Jesse Fulton

unread,

Nov 15, 2010, 1:10:12 PM11/15/10

to nltk-users

Hi all,

I'm an NLTK (and Python) newbie, so forgive me if this is a pretty
straightforward question...

I have an untagged set of text (ie., chat-room transcripts) which I'm
trying to use to generate new sentences. From what I've read, it seems
like the steps to do this are to train a HMM on a similar corpus, and
then use that to chunk & tag my chat-room transcripts, and then I can
use the HiddenMarkovModelTagger.random_sample() function to generate
my text. Is that correct? If so, how do I go about training the
Tagger? Do I need to train it on chunks (phrases) and words? I've
tried the following, and the results are less than stellar...

# example code
def conll_tag_chunks(chunk_sents):
tag_sents = [nltk.chunk.tree2conlltags(tree) for tree in
chunk_sents]
return [[(t, c) for (w, t, c) in chunk_tags] for chunk_tags in
tag_sents]

conll_sents = nltk.corpus.conll2000.tagged_sents()
conll_train = list(conll_sents)
conll_train_chunks = conll_tag_chunks(conll_train)

hmmTrainer = HiddenMarkovModelTrainer()
hmmTagger = hmmTrainer.train(conll_train_chunks)

#just realized this next line is wrong - I should be calling
hmmTagger.train() - but what would the args be?
hmmTagger = hmmTrainer.train(conll_train)
hmmTagger.random_sample(random, 20)

# sample output: [('cushion', 'NN'), ('of', 'IN'), ('year', 'NN'),
('Municipals', 'NNS'), ('and', 'CC'), ('increased', 'VBN'), ('letter',
'NN'), ('managers', 'NNS'), ('of', 'IN'), ('installed', 'VBN'), (',',
','), ('The', 'DT'), ('Economists', 'NNS'), ('.', '.'), ("''", "''"),
('exercises', 'VBZ'), ('required', 'VBN'), ('this', 'DT'), ('law',
'NN'), ('to', 'TO')]

I've been able to successfully mark up my sentences (somewhat
accurately) using NgramTaggers, but my research has led me to believe
that HMM is much better for *generating* text...

Thanks in advance!

-Jesse

Alex Rudnick

unread,

Nov 17, 2010, 3:22:39 PM11/17/10

to nltk-...@googlegroups.com

Hello Jesse,

So it looks like you're able to successfully get the HMM to generate
(unintelligible) text?

As I understand it, what the HMM is doing in the learning phase is
building two tables:
- the "transition probabilities" -- ie, the probability that a state
(in this case, a POS tag) is followed by another state
- the "emission probabilities" -- the probability that (for example)
the state "NN" generates the observation "dog".

So to generate text, the HMM code picks a POS tag (a state), then
generates a word from that state. Then it hops to another state, and
generates a word from that one. This means that while the parts of
speech /might/ make sense as a sequence, the actual words that it
picks probably won't.

Maybe if your goal is to generate plausible-looking text, HMMs and the
random_sample function aren't the best approach? You might want to try
a model that's just trained on the words, rather than on the parts of
speech. (nltk.model.NgramModel.generate might be a good place to
look.)

Hope this helps!

--
-- alexr

Jesse Fulton

unread,

Nov 17, 2010, 4:56:12 PM11/17/10

to nltk-users

Hey Alex,

So I haven't actually figured out how to load my custom text
(untagged, unstructured - simply text) into the HMM in order to
generate something. The example above was just using the built-in
corpora. The approach I'm taking now is to train a chunker & tagger on
the nps_chat corpus, and then use that to analyze my text and break it
up into phrases and store each phrase, and all possible sentence
structures (NP-VP-NP) to re-use the chunks and simply rearrange them.
But I feel like that may be a bit too simplistic and will leave me
with the same problem...

I had tried using the Ngram models before, but only got limited
success. I think the main issue is that my sample of text is really
small (<200KB) - does anybody have suggestions on how to generate text
with such a small body to work with?

Thanks,
-Jesse

Reply all

Reply to author

Forward