Sorry for my late response. I just tried today the example and it
seems to me that there is a typo in the book listing (I hope it can be
fix before the book goes to print).
The function:
def document_features(document):
document_words = set(document)
features = {}
for word in all_words:
features['contains(%s)' % word] = (word in document_words)
return features
should read
def document_features(document):
document_words = set(document)
features = {}
for word in word_features:
features['contains(%s)' % word] = (word in document_words)
return features
So the problem seemed to be in the for loop: the "all_words" list
contains every single word in the corpus instead of the 2000 most
frequent words as it was intended in the example. The "word_features"
list is the one containing those 2000 most frequent words.
>>>len(all_words)
39768
>>> len( word_features)
2000
My understanding is that the purpose of the function is to check if
any of the 2000 most frequent words can found in a given document.
Now the example takes seconds instead of hours.
Hope it helps!
Javier
2009/3/30 Dilip kola <
dilip...@gmail.com>: