tokens = tokenizer.tokenize(data)
def clean(doc):
stop_free = " ".join([i for i in doc.lower().split() if i not in stop])
punc_free = ''.join(ch for ch in stop_free if ch not in exclude)
normalized = " ".join(lemma.lemmatize(word) for word in punc_free.split())
return normalized
doc_clean = [clean(doc).split() for doc in tokens]
There is no issues in the bag of words model that I get using the above code. But the problem is that each document gets tokenized and I get a list of lists [ [word1, word2, ...], [ word1, word2, ,,,], [ ],...[ ] ] where each list is a sentence in the document. This gets easily processed by the following code:
dictionary_1 = corpora.Dictionary(doc_clean)
But I actually want each document to be a single list [ word1, word2,........ ] so i combine the list of lists using this:
combined = [item for sublist in doc_clean for item in sublist]
dictionary_1 = corpora.Dictionary(combined)
So now when i use this to get the bag of words model, I get this error (" TypeError: doc2bow expects an array of unicode tokens on input, not a single string ")
I am not sure how to overcome this issue.