Gensim Doc2vec error:

37 views
Skip to first unread message

Ankit

unread,
Jan 13, 2023, 6:59:20 AM1/13/23
to Gensim
Hi everyone,

I am new to machine learning and tried doc2vec on quora duplicate dataset. Following is the tagged document sample:

input: 
tagged_data1[50001]

ouput: 
TaggedDocument(words=['senseless', 'movi', 'like', 'dilwal', 'happi', 'new', 'year', 'earn', 'easi', '100', 'crore', 'india'], tags=['50001'])

Input:
model_dbow1 = Doc2Vec(dm=1, vector_size=300, negative=5, workers=cores)
model_dbow1.build_vocab([x for x in tqdm(tagged_data1)])
train_documents1  = utils.shuffle(tagged_data1)
model_dbow1.train(tagged_data1,total_examples=len(train_documents1), epochs=30)

# to check if model trained right
model_dbow1.most_similar('senseless')

Error:
KeyError: "word 'senseless' not in vocabulary"

The data I have given to model for training as input has the word "senseless" so why this error? Could anyone please help?

Gordon Mohr

unread,
Jan 13, 2023, 9:04:20 PM1/13/23
to Gensim
Are you sure the word 'senseless' appears in your data at least `min_count=5` times? These algorithms ignore rarer words, because that usually improves results. (And, it's generally better to focus evaluations on the surviving words, or gather more data, rather than make `min_count` very low.)

Also, it's generally a good idea to set logging to at least the INFO level, & watch the output for anything anomalous. (Just watching the logged output teaches a lot about the steps, and if anything looks amiss – some total seems off compared to what you think your corpus/vocabulary contains, or some step completes anomalously fast, etc – it's good to dig deeper.)

Separately, I'm assuming your `train_documents1` has exactly as many texts as `tagged_data1` – so you might as well use `total_examples=len(tagged_data1)` to guarantee consistency. And, as you don't show where you set `cores`, note that a bad value there – like `0` or `-1` – could result in no training. 

- Gordon
Reply all
Reply to author
Forward
0 new messages