Gensim Doc2vec error:

37 views

Skip to first unread message

Ankit

unread,

Jan 13, 2023, 6:59:20 AM1/13/23

to Gensim

Hi everyone,

I am new to machine learning and tried doc2vec on quora duplicate dataset. Following is the tagged document sample:

input:

tagged_data1[50001]

ouput:

TaggedDocument(words=['senseless', 'movi', 'like', 'dilwal', 'happi', 'new', 'year', 'earn', 'easi', '100', 'crore', 'india'], tags=['50001'])

Input:

model_dbow1 = Doc2Vec(dm=1, vector_size=300, negative=5, workers=cores)

model_dbow1.build_vocab([x for x in tqdm(tagged_data1)])

train_documents1 = utils.shuffle(tagged_data1)

model_dbow1.train(tagged_data1,total_examples=len(train_documents1), epochs=30)

# to check if model trained right

model_dbow1.most_similar('senseless')

Error:

KeyError: "word 'senseless' not in vocabulary"

The data I have given to model for training as input has the word "senseless" so why this error? Could anyone please help?

Gordon Mohr

unread,

Jan 13, 2023, 9:04:20 PM1/13/23

to Gensim

Are you sure the word 'senseless' appears in your data at least `min_count=5` times? These algorithms ignore rarer words, because that usually improves results. (And, it's generally better to focus evaluations on the surviving words, or gather more data, rather than make `min_count` very low.)

Also, it's generally a good idea to set logging to at least the INFO level, & watch the output for anything anomalous. (Just watching the logged output teaches a lot about the steps, and if anything looks amiss – some total seems off compared to what you think your corpus/vocabulary contains, or some step completes anomalously fast, etc – it's good to dig deeper.)

Separately, I'm assuming your `train_documents1` has exactly as many texts as `tagged_data1` – so you might as well use `total_examples=len(tagged_data1)` to guarantee consistency. And, as you don't show where you set `cores`, note that a bad value there – like `0` or `-1` – could result in no training.