Mismatch of vector embedding length at inference

26 views
Skip to first unread message

Kranthi Kumar

unread,
Jun 3, 2024, 10:41:30 AMJun 3
to Gensim
Train Code:
    # Create a dictionary from the preprocessed documents
    dictionary = corpora.Dictionary(texts)
    # Create a corpus from the dictionary
    corpus = [dictionary.doc2bow(text) for text in texts]
    # Create an LSI model from corpus
    lsi = models.LsiModel(corpus, id2word=dictionary, num_topics=1500)

Inference Code:
     doc = " ".join([str(item) for item in preprocess(text)])
    vec_bow = dictionary.doc2bow(doc.lower().split())
    vec_lsi = lsi[vec_bow]  # convert the query to LSI space  
    query_vector = [i[1] for i in vec_lsi]

Need Help:

When applying the LSI model during inference, I'm observing that the resulting LSI representation contains only 1499 topics (vec_lsi varaiable from inference code) instead of the expected 1500 topics. This discrepancy occurs despite the following:

  1. The query words are present in the trained vocabulary, and I can successfully create a bag-of-words representation for the query.
  2. The LSI model was originally trained with 1500 topics.

Therefore, it appears that during the conversion from the bag-of-words representation to the LSI representation, one of the 1500 topics is being lost or omitted, resulting in an LSI representation with only 1499 topics. Could you please explain why this might be happening and provide any insights or potential solutions to address this issue?


Kranthi Kumar

unread,
Jun 4, 2024, 2:48:51 AMJun 4
to Gensim

My documents consist of very small, single line with 3 to 5 words, and the queries are also of the same length.

I have also observed that instead of obtaining 1500 topic representations, only 1499 are being generated.

Gordon Mohr

unread,
Jun 5, 2024, 4:37:45 PMJun 5
to Gensim
I'm not too familiar/experienced with the Gensim LsiModel, but some things I'd check:

(1) How large is your set of documents, and the effective vocabulary of unique words in your `dictionary`?
(2) At each earlier step, are the sizes/counts of lists/dicts/vectors as expected?
(3) If you set logging to the `INFO` level, does all reported progress (especially in the `LsiModel()` instantiation-training step) seem sensible in its logged steps and elapsed time/effort?
(4) Does looking at an individual doc BoW representation (like say `vec_bow`) reveal it to be sensible-looking?
(5) What does `lsi.get_topics().shape` show? (Have you seen the note in this method's docs about how the topic count can be lower than requested if the input matrix's real rank is too small? https://radimrehurek.com/gensim/models/lsimodel.html#gensim.models.lsimodel.LsiModel.get_topics)
(6) Do runs requesting fewer topics (500, 1000, 1499) succeed in created the desired number of topics, with the unexpected gap only appearing at `num_topics=1500`?

- Gordon
Reply all
Reply to author
Forward
0 new messages