LSI less accurate than TfIdf

Pete Bleackley

unread,

Mar 22, 2022, 10:53:39 AM3/22/22

to gensim

I'm working with a model that, given three documents A, B and C, calculates 2 cosine similarities

CS1 = cossim(tfidf(A),tfidf(B))

CS2 = cossim(tfidf(A),tfidf(C))

These similarities are then fed into a classifier, which achieves 70% accuracy on the downstream task.

I'm trying to improve on this, and my first idea was to reduce the dimensionality by performing Latent Semantic Indexing on the TfIdf data before calculating the cosine similarities. In theory this should improve the SNR of the data and enable the model to make use of significant relationships between terms.

However, I find that the accuracy actually drops to 50% on the downstream task, and a visualisation of the cosine similarities showed that when TfIdf is used, the classes to be predicted line up on clear bands, whereas with LSI they blur into each other.

Can anyone suggest what sort of things I should investigate to understand this behaviour?

Radim Řehůřek

unread,

Mar 22, 2022, 2:32:03 PM3/22/22

to Gensim

Hi Pete,

the things to investigate are a) LSI dimensionality and b) LSI training corpus.

The LSI dimensionality controls how "like TFIDF" the LSI model is. In the extreme, when you set the number of topics (latent factors) to len(tfidf_dictionary), LSI becomes identical to TFIDF. So you can think of LSI as a generalization of TFIDF: you can recover TFIDF from LSI with high enough num_topics.

The LSI training corpus should reflect the vocabulary & document themes seen later, during inference on actual documents. Otherwise the model won't be of much use of course (same as any other ML model).

Hope that helps,

Radim

Pete Bleackley

unread,

Mar 22, 2022, 3:43:00 PM3/22/22

to gensim

Thanks. I suspect I may be underfitting.

--
You received this message because you are subscribed to the Google Groups "Gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gensim/615b02c8-c752-4d2b-b201-355a8953693en%40googlegroups.com.

Reply all

Reply to author

Forward

Message has been deleted