I'm working with a model that, given three documents A, B and C, calculates 2 cosine similarities
CS1 = cossim(tfidf(A),tfidf(B))
CS2 = cossim(tfidf(A),tfidf(C))
These similarities are then fed into a classifier, which achieves 70% accuracy on the downstream task.
I'm trying to improve on this, and my first idea was to reduce the dimensionality by performing Latent Semantic Indexing on the TfIdf data before calculating the cosine similarities. In theory this should improve the SNR of the data and enable the model to make use of significant relationships between terms.
However, I find that the accuracy actually drops to 50% on the downstream task, and a visualisation of the cosine similarities showed that when TfIdf is used, the classes to be predicted line up on clear bands, whereas with LSI they blur into each other.
Can anyone suggest what sort of things I should investigate to understand this behaviour?