Hi,
I'm trying to understand the consequences of filtering out terms that are 'too rare' or 'too common' during LSI preprocessing (although it may similarly apply to LDA or word2vec preprocessing) , and the possible effects of this kind of filtering on document similarity queries once the eigenvectors/topic models have been calculated.
Is this statement generally correct: Terms that are 'too rare' and 'too common' are both like 'noise' since they don't help differentiate many documents from one another. True/False?
Let's say we have this scenario:
Start with a corpus of ~20 million documents and ~4 million unique terms before term filtering.
Preprocessing: filter out stopwords and other garbage. Then filter the extreme terms ('way too rare' = occur in 5 documents or less / 'way too common' = occur in 33% documents or more), and after that, keep only the 150K remaining most common terms. The other ~3.85 million less common terms get 'thrown out'. These are occurring in ~100 documents or less across the ~4 million document corpus.
Re-write the corpus using the 150K terms we kept above.
Perform LSI processing.
In this scenario, I have a colleague who's concerned that throwing out the other ~3.85 million terms that occur in between ~100 and 5 documents will cause a 'needle in the haystack' document similarity query to fail often, since the topic models won't include the terms we threw out. I'm having a hard time arguing against that assertion. Do you believe it's true? Why or why not?
Is it even fair to expect LSI to yield good results for 'needle in the haystack' document similarity queries?
Excluding hardware limitations, what if we just 'gave in' and handed LSI 2 million unique terms or more? Would that tend to improve results for highly specific document similarity queries?
Kind Regards,
-John H