filter_extremes: conceptual questions about what terms are 'too rare' or 'too common'

347 views

Skip to first unread message

jo...@tetraconcepts.com

unread,

Feb 19, 2015, 3:23:25 PM2/19/15

to gen...@googlegroups.com

Hi,

I'm trying to understand the consequences of filtering out terms that are 'too rare' or 'too common' during LSI preprocessing (although it may similarly apply to LDA or word2vec preprocessing) , and the possible effects of this kind of filtering on document similarity queries once the eigenvectors/topic models have been calculated.

Is this statement generally correct: Terms that are 'too rare' and 'too common' are both like 'noise' since they don't help differentiate many documents from one another. True/False?

Let's say we have this scenario:

Start with a corpus of ~20 million documents and ~4 million unique terms before term filtering.
Preprocessing: filter out stopwords and other garbage. Then filter the extreme terms ('way too rare' = occur in 5 documents or less / 'way too common' = occur in 33% documents or more), and after that, keep only the 150K remaining most common terms. The other ~3.85 million less common terms get 'thrown out'. These are occurring in ~100 documents or less across the ~4 million document corpus.
Re-write the corpus using the 150K terms we kept above.
Perform LSI processing.

In this scenario, I have a colleague who's concerned that throwing out the other ~3.85 million terms that occur in between ~100 and 5 documents will cause a 'needle in the haystack' document similarity query to fail often, since the topic models won't include the terms we threw out. I'm having a hard time arguing against that assertion. Do you believe it's true? Why or why not?

Is it even fair to expect LSI to yield good results for 'needle in the haystack' document similarity queries?

Excluding hardware limitations, what if we just 'gave in' and handed LSI 2 million unique terms or more? Would that tend to improve results for highly specific document similarity queries?

Kind Regards,
-John H

Radim Řehůřek

unread,

Feb 20, 2015, 4:00:15 PM2/20/15

to gen...@googlegroups.com

Hello John,

comments inline:

On Thursday, February 19, 2015 at 9:23:25 PM UTC+1, jo...@tetraconcepts.com wrote:

Hi,

I'm trying to understand the consequences of filtering out terms that are 'too rare' or 'too common' during LSI preprocessing (although it may similarly apply to LDA or word2vec preprocessing) , and the possible effects of this kind of filtering on document similarity queries once the eigenvectors/topic models have been calculated.

Is this statement generally correct: Terms that are 'too rare' and 'too common' are both like 'noise' since they don't help differentiate many documents from one another. True/False?

The goal of LDA/LSA etc is not to differentiate documents. It is to describe documents using a higher level language (compared to words), namely topics. For differentiating, you probably want a classifier (supervised learning).

Let's say we have this scenario:
Start with a corpus of ~20 million documents and ~4 million unique terms before term filtering.
Preprocessing: filter out stopwords and other garbage. Then filter the extreme terms ('way too rare' = occur in 5 documents or less / 'way too common' = occur in 33% documents or more), and after that, keep only the 150K remaining most common terms. The other ~3.85 million less common terms get 'thrown out'. These are occurring in ~100 documents or less across the ~4 million document corpus.
Re-write the corpus using the 150K terms we kept above.
Perform LSI processing.

In this scenario, I have a colleague who's concerned that throwing out the other ~3.85 million terms that occur in between ~100 and 5 documents will cause a 'needle in the haystack' document similarity query to fail often, since the topic models won't include the terms we threw out. I'm having a hard time arguing against that assertion. Do you believe it's true? Why or why not?

No, the point of doing similarity on topic level (rather than word level) is that the scores don't depend on individual words. So unless you query document is super short (just a few words), the model should be able to assign correct topics, even if you throw away some words.

Now if you have a document that consists *exclusively* of those 5-100 doc words you threw out, then that's a problem, because there's not enough words left to assign topics.

The more relevant question is, do your documents, when expressed in the LSI/LDA topics space, cover enough fineness to answer "needle in haystack" queries?

For example, when you operate on the level of topic = "investing", you can't expect doc-doc similarity to catch the difference between "shorting selling" and "put option" -- they're both "investing", 100% match.

Is it even fair to expect LSI to yield good results for 'needle in the haystack' document similarity queries?

Depends on the number of topics in the model. The more topics you use, the closer LSI gets to raw bag-of-words (tf-idf) match. The fewer the topics, the more "high-level" the similarities.

The way LSI is typically used (~hundreds of topics) implies high level matching = similarity reflects broadly matching concepts = no, not good for "needle in the haystack" queries.

Excluding hardware limitations, what if we just 'gave in' and handed LSI 2 million unique terms or more? Would that tend to improve results for highly specific document similarity queries?

Again, this depends more on your document length and the number of topics.

When your documents have enough words (after filtering), then the extra words wouldn't do much.

When your model is general (few topics), the extra, specific words won't do much.

Anyway this should be easy to test -- the gensim API is simple and training fast, so probably best to just compute the models and see for yourself what works best.

HTH,

Radim

Kind Regards,
-John H

John Hopkins

unread,

Feb 20, 2015, 6:06:01 PM2/20/15

to gen...@googlegroups.com

Wow -- thanks for the clear and detailed feedback Radim! It is all I could have hoped. You have a gift for explaining these things.

Kind Regards,

-John

--

John Hopkins

Tetra Concepts, LLC

cell: 321-246-4828

--
You received this message because you are subscribed to a topic in the Google Groups "gensim" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/gensim/5HG7WFFZgjw/unsubscribe.
To unsubscribe from this group and all its topics, send an email to gensim+un...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all

Reply to author

Forward

0 new messages