LSI and Incremental Document Clustering

380 views
Skip to first unread message

Rui Encarnação

unread,
Oct 13, 2012, 12:48:55 AM10/13/12
to gen...@googlegroups.com

Hello Radim and other Members,

 

First of all, I want to thanks Radim for his awesome work with Gensim.

 

The topic of my PhD Thesis is Incremental and Hierarchical Document Clustering.

 

At the beginning, LSI (and other related "algebraic" techniques) was not one of the techniques I intended to use.

However, when I discovered Radim's work I started changing my opinion. With version 0.8.6 and its implementation of hashing trick, I believed I've found the solutions for some of my problems.

 

To run some experiments with Gensim and LSI I used a very small corpus of 4 documents with 5 words (represented by letters for short). I started with the 3 first documents and calculated the hash dictionary, BOW, TF-IDF and LSI (from TF-IDF and also from original BOW vectors). Then I add the 4th document and updated all the matrices. For LSI I tried two ways: add the document to previous model and recalculate a new model from scratch. I send in attachment the code and the output.

From these experiments some questions and issues arise:

 

1. Even understanding the hashing trick and the advantages of Hash Dictionary over "normal" Dictionary, how can we "calculate TF-IDF on-the-fly". We can compute TF-IDF of new documents from the previous TF-IDF model (because with Hash Dictionary the vectors have the same dimension) but this TF-IDF will not consider new terms and, mainly, the TF-IDF values of previous documents will not be updated. With each new document the TF-IDF of all previous documents change, so we need to recalculate everything from scratch. Am I right?

 

2. After running several times the same code I realized that, sometimes, LSI produces symmetrical results (without any code change). The magnitude of values remains unchanged but all the signs change. I saw that the topic values also change to the symmetrical and the final matrix should be the same. But why this happen? (I haven't dived in Gensim's code to find if exists some source of randomness).

 

3. I confirmed, as expected, that LSI models starting from BOW produced exactly the same result when computed incrementally and when computed from scratch. However the same didn't happen when I used TF-IDF values as input to LSI model. The values are near but different. I suppose that this difference is caused by TF-IDF changes (BOW remains equal).

 

4. Finally the biggest issue: LSI values of documents already processed also change with the arrival of each new document. This is expectable since the topics are recalculated in each iteration. So, is like have a set of points in a space whose basis vectors are constantly changing. Even if the points remain "constant" (like in BOW case) his representation in the permanently changing space will change accordingly.

If I'm getting the whole picture, this cancels the advantage of hashing trick and LSI because there is still the need of reprocessing the previous documents.

 

Now I'm blocked, wondering if LSI can be used for incremental clustering.

My clustering algorithm creates a tree where each node represents a concept (probabilistic description of the documents it stores). There are also some requirements in terms of processing time for each document, making impractical a global reorganization of the hierarchy in each step.

 

Any remark from Radim or other members would be invaluable. If someone is using Gensim for incremental clustering it would be great if we could trade ideas and experiences.

Sorry for the extension of the post.

 

Thanks,

Rui Encarnação

 

PS - Radim, the "compactify" issue is solved! Thx!

test_gensim.py
output.txt
output with log.txt

bluishgreen

unread,
Oct 13, 2012, 1:08:45 AM10/13/12
to gen...@googlegroups.com
Hi Rui,

I can confirm that the issues you are facing 1 & 4 are infact conceptual problems that are inherent in the LSI approach. I had asked about this earlier in a thread. It is here: 


In practice I deployed an Online LSI app, which would do incremental computation till a certain cut-off number of new documents are added, and then start retraining from scratch. Eg. You could do incremental during busy times, and at night you can retrain the whole matrix again etc.  

Radim Řehůřek

unread,
Oct 16, 2012, 6:23:39 PM10/16/12
to gensim
Hello Rui,

> To run some experiments with Gensim and LSI I used a very small corpus of 4
> documents with 5 words (represented by letters for short). I started with
> the 3 first documents and calculated the hash dictionary, BOW, TF-IDF and
> LSI (from TF-IDF and also from original BOW vectors). Then I add the 4th
> document and updated all the matrices. For LSI I tried two ways: add the
> document to previous model and recalculate a new model from scratch. I send
> in attachment the code and the output.
>
> From these experiments some questions and issues arise:
>
> 1. Even understanding the hashing trick and the advantages of Hash
> Dictionary over "normal" Dictionary, how can we "calculate TF-IDF
> on-the-fly". We can compute TF-IDF of new documents from the previous TF-IDF
> model (because with Hash Dictionary the vectors have the same dimension) but
> this TF-IDF will not consider new terms and, mainly, the TF-IDF values of
> previous documents will not be updated. With each new document the TF-IDF of
> all previous documents change, so we need to recalculate everything from
> scratch. Am I right?

when you change a model (incrementally or not), all documents computed
with a previous model are indeed different (to what they would have
been if computed with the new model).


> 2. After running several times the same code I realized that, sometimes, LSI
> produces symmetrical results (without any code change). The magnitude of
> values remains unchanged but all the signs change. I saw that the topic
> values also change to the symmetrical and the final matrix should be the
> same. But why this happen? (I haven't dived in Gensim's code to find if
> exists some source of randomness).

Yes, SVD is by definition unique up to signs of the corresponding left/
right singular vectors: U*S*V'=-U*S*-V'.

For all practical reasons, whether `u` or `-u` appears in the result
depends on chance. If that bothers you, you can normalize the u-v
pairs to your liking -- the outcome remains unchanged.


> 3. I confirmed, as expected, that LSI models starting from BOW produced
> exactly the same result when computed incrementally and when computed from
> scratch. However the same didn't happen when I used TF-IDF values as input
> to LSI model. The values are near but different. I suppose that this
> difference is caused by TF-IDF changes (BOW remains equal).
>
> 4. Finally the biggest issue: LSI values of documents already processed also
> change with the arrival of each new document. This is expectable since the
> topics are recalculated in each iteration. So, is like have a set of points
> in a space whose basis vectors are constantly changing. Even if the points
> remain "constant" (like in BOW case) his representation in the permanently
> changing space will change accordingly.

Same as 1).

Except with LSI, you can update positions of old documents pretty
cheaply -- the modification of old model => new model always amounts
to rotating/scaling (=matrix multiplication). So if you apply the same
linear transformation to existing documents, you don't need to project
them again. In other words, with a little algebra you can come up with
the formula to update already projected documents incrementally.

This will not be more efficient computationally (projecting from
scratch is also simply a matrix multiplication), but it is the only
way to update documents if you only have the transformed
representation available (not the original documents anymore).

Something one needs to be careful about is accumulating floating point
error with these incremental re-adjustments, because of finite machine
precision. But this may not be such a big deal in NLP tasks (as
opposed to launching rockets into space) :) Still a little experiment
would be in order.

HTH,
Radim



>
> If I'm getting the whole picture, this cancels the advantage of hashing
> trick and LSI because there is still the need of reprocessing the previous
> documents.
>
> Now I'm blocked, wondering if LSI can be used for incremental clustering.
>
> My clustering algorithm creates a tree where each node represents a concept
> (probabilistic description of the documents it stores). There are also some
> requirements in terms of processing time for each document, making
> impractical a global reorganization of the hierarchy in each step.
>
> Any remark from Radim or other members would be invaluable. If someone is
> using Gensim for incremental clustering it would be great if we could trade
> ideas and experiences.
>
> Sorry for the extension of the post.
>
> Thanks,
>
> Rui Encarnação
>
> PS - Radim, the "compactify" issue is solved! Thx!
>
>  test_gensim.py
> 2KViewDownload
>
>  output.txt
> 1KViewDownload
>
>  output with log.txt
> 10KViewDownload
Reply all
Reply to author
Forward
0 new messages