Hello Radim and other Members,
First of all, I want to thanks Radim for his awesome work with Gensim.
The topic of my PhD Thesis is Incremental and Hierarchical Document Clustering.
At the beginning, LSI (and other related "algebraic" techniques) was not one of the techniques I intended to use.
However, when I discovered Radim's work I started changing my opinion. With version 0.8.6 and its implementation of hashing trick, I believed I've found the solutions for some of my problems.
To run some experiments with Gensim and LSI I used a very small corpus of 4 documents with 5 words (represented by letters for short). I started with the 3 first documents and calculated the hash dictionary, BOW, TF-IDF and LSI (from TF-IDF and also from original BOW vectors). Then I add the 4th document and updated all the matrices. For LSI I tried two ways: add the document to previous model and recalculate a new model from scratch. I send in attachment the code and the output.
From these experiments some questions and issues arise:
1. Even understanding the hashing trick and the advantages of Hash Dictionary over "normal" Dictionary, how can we "calculate TF-IDF on-the-fly". We can compute TF-IDF of new documents from the previous TF-IDF model (because with Hash Dictionary the vectors have the same dimension) but this TF-IDF will not consider new terms and, mainly, the TF-IDF values of previous documents will not be updated. With each new document the TF-IDF of all previous documents change, so we need to recalculate everything from scratch. Am I right?
2. After running several times the same code I realized that, sometimes, LSI produces symmetrical results (without any code change). The magnitude of values remains unchanged but all the signs change. I saw that the topic values also change to the symmetrical and the final matrix should be the same. But why this happen? (I haven't dived in Gensim's code to find if exists some source of randomness).
3. I confirmed, as expected, that LSI models starting from BOW produced exactly the same result when computed incrementally and when computed from scratch. However the same didn't happen when I used TF-IDF values as input to LSI model. The values are near but different. I suppose that this difference is caused by TF-IDF changes (BOW remains equal).
4. Finally the biggest issue: LSI values of documents already processed also change with the arrival of each new document. This is expectable since the topics are recalculated in each iteration. So, is like have a set of points in a space whose basis vectors are constantly changing. Even if the points remain "constant" (like in BOW case) his representation in the permanently changing space will change accordingly.
If I'm getting the whole picture, this cancels the advantage of hashing trick and LSI because there is still the need of reprocessing the previous documents.
Now I'm blocked, wondering if LSI can be used for incremental clustering.
My clustering algorithm creates a tree where each node represents a concept (probabilistic description of the documents it stores). There are also some requirements in terms of processing time for each document, making impractical a global reorganization of the hierarchy in each step.
Any remark from Radim or other members would be invaluable. If someone is using Gensim for incremental clustering it would be great if we could trade ideas and experiences.
Sorry for the extension of the post.
Thanks,
Rui Encarnação
PS - Radim, the "compactify" issue is solved! Thx!