How to update tfidf.

bluishgreen

unread,

Jan 13, 2012, 1:58:38 AM1/13/12

to gensim

Hi all,

I have a quick question. So I have a bunch of documents and I make a
corpus - then transform it to tfidf model, and take the LSI of this
model.

My question is what happens when I add a bunch of documents to this
LSI model. I only have word counts that I can supply. To calculate the
tfidf, I would have to go back into the previous documents and change
all the values there - since the incoming documents would affect the
frequency count and thereby make all the previous tfidf values
'dirty'.

So does the add_documents() function in the LSI model take this into
account (I find it hard to imagine how it can do this). So I thought
I'll just ask (instead of trying hard to imagine!)

-thanks,

Radim

unread,

Jan 15, 2012, 2:13:14 PM1/15/12

to gensim

Hi,

you're right, the LSI update doesn't take new (or modified) features
into account. It only updates on new observations (=documents), using
the same old features.

Interestingly, feature updates can also be done incrementally, in
theory. This is not implemented in gensim though, so for modified
features, your only option right now is to re-train LSI.

HTH,
Radim

Radim

unread,

Jan 15, 2012, 2:30:24 PM1/15/12

to gensim

Re. the incremental tf-idf, when adding extra documents:

1) can be done by simply incrementing document counts. Basically the
same thing as happens now inside `initialize`, but factored out into a
separate function. Difficulty: very easy.

2) is not necessarily a good idea. If your original collection was
large enough, the inverse document weights are likely set reasonably
already. Re-adjusting the weights (or adding new weights for new
vocabulary) has the unpleasant effect that subsequent methods like
online LSI will need to be retrained, because they cannot deal with
modifying input features dynamically, as I explained in the previous
post.

Best,
Radim

Amit Tewari

unread,

Feb 7, 2012, 5:48:50 AM2/7/12

to gensim

I am trying to do exactly this. I parsed new documents, used new
dictionary to create the corpus. But used original tfidf model to
project the new documents. When I add the transformed document to the
original LSI I get an error that "arrays are not the same size" not
sure whats wrong.

Appreciate any help.

-A

Radim

unread,

Feb 8, 2012, 3:51:56 AM2/8/12

to gensim

On Feb 7, 11:48 am, Amit Tewari <amittew...@gmail.com> wrote:
> I am trying to do exactly this. I parsed new documents, used new
> dictionary to create the corpus. But used original tfidf model to
> project the new documents.

^^^
this is the problem. Both corpora must be built using the same (=old)
dictionary, otherwise you have a feature mismatch on your hands.

This problem keeps coming up again and again for users; I already
opened an issue ticket at https://github.com/piskvorky/gensim/issues/74
.

I'm thinking of adding some dictionary digest (hash) to the corpus
object, and warning users if there's a digest mismatch when applying/
updating a model. This is an API change, so I still need to decide
what to do about this digest during serialization (storing it to
disk), how to pass on the digest to individual vectors during corpus
iteration &c.

Best,
Radim

Reply all

Reply to author

Forward