Weight different parts of a document for TFIDF model

Matt Harrison

unread,

Jan 8, 2021, 9:14:07 AM1/8/21

to Gensim

Hi,

I'm implementing a "find related documents" feature for an application and using Gensim. In my application a document is actually a collection of different attributes like title, description, other. I would like each document to be treated as a single entry but I want to weight the different attributes somehow.

As an example if a word "deployment" appears in a title I would expect the TFIDF value to be higher than if the word "deployment" appeared in a description because it's more likely to be about "deployment".

Do you have any pointers or examples on how I could achieve this.

Many thanks,

Matt

Matt Harrison

unread,

Jan 8, 2021, 11:46:41 AM1/8/21

to Gensim

I'm thinking now I could get a doc2bow representation for each attribute and then weight the tf's by the per-attribute weighting and then combined them into a single doc2bow representation. The counts may no longer be ints but it seems like gensim internally won't have an issue with that.

Radim Řehůřek

unread,

Jan 8, 2021, 12:37:25 PM1/8/21

to Gensim

Hi Matt,

that's right – as the simplest approximation, you can boost the word counts for different attributes. E.g. each "deployment" in Title is worth five "deployment"s in Description etc.

This is actually a common approach, even if hackish, and tends to work well given its simplicity. The "combined" model effectively adds a handful of constants, since such weights typically don't depend on the query/indexed document at all, and are fixed throughout.

Of course, that leads to the question of "what weights to use". Do you have an annotated set from which to tune these extra parameters? Or do you plan to set them based on your intuition?

Other approaches include keeping the attribute vectors separate, and extending the query model instead. So instead of comparing two vectors (doc x query) with cosine similarity, you'd be comparing N x N vectors (N for query, N for doc) with a more sophisticated measure. Depending on the quality of your training data and appetite for unchartered "academic" algorithms, keeping the input signals "raw" and pushing the complexity (and explosion of model parameters) into evaluation can be quite the rabbit hole. That's the direction of modern attention algorithms and deep learning.

Hope that helps,

Radim

Matt Harrison

unread,

Jan 8, 2021, 12:51:42 PM1/8/21

to Gensim

Hi Radim,

I agree the idea did feel slightly hackish at first but I think it should work for my use case and glad to hear it's a fairly common approach.

Regarding the question of "what weights to use", our SAAS application is highly configurable and the administrators of each tenant will be able to explicitly configure which attributes to select and the relative weightings of each attribute. We have loads of really different tenants with their own configurations so will be interesting to evaluate how it works in reality.

I'll try the simple approach at first, but really appreciate your other suggestions and I'll keep them in mind as we improve this feature. Thanks for the really quick response to my question and for your work on Gensim :)