Own weights for TfidfModel

Andrii Elyiv

unread,

Mar 9, 2016, 4:34:03 AM3/9/16

to gensim

Hello,

I use tfidf = models.TfidfModel(corpus)

Could you show me on example how to use my own weight?

Thanks,

Andrii

Radim Řehůřek

unread,

Mar 9, 2016, 9:27:45 PM3/9/16

to gensim

Hello Andrii,

On Wednesday, March 9, 2016 at 5:34:03 PM UTC+8, Andrii Elyiv wrote:

Hello,

I use tfidf = models.TfidfModel(corpus)
Could you show me on example how to use my own weight?

what do you mean? What "weight"?

Radim

Thanks,
Andrii

Andrii Elyiv

unread,

Mar 10, 2016, 3:53:41 AM3/10/16

to gensim

Hello Radim,

For example in function:

gensim.models.tfidfmodel.TfidfModel(corpus=None, id2word=None, dictionary=None, wlocal=<function identity>, wglobal=<function df2idf>, normalize=True)

I want to use

weight_{i,j} = wlocal(frequency_{i,j}) * wglobal(document_freq_{i}, D)

instead of default weight.

How to realize it on practice?

Thanks,

Andrii

Radim Řehůřek

unread,

Mar 10, 2016, 6:27:49 AM3/10/16

to gensim

Ah, I see. To supply your own wlocal and wglobal weights, pass them as functions into the TfidfModel constructor:

http://radimrehurek.com/gensim/models/tfidfmodel.html#gensim.models.tfidfmodel.TfidfModel

For an example of what these functions look like (what parameters they take), see their default implementations in gensim:

https://github.com/piskvorky/gensim/blob/develop/gensim/models/tfidfmodel.py#L52

`wlocal` takes single param (frequency of term in a document); `wglobal` takes two params (how many documents does a term appear in + total number of documents).

By default wlocal does TF (=just returns the input frequency right back), and wglobal does IDF. Together, TF-IDF.

Hope that helps,
Radim

Andrii Elyiv

unread,

Mar 10, 2016, 6:35:36 AM3/10/16

to gensim

Thanks, but I still dont understand.

def __init__(self, corpus=None, id2word=None, dictionary=None, wlocal=utils.identity, wglobal=df2idf, normalize=True):

What does mean utils.identity ?
How to put, for example wlocal=1+frequency^2 ?

Andrii

Radim Řehůřek

unread,

Mar 10, 2016, 7:42:57 AM3/10/16

to gensim

Like this:

def my_local_weight(freq):
    return 1.0 + freq**2

model= TfidfModel(..., wlocal=my_local_weight, ...)

HTH,

Radim

Andrii Elyiv

unread,

Mar 10, 2016, 7:57:08 AM3/10/16

to gensim

Great! Many thanks

Andrii Elyiv

unread,

Mar 10, 2016, 9:05:11 AM3/10/16

to gensim

Other question, I have following problem:

very frequently used words in documents set like London or Delhi have very small weights.

And some news where is mentioned only London or Delhi are not categorized properly to UK or India, respectively.

Which kind local and global weights better to use in this case?

Thanks,

Andrii

Radim Řehůřek

unread,

Mar 10, 2016, 9:37:31 PM3/10/16

to gensim

There's no one-true-way for handling feature extraction. If TF-IDF doesn't suit you, try something else. Maybe changing the logarithm base in IDF (default is base 2). Or ditching logarithm and using square root instead. Or treating "special" words and named entities in a special way, outside of the bag-of-words framework.

One example would be using semantic models instead of tf-idf. Models like LSA (latent semantic analysis) or LDA (latent dirichlet allocation) construct condensed features from entire docs, so that each feature is less dependent on a particular word appearing in the input (London, Delhi).

Finally, check your categorization model. Some supervised algorithms work better on text than others. SVM is a safe choice, Naive Bayes a good start, simple easy to debug. Unless your documents are super short, the categorization shouldn't depend on a single word so strongly, that's just a red flag.

HTH,

Radim

Andrii Elyiv

unread,

Mar 11, 2016, 4:38:26 AM3/11/16

to gensim

Thanks for the suggestions,

> the categorization shouldn't depend on a single word so strongly, that's just a red flag.

Very often in news we have just mentioned London and some non-famous surnames and general description of local event. This news will not be categorized like related to UK. Word "london" is used in more than 30% of documents in set and its has low global weight. I putted wglobal=1 but result is almost the same.

Andrii

Radim Řehůřek

unread,

Mar 11, 2016, 9:32:36 PM3/11/16

to gensim

I'd say this is an issue with your classifier. If your training set contains London in "UK"-labelled docs, and doesn't contain London in non-UK-labelled docs, the classifier should pick up this signal, regardless of its tf-idf weight.

Random idea: are you really sure "London" is one of your features? Usually, there's a pruning step where the most frequent words are completely discarded, on account of them carrying little information. Maybe your pruning / feature selection step discarded London?

Radim

Andrii

Reply all

Reply to author

Forward

Message has been deleted