Own weights for TfidfModel

790 views
Skip to first unread message

Andrii Elyiv

unread,
Mar 9, 2016, 4:34:03 AM3/9/16
to gensim
Hello,

I use tfidf = models.TfidfModel(corpus)
Could you show me on example how to use my own weight?

Thanks,
Andrii

Radim Řehůřek

unread,
Mar 9, 2016, 9:27:45 PM3/9/16
to gensim
Hello Andrii,


On Wednesday, March 9, 2016 at 5:34:03 PM UTC+8, Andrii Elyiv wrote:
Hello,

I use tfidf = models.TfidfModel(corpus)
Could you show me on example how to use my own weight?

what do you mean? What "weight"?

Radim

 

Thanks,
Andrii

Andrii Elyiv

unread,
Mar 10, 2016, 3:53:41 AM3/10/16
to gensim
Hello Radim,

For example in function:
gensim.models.tfidfmodel.TfidfModel(corpus=None, id2word=None, dictionary=None, wlocal=<function identity>, wglobal=<function df2idf>, normalize=True)
I want to use 
weight_{i,j} = wlocal(frequency_{i,j}) * wglobal(document_freq_{i}, D)
instead of default weight. 

How to realize it on practice?
Thanks,
Andrii

Radim Řehůřek

unread,
Mar 10, 2016, 6:27:49 AM3/10/16
to gensim
Ah, I see. To supply your own wlocal and wglobal weights, pass them as functions into the TfidfModel constructor:

For an example of what these functions look like (what parameters they take), see their default implementations in gensim:

`wlocal` takes single param (frequency of term in a document); `wglobal` takes two params (how many documents does a term appear in + total number of documents).

By default wlocal does TF (=just returns the input frequency right back), and wglobal does IDF. Together, TF-IDF.

Hope that helps,
Radim

Andrii Elyiv

unread,
Mar 10, 2016, 6:35:36 AM3/10/16
to gensim
Thanks, but I still dont understand.

  def __init__(self, corpus=None, id2word=None, dictionary=None, wlocal=utils.identity, wglobal=df2idf, normalize=True):

What does mean utils.identity ?
How to put, for example wlocal=1+frequency^2 ?


Andrii

Radim Řehůřek

unread,
Mar 10, 2016, 7:42:57 AM3/10/16
to gensim
Like this:

def my_local_weight(freq):
   
return 1.0 + freq**2

model
= TfidfModel(..., wlocal=my_local_weight, ...)


HTH,
Radim

Andrii Elyiv

unread,
Mar 10, 2016, 7:57:08 AM3/10/16
to gensim
Great! Many thanks

Andrii Elyiv

unread,
Mar 10, 2016, 9:05:11 AM3/10/16
to gensim
Other question, I have following problem:
very frequently used words in documents set like London or Delhi have very small weights. 
And some news where is mentioned only London or Delhi are not categorized properly to UK or India, respectively.  
Which kind local and global weights better to use in this case?

Thanks,
Andrii

Radim Řehůřek

unread,
Mar 10, 2016, 9:37:31 PM3/10/16
to gensim
There's no one-true-way for handling feature extraction. If TF-IDF doesn't suit you, try something else. Maybe changing the logarithm base in IDF (default is base 2). Or ditching logarithm and using square root instead. Or treating "special" words and named entities in a special way, outside of the bag-of-words framework.

One example would be using semantic models instead of tf-idf. Models like LSA (latent semantic analysis) or LDA (latent dirichlet allocation) construct condensed features from entire docs, so that each feature is less dependent on a particular word appearing in the input (London, Delhi).

Finally, check your categorization model. Some supervised algorithms work better on text than others. SVM is a safe choice, Naive Bayes a good start, simple easy to debug. Unless your documents are super short, the categorization shouldn't depend on a single word so strongly, that's just a red flag.

HTH,
Radim

Andrii Elyiv

unread,
Mar 11, 2016, 4:38:26 AM3/11/16
to gensim
Thanks for the suggestions,

> the categorization shouldn't depend on a single word so strongly, that's just a red flag.

Very often in news we have just mentioned London and some non-famous surnames and general description of local event. This news will not be categorized like related to UK. Word "london" is used in more than 30% of documents in set and its has low global weight. I putted wglobal=1 but result is almost the same.

Andrii

Radim Řehůřek

unread,
Mar 11, 2016, 9:32:36 PM3/11/16
to gensim
I'd say this is an issue with your classifier. If your training set contains London in "UK"-labelled docs, and doesn't contain London in non-UK-labelled docs, the classifier should pick up this signal, regardless of its tf-idf weight.

Random idea: are you really sure "London" is one of your features? Usually, there's a pruning step where the most frequent words are completely discarded, on account of them carrying little information. Maybe your pruning / feature selection step discarded London?

Radim



 

Andrii

Reply all
Reply to author
Forward
Message has been deleted
0 new messages