tf Normalization

388 views
Skip to first unread message

Felix Middendorf

unread,
Nov 18, 2011, 7:37:52 PM11/18/11
to gensim
Hi everyone,

I am currently experimenting with gensim and it is really great. Good
job, Radim!
However, I was a bit surprised by the fact that gensim does not ship
with functionality to normalize tf (not tf.idf).

Let me explain: in order to avoid bias to large documents, tf is often
normalized like this
term_frequency / length_of_document
before it is multiplied with idf (see, e.g., http://en.wikipedia.org/wiki/Tf.idf
and p. 245 in Croft et al: "Search Engines", 2010.)
Browsing the code, I noticed that gensim's tfidf-Transformation only
offers to normalize the resulting vectors (tfidf = tf * log2(documents/
df) to length 1, which is not the same.

Am I overlooking something? If it ships with this kind of
functionality, I'm not seeing the forest for the trees and would be
grateful for a hint ;).
Would anyone else be interested in this? I think it would be a great
fit for TfidfModel.

All the best

Felix

Radim

unread,
Nov 19, 2011, 6:46:28 AM11/19/11
to gensim, Dieter Plaetinck
Hi Felix (cc Dieter),

the vectors are indeed normalized only after tf*idf. Feel free to
modify the function(s) any way you like! In fact, I think Dieter
mentioned that he is reworking TfIdf at the moment, to make it more
flexible. Perhaps you two can co-operate, to ensure smooth operation
and avoid collisions =)

Btw have a look at the logentropy_model, which attempts to do the
weighting in a more principled manner.

Best,
Radim


On Nov 19, 1:37 am, Felix Middendorf <felix.middend...@gmail.com>
wrote:


> Hi everyone,
>
> I am currently experimenting with gensim and it is really great. Good
> job, Radim!
> However, I was a bit surprised by the fact that gensim does not ship
> with functionality to normalize tf (not tf.idf).
>
> Let me explain: in order to avoid bias to large documents, tf is often
> normalized like this
>     term_frequency / length_of_document

> before it is multiplied with idf (see, e.g.,http://en.wikipedia.org/wiki/Tf.idf

Felix Middendorf

unread,
Nov 19, 2011, 7:10:10 AM11/19/11
to gensim
Hi Radim (+ Dieter),

thanks for directing me to logentropy, I'll look into that.

I'm aware of Dieter's ventures in Tfidf. I don't think my tf-
normalization conflicts with his different variants of implementing
idf (e.g. using log_10 instead of log_2). However, we'll probably have
to find away around param-creep in the Tfidf-Constructor.

I just have created a feature-branch for tf-normalization in my repo
and sent you a pull request on github :). As I am rather new to
Python, it'd be cool if you could take a look.
https://github.com/piskvorky/gensim/pull/67

Have a great weekend

Felix

Dieter Plaetinck

unread,
Nov 19, 2011, 8:24:34 AM11/19/11
to gen...@googlegroups.com
While playing with modified df2idf functions I also noticed I also wanted to modify TF. (Lucene does math.sqrt(tf), btw)
I'm pondering what the best way is though.  For most flexibility, I would like to allow the user to modify the entire calculation (i.e. not just df2idf but also what it does to TF, which is currently in __getitem__)
Then again, your patch which introduces a simple 'normalise or not' boolean might be enough for many folks?
Maybe we could do something like:

def __getitem__(self, bow):
  if is_corpus (..)

   return self.calculate(bow)

then the calculate method does the default calculation (controlled by the booleans and basic options), or can be completely overridden.
The downside is an extra function call (but i guess that's okay), we could also allow the user to modify the getitem method directly, but that seems a bit messy

Dieter

Radim

unread,
Nov 19, 2011, 10:36:01 AM11/19/11
to gensim
On Nov 19, 2:24 pm, Dieter Plaetinck <die...@plaetinck.be> wrote:
> While playing with modified df2idf functions I also noticed I also wanted
> to modify TF. (Lucene does math.sqrt(tf), btw)

...and `logentropy` does math.log :-) There's no end to possible
modifications.


> I'm pondering what the best way is though.  For most flexibility, I would
> like to allow the user to modify the entire calculation (i.e. not just

Well this goes in the direction of a generic "local weight+global
weight+normalize" framework (ala the SMART system), where each of the
three pieces is supplied by the user, through injection. What I mean
is you'd construct the transformation as `tfidf =
GenericWeighter(corpus, local=some_tf_function,
global=some_idf_fuction, normalize=some_norm_fnc)`. Python is quite
handy for code injection (functions as first-class citizens that can
be passed around etc), so no problem there.

Question is, if we inject all functionality through parameters, what
will be left in the core? :)

If I recall correctly, there actually used to be such a weighting
framework in gensim, but I removed it.* The reason was that I wasn't
sure that the added complexity (=somebody looking at the API and
thinking WTF) outweighed the gain in flexibility.

In other words, modifying the models is so easy that I thought writing
the adjustment yourself is faster than studying API parameters and
documentation.

Having said that, providing good defaults for users who don't know how
to code are important. So I'm in favour of doing the modifications, as
long as we keep the code obfuscation level reasonably low for power-
users.

Best,
Radim

* I believe I deleted more code from gensim than there remains :)

Felix Middendorf

unread,
Nov 19, 2011, 12:30:09 PM11/19/11
to gensim
I think I'll have to agree.

I also had something like this in mind. While attractive, even with
the local*global solution that you described there'll still be the
problem of composition. Take e.g. the following tf.idf example from p.
246 of the aforementioned Croft et al.
http://i.imgur.com/JbZ53.jpg (sorry, cell phone photo)
Thus, we'd have to add more and more and more... So I'm not really if
it's a great idea.

I think the problem boils down to "there is no one way to tf.idf".
However, people have certain expectations, e.g. I totally expected
this tf/doc_length normalization to be available. Why? Maybe, because
no one told me what kind of tf.idf is implemented. ;-)

So I guess the documentation on TfidfModel should be more specific
with regard to the formula for tf and idf in order to set the right
expectations.
For advanced users, it's easy to work around this. Just preprocess the
corpus before using TfidfModel. So I don't really need to have my pull
request included (Yet, it would be handy ;)).

Maybe one could have a more complex TfidfModel that supports all kinds
of funky jazz so that it's there for people who need it, and a more
simple one that is implemented by providing the complex one with some
sensible defaults?
Another option would be to have an easily extensible base class for
tfi.df transformations that advanced users can use to implement their
own idea of tf.idf. Thus, there's no API bloat, but everyone can log/
sqrt/normalize/... the hell out of tf.idf ;).
I'll have to give this some more thought, though.

All the best

Felix

Radim

unread,
Nov 19, 2011, 1:41:08 PM11/19/11
to gensim
On Nov 19, 6:30 pm, Felix Middendorf <felix.middend...@gmail.com>
wrote:

> However, people have certain expectations, e.g. I totally expected
> this tf/doc_length normalization to be available. Why? Maybe, because
> no one told me what kind of tf.idf is implemented. ;-)

Good point. Can you fix the documentation?

Btw the way it's implemented in your patch, `normalize_tf` has no
effect if `normalize` is already set, because the uniform scaling by
`doc_length=sum(tf)` is lost when `normalize` sets the entire vector
to unit length afterward. So the combination of `normalize_tf=True and
normalize=True` is just a slow no-op.

Since vectors are L2-normalized later on in the pipeline anyway (for
cosine similarity), so `normalize=True` typically holds, I wonder what
the use case for having `normalize_tf=True` is? Maybe you meant to use
a different formula?

Best,
Radim

Brandon

unread,
Feb 17, 2012, 6:39:37 PM2/17/12
to gen...@googlegroups.com
Hi,

I happen to be considering a project that looks into just what Radim proposed above -- an exploration of SMART-style tf-idf variants.  I thought I would ask about any further progress on this idea by those of you with more coding under their belt than I!  :)  

Thanks all,

Brandon

Radim

unread,
Feb 18, 2012, 9:45:44 AM2/18/12
to gensim
Hia Brandon,

go ahead & good luck! Let us know once you have some results, so we
may comment and give you feedback :)

Best,
Radim
Reply all
Reply to author
Forward
0 new messages