the vectors are indeed normalized only after tf*idf. Feel free to
modify the function(s) any way you like! In fact, I think Dieter
mentioned that he is reworking TfIdf at the moment, to make it more
flexible. Perhaps you two can co-operate, to ensure smooth operation
and avoid collisions =)
Btw have a look at the logentropy_model, which attempts to do the
weighting in a more principled manner.
Best,
Radim
On Nov 19, 1:37 am, Felix Middendorf <felix.middend...@gmail.com>
wrote:
> Hi everyone,
>
> I am currently experimenting with gensim and it is really great. Good
> job, Radim!
> However, I was a bit surprised by the fact that gensim does not ship
> with functionality to normalize tf (not tf.idf).
>
> Let me explain: in order to avoid bias to large documents, tf is often
> normalized like this
> term_frequency / length_of_document
> before it is multiplied with idf (see, e.g.,http://en.wikipedia.org/wiki/Tf.idf
thanks for directing me to logentropy, I'll look into that.
I'm aware of Dieter's ventures in Tfidf. I don't think my tf-
normalization conflicts with his different variants of implementing
idf (e.g. using log_10 instead of log_2). However, we'll probably have
to find away around param-creep in the Tfidf-Constructor.
I just have created a feature-branch for tf-normalization in my repo
and sent you a pull request on github :). As I am rather new to
Python, it'd be cool if you could take a look.
https://github.com/piskvorky/gensim/pull/67
Have a great weekend
Felix
...and `logentropy` does math.log :-) There's no end to possible
modifications.
> I'm pondering what the best way is though. For most flexibility, I would
> like to allow the user to modify the entire calculation (i.e. not just
Well this goes in the direction of a generic "local weight+global
weight+normalize" framework (ala the SMART system), where each of the
three pieces is supplied by the user, through injection. What I mean
is you'd construct the transformation as `tfidf =
GenericWeighter(corpus, local=some_tf_function,
global=some_idf_fuction, normalize=some_norm_fnc)`. Python is quite
handy for code injection (functions as first-class citizens that can
be passed around etc), so no problem there.
Question is, if we inject all functionality through parameters, what
will be left in the core? :)
If I recall correctly, there actually used to be such a weighting
framework in gensim, but I removed it.* The reason was that I wasn't
sure that the added complexity (=somebody looking at the API and
thinking WTF) outweighed the gain in flexibility.
In other words, modifying the models is so easy that I thought writing
the adjustment yourself is faster than studying API parameters and
documentation.
Having said that, providing good defaults for users who don't know how
to code are important. So I'm in favour of doing the modifications, as
long as we keep the code obfuscation level reasonably low for power-
users.
Best,
Radim
* I believe I deleted more code from gensim than there remains :)
I also had something like this in mind. While attractive, even with
the local*global solution that you described there'll still be the
problem of composition. Take e.g. the following tf.idf example from p.
246 of the aforementioned Croft et al.
http://i.imgur.com/JbZ53.jpg (sorry, cell phone photo)
Thus, we'd have to add more and more and more... So I'm not really if
it's a great idea.
I think the problem boils down to "there is no one way to tf.idf".
However, people have certain expectations, e.g. I totally expected
this tf/doc_length normalization to be available. Why? Maybe, because
no one told me what kind of tf.idf is implemented. ;-)
So I guess the documentation on TfidfModel should be more specific
with regard to the formula for tf and idf in order to set the right
expectations.
For advanced users, it's easy to work around this. Just preprocess the
corpus before using TfidfModel. So I don't really need to have my pull
request included (Yet, it would be handy ;)).
Maybe one could have a more complex TfidfModel that supports all kinds
of funky jazz so that it's there for people who need it, and a more
simple one that is implemented by providing the complex one with some
sensible defaults?
Another option would be to have an easily extensible base class for
tfi.df transformations that advanced users can use to implement their
own idea of tf.idf. Thus, there's no API bloat, but everyone can log/
sqrt/normalize/... the hell out of tf.idf ;).
I'll have to give this some more thought, though.
All the best
Felix
Good point. Can you fix the documentation?
Btw the way it's implemented in your patch, `normalize_tf` has no
effect if `normalize` is already set, because the uniform scaling by
`doc_length=sum(tf)` is lost when `normalize` sets the entire vector
to unit length afterward. So the combination of `normalize_tf=True and
normalize=True` is just a slow no-op.
Since vectors are L2-normalized later on in the pipeline anyway (for
cosine similarity), so `normalize=True` typically holds, I wonder what
the use case for having `normalize_tf=True` is? Maybe you meant to use
a different formula?
Best,
Radim