Merging frequency distributions

1,425 views
Skip to first unread message

Nitin Madnani

unread,
Sep 29, 2009, 10:00:19 PM9/29/09
to nltk-dev
Hi Guys,

I want to be able to merge frequency distributions together into one
frequency distribution. Given that the 'update()' method for FreqDist
doesn't seem to work, the best I can come up with is the following
(inelegant) solution:

from nltk import FreqDist as FD
fd1, fd2, fd3, bigfd = FD(), FD(), FD(), FD()
fd1.inc('a', count=7)
fd2.inc('b', count=5)
fd3.inc('a', count=10)
allitems = []
for fd in fd1, fd2, fd3:
allitems.extend(fd.items())
blah = [bigfd.inc(x, count=y) for x,y in allitems]

Is there a better/more efficient way to accomplish this?

Steven Bird

unread,
Sep 29, 2009, 10:23:14 PM9/29/09
to nltk...@googlegroups.com
Nitin,

Python 3.1 has a Counter object which has the functionality you
require. Please see:

http://docs.python.org/3.1/library/collections.html?highlight=counter#collections.Counter

In time, nltk.FreqDist should extend collections.Counter, instead of
dict. In the meantime, I suggest that we add a new method update() to
FreqDist. Comments, anyone?

-Steven


2009/9/30 Nitin Madnani <nmad...@gmail.com>:

Nitin Madnani

unread,
Sep 29, 2009, 11:31:47 PM9/29/09
to nltk...@googlegroups.com
I think adding an 'update()' method to FreqDist is a good idea. Here's
the simplest implementation:

def update(self, other):
for sample,count in other.iteritems():
self.inc(sample, count=count)

Couple of additional thoughts:
(1) Would it be worth writing another method like 'fromfds()' that
creates a new FD from a bunch of given frequency distributions?
(2) Another thing we could do is have an 'update()' method for a
ConditionalFreqDist that basically calls 'update()' on each of the
FreqDists for each condition.

I would be happy to write all of these up and commit them if
everyone's okay with the idea.

Nitin

Joel Nothman

unread,
Sep 30, 2009, 12:37:31 AM9/30/09
to nltk...@googlegroups.com
I don't think you should overload the update method name. It has the
familiar semantics of replacing some content of one dict with that of
another. If you use the name update, users may (a) expect dict's
functionality to be replicated; (b) not expect addition to be
accomplished by update().

Perhaps the name inc_many is appropriate (perhaps it is not), and I
would suggest the following implementation:

def inc_many(self, samples):
try:
sample_iter = samples.iteritems()
except:
sample_iter = ((sample, 1) for sample in samples)
for sample, count in sample_iter:
self.inc(sample, count)

def __add__(self, other):
return self.copy().inc_many(other)


This will also enable the logical use of:
fd1 += fd2

You should also use inc_many() in __init__ instead of:
for sample in samples:
self.inc(sample)

- Joel

Joel Nothman

unread,
Sep 30, 2009, 12:38:48 AM9/30/09
to nltk...@googlegroups.com
On Wed, Sep 30, 2009 at 2:37 PM, Joel Nothman
<jnot...@student.usyd.edu.au> wrote:
> I don't think you should overload the update method name. It has the
> familiar semantics of replacing some content of one dict with that of
> another. If you use the name update, users may (a) expect dict's
> functionality to be replicated; (b) not expect addition to be
> accomplished by update().

Oh. I didn't realise the python API counter overloaded the update() name... :s

Steven Bird

unread,
Sep 30, 2009, 2:08:08 AM9/30/09
to nltk...@googlegroups.com
2009/9/30 Joel Nothman <jnot...@student.usyd.edu.au>:

> Oh. I didn't realise the python API counter overloaded the update() name... :s

I guess they lack your finely-tuned semantic intuitions, or operate on
a deeper and more mysterious level than we do :)

Nitin Madnani

unread,
Sep 30, 2009, 10:38:56 AM9/30/09
to nltk...@googlegroups.com
I kinda see Joel's point that using 'update()' might not tell people
that counts are added as well. However, given that Counter does the
same thing natively and when we port the FreqDist to inherit from
Counter, we will most likely use the name 'update()', I think it might
be better to start using that name now only.

What say?

Nitin

Joel Nothman

unread,
Sep 30, 2009, 9:13:52 PM9/30/09
to nltk...@googlegroups.com

Yes that's what my comment was to imply. update() should be fine, to
follow Counter's semantics. But, given that the FreqDist is constructed
with just an iterator of samples, I still suggest that update() is able to
apply those same semantics.

I.e. the following should give the same value for fd:

option 1:
>> fd = FreqDist(words)
>> fd.update(more_words)

option 2:
>> fd = FreqDist(words)
>> fd2 = FreqDist(more_words)
>> fd.update(fd2)

option 3:
>> fd = FreqDist(words)
>> fd += FreqDist(more_words)


- Joel

Nitin Madnani

unread,
Oct 1, 2009, 9:46:04 PM10/1/09
to nltk...@googlegroups.com
Yeah, that makes sense to me. So I started combining your suggestions
and my code into an update() method but then I realized the fd.copy()
is a shallow copy and results only in a dictionary. I tried seeing if
copy.deepcopy() would work but it doesn't. So, we have the following
options:

(a) We implement a __copy__ method in FreqDist that creates a proper
FreqDist copy
(b) We define an __init__ method in FreqDist that can also take in an
existing FreqDist just like how we can do d2 = dict(d1), where d1 and
d2 are dictionaries.

I like (b) personally because it's a little more intuitive. So, here's
what the final solution would look like:

def update(self, samples):
try:
sample_iter = samples.iteritems()
except:
sample_iter = ((sample,1) for sample in samples)
for sample,count in sample_iter():
self.inc(sample, count=count)

def __add__(self, other):
return FreqDist(self).update(other)

Comments?

Nitin

Joel Nothman

unread,
Oct 1, 2009, 10:29:55 PM10/1/09
to Nitin Madnani, nltk...@googlegroups.com

> Yeah, that makes sense to me. So I started combining your suggestions
> and my code into an update() method but then I realized the fd.copy()
> is a shallow copy and results only in a dictionary. I tried seeing if
> copy.deepcopy() would work but it doesn't.

The distinction is not that between a shallow and a deep copy. In fact, we
want the copy to be shallow (we only want a copy of the FreqDist's
structure, not a copy of all the elements inside of it).

The problem is only that the method dict.copy() apparently only knows how
to produce another dict, (not a derived class).

So for correct functionality it is probably best to override the existing
copy function with something more generic:

def copy(self):
return self.__class__.__new__(self.__class__, self)


And of course this requires __init__ to be changed to handle being passed
its own type as an argument:


def __init__(self, samples=None):
dict.__init__(self)
self._N = 0
self._Nr_cache = None
self._max_cache = None
self._item_cache = None
if samples:
self.update(samples)


> def update(self, samples):
> try:
> sample_iter = samples.iteritems()
> except:
> sample_iter = ((sample,1) for sample in samples)
> for sample,count in sample_iter():
> self.inc(sample, count=count)

According to style guides, "sample,count" should be "sample, count".

>
> def __add__(self, other):
> return FreqDist(self).update(other)

This or using the newly fixed copy function would essentially be
equivalent. Using copy() is perhaps a little clearer to read; it also
means that if someone decides to build a class inheriting from FreqDist
they won't have the same problems we had with copy(), i.e. the result of
__add__ being a FreqDist rather than the derived type.

- Joel

Nitin Madnani

unread,
Oct 1, 2009, 10:51:26 PM10/1/09
to nltk...@googlegroups.com
Joel,

Ah, I see where I was wrong about the deep/shallow copy. Thanks for
the clarification! I agree that modifying the copy semantics makes
more sense. So, here are the suggested modifications to the code:

def __init__(self, samples=None):
dict.__init__(self)
self._N = 0
self._Nr_cache = None
self._max_cache = None
self._item_cache = None
if samples:
self.update(samples)

def copy(self):
return self.__class__.__new__(self.__class__, self)

def update(self, samples):
try:
sample_iter = samples.iteritems()
except:
sample_iter = ((sample, 1) for sample in samples)
for sample, count in sample_iter():
self.inc(sample, count=count)

def __add__(self, other):
return self.copy().update(other)

I will also add/update docstrings appropriately. If I don't hear any
objections from anyone, I will go ahead and commit the change. Is that
okay with you, Steven?

Thanks again, Joel!
Nitin

Steven Bird

unread,
Oct 1, 2009, 11:36:46 PM10/1/09
to nltk...@googlegroups.com
2009/10/2 Nitin Madnani <nmad...@umiacs.umd.edu>:

> If I don't hear any objections from anyone, I will go ahead and
> commit the change. Is that okay with you, Steven?

Its great to see this collaboration on the solution.

As a procedural matter, please first check (and revise) the test cases
I've added to probability.doctest, based on Joel's earlier post.

http://code.google.com/p/nltk/source/browse/trunk/nltk/nltk/test/probability.doctest

I think we might need some more cases. Once that's agreed, please
test and commit the proposed changes.

Also, please note that NLTK still needs to work with Python 2.4, so no
generator expressions sorry!

Thanks,
-Steven

Nitin Madnani

unread,
Oct 2, 2009, 12:01:20 PM10/2/09
to nltk...@googlegroups.com
Steven,

Good point! I looked at the doctests. First thing, shouldn't 'samples
()' be 'items()' actually?

I think what you have is pretty sufficient but we can probably add two
more doctests:

>>> fd1 = nltk.FreqDist(text1)
>>> fd2 = nltk.FreqDist(text2)
>>> fd3 = fd1 + fd2
>>> fd3.items()
[('fish', 3), ('good', 2), ('no', 2), ('anywhere', 2),
('porpoise', 2), ('a', 1),
('!', 1), ('.', 1), ('to', 1), ('without', 1), ('likes', 1),
('goes', 1)]

>>> fd1 = nltk.FreqDist(text1)
>>> fd2 = nltk.FreqDist(fd1)
>>> fd2.items()
[('a', 1), ('!', 1), ('good', 1), ('no', 1), ('fish', 1),
('anywhere', 1),
('without', 1), ('goes', 1), ('porpoise', 1)]

Now, back to the code. I added all the suggested code to
probability.py (changing the generator expression to an iterator using
imap). However, for some reason, __init__ is not being called by
__new__ when creating a copy of the FreqDist. Here's a session showing
what happens:

>>> from nltk import FreqDist as FD

>>> fd1 = FD('a')
>>> fd1
<FreqDist with 1 outcomes>
>>> fd2 = fd1.copy()
>>> fd2
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/nmadnani/nltk-devel/nltk/nltk/probability.py",
line 488, in __repr__
return '<FreqDist with %d outcomes>' % self.N()
File "/Users/nmadnani/nltk-devel/nltk/nltk/probability.py",
line 150, in N
return self._N
AttributeError: 'FreqDist' object has no attribute '_N'
>>> fd2.items()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/nmadnani/nltk-devel/nltk/nltk/probability.py",
line 407, in items
self._sort_keys_by_value()
File "/Users/nmadnani/nltk-devel/nltk/nltk/probability.py",
line 376, in _sort_keys_by_value
if not self._item_cache:
AttributeError: 'FreqDist' object has no attribute '_item_cache'
>>> dir(fd2)
['B', 'N', 'Nr', '__add__', '__class__', '__cmp__',
'__contains__', '__delattr__', '__delitem__', '__dict__', '__doc__',
'__eq__', '__ge__', '__getattribute__', '__getitem__', '__gt__',
'__hash__', '__init__', '__iter__', '__le__', '__len__', '__lt__',
'__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__',
'__repr__', '__setattr__', '__setitem__', '__str__', '__weakref__',
'_cache_Nr_values', '_cumulative_frequencies', '_sort_keys_by_value',
'clear', 'copy', 'count', 'freq', 'fromkeys', 'get', 'hapaxes',
'has_key', 'inc', 'items', 'iteritems', 'iterkeys', 'itervalues',
'keys', 'max', 'plot', 'pop', 'popitem', 'samples', 'setdefault',
'sorted', 'sorted_samples', 'tabulate', 'update', 'values']

So, the attributes that are supposed to be initialized by __init__
(_N, _Nr_cache, _item_cache etc.) are not present in the copy at all!
Simple debugging shows that this is most likely because __init__ is
not being called by the __new__ method. Recall that __new__ is called
in the copy() method:

def copy(self):
return self.__class__.__new__(self.__class__, self)

Note that this also means that the '__add__' method doesn't work
properly since it relies on the 'copy()' method. I am not sure why
this should be happening since the Python reference clearly states that:

"If __new__() returns an instance of cls, then the new instance’s
__init__() method will be invoked like __init__(self[, ...]), where
self is the new instance and the remaining arguments are the same as
were passed to __new__()."

and here the __new__ method *does* return an instance of FreqDist. I
am stumped. Am I doing something wrong?

Thanks!
Nitin

Joel Nothman

unread,
Oct 4, 2009, 8:27:52 AM10/4/09
to nltk...@googlegroups.com

> Good point! I looked at the doctests. First thing, shouldn't 'samples
> ()' be 'items()' actually?

Yes, IMO.

A couple of comments:

(a) it is not clear from the tests there that the output is identical.
That needs to be done by inspection. You might also want to show that the
union of the FreqDists is the FreqDist of the union of the texts. I.e., I
think it would be better as:

>>> fd1 = nltk.FreqDist(text1)
>>> fd2 = nltk.FreqDist(text2)

>>> both = nltk.FreqDist(text1 + text2)
>>> fd1.update(fd2)
>>> fd1 == both
True

>>> fd1 = nltk.FreqDist(text1)
>>> fd1.update(text2)
>>> fd1 == both
True

(b) The third test with the += syntax should probably make clear its
distinction from the update syntax:

>>> fd1 = nltk.FreqDist(text1)
>>> both == fd1 + nltk.FreqDist(text2)
True
>>> fd1 == nltk.FreqDist(text1) # But fd1 is unchanged
True

This is suggested by Nitin's first additional test but, I'd suggest, is
more precise.

> >>> fd1 = nltk.FreqDist(text1)
> >>> fd2 = nltk.FreqDist(fd1)
> >>> fd2.items()
> [('a', 1), ('!', 1), ('good', 1), ('no', 1), ('fish', 1),
> ('anywhere', 1),
> ('without', 1), ('goes', 1), ('porpoise', 1)]

Yes, that's an important constructor test.


> Now, back to the code. I added all the suggested code to
> probability.py (changing the generator expression to an iterator using
> imap). However, for some reason, __init__ is not being called by
> __new__ when creating a copy of the FreqDist.

Argh. Yes, the standard Python types can be a real pain to inherit from
for these sorts of reasons. You can define a default-style __new__ method
if need be:

def __new__(cls, *args, **kwargs):
return dict.__new__(cls)


Hope that works...

- Joel

Nitin Madnani

unread,
Oct 4, 2009, 5:17:02 PM10/4/09
to nltk...@googlegroups.com
> Argh. Yes, the standard Python types can be a real pain to inherit
> from
> for these sorts of reasons. You can define a default-style __new__
> method
> if need be:
>
> def __new__(cls, *args, **kwargs):
> return dict.__new__(cls)
>
>
> Hope that works…

No dice. __init__() still isn't being called.

BTW, Shouldn't that be:

def __new__(cls, *args, **kwargs):
return dict.__new__(cls, *args, *kwargs)

given that 'copy()' is defined as follows:

def copy(self):
return self.__class__.__new__(self.__class__, self)

I mean, don't we need to pass along the second argument from copy
(which is the freqdist instance being copied) to __new__ that it will
then be used in __init__?

Just curious.
Nitin

Steven Bird

unread,
Oct 4, 2009, 5:25:39 PM10/4/09
to nltk...@googlegroups.com
2009/10/4 Joel Nothman <jnot...@student.usyd.edu.au>

>
> (a) it is not clear from the tests there that the output is identical.
> That needs to be done by inspection. You might also want to show that the
> union of the FreqDists is the FreqDist of the union of the texts.  ...

I agree -- that's the right way to do it.

>      >>> fd1 = nltk.FreqDist(text1)
>      >>> fd2 = nltk.FreqDist(fd1)
>      >>> fd2.items()
>      [('a', 1), ('!', 1), ('good', 1), ('no', 1), ('fish', 1),
> ('anywhere', 1),
>      ('without', 1), ('goes', 1), ('porpoise', 1)]

> Yes, that's an important constructor test.

It should also just do an equality check though:

>>> fd1 = nltk.FreqDist(text1)
>>> fd2 = nltk.FreqDist(fd1)

>>> fd2 == fd1

>     def __new__(cls, *args, **kwargs):
>         return dict.__new__(cls)

I agree with Nitin that this should pass on args and kwargs to dict.__new__()

-Steven

Joel Nothman

unread,
Oct 4, 2009, 9:24:21 PM10/4/09
to nltk...@googlegroups.com

Sorry, I messed things up.

No, one does not pass the args from FreqDist.__new__ to dict.__new__.
That's not how __new__ works. Not that I've understood well it until now.

The definition for __new__ was correct, but you don't need it.

BUT the definition for copy() was incorrect as you can see from the
snippet below.


>>> class B(dict):
>>> def __init__(self, *args, **kwargs):
>>> print 'init'
>>> dict.__init__(self, *args, **kwargs)
>>> def __new__(cls, *args, **kwargs):
>>> print 'new'
>>> return dict.__new__(cls)
>>> def copy(self):
>>> return self.__class__(self)
>>> def copy2(self):
>>> return self.__class__.__new__(self.__class__, self)
>>>b = B(a=1, b=2, c=3)
new
init
>>> print b, type(b)
{'a': 1, 'c': 3, 'b': 2} <class '__main__.B'>
>>> c = b.copy()
new
init
>>> print c, type(c)
{'a': 1, 'c': 3, 'b': 2} <class '__main__.B'>
>>> c = b.copy2()
new
>>> print c, type(c)
{} <class '__main__.B'>


What wasn't clear is that __new__ and __init__ are called correctly when
you instantiate an object using, e.g. obj.__class__(*args, **kwargs). The
new copy() method does that.

Remind me to bother testing things out in the future before sending them.

- Joel

Nitin Madnani

unread,
Oct 4, 2009, 9:03:14 PM10/4/09
to nltk...@googlegroups.com
Joel,

I think I understand now from the very useful example you provided,

(1) __new__() *always* calls __init() with the arguments and this
invocation happens automatically. So, my thinking that this happens
because of the dict.__new__() call was incorrect. It will happen with
or without that particular statement.

(2) Just calling self.__class__(self) will automatically invoke
__new__ which will automatically invoke __init__() with the right
argument.

Does this sound right?

This is great! Clarifies a lot of the confusion I had regarding
__new__ and __init__ in general.

Joel Nothman

unread,
Oct 4, 2009, 10:28:59 PM10/4/09
to nltk...@googlegroups.com

Not exactly correct. That's sort of what I'd thought too.

In particular, __new__ does *not* call magically __init__. This is why you
do not need to pass the arguments in __new__ up to the superclass.

There is a magic function (perhaps it is object.__class__.__call__) which
handles what happens when you go: MyClass(*args, **kwargs).

It looks like this:

def _____constructor_func_____(cls, *args, **kwargs):
inst = cls.__new__(cls, *args, **kwargs)
if isinstance(inst, cls):
inst.__init__(*args, **kwargs)
return inst


SO: while __new__ is passed the same arguments as __init__, it does not
call __init__.

If __new__ is being called by the magic constructor function, and it
returns an instance of the expected type, *then* __init__ will be called
to initialise the object's variables.

I hope that helps,

- Joel

Nitin Madnani

unread,
Oct 4, 2009, 9:40:21 PM10/4/09
to nltk...@googlegroups.com
Ahh, okay. Now all the stuff I have been reading about Callables,
Metaclasses etc. in Python seems to be coming together.

__call__ is a function that makes anything callable (very crudely
speaking). So, when we instantiate the class, its __call__ method is
called (unless there is a metaclass involved which we won't go into
now) and that __call__ function first calls __new__. If that __new__
call returns an instance of the class we were expecting, then __init__
is called with the correct arguments. So, the reason there doesn't
need to be a direct connection (in terms of the arguments) between
__new__ and __init__ because there's a higher level function
(__call___) providing the connection.

BTW, our own Steven Bethard provided a very explanation of __call__ of
this on comp.lang.python. It was in the context of metaclasses but
it's still a pretty useful read, IMO.

http://groups.google.com/group/comp.lang.python/browse_thread/thread/fe77dd2c97837541/


Nitin

Nitin Madnani

unread,
Oct 5, 2009, 8:59:52 AM10/5/09
to nltk...@googlegroups.com
So, I made the changes to the code as we had talked about. BTW, I
realized that we had defined __add__ incorrectly. Here's what we had:

def __add__(self, other):
return self.copy().update(other)

However, this won't work because update() doesn't return anything. So,
I changed this to:

def __add__(self, other):
clone = self.copy()
clone.update(other)
return clone

This seems to work fine. I then added the following doctests based on
our discussions:

>>> text1 = ['no', 'good', 'fish', 'goes', 'anywhere', 'without',
'a', 'porpoise', '!']
>>> text2 = ['no', 'good', 'porpoise', 'likes', 'to', 'fish',
'fish', 'anywhere', '.']

>>> fd1 = nltk.FreqDist(text1)
>>> fd2 = nltk.FreqDist(text2)
>>> both = nltk.FreqDist(text1 + text2)
>>> fd1.update(fd2)
>>> fd1 == both
True

>>> fd1 = nltk.FreqDist(text1)
>>> fd1.update(text2)
>>> fd1 == both
True

>>> fd1 = nltk.FreqDist(text1)


>>> both == fd1 + nltk.FreqDist(text2)
True
>>> fd1 == nltk.FreqDist(text1) # But fd1 is unchanged
True

>>> fd1 = nltk.FreqDist(text1)
>>> fd2 = nltk.FreqDist(fd1)
>>> fd2 == fd1
True

However, equality tests no. 3 and 5 fail. Here is what's going on with
no. 5 (I chose that one because it's easier to explain but I think the
problem is the same with number 3):

>>> fd1 = nltk.FreqDist(text1)
>>> fd2 = nltk.FreqDist(fd1)

>>> print fd1.items()


[('a', 1), ('!', 1), ('good', 1), ('no', 1), ('fish', 1),
('anywhere', 1), ('without', 1), ('goes', 1), ('porpoise', 1)]

>>> print fd2.items()


[('a', 1), ('!', 1), ('good', 1), ('no', 1), ('fish', 1),

('without', 1), ('anywhere', 1), ('goes', 1), ('porpoise', 1)]

It looks like during sorting the ties are not being broken the same
way for both the lists. To make sure that any of the new code was not
responsible for this, I reverted probability.py to trunk and tested
the same semantics again in a more direct way:

>>> fd1 = nltk.FreqDist(text1)
>>> fd2 = nltk.FreqDist()
>>> for s, c in fd1.items():
fd2.inc(s, count=c)
>>> fd2 == fd1
False
>>> print fd1.items()


[('a', 1), ('!', 1), ('good', 1), ('no', 1), ('fish', 1),
('anywhere', 1), ('without', 1), ('goes', 1), ('porpoise', 1)]

>>> print fd2.items()


[('a', 1), ('!', 1), ('good', 1), ('no', 1), ('fish', 1),

('without', 1), ('anywhere', 1), ('goes', 1), ('porpoise', 1)]

Same problem. I even ran the same code on another (linux) machine to
make sure that the problem is reproducible and it is. I guess this
issue has been there for a while and we didn't know about it. From the
Python Wiki page on Sorting (http://wiki.python.org/moin/HowTo/
Sorting), it looks like we need to sort twice in order to enforce a
uniform way of breaking ties:
<begin quote>
To reverse sort based on the second item in each tuple, but forward
sort based on the first item when a tie is met, then [sic] forward
sort first and then reverse sort based on 2nd item:
>>> L = [('a', 2), ('d', 4), ('b', 3), ('c', 2)]
>>> L.sort(); L
[('a', 2), ('b', 3), ('c', 2), ('d', 4)]
>>> sorted(L,key=operator.itemgetter(1),reverse=True)
[('d', 4), ('b', 3), ('a', 2), ('c', 2)]
</end quote>

So, may be we need to do the same double sorting in the
_sort_keys_by_value() method? Unless you guys know that there is
something else going on …

Nitin

Steven Bethard

unread,
Oct 5, 2009, 3:46:21 PM10/5/09
to nltk...@googlegroups.com
On Sun, Oct 4, 2009 at 7:28 PM, Joel Nothman
<jnot...@student.usyd.edu.au> wrote:
> There is a magic function (perhaps it is object.__class__.__call__) which
> handles what happens when you go: MyClass(*args, **kwargs).
>
> It looks like this:
>
> def _____constructor_func_____(cls, *args, **kwargs):
>     inst = cls.__new__(cls, *args, **kwargs)
>     if isinstance(inst, cls):
>         inst.__init__(*args, **kwargs)
>     return inst

Excellent explanation, and yes, this function is actually
type.__call__. If you're curious, you can see the code for it here:

http://svn.python.org/projects/python/trunk/Objects/typeobject.c

Just search for "type_call" -- it's less than 30 lines of C code that
do pretty much exactly what Joel shows above.

Steve
--
Where did you get that preposterous hypothesis?
Did Steve tell you that?
--- The Hiphopopotamus

Steven Bird

unread,
Oct 31, 2009, 2:37:35 PM10/31/09
to nltk...@googlegroups.com
[Returning to this unresolved thread...]

2009/10/5 Nitin Madnani <nmad...@umiacs.umd.edu>:


> However, equality tests no. 3 and 5 fail. Here is what's going on with
> no. 5 (I chose that one because it's easier to explain but I think the
> problem is the same with number 3):
>
>     >>> fd1 = nltk.FreqDist(text1)
>     >>> fd2 = nltk.FreqDist(fd1)
>     >>> print fd1.items()
>     [('a', 1), ('!', 1), ('good', 1), ('no', 1), ('fish', 1),
> ('anywhere', 1), ('without', 1), ('goes', 1), ('porpoise', 1)]
>     >>> print fd2.items()
>     [('a', 1), ('!', 1), ('good', 1), ('no', 1), ('fish', 1),
> ('without', 1), ('anywhere', 1), ('goes', 1), ('porpoise', 1)]
>
> It looks like during sorting the ties are not being broken the same
> way for both the lists.

Thanks for diagnosing the problem. The FreqDist.__eq__() method does
a reverse sort on the values (the counts) and ignores the keys. Thus
there is no canonical order amongst items having the same count. The
problem is with line 336 in probability.py:

sorted(items, key=itemgetter(1), reverse=True)

We can trivially include the FreqDist key as a secondary sort key:

sorted(items, key=itemgetter(1, 0), reverse=True)

However, now FreqDist items having the same count are sorted in
reverse alphabetical order. Here's what I came up with instead:

sorted(items, key=lambda x:(-x[1], x[0]))

Now the FreqDist doctests that used == should behave as expected.

Nitin -- how about committing your implementations of update() and
__add__(), and letting us test them out.

Thanks,
-Steven

Reply all
Reply to author
Forward
0 new messages