Oh. I didn't realise the python API counter overloaded the update() name... :s
I guess they lack your finely-tuned semantic intuitions, or operate on
a deeper and more mysterious level than we do :)
Its great to see this collaboration on the solution.
As a procedural matter, please first check (and revise) the test cases
I've added to probability.doctest, based on Joel's earlier post.
http://code.google.com/p/nltk/source/browse/trunk/nltk/nltk/test/probability.doctest
I think we might need some more cases. Once that's agreed, please
test and commit the proposed changes.
Also, please note that NLTK still needs to work with Python 2.4, so no
generator expressions sorry!
Thanks,
-Steven
Good point! I looked at the doctests. First thing, shouldn't 'samples
()' be 'items()' actually?
I think what you have is pretty sufficient but we can probably add two
more doctests:
>>> fd1 = nltk.FreqDist(text1)
>>> fd2 = nltk.FreqDist(text2)
>>> fd3 = fd1 + fd2
>>> fd3.items()
[('fish', 3), ('good', 2), ('no', 2), ('anywhere', 2),
('porpoise', 2), ('a', 1),
('!', 1), ('.', 1), ('to', 1), ('without', 1), ('likes', 1),
('goes', 1)]
>>> fd1 = nltk.FreqDist(text1)
>>> fd2 = nltk.FreqDist(fd1)
>>> fd2.items()
[('a', 1), ('!', 1), ('good', 1), ('no', 1), ('fish', 1),
('anywhere', 1),
('without', 1), ('goes', 1), ('porpoise', 1)]
Now, back to the code. I added all the suggested code to
probability.py (changing the generator expression to an iterator using
imap). However, for some reason, __init__ is not being called by
__new__ when creating a copy of the FreqDist. Here's a session showing
what happens:
>>> from nltk import FreqDist as FD
>>> fd1 = FD('a')
>>> fd1
<FreqDist with 1 outcomes>
>>> fd2 = fd1.copy()
>>> fd2
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/nmadnani/nltk-devel/nltk/nltk/probability.py",
line 488, in __repr__
return '<FreqDist with %d outcomes>' % self.N()
File "/Users/nmadnani/nltk-devel/nltk/nltk/probability.py",
line 150, in N
return self._N
AttributeError: 'FreqDist' object has no attribute '_N'
>>> fd2.items()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/nmadnani/nltk-devel/nltk/nltk/probability.py",
line 407, in items
self._sort_keys_by_value()
File "/Users/nmadnani/nltk-devel/nltk/nltk/probability.py",
line 376, in _sort_keys_by_value
if not self._item_cache:
AttributeError: 'FreqDist' object has no attribute '_item_cache'
>>> dir(fd2)
['B', 'N', 'Nr', '__add__', '__class__', '__cmp__',
'__contains__', '__delattr__', '__delitem__', '__dict__', '__doc__',
'__eq__', '__ge__', '__getattribute__', '__getitem__', '__gt__',
'__hash__', '__init__', '__iter__', '__le__', '__len__', '__lt__',
'__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__',
'__repr__', '__setattr__', '__setitem__', '__str__', '__weakref__',
'_cache_Nr_values', '_cumulative_frequencies', '_sort_keys_by_value',
'clear', 'copy', 'count', 'freq', 'fromkeys', 'get', 'hapaxes',
'has_key', 'inc', 'items', 'iteritems', 'iterkeys', 'itervalues',
'keys', 'max', 'plot', 'pop', 'popitem', 'samples', 'setdefault',
'sorted', 'sorted_samples', 'tabulate', 'update', 'values']
So, the attributes that are supposed to be initialized by __init__
(_N, _Nr_cache, _item_cache etc.) are not present in the copy at all!
Simple debugging shows that this is most likely because __init__ is
not being called by the __new__ method. Recall that __new__ is called
in the copy() method:
def copy(self):
return self.__class__.__new__(self.__class__, self)
Note that this also means that the '__add__' method doesn't work
properly since it relies on the 'copy()' method. I am not sure why
this should be happening since the Python reference clearly states that:
"If __new__() returns an instance of cls, then the new instance’s
__init__() method will be invoked like __init__(self[, ...]), where
self is the new instance and the remaining arguments are the same as
were passed to __new__()."
and here the __new__ method *does* return an instance of FreqDist. I
am stumped. Am I doing something wrong?
Thanks!
Nitin
> Good point! I looked at the doctests. First thing, shouldn't 'samples
> ()' be 'items()' actually?
Yes, IMO.
A couple of comments:
(a) it is not clear from the tests there that the output is identical.
That needs to be done by inspection. You might also want to show that the
union of the FreqDists is the FreqDist of the union of the texts. I.e., I
think it would be better as:
>>> fd1 = nltk.FreqDist(text1)
>>> fd2 = nltk.FreqDist(text2)
>>> both = nltk.FreqDist(text1 + text2)
>>> fd1.update(fd2)
>>> fd1 == both
True
>>> fd1 = nltk.FreqDist(text1)
>>> fd1.update(text2)
>>> fd1 == both
True
(b) The third test with the += syntax should probably make clear its
distinction from the update syntax:
>>> fd1 = nltk.FreqDist(text1)
>>> both == fd1 + nltk.FreqDist(text2)
True
>>> fd1 == nltk.FreqDist(text1) # But fd1 is unchanged
True
This is suggested by Nitin's first additional test but, I'd suggest, is
more precise.
> >>> fd1 = nltk.FreqDist(text1)
> >>> fd2 = nltk.FreqDist(fd1)
> >>> fd2.items()
> [('a', 1), ('!', 1), ('good', 1), ('no', 1), ('fish', 1),
> ('anywhere', 1),
> ('without', 1), ('goes', 1), ('porpoise', 1)]
Yes, that's an important constructor test.
> Now, back to the code. I added all the suggested code to
> probability.py (changing the generator expression to an iterator using
> imap). However, for some reason, __init__ is not being called by
> __new__ when creating a copy of the FreqDist.
Argh. Yes, the standard Python types can be a real pain to inherit from
for these sorts of reasons. You can define a default-style __new__ method
if need be:
def __new__(cls, *args, **kwargs):
return dict.__new__(cls)
Hope that works...
- Joel
No dice. __init__() still isn't being called.
BTW, Shouldn't that be:
def __new__(cls, *args, **kwargs):
return dict.__new__(cls, *args, *kwargs)
given that 'copy()' is defined as follows:
def copy(self):
return self.__class__.__new__(self.__class__, self)
I mean, don't we need to pass along the second argument from copy
(which is the freqdist instance being copied) to __new__ that it will
then be used in __init__?
Just curious.
Nitin
I agree -- that's the right way to do it.
> >>> fd1 = nltk.FreqDist(text1)
> >>> fd2 = nltk.FreqDist(fd1)
> >>> fd2.items()
> [('a', 1), ('!', 1), ('good', 1), ('no', 1), ('fish', 1),
> ('anywhere', 1),
> ('without', 1), ('goes', 1), ('porpoise', 1)]
> Yes, that's an important constructor test.
It should also just do an equality check though:
>>> fd1 = nltk.FreqDist(text1)
>>> fd2 = nltk.FreqDist(fd1)
>>> fd2 == fd1
> def __new__(cls, *args, **kwargs):
> return dict.__new__(cls)
I agree with Nitin that this should pass on args and kwargs to dict.__new__()
-Steven
No, one does not pass the args from FreqDist.__new__ to dict.__new__.
That's not how __new__ works. Not that I've understood well it until now.
The definition for __new__ was correct, but you don't need it.
BUT the definition for copy() was incorrect as you can see from the
snippet below.
>>> class B(dict):
>>> def __init__(self, *args, **kwargs):
>>> print 'init'
>>> dict.__init__(self, *args, **kwargs)
>>> def __new__(cls, *args, **kwargs):
>>> print 'new'
>>> return dict.__new__(cls)
>>> def copy(self):
>>> return self.__class__(self)
>>> def copy2(self):
>>> return self.__class__.__new__(self.__class__, self)
>>>b = B(a=1, b=2, c=3)
new
init
>>> print b, type(b)
{'a': 1, 'c': 3, 'b': 2} <class '__main__.B'>
>>> c = b.copy()
new
init
>>> print c, type(c)
{'a': 1, 'c': 3, 'b': 2} <class '__main__.B'>
>>> c = b.copy2()
new
>>> print c, type(c)
{} <class '__main__.B'>
What wasn't clear is that __new__ and __init__ are called correctly when
you instantiate an object using, e.g. obj.__class__(*args, **kwargs). The
new copy() method does that.
Remind me to bother testing things out in the future before sending them.
- Joel
def __add__(self, other):
return self.copy().update(other)
However, this won't work because update() doesn't return anything. So,
I changed this to:
def __add__(self, other):
clone = self.copy()
clone.update(other)
return clone
This seems to work fine. I then added the following doctests based on
our discussions:
>>> text1 = ['no', 'good', 'fish', 'goes', 'anywhere', 'without',
'a', 'porpoise', '!']
>>> text2 = ['no', 'good', 'porpoise', 'likes', 'to', 'fish',
'fish', 'anywhere', '.']
>>> fd1 = nltk.FreqDist(text1)
>>> fd2 = nltk.FreqDist(text2)
>>> both = nltk.FreqDist(text1 + text2)
>>> fd1.update(fd2)
>>> fd1 == both
True
>>> fd1 = nltk.FreqDist(text1)
>>> fd1.update(text2)
>>> fd1 == both
True
>>> fd1 = nltk.FreqDist(text1)
>>> both == fd1 + nltk.FreqDist(text2)
True
>>> fd1 == nltk.FreqDist(text1) # But fd1 is unchanged
True
>>> fd1 = nltk.FreqDist(text1)
>>> fd2 = nltk.FreqDist(fd1)
>>> fd2 == fd1
True
However, equality tests no. 3 and 5 fail. Here is what's going on with
no. 5 (I chose that one because it's easier to explain but I think the
problem is the same with number 3):
>>> fd1 = nltk.FreqDist(text1)
>>> fd2 = nltk.FreqDist(fd1)
>>> print fd1.items()
[('a', 1), ('!', 1), ('good', 1), ('no', 1), ('fish', 1),
('anywhere', 1), ('without', 1), ('goes', 1), ('porpoise', 1)]
>>> print fd2.items()
[('a', 1), ('!', 1), ('good', 1), ('no', 1), ('fish', 1),
('without', 1), ('anywhere', 1), ('goes', 1), ('porpoise', 1)]
It looks like during sorting the ties are not being broken the same
way for both the lists. To make sure that any of the new code was not
responsible for this, I reverted probability.py to trunk and tested
the same semantics again in a more direct way:
>>> fd1 = nltk.FreqDist(text1)
>>> fd2 = nltk.FreqDist()
>>> for s, c in fd1.items():
fd2.inc(s, count=c)
>>> fd2 == fd1
False
>>> print fd1.items()
[('a', 1), ('!', 1), ('good', 1), ('no', 1), ('fish', 1),
('anywhere', 1), ('without', 1), ('goes', 1), ('porpoise', 1)]
>>> print fd2.items()
[('a', 1), ('!', 1), ('good', 1), ('no', 1), ('fish', 1),
('without', 1), ('anywhere', 1), ('goes', 1), ('porpoise', 1)]
Same problem. I even ran the same code on another (linux) machine to
make sure that the problem is reproducible and it is. I guess this
issue has been there for a while and we didn't know about it. From the
Python Wiki page on Sorting (http://wiki.python.org/moin/HowTo/
Sorting), it looks like we need to sort twice in order to enforce a
uniform way of breaking ties:
<begin quote>
To reverse sort based on the second item in each tuple, but forward
sort based on the first item when a tie is met, then [sic] forward
sort first and then reverse sort based on 2nd item:
>>> L = [('a', 2), ('d', 4), ('b', 3), ('c', 2)]
>>> L.sort(); L
[('a', 2), ('b', 3), ('c', 2), ('d', 4)]
>>> sorted(L,key=operator.itemgetter(1),reverse=True)
[('d', 4), ('b', 3), ('a', 2), ('c', 2)]
</end quote>
So, may be we need to do the same double sorting in the
_sort_keys_by_value() method? Unless you guys know that there is
something else going on …
Nitin
Excellent explanation, and yes, this function is actually
type.__call__. If you're curious, you can see the code for it here:
http://svn.python.org/projects/python/trunk/Objects/typeobject.c
Just search for "type_call" -- it's less than 30 lines of C code that
do pretty much exactly what Joel shows above.
Steve
--
Where did you get that preposterous hypothesis?
Did Steve tell you that?
--- The Hiphopopotamus
2009/10/5 Nitin Madnani <nmad...@umiacs.umd.edu>:
> However, equality tests no. 3 and 5 fail. Here is what's going on with
> no. 5 (I chose that one because it's easier to explain but I think the
> problem is the same with number 3):
>
> >>> fd1 = nltk.FreqDist(text1)
> >>> fd2 = nltk.FreqDist(fd1)
> >>> print fd1.items()
> [('a', 1), ('!', 1), ('good', 1), ('no', 1), ('fish', 1),
> ('anywhere', 1), ('without', 1), ('goes', 1), ('porpoise', 1)]
> >>> print fd2.items()
> [('a', 1), ('!', 1), ('good', 1), ('no', 1), ('fish', 1),
> ('without', 1), ('anywhere', 1), ('goes', 1), ('porpoise', 1)]
>
> It looks like during sorting the ties are not being broken the same
> way for both the lists.
Thanks for diagnosing the problem. The FreqDist.__eq__() method does
a reverse sort on the values (the counts) and ignores the keys. Thus
there is no canonical order amongst items having the same count. The
problem is with line 336 in probability.py:
sorted(items, key=itemgetter(1), reverse=True)
We can trivially include the FreqDist key as a secondary sort key:
sorted(items, key=itemgetter(1, 0), reverse=True)
However, now FreqDist items having the same count are sorted in
reverse alphabetical order. Here's what I came up with instead:
sorted(items, key=lambda x:(-x[1], x[0]))
Now the FreqDist doctests that used == should behave as expected.
Nitin -- how about committing your implementations of update() and
__add__(), and letting us test them out.
Thanks,
-Steven