def f(L):
'''Return a set of the items that occur more than once in L.'''
L = list(L)
for item in set(L):
L.remove(item)
return set(L)
|>> f([0, 0, 1, 1, 2, 2, 3])
set([0, 1, 2])
def f(L):
D = dict()
for item in L:
if item in D:
D[item] += 1
else:
D[item] = 1
return [i for i,j in D.items() if j > 1]
That would be my way to do it, would need to test it via several
thousand iterations to see which one is most efficient though.
http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/502263
That's neat, but quadratic time because list.remove() requires
a linear search. We can make an efficient variant by using
remove on a set rather than a list:
def multiples(lst):
singles = set(lst)
mults = set()
for x in lst:
if x in singles:
singles.remove(x)
else:
mults.add(x)
return mults
Though probably better is:
def multiples(lst):
seen = set()
mults = set()
for x in lst:
if x in seen:
mults.add(x)
else:
seen.add(x)
return mults
I've typically used dicts for such things, as in:
def multiples(lst):
h = {}
for x in lst:
h[x] = h.get(x, 0) + 1
return set([x for x in h if h[x] > 1])
--
--Bryan
The dictionary version Chris suggests (and the essentially equivalent
set-based approach) is doing essentially the same thing in a way, but
using hashing rather than ordering to organise the list and spot
duplicates. This is *not* O(n) due to the rate of collisions
increasing as the hash table fills. If collisions are handled by
building a linked list for each hash table entry, the performance
overall is still O(n^2) since (for large amounts of data) there is
still a linear search on each collision. If collisions are handled by
building binary trees, the performance is back to O(n log n). That is,
for large enough datasets, the performance of hash tables is basically
the performance of the collision handling data structure (ignoring
scaling constants).
I suspect Python handles collisions using linked lists, since using
trees would require that all dictionary keys support ordered
comparisons. Using a dictionary or set to eliminate duplicates
therefore gives O(n^2) performance.
That said, hash tables have very good scaling constants to their
performance, so the dictionary technique will be a very good performer
in general. And if the lists are reasonably small the performance will
often seem like O(n) in practice.
The sort-then-filter approach should still be competitive, but of
course it requires that the contents of the list can be ordered
consistently. An inappropriate hash function can similarly cause
problems for the dictionary approach, causing that theoretical O(n^2)
performance to become apparent very quickly. This is probably one
reason why the cookbook recipe recommends an O(n^2) searching-based
approach.
> The dictionary version Chris suggests (and the essentially
> equivalent set-based approach) is doing essentially the same thing
> in a way, but using hashing rather than ordering to organise the
> list and spot duplicates. This is *not* O(n) due to the rate of
> collisions increasing as the hash table fills. If collisions are
> handled by building a linked list for each hash table entry, the
> performance overall is still O(n^2)
This doesn't apply to Python, which implements dict storage as an
open-addressed table and automatically (and exponentially) grows the
table when the number of entries approaches 2/3 of the table size.
Assuming a good hash function, filling the dict should yield amortized
constant time for individual additions.
> I suspect Python handles collisions using linked lists,
Why suspect, it's trivial to check. dictobject.c says:
The basic lookup function used by all operations.
This is based on Algorithm D from Knuth Vol. 3, Sec. 6.4.
Open addressing is preferred over chaining since the link overhead for
chaining would be substantial (100% with typical malloc overhead).
> This doesn't apply to Python, which implements dict storage as an
> open-addressed table and automatically (and exponentially) grows the
> table when the number of entries approaches 2/3 of the table size.
> Assuming a good hash function, filling the dict should yield amortized
> constant time for individual additions.
OK. I obviously need to look up open-addressed tables. I thought this
was just, in effect, implicit linked listing - ie it still needs a
linear search to handle collisions, it just avoids the need for
explicitly stored link fields. Perhaps I'm mistaken.
As for the growth pattern, each time you grow the table you have to
redistribute all the items previously inserted to new locations.
Resizes would get rarer as more items are added due to the exponential
growth, but every table resize would take longer too since there are
more items to move. Inserting n items still intuitively looks like
O(n^2) to me.
That said, it does remind me of that old exponential realloc trick for
array resizing. Same thing, I suppose, since a hash table is basically
an array. Maybe my math "intuition" is just wrong.
def f(L):
seen = set()
dups = set()
for e in L:
if e in seen:
dups.add(e)
else:
seen.add(e)
return dups
Raymond
> As for the growth pattern, each time you grow the table you have to
> redistribute all the items previously inserted to new locations.
> Resizes would get rarer as more items are added due to the
> exponential growth, but every table resize would take longer too
> since there are more items to move. Inserting n items still
> intuitively looks like O(n^2) to me.
>
> That said, it does remind me of that old exponential realloc trick
> for array resizing. Same thing, I suppose, since a hash table is
> basically an array. Maybe my math "intuition" is just wrong.
That's it. While table resizes grow linearly in complexity, they
become geometrically rarer. This is exactly what happens when
resizing dynamic arrays such as Python lists. Applying your
intuition, appending n elements to a Python list would also have
O(n^2) complexity, which it doesn't. See, for example,
http://en.wikipedia.org/wiki/Dynamic_array#Geometric_expansion_and_amortized_cost
> def f(L):
> '''Return a set of the items that occur more than once in L.'''
> L = list(L)
> for item in set(L):
> L.remove(item)
> return set(L)
def nonunique(lst):
slst = sorted(lst)
return list(set([s[0] for s in
filter(lambda t : not(t[0]-t[1]), zip(slst[:-1],slst[1:]))]))
> def nonunique(lst):
> slst = sorted(lst)
> return list(set([s[0] for s in
> filter(lambda t : not(t[0]-t[1]), zip(slst[:-1],slst[1:]))]))
Or perhaps better:
def nonunique(lst):
slst = sorted(lst)
return list(set([s[0] for s in
filter(lambda t : t[0] != t[1], zip(slst[:-1],slst[1:]))]))
The items are all comparable and you're willing to take them out of order?
from collections import defaultdict
def nonunique(lst):
d = defaultdict(int)
for x in lst: d[x] += 1
return [x for x,n in d.iterkeys() if n > 1]
> def nonunique(lst):
> slst = sorted(lst)
> return list(set([s[0] for s in
> filter(lambda t : t[0] != t[1], zip(slst[:-1],slst[1:]))]))
Obviously that should be 'lambda t : t[0] == t[1]'. Instead of using
the set function, there is more structure to exploit when the list is
sorted:
def nonunique(lst):
slst = sorted(lst)
dups = [s[0] for s in
filter(lambda t : t[0] == t[1], zip(slst[:-1],slst[1:]))]
return [dups[0]] + [s[1] for s in
filter(lambda t : t[0] != t[1], zip(dups[:-1],dups[1:]))]
Argh! What's wrong with something like:
def duplicates(l):
i = j = object()
for k in sorted(l):
if i != j == k: yield k
i, j = j, k
>>> list(duplicates([1, 2, 3, 2, 2, 4, 1]))
[1, 2]
--
Arnaud
> > def nonunique(lst):
> > slst = sorted(lst)
> > dups = [s[0] for s in
> > filter(lambda t : t[0] == t[1], zip(slst[:-1],slst[1:]))]
> > return [dups[0]] + [s[1] for s in
> > filter(lambda t : t[0] != t[1], zip(dups[:-1],dups[1:]))]
>
> Argh! What's wrong with something like:
>
> def duplicates(l):
> i = j = object()
> for k in sorted(l):
> if i != j == k: yield k
> i, j = j, k
Nice, and more readable. But I'd use Paul Robin's solution. It is O(N)
as opposed to ours which are O(N log N).
Isn't this average constant time rather than amortized?
> OK. I obviously need to look up open-addressed tables. I thought this
> was just, in effect, implicit linked listing - ie it still needs a
> linear search to handle collisions, it just avoids the need for
> explicitly stored link fields. Perhaps I'm mistaken.
I don't think you are mistaken, but if I'm wrong I'd be grateful for a
link to further details.
Thanks
--
Arnaud
Is there a known algorithm which for Key A, Key B s.t. h( KeyA ) =
h( KeyB ) for hash function h, h( KeyA, 1 ) != h( KeyB, 1 ), where
h( x, i+ 1 ) is searched when h( x, i ) is filled? That is, the
search sequence is unique (or at least orthogonal to the h_value)?
Could you just scramble the bits and hash key -> hash table -> hash
table?
Is deletion - resize to small - a strong bottleneck? Can you store an
object's search sequence, as it looks for "a home", and when one of
its earlier slots vacantizes, promote it?
This is expected amortized constant time. Is that really the same
thing as average constant time? Hmmm... that's subtle.
>> OK. I obviously need to look up open-addressed tables. I thought this
>> was just, in effect, implicit linked listing - ie it still needs a
>> linear search to handle collisions, it just avoids the need for
>> explicitly stored link fields. Perhaps I'm mistaken.
The amortized doubling breaks that.
> I don't think you are mistaken, but if I'm wrong I'd be grateful for a
> link to further details.
Arnaud Delobelle offered a good Wikipedia link, and for more background
look up "amortized analysis.
--
--Bryan
I'd use Raymond Hettinger's solution. It is as much O(N) as Paul's,
and is IMHO more readable than Paul's.
> I'd use Raymond Hettinger's solution. It is as much O(N) as Paul's,
> and is IMHO more readable than Paul's.
Is a Python set implemented using a hash table?
It's not as much O(N)... Paul Robin's uses a sort first which is
definitely not O(N). Paul's could be prettied up a bit but the general
principle is sound.
"""
from collections import defaultdict
def nonunique(lst):
d = defaultdict(int)
for x in lst: d[x] += 1
return [x for x,n in d.iterkeys() if n > 1]
"""
I see no sort here.
What don't you understand about the comments in the first two
screenfuls of Objects/setobject.c?
> What don't you understand about the comments in the first two
> screenfuls of Objects/setobject.c?
I had not looked at it, but now I have. Is seems Hettinger is the
author :) Ok, so sets are implemented as hash tables. Then I agree,
That comment was unnecesarily harsh, don't you think?
--
Gabriel Genellina
I am not sure what the difference between expected amortized time
complexity and average time complexity is (I know that they are
defined as different notions but I don't know whether they reduce to
the same thing or not). Anyway both are average case complexities and
AFAIK worst case time complexity of insertion / lookup in a hashtable
is still O(n).
> >> OK. I obviously need to look up open-addressed tables. I thought this
> >> was just, in effect, implicit linked listing - ie it still needs a
> >> linear search to handle collisions, it just avoids the need for
> >> explicitly stored link fields. Perhaps I'm mistaken.
>
> The amortized doubling breaks that.
>
> > I don't think you are mistaken, but if I'm wrong I'd be grateful for a
> > link to further details.
>
> Arnaud Delobelle offered a good Wikipedia link, and for more background
> look up "amortized analysis.
Hrvoje Niksic provided the link :). I still think two unrelated
things are being compared in this thread when people say that method A
(using dictionaries / sets) is O(n) and method B (sorting the list)
O(nlogn).
Isn't it the case that:
| Worst case | Average case
---------|------------|-------------
Method A | O(n^2) | O(n)
Method B | O(nlogn) | O(nlogn)
Which method is the best would then be determined by the distribution
of the hash values of your data, and whether you want to have a
guarantee the method will have a less-than-quadratic behaviour.
--
Arnaud
Screenfuls is an awful lot, the way you guys write. Besides, what's
the answer? What don't you understand?
Oops, careless thread-following. Hrvoje Niksic it was.
> I still think two unrelated
> things are being compared in this thread when people say that method A
> (using dictionaries / sets) is O(n) and method B (sorting the list)
> O(nlogn).
>
> Isn't it the case that:
>
> | Worst case | Average case
> ---------|------------|-------------
> Method A | O(n^2) | O(n)
> Method B | O(nlogn) | O(nlogn)
>
> Which method is the best would then be determined by the distribution
> of the hash values of your data, and whether you want to have a
> guarantee the method will have a less-than-quadratic behaviour.
If we exclude the case where an adversary is choosing the keys, the
chance of a seriously degenerate case in the hashing method is so
remote that we do should not worry about it. Those who insist on
ranking by worst-case behavior have probably abandoned Python long
ago, as it uses those hashed 'dict' things everywhere. Of course
they've also abandoned OS's with demand-paged virtual memory and
such.
--
--Bryan
A collision sequence is not so rare.
>>> [ hash( 2**i ) for i in range( 0, 256, 32 ) ]
[1, 1, 1, 1, 1, 1, 1, 1]
Bryan did qualify his remarks: "If we exclude the case where an
adversary is choosing the keys, ..."
Some adversary. What, you mean, my boss or my customers?
>> Bryan did qualify his remarks: "If we exclude the case where an
>> adversary is choosing the keys, ..."
>
> Some adversary. What, you mean, my boss or my customers?
We mean that the party supplying the keys deliberately chose
them to make the hash table inefficient. In this thread the goal
is efficiency; a party working toward an opposing goal is an
adversary.
If you find real-world data sets that tend to induce bad-case
behavior in Python's hash, please do tell. It would be reason
enough to adjust the hash function. The hashes in popular
software such as Python are already quite well vetted.
Even a hash function that behaves as a random oracle has
worst-case quadratic-time in the algorithm here, but that's
an issue in theory and not in serving customers.
--
--Bryan
There are situations where this can happen I guess
> If you find real-world data sets that tend to induce bad-case
> behavior in Python's hash, please do tell. It would be reason
> enough to adjust the hash function. The hashes in popular
> software such as Python are already quite well vetted.
As Python allows you to make your own hash functions for your own
classes, another danger is that you are dealing with objects with a
bad hash function. I've actually tried it (as I am *not* a pro!) and
FWIW here are the results.
-------- badhash.py ----------
class BadHash(object):
def __hash__(self):
return 1 # that's a bad hash function!
from time import time
def test(n):
t0 = time()
# Create a hash table of size n
s = set(BadHash() for x in xrange(n))
t1 = time()
# Add an object to it
obj = BadHash()
s.add(obj)
t2 = time()
# Find that object
obj in s
t3 = time()
print "%s\t%.3e\t%.3e\t%.3e" % (n, (t1-t0)/n**2, (t2-t1)/n, (t3-
t2)/n)
print "n\tcreate/n^2\tadd/n\t\tfind/n"
for k in range(8, 17):
# I'm hoping that adding an element to a dict of size (1 << k) + 1
# will not trigger a resize of the hasmap.
test((1 << k) + 1)
--------------------------------
marigold:junk arno$ python badhash.py
n create/n^2 add/n find/n
257 3.917e-08 6.958e-08 7.051e-08
513 3.641e-08 8.598e-08 7.018e-08
1025 3.522e-08 7.118e-08 6.722e-08
2049 3.486e-08 6.935e-08 6.982e-08
4097 3.480e-08 7.053e-08 6.931e-08
8193 3.477e-08 6.897e-08 6.981e-08
16385 3.441e-08 6.963e-08 7.075e-08
32769 3.720e-08 7.672e-08 7.885e-08
65537 3.680e-08 7.159e-08 7.510e-08
So theory and practice agree! In this case The hash table behaves
just like a linked list.
Lastly, if one deals with a totally ordered set of object but they are
not hashable (there isn't a good hash function), then Ninereeds' idea
of sorting first is still useful.
--
Arnaud
... and if not totally ordered, then ... I'll just stand aside and cue
the timbot :-)
http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/52560
> Even a hash function that behaves as a random oracle has
> worst-case quadratic-time in the algorithm here
In which case inserts are not amortized to O(1).
Hey Net:
Where -we- left off, all real-world data sets were equally common,
unless you think in peckbird bartenders. Nuh uh at all, unless you've
got a characterize real-world data project underway. Did you know
letter frequency distributions vary by domain? And now for the 2-
squarebar spider.
raise ISay( "Python Science is unusually prone against non-power
multiples of 32 hashing integers", "Unknown value of I" )
parse( 'strangers are perfect' )
parse( 'i think i know someone' )
prevention.value() == 12* cure.value()
raise SolveFor
(raise CureValueSterling)
keep: time.value().encode() > money.value()
assert time.value().encode()== time.encode().value()
perch( Out )
compose( Fluctuate )
compose( Punctuate )
compose( Explain )
@partial( partial, prefix ) (ha, h)
raise Distracts( 'Well, would you look at that?' )
raise LivingConstraint( NoseFalls )
raise Superlative
voice Absolute
raise NonScalar( "Fit: It's apples -over- oranges." )
raise OverPull( "Not on Google." )
raise Drop( "It." )
moment+= instant
raise LeavesImpression
raise GoodMuleTwoHaystacks
raise ScaleDecomposition
raise Bias( 'Discrimination' )
raise BadFor
raise RequestedControlCurrencyExpiration
raise InsufficientTrust
raise Shock
raise EnoughTheory
raise EnoughCustomers
raise EyebrowTwo
raise DataKill
raise Reckless
raise ThenWhat( AfterYou )
raise NotImplemented( 'Offer.refuse' )
raise LaughterUnacceptable
raise Hell
raise DesignUnintelligent
raise HashesPerturbation
raise ProfessionException
yield Tick
yield Tally
yield Ho
yield Gee
yield Whoa
raise ActuallyActuallyActually
raise CurrencyAStream
If you can't fit an analogy...
Bee Dance : Hive Good :: Human This :: Community Good?
What and how is it when to who where?
What does it know? What does it do but sleep?
Does it specialize, momentarily fair?
Hey, what do you think of the spelling of this? "The group you are
posting to is a Usenet group. Messages posted to this group will make
your email address visible to anyone on the Internet." What do you
think of the wording of "spelling"? Take me to your lexicographer:
wake-up; you're up web, Europe web. Pass self to a generator.
Dust is sett'ling;
Tables turn:
Water's drying;
Sugars burn.
Phaser fire!
Shoulders square!
join( TenForward )?
Who is where?
Mushroom's growing;
Mighty... falls.
Cauldron's bubbling;
That is all.
>>> def f():
... n= True
... while 1:
... k= yield( n )
... if k is not None: n= k
...
>>> a= f()
>>> next( a )
True
>>> next( a )
True
>>> next( a )
True
>>> a.send( 2 )
2
>>> next( a )
2
>>> next( a )
2
>>> next( a )
2
>>>
Cormen-Leiserson-Rivest /Introduction to Algorithms/ discusses
it in section 12.3.3 (first edition). They suggest a family of
hash functions, where an individual hash is chosen at random
when creating the table. That still doesn't beat adversaries
who can get on-line timing information and figure out when
they induce bad cases.
More generally, security researchers have started to look at
"algorithmic complexity denial of service attacks".
http://www.cs.rice.edu/~scrosby/hash/
>> If you find real-world data sets that tend to induce bad-case
>> behavior in Python's hash, please do tell. It would be reason
>> enough to adjust the hash function. The hashes in popular
>> software such as Python are already quite well vetted.
>
> As Python allows you to make your own hash functions for your own
> classes, another danger is that you are dealing with objects with a
> bad hash function. I've actually tried it (as I am *not* a pro!) and
> FWIW here are the results.
>
> -------- badhash.py ----------
>
> class BadHash(object):
> def __hash__(self):
> return 1 # that's a bad hash function!
The sorting algorithm can have the same problem. With user-defined
methods, there's no telling how long __cmp__ will actually take.
--
--Bryan