Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

finding items that occur more than once in a list

782 views
Skip to first unread message

Simon Forman

unread,
Mar 18, 2008, 5:57:06 AM3/18/08
to
Is there a more efficient way to do this?

def f(L):
'''Return a set of the items that occur more than once in L.'''
L = list(L)
for item in set(L):
L.remove(item)
return set(L)


|>> f([0, 0, 1, 1, 2, 2, 3])
set([0, 1, 2])

Chris

unread,
Mar 18, 2008, 6:18:20 AM3/18/08
to

def f(L):
D = dict()
for item in L:
if item in D:
D[item] += 1
else:
D[item] = 1
return [i for i,j in D.items() if j > 1]

That would be my way to do it, would need to test it via several
thousand iterations to see which one is most efficient though.

Paul Rubin

unread,
Mar 18, 2008, 7:00:19 AM3/18/08
to
Simon Forman <sajm...@gmail.com> writes:
> Is there a more efficient way to do this?

http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/502263

Bryan Olson

unread,
Mar 18, 2008, 7:24:12 AM3/18/08
to
Simon Forman wrote:
> Is there a more efficient way to do this?
>
> def f(L):
> '''Return a set of the items that occur more than once in L.'''
> L = list(L)
> for item in set(L):
> L.remove(item)
> return set(L)

That's neat, but quadratic time because list.remove() requires
a linear search. We can make an efficient variant by using
remove on a set rather than a list:

def multiples(lst):
singles = set(lst)
mults = set()
for x in lst:
if x in singles:
singles.remove(x)
else:
mults.add(x)
return mults

Though probably better is:

def multiples(lst):
seen = set()
mults = set()
for x in lst:
if x in seen:
mults.add(x)
else:
seen.add(x)
return mults


I've typically used dicts for such things, as in:

def multiples(lst):
h = {}
for x in lst:
h[x] = h.get(x, 0) + 1
return set([x for x in h if h[x] > 1])


--
--Bryan

Ninereeds

unread,
Mar 18, 2008, 8:25:15 AM3/18/08
to

Just to throw in one more alternative, if you sort your list, you only
need to test adjacent items for equality rather than needing a search
for each unique item found. You should get O(n log n) rather than
O(n^2), since the performance bottleneck is now the sorting rather
than the searching for duplicates. Also, since the sort method is
built into the list, sorting performance should be very good.

The dictionary version Chris suggests (and the essentially equivalent
set-based approach) is doing essentially the same thing in a way, but
using hashing rather than ordering to organise the list and spot
duplicates. This is *not* O(n) due to the rate of collisions
increasing as the hash table fills. If collisions are handled by
building a linked list for each hash table entry, the performance
overall is still O(n^2) since (for large amounts of data) there is
still a linear search on each collision. If collisions are handled by
building binary trees, the performance is back to O(n log n). That is,
for large enough datasets, the performance of hash tables is basically
the performance of the collision handling data structure (ignoring
scaling constants).

I suspect Python handles collisions using linked lists, since using
trees would require that all dictionary keys support ordered
comparisons. Using a dictionary or set to eliminate duplicates
therefore gives O(n^2) performance.

That said, hash tables have very good scaling constants to their
performance, so the dictionary technique will be a very good performer
in general. And if the lists are reasonably small the performance will
often seem like O(n) in practice.

The sort-then-filter approach should still be competitive, but of
course it requires that the contents of the list can be ordered
consistently. An inappropriate hash function can similarly cause
problems for the dictionary approach, causing that theoretical O(n^2)
performance to become apparent very quickly. This is probably one
reason why the cookbook recipe recommends an O(n^2) searching-based
approach.

Hrvoje Niksic

unread,
Mar 18, 2008, 11:18:04 AM3/18/08
to
Ninereeds <stephen...@aol.com> writes:

> The dictionary version Chris suggests (and the essentially
> equivalent set-based approach) is doing essentially the same thing
> in a way, but using hashing rather than ordering to organise the
> list and spot duplicates. This is *not* O(n) due to the rate of
> collisions increasing as the hash table fills. If collisions are
> handled by building a linked list for each hash table entry, the
> performance overall is still O(n^2)

This doesn't apply to Python, which implements dict storage as an
open-addressed table and automatically (and exponentially) grows the
table when the number of entries approaches 2/3 of the table size.
Assuming a good hash function, filling the dict should yield amortized
constant time for individual additions.

> I suspect Python handles collisions using linked lists,

Why suspect, it's trivial to check. dictobject.c says:

The basic lookup function used by all operations.
This is based on Algorithm D from Knuth Vol. 3, Sec. 6.4.
Open addressing is preferred over chaining since the link overhead for
chaining would be substantial (100% with typical malloc overhead).

Ninereeds

unread,
Mar 18, 2008, 11:59:04 AM3/18/08
to

Hrvoje Niksic wrote:

> This doesn't apply to Python, which implements dict storage as an
> open-addressed table and automatically (and exponentially) grows the
> table when the number of entries approaches 2/3 of the table size.
> Assuming a good hash function, filling the dict should yield amortized
> constant time for individual additions.

OK. I obviously need to look up open-addressed tables. I thought this
was just, in effect, implicit linked listing - ie it still needs a
linear search to handle collisions, it just avoids the need for
explicitly stored link fields. Perhaps I'm mistaken.

As for the growth pattern, each time you grow the table you have to
redistribute all the items previously inserted to new locations.
Resizes would get rarer as more items are added due to the exponential
growth, but every table resize would take longer too since there are
more items to move. Inserting n items still intuitively looks like
O(n^2) to me.

That said, it does remind me of that old exponential realloc trick for
array resizing. Same thing, I suppose, since a hash table is basically
an array. Maybe my math "intuition" is just wrong.

Raymond Hettinger

unread,
Mar 18, 2008, 1:17:19 PM3/18/08
to


def f(L):
seen = set()
dups = set()
for e in L:
if e in seen:
dups.add(e)
else:
seen.add(e)
return dups

Raymond

Hrvoje Niksic

unread,
Mar 18, 2008, 4:43:49 PM3/18/08
to
Ninereeds <stephen...@aol.com> writes:

> As for the growth pattern, each time you grow the table you have to
> redistribute all the items previously inserted to new locations.
> Resizes would get rarer as more items are added due to the
> exponential growth, but every table resize would take longer too
> since there are more items to move. Inserting n items still
> intuitively looks like O(n^2) to me.
>
> That said, it does remind me of that old exponential realloc trick
> for array resizing. Same thing, I suppose, since a hash table is
> basically an array. Maybe my math "intuition" is just wrong.

That's it. While table resizes grow linearly in complexity, they
become geometrically rarer. This is exactly what happens when
resizing dynamic arrays such as Python lists. Applying your
intuition, appending n elements to a Python list would also have
O(n^2) complexity, which it doesn't. See, for example,
http://en.wikipedia.org/wiki/Dynamic_array#Geometric_expansion_and_amortized_cost

sturlamolden

unread,
Mar 18, 2008, 5:22:53 PM3/18/08
to
On 18 Mar, 10:57, Simon Forman <sajmik...@gmail.com> wrote:

> def f(L):
> '''Return a set of the items that occur more than once in L.'''
> L = list(L)
> for item in set(L):
> L.remove(item)
> return set(L)


def nonunique(lst):
slst = sorted(lst)
return list(set([s[0] for s in
filter(lambda t : not(t[0]-t[1]), zip(slst[:-1],slst[1:]))]))


sturlamolden

unread,
Mar 18, 2008, 5:25:12 PM3/18/08
to
On 18 Mar, 22:22, sturlamolden <sturlamol...@yahoo.no> wrote:

> def nonunique(lst):
> slst = sorted(lst)
> return list(set([s[0] for s in
> filter(lambda t : not(t[0]-t[1]), zip(slst[:-1],slst[1:]))]))


Or perhaps better:

def nonunique(lst):
slst = sorted(lst)
return list(set([s[0] for s in

filter(lambda t : t[0] != t[1], zip(slst[:-1],slst[1:]))]))

Paul Rubin

unread,
Mar 18, 2008, 5:52:25 PM3/18/08
to
sturlamolden <sturla...@yahoo.no> writes:
> def nonunique(lst):
> slst = sorted(lst)
> return list(set([s[0] for s in
> filter(lambda t : t[0] != t[1], zip(slst[:-1],slst[1:]))]))

The items are all comparable and you're willing to take them out of order?

from collections import defaultdict
def nonunique(lst):
d = defaultdict(int)
for x in lst: d[x] += 1
return [x for x,n in d.iterkeys() if n > 1]

sturlamolden

unread,
Mar 18, 2008, 5:56:02 PM3/18/08
to
On 18 Mar, 22:25, sturlamolden <sturlamol...@yahoo.no> wrote:

> def nonunique(lst):
> slst = sorted(lst)
> return list(set([s[0] for s in

> filter(lambda t : t[0] != t[1], zip(slst[:-1],slst[1:]))]))

Obviously that should be 'lambda t : t[0] == t[1]'. Instead of using
the set function, there is more structure to exploit when the list is
sorted:


def nonunique(lst):
slst = sorted(lst)

dups = [s[0] for s in
filter(lambda t : t[0] == t[1], zip(slst[:-1],slst[1:]))]
return [dups[0]] + [s[1] for s in
filter(lambda t : t[0] != t[1], zip(dups[:-1],dups[1:]))]

Simon Forman

unread,
Mar 18, 2008, 6:42:22 PM3/18/08
to
I love you guys, thanks! ;-)
~Simon

Arnaud Delobelle

unread,
Mar 18, 2008, 6:45:42 PM3/18/08
to

Argh! What's wrong with something like:

def duplicates(l):
i = j = object()
for k in sorted(l):
if i != j == k: yield k
i, j = j, k

>>> list(duplicates([1, 2, 3, 2, 2, 4, 1]))
[1, 2]

--
Arnaud

sturlamolden

unread,
Mar 18, 2008, 7:08:04 PM3/18/08
to
On 18 Mar, 23:45, Arnaud Delobelle <arno...@googlemail.com> wrote:

> > def nonunique(lst):
> > slst = sorted(lst)

> > dups = [s[0] for s in
> > filter(lambda t : t[0] == t[1], zip(slst[:-1],slst[1:]))]
> > return [dups[0]] + [s[1] for s in
> > filter(lambda t : t[0] != t[1], zip(dups[:-1],dups[1:]))]
>
> Argh! What's wrong with something like:
>
> def duplicates(l):
> i = j = object()
> for k in sorted(l):
> if i != j == k: yield k
> i, j = j, k


Nice, and more readable. But I'd use Paul Robin's solution. It is O(N)
as opposed to ours which are O(N log N).


Arnaud Delobelle

unread,
Mar 18, 2008, 7:38:45 PM3/18/08
to
On Mar 18, 3:59 pm, Ninereeds <stephenhorne...@aol.com> wrote:
> Hrvoje Niksic wrote:
> > This doesn't apply to Python, which implements dict storage as an
> > open-addressed table and automatically (and exponentially) grows the
> > table when the number of entries approaches 2/3 of the table size.
> > Assuming a good hash function, filling the dict should yield amortized
> > constant time for individual additions.

Isn't this average constant time rather than amortized?

> OK. I obviously need to look up open-addressed tables. I thought this
> was just, in effect, implicit linked listing - ie it still needs a
> linear search to handle collisions, it just avoids the need for
> explicitly stored link fields. Perhaps I'm mistaken.

I don't think you are mistaken, but if I'm wrong I'd be grateful for a
link to further details.

Thanks

--
Arnaud

casti...@gmail.com

unread,
Mar 18, 2008, 9:11:32 PM3/18/08
to

Is there a known algorithm which for Key A, Key B s.t. h( KeyA ) =
h( KeyB ) for hash function h, h( KeyA, 1 ) != h( KeyB, 1 ), where
h( x, i+ 1 ) is searched when h( x, i ) is filled? That is, the
search sequence is unique (or at least orthogonal to the h_value)?

Could you just scramble the bits and hash key -> hash table -> hash
table?

Is deletion - resize to small - a strong bottleneck? Can you store an
object's search sequence, as it looks for "a home", and when one of
its earlier slots vacantizes, promote it?

Bryan Olson

unread,
Mar 18, 2008, 11:08:11 PM3/18/08
to
Arnaud Delobelle wrote:

> Ninereeds wrote:
>> Hrvoje Niksic wrote:
>>> This doesn't apply to Python, which implements dict storage as an
>>> open-addressed table and automatically (and exponentially) grows the
>>> table when the number of entries approaches 2/3 of the table size.
>>> Assuming a good hash function, filling the dict should yield amortized
>>> constant time for individual additions.
>
> Isn't this average constant time rather than amortized?

This is expected amortized constant time. Is that really the same
thing as average constant time? Hmmm... that's subtle.


>> OK. I obviously need to look up open-addressed tables. I thought this
>> was just, in effect, implicit linked listing - ie it still needs a
>> linear search to handle collisions, it just avoids the need for
>> explicitly stored link fields. Perhaps I'm mistaken.

The amortized doubling breaks that.

> I don't think you are mistaken, but if I'm wrong I'd be grateful for a
> link to further details.

Arnaud Delobelle offered a good Wikipedia link, and for more background
look up "amortized analysis.


--
--Bryan

Message has been deleted

John Machin

unread,
Mar 19, 2008, 5:48:37 PM3/19/08
to

I'd use Raymond Hettinger's solution. It is as much O(N) as Paul's,
and is IMHO more readable than Paul's.

sturlamolden

unread,
Mar 19, 2008, 6:14:25 PM3/19/08
to
On 19 Mar, 22:48, John Machin <sjmac...@lexicon.net> wrote:

> I'd use Raymond Hettinger's solution. It is as much O(N) as Paul's,
> and is IMHO more readable than Paul's.

Is a Python set implemented using a hash table?

Justin Bozonier

unread,
Mar 19, 2008, 6:57:16 PM3/19/08
to

It's not as much O(N)... Paul Robin's uses a sort first which is
definitely not O(N). Paul's could be prettied up a bit but the general
principle is sound.

John Machin

unread,
Mar 19, 2008, 7:06:46 PM3/19/08
to

"""


from collections import defaultdict
def nonunique(lst):
d = defaultdict(int)
for x in lst: d[x] += 1
return [x for x,n in d.iterkeys() if n > 1]
"""

I see no sort here.

John Machin

unread,
Mar 19, 2008, 7:16:36 PM3/19/08
to

What don't you understand about the comments in the first two
screenfuls of Objects/setobject.c?

sturlamolden

unread,
Mar 19, 2008, 9:03:24 PM3/19/08
to
On 20 Mar, 00:16, John Machin <sjmac...@lexicon.net> wrote:

> What don't you understand about the comments in the first two
> screenfuls of Objects/setobject.c?

I had not looked at it, but now I have. Is seems Hettinger is the
author :) Ok, so sets are implemented as hash tables. Then I agree,

Gabriel Genellina

unread,
Mar 19, 2008, 9:50:20 PM3/19/08
to pytho...@python.org
En Wed, 19 Mar 2008 20:16:36 -0300, John Machin <sjma...@lexicon.net>
escribió:

That comment was unnecesarily harsh, don't you think?

--
Gabriel Genellina

John Machin

unread,
Mar 20, 2008, 3:07:34 AM3/20/08
to
On Mar 20, 12:50 pm, "Gabriel Genellina" <gagsl-...@yahoo.com.ar>
wrote:
> En Wed, 19 Mar 2008 20:16:36 -0300, John Machin <sjmac...@lexicon.net>  

> escribió:
>
> > On Mar 20, 9:14 am, sturlamolden <sturlamol...@yahoo.no> wrote:
> >> Is a Python set implemented using a hash table?
>
> > What don't you understand about the comments in the first two
> > screenfuls of Objects/setobject.c?
>
> That comment was unnecesarily harsh, don't you think?
>
No, otherwise I wouldn't have made it.

Arnaud Delobelle

unread,
Mar 20, 2008, 5:43:26 AM3/20/08
to
On Mar 19, 3:08 am, Bryan Olson <fakeaddr...@nowhere.org> wrote:
> Arnaud Delobelle wrote:
> > Ninereeds wrote:
> >> Hrvoje Niksic wrote:
> >>> This doesn't apply to Python, which implements dict storage as an
> >>> open-addressed table and automatically (and exponentially) grows the
> >>> table when the number of entries approaches 2/3 of the table size.
> >>> Assuming a good hash function, filling the dict should yield amortized
> >>> constant time for individual additions.
>
> > Isn't this average constant time rather than amortized?
>
> This is expected amortized constant time. Is that really the same
> thing as average constant time? Hmmm... that's subtle.

I am not sure what the difference between expected amortized time
complexity and average time complexity is (I know that they are
defined as different notions but I don't know whether they reduce to
the same thing or not). Anyway both are average case complexities and
AFAIK worst case time complexity of insertion / lookup in a hashtable
is still O(n).

> >> OK. I obviously need to look up open-addressed tables. I thought this
> >> was just, in effect, implicit linked listing - ie it still needs a
> >> linear search to handle collisions, it just avoids the need for
> >> explicitly stored link fields. Perhaps I'm mistaken.
>
> The amortized doubling breaks that.
>
> > I don't think you are mistaken, but if I'm wrong I'd be grateful for a
> > link to further details.
>
> Arnaud Delobelle offered a good Wikipedia link, and for more background
> look up "amortized analysis.

Hrvoje Niksic provided the link :). I still think two unrelated
things are being compared in this thread when people say that method A
(using dictionaries / sets) is O(n) and method B (sorting the list)
O(nlogn).

Isn't it the case that:

| Worst case | Average case
---------|------------|-------------
Method A | O(n^2) | O(n)
Method B | O(nlogn) | O(nlogn)

Which method is the best would then be determined by the distribution
of the hash values of your data, and whether you want to have a
guarantee the method will have a less-than-quadratic behaviour.

--
Arnaud

casti...@gmail.com

unread,
Mar 21, 2008, 2:47:50 AM3/21/08
to

Screenfuls is an awful lot, the way you guys write. Besides, what's
the answer? What don't you understand?

Bryan Olson

unread,
Mar 21, 2008, 5:17:10 AM3/21/08
to
Arnaud Delobelle wrote:
> Bryan Olson wrote:
[...]

>> Arnaud Delobelle offered a good Wikipedia link, and for more background
>> look up "amortized analysis.
>
> Hrvoje Niksic provided the link :).

Oops, careless thread-following. Hrvoje Niksic it was.

> I still think two unrelated
> things are being compared in this thread when people say that method A
> (using dictionaries / sets) is O(n) and method B (sorting the list)
> O(nlogn).
>
> Isn't it the case that:
>
> | Worst case | Average case
> ---------|------------|-------------
> Method A | O(n^2) | O(n)
> Method B | O(nlogn) | O(nlogn)
>
> Which method is the best would then be determined by the distribution
> of the hash values of your data, and whether you want to have a
> guarantee the method will have a less-than-quadratic behaviour.

If we exclude the case where an adversary is choosing the keys, the
chance of a seriously degenerate case in the hashing method is so
remote that we do should not worry about it. Those who insist on
ranking by worst-case behavior have probably abandoned Python long
ago, as it uses those hashed 'dict' things everywhere. Of course
they've also abandoned OS's with demand-paged virtual memory and
such.


--
--Bryan

casti...@gmail.com

unread,
Mar 21, 2008, 10:11:37 AM3/21/08
to

A collision sequence is not so rare.

>>> [ hash( 2**i ) for i in range( 0, 256, 32 ) ]
[1, 1, 1, 1, 1, 1, 1, 1]

John Machin

unread,
Mar 21, 2008, 4:39:09 PM3/21/08
to

Bryan did qualify his remarks: "If we exclude the case where an
adversary is choosing the keys, ..."

casti...@gmail.com

unread,
Mar 22, 2008, 3:25:57 AM3/22/08
to

Some adversary. What, you mean, my boss or my customers?

Bryan Olson

unread,
Mar 22, 2008, 4:31:45 AM3/22/08
to
casti...@gmail.com wrote:

> John Machin wrote:
>> On Mar 22, 1:11 am, castiro...@gmail.com wrote:
>>> A collision sequence is not so rare.
>>>>>> [ hash( 2**i ) for i in range( 0, 256, 32 ) ]
>>> [1, 1, 1, 1, 1, 1, 1, 1]

>> Bryan did qualify his remarks: "If we exclude the case where an
>> adversary is choosing the keys, ..."
>
> Some adversary. What, you mean, my boss or my customers?

We mean that the party supplying the keys deliberately chose
them to make the hash table inefficient. In this thread the goal
is efficiency; a party working toward an opposing goal is an
adversary.

If you find real-world data sets that tend to induce bad-case
behavior in Python's hash, please do tell. It would be reason
enough to adjust the hash function. The hashes in popular
software such as Python are already quite well vetted.

Even a hash function that behaves as a random oracle has
worst-case quadratic-time in the algorithm here, but that's
an issue in theory and not in serving customers.


--
--Bryan

Arnaud Delobelle

unread,
Mar 22, 2008, 6:58:15 AM3/22/08
to
On Mar 22, 8:31 am, Bryan Olson <fakeaddr...@nowhere.org> wrote:

> castiro...@gmail.com wrote:
> > John Machin wrote:
> >> On Mar 22, 1:11 am, castiro...@gmail.com wrote:
> >>> A collision sequence is not so rare.
> >>>>>> [ hash( 2**i ) for i in range( 0, 256, 32 ) ]
> >>> [1, 1, 1, 1, 1, 1, 1, 1]
> >> Bryan did qualify his remarks: "If we exclude the case where an
> >> adversary is choosing the keys, ..."
>
> > Some adversary.  What, you mean, my boss or my customers?
>
> We mean that the party supplying the keys deliberately chose
> them to make the hash table inefficient. In this thread the goal
> is efficiency; a party working toward an opposing goal is an
> adversary.

There are situations where this can happen I guess

> If you find real-world data sets that tend to induce bad-case
> behavior in Python's hash, please do tell. It would be reason
> enough to adjust the hash function. The hashes in popular
> software such as Python are already quite well vetted.

As Python allows you to make your own hash functions for your own
classes, another danger is that you are dealing with objects with a
bad hash function. I've actually tried it (as I am *not* a pro!) and
FWIW here are the results.

-------- badhash.py ----------

class BadHash(object):
def __hash__(self):
return 1 # that's a bad hash function!

from time import time

def test(n):
t0 = time()
# Create a hash table of size n
s = set(BadHash() for x in xrange(n))
t1 = time()
# Add an object to it
obj = BadHash()
s.add(obj)
t2 = time()
# Find that object
obj in s
t3 = time()
print "%s\t%.3e\t%.3e\t%.3e" % (n, (t1-t0)/n**2, (t2-t1)/n, (t3-
t2)/n)

print "n\tcreate/n^2\tadd/n\t\tfind/n"
for k in range(8, 17):
# I'm hoping that adding an element to a dict of size (1 << k) + 1
# will not trigger a resize of the hasmap.
test((1 << k) + 1)

--------------------------------

marigold:junk arno$ python badhash.py
n create/n^2 add/n find/n
257 3.917e-08 6.958e-08 7.051e-08
513 3.641e-08 8.598e-08 7.018e-08
1025 3.522e-08 7.118e-08 6.722e-08
2049 3.486e-08 6.935e-08 6.982e-08
4097 3.480e-08 7.053e-08 6.931e-08
8193 3.477e-08 6.897e-08 6.981e-08
16385 3.441e-08 6.963e-08 7.075e-08
32769 3.720e-08 7.672e-08 7.885e-08
65537 3.680e-08 7.159e-08 7.510e-08

So theory and practice agree! In this case The hash table behaves
just like a linked list.

Lastly, if one deals with a totally ordered set of object but they are
not hashable (there isn't a good hash function), then Ninereeds' idea
of sorting first is still useful.

--
Arnaud

John Machin

unread,
Mar 22, 2008, 7:48:17 AM3/22/08
to
On Mar 22, 9:58 pm, Arnaud Delobelle <arno...@googlemail.com> wrote:
>
> Lastly, if one deals with a totally ordered set of object but they are
> not hashable (there isn't a good hash function), then Ninereeds' idea
> of sorting first is still useful.

... and if not totally ordered, then ... I'll just stand aside and cue
the timbot :-)

http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/52560


sturlamolden

unread,
Mar 22, 2008, 9:43:33 AM3/22/08
to
On 22 Mar, 09:31, Bryan Olson <fakeaddr...@nowhere.org> wrote:

> Even a hash function that behaves as a random oracle has
> worst-case quadratic-time in the algorithm here

In which case inserts are not amortized to O(1).


casti...@gmail.com

unread,
Mar 22, 2008, 3:17:45 PM3/22/08
to
On Mar 22, 3:31 am, Bryan Olson <fakeaddr...@nowhere.org> wrote:

Hey Net:

Where -we- left off, all real-world data sets were equally common,
unless you think in peckbird bartenders. Nuh uh at all, unless you've
got a characterize real-world data project underway. Did you know
letter frequency distributions vary by domain? And now for the 2-
squarebar spider.

raise ISay( "Python Science is unusually prone against non-power
multiples of 32 hashing integers", "Unknown value of I" )

parse( 'strangers are perfect' )
parse( 'i think i know someone' )
prevention.value() == 12* cure.value()
raise SolveFor
(raise CureValueSterling)
keep: time.value().encode() > money.value()
assert time.value().encode()== time.encode().value()
perch( Out )

compose( Fluctuate )
compose( Punctuate )
compose( Explain )

@partial( partial, prefix ) (ha, h)
raise Distracts( 'Well, would you look at that?' )
raise LivingConstraint( NoseFalls )
raise Superlative
voice Absolute
raise NonScalar( "Fit: It's apples -over- oranges." )
raise OverPull( "Not on Google." )
raise Drop( "It." )
moment+= instant
raise LeavesImpression
raise GoodMuleTwoHaystacks
raise ScaleDecomposition
raise Bias( 'Discrimination' )
raise BadFor
raise RequestedControlCurrencyExpiration
raise InsufficientTrust
raise Shock
raise EnoughTheory
raise EnoughCustomers
raise EyebrowTwo
raise DataKill
raise Reckless
raise ThenWhat( AfterYou )
raise NotImplemented( 'Offer.refuse' )
raise LaughterUnacceptable
raise Hell
raise DesignUnintelligent
raise HashesPerturbation
raise ProfessionException
yield Tick
yield Tally
yield Ho
yield Gee
yield Whoa
raise ActuallyActuallyActually
raise CurrencyAStream

If you can't fit an analogy...

Bee Dance : Hive Good :: Human This :: Community Good?
What and how is it when to who where?
What does it know? What does it do but sleep?
Does it specialize, momentarily fair?

Hey, what do you think of the spelling of this? "The group you are
posting to is a Usenet group. Messages posted to this group will make
your email address visible to anyone on the Internet." What do you
think of the wording of "spelling"? Take me to your lexicographer:
wake-up; you're up web, Europe web. Pass self to a generator.

Dust is sett'ling;
Tables turn:
Water's drying;
Sugars burn.

Phaser fire!
Shoulders square!
join( TenForward )?
Who is where?

Mushroom's growing;
Mighty... falls.
Cauldron's bubbling;
That is all.

>>> def f():
... n= True
... while 1:
... k= yield( n )
... if k is not None: n= k
...
>>> a= f()
>>> next( a )
True
>>> next( a )
True
>>> next( a )
True
>>> a.send( 2 )
2
>>> next( a )
2
>>> next( a )
2
>>> next( a )
2
>>>

Bryan Olson

unread,
Mar 23, 2008, 2:02:52 AM3/23/08
to
Arnaud Delobelle wrote:
> Bryan Olson wrote:
>> We mean that the party supplying the keys deliberately chose
>> them to make the hash table inefficient. In this thread the goal
>> is efficiency; a party working toward an opposing goal is an
>> adversary.
>
> There are situations where this can happen I guess

Cormen-Leiserson-Rivest /Introduction to Algorithms/ discusses
it in section 12.3.3 (first edition). They suggest a family of
hash functions, where an individual hash is chosen at random
when creating the table. That still doesn't beat adversaries
who can get on-line timing information and figure out when
they induce bad cases.

More generally, security researchers have started to look at
"algorithmic complexity denial of service attacks".
http://www.cs.rice.edu/~scrosby/hash/


>> If you find real-world data sets that tend to induce bad-case
>> behavior in Python's hash, please do tell. It would be reason
>> enough to adjust the hash function. The hashes in popular
>> software such as Python are already quite well vetted.
>
> As Python allows you to make your own hash functions for your own
> classes, another danger is that you are dealing with objects with a
> bad hash function. I've actually tried it (as I am *not* a pro!) and
> FWIW here are the results.
>
> -------- badhash.py ----------
>
> class BadHash(object):
> def __hash__(self):
> return 1 # that's a bad hash function!

The sorting algorithm can have the same problem. With user-defined
methods, there's no telling how long __cmp__ will actually take.


--
--Bryan

0 new messages