[Python-ideas] Uniquify attribute for lists

31 views
Skip to first unread message

Robrecht De Rouck

unread,
Nov 16, 2012, 7:28:46 AM11/16/12
to python...@python.org
Hello,

I just wanted to bring to your attention that an attribute for removing duplicate elements for lists would be a nice feature. 

def uniquify(lis):
    seen = set()
    seen_add = seen.add
    return [ x for x in lis if x not in seen and not seen_add(x)]

The code is from this post. Also check out this performance comparison of uniquifying snippets.
It would be useful to have a uniquify attribute for containers in general. 

Best regards, Robrecht

Paul Moore

unread,
Nov 16, 2012, 7:39:48 AM11/16/12
to Robrecht De Rouck, python...@python.org
list(set(ls))

Paul

Michael Foord

unread,
Nov 16, 2012, 8:17:56 AM11/16/12
to Paul Moore, python...@python.org
This loses order. Both solutions suffer from the problem that they only work with hashable objects.

Michael
 

Paul

_______________________________________________
Python-ideas mailing list
Python...@python.org
http://mail.python.org/mailman/listinfo/python-ideas




--
http://www.voidspace.org.uk/

May you do good and not evil
May you find forgiveness for yourself and forgive others
May you share freely, never taking more than you give.
-- the sqlite blessing http://www.sqlite.org/different.html

Masklinn

unread,
Nov 16, 2012, 8:49:12 AM11/16/12
to python-ideas
On 2012-11-16, at 14:17 , Michael Foord wrote:

> On 16 November 2012 12:39, Paul Moore <p.f....@gmail.com> wrote:
>
>> On 16 November 2012 12:28, Robrecht De Rouck <de.rouck...@gmail.com>wrote:
>>
>>> Hello,
>>>
>>> I just wanted to bring to your attention that an *attribute for removing
>>> duplicate elements* for lists would be a nice feature.
>>>
>>> *def uniquify(lis):
>>> seen = set()
>>> seen_add = seen.add
>>> return [ x for x in lis if x not in seen and not seen_add(x)]*
>>> *
>>> *
>>> The code is from this post<http://stackoverflow.com/questions/480214/how-do-you-remove-duplicates-from-a-list-in-python-whilst-preserving-order>. Also
>>> check out this performance comparison<http://www.peterbe.com/plog/uniqifiers-benchmark> of
>>> uniquifying snippets.
>>> It would be useful to have a uniquify attribute for containers in
>>> general.
>>>
>>
>>
>> list(set(ls))
>>
>
> This loses order. Both solutions suffer from the problem that they only
> work with hashable objects.

Though in both cases they also have the advantage that they work in
(roughly) O(n) where an eq-based uniquifier (such as Haskell's
nub/nubBy) works in O(n^2).

Serhiy Storchaka

unread,
Nov 16, 2012, 9:37:54 AM11/16/12
to python...@python.org
On 16.11.12 15:17, Michael Foord wrote:
> On 16 November 2012 12:39, Paul Moore
> <p.f....@gmail.com
> <mailto:p.f....@gmail.com>> wrote:

> list(set(ls))
>
>
> This loses order.

list(collections.OrderedDict.fromkeys(ls))

Jasper St. Pierre

unread,
Nov 16, 2012, 1:22:00 PM11/16/12
to Serhiy Storchaka, python-ideas
As long as we're giving terrible code suggestions:

[a for a in L if a not in locals("_[0]")]

--
  Jasper

Steven D'Aprano

unread,
Nov 16, 2012, 1:21:24 PM11/16/12
to python...@python.org
On 16/11/12 23:28, Robrecht De Rouck wrote:
> Hello,
>
> I just wanted to bring to your attention that an *attribute for removing
> duplicate elements* for lists would be a nice feature.
>
> *def uniquify(lis):
> seen = set()
> seen_add = seen.add
> return [ x for x in lis if x not in seen and not seen_add(x)]*

That won't work for a general purpose function, because lists can hold
unhashable items, and sets require hashable.

Here's an old recipe predating sets that solves the problem in a number
of different ways. Read the comments for versions that don't lose order.



--
Steven

Andrew Barnert

unread,
Nov 16, 2012, 2:12:43 PM11/16/12
to Robrecht De Rouck, python...@python.org
If I understand this right, the problem you want to solve is that there is no obvious way to uniquify lists that's order preserving and efficient, so you want a good implementation to be added as an attribute of the list type. Right?

As others have pointed out, your implementation only works for lists with hashable elements, so no lists of lists, for example.

Also, making it an attribute of list means you can't use it on, say, a tuple, or a dict key iterator, or a file. Why restrict it like that? I'd much rather have an itertools.uniquify(seq) than a list method. (If I'm just misreading your use of the word "attribute", I apologize.)

And, once it's a separate function rather than a member of list, why do you want it to return a list rather than a generator?

All that being said, if getting this right is difficult enough that a bunch of people working together on a blog over 6 years didn't come up with a good version that supports non-hashable elements, maybe a good implementation does belong in the standard library itertools.

Sent from my iPhone

Andrew Barnert

unread,
Nov 16, 2012, 3:14:17 PM11/16/12
to Andrew Barnert, Robrecht De Rouck, python...@python.org
From: Andrew Barnert <abar...@yahoo.com>
Sent: Fri, November 16, 2012 11:15:18 AM

>All that being said, if getting this right is difficult enough that a bunch of
>people working together on a blog over 6 years didn't come up with a good
>version that supports non-hashable elements, maybe a good implementation does
>belong in the standard library itertools.

Actually, it looks like it's already there. The existing unique_everseen
function in http://docs.python.org/3/library/itertools.html#itertools-recipes
(also available from the more-itertools PyPI module at
http://packages.python.org/more-itertools/api.html#more_itertools.unique_everseen)
is an improvement on this idea.

So, unless someone has done performance tests showing that the suggested
implementation is significantly faster than unique_everseen (I suppose the
"__contains__" vs. "in" might make a difference?), and this is a critical
bottleneck for your app, I think the right way to write this function is:

uniquify = more_itertools.unique_everseen

Unfortunately, it's still not going to work on non-hashable elements. Maybe
itertools (either the module or the documentation recipe list) needs a version
that does?

Jan Kaliszewski

unread,
Nov 16, 2012, 4:59:34 PM11/16/12
to python...@python.org
Here is my quick'n'dirty implementation of different variants
(with/without key, preserving/not preserving order, accepting
hashable-only/any items):

http://wklej.org/id/872623/

Timeit-ed on my machine:

$ python3.3 iteruniq.py
test1_nokey_noorder_hashonly [0.08257626800332218, 0.08304202905856073,
0.08718552498612553]
test2_nokey_noorder_universal [2.48601198696997, 2.4620621589710936,
2.453364996938035]
test3_nokey_withorder_hashonly [0.3661507030483335, 0.3646505419164896,
0.36500189593061805]
test4_nokey_withorder_universal [7.532308181049302, 7.397191203082912,
7.316833758028224]
test5_withkey_noorder_hashonly [0.9567891559563577, 0.9690931889927015,
0.9598639439791441]
test6_withkey_noorder_universal [3.142076837946661, 3.144917198107578,
3.150129645015113]
test7_withkey_withorder_hashonly [0.9493958179373294,
0.9514245060272515, 0.9517305289627984]
test8_withkey_withorder_universal [10.233501984039322,
10.404869885998778, 10.786898656049743]

Cheers.
*j

Andrew Barnert

unread,
Nov 16, 2012, 6:25:31 PM11/16/12
to Jan Kaliszewski, python...@python.org
Comparing your test3 and test7 to the equivalent calls with the itertools
recipe, it's about 32% slower in 2.7 and 46% slower in 3.3. But the added
flexibility might easily make up for the cost—it certainly does if you need it…

More interestingly, with this:

if hashable_only is None:
try:
return iteruniq(iterable, key, preserve_order, True)
except TypeError:
return iteruniq(iterable, key, preserve_order, False)

… hashable_only=None is only 8% slower than hashable_only=False when you have
non-hashables, and 91% faster when you don't. (And trying unique_everseen
instead of iteruniq if hashable_only is None and preserve_order makes that 7%
and 94%.)

The current unique_everseen is already by far the longest recipe on the
itertools docs page, but it still might be worth updating with a synthesized
best-of-all-options version, or actually adding to itertools instead of leaving
as a recipe.

----- Original Message ----

Steven D'Aprano

unread,
Nov 16, 2012, 8:12:31 PM11/16/12
to python...@python.org
On 17/11/12 05:21, Steven D'Aprano wrote:

> Here's an old recipe predating sets that solves the problem in a number
> of different ways. Read the comments for versions that don't lose order.

And it would help if I actually included the URL. Sorry about that.


http://code.activestate.com/recipes/52560-remove-duplicates-from-a-sequence/

Andrew Barnert

unread,
Nov 16, 2012, 10:17:59 PM11/16/12
to Steven D'Aprano, python...@python.org
I created a fork of more-itertools, with unique_everseen modified to work with
non-hashable elements. It starts off assuming everything is hashable, and using
a set; if it gets a TypeError, it converts the set to a list and carries on.
Very simple, and it's within 5% of the speed of the original in the best case.

From a quick glance at your recipe, it looks like you had the same idea long
ago.

Meanwhile, I've been thinking that you don't really have to fall all the way
back to a list if things aren't hashable; there are plenty of intermediate
steps:

There are plenty of types that aren't generally hashable, but you can be sure
that your algorithm won't mutate them, and you can devise a hash function that
guarantees the hashable requirement. For some easy examples, hash mutable
sequences as (type(x), tuple(x)), mutable sets as (type(x), frozenset(x)),
mutable mappings as (type(x), tuple(dict(x).items())), mutable buffers as
(type(x), bytes(x)), etc. Or, if you have a bunch of your own mutable classes,
maybe add a "fakehash" method to them and use that.

As another option, a set of pickle.dumps(x) would work for many types, and it's
still O(N), although with a huge multiplier, so it's not worth it
unless len(sequence) >> avg(len(element)). Also, it's not guaranteed that x==y
implies dumps(x)==dumps(y), so you'd need to restrict it to types for which this
is known to be true.

There are plenty of types that are not hashable, but are fully ordered, and
(type(x), x) is fully ordered as long as all of the types are, so in such cases
you can use a sorted collection (blist.sortedset?) and get O(N log N) time.

Of course none of these works for all types, so you'd still have to fall back to
linear searching through a list in some cases.

At any rate, I don't think any of these alternatives needs to be added to a
general-purpose uniquifier, but they should all be doable if your use case
requires better performance for, e.g., a list of lists or a generator of mutable
class objects or a huge collection of quickly-picklable objects.


----- Original Message ----
> From: Steven D'Aprano <st...@pearwood.info>
> To: python...@python.org
> Sent: Fri, November 16, 2012 5:18:02 PM
> Subject: Re: [Python-ideas] Uniquify attribute for lists
>

Joshua Landau

unread,
Nov 17, 2012, 2:37:38 PM11/17/12
to Andrew Barnert, python-ideas
Surely the best choice is two have *two* caches; one for hashables and another for the rest.

This might be improvable with a *third* chache if some non-hashables had total ordering, but that could also introduce bugs I think. It'd also be a lot harder and likely be slower anyway.

Timings (approximate and rough):
"""
Using 16 elements

% Hashable:    100     90     66     33      0
Original:     1.47   1.00   1.06   1.09   1.01
w/ key=str:   3.73   1.91   2.12   3.15   3.00
New:          1.20   1.46   1.81   2.13   3.38
w/ key=str:   1.72   2.00   2.48   2.76   3.01

Using 64 elements

% Hashable:    100     90     66     33      0
Original:     1.15   1.29   1.61   1.64   1.43
w/ key=str:   1.98   2.50   3.09   3.55   3.99
New:          1.00   1.47   2.18   3.01   3.60
w/ key=str:   1.87   2.30   2.79   3.41   3.84

Using 256 elements

% Hashable:    100     90     66     33      0
Original:     2.70   3.66   5.19   5.34   4.41
w/ key=str:   4.06   5.07   6.26   6.93   6.98
New:          1.00   1.65   2.92   5.28   7.62
w/ key=str:   2.28   2.71   3.76   4.36   4.93

Using 1024 elements

% Hashable:    100     90     66     33      0
Original:     9.30   12.4   18.8   21.4   16.9
w/ key=str:   11.1   13.1   16.3   17.5   13.9
New:          1.00   1.84   6.20   13.1   19.8
w/ key=str:   2.31   2.79   3.59   4.50   5.16

Using 4096 elements

% Hashable:    100     90     66     33      0
Original:     33.7   44.3   69.1   79.4   60.5
w/ key=str:   36.7   44.2   59.3   60.1   40.4
New:          1.00   3.73   18.1   42.2   63.7
w/ key=str:   2.23   2.56   3.33   4.19   4.93

Using 16384 elements

% Hashable:    100     90     66     33      0
Original:     126.   173.   265.   313.   243.
w/ key=str:   136.   164.   215.   213.   147.
New:          1.00   12.5   68.6   173.   263.
w/ key=str:   2.24   2.60   3.28   4.14   4.80
"""

--------------

Code attached, unless I forget ;).
No guarantees that it still works the same way, or works at all, or is the right file.

Every item is repeated 5 times on average for any length of the list being unique-ified. I'll try it with this changed later.

Basically, the new version is faster on all but ~100% non-hashable lists when there are more than ~1000 elements, and on more-hashable lists it's quadratically faster. When slower, it's by only about 10% to 20%. Of course, specifying whether or not your list is fully-hashable would be more efficient (probably 10%) but that's not as nice to use.

Additionally, if you use key="" to uniquify by hashable values you are able to get a good speed up with the new version.

Example of use:
iteruniq(<list of set>, key=frozenset)

With really small non-hashable lists, the original is significantly better (3x).
iteruniq.py

Andrew Barnert

unread,
Nov 21, 2012, 7:06:07 PM11/21/12
to Joshua Landau, python-ideas
From: Joshua Landau <joshua.l...@gmail.com>
Sent: Sat, November 17, 2012 11:38:22 AM

>Surely the best choice is two have *two* caches; one for hashables and another
>for the rest.

Your implementation does a try: hash() to decide whether to check the set or the
list, instead of just doing a try: item in set except: item in list. Is there a
reason for this? It's more complicated, and it's measurably slower.

>This might be improvable with a *third* chache if some non-hashables had total
>ordering, but that could also introduce bugs I think. It'd also be a lot harder

>and likely be slower anyway.

I agree that it's probably not worth adding to something in the standard
library, or a recipe given in the documentation (in fact, I think I already said
as much earlier in the thread), but I think you've got most of those facts
wrong.

It's not a lot harder. The same 4 lines you have to add to do a
try-set-except-list, you just do again, so it's
try-set-except-try-sortedlist-except-list. And it won't introduce any bugs. And
as for speed, it'll be O(NlogM) instead of O(NM) for N elements with M unique,
which is obviously better, but probably slower for tiny M, and another 5-10%
overhead for inappropriate values.

The problem is finding an appropriate sortedcollection type. There isn't one in
the standard library. There's a link to an external SortedCollection reference
in the bisect docs page, but that has O(N) insertion time, so it won't help. The
most popular library I know of is blist.sortedlist, and that works, but it has
quite a bit of overhead for searching tiny lists. As I understand it, the reason
there isn't a standard sorted collection is the assumption that people dealing
with huge sequences ought to expect to have some searching, comparing, and
profiling of algorithms in their future, while those people dealing with len<10
sequences shouldn't have to think at all.

At any rate, I tried a few different sorted collections. The performance for
single-digit M was anywhere from 2.9x slower to 38x slower (8x with blist); the
crossover was around M=100, and you get 10x faster by around M=100K. Deciding
whether this is appropriate, and which implementation to use, and so on… well,
that's exactly why there's no sorted list in the stdlib in the first place.

Joshua Landau

unread,
Nov 22, 2012, 5:33:49 PM11/22/12
to Andrew Barnert, python-ideas
On 22 November 2012 00:06, Andrew Barnert <abar...@yahoo.com> wrote:
From: Joshua Landau <joshua.l...@gmail.com>
Sent: Sat, November 17, 2012 11:38:22 AM

>Surely the best choice is two have *two* caches; one for hashables and another
>for the rest.

Your implementation does a try: hash() to decide whether to check the set or the
list, instead of just doing a try: item in set except: item in list. Is there a
reason for this? It's more complicated, and it's measurably slower.

I did not realise that "[] in set()" raised an error! I'd just assumed it returned False.

Thank you, this does make small but significant difference.
 
>This might be improvable with a *third* chache if some non-hashables had total
>ordering, but that could also introduce bugs I think. It'd also be a lot harder
>and likely be slower anyway.

I agree that it's probably not worth adding to something in the standard
library, or a recipe given in the documentation (in fact, I think I already said
as much earlier in the thread), but I think you've got most of those facts
wrong.

It's not a lot harder. The same 4 lines you have to add to do a
try-set-except-list, you just do again, so it's
try-set-except-try-sortedlist-except-list.

Well, I'd sort-of assumed that this included adding  sorted collection to the mix, as it isn't in the standard library.
 
And it won't introduce any bugs.

This took me a while to prove, so I'm proud of this:

>>> from blist import sortedlist
>>> {2} in sortedlist([{1, 2}, {1, 3}, {2}])
False

You *cannot* assume that a data set has total ordering on the basis that it's working so far.
 
And
as for speed, it'll be O(NlogM) instead of O(NM) for N elements with M unique,
which is obviously better, but probably slower for tiny M, and another 5-10%
overhead for inappropriate values.

Well yes... bar the fact that you may be using it on something with a non-even distribution of "things" where some types are not comparable to each-other:

[ {1, 2}, [3, 4], [1, 2], [7, 4], [2, 3], (5, 2), [2, 1] ... ]

Where you'll get nowhere near O(NlogM).

*And* then there's the fact that sorted collections have intrinsically more overhead, and so are likely to give large overhead.

The problem is finding an appropriate sortedcollection type. There isn't one in
the standard library. There's a link to an external SortedCollection reference
in the bisect docs page, but that has O(N) insertion time, so it won't help. The
most popular library I know of is blist.sortedlist, and that works, but it has
quite a bit of overhead for searching tiny lists. As I understand it, the reason
there isn't a standard sorted collection is the assumption that people dealing
with huge sequences ought to expect  to have some searching, comparing, and
profiling of algorithms in their  future, while those people dealing with len<10
sequences shouldn't have to think at all.

At any rate, I tried a few different sorted collections. The performance for
single-digit M was anywhere from 2.9x slower to 38x slower (8x with blist); the
crossover was around M=100, and you get 10x faster by around M=100K. Deciding
whether this is appropriate, and which implementation to use, and so on… well,
that's exactly why there's no sorted list in the stdlib in the first place.

Thank you for the numbers. May I ask what libraries you used?

Andrew Barnert

unread,
Nov 22, 2012, 6:57:33 PM11/22/12
to Joshua Landau, python-ideas
From: Joshua Landau <joshua.l...@gmail.com>
Sent: Thu, November 22, 2012 2:34:30 PM

>
>On 22 November 2012 00:06, Andrew Barnert <abar...@yahoo.com> wrote:
>
>From: Joshua Landau <joshua.l...@gmail.com>
>>Sent: Sat, November 17, 2012 11:38:22 AM
>>
>>
>>>Surely the best choice is two have *two* caches; one for hashables and
another
>>>for the rest.
>>
>>Your implementation does a try: hash() to decide whether to check the set or
>the
>>list, instead of just doing a try: item in set except: item in list. Is there
a
>>reason for this? It's more complicated, and it's measurably slower.

> I did not realise that "[] in set()" raised an error! I'd just assumed it
> returned False.

I just realized that this doesn't seem to be documented anywhere. It's obvious
that set.add would have to raise for a non-hashable, but x in set could be
implemented as either raising or returning False without violating any of the
requirements at http://docs.python.org/3/library/stdtypes.html#set or anywhere
else that I can see…

I did a quick test with PyPy and Jython built-in sets, the old sets module, and
the Python recipe that used to be linked for pre-2.3 compat, and they all do the
same thing as CPython. (The pure Python versions are all just implemented as a
dict where all the values are True.) But that still doesn't necessarily
guarantee that it's safe in all possible future implementations…

Maybe the documentation should be updated to guarantee this—it's a useful thing
to rely on, all current implementations provide it, and it's hard to think of a
good reason why breaking it could improve performance or implementation
simplicity.

> Well, I'd sort-of assumed that this included adding sorted collection to the
> mix, as it isn't in the standard library.

Yes, as I said later, that's the biggest reason not to consider it as a general
solution.



> You *cannot* assume that a data set has total ordering on the basis that it's
> working so far.

You're right. I was thinking that a sorted collection should reject adding
elements that aren't totally ordered with the existing elements… but now that I
think about it, there's obviously no way to do that in O(log N) time.



>> And
>>as for speed, it'll be O(NlogM) instead of O(NM) for N elements with M unique,
>>which is obviously better, but probably slower for tiny M, and another 5-10%
>>overhead for inappropriate values.
>>
>
> Well yes... bar the fact that you may be using it on something with a non-even

> distribution of "things" where some types are not comparable to each-other:

I didn't speak very precisely here, because it's hard to be concise, but the
total performance is O(A) + O(BlogM) + O(CN), where A is the number of hashable
elements, B is the number of non-hashable but sortable elements that are
comparable to the first non-hashable but sortable element, M is the number of
unique elements within B, C is the number of elements that are neither hashable
nor comparable with the elements of B, and N is the number of unique elements
within C.

The point is that, if a significant subset of the elements are in B, this will
be better than O(A)+O(CN); otherwise, it'll be the same. Just as O(A)+O(CN) is
better than O(CN) if a significant subset of the elements are in A, otherwise
the same. So, it's an improvement in the same way that adding the set is an
improvement.

> *And* then there's the fact that sorted collections have intrinsically more
> overhead, and so are likely to give large overhead.

I mentioned that later (and you commented on it). When M is very small
(especially if B is also very large), there's a substantial added cost. Of
course the same is true for the set, but the crossover for the set happens
somewhere between 2-10 unique elements instead of 100, and the cost below that
crossover is much smaller.

>>At any rate, I tried a few different sorted collections. The performance for
>>single-digit M was anywhere from 2.9x slower to 38x slower (8x with blist);
the
>>crossover was around M=100, and you get 10x faster by around M=100K. Deciding
>>whether this is appropriate, and which implementation to use, and so on… well,
>>that's exactly why there's no sorted list in the stdlib in the first place.

> Thank you for the numbers. May I ask what libraries you used?

* blist (PyPI): hybrid B+Tree/array
* pyavl (PyPI): AVL tree
* bintrees (PyPI): AVL tree and red-black tree
* opus7 (http://www.brpreiss.com/books/opus7/): AVL tree
* libc++ std::set (incomplete hacked-up Cython interface): red-black tree
* CHDataStructures (via PyObjC): not sure
* java.util.TreeSet (via jython): red-black tree
* java.util.concurrrent.ConcurrentSkipListSet: skip-list
* QtCore.QMap (via PyQt4): skip-list

Some of these are hacked-up implementations that only handle just enough of the
interface I need, in some cases even faking the comparisons, and in some cases
not even complete enough to run the real test (so I just tested the time to
test-and-insert B values M/B values). So, it's not a scientific test or
anything, but they were all in the expected ballpark (and the few that weren't
turned out not to be O(log N) insert time, so I threw them out).

The most thorough tests were with blist; I think I posted the complete numbers
before, but the short version is: 8.0x with 2 unique values; 1.1x with 64; 0.9x
with 128; 0.1x with 128K, all with 256K total values.

Reply all
Reply to author
Forward
0 new messages