Paul
_______________________________________________
Python-ideas mailing list
Python...@python.org
http://mail.python.org/mailman/listinfo/python-ideas
http://www.voidspace.org.uk/
May you do good and not evil
May you find forgiveness for yourself and forgive others
May you share freely, never taking more than you give.
-- the sqlite blessing http://www.sqlite.org/different.html
More interestingly, with this:
if hashable_only is None:
try:
return iteruniq(iterable, key, preserve_order, True)
except TypeError:
return iteruniq(iterable, key, preserve_order, False)
… hashable_only=None is only 8% slower than hashable_only=False when you have
non-hashables, and 91% faster when you don't. (And trying unique_everseen
instead of iteruniq if hashable_only is None and preserve_order makes that 7%
and 94%.)
The current unique_everseen is already by far the longest recipe on the
itertools docs page, but it still might be worth updating with a synthesized
best-of-all-options version, or actually adding to itertools instead of leaving
as a recipe.
----- Original Message ----
>Surely the best choice is two have *two* caches; one for hashables and another
>for the rest.
Your implementation does a try: hash() to decide whether to check the set or the
list, instead of just doing a try: item in set except: item in list. Is there a
reason for this? It's more complicated, and it's measurably slower.
>This might be improvable with a *third* chache if some non-hashables had total
>ordering, but that could also introduce bugs I think. It'd also be a lot harder
>and likely be slower anyway.
I agree that it's probably not worth adding to something in the standard
library, or a recipe given in the documentation (in fact, I think I already said
as much earlier in the thread), but I think you've got most of those facts
wrong.
It's not a lot harder. The same 4 lines you have to add to do a
try-set-except-list, you just do again, so it's
try-set-except-try-sortedlist-except-list. And it won't introduce any bugs. And
as for speed, it'll be O(NlogM) instead of O(NM) for N elements with M unique,
which is obviously better, but probably slower for tiny M, and another 5-10%
overhead for inappropriate values.
The problem is finding an appropriate sortedcollection type. There isn't one in
the standard library. There's a link to an external SortedCollection reference
in the bisect docs page, but that has O(N) insertion time, so it won't help. The
most popular library I know of is blist.sortedlist, and that works, but it has
quite a bit of overhead for searching tiny lists. As I understand it, the reason
there isn't a standard sorted collection is the assumption that people dealing
with huge sequences ought to expect to have some searching, comparing, and
profiling of algorithms in their future, while those people dealing with len<10
sequences shouldn't have to think at all.
At any rate, I tried a few different sorted collections. The performance for
single-digit M was anywhere from 2.9x slower to 38x slower (8x with blist); the
crossover was around M=100, and you get 10x faster by around M=100K. Deciding
whether this is appropriate, and which implementation to use, and so on… well,
that's exactly why there's no sorted list in the stdlib in the first place.
From: Joshua Landau <joshua.l...@gmail.com>
Sent: Sat, November 17, 2012 11:38:22 AM
Your implementation does a try: hash() to decide whether to check the set or the
>Surely the best choice is two have *two* caches; one for hashables and another
>for the rest.
list, instead of just doing a try: item in set except: item in list. Is there a
reason for this? It's more complicated, and it's measurably slower.
>This might be improvable with a *third* chache if some non-hashables had totalI agree that it's probably not worth adding to something in the standard
>ordering, but that could also introduce bugs I think. It'd also be a lot harder
>and likely be slower anyway.
library, or a recipe given in the documentation (in fact, I think I already said
as much earlier in the thread), but I think you've got most of those facts
wrong.
It's not a lot harder. The same 4 lines you have to add to do a
try-set-except-list, you just do again, so it's
try-set-except-try-sortedlist-except-list.
And it won't introduce any bugs.
And
as for speed, it'll be O(NlogM) instead of O(NM) for N elements with M unique,
which is obviously better, but probably slower for tiny M, and another 5-10%
overhead for inappropriate values.
The problem is finding an appropriate sortedcollection type. There isn't one in
the standard library. There's a link to an external SortedCollection reference
in the bisect docs page, but that has O(N) insertion time, so it won't help. The
most popular library I know of is blist.sortedlist, and that works, but it has
quite a bit of overhead for searching tiny lists. As I understand it, the reason
there isn't a standard sorted collection is the assumption that people dealing
with huge sequences ought to expect to have some searching, comparing, and
profiling of algorithms in their future, while those people dealing with len<10
sequences shouldn't have to think at all.
At any rate, I tried a few different sorted collections. The performance for
single-digit M was anywhere from 2.9x slower to 38x slower (8x with blist); the
crossover was around M=100, and you get 10x faster by around M=100K. Deciding
whether this is appropriate, and which implementation to use, and so on… well,
that's exactly why there's no sorted list in the stdlib in the first place.
> I did not realise that "[] in set()" raised an error! I'd just assumed it
> returned False.
I just realized that this doesn't seem to be documented anywhere. It's obvious
that set.add would have to raise for a non-hashable, but x in set could be
implemented as either raising or returning False without violating any of the
requirements at http://docs.python.org/3/library/stdtypes.html#set or anywhere
else that I can see…
I did a quick test with PyPy and Jython built-in sets, the old sets module, and
the Python recipe that used to be linked for pre-2.3 compat, and they all do the
same thing as CPython. (The pure Python versions are all just implemented as a
dict where all the values are True.) But that still doesn't necessarily
guarantee that it's safe in all possible future implementations…
Maybe the documentation should be updated to guarantee this—it's a useful thing
to rely on, all current implementations provide it, and it's hard to think of a
good reason why breaking it could improve performance or implementation
simplicity.
> Well, I'd sort-of assumed that this included adding sorted collection to the
> mix, as it isn't in the standard library.
Yes, as I said later, that's the biggest reason not to consider it as a general
solution.
> You *cannot* assume that a data set has total ordering on the basis that it's
> working so far.
You're right. I was thinking that a sorted collection should reject adding
elements that aren't totally ordered with the existing elements… but now that I
think about it, there's obviously no way to do that in O(log N) time.
>> And
>>as for speed, it'll be O(NlogM) instead of O(NM) for N elements with M unique,
>>which is obviously better, but probably slower for tiny M, and another 5-10%
>>overhead for inappropriate values.
>>
>
> Well yes... bar the fact that you may be using it on something with a non-even
> distribution of "things" where some types are not comparable to each-other:
I didn't speak very precisely here, because it's hard to be concise, but the
total performance is O(A) + O(BlogM) + O(CN), where A is the number of hashable
elements, B is the number of non-hashable but sortable elements that are
comparable to the first non-hashable but sortable element, M is the number of
unique elements within B, C is the number of elements that are neither hashable
nor comparable with the elements of B, and N is the number of unique elements
within C.
The point is that, if a significant subset of the elements are in B, this will
be better than O(A)+O(CN); otherwise, it'll be the same. Just as O(A)+O(CN) is
better than O(CN) if a significant subset of the elements are in A, otherwise
the same. So, it's an improvement in the same way that adding the set is an
improvement.
> *And* then there's the fact that sorted collections have intrinsically more
> overhead, and so are likely to give large overhead.
I mentioned that later (and you commented on it). When M is very small
(especially if B is also very large), there's a substantial added cost. Of
course the same is true for the set, but the crossover for the set happens
somewhere between 2-10 unique elements instead of 100, and the cost below that
crossover is much smaller.
>>At any rate, I tried a few different sorted collections. The performance for
>>single-digit M was anywhere from 2.9x slower to 38x slower (8x with blist);
the
>>crossover was around M=100, and you get 10x faster by around M=100K. Deciding
>>whether this is appropriate, and which implementation to use, and so on… well,
>>that's exactly why there's no sorted list in the stdlib in the first place.
> Thank you for the numbers. May I ask what libraries you used?
* blist (PyPI): hybrid B+Tree/array
* pyavl (PyPI): AVL tree
* bintrees (PyPI): AVL tree and red-black tree
* opus7 (http://www.brpreiss.com/books/opus7/): AVL tree
* libc++ std::set (incomplete hacked-up Cython interface): red-black tree
* CHDataStructures (via PyObjC): not sure
* java.util.TreeSet (via jython): red-black tree
* java.util.concurrrent.ConcurrentSkipListSet: skip-list
* QtCore.QMap (via PyQt4): skip-list
Some of these are hacked-up implementations that only handle just enough of the
interface I need, in some cases even faking the comparisons, and in some cases
not even complete enough to run the real test (so I just tested the time to
test-and-insert B values M/B values). So, it's not a scientific test or
anything, but they were all in the expected ballpark (and the few that weren't
turned out not to be O(log N) insert time, so I threw them out).
The most thorough tests were with blist; I think I posted the complete numbers
before, but the short version is: 8.0x with 2 unique values; 1.1x with 64; 0.9x
with 128; 0.1x with 128K, all with 256K total values.