[Python-Dev] Should set objects maintain insertion order too?

1,773 views
Skip to first unread message

Larry Hastings

unread,
Dec 15, 2019, 9:51:52 PM12/15/19
to Python-Dev


As of 3.7, dict objects are guaranteed to maintain insertion order.  But set objects make no such guarantee, and AFAIK in practice they don't maintain insertion order either.  Should they?

I do have a use case for this. In one project I maintain a "ready" list of jobs; I need to iterate over it, but I also want fast lookup because I soemtimes remove jobs when they're subsequently marked "not ready".  The existing set object would work fine here, except that it doesn't maintain insertion order.  That means multiple runs of the program with the same inputs may result in running jobs in different orders, and this instability makes debugging more difficult.  I've therefore switched from a set to a dict with all keys mapped to None, which provides all the set-like functionality I need.

ISTM that all the reasons why dicts should maintain insertion order also apply to sets, and so I think we should probably do this.  Your thoughts?


/arry

p.s. My dim recollection is that the original set object was a hacked-up copy of the dict object removing the ability to store values.  So there's some precedent for dicts and sets behaving similarly.  (Note: non-pejorative use of "hacked-up" here, that was a fine way to start.)

Tim Peters

unread,
Dec 15, 2019, 10:02:03 PM12/15/19
to Larry Hastings, Python-Dev
[Larry Hastings <la...@hastings.org>]
> As of 3.7, dict objects are guaranteed to maintain insertion order. But set
> objects make no such guarantee, and AFAIK in practice they don't maintain
> insertion order either.

If they ever appear to, it's an accident you shouldn't rely on.

> Should they?

From Raymond, 22 Dec 2017:

https://twitter.com/raymondh/status/944454031870607360
"""
Sets use a different algorithm that isn't as amendable to retaining
insertion order. Set-to-set operations lose their flexibility and
optimizations if order is required. Set mathematics are defined in
terms of unordered sets. In short, set ordering isn't in the immediate
future.
"""

Which is more an answer to "will they?" than "should they?" ;-)
_______________________________________________
Python-Dev mailing list -- pytho...@python.org
To unsubscribe send an email to python-d...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/pytho...@python.org/message/WMIZRFJZXI3CD5YMLOC5Z3LXVXD7HM4R/
Code of Conduct: http://python.org/psf/codeofconduct/

Larry Hastings

unread,
Dec 15, 2019, 10:50:27 PM12/15/19
to Tim Peters, Python-Dev


That's not a reason why we shouldn't do it.  Python is the language where speed, correctness, and readability trump performance every time.  We should decide what semantics we want, then do that, period.  And anyway, it seems like some genius always figures out how to make it fast sooner or later ;-)

I can believe that, given the current implementation, sets might give up some of their optimizations if they maintained insertion order.  But I don't understand what "flexibility" they would lose.  Apart from being slower, what would set objects give up if they maintained insertion order?  Are there operations on set objects that would no longer be possible?


/arry

Raymond Hettinger

unread,
Dec 15, 2019, 11:29:08 PM12/15/19
to Larry Hastings, Python-Dev@Python. Org


> On Dec 15, 2019, at 6:48 PM, Larry Hastings <la...@hastings.org> wrote:
>
> As of 3.7, dict objects are guaranteed to maintain insertion order. But set objects make no such guarantee, and AFAIK in practice they don't maintain insertion order either. Should they?

I don't think they should.

Several thoughts:

* The corresponding mathematical concept is unordered and it would be weird to impose such as order.

* You can already get membership testing while retaining insertion ordering by running dict.fromkeys(seq).

* Set operations have optimizations that preclude giving a guaranteed order (for example, set intersection loops over the smaller of the two input sets).

* To implement ordering, set objects would have to give-up their current membership testing optimization that exploits cache locality in lookups (it looks at several consecutive hashes at a time before jumping to the next random position in the table).

* The ordering we have for dicts uses a hash table that indexes into a sequence. That works reasonably well for typical dict operations but is unsuitable for set operations where some common use cases make interspersed additions and deletions (that is why the LRU cache still uses a cheaply updated doubly-linked list rather that deleting and reinserting dict entries).

* This idea has been discussed a couple times before and we've decided not to go down this path. I should document prominently because it is inevitable that it will be suggested periodically because it is such an obvious thing to consider.


Raymond
_______________________________________________
Python-Dev mailing list -- pytho...@python.org
To unsubscribe send an email to python-d...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/pytho...@python.org/message/6CO2CZS4CPP6MSJKRZXXQYFLY5T3UVDU/

Guido van Rossum

unread,
Dec 15, 2019, 11:33:41 PM12/15/19
to Larry Hastings, Python-Dev, Tim Peters
Actually, for dicts the implementation came first.

_______________________________________________
Python-Dev mailing list -- pytho...@python.org
To unsubscribe send an email to python-d...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
--
--Guido (mobile)

Larry Hastings

unread,
Dec 16, 2019, 12:04:27 AM12/16/19
to Raymond Hettinger, Python-Dev@Python. Org


On 12/15/19 8:25 PM, Raymond Hettinger wrote:
On Dec 15, 2019, at 6:48 PM, Larry Hastings <la...@hastings.org> wrote:
As of 3.7, dict objects are guaranteed to maintain insertion order.  But set objects make no such guarantee, and AFAIK in practice they don't maintain insertion order either.  Should they?
I don't think they should.  

Several thoughts:

* The corresponding mathematical concept is unordered and it would be weird to impose such as order.

I disagree--I assert it's no more or less weird than imposing order on dicts, which we eventually decided to do.


* You can already get membership testing while retaining insertion ordering by running dict.fromkeys(seq).

I'm not sure where you're going with this.  The argument "you can do this with other objects" didn't preclude us from, for example, adding ordering to the dict object.  We already had collections.OrderedDict, and we decided to add the functionality to the dict object.


* Set operations have optimizations that preclude giving a guaranteed order (for example, set intersection loops over the smaller of the two input sets).

* To implement ordering, set objects would have to give-up their current membership testing optimization that exploits cache locality in lookups (it looks at several consecutive hashes at a time before jumping to the next random position in the table).

* The ordering we have for dicts uses a hash table that indexes into a sequence.  That works reasonably well for typical dict operations but is unsuitable for set operations where some common use cases make interspersed additions and deletions (that is why the LRU cache still uses a cheaply updated doubly-linked list rather that deleting and reinserting dict entries).

These are all performance concerns.  As I mentioned previously in this thread, in my opinion we should figure out what semantics we want for the object, then implement those, and only after that should we worry about performance.  I think we should decide the question "should set objects maintain insertion order?" literally without any consideration about performance implications.


* This idea has been discussed a couple times before and we've decided not to go down this path.  I should document prominently because it is inevitable that it will be suggested periodically because it is such an obvious thing to consider.

Please do!

In the email quoted by Tim Peters earlier in the thread, you stated that set objects "lose their flexibility" if they maintain order.  Can you be more specific about what you meant by that?


/arry

p.s. To be clear, I'm not volunteering to add this feature to the set object implementation--that's above my pay grade.  But I'm guessing that, if we decided we wanted these semantics for the dict object, someone would volunteer to implement it.

Inada Naoki

unread,
Dec 16, 2019, 1:58:16 AM12/16/19
to Guido van Rossum, Python-Dev, Tim Peters
On Mon, Dec 16, 2019 at 1:33 PM Guido van Rossum <gu...@python.org> wrote:
>
> Actually, for dicts the implementation came first.
>

I had tried to implement the Ordered Set. Here is the implementation.
https://github.com/methane/cpython/pull/23

Regards,
_______________________________________________
Python-Dev mailing list -- pytho...@python.org
To unsubscribe send an email to python-d...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/pytho...@python.org/message/SGDD47GTMS7OGIEZTLLXEYHABL5OS4EN/

Serhiy Storchaka

unread,
Dec 16, 2019, 3:49:54 AM12/16/19
to pytho...@python.org
16.12.19 04:48, Larry Hastings пише:
> As of 3.7, dict objects are guaranteed to maintain insertion order.  But
> set objects make no such guarantee, and AFAIK in practice they don't
> maintain insertion order either.  Should they?
>
> I do have a use case for this. In one project I maintain a "ready" list
> of jobs; I need to iterate over it, but I also want fast lookup because
> I soemtimes remove jobs when they're subsequently marked "not ready".
> The existing set object would work fine here, except that it doesn't
> maintain insertion order. That means multiple runs of the program with
> the same inputs may result in running jobs in different orders, and this
> instability makes debugging more difficult.  I've therefore switched
> from a set to a dict with all keys mapped to None, which provides all
> the set-like functionality I need.
>
> ISTM that all the reasons why dicts should maintain insertion order also
> apply to sets, and so I think we should probably do this.  Your thoughts?

The current dict implementation is called a "compact dict
implementation", because it saves memory in common cases. It was the
initial reason of writing it. At the same time there was a need in
ordered dicts for kwarg and class namespace. We discovered that slightly
modified compact dict implementation preserves order, and that its
possible drawback (performance penalty) is too small if exists.

But ordered set implementation is not more compact that the current set
implementation (because for dicts the item is 3-word, but for sets it is
2-word). Also the current set implementation has some optimizations that
the dict implementation never had, which will be lost in the ordered set
implementation.

Take to account that sets are way less used than dicts, and that you can
use a dict if you need an ordered set-like object.
_______________________________________________
Python-Dev mailing list -- pytho...@python.org
To unsubscribe send an email to python-d...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/pytho...@python.org/message/4BDCSPE4FKPNU6SSMH6A7PX5CGO7EF4I/

Petr Viktorin

unread,
Dec 16, 2019, 7:18:38 AM12/16/19
to pytho...@python.org
On 2019-12-16 06:00, Larry Hastings wrote:
[...]
>
> These are all performance concerns.  As I mentioned previously in this
> thread, in my opinion we should figure out what semantics we want for
> the object, then implement those, and only after that should we worry
> about performance.  I think we should decide the question "should set
> objects maintain insertion order?" literally without any consideration
> about performance implications.

Then this thread is missing arguments saying *why* ordered dicts are
actually better, semantics-wise.

Originally, making dicts ordered was all about performance (or rather
memory efficiency, which falls in the same bucket.) It wasn't added
because it's better semantics-wise.
Here's one (very simplified and maybe biased) view of the history of dicts:

* 2.x: Dicts are unordered, please don't rely on the order.
* 3.3: Dict iteration order is now randomized. We told you not to rely
on it!
* 3.6: We now use an optimized implementation of dict that saves memory!
As a side effect it makes dicts ordered, but that's an implementation
detail, please don't rely on it.
* 3.7: Oops, people are now relying on dicts being ordered. Insisting on
people not relying on it is battling windmills. Also, it's actually
useful sometimes, and alternate implementations can support it pretty
easily. Let's make it a language feature! (Later it turns out
MicroPython can't support it easily. Tough luck.)


By itself, "we already made dicts do it" is not a great argument in the
set *semantics* debate.
Of course, it may turn out ordering was a great idea semantics-wise as
well, but if I'm reading correctly, so far this thread has one piece of
anectodal evidence for that.

_______________________________________________
Python-Dev mailing list -- pytho...@python.org
To unsubscribe send an email to python-d...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/pytho...@python.org/message/RFGF4IJ3RHW5QCQGY5P6IUWE336D4OU5/

Guido van Rossum

unread,
Dec 16, 2019, 11:49:20 AM12/16/19
to Petr Viktorin, Python-Dev
I don't know if this works the same for sets, but for dicts, this is the semantics that everyone has wanted for a long time. It makes doctests (and similar things) easier to write, reduces a source of nondeterministic failure, and removes a wart that everyone had to learn:

>>> {"foo": 1, "bar": 2, "baz": 3}
{'baz': 3, 'foo': 1, 'bar': 2}
>>>

It's essentially the same reason as why Python specifies that expressions are (in most cases) evaluated from left to right. User don't realize that f()+g() might call g() before f(), and write code that assumes f() is called first -- the language should not disappoint them, optimization opportunities be damned.
--
--Guido van Rossum (python.org/~guido)

Steve Dower

unread,
Dec 16, 2019, 12:25:35 PM12/16/19
to pytho...@python.org
On 16Dec2019 0417, Petr Viktorin wrote:
> Originally, making dicts ordered was all about performance (or rather
> memory efficiency, which falls in the same bucket.) It wasn't added
> because it's better semantics-wise.
> Here's one (very simplified and maybe biased) view of the history of dicts:
>
> * 2.x: Dicts are unordered, please don't rely on the order.
> * 3.3: Dict iteration order is now randomized. We told you not to rely
> on it!
> * 3.6: We now use an optimized implementation of dict that saves memory!
> As a side effect it makes dicts ordered, but that's an implementation
> detail, please don't rely on it.
> * 3.7: Oops, people are now relying on dicts being ordered. Insisting on
> people not relying on it is battling windmills. Also, it's actually
> useful sometimes, and alternate implementations can support it pretty
> easily. Let's make it a language feature! (Later it turns out
> MicroPython can't support it easily. Tough luck.)

For the record, we missed out on a very memory efficient "frozendict"
implementation because it can't maintain insertion order - Yury is
currently proposing it as FrozenMap in PEP 603.
https://discuss.python.org/t/pep-603-adding-a-frozenmap-type-to-collections/2318

Codifying semantics isn't always the kind of future-proof we necessarily
want to have :)

Cheers,
Steve
_______________________________________________
Python-Dev mailing list -- pytho...@python.org
To unsubscribe send an email to python-d...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/pytho...@python.org/message/Y7YP7SKSQOGOCXNXV27ZGSQDUVZRPSPH/

David Mertz

unread,
Dec 16, 2019, 1:01:16 PM12/16/19
to Raymond Hettinger, Python-Dev@Python. Org
On Sun, Dec 15, 2019 at 11:28 PM Raymond Hettinger <raymond....@gmail.com> wrote:
* The corresponding mathematical concept is unordered and it would be weird to impose such as order.

I'm with Raymond in not wanting sets to maintain insertion (or any) order.  Even though I don't doubt that Larry--and no doubt other folks, from time to time--have a use for an "ordered set," I feel like it is bad practice to encourage that way of thinking about sets and using them.

Admittedly, I was only lukewarm about making an insertion-order guarantee for dictionaries too.  But for sets I feel more strongly opposed.  Although it seems unlikely now, if some improved implementation of sets had the accidental side effects of making them ordered, I would still not want that to become a semantic guarantee.

That said, having OrderedSet in collections module would be fine by me.  It might have different performance characteristics, but so what? It would be a different class that folks could use or not, depending on how they felt about its behavior and performance profile.
 

--
Keeping medicines from the bloodstreams of the sick; food
from the bellies of the hungry; books from the hands of the
uneducated; technology from the underdeveloped; and putting
advocates of freedom in prisons.  Intellectual property is
to the 21st century what the slave trade was to the 16th.

Barry Warsaw

unread,
Dec 16, 2019, 1:26:40 PM12/16/19
to David Mertz, Raymond Hettinger, Python-Dev@Python. Org
I can’t think of a time when I’ve really needed sets to maintain insertion order, but thinking about it from a user’s perspective, it’s a natural leap from ordered dicts to ordered sets. At least I don’t immediately think about sets as their mathematical equivalent, but as sets were originally proposed: efficient collections of value-less keys. Before we had sets, it was common to use dicts with values of None to represent the same concept.

I’m fine with a decision not to change the ordering semantics of sets, but we should update the Language Reference to describe the language guarantees for both sets and dicts. The Library Reference does document the ordering semantics for both dicts and sets, but I wouldn’t say that information is easy to find. Maybe we can make the latter more clear too.

Cheers,
-Barry
> _______________________________________________
> Python-Dev mailing list -- pytho...@python.org
> To unsubscribe send an email to python-d...@python.org
> https://mail.python.org/mailman3/lists/python-dev.python.org/
> Message archived at https://mail.python.org/archives/list/pytho...@python.org/message/G5VFFODDT5N2HNWCTAKUEDDXJJVX7VDJ/
signature.asc

Tim Peters

unread,
Dec 16, 2019, 2:03:46 PM12/16/19
to Guido van Rossum, Python-Dev
[Guido]
> ...
> the language should not disappoint them, optimization opportunities be damned.

I would like to distinguish between two kinds of "optimization
opportunities": theoretical ones that may or may not be exploited
some day, and those that CPython has _already_ exploited.

That is, we don't have a blank slate here. As Raymond said, the set
implementation already diverged from the dict implementation in
fundamental ways for "go fast" reasons. How much are people willing
to see set operations slow down to get this new feature?

For me, "none" ;-) Really. I have no particular use for "ordered"
sets, but have set-slinging code that benefits _today_ from the "go
fast" work Raymond did for them.

Analogy: it was always obvious that list.sort() is "better" stable
than not, but I ended up crafting a non-stable samplesort variant that
ran faster than any stable sort anyone tried to write. For years. So
we stuck with that, to avoid slowing down sorting across releases.

The stable "timsort" finally managed to be _as_ fast as the older
samplesort in almost all cases, and was very much faster in many
real-world cases. "Goes faster" was the thing that really sold it,
and so much so that its increased memory use didn't count for much
against it in comparison.

Kinda similarly, "ordered dicts" were (as has already been pointed
out) originally introduced as a memory optimization ("compact" dicts),
where the "ordered" part fell out more-or-less for free. The same
approach wouldn't save memory for sets (or so I'm told), so the
original motivation for compact dicts doesn't carry over.

So I'm +1 on ordered sets if and only if there's an implementation
that's at least as fast as what we already have. If not now, then,
like ordered dicts evolved, offer a slower OrderedSet type in the
`collections` module for those who really want it, and wait for magic
;-)

BTW, what should

{1, 2} | {3, 4, 5, 6, 7}

return as ordered sets? Beats me.; The imbalance between the
operands' cardinalities can be arbitrarily large, and "first make a
copy of the larger one, then loop over the smaller one" is the obvious
way to implement union to minimize the number of lookups needed. The
speed penalty for, e.g., considering "the left" operand's elements'
insertion orders to precede all "the right" operand's elements'
insertion orders can be arbitrarily large.

The concept of "insertion order" just doesn't make well-defined sense
to me for any operation the produces a set from multiple input sets,
unless it means no more than "whatever order happens to be used
internally to add elements to the result set". Dicts don't have
anything like that, although dict.update comes close, but in that case
the result is defined by mutating the dict via a one-at-a-time loop
over the argument dict.
_______________________________________________
Python-Dev mailing list -- pytho...@python.org
To unsubscribe send an email to python-d...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/pytho...@python.org/message/YXSGUPZYTO7TKOVVU32276M54TMITVVQ/

Victor Stinner

unread,
Dec 16, 2019, 5:58:28 PM12/16/19
to Inada Naoki, Python-Dev, Tim Peters
That looks quite interesting. It looks like compact dict optimization
applied to set. I had the same idea :-)

If it reduces the memory footprint, keep insertion order and has low
performance overhead, I would be an interesting idea!

Victor
--
Night gathers, and now my watch begins. It shall not end until my death.

_______________________________________________
Python-Dev mailing list -- pytho...@python.org
To unsubscribe send an email to python-d...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/pytho...@python.org/message/C4IQW5OHLTGWJ7I6EAZ6S6XYQPONGVAV/

David Cuthbert via Python-Dev

unread,
Dec 16, 2019, 7:38:58 PM12/16/19
to David Mertz, Raymond Hettinger, Python-Dev@Python. Org

On Mon 12/16/19, 9:59 AM, "David Mertz" <me...@gnosis.cx> wrote:

Admittedly, I was only lukewarm about making an insertion-order guarantee for dictionaries too.  But for sets I feel more strongly opposed.  Although it seems unlikely now, if some improved implementation of sets had the accidental side effects of making them ordered, I would still not want that to become a semantic guarantee.

 

Eek… No accidental side effects whenever possible, please. People come to rely upon them (like that chemistry paper example[*]), and changing the assumptions results in a lot of breakage down the line. Changing the length of AWS identifiers (e.g. instances from i-1234abcd to i-0123456789abcdef) was a huge deal; even though the identifier length was never guaranteed, plenty of folks had created database schemata with VARCHAR(10) for instance ids, for example.

 

Break assumptions from day 1. If some platforms happen to return sorted results, sort the other platforms or toss in a sorted(key=lambda el: random.randomint()) call on the sorting platform. If you’re creating custom identifiers, allocate twice as many bits as you think you’ll need to store it.

 

Yes, this is bad user code, but I’m all for breaking bad user code in obvious ways sooner rather than subtle ways later, especially in a language like Python.

 

[*] https://arstechnica.com/information-technology/2019/10/chemists-discover-cross-platform-python-scripts-not-so-cross-platform/

 

David Mertz

unread,
Dec 16, 2019, 8:25:06 PM12/16/19
to David Cuthbert, Raymond Hettinger, Python-Dev@Python. Org
On Mon, Dec 16, 2019, 7:35 PM David Cuthbert <da...@kanga.org> wrote:

On Mon 12/16/19, 9:59 AM, "David Mertz" <me...@gnosis.cx> wrote:

If some improved implementation of sets had the accidental side effects of making them ordered, I would still not want that to become a semantic guarantee.

Eek… No accidental side effects whenever possible, please. People come to rely upon them (like that chemistry paper example[*]), and changing the assumptions results in a lot of breakage down the line.

I'm not sure what point you are making really. Any particular implementation will have some behaviors that are not guaranteed by the language spec. I wouldn't want to PROHIBIT a set implementation that preserved insertion order. Nor, for example, would I want to prohibit one that stored string elements in alphabetical order, if that somehow had an independent performance advantage. But that's very different from wanting to REQUIRE sets to iterate in alphabetical order. If they did that, it wouldn't be *wrong*, but it also shouldn't be something we rely on.
This is interesting. But it's a programming error that Python, or any programming language, can not perfect against. glob.glib() does not promise to list matching files in any specific order. If the authors wrote code whose results vary based on the particular order files are processed in, that's a bug. It's their responsibility to order the files appropriately.

Obviously, glob COULD alphabetize it's results. But that order is generally different from ordering by creation time. Which is, in turn, different from ordering by modification time or file size. I don't want to prohibit glob from doing any of these if the filesystem happens to make such easier (this is really a filesystem question more than an OS question). But I also don't want to make the order part of the semantics of the function... Nor do extra work to "randomize" the order to avoid some pattern that may happen to exist on some platform.

Larry Hastings

unread,
Dec 16, 2019, 8:55:44 PM12/16/19
to pytho...@python.org


On 12/16/19 10:59 AM, Tim Peters wrote:
BTW, what should
    {1, 2} | {3, 4, 5, 6, 7}

return as ordered sets?  Beats me.;

The obvious answer is {1, 2, 3, 4, 5, 6, 7}.  Which is the result I got in Python 3.8 just now ;-)  But that's just a side-effect of how hashing numbers works, the implementation, etc.  It's rarely stable like this, and nearly any other input would have resulted in the scrambling we all (currently) expect to see.

>>> {"apples", "peaches", "pumpkin pie"} | {"who's", "not", "ready", "holler", "I" }
{'pumpkin pie', 'peaches', 'I', "who's", 'holler', 'ready', 'apples', 'not'}


/arry

Tim Peters

unread,
Dec 16, 2019, 9:34:21 PM12/16/19
to Larry Hastings, Python Dev
[Tim]
> BTW, what should
>
> {1, 2} | {3, 4, 5, 6, 7}
>
> return as ordered sets? Beats me.;

[Larry]
> The obvious answer is {1, 2, 3, 4, 5, 6, 7}.

Why? An obvious implementation that doesn't ignore performance entirely is:

def union(smaller, larger):
if len(larger) < len(smaller):
smaller, larger = larger, smaller
result = larger.copy()
for x in smaller:
result.add(x)

In the example, that would first copy {3, 4, 5, 6, 7}, and then add 1
and 2 (in that order) to it, giving {3, 4, 5, 6, 7, 1, 2} as the
result.

If it's desired that "insertion order" be consistent across runs,
platforms, and releases, then what "insertion order" _means_ needs to
be rigorously defined & specified for all set operations. This was
comparatively trivial for dicts, because there are, e.g., no
commutative binary operators defined on dicts.

If you want to insist that `a | b` first list all the elements of a,
and then all the elements of b that weren't already in a, regardless
of cost, then you create another kind of unintuitive surprise: in
general the result of "a | b" will display differently than the result
of "b | a" (although the results will compare equal), and despite that
the _user_ didn't "insert" anything.
_______________________________________________
Python-Dev mailing list -- pytho...@python.org
To unsubscribe send an email to python-d...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/pytho...@python.org/message/OU37ZU46BCHI6HLA7E3NEWCDOLQOHRNF/

Tim Peters

unread,
Dec 16, 2019, 10:36:48 PM12/16/19
to Raymond Hettinger, Python-Dev@Python. Org
[Raymond]
> ...
> * The ordering we have for dicts uses a hash table that indexes into a sequence.
> That works reasonably well for typical dict operations but is unsuitable for set
> operations where some common use cases make interspersed additions
> and deletions (that is why the LRU cache still uses a cheaply updated doubly-l
> linked list rather that deleting and reinserting dict entries).

I'm going to take a stab at fleshing that out a bit: ideally, an
ordered dict stores hash+key+value records in a contiguous array with
no "holes". That's ("no holes") where the memory savings comes from.
The "holes" are in the hash table, which only hold indices _into_ the
contiguous array (and the smallest width of C integer indices
sufficient to get to every slot in the array).

"The problem" is that deletions leave _two_ kinds of holes: one in
the hash table, and another in the contiguous vector. The latter
holes cannot be filled with subsequent new hash+key+value records
because that would break insertion order.

So in an app that mixes additions and deletions, the ordered dict
needs to be massively rearranged at times to squash out the holes left
behind by deletions, effectively moving all the holes to "the end",
where they can again be used to reflect insertion order.

Unordered dicts never had to be rearranged unless the total size
changed "a lot", and that remains true of the set implementation. But
in apps mixing adds and deletes, ordered dicts can need massive
internal rearrangement at times even if the total size never changes
by more than 1.

Algorithms doing a lot of mixing of adds and deletes seem a lot more
common for sets than for dicts, and so the ordered dict's basic
implementation _approach_ is a lot less suitable for sets. Or, at
least, that's my best attempt to flesh out Raymond's telegraphic
thinking there.

Note: the holes left by deletions _wouldn't_ be "a problem" _except_
for maintaining insertion order. If we were only after the memory
savings, then on deletion "the last" record in the contiguous array
could be moved into the hole at once, leaving the array hole-free
again. But that also changes the order. IIRC, that's what the
original "compact dict" _did_ do.
_______________________________________________
Python-Dev mailing list -- pytho...@python.org
To unsubscribe send an email to python-d...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/pytho...@python.org/message/QQXSSJWHKTUNHSMSHVM7XLMDBMUV7BDX/

Tim Peters

unread,
Dec 16, 2019, 10:46:57 PM12/16/19
to Petr Viktorin, Python Dev
[Petr Viktorin <enc...@gmail.com>]
> ...
> Originally, making dicts ordered was all about performance (or rather
> memory efficiency, which falls in the same bucket.) It wasn't added
> because it's better semantics-wise.

As I tried to flesh out a bit in a recent message, the original
"compact dict" idea got all the memory savings, but did _not_ maintain
insertion order. Maintaining insertion order too complicated
deletions (see recent message), but was deliberately done because
people really did want "ordered dicts".

> Here's one (very simplified and maybe biased) view of the history of dicts:
>
> * 2.x: Dicts are unordered, please don't rely on the order.
> * 3.3: Dict iteration order is now randomized. We told you not to rely
> on it!
> * 3.6: We now use an optimized implementation of dict that saves memory!
> As a side effect it makes dicts ordered, but that's an implementation
> detail, please don't rely on it.
> * 3.7: Oops, people are now relying on dicts being ordered. Insisting on
> people not relying on it is battling windmills. Also, it's actually
> useful sometimes, and alternate implementations can support it pretty
> easily. Let's make it a language feature! (Later it turns out
> MicroPython can't support it easily. Tough luck.)

A very nice summary! My only quibble is as above: the "compact dict"
implementation doesn't maintain insertion order naturally, _unless_
there are no deletions (which is, e.g., true of dicts constructed to
pass keyword arguments). The code got hairier to maintain insertion
order in the presence of mixing insertions and deletions.
_______________________________________________
Python-Dev mailing list -- pytho...@python.org
To unsubscribe send an email to python-d...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/pytho...@python.org/message/TRVOCOLQGSOM2OLLAI3UPRCTFIKIWWH6/

Larry Hastings

unread,
Dec 17, 2019, 12:11:03 AM12/17/19
to Tim Peters, Python Dev


On 12/16/19 6:30 PM, Tim Peters wrote:
If it's desired that "insertion order" be consistent across runs,
platforms, and releases, then what "insertion order" _means_ needs to
be rigorously defined & specified for all set operations.  This was
comparatively trivial for dicts, because there are, e.g., no
commutative binary operators defined on dicts.

My intuition is that figuring out sensible semantics here is maybe not trivial, but hardly impossible.  If this proposal goes anywhere I'd be willing to contribute to figuring it out.


If you want to insist that `a | b` first list all the elements of a,
and then all the elements of b that weren't already in a, regardless
of cost, then you create another kind of unintuitive surprise:  in
general the result of "a | b" will display differently than the result
of "b | a" (although the results will compare equal), and despite that
the _user_ didn't "insert" anything.

Call me weird--and I won't disagree--but I find nothing unintuitive about that.  After all, that's already the world we live in: there are any number of sets that compare as equal but display differently.  In current Python:

>>> a = {'a', 'b', 'c'}
>>> d = {'d', 'e', 'f'}
>>> a | d
{'f', 'e', 'd', 'a', 'b', 'c'}
>>> d | a
{'f', 'b', 'd', 'a', 'e', 'c'}
>>> a | d == d | a
True

This is also true for dicts, in current Python, which of course do maintain insertion order.  Dicts don't have the | operator, so I approximate the operation by duplicating the dict (which AFAIK preserves insertion order) and using update.

>>> aa = {'a': 1, 'b': 1, 'c': 1}
>>> dd = {'d': 1, 'e': 1, 'f': 1}
>>> x = dict(aa)
>>> x.update(dd)
>>> y = dict(dd)
>>> y.update(aa)
>>> x
{'a': 1, 'b': 1, 'c': 1, 'd': 1, 'e': 1, 'f': 1}
>>> y
{'d': 1, 'e': 1, 'f': 1, 'a': 1, 'b': 1, 'c': 1}
>>> x == y
True

Since dicts already behave in exactly that way, I don't think it would be too surprising if sets behaved that way too.  In fact, I think it's a little surprising that they behave differently, which I suppose was my whole thesis from the beginning.


Cheers,


/arry

Larry Hastings

unread,
Dec 17, 2019, 12:21:41 AM12/17/19
to pytho...@python.org


On 12/16/19 7:43 PM, Tim Peters wrote:
[Petr Viktorin <enc...@gmail.com>]
Here's one (very simplified and maybe biased) view of the history of dicts:

* 2.x: Dicts are unordered, please don't rely on the order.
* 3.3: Dict iteration order is now randomized. We told you not to rely
on it!
* 3.6: We now use an optimized implementation of dict that saves memory!
As a side effect it makes dicts ordered, but that's an implementation
detail, please don't rely on it.
* 3.7: Oops, people are now relying on dicts being ordered. Insisting on
people not relying on it is battling windmills. Also, it's actually
useful sometimes, and alternate implementations can support it pretty
easily. Let's make it a language feature! (Later it turns out
MicroPython can't support it easily. Tough luck.)
A very nice summary!  My only quibble is as above:  the "compact dict"
implementation doesn't maintain insertion order naturally, _unless_
there are no deletions (which is, e.g., true of dicts constructed to
pass keyword arguments).  The code got hairier to maintain insertion
order in the presence of mixing insertions and deletions.


Didn't some paths also get sliiiiightly slower as a result of maintaining insertion order when mixing insertions and deletions?  My recollection is that that was part of the debate--not only "are we going to regret inflicting these semantics on posterity, and on other implementations?", but also "are these semantics worth the admittedly-small performance hit, in Python's most important and most-used data structure?".

Also, I don't recall anything about us resigning ourselves to explicitly maintain ordering on dicts because people were relying on it, "battling windmills", etc.  Dict ordering had never been guaranteed, a lack of guarantee Python had always taken particularly seriously.  Rather, we decided it was a desirable feature, and one worth pursuing even at the cost of a small loss of performance.  One prominent Python core developer** wanted this feature for years, and I recall them saying something like:

Guido says, "When a programmer iterates over a dictionary and they see the keys shift around when the dictionary changes, they learn something!"  To that I say--"Yes!  They learn that Python is unreliable and random!"


/arry

** I won't give their name here because I fear I'm misquoting everybody involved.  Apologies in advance if that's the case!

Tim Peters

unread,
Dec 17, 2019, 1:02:55 AM12/17/19
to Larry Hastings, Python Dev
[Tim]
>> If it's desired that "insertion order" be consistent across runs,
>> platforms, and releases, then what "insertion order" _means_ needs to
>> be rigorously defined & specified for all set operations. This was
>> comparatively trivial for dicts, because there are, e.g., no
>> commutative binary operators defined on dicts.

[Larry]
> My intuition is that figuring out sensible semantics here is maybe not
> trivial, but hardly impossible. If this proposal goes anywhere I'd be
> willing to contribute to figuring it out.

No, it _is_ easy. It's just tedious, adding piles of words to the
docs, and every constraint also constrains possible implementations.
You snipped my example of implementing union, which you should really
think about instead ;-)


>> If you want to insist that `a | b` first list all the elements of a,
>> and then all the elements of b that weren't already in a, regardless
>> of cost, then you create another kind of unintuitive surprise: in
>> general the result of "a | b" will display differently than the result
>> of "b | a" (although the results will compare equal), and despite that
>> the _user_ didn't "insert" anything.

> Call me weird--and I won't disagree--but I find nothing unintuitive about that.
> After all, that's already the world we live in: there are any number of sets
> that compare as equal but display differently. In current Python:
>
> >>> a = {'a', 'b', 'c'}
> >>> d = {'d', 'e', 'f'}
> >>> a | d
> {'f', 'e', 'd', 'a', 'b', 'c'}
> >>> d | a
> {'f', 'b', 'd', 'a', 'e', 'c'}

Yup, it happens But under the sample union implementation I gave, it
would never happen when the sets had different cardinalities (the
different sizes are used to force a "standard" order then). For
mathematical sets, | is commutative (it makes no difference to the
result if the arguments are swapped - but can make a _big_ difference
to implementation performance unless the implementation is free to
pick the better order).

> ...
> This is also true for dicts, in current Python, which of course do maintain
> insertion order. Dicts don't have the | operator, so I approximate the
> operation by duplicating the dict (which AFAIK preserves insertion order)

Ya, it does, but I don't believe that's documented (it should be).

> and using update.

Too different to be interesting here - update() isn't commutative.
For sets, union, intersection, and symmetric difference are
commutative.

> ...
> Since dicts already behave in exactly that way, I don't think it would be too
> surprising if sets behaved that way too. In fact, I think it's a little surprising
> that they behave differently, which I suppose was my whole thesis from
> the beginning.

I appreciate that dicts and sets behave differently in visible ways
now. It just doesn't bother me enough that I'm cool with slowing set
operations to "fix that".
_______________________________________________
Python-Dev mailing list -- pytho...@python.org
To unsubscribe send an email to python-d...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/pytho...@python.org/message/L2VGSHROPMW4BV6ORYMFKQ4ZJ5AXTZLE/

Tim Peters

unread,
Dec 17, 2019, 1:38:13 AM12/17/19
to Larry Hastings, Python Dev
[Larry]
> Didn't some paths also get sliiiiightly slower as a result of maintaining
> insertion order when mixing insertions and deletions?

I paid no attention at the time. But in going from "compact dict" to
"ordered dict", deletion all by itself got marginally cheaper. The
downside was the need to rearrange the whole dict when too many
"holes" built up. "Compact (but unordered) dict" doesn't need that.

> My recollection is that that was part of the debate--not only "are we going to
> regret inflicting these semantics on posterity, and on other implementations?",
> but also "are these semantics worth the admittedly-small performance hit, in
> Python's most important and most-used data structure?".

There's a layer of indirection in compact dicts - lookups are
distributed across two arrays. In non-compact unordered dicts,
everything is in a single array. Cache effects may or may not make a
measurable difference then, depending on all sorts of things.

> Also, I don't recall anything about us resigning ourselves to explicitly
> maintain ordering on dicts because people were relying on it, "battling
> windmills", etc. Dict ordering had never been guaranteed, a lack
> of guarantee Python had always taken particularly seriously. Rather, we
> decided it was a desirable feature, and one worth pursuing even at the
> cost of a small loss of performance.

I'm not convinced there was a loss of performance. The delay between
the implementation supplying ordered dicts, and the language
guaranteeing it, was, I suspect, more a matter of getting extensive
real-world experience about whether the possible new need to massively
rearrange dict internals to remove "holes" would bite someone too
savagely to live with. But, again, I wasn't paying attention at the
time.

> One prominent Python core developer** wanted this feature for years, and I recall
> them saying something like:
>
> Guido says, "When a programmer iterates over a dictionary and they see the keys
> shift around when the dictionary changes, they learn something!" To that I say--"Yes!
> They learn that Python is unreliable and random!"

I never wanted ordered dicts, but never objected to them either. All
in all, given that _I_ haven't seen a performance degradation, I like
that they're more predictable, and love the memory savings.

But as Raymond said (& I elaborated on), there's reason to believe
that the implementation of ordered dicts is less suited to sets, where
high rates of mixing adds and deletes is more common (thus triggering
high rates of massive internal dict rearranging). Which is why I said
much earlier that I'd be +1 on ordered sets only when there's an
implementation that's as fast as what we have now.

Backing up:

> Python is the language where speed, correctness, and readability trump
> performance every time.

Speed trumping performance didn't make any sense ;-)

So assuming you didn't intend to type "speed", I think you must have,
say, Scheme or Haskell in mind there. "Practicality beats purity" is
never seen on forums for those languages. Implementation speed & pain
have played huge rules in many Python decisions. As someone who has
wrestled with removing the GIL, you should know that better than
anyone ;-)
_______________________________________________
Python-Dev mailing list -- pytho...@python.org
To unsubscribe send an email to python-d...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/pytho...@python.org/message/QK3KERIPYD3Q3XNKZJBQBQ6NUUKT63WN/

Larry Hastings

unread,
Dec 17, 2019, 2:08:22 AM12/17/19
to Tim Peters, Python Dev
On 12/16/19 10:32 PM, Tim Peters wrote:
[Larry]
Python is the language where speed, correctness, and readability trump
performance every time.
Speed trumping performance didn't make any sense ;-)


Sorry, that was super unclear!  I honestly meant speed of development.


D'oh,


/arry

Steven D'Aprano

unread,
Dec 17, 2019, 5:03:33 AM12/17/19
to pytho...@python.org
On Sun, Dec 15, 2019 at 09:00:31PM -0800, Larry Hastings wrote:

> I think we should decide the question "should set
> objects maintain insertion order?" literally without any consideration
> about performance implications.

In your opening post:

> I also want FAST lookup because I soemtimes remove jobs when they're
> subsequently marked "not ready".
[emphasis added]

So do you care about performance or not?

:-)

If you do care (a bit) about performance, what slowdown would you be
willing to take to get ordered sets? That would give us a good idea of
the potential trade-offs that might be acceptable.

Without being facetious[1] if you don't care about performance, you
don't need a set, you could use a list.

There's a serious point here: there's nothing sets can do that couldn't
be implemented using lists. The reason we have sets, rather than a
bunch of extra methods like intersection and symmetric_difference on
lists, is because we considered performance.



[1] Well... maybe a tiny bit... *wink*


--
Steven
_______________________________________________
Python-Dev mailing list -- pytho...@python.org
To unsubscribe send an email to python-d...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/pytho...@python.org/message/H7GRS7MF5QH7EYMP4PLNMPXW6IDM42LG/

Larry Hastings

unread,
Dec 17, 2019, 6:07:21 AM12/17/19
to pytho...@python.org


On 12/17/19 2:02 AM, Steven D'Aprano wrote:
Without being facetious[1] if you don't care about performance, you don't need a set, you could use a list.


Lists don't enforce uniqueness.  Apart from that a list would probably work fine for my needs; in my admittedly-modest workloads I would probably never notice a performance difference.  My anecdote was merely a jumping-off point for the discussion.


"I don't care about performance" is not because I'm aching for Python to run my code slowly.  It's because I'm 100% confident that the Python community will lovingly optimize the implementation.  So when I have my language designer hat on, I really don't concern myself with performance.  As I thought I said earlier in the thread, I think we should figure out the semantics we want first, and then we figure out how to make it fast.

I'll also cop to "a foolish consistency is the hobgoblin of little minds".  I lack this strongly mathematical view of sets others have espoused; instead I view them more like "dicts without values".  I'm therefore disgruntled by this inconsistency between what are I see as closely related data structures, and it makes sense to me that they'd maintain their insertion order the same way that dictionaries now do.


/arry

Paul Moore

unread,
Dec 17, 2019, 6:50:49 AM12/17/19
to Larry Hastings, Python Dev
On Tue, 17 Dec 2019 at 11:13, Larry Hastings <la...@hastings.org> wrote:
> I lack this strongly mathematical view of sets others have espoused; instead I view them more like "dicts without values". I'm therefore disgruntled by this inconsistency between what are I see as closely related data structures, and it makes sense to me that they'd maintain their insertion order the same way that dictionaries now do.

I'll admit to a mathematical background, which probably influences my
views. But I view sets as "collections" of values - values are "in"
the set or not. I don't see them operationally, in the sense that you
add and remove things from the set. Adding and removing are mechanisms
for changing the set, but they aren't fundamentally what sets are
about (which for me is membership). So insertion order on sets is
largely meaningless for me (it doesn't matter *how* it got there, what
matters is whether it's there now).

Having said that, I also see dictionaries as mappings from key to
value, so insertion order is mostly not something I care about for
dictionaries either. I can see the use cases for ordered dictionaries,
and now we have insertion order, it's useful occasionally, but it's
not something that was immediately obvious to me based on my
intuition. Similarly, I don't see sets as *needing* insertion order,
although I concede that there probably are use cases, similar to
dictionaries. The question for me, as with dictionaries, is whether
the cost (in terms of clear semantics, constraints on future
implementation choices, and actual performance now) is justified by
the additional capabilities (remembering that personally, I'd consider
insertion order preservation as very much a niche requirement).

Paul
_______________________________________________
Python-Dev mailing list -- pytho...@python.org
To unsubscribe send an email to python-d...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/pytho...@python.org/message/IFXPRC4BPYC5ZAZYCCQ5PQTBX5ALIOUK/

Ivan Levkivskyi

unread,
Dec 17, 2019, 7:08:59 AM12/17/19
to Paul Moore, Python Dev
On Tue, 17 Dec 2019 at 11:48, Paul Moore <p.f....@gmail.com> wrote:
On Tue, 17 Dec 2019 at 11:13, Larry Hastings <la...@hastings.org> wrote:
> I lack this strongly mathematical view of sets others have espoused; instead I view them more like "dicts without values".  I'm therefore disgruntled by this inconsistency between what are I see as closely related data structures, and it makes sense to me that they'd maintain their insertion order the same way that dictionaries now do.

I'll admit to a mathematical background, which probably influences my
views. But I view sets as "collections" of values - values are "in"
the set or not. I don't see them operationally, in the sense that you
add and remove things from the set. Adding and removing are mechanisms
for changing the set, but they aren't fundamentally what sets are
about (which for me is membership). So insertion order on sets is
largely meaningless for me (it doesn't matter *how* it got there, what
matters is whether it's there now).

Having said that, I also see dictionaries as mappings from key to
value, so insertion order is mostly not something I care about for
dictionaries either. I can see the use cases for ordered dictionaries,
and now we have insertion order, it's useful occasionally, but it's
not something that was immediately obvious to me based on my
intuition. Similarly, I don't see sets as *needing* insertion order,
although I concede that there probably are use cases, similar to
dictionaries. The question for me, as with dictionaries, is whether
the cost (in terms of clear semantics, constraints on future
implementation choices, and actual performance now) is justified by
the additional capabilities (remembering that personally, I'd consider
insertion order preservation as very much a niche requirement).

As a random data point, I often see the pattern where one needs to remove duplicates
from the list while preserving the order of first appearance.
This is for example needed to get stability in various type-checking situations (like union items, type variables in base classes, type queries etc.)

One can write a four line helper to achieve this, but I can see that having order preserving set could be useful.
So again, it is "nice to have but not really critical".

--
Ivan


Serhiy Storchaka

unread,
Dec 17, 2019, 9:31:47 AM12/17/19
to pytho...@python.org
17.12.19 14:05, Ivan Levkivskyi пише:
> As a random data point, I often see the pattern where one needs to
> remove duplicates
> from the list while preserving the order of first appearance.
> This is for example needed to get stability in various type-checking
> situations (like union items, type variables in base classes, type
> queries etc.)
>
> One can write a four line helper to achieve this, but I can see that
> having order preserving set could be useful.
> So again, it is "nice to have but not really critical".

This is a one-liner:

list(dict.fromkeys(iterable))
_______________________________________________
Python-Dev mailing list -- pytho...@python.org
To unsubscribe send an email to python-d...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/pytho...@python.org/message/2P43G6N2MKQPSNVB4TBWY2ZHAOEISDPT/

Tim Peters

unread,
Dec 17, 2019, 2:01:06 PM12/17/19
to Larry Hastings, Python Dev
...

[Larry]
>> One prominent Python core developer** wanted this feature for years, and I recall
>> them saying something like:
>>
>> Guido says, "When a programmer iterates over a dictionary and they see the keys
>> shift around when the dictionary changes, they learn something!" To that I say--"Yes!
>> They learn that Python is unreliable and random!"

[Tim]
> I never wanted ordered dicts, but never objected to them either. All
> in all, given that _I_ haven't seen a performance degradation, I like
> that they're more predictable, and love the memory savings.

That reads like I was correcting Larry for misquoting _me_. Sorry!
He wasn't quoting me, and I didn't think he was. What he did quote
was delightfully snarky enough that I didn't want to snip it, but not
substantial enough to be worth the bother of addressing. So I just
moved on to summarize what I said at the time (i.e., nothing).
_______________________________________________
Python-Dev mailing list -- pytho...@python.org
To unsubscribe send an email to python-d...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/pytho...@python.org/message/LXJU3EKB2KM3BT6MTGDLWVHTZUUIZGJK/

Tim Peters

unread,
Dec 17, 2019, 2:53:45 PM12/17/19
to Larry Hastings, Python Dev
[Larry]
> "I don't care about performance" is not because I'm aching for Python to
> run my code slowly. It's because I'm 100% confident that the Python
> community will lovingly optimize the implementation.

I'm not ;-)

> So when I have my language designer hat on, I really don't concern myself
> with performance. As I thought I said earlier in the thread, I think we should
> figure out the semantics we want first, and then we figure out how to make it fast.

Pretty much the opposite of how Python started. The original lists
and dicts were deliberately designed to have dirt simple
implementations, in reaction against ABC's "theoretically optimal"
data structures that were a nightmare to maintain, and had so much
overhead to support "optimality" in all cases that they were, in fact,
much slower in common cases.

There isn't magic waiting to be uncovered here: if you want O(1)
deletion at arbitrary positions in an ordered sequence, a doubly
linked list is _the_ way to do it. That's why, e.g., Raymond said he
still uses a doubly linked list, instead of an ordered dict, in the
LRU cache implementation. If that isn't clear, a cache can be
maintained in least-to-most recently accessed order with an ordered
dict like so:

if key in cache:
cached_value = cache.pop(key) # remove key
else:
compute cached_value
assert key not in cache
cache[key] = cached_value # most recently used at the end now
return cached_value

and deleting the least recently used is just "del
cache[next(iter(cache))]" (off topic, just noting this is a fine use
for the "first()" function that's the subject of a different thread).

We _could_ structure an ordered dict's hash+key+value records as a
doubly linked list instead (and avoid the need for O(N)
reorganizations). But then we piss away much of the memory savings
(to store the new links) that was the _point_ of compact dicts to
begin with.

So there was a compromise. No links, deletion on its own is O(1), but
can periodically require O(N) under-the-covers table reorganizations
to squash out the holes. "Suitable enough" for ordered dicts, but
_thought_ to be unsuitable for ordered sets (which appear to have
higher rates of mixing deletions with insertions - the LRU cache
example being an exception).

But there are also other optimizations in the current set
implementation, so "fine, add the doubly linked list to sets but not
to dicts" is only part of it.

Which may or may not be possible to match, let alone beat, in an
ordered set implementation. A practical barrier now is that Python is
too mature to bank on loving optimizations _after_ a change to a core
feature is released. It's going to need a highly polished
implementation first.

I certainly don't object to anyone trying, but it won't be me ;-)
_______________________________________________
Python-Dev mailing list -- pytho...@python.org
To unsubscribe send an email to python-d...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/pytho...@python.org/message/UXWGNLGC3BG2XSMYN5H73FVKP3O2XUUC/

Nick Coghlan

unread,
Dec 19, 2019, 1:36:43 AM12/19/19
to Tim Peters, Python Dev


On Wed., 18 Dec. 2019, 5:51 am Tim Peters, <tim.p...@gmail.com> wrote:
But there are also other optimizations in the current set
implementation, so "fine, add the doubly linked list to sets but not
to dicts" is only part of it.

Which may or may not be possible to match, let alone beat, in an
ordered set implementation.  A practical barrier now is that Python is
too mature to bank on loving optimizations _after_ a change to a core
feature is released.  It's going to need a highly polished
implementation first.


Starting with "collections.OrderedSet" seems like a reasonable idea, though - that way "like a built-in set, but insertion order preserving" will have an obvious and readily available answer, and it should also make performance comparisons easier.

Cheers,
Nick.

______________________________

Tim Peters

unread,
Dec 19, 2019, 12:06:51 PM12/19/19
to Nick Coghlan, Python Dev
[Nick Coghlan <ncog...@gmail.com>]
> Starting with "collections.OrderedSet" seems like a reasonable idea,
> though - that way "like a built-in set, but insertion order preserving" will
> have an obvious and readily available answer, and it should also
> make performance comparisons easier.

Ya, I suggested starting with collections.OrderedSet earlier, but gave up on it.

The problem is that the "use case" here isn't really a matter of
functionality, but of consistency: "it would be nice if", like dicts
enjoy now, the iteration order of sets was defined in an
implementation-independent way. That itch isn't scratched unless the
builtin set type defines it.
_______________________________________________
Python-Dev mailing list -- pytho...@python.org
To unsubscribe send an email to python-d...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/pytho...@python.org/message/PM5ENMLR665XG32AS2FEAEUVDG3AFWV6/

Wes Turner

unread,
Dec 19, 2019, 3:21:01 PM12/19/19
to Tim Peters, Nick Coghlan, Python Dev

Nick Coghlan

unread,
Dec 19, 2019, 5:45:08 PM12/19/19
to Tim Peters, Python Dev
On Fri., 20 Dec. 2019, 2:58 am Tim Peters, <tim.p...@gmail.com> wrote:
[Nick Coghlan <ncog...@gmail.com>]
> Starting with "collections.OrderedSet" seems like a reasonable idea,
> though - that way "like a built-in set, but insertion order preserving" will
> have an obvious and readily available answer, and it should also
> make performance comparisons easier.

Ya, I suggested starting with collections.OrderedSet earlier, but gave up on it.

The problem is that the "use case" here isn't really a matter of
functionality, but of consistency:  "it would be nice if", like dicts
enjoy now, the iteration order of sets was defined in an
implementation-independent way.  That itch isn't scratched unless the
builtin set type defines it.

I took Larry's request a slightly different way: he has a use case where he wants order preservation (so built in sets aren't good), but combined with low cost duplicate identification and elimination and removal of arbitrary elements (so lists and collections.deque aren't good).

Organising a work queue that way seems common enough to me to be worthy of a stdlib solution that doesn't force you to either separate a "job id" from the "job" object in an ordered dict, or else use an ordered dict with "None" values.

So while it means answering "No" to the "Should builtin sets preserve order?" question (at least for now), adding collections.OrderedSet *would* address that "duplicate free pending task queue" use case.

Cheers,
Nick.

David Mertz

unread,
Dec 19, 2019, 6:23:13 PM12/19/19
to Nick Coghlan, Tim Peters, Python Dev
On Thu, Dec 19, 2019 at 5:39 PM Nick Coghlan <ncog...@gmail.com> wrote:
I took Larry's request a slightly different way: he has a use case where he wants order preservation (so built in sets aren't good), but combined with low cost duplicate identification and elimination and removal of arbitrary elements (so lists and collections.deque aren't good). Organising a work queue that way seems common enough to me to be worthy of a stdlib solution that doesn't force you to either separate a "job id" from the "job" object in an ordered dict, or else use an ordered dict with "None" values.

It's not obvious to me that insertion order is even the most obvious or most commonly relevant sort order.  I'm sure it is for Larry's program, but often a work queue might want some other order.  Very often queues might instead, for example, have a priority number assigned to them.

The widely used third-party `sortedcollections` module maintains either "natural" order or some key order.  That could be engineered to be insertion order fairly easily.  Admittedly, 'OrderedSet' sounds a bit different from 'SortedSet'.

In [8]: sortedcontainers.SortedSet(['Wake Up', 'Drink Coffee', 'Make Breakfast'])
Out[8]: SortedSet(['Drink Coffee', 'Make Breakfast', 'Wake Up'])

In [9]: sortedcontainers.SortedSet(['Wake Up', 'Drink Coffee', 'Make Breakfast'], key=lambda s: s[-1])
Out[9]: SortedSet(['Drink Coffee', 'Wake Up', 'Make Breakfast'], key=<function <lambda> at 0x7f68f5985400>)

In [10]: 'Make Breakfast' in
tasks
Out[10]: True

Tim Peters

unread,
Dec 19, 2019, 6:27:00 PM12/19/19
to Nick Coghlan, Python Dev
[Nick]
> I took Larry's request a slightly different way:

Sorry, I was unclear: by "use case" I had in mind what appeared to me
to be the overwhelming thrust of the _entirety_ of this thread so far,
not Larry's original request.

> he has a use case where he wants order preservation (so built in sets aren't
> good), but combined with low cost duplicate identification and elimination and
> removal of arbitrary elements (so lists and collections.deque aren't good).

Sure.

> Organising a work queue that way seems common enough to me to be
> worthy of a stdlib solution that doesn't force you to either separate a
> "job id" from the "job" object in an ordered dict, or else use an ordered
> dict with "None" values.
>
> So while it means answering "No" to the "Should builtin sets preserve
> order?" question (at least for now), adding collections.OrderedSet *would*
> address that "duplicate free pending task queue" use case.

Only Larry can answer whether that would meet his perceived need. My
_guess_ is that he wouldn't know OrderedSet existed, and, even if he
did, he'd use a dict with None values anyway (because it's less hassle
and does everything he wanted).

But I have to add that I don't know enough about his original use case
to be sure it wasn't "an XY problem":

https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem
"""
The XY problem is asking about your attempted solution rather than
your actual problem.

That is, you are trying to solve problem X, and you think solution Y
would work, but instead of asking about X when you run into trouble,
you ask about Y.
"""

Because I've never had a "job queue" problem where an OrderedSet would
have helped. Can't guess what's different about Larry's problem.
_______________________________________________
Python-Dev mailing list -- pytho...@python.org
To unsubscribe send an email to python-d...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/pytho...@python.org/message/TV2XVX2NKUGJJZNFKO4TSMEPVQI6Y75Q/

Tim Peters

unread,
Dec 19, 2019, 11:04:40 PM12/19/19
to David Mertz, Nick Coghlan, Python Dev
[David Mertz <me...@gnosis.cx>]
> It's not obvious to me that insertion order is even the most obvious or
> most commonly relevant sort order. I'm sure it is for Larry's program, but
> often a work queue might want some other order. Very often queues
> might instead, for example, have a priority number assigned to them.

Indeed, and it makes me more cautious that claims for the _usefulness_
(rather than just consistency) of an ordered set are missing an XY
problem. The only "natural" value of insertion-order ordering for a
dynamic ordered set is that it supports FIFO queues. Well, that's one
thing that a deque already excels at, but much more obviously,
flexibly, and space- and time- efficiently.

For dicts the motivating use cases were much more "static", like
preserving textual order of keyword argument specifications, and
avoiding gratuitous changing of order when round-tripping key-value
encodings like much of JSON.

So what problem(s) would a dynamic ordered set really be aiming at? I
don't know. Consistency & predictability I understand and appreciate,
but little beyond that. Even FIFO ordering is somewhat a PITA, since
`next(iter(set))` is an overly elaborate way to need to spell "get the
first" if a FIFO queue _is_ a prime dynamic use case. And there's no
efficient way at all to access the other end (although I suppose
set.pop() would - like dict.popitem() - change to give a _destructive_
way to get at "the other end").
_______________________________________________
Python-Dev mailing list -- pytho...@python.org
To unsubscribe send an email to python-d...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/pytho...@python.org/message/J52ATOHFXNBSEJV6QQZBFXSC2TAHOWGW/

Nick Coghlan

unread,
Dec 19, 2019, 11:16:56 PM12/19/19
to Tim Peters, Python Dev


On Fri., 20 Dec. 2019, 1:56 pm Tim Peters, <tim.p...@gmail.com> wrote:
So what problem(s) would a dynamic ordered set really be aiming at?  I
don't know.  Consistency & predictability I understand and appreciate,
but little beyond that.  Even FIFO ordering is somewhat a PITA, since
`next(iter(set))` is an overly elaborate way to need to spell "get the
first" if a FIFO queue _is_ a prime dynamic use case.  And there's no
efficient way at all to access the other end (although I suppose
set.pop() would - like dict.popitem() - change to give a _destructive_
way to get at "the other end").

I must admit that I was assuming without stating that a full OrderedSet implementation would support the MutableSequence interface.

Dictionaries can't do that because they already support the MutableMapping interface.

Cheers,
Nick.

Tim Peters

unread,
Dec 19, 2019, 11:42:55 PM12/19/19
to Nick Coghlan, Python Dev
[Nick]
> I must admit that I was assuming without stating that a full OrderedSet
> implementation would support the MutableSequence interface.

Efficient access via index position too would be an enormous new
requirement, My bet: basic operations would need to change from O(1)
to O(log(N)).

BTW, in previous msgs there are links to various implementations
calling themselves "ordered sets". One of them supplies O(1)
indexing, but at the expense of making deletion O(N) (!):

https://pypi.org/project/ordered-set/

If efficient indexing is really wanted, then the original "use case"
Larry gave was definitely obscuring an XY problem ;-)
_______________________________________________
Python-Dev mailing list -- pytho...@python.org
To unsubscribe send an email to python-d...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/pytho...@python.org/message/IBRSGUTHIMOZ6JGIYJBQJFXEANFZI4V5/

Wes Turner

unread,
Dec 20, 2019, 2:16:01 AM12/20/19
to Tim Peters, Nick Coghlan, Python Dev
How slow and space-inefficient would it be to just implement the set methods on top of dict?

Do dicts lose insertion order when a key is deleted? AFAIU, OrderedDict do not lose insertion order on delete. Would this limit the utility of an ordered set as a queue? What set methods does a queue need to have?

Inada Naoki

unread,
Dec 20, 2019, 2:42:32 AM12/20/19
to Wes Turner, Tim Peters, Nick Coghlan, Python Dev
On Fri, Dec 20, 2019 at 4:15 PM Wes Turner <wes.t...@gmail.com> wrote:
>
> How slow and space-inefficient would it be to just implement the set methods on top of dict?

Speed: Dict doesn't cache the position of the first item. Calling
next(iter(D)) repeatedly is O(N) in worst case.
Space: It waste 8bytes per member.

>
> Do dicts lose insertion order when a key is deleted? AFAIU, OrderedDict do not lose insertion order on delete.

Dict keeps insertion order after deletion too.

>Would this limit the utility of an ordered set as a queue? What set methods does a queue need to have?

I want O(1) D.popleft(). (The name is borrowed from deque. popfirst()
would be better maybe).

--
Inada Naoki <songof...@gmail.com>
_______________________________________________
Python-Dev mailing list -- pytho...@python.org
To unsubscribe send an email to python-d...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/pytho...@python.org/message/T4DT5P7CZE7A4J6XNOT6QGC5UDS4INTR/

Peter Wang

unread,
Dec 20, 2019, 8:18:49 AM12/20/19
to Tim Peters, Nick Coghlan, Python Dev
On Thu, Dec 19, 2019 at 9:57 PM Tim Peters <tim.p...@gmail.com> wrote:
> It's not obvious to me that insertion order is even the most obvious or
> most commonly relevant sort order.  I'm sure it is for Larry's program, but
> often a work queue might want some other order.  Very often queues
> might instead, for example, have a priority number assigned to them.

Indeed, and it makes me more cautious that claims for the _usefulness_
(rather than just consistency) of an ordered set are missing an XY
problem.  The only "natural" value of insertion-order ordering for a
dynamic ordered set is that it supports FIFO queues.  Well, that's one
thing that a deque already excels at, but much more obviously,
flexibly, and space- and time- efficiently.

I agree with Tim and David.

Furthermore, I'd like to point out that the core built-in types are frequently archetypes or foundational paradigms for many derived concepts in 3rd-party libraries. Despite the fact that they have implementations, they also form a basis set of interfaces which are part of the lexicon of "Pythonic-ness".  Something like a "set" is a very very common and useful base interface for programming.  If we were to cloud or confound the (relatively clean) semantics around sets & set algebra by introducing ordinality, that same cloudiness will spread throughout the ecosystem.

Unlike dict()s, which are a more computer science-y data structure with lots of details, sets are simple things with mathematical algebraic properties. These allow the user to compactly specify *what* they want done, without delving into the details of *how*. In my experience, this is always preferable, for ease of use, ease of learning, and correctness.  Adding orderedness as a concern now imposes an additional cognitive burden onto the coder because they must be concerned with *how*.

Ultimately, even if we establish that there is major utility in insertion-ordered sets, unordered sets also clearly have utility as well (as a more relaxed interface, with fewer promises).  So in that case, which do we want as the default?  I think about it like this: In a future time when sets maintain insertion order by default, every time I see library docs that require casting sets via collections.unordered_set() will feel like sand in the semantic vaseline.  I would much rather have things the other way around.

I don't really have a horse in this race, and I'm sure the dev community will arrive at a good resolution to this issue.  However, I would just encourage folks to think less about "utility of the implementation", and more about "economy of semantics".  As Larry pointed out earlier, ordered dicts posed a problem for MicroPython. This actually isn't a minor issue -- in the long run, forked semantics are a Bad Thing for the language (even for a language that is incorrectly eponymized with snakes!).


-Peter

Wes Turner

unread,
Dec 20, 2019, 9:18:13 AM12/20/19
to Inada Naoki, Tim Peters, Nick Coghlan, Python Dev
Got it. Thanks

Tim Peters

unread,
Dec 20, 2019, 1:25:05 PM12/20/19
to Inada Naoki, Nick Coghlan, Python Dev
[Wes Turner <wes.t...@gmail.com>]
>> How slow and space-inefficient would it be to just implement the set methods
>> on top of dict?

[Inada Naoki <songof...@gmail.com>]
> Speed: Dict doesn't cache the position of the first item. Calling
> next(iter(D)) repeatedly is O(N) in worst case.
> ...

See also Raymond's (only) message in this thread. We would also lose
low-level speed optimizations specific to the current set
implementation. And we would need to define what "insertion order"
_means_ for operators (union, intersection, symmetric difference) that
build a new set out of other sets without a user-level "insert" in
sight. However that's defined, it may constrain possible
implementations in ways that are inherently slower (for example, may
require the implementation to iterate over the larger operand where
they currently iterate over the smaller).

And the current set implementation (like the older dict
implementation) never needs to "rebuild the table from scratch" unless
the cardinality of the set keeps growing. As Raymond telegraphically
hinted, the current dict implementation can, at times, in the presence
of deletions, require rebuilding the table from scratch even if the
dict's maximum size remains bounded.

That last can't be seen directly from Python code (rebuilding the
table is invisible at the Python level). Here's a short example:

d = dict.fromkeys(range(3))
while True:
del d[0]
d[0] = None

Run that with a breakpoint set on dictobject.c's `dictresize()`
function. You'll see that it rebuilds the table from scratch every
3rd time through the loop. In effect, for the purpose of deciding
whether it needs to rebuild, the current dict implementation sees no
difference between adding a new element and deleting an existing
element Deleting leaves behind "holes" that periodically have to be
squashed out of existence so that insertion order can be maintained in
a dead simple way upon future additions.
_______________________________________________
Python-Dev mailing list -- pytho...@python.org
To unsubscribe send an email to python-d...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/pytho...@python.org/message/BVCDKT5ULY324RZATMCEFGAMYOBJFHIZ/

Inada Naoki

unread,
Dec 20, 2019, 3:32:50 PM12/20/19
to Tim Peters, Nick Coghlan, Python Dev
On Sat, Dec 21, 2019 at 3:17 AM Tim Peters <tim.p...@gmail.com> wrote:
>
> [Wes Turner <wes.t...@gmail.com>]
> >> How slow and space-inefficient would it be to just implement the set methods
> >> on top of dict?
>
> [Inada Naoki <songof...@gmail.com>]
> > Speed: Dict doesn't cache the position of the first item. Calling
> > next(iter(D)) repeatedly is O(N) in worst case.
> > ...
>
> See also Raymond's (only) message in this thread. We would also lose
> low-level speed optimizations specific to the current set
> implementation.

I read this thread, and I understand all of them.

I just meant the performance of the next(iter(D)) is the most critical part
when you implement orderdset on top of the current dict and use it as a queue.

This code should be O(N), but it's O(N^2) if q is implemented on top
of the dict.

while q:
item = q.popleft()

Sorry for the confusion.


>
> And the current set implementation (like the older dict
> implementation) never needs to "rebuild the table from scratch" unless
> the cardinality of the set keeps growing.

It is a bit misleading. If "the cardinality of the set" means len(S),
set requires the rebuild in low frequency if the its items are random.

Anyway, it is a smaller problem than next(iter(D)) because of the
amortized cost is O(1).
Current dict is not optimized for queue usage.

Regards,
--
Inada Naoki <songof...@gmail.com>
_______________________________________________
Python-Dev mailing list -- pytho...@python.org
To unsubscribe send an email to python-d...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/pytho...@python.org/message/DFJQOW225OSRPFDHBJ3UBYRRMZ52AXDH/

Tim Peters

unread,
Dec 20, 2019, 5:44:42 PM12/20/19
to Inada Naoki, Nick Coghlan, Python Dev
[Inada Naoki <songof...@gmail.com>]
> I just meant the performance of the next(iter(D)) is the most critical part
> when you implement orderdset on top of the current dict and use it as a queue.

Which is a good point. I added a lot more, though, because Wes didn't
even mention queues in his question:

[Wes Turner <wes.t...@gmail.com>]
>>>> How slow and space-inefficient would it be to just implement the set methods
>>>> on top of dict?

I can't guess what he was _really_ asking about, so now he's got lots
of answers ;-)

> This code should be O(N), but it's O(N^2) if q is implemented on top
> of the dict.
>
> while q:
> item = q.popleft()
>
> Sorry for the confusion.

To flesh this out, the FIFO queue entries are all together in a
contiguous vector, with no "holes" in the middle. Pushing a new item
appends "to the right" (higher index in the vector), while popping an
item leaves "a hole" at the left. But there are no holes "in the
middle" in this case.

So the first real entry is at a higher index with every pop, but
iteration starts at index 0 every time. The more pops there are, the
more holes iteration has to skip over, one at a time, to find the
first real entry remaining.

Until pushing a new item would fall off the right end of the vector.
Then the table is rebuilt from scratch, moving all the real entries to
form a contiguous block starting at index 0 again.


>> And the current set implementation (like the older dict
>> implementation) never needs to "rebuild the table from scratch" unless
>> the cardinality of the set keeps growing.

> It is a bit misleading. If "the cardinality of the set" means len(S),

Yes.

> set requires the rebuild in low frequency if the its items are random.
>
> Anyway, it is a smaller problem than next(iter(D)) because of the
> amortized cost is O(1).

For the specific case of FIFO queues, sure. In general, though, "O(1)
period". is more desirable than "amortized O(1)". This is especially
true when sets get very large.


> Current dict is not optimized for queue usage.

Nor should it be:-) That's what collections.deque is for, pushes and
pops, at both ends, always time O(1) period. No holes, no internal
rearrangements, and O(1) "wasted" space.
_______________________________________________
Python-Dev mailing list -- pytho...@python.org
To unsubscribe send an email to python-d...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/pytho...@python.org/message/5VDU52JX2GFM6546W6FCFKXIODDAQKP4/

Larry Hastings

unread,
Dec 21, 2019, 7:57:57 AM12/21/19
to pytho...@python.org


(mixing together messages in the thread, sorry threaded-view readers)

On 12/19/19 3:15 PM, Tim Peters wrote:
[Nick]
I took Larry's request a slightly different way:
[...]
Only Larry can answer whether that would meet his perceived need.  My
_guess_ is that he wouldn't know OrderedSet existed, and, even if he
did, he'd use a dict with None values anyway (because it's less hassle
and does everything he wanted).


At last, something I can speak knowledgeably about: Larry's use case!  Regarding Larry, I'd say

  1. his use case was small enough that almost anything maintaining order would have worked, even a list,
  2. he often does have a pretty good idea what goodies are salted away in the Python standard library, and
  3. a hypothetical collections.OrderedSet would probably work just fine.  And he'd probably use it too, as that makes the code clearer / easier to read.  If you read code using an OrderedSet, you know what the intent was and what the semantics are.  If you see code using a dict but setting the values to None, you think "that's crazy" and now you have a puzzle to solve.  Let's hope those comments are accurate!


Also, speaking personally, at least once (maybe twice?) in this thread folks have asked "would YOU, Mr Larry, really want ordered sets if it meant sets were slightly slower?"

The first thing I'd say is, I'm not sure why people should care about what's best for me.  That's sweet of you!  But you really shouldn't.

The second thing is, my needs are modest, so the speed hit we're talking about would likely be statistically insignificant, for me.

And the third thing is, I don't really put the set() API through much of a workout anyway.  I build sets, I add and remove items, I iterate over them, I do the occasional union, and only very rarely do I use anything else.  Even then, when I write code where I reach for a fancy operation like intersection or symmetric_difference--what tends to happen is, I have a rethink and a refactor, and after I simplify my code I don't need the fancy operations anymore.  I can't remember the last time I used one of these fancy operations where the code really stuck around for a long time.

So, personally?  Sure, I'd take that tradeoff.  As already established, I like that dict objects maintain their order, and I think it'd be swell if sets maintained their order too.  I suspect the performance hit wouldn't affect me in any meaningful way, and I could benefit from the order-maintaining semantics.  I bet it'd have other benefits too, like making regression tests more stable.  And my (admittedly uninformed!) assumption is, the loss of performance would mostly be in these sophisticated operations that I never seem to use anyway.

But I don't mistake my needs for the needs of the Python community at large.  I'd be mighty interested to read other folks coming forward and saying, for example, "please don't make set objects any slower!" and talking us through neat real-world use cases.  Bonus points if you include mind-boggling numbers and benchmarks!


On 12/20/19 5:09 AM, Peter Wang wrote:
As Larry pointed out earlier, ordered dicts posed a problem for MicroPython.

Just a minor correction: that wasn't me, that was Petr Viktorin.


Ten days left until we retire 2.7,


/arry

Kyle Stanley

unread,
Dec 23, 2019, 8:07:39 PM12/23/19
to Larry Hastings, Python Dev
Larry Hastings wrote:
> a hypothetical collections.OrderedSet would probably work just fine.

I'm also in agreement with adding a collections.OrderedSet. There certainly seems to be a practical use case for a Set that preserves insertion order, but with significant push back to modifying the existing Set implementation (primarily due to performance concerns).

If later down the road an improvement to OrderedSet is made that gives it equal or better average performance across the board compared to Set, we can consider changing the default implementation if there's significant demand for it.

Larry Hastings wrote:
> And he'd probably use it too, as that makes the code clearer / easier to read.  If you read code using an OrderedSet, you know what the intent was and what the semantics are.  If you see code using a dict but setting the values to None, you think "that's crazy" and now you have a puzzle to solve.  Let's hope those comments are accurate!

This is an excellent point. My first thought if someone were to be using a Dict with all values set to None would be: why not use a Set here? As Larry said, the need for preservation of insertion order would have to be explicitly described in the code comments. Realistically speaking, code comments are not something that can be consistently relied upon, especially if it's part of some boilerplate format that gets replicated or if the container is used in multiple locations. I'd much rather have a dedicated data structure that clearly describes what it does based on the name alone, IMO that's a million times better for readability purposes.

Also, this is mostly speculation since I haven't ran any benchmarks for an OrderedSet implementation, but I strongly suspect that OrderedSet will end up having better average performance for add, pop, and update than trying to use a dictionary with None values (assuming it's implemented in C or at least with a C accelerator).

Not to mention allowing the ability to update (for both addition and removal) based on intersections and unions across ordered sets, which currently isn't possible to do with dictionaries (at least not directly with |=, &=, -=. and ^=). I'm not sure how much of a practical use case this would have for ordered sets specifically, but it's a nice bonus.

_______________________________________________

Guido van Rossum

unread,
Dec 23, 2019, 8:19:05 PM12/23/19
to Kyle Stanley, Python Dev
To begin with, please compare performance of dict-with-None-values to that of a set, for operations that can be expressed by both (e.g. both have update()).

--
--Guido (mobile)

David Mertz

unread,
Dec 23, 2019, 8:27:28 PM12/23/19
to Kyle Stanley, Python Dev
Even though I was the first person in this thread to suggest collections.OrderedSet, I'm "meh" about it now. As I read more and played with the sortedcollections package, it seemed to me that while I might want a set that iterated in a determinate and meaningful order moderately often, insertion order would make up a small share of those use cases.

Kyle Stanley

unread,
Dec 23, 2019, 10:02:33 PM12/23/19
to Guido van Rossum, Python Dev
> To begin with, please compare performance of dict-with-None-values to that of a set, for operations that can be expressed by both (e.g. both have update()).

Good idea. It would be useful to have a baseline comparison. I emboldened the comparisons with a notable difference.

Tested on:
Version: Python 3.8.0
OS: Linux kernel 5.4.5
CPU: Intel i5-4460

Initialization (about the same):
>>> timeit.timeit("set(i for i in range(1000))", number=100_000)
4.878508095000143
>>> timeit.timeit("{i: None for i in range(1000)}", number=100_000)
4.9192055170024105

Membership (about the same):
>>> timeit.timeit("random.randint(0, 999) in s", setup="import random; s = set(i for i in range(1000))", number=10_000_000)
7.992674231001729
>>> timeit.timeit("random.randint(0, 999) in d", setup="import random; d = {i: None for i in range(1000)}", number=10_000_000)
8.032404395999038

Add (much faster for dicts):
>>> timeit.timeit("s = set(); s.add(0)", number=100_000_000)
13.330938750001224
>>> timeit.timeit("d = {}; d[0] = None", number=100_000_000)
5.788865385999088

Update (much faster for updating sets):
>>> timeit.timeit("s.update(other_s)", setup="s = set(); other_s = set(i for i in range(1000))", number=1_000_000)
6.2428832090008655
>>> timeit.timeit("d.update(other_d)", setup="d = {}; other_d = {i: None for i in range(1000)}", number=1_000_000)
13.031371458000649

Create list from keys (faster for sets):
>>> timeit.timeit("list(s)", setup="s = set(i for i in range(1000))", number=10_00_000)
5.24364303600305
>>> timeit.timeit("list(d)", setup="d = {i: None for i in range(1000)}", number=10_00_000)
6.303017043999716

Removal isn't easily comparable, since s.pop(), s.discard(), d.pop(), and d.popitem() all behave quite differently. Also, I'm sure these benchmarks could be improved significantly, particularly with the "Add" ones. This should provide a decent general idea though.

Chris Angelico

unread,
Dec 23, 2019, 10:08:17 PM12/23/19
to python-dev
On Tue, Dec 24, 2019 at 1:57 PM Kyle Stanley <aero...@gmail.com> wrote:
> Add (much faster for dicts):
> >>> timeit.timeit("s = set(); s.add(0)", number=100_000_000)
> 13.330938750001224
> >>> timeit.timeit("d = {}; d[0] = None", number=100_000_000)
> 5.788865385999088

I think this is an artifact of Python not having an empty set literal.

>>> timeit.timeit("s = set(); s.add(0)", number=100_000_000)
13.275540543720126
>>> timeit.timeit("d = dict(); d[0] = None", number=100_000_000)
13.044076398015022
>>> timeit.timeit("d = {}; d[0] = None", number=100_000_000)
6.088695731014013
>>> timeit.timeit("s = {1}; s.add(0)", number=100_000_000)
9.260965215042233
>>> timeit.timeit("d = {1:2}; d[0] = None", number=100_000_000)
8.75433829985559

When both use the constructor call or both use a literal, the
difference is far smaller. I'd call this one a wash.

ChrisA
_______________________________________________
Python-Dev mailing list -- pytho...@python.org
To unsubscribe send an email to python-d...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/pytho...@python.org/message/ODZYHNI57MFZD3I7TGP3B3HJTRX36KGB/

Stephen J. Turnbull

unread,
Dec 23, 2019, 10:15:20 PM12/23/19
to David Mertz, Kyle Stanley, Python Dev
David Mertz writes:

> Even though I was the first person in this thread to suggest
> collections.OrderedSet, I'm "meh" about it now. As I read more and played
> with the sortedcollections package, it seemed to me that while I might want
> a set that iterated in a determinate and meaningful order moderately often,
> insertion order would make up a small share of those use cases.

On the other hand, insertion order is one of the most prominent of the
determinate meaningful orders where you would have to do ugly things
to use "sorted" to get that order. Any application where you have an
unreliable message bus feeding a queue (so that you might get
duplicate objects but it's bad to process the same object twice) would
be a potential application of insertion-ordered sets.

Steve

_______________________________________________
Python-Dev mailing list -- pytho...@python.org
To unsubscribe send an email to python-d...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/pytho...@python.org/message/ZGSW6MJ4HGHUG65TLWUDHQMZXCW2GR7Q/

Tim Peters

unread,
Dec 23, 2019, 10:59:16 PM12/23/19
to Kyle Stanley, Python Dev

Tim Peters

unread,
Dec 23, 2019, 11:09:46 PM12/23/19
to Kyle Stanley, Python Dev
Sorry! A previous attempt to reply got sent before I typed anything :-(

Very briefly:

> >>> timeit.timeit("set(i for i in range(1000))", number=100_000)

[and other examples using a range of integers]

The collision resolution strategy for sets evolved to be fancier than
for dicts, to reduce cache misses. This is important for sets because
the _only_ interesting thing about an element wrt a set is whether or
not the set contains it. Lookup speed is everything for sets.

If you use a contiguous range of "reasonable" integers for keys, the
integer hash function is perfect: there's never a collision. So any
such test misses all the work Raymond did to speed set lookups.
String keys have sufficiently "random" hashes to reliably create
collisions, though. Cache misses can, of course, have massive effects
on timing.

> Add (much faster for dicts):
> >>> timeit.timeit("s = set(); s.add(0)", number=100_000_000)
> 13.330938750001224
> >>> timeit.timeit("d = {}; d[0] = None", number=100_000_000)
> 5.788865385999088

In the former case you're primarily measuring the time to look up the
"add" method of sets (that's more expensive than adding 0 to the set).
A better comparison would, e.g., move `add = s.add` to a setup line,
and use plain "add(0)" in the loop.

That's it!
_______________________________________________
Python-Dev mailing list -- pytho...@python.org
To unsubscribe send an email to python-d...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/pytho...@python.org/message/5UKYDB36WDRRIYOLRD52QS4I43SCAAJ6/

Kyle Stanley

unread,
Dec 23, 2019, 11:16:17 PM12/23/19
to Chris Angelico, python-dev

Chris Angelico wrote:
> I think this is an artifact of Python not having an empty set literal.
> [snip]
> When both use the constructor call or both use a literal, the
> difference is far smaller. I'd call this one a wash.

Ah, good catch. I hadn't considered that it would make a substantial difference, but that makes sense. Here's an additional comparison between "{}"  and "dict()" to confirm it:

>>> timeit.timeit("{}", number=100_000_000)
2.1038335599987477
>>> timeit.timeit("dict()", number=100_000_000)
10.225583500003268

Some more in-depth comparisons might be required for addition of single items to the containers. I might experiment further with this at some point in the next week or so, likely with implementing proper tests that omit the time to initialize the container.







Kyle Stanley

unread,
Dec 24, 2019, 12:32:33 AM12/24/19
to Tim Peters, Python Dev
Tim Peters wrote:
> Sorry!  A previous attempt to reply got sent before I typed anything :-(

No worries, I only saw the link in the footer to the PSF CoC, and I was mildly concerned for a moment. My first thought was "Oh no, what did I do?". Thanks for clearing that up (:

> The collision resolution strategy for sets evolved to be fancier than
> for dicts, to reduce cache misses.  This is important for sets because
> the _only_ interesting thing about an element wrt a set is whether or
> not the set contains it.   Lookup speed is everything for sets.

Huh, that's quite interesting. For some reason, I had assumed in the back of my head (without giving it much thought) that the average collision rate would be the same for set items and dict keys. Thanks for the useful information.

> If you use a contiguous range of "reasonable" integers for keys, the
> integer hash function is perfect:  there's never a collision.  So any
> such test misses all the work Raymond did to speed set lookups.
> String keys have sufficiently "random" hashes to reliably create
> collisions, though.  Cache misses can, of course, have massive effects
> on timing.

Ah, I forgot to consider how the hash function actually works for continuous integers. A better comparison to demonstrate the collision differences would likely use random strings.

Also, I believe that max "reasonable" integer range of no collision is (-2305843009213693951, 2305843009213693951), as defined in Include/pyhash.h (https://github.com/python/cpython/blob/e42b705188271da108de42b55d9344642170aa2b/Include/pyhash.h#L28).


> In the former case you're primarily measuring the time to look up the
> "add" method of sets (that's more expensive than adding 0 to the set).
> A better comparison would, e.g., move `add = s.add` to a setup line,
> and use plain "add(0)" in the loop.

Good point, there's also what Chris Angelico pointed out as well: dicts have a significant advantage in this case for having a literal for initialization. From my understanding, this ends up having a minimal performance impact in most realistic code (similar to method lookup time), but it makes a very substantial difference in this case since initialization of the containers takes up a significant portion of each iteration.

I suspect my initialization comparison might also be flawed for similar reasons, but I'm glad that at least 3/5 of my comparisons were mostly reasonable. So far, the performance difference between set.update() and dict.update() were the most notable.

I'll likely spend some more time experimenting in the next couple of weeks or so, and report back if I find any interesting results. In the meantime, anyone can certainly feel free to improve upon my existing comparisons.

Tim Peters

unread,
Dec 24, 2019, 1:00:41 AM12/24/19
to Kyle Stanley, Python Dev
[Kyle]
> ...
> For some reason, I had assumed in the back of my head (without
> giving it much thought) that the average collision rate would be the
> same for set items and dict keys. Thanks for the useful information.

I know the theoretical number of probes for dicts, but not for sets
anymore. The latter use a mix of probe strategies now, "randomish"
jumps (same as for dicts) but also purely linear ("up by 1") probing
to try to exploit L1 cache.

It's not _apparent_ to me that the mix actually helps ;-) I just
trust that Raymond timed stuff until he was sure it does.

> ...
> Ah, I forgot to consider how the hash function actually works for continuous
> integers. A better comparison to demonstrate the collision differences would
> likely use random strings.

And, at an extreme, a class with a __hash__ that always returns the
same integer.

> Also, I believe that max "reasonable" integer range of no collision
> is (-2305843009213693951, 2305843009213693951), ...

Any range that does _not_ contain both -2 and -1 (-1 is an annoying
special case, with hash(-1) == hash(-2) == -2), and spans no more than
sys.hash_info.modulus integers. Apart from that, the sign and
magnitude of the start of the range don't matter; e.g.,

>>> len(set(hash(i) for i in range(10**5000, 10**5000 + 1000000)))
1000000
>>> len(set(hash(i) for i in range(-10**5000, -10**5000 + 1000000)))
1000000
_______________________________________________
Python-Dev mailing list -- pytho...@python.org
To unsubscribe send an email to python-d...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/pytho...@python.org/message/FXTRZLOHSWYV5KEGJ4DYNS7HOMUOPVE5/

Tim Peters

unread,
Dec 24, 2019, 1:07:05 AM12/24/19
to Kyle Stanley, Python Dev
>> Also, I believe that max "reasonable" integer range of no collision
>> is (-2305843009213693951, 2305843009213693951), ...

> Any range that does _not_ contain both -2 and -1 (-1 is an annoying
> special case, with hash(-1) == hash(-2) == -2), and spans no more than
> sys.hash_info.modulus integers. Apart from that, the sign and
> magnitude of the start of the range don't matter; e.g.,
>
> >>> len(set(hash(i) for i in range(10**5000, 10**5000 + 1000000)))
> 1000000
> >>> len(set(hash(i) for i in range(-10**5000, -10**5000 + 1000000)))
> 1000000

Sorry again! That only shows that the hash codes are unique. Dicts
and sets use only the last N bits to determine the start of the probe
sequence, for some value of N >= 3 (depending on the table size). So
for a table of size a million, the last 20 bits (1000000 ~= 2**20) are
interesting. But same thing:

>>> MASK = (1 << 20) - 1
>>> len(set(hash(i) & MASK for i in range(-10**5000, -10**5000 + 1000000)))
1000000
>>> len(set(hash(i) & MASK for i in range(-10**5000, -10**5000 + 1000000)))
1000000
_______________________________________________
Python-Dev mailing list -- pytho...@python.org
To unsubscribe send an email to python-d...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/pytho...@python.org/message/BXHOS73YMJDSVZUM74VYXXYN5WW63BZ4/

Larry Hastings

unread,
Dec 24, 2019, 3:50:26 AM12/24/19
to pytho...@python.org


On 12/23/19 8:09 PM, Kyle Stanley wrote:

Chris Angelico wrote:
> I think this is an artifact of Python not having an empty set literal.
> [snip]
> When both use the constructor call or both use a literal, the
> difference is far smaller. I'd call this one a wash.

Ah, good catch. I hadn't considered that it would make a substantial difference, but that makes sense. Here's an additional comparison between "{}"  and "dict()" to confirm it:

>>> timeit.timeit("{}", number=100_000_000)
2.1038335599987477
>>> timeit.timeit("dict()", number=100_000_000)
10.225583500003268


We could also go the other way with it, set / dict-map-to-dontcare with one element, because that way we can have a set literal:

>>> timeit.timeit("set()", number=100_000_000)
8.579900023061782
>>> timeit.timeit("dict()", number=100_000_000)
10.473437276901677
>>> timeit.timeit("{0}", number=100_000_000)
5.969995185965672
>>> timeit.timeit("{0:0}", number=100_000_000)
6.24465325800702

(ran all four of those just so you see a relative sense on my laptop, which I guess is only sliiiightly slower than Kyle's)


Happy holidays,


/arry

Wes Turner

unread,
Dec 24, 2019, 6:10:30 AM12/24/19
to Larry Hastings, Python-Dev
A table with these columns might be helpful for performance comparisons: [
structure,
operation,
big_o,
keyset,  # signed ints, randstrs
timeit_output,
system_load,
architecture,  # x86_84, aarch64
os,
]

Does anyone have a recommendation for a tool that stores such benchmark data in JSON and merges the e.g. per-arch output?

Are there URIs for Big-O notation that could be added as e.g. structured attributes in docstrings or annotations?

I just added a ._repr_html_() method to the tabulate package so that, in addition to the many excellent ASCII formats tabulate prints, tabulate.tabulate(data, tablefmt='html') displays an html-escaped table in Jupyter notebooks? It's not yet released, so you'd need to

_______________________________________________

Tim Peters

unread,
Dec 25, 2019, 6:03:07 PM12/25/19
to Kyle Stanley, Python Dev
[Tim]
> I know the theoretical number of probes for dicts, but not for sets
> anymore. The latter use a mix of probe strategies now, "randomish"
> jumps (same as for dicts) but also purely linear ("up by 1") probing
> to try to exploit L1 cache.
>
> It's not _apparent_ to me that the mix actually helps ;-) I just
> trust that Raymond timed stuff until he was sure it does.

To illustrate, I contrived a case where the set optimizations clearly
pay off. Output (3.8.1, Win 10 Pro, quad core box with plenty of
RAM):

11184810 nice keys
dict build 0.54
dict iter 0.07
dict lookup 0.81
set build 0.54
set iter 0.08
set lookup 0.82

11184810 nasty keys
dict build 23.32
dict iter 0.09
dict lookup 13.26
set build 9.79
set iter 0.28
set lookup 8.46

I didn't make heroic efforts to keep the machine quiet, and there's
substantial variation across runs. For example, there's no real
difference beteen "0.07" and "0.08".

Some things to note:

- The "nice" keys act much the same regardless of whether dict or set.
Dicts have an advantage for iteration "in theory" in the absence of
deletions because there are no "holes" in the area storing the
hash+key+value records. But for these specific keys, the same is true
of sets: the initial hash probes are at indices 0, 1, 2, ...,
len(the_set)-1, and there are no holes in its hash+key records either
(all the empty slots are together "at the right end").

- Sets get a lot of value out of their fancier probe sequence for the
nasty keys. There are lots of collisions, and sets can much more
often resolve them from L1 cache. Better than a factor of 2 for build
time is nothing to sneeze at.

- Iteration for sets is substantially slower in this case, for two
reasons: (1) sets _do_ have holes for this collection of keys; and,
(2) while I don't think it's documented, the maximum load factor for
sets is lower than for dicts, and 11184810 was picked to give the
maximum load factor for a dict with 2**24 slots. The set in this case
has twice as many slots as the dict, and all the extras are empty
slots that iteration has to skip over.

- I don't have a theory for why dict build time is _so_ much higher
than dict lookup time for the nasty keys.

- If you don't know a lot about how these things are implemented, just
about everything is baffling ;-) Integer keys are important in
practice, but things about how modern dicts are implemented were
_driven_ by opportunities to exploit properties of the mostly-trivial
int hash function. For "general" timing, it's better to use string
keys, whose hash codes are "pretty randomish" (but can vary across
runs). When using int keys, it's really easy to _end up_ measuring
happy behaviors _specific_ to int keys.

Code:

from time import perf_counter as now
import sys
from collections import deque

def disp(text, f):
print(f"{text:20s} {f:5.2f}")

M = sys.hash_info.modulus
targetlen = 2**24 * 2 // 3 # highest load factor for dict
hc = 1
badkeys = []
while len(badkeys) < targetlen:
# add no more than 100 ints with hash code `hc`
toadd = min(100, targetlen - len(badkeys))
badkeys.extend(hc + i * M for i in range(toadd))
hc += 713 # more than less arbitrary
goodkeys = list(range(targetlen))

for tag, keys in ("nice", goodkeys), ("nasty", badkeys):
print()
print(len(keys), tag, "keys")

for thetype, builder in ("dict", dict.fromkeys), ("set", set):
s = now()
d = builder(keys)
f = now()
disp(thetype + " build", f-s)

s = now()
deque(d, maxlen=0)
f = now()
disp(thetype + " iter", f-s)

s = now()
for k in keys:
assert k in d
f = now()
disp(thetype + " lookup", f-s)

del d
_______________________________________________
Python-Dev mailing list -- pytho...@python.org
To unsubscribe send an email to python-d...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/pytho...@python.org/message/S3UB5TCSUBLYSESG7WO2HNCFBZB5N4BG/

Serhiy Storchaka

unread,
Dec 26, 2019, 3:09:35 AM12/26/19
to pytho...@python.org
26.12.19 00:56, Tim Peters пише:
> - I don't have a theory for why dict build time is _so_ much higher
> than dict lookup time for the nasty keys.

1/2 of items were copied to a new hashtable when the dict grew up. 1/4
of items were copied twice. 1/8 of items were copied three times, etc.
In average every item is copied 1 time, so the build time is twice the
lookup time when the cost of lookup is dominated.

But the lookup benchmarking includes the overhead of iterating Python
loop, which is more significant for "good" keys, thus the lookup time is
larger than the build time.
_______________________________________________
Python-Dev mailing list -- pytho...@python.org
To unsubscribe send an email to python-d...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/pytho...@python.org/message/J5BUQZQUHONX6R6GLD5ESH5XVWLP3INS/

Tim Peters

unread,
Dec 26, 2019, 2:46:31 PM12/26/19
to Serhiy Storchaka, Python Dev
[Tim]
>> - I don't have a theory for why dict build time is _so_ much higher
>> than dict lookup time for the nasty keys.

To be clearer, in context this was meant to be _compared to_ the
situation for sets. These were the numbers:

11184810 nasty keys
dict build 23.32
dict lookup 13.26
set build 9.79
set lookup 8.46

Build time is higher than lookup time for both, which isn't
surprising. But it's _much_ higher (whether in absolute or relative
terms) for dicts than for sets.

[Serhiy Storchaka <stor...@gmail.com>]
> 1/2 of items were copied to a new hashtable when the dict grew up. 1/4
> of items were copied twice. 1/8 of items were copied three times, etc.
> In average every item is copied 1 time, so the build time is twice the
> lookup time when the cost of lookup is dominated.

I agree the average total number of table insertions per element
approaches 2 either way (counting insertions due to internal table
growth as well as insertions visible from Python code). That's why
it's expected that build time exceeds lookup time (the latter does
exactly one lookup per element).


> But the lookup benchmarking includes the overhead of iterating Python
> loop, which is more significant for "good" keys, thus the lookup time is
> larger than the build time.

Iteration time is too minor to care much about. Even in the worst
case (sets with the nasty keys), the Python:

for k in d:
pass

takes under a second here. Take a full second off the "set lookup"
time, and dict build time still exceeds dict lookup time (whether in
absolute or relative terms) much more so than for sets.

But I'm not going to pursue it. The high-order bit was just to
demonstrate that the set object's more complicated (than dicts) probe
sequence does buy highly significant speedups - in at least one highly
contrived case. Since the probe sequence changes were aimed at
reducing cache misses, a full account requires comparing cache misses
between the dict and set implementations. That's a mess, and slogging
through it appears unlikey to result in insight.
_______________________________________________
Python-Dev mailing list -- pytho...@python.org
To unsubscribe send an email to python-d...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/pytho...@python.org/message/IYPHUKW7DGYNAFAC4AOCDYYUWANHN5YE/

Antoine Pitrou

unread,
Dec 27, 2019, 5:14:41 AM12/27/19
to pytho...@python.org
On Tue, 24 Dec 2019 12:08:33 +0900
"Stephen J. Turnbull" <turnbull....@u.tsukuba.ac.jp> wrote:
> David Mertz writes:
>
> > Even though I was the first person in this thread to suggest
> > collections.OrderedSet, I'm "meh" about it now. As I read more and played
> > with the sortedcollections package, it seemed to me that while I might want
> > a set that iterated in a determinate and meaningful order moderately often,
> > insertion order would make up a small share of those use cases.
>
> On the other hand, insertion order is one of the most prominent of the
> determinate meaningful orders where you would have to do ugly things
> to use "sorted" to get that order. Any application where you have an
> unreliable message bus feeding a queue (so that you might get
> duplicate objects but it's bad to process the same object twice) would
> be a potential application of insertion-ordered sets.

In that case you probably want a separate persistent "seen" set.
Because your queue can have been drained by the time a duplicate object
arrives.

(which means you probably want something more efficient, such as a
sequence number)

Regards

Antoine.

_______________________________________________
Python-Dev mailing list -- pytho...@python.org
To unsubscribe send an email to python-d...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/pytho...@python.org/message/QANSVP7LU4OVMGKIGM25CJ4W24YYYVBD/

Tim Peters

unread,
Dec 27, 2019, 10:51:14 PM12/27/19
to Nick Coghlan, Python Dev
[Nick Coghlan <ncog...@gmail.com>]
> I took Larry's request a slightly different way: he has a use case where
> he wants order preservation (so built in sets aren't good), but combined
> with low cost duplicate identification and elimination and removal of
> arbitrary elements (so lists and collections.deque aren't good). Organising
> a work queue that way seems common enough ...

Is it? I confess I haven't thought of a plausible use case. Larry
didn't really explain his problem, just suggested that an ordered set
would be "a solution" to it.

The problem: whether there are duplicates "in the queue" is a
question about an implementation detail, hard for me to translate to a
question about the _task_ to be solved.

For example, suppose "the task" is to make a deep copy of a web site.
A "job" consists of following a URL, sucking down the page text, and
adding new jobs for contained URLs on the page.

We probably don't want to suck down the page text multiple times for a
given URL, but checking whether a URL is currently already in the job
queue is asking a question about an implementation detail that misses
the point: we want to know whether that URL has already been chased,
period. Whether the URL is currently in the queue is irrelevant to
that. The only logical relationship is that a job that has finished
_was_ in the queue at some time before the job finished.

So, ya, I've seen and implemented lots of work queues along these
lines - but an OrderedSet would be an "attractive nuisance"
(offhandedly appearing to solve a problem it doesn't actually
address):

jobs = some_kind_of_queue()
finished_jobs = set()
...
while jobs:
job = jobs.get()
if job in finished_jobs:
continue
try:
work on the job
possibly adding (sub)jobs to the `jobs` queue
except TransientExceptions:
jobs.put(job) # try again later
else:
finished_jobs.add(job)

Clear? There is a queue here, and a set, but they can't be combined
in a useful way: the set contents have nothing to do with what's
currently in the queue - the set consists of jobs that have been
successfully completed; the queue doesn't care whether it contains
duplicates, and merely skips over jobs that have been completed by the
time the queue spits them out (which strikes me as more robust design
than duplicating logic everywhere a job is added to a queue to try to
prevent adding already-done jobs to begin with).

Similarly, in other places I've used a "finished" flag on the object
instead of a set, or a dict mapping jobs to info about job status and
results. But in all these cases, the vital info associated with a job
really has little to do with whether the job is currently sitting on a
queue.

If "is it currently on the queue?" _is_ a common question to ask,
perhaps someone could give a semi-realistic example?
_______________________________________________
Python-Dev mailing list -- pytho...@python.org
To unsubscribe send an email to python-d...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/pytho...@python.org/message/2XDEVSO5S5ZLVZTLX43UHBOJWMXYJIIB/

Brandt Bucher

unread,
Dec 28, 2019, 12:06:22 AM12/28/19
to Python Dev, Tim Peters
> On Dec 27, 2019, at 19:48, Tim Peters <tim.p...@gmail.com> wrote:
>
> So, ya, I've seen and implemented lots of work queues along these
> lines - but an OrderedSet would be an "attractive nuisance"
> (offhandedly appearing to solve a problem it doesn't actually
> address):
>
> jobs = some_kind_of_queue()
> finished_jobs = set()
> ...
> while jobs:
> job = jobs.get()
> if job in finished_jobs:
> continue
> try:
> work on the job
> possibly adding (sub)jobs to the `jobs` queue
> except TransientExceptions:
> jobs.put(job) # try again later
> else:
> finished_jobs.add(job)

Well, if an OrderedSet were designed to gracefully handle resizes during iteration, something like this may make sense:

jobs = OrderedSet(initial_jobs)
for job in jobs:
new_jobs = process(job)
jobs |= new_jobs
... # jobs is now a set of every job processed

A dictionary with None values comes close if you replace the union line with a jobs.update(new_jobs) call (and ignore resizing issues), but it breaks because repeated jobs are shuffled to the end of the sequence and would be processed again.

Brandt
_______________________________________________
Python-Dev mailing list -- pytho...@python.org
To unsubscribe send an email to python-d...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/pytho...@python.org/message/SXGD5G5EY4YMXDG42AU7OSNVCUUU25DI/

Larry Hastings

unread,
Dec 28, 2019, 7:07:14 PM12/28/19
to pytho...@python.org


On 12/27/19 7:44 PM, Tim Peters wrote:
[Nick Coghlan <ncog...@gmail.com>]
I took Larry's request a slightly different way: he has a use case where
he wants order preservation (so built in sets aren't good), but combined
with low cost duplicate identification and elimination and removal of
arbitrary elements (so lists and collections.deque aren't good). Organising
a work queue that way seems common enough ...
Is it?  I confess I haven't thought of a plausible use case.  Larry
didn't really explain his problem, just suggested that an ordered set
would be "a solution" to it.


Here is the original description of my problem, from the original email in this thread.  I considered this an adequate explanation of my problem at the time.


On 12/15/19 6:48 PM, Larry Hastings wrote:

I do have a use case for this. In one project I maintain a "ready" list of jobs; I need to iterate over it, but I also want fast lookup because I soemtimes remove jobs when they're subsequently marked "not ready".  The existing set object would work fine here, except that it doesn't maintain insertion order.  That means multiple runs of the program with the same inputs may result in running jobs in different orders, and this instability makes debugging more difficult.  I've therefore switched from a set to a dict with all keys mapped to None, which provides all the set-like functionality I need.


In subsequent emails I've clarified that my workloads are small enough--and computers are fast enough--that almost any data structure would work fine for me here, e.g. a list.  I don't think my needs should drive the decision making process regardless.  I only described my problem to get the conversation rolling.

I opine that anybody who iterates over set objects and has bugs in their code would appreciate set objects maintaining insertion order.  I suspect it would make their debugging easier, because given identical inputs their set iteration would behave identically, thus making their bugs that much more stable.  That's as much use case as I have for the feature.


/arry

Tim Peters

unread,
Dec 28, 2019, 9:31:28 PM12/28/19
to Larry Hastings, Python Dev
[Larry]
> Here is the original description of my problem, from the original email in
> this thread. I considered this an adequate explanation of my problem
> at the time.

>> I do have a use case for this. In one project I maintain a "ready" list of
>> jobs; I need to iterate over it, but I also want fast lookup because I
>> soemtimes remove jobs when they're subsequently marked "not ready".

Which is a description not of "the problem", but of the operations you
want to perform. It's the first time in my life I've heard of a
"ready list of jobs" where the programmer sometimes decides later "oh,
no - this one and that one aren't ready after all" ;-)

Not that it matters much - you want the operations you want. The lack
of "oh - of course - that's a problem we all face at times"
motivation, though, works against it being _compelling_ on its own.

> ...
> In subsequent emails I've clarified that my workloads are small enough--and
> computers are fast enough--that almost any data structure would work fine
> for me here, e.g. a list.
>
> I don't think my needs should drive the decision making process regardless.
> I only described my problem to get the conversation rolling.

Which it did :-) Others went on to, explicitly or implicitly, suggest
that an ordered set must also, e.g., support the entire
MutableSequence interface, and even define what happens if you mutate
the ordered set _while_ iterating over it (lists define the results in
such cases, but ordered dicts raise an exception).


> I opine that anybody who iterates over set objects and has bugs in their
> code would appreciate set objects maintaining insertion order. I suspect
> it would make their debugging easier, because given identical inputs their
> set iteration would behave identically, thus making their bugs that much
> more stable. That's as much use case as I have for the feature.

That's much stronger to me than some weird FIFO-only queue supporting
fast deletion of interior elements (which - sorry - I've never had a
use for).

And fine by me if such a thing is added. All else being equal, I'd
prefer that the builtin set type be ordered, simply because it's less
surprising. and dicts are ordered now (although, as Inada Naoki said,
their current implementation isn't suitable for a FIFO queue, doe
primarily to O(N) time needed to find "the first" key+value record).

I believe we agree that adding _some_ flavor of OrderedSet to
`collections` would be a fine start. I'm just not cool with replacing
the builtin set before there's an implementation fast and
memory-efficient enough that _current_ heavy set users won't feel
betrayed by major slowdowns or memory bloat when the switch is made.

Toward that end, my opinion - which I won't defend - is that
OrderedSet should not promise better than O(N)-time indexing (if it
supports indexing at all), and should - like current sets and dicts -
try to raise an exception if it can detect that the container has been
mutated during iteration.
_______________________________________________
Python-Dev mailing list -- pytho...@python.org
To unsubscribe send an email to python-d...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/pytho...@python.org/message/AH4LQB3OCMCE22FEWUALRND5P6AHUWE6/

Paddy McCarthy

unread,
Dec 29, 2019, 9:30:33 AM12/29/19
to Larry Hastings, python-dev
Hi Larry, sets (and mappings, ie dicts), have a common functionality among many languages and libraries that does Not include an ordering. Already, in CPython, there is a need to somehow indicate that insertion ordering is being used in dicts or use OrderedDict. I am quite happy with keeping sets as they are and coding any extra functionality required on top of this. This aids me as I work in many languages that have their own sets and mappings, usually without any insertion ordering. 

Cheers, Paddy. 


_______________________________________________

Larry Hastings

unread,
Dec 29, 2019, 10:59:00 AM12/29/19
to Tim Peters, Python Dev


On 12/28/19 6:24 PM, Tim Peters wrote:
[Larry]
Here is the original description of my problem, from the original email in
this thread.  I considered this an adequate explanation of my problem
at the time.

      
I do have a use case for this. In one project I maintain a "ready" list of
jobs; I need to iterate over it, but I also want fast lookup because I
soemtimes remove jobs when they're subsequently marked "not ready".
Which is a description not of "the problem", but of the operations you
want to perform.  It's the first time in my life I've heard of a
"ready list of jobs" where the programmer sometimes decides later "oh,
no - this one and that one aren't ready after all" ;-)


It's a lightweight abstract dependency graph.  Its nodes are opaque, only required to be hashable.  And it doesn't require that you give it all the nodes in strict dependency order.

When you add a node, you can also optionally specify dependencies, and those dependencies aren't required to be nodes that the graph has previously seen.  So you can add a node A that depends on B and C, without showing it B or C first.  When that happens it creates placeholder nodes for B and C, and naturally these nodes have no dependencies.  You can then later inform the graph that node B has dependencies on X Y and Z.

(My specific use case is a package manager.  Packages are identified with unique strings.  So the nodes I give the give the graph are simply those package names.  It's pretty common for me to get a package with dependencies on packages I haven't seen yet.  Designing the graph to create placeholders let me skip making two passes over the package list, pre-creating the package objects, etc etc.  This made the code simpler and presumably faster.)

The graph API maintains an externally-visible "ready" iterable of nodes.  And since B can go from ready to not-ready, it can get added to "ready" and subsequently removed.

Also, when a node is satisfied, you simply remove it from the graph.  If A depends on B and C, and they all represent jobs, and B and C have no dependencies, they'll be in the "ready" iterable.  You iterate over "ready", and execute those jobs, and assuming they are successful you then remove them from the graph.  When A's dependencies are all satisfied, it'll get added to the "ready" iterable.  And naturally when B and C were removed from the graph, they were removed from the "ready" iterable too.

Thus it's this "ready" collection that I want to a) be iterable, b) be stable, and c) support fast removal.  I add and remove nodes from that set a lot, which I realized would perturb the iteration order from run to run, etc etc etc, and here we are.


I grant you maybe it's a weird approach.  But after two false starts, where I was starting to boil the oceans with ABCs and "node is satisfied" APis and separate iterator objects, I had a re-think and hit on this super lightweight design.  I'm actually quite pleased with it--it's short, it has a minimal API, it's readable, it was easy to get right, and it solves my problem.


Happy new year,


/arry

Tim Peters

unread,
Dec 29, 2019, 5:12:51 PM12/29/19
to Larry Hastings, Python Dev
[Larry]
OK! You're doing a topological sort. There are natural & simple ways
to do that right now that scale efficiently to very large graphs (time
& space linear in the sum of the number of nodes and the number of
dependencies). Curiously, the difficulties are mostly driven by the
quality of _error_ messages you want (in case of, e.g., cycles in the
dependency graph).

Lots of those implementations became deterministic "by magic" when
ordered dicts were introduced. This is why: a bare bones
implementation needs to associate two pieces of info with each node:
a list of its immediate successors, and an integer count of the number
of its immediate predecessors. A dict is the natural way to implement
that association.

When all the dependency info has been entered, then:

- First all nodes are emitted whose predecessor count is 0. Iterating
over the association dict once is the natural way to find them, and
that order is defined now.

- Then, as each emitted node N is marked done:
for child in N.successors:
assert child.predcount > 0
child.predcount -= 1
if child.predcount == 0:
emit(child)

That's about all there is to it. But there's no end to the cruft that
_can_ be added to, e.g., verify that a node is marked done no more
than once, or to compute & display strongly connected components if
not all nodes are emitted, etc.

Ordered sets could improve this "natural" implementation if the
successor lists were successor sets instead, simply because then -
like lists - their iteration order would be defined, which can in turn
affect the order in which nodes are emitted in the loop just above.
But lists are adequate to the task, are cheaper in both time and
space, and can't _not_ be ordered ;-)


> Thus it's this "ready" collection that I want to a) be iterable, b) be stable, and
> c) support fast removal. I add and remove nodes from that set a lot, which I
> realized would perturb the iteration order from run to run, etc etc etc, and here
> we are.

The sketch above (which is a bog-common way to implement topsorts)
doesn't build a data structure recording the final order, and, in
fact, never deletes anything. You might protest that the initial
iteration step (scanning the association dict to find nodes with no
predecessors) is an expense you skip because nodes with predecessors
are deleted from your "ready" structure all along. But nodes with
predecessors are _touched_ either way, and merely skipping over them
is bound to be cheaper than deleting them.

After that initial step, no search of any kind is needed again.

> I grant you maybe it's a weird approach. But after two false starts, where I
> was starting to boil the oceans with ABCs and "node is satisfied" APis and
> separate iterator objects, I had a re-think and hit on this super lightweight
> design. I'm actually quite pleased with it--it's short, it has a minimal API,
> it's readable, it was easy to get right, and it solves my problem.

Whereas I would have started with a topsort and finished it while I
was sleeping ;-) Seriously, I've written such a thing many times, but
never reuse the code because it's so easy to redo from scratch. Every
application has _some_ unique twist, which makes a general-purpose API
so bloated it's harder to figure out how to use than to write custom
code.

In your application, I'm guessing (but don't know) that when a package
name is emitted, it's possible you're not able to find the package,
and so the process has to stop. That's why I inserted a "as each
emitted node N is marked done" step to my description. That is, I'm
picturing an API that adds a (say)

topsorter.record_succeeded(node)

method to say that - yes - the node was processed successfully. Else,
by default, its successors (if any) must not be emitted.

But if that isn't needed, the whole topsort order can be materialized
into (say) a list in one gulp, or delivered by a generator.


> Happy new year,

And to you too! :-)
_______________________________________________
Python-Dev mailing list -- pytho...@python.org
To unsubscribe send an email to python-d...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/pytho...@python.org/message/HRKQPTPQQNIU3DYMYUYRC7USVRJGMRFC/

Wes Turner

unread,
Dec 29, 2019, 9:00:22 PM12/29/19
to Tim Peters, Python Dev
Due to version and platform constraints, a SAT solver is necessary for conda and (now) pip. Libsolv is one of the fastest around.

 https://github.com/QuantStack/mamba is conda implemented with libsolv (and parallel downloading of *declarative* dependency metadata).

For general purpose graphs with implicit node instantation from edge declaration, NetworkX has implementations of many graph algorithms. 

CuGraph (a product of the RAPIDS.ai project) does graphs with CUDA (from Python) with a "NetworkX-like API" but doesn't yet have a topological sort (though it does have BFS)

"pip needs a dependency resolver" + End (Fn+Right) links to the latest work on the new pip code (that must require declarative dependency metadata)

That said, this implementation of topo sort could have a deterministic output given an OrderedSet:

A table including Big-O and memory usage for the necessary data structure and methods would be useful.

pylang

unread,
Dec 30, 2019, 6:11:01 PM12/30/19
to Wes Turner, Tim Peters, Python Dev
A set was requested with preserved insertion order.  Mathematically, such a structure relates more to an Oset (having order) than a Set.  See relationships in the image below:

libo.png

Each of the mentioned structure types has a dedicated role.  Python's sets work well and correlate with existing math principles.  I would be more convinced to introduce a new collection to Python rather than alter existing sets.

-1: preserving insertion-order in sets
 0: adding `OrderedSet` (where `dict.fromkeys()` does not suffice, i.e. needing set operations)
+1: implementing an analogous abstract collection, e.g. "`HashList`" -> Oset, just as `Counter` -> Mset

Brett Cannon

unread,
Jan 2, 2020, 7:08:06 PM1/2/20
to pytho...@python.org

Kyle Stanley

unread,
Jan 8, 2020, 1:01:06 AM1/8/20
to Brett Cannon, Python Dev
> The only `OrderedSet` use I have seen in the wild is https://github.com/python-hyper/uritemplate/search?q=orderedset&unscoped_q=orderedset .

A more generalized Python code search across GitHub of "orderedset" returns ~500k results: https://github.com/search?l=Python&q=orderedset&type=Code .

Of course, a general code search across GitHub is by no means a clear or reliable measure of actual usage, considering the number of duplicate results. But, it's still useful for rough estimations.

Victor Stinner

unread,
Jan 8, 2020, 5:36:37 AM1/8/20
to Kyle Stanley, Python Dev
Le mer. 8 janv. 2020 à 07:02, Kyle Stanley <aero...@gmail.com> a écrit :
> A more generalized Python code search across GitHub of "orderedset" returns ~500k results: https://github.com/search?l=Python&q=orderedset&type=Code .

Sadly this search seems to count how my projects put their virtual
environment on GitHub. Example of result:
venv/Lib/site-packages/setuptools/_vendor/ordered_set.py

It's a vendored copy of the https://pypi.org/project/ordered-set/
project used by setuptools.

Libraries.io counts 100 repositories and 20 packages which depend on
ordered-set:
https://libraries.io/pypi/ordered-set/dependents

Victor

_______________________________________________
Python-Dev mailing list -- pytho...@python.org
To unsubscribe send an email to python-d...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/pytho...@python.org/message/5EYOBOSU6XJYUZRIQ2XX2ICODNA77FBJ/
Reply all
Reply to author
Forward
0 new messages