As of 3.7, dict objects are guaranteed to maintain insertion order. But set objects make no such guarantee, and AFAIK in practice they don't maintain insertion order either. Should they?
I do have a use case for this. In one project I maintain a "ready" list of jobs; I need to iterate over it, but I also want fast lookup because I soemtimes remove jobs when they're subsequently marked "not ready". The existing set object would work fine here, except that it doesn't maintain insertion order. That means multiple runs of the program with the same inputs may result in running jobs in different orders, and this instability makes debugging more difficult. I've therefore switched from a set to a dict with all keys mapped to None, which provides all the set-like functionality I need.
ISTM that all the reasons why dicts should maintain insertion
order also apply to sets, and so I think we should probably do
this. Your thoughts?
/arry
p.s. My dim recollection is that the original set object was a
hacked-up copy of the dict object removing the ability to store
values. So there's some precedent for dicts and sets behaving
similarly. (Note: non-pejorative use of "hacked-up" here, that
was a fine way to start.)
That's not a reason why we shouldn't do it. Python is the language where speed, correctness, and readability trump performance every time. We should decide what semantics we want, then do that, period. And anyway, it seems like some genius always figures out how to make it fast sooner or later ;-)
I can believe that, given the current implementation, sets might
give up some of their optimizations if they maintained insertion
order. But I don't understand what "flexibility" they would
lose. Apart from being slower, what would set objects give up if
they maintained insertion order? Are there operations on set
objects that would no longer be possible?
/arry
_______________________________________________
Python-Dev mailing list -- pytho...@python.org
To unsubscribe send an email to python-d...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Code of Conduct: http://python.org/psf/codeofconduct/
On Dec 15, 2019, at 6:48 PM, Larry Hastings <la...@hastings.org> wrote:As of 3.7, dict objects are guaranteed to maintain insertion order. But set objects make no such guarantee, and AFAIK in practice they don't maintain insertion order either. Should they?I don't think they should. Several thoughts: * The corresponding mathematical concept is unordered and it would be weird to impose such as order.
I disagree--I assert it's no more or less weird than imposing order on dicts, which we eventually decided to do.
* You can already get membership testing while retaining insertion ordering by running dict.fromkeys(seq).
I'm not sure where you're going with this. The argument "you can do this with other objects" didn't preclude us from, for example, adding ordering to the dict object. We already had collections.OrderedDict, and we decided to add the functionality to the dict object.
* Set operations have optimizations that preclude giving a guaranteed order (for example, set intersection loops over the smaller of the two input sets). * To implement ordering, set objects would have to give-up their current membership testing optimization that exploits cache locality in lookups (it looks at several consecutive hashes at a time before jumping to the next random position in the table). * The ordering we have for dicts uses a hash table that indexes into a sequence. That works reasonably well for typical dict operations but is unsuitable for set operations where some common use cases make interspersed additions and deletions (that is why the LRU cache still uses a cheaply updated doubly-linked list rather that deleting and reinserting dict entries).
These are all performance concerns. As I mentioned previously in
this thread, in my opinion we should figure out what semantics we
want for the object, then implement those, and only after that
should we worry about performance. I think we should decide the
question "should set objects maintain insertion order?" literally
without any consideration about performance implications.
* This idea has been discussed a couple times before and we've decided not to go down this path. I should document prominently because it is inevitable that it will be suggested periodically because it is such an obvious thing to consider.
Please do!
In the email quoted by Tim Peters earlier in the thread, you
stated that set objects "lose their flexibility" if they maintain
order. Can you be more specific about what you meant by that?
/arry
p.s. To be clear, I'm not volunteering to add this feature to the
set object implementation--that's above my pay grade. But I'm
guessing that, if we decided we wanted these semantics for the
dict object, someone would volunteer to implement it.
* The corresponding mathematical concept is unordered and it would be weird to impose such as order.
On Mon 12/16/19, 9:59 AM, "David Mertz" <me...@gnosis.cx> wrote:
Admittedly, I was only lukewarm about making an insertion-order guarantee for dictionaries too. But for sets I feel more strongly opposed. Although it seems unlikely now, if some improved implementation of sets had the accidental side effects of making them ordered, I would still not want that to become a semantic guarantee.
Eek… No accidental side effects whenever possible, please. People come to rely upon them (like that chemistry paper example[*]), and changing the assumptions results in a lot of breakage down the line. Changing the length of AWS identifiers (e.g. instances from i-1234abcd to i-0123456789abcdef) was a huge deal; even though the identifier length was never guaranteed, plenty of folks had created database schemata with VARCHAR(10) for instance ids, for example.
Break assumptions from day 1. If some platforms happen to return sorted results, sort the other platforms or toss in a sorted(key=lambda el: random.randomint()) call on the sorting platform. If you’re creating custom identifiers, allocate twice as many bits as you think you’ll need to store it.
Yes, this is bad user code, but I’m all for breaking bad user code in obvious ways sooner rather than subtle ways later, especially in a language like Python.
On Mon 12/16/19, 9:59 AM, "David Mertz" <me...@gnosis.cx> wrote:
If some improved implementation of sets had the accidental side effects of making them ordered, I would still not want that to become a semantic guarantee.
Eek… No accidental side effects whenever possible, please. People come to rely upon them (like that chemistry paper example[*]), and changing the assumptions results in a lot of breakage down the line.
BTW, what should{1, 2} | {3, 4, 5, 6, 7} return as ordered sets? Beats me.;
The obvious answer is {1, 2, 3, 4, 5, 6, 7}. Which is the result
I got in Python 3.8 just now ;-) But that's just a side-effect of
how hashing numbers works, the implementation, etc. It's rarely
stable like this, and nearly any other input would have resulted
in the scrambling we all (currently) expect to see.
>>> {"apples", "peaches", "pumpkin pie"} | {"who's", "not", "ready", "holler", "I" }
{'pumpkin pie', 'peaches', 'I', "who's", 'holler', 'ready', 'apples', 'not'}
/arry
If it's desired that "insertion order" be consistent across runs, platforms, and releases, then what "insertion order" _means_ needs to be rigorously defined & specified for all set operations. This was comparatively trivial for dicts, because there are, e.g., no commutative binary operators defined on dicts.
If you want to insist that `a | b` first list all the elements of a, and then all the elements of b that weren't already in a, regardless of cost, then you create another kind of unintuitive surprise: in general the result of "a | b" will display differently than the result of "b | a" (although the results will compare equal), and despite that the _user_ didn't "insert" anything.
Call me weird--and I won't disagree--but I find nothing
unintuitive about that. After all, that's already the world we
live in: there are any number of sets that compare as equal but
display differently. In current Python:
>>> a = {'a', 'b', 'c'}
>>> d = {'d', 'e', 'f'}
>>> a | d
{'f', 'e', 'd', 'a', 'b', 'c'}
>>> d | a
{'f', 'b', 'd', 'a', 'e', 'c'}
>>> a | d == d | a
True
This is also true for dicts, in current Python, which of course do maintain insertion order. Dicts don't have the | operator, so I approximate the operation by duplicating the dict (which AFAIK preserves insertion order) and using update.
>>> aa = {'a': 1, 'b': 1, 'c': 1}
>>> dd = {'d': 1, 'e': 1, 'f': 1}
>>> x = dict(aa)
>>> x.update(dd)
>>> y = dict(dd)
>>> y.update(aa)
>>> x
{'a': 1, 'b': 1, 'c': 1, 'd': 1, 'e': 1, 'f': 1}
>>> y
{'d': 1, 'e': 1, 'f': 1, 'a': 1, 'b': 1, 'c': 1}
>>> x == y
True
Since dicts already behave in exactly that way, I don't think it would be too surprising if sets behaved that way too. In fact, I think it's a little surprising that they behave differently, which I suppose was my whole thesis from the beginning.
Cheers,
/arry
Here's one (very simplified and maybe biased) view of the history of dicts: * 2.x: Dicts are unordered, please don't rely on the order. * 3.3: Dict iteration order is now randomized. We told you not to rely on it! * 3.6: We now use an optimized implementation of dict that saves memory! As a side effect it makes dicts ordered, but that's an implementation detail, please don't rely on it. * 3.7: Oops, people are now relying on dicts being ordered. Insisting on people not relying on it is battling windmills. Also, it's actually useful sometimes, and alternate implementations can support it pretty easily. Let's make it a language feature! (Later it turns out MicroPython can't support it easily. Tough luck.)A very nice summary! My only quibble is as above: the "compact dict" implementation doesn't maintain insertion order naturally, _unless_ there are no deletions (which is, e.g., true of dicts constructed to pass keyword arguments). The code got hairier to maintain insertion order in the presence of mixing insertions and deletions.
Didn't some paths also get sliiiiightly slower as a result of maintaining insertion order when mixing insertions and deletions? My recollection is that that was part of the debate--not only "are we going to regret inflicting these semantics on posterity, and on other implementations?", but also "are these semantics worth the admittedly-small performance hit, in Python's most important and most-used data structure?".
Also, I don't recall anything about us resigning ourselves to explicitly maintain ordering on dicts because people were relying on it, "battling windmills", etc. Dict ordering had never been guaranteed, a lack of guarantee Python had always taken particularly seriously. Rather, we decided it was a desirable feature, and one worth pursuing even at the cost of a small loss of performance. One prominent Python core developer** wanted this feature for years, and I recall them saying something like:
Guido says, "When a programmer iterates over a dictionary and they see the keys shift around when the dictionary changes, they learn something!" To that I say--"Yes! They learn that Python is unreliable and random!"
/arry
** I won't give their name here because I fear I'm misquoting
everybody involved. Apologies in advance if that's the case!
[Larry]
Python is the language where speed, correctness, and readability trump performance every time.Speed trumping performance didn't make any sense ;-)
Sorry, that was super unclear! I honestly meant speed of
development.
D'oh,
/arry
Without being facetious[1] if you don't care about performance, you don't need a set, you could use a list.
Lists don't enforce uniqueness. Apart from that a list would
probably work fine for my needs; in my admittedly-modest workloads
I would probably never notice a performance difference. My
anecdote was merely a jumping-off point for the discussion.
"I don't care about performance" is not because I'm aching for
Python to run my code slowly. It's because I'm 100% confident
that the Python community will lovingly optimize the
implementation. So when I have my language designer hat on, I
really don't concern myself with performance. As I thought I said
earlier in the thread, I think we should figure out the semantics
we want first, and then we figure out how to make
it fast.
I'll also cop to "a foolish consistency is the hobgoblin of little minds". I lack this strongly mathematical view of sets others have espoused; instead I view them more like "dicts without values". I'm therefore disgruntled by this inconsistency between what are I see as closely related data structures, and it makes sense to me that they'd maintain their insertion order the same way that dictionaries now do.
/arry
On Tue, 17 Dec 2019 at 11:13, Larry Hastings <la...@hastings.org> wrote:
> I lack this strongly mathematical view of sets others have espoused; instead I view them more like "dicts without values". I'm therefore disgruntled by this inconsistency between what are I see as closely related data structures, and it makes sense to me that they'd maintain their insertion order the same way that dictionaries now do.
I'll admit to a mathematical background, which probably influences my
views. But I view sets as "collections" of values - values are "in"
the set or not. I don't see them operationally, in the sense that you
add and remove things from the set. Adding and removing are mechanisms
for changing the set, but they aren't fundamentally what sets are
about (which for me is membership). So insertion order on sets is
largely meaningless for me (it doesn't matter *how* it got there, what
matters is whether it's there now).
Having said that, I also see dictionaries as mappings from key to
value, so insertion order is mostly not something I care about for
dictionaries either. I can see the use cases for ordered dictionaries,
and now we have insertion order, it's useful occasionally, but it's
not something that was immediately obvious to me based on my
intuition. Similarly, I don't see sets as *needing* insertion order,
although I concede that there probably are use cases, similar to
dictionaries. The question for me, as with dictionaries, is whether
the cost (in terms of clear semantics, constraints on future
implementation choices, and actual performance now) is justified by
the additional capabilities (remembering that personally, I'd consider
insertion order preservation as very much a niche requirement).
But there are also other optimizations in the current set
implementation, so "fine, add the doubly linked list to sets but not
to dicts" is only part of it.
Which may or may not be possible to match, let alone beat, in an
ordered set implementation. A practical barrier now is that Python is
too mature to bank on loving optimizations _after_ a change to a core
feature is released. It's going to need a highly polished
implementation first.
______________________________
[Nick Coghlan <ncog...@gmail.com>]
> Starting with "collections.OrderedSet" seems like a reasonable idea,
> though - that way "like a built-in set, but insertion order preserving" will
> have an obvious and readily available answer, and it should also
> make performance comparisons easier.
Ya, I suggested starting with collections.OrderedSet earlier, but gave up on it.
The problem is that the "use case" here isn't really a matter of
functionality, but of consistency: "it would be nice if", like dicts
enjoy now, the iteration order of sets was defined in an
implementation-independent way. That itch isn't scratched unless the
builtin set type defines it.
I took Larry's request a slightly different way: he has a use case where he wants order preservation (so built in sets aren't good), but combined with low cost duplicate identification and elimination and removal of arbitrary elements (so lists and collections.deque aren't good). Organising a work queue that way seems common enough to me to be worthy of a stdlib solution that doesn't force you to either separate a "job id" from the "job" object in an ordered dict, or else use an ordered dict with "None" values.
So what problem(s) would a dynamic ordered set really be aiming at? I
don't know. Consistency & predictability I understand and appreciate,
but little beyond that. Even FIFO ordering is somewhat a PITA, since
`next(iter(set))` is an overly elaborate way to need to spell "get the
first" if a FIFO queue _is_ a prime dynamic use case. And there's no
efficient way at all to access the other end (although I suppose
set.pop() would - like dict.popitem() - change to give a _destructive_
way to get at "the other end").
> It's not obvious to me that insertion order is even the most obvious or
> most commonly relevant sort order. I'm sure it is for Larry's program, but
> often a work queue might want some other order. Very often queues
> might instead, for example, have a priority number assigned to them.
Indeed, and it makes me more cautious that claims for the _usefulness_
(rather than just consistency) of an ordered set are missing an XY
problem. The only "natural" value of insertion-order ordering for a
dynamic ordered set is that it supports FIFO queues. Well, that's one
thing that a deque already excels at, but much more obviously,
flexibly, and space- and time- efficiently.
(mixing together messages in the thread, sorry threaded-view
readers)
[Nick]
I took Larry's request a slightly different way:
[...]
Only Larry can answer whether that would meet his perceived need. My _guess_ is that he wouldn't know OrderedSet existed, and, even if he did, he'd use a dict with None values anyway (because it's less hassle and does everything he wanted).
At last, something I can speak knowledgeably about: Larry's use case! Regarding Larry, I'd say
Also, speaking personally, at least once (maybe twice?) in this thread folks have asked "would YOU, Mr Larry, really want ordered sets if it meant sets were slightly slower?"
The first thing I'd say is, I'm not sure why people should care about what's best for me. That's sweet of you! But you really shouldn't.
The second thing is, my needs are modest, so the speed hit we're talking about would likely be statistically insignificant, for me.
And the third thing is, I don't really put the set() API through
much of a workout anyway. I build sets, I add and remove items, I
iterate over them, I do the occasional union, and only very rarely
do I use anything else. Even then, when I write code where I
reach for a fancy operation like intersection or
symmetric_difference--what tends to happen is, I have a rethink
and a refactor, and after I simplify my code I don't need the
fancy operations anymore. I can't remember the last time I used
one of these fancy operations where the code really stuck around
for a long time.
So, personally? Sure, I'd take that tradeoff. As already
established, I like that dict objects maintain their order, and I
think it'd be swell if sets maintained their order too. I suspect
the performance hit wouldn't affect me in any meaningful
way, and I could benefit from the order-maintaining semantics. I
bet it'd have other benefits too, like making regression tests
more stable. And my (admittedly uninformed!) assumption is, the
loss of performance would mostly be in these sophisticated
operations that I never seem to use anyway.
But I don't mistake my needs for the needs of the Python
community at large. I'd be mighty interested to read other folks
coming forward and saying, for example, "please don't make set
objects any slower!" and talking us through neat real-world use
cases. Bonus points if you include mind-boggling numbers and
benchmarks!
As Larry pointed out earlier, ordered dicts posed a problem for MicroPython.
Just a minor correction: that wasn't me, that was Petr Viktorin.
Ten days left until we retire 2.7,
/arry
_______________________________________________
Python-Dev mailing list -- pytho...@python.org
To unsubscribe send an email to python-d...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/pytho...@python.org/message/JYSJ55CWUE2FFK2K2Y75EWE7R7M6ZDGG/
Code of Conduct: http://python.org/psf/codeofconduct/
Message archived at https://mail.python.org/archives/list/pytho...@python.org/message/2KEAXZBQP6YJ2EV4YURQZ7NFFRJZQRFH/
Chris Angelico wrote:
> I think this is an artifact of Python not having an empty set literal.
> [snip]
> When both use the constructor call or both use a literal, the
> difference is far smaller. I'd call this one a wash.
Ah, good catch. I hadn't considered that it would make a substantial difference, but that makes sense. Here's an additional comparison between "{}" and "dict()" to confirm it:
>>> timeit.timeit("{}", number=100_000_000)
2.1038335599987477
>>> timeit.timeit("dict()", number=100_000_000)
10.225583500003268
We could also go the other way with it, set / dict-map-to-dontcare with one element, because that way we can have a set literal:
>>> timeit.timeit("set()", number=100_000_000)
8.579900023061782
>>> timeit.timeit("dict()", number=100_000_000)
10.473437276901677
>>> timeit.timeit("{0}", number=100_000_000)
5.969995185965672
>>> timeit.timeit("{0:0}", number=100_000_000)
6.24465325800702
(ran all four of those just so you see a relative sense on my
laptop, which I guess is only sliiiightly slower than Kyle's)
Happy holidays,
/arry
_______________________________________________
Python-Dev mailing list -- pytho...@python.org
To unsubscribe send an email to python-d...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/pytho...@python.org/message/MQAKLPKWECT22NVPRITAD5XQBIFUS4UA/
[Nick Coghlan <ncog...@gmail.com>]I took Larry's request a slightly different way: he has a use case where he wants order preservation (so built in sets aren't good), but combined with low cost duplicate identification and elimination and removal of arbitrary elements (so lists and collections.deque aren't good). Organising a work queue that way seems common enough ...Is it? I confess I haven't thought of a plausible use case. Larry didn't really explain his problem, just suggested that an ordered set would be "a solution" to it.
Here is the original description of my problem, from the original
email in this thread. I considered this an adequate explanation
of my problem at the time.
I do have a use case for this. In one project I maintain a "ready" list of jobs; I need to iterate over it, but I also want fast lookup because I soemtimes remove jobs when they're subsequently marked "not ready". The existing set object would work fine here, except that it doesn't maintain insertion order. That means multiple runs of the program with the same inputs may result in running jobs in different orders, and this instability makes debugging more difficult. I've therefore switched from a set to a dict with all keys mapped to None, which provides all the set-like functionality I need.
In subsequent emails I've clarified that my workloads are small enough--and computers are fast enough--that almost any data structure would work fine for me here, e.g. a list. I don't think my needs should drive the decision making process regardless. I only described my problem to get the conversation rolling.
I opine that anybody who iterates over set objects and has bugs
in their code would appreciate set objects maintaining insertion
order. I suspect it would make their debugging easier, because
given identical inputs their set iteration would behave
identically, thus making their bugs that much more stable. That's
as much use case as I have for the feature.
/arry
_______________________________________________
Python-Dev mailing list -- pytho...@python.org
To unsubscribe send an email to python-d...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/pytho...@python.org/message/XWTP7OD4UMGL7DLVABM3S2CB26LN5RBP/
[Larry]Here is the original description of my problem, from the original email in this thread. I considered this an adequate explanation of my problem at the time.I do have a use case for this. In one project I maintain a "ready" list of jobs; I need to iterate over it, but I also want fast lookup because I soemtimes remove jobs when they're subsequently marked "not ready".Which is a description not of "the problem", but of the operations you want to perform. It's the first time in my life I've heard of a "ready list of jobs" where the programmer sometimes decides later "oh, no - this one and that one aren't ready after all" ;-)
It's a lightweight abstract dependency graph. Its nodes are opaque, only required to be hashable. And it doesn't require that you give it all the nodes in strict dependency order.
When you add a node, you can also optionally specify dependencies, and those dependencies aren't required to be nodes that the graph has previously seen. So you can add a node A that depends on B and C, without showing it B or C first. When that happens it creates placeholder nodes for B and C, and naturally these nodes have no dependencies. You can then later inform the graph that node B has dependencies on X Y and Z.
(My specific use case is a package manager. Packages are
identified with unique strings. So the nodes I give the give the
graph are simply those package names. It's pretty common for me
to get a package with dependencies on packages I haven't seen
yet. Designing the graph to create placeholders let me skip
making two passes over the package list, pre-creating the package
objects, etc etc. This made the code simpler and presumably
faster.)
The graph API maintains an externally-visible "ready" iterable of nodes. And since B can go from ready to not-ready, it can get added to "ready" and subsequently removed.
Also, when a node is satisfied, you simply remove it from the
graph. If A depends on B and C, and they all represent jobs, and
B and C have no dependencies, they'll be in the "ready" iterable.
You iterate over "ready", and execute those jobs, and assuming
they are successful you then remove them from the graph. When A's
dependencies are all satisfied, it'll get added to the "ready"
iterable. And naturally when B and C were removed from the graph,
they were removed from the "ready" iterable too.
Thus it's this "ready" collection that I want to a) be iterable,
b) be stable, and c) support fast removal. I add and remove nodes
from that set a lot, which I realized would perturb the iteration
order from run to run, etc etc etc, and here we are.
I grant you maybe it's a weird approach. But after two false
starts, where I was starting to boil the oceans with ABCs and
"node is satisfied" APis and separate iterator objects, I had a
re-think and hit on this super lightweight design. I'm actually
quite pleased with it--it's short, it has a minimal API, it's
readable, it was easy to get right, and it solves my problem.
Happy new year,
/arry
Message archived at https://mail.python.org/archives/list/pytho...@python.org/message/6UKXR63M6NYI3EQRXL7AOH3AWMTSNXUW/