What happens when a key is deleted? Will that leave an empty slot in
the entry array?
If so, will the array be compacted if the proportion
of entries grows beyond a certain limit?
Adding a key would involve simply appending to the entry array (filling
the last empty slot), but if there could be empty slots in other
places, would it look for the first?
Actually, as I think about it, you could form a linked list (using the
'hash' field) of those empty slots that aren't part of the final
contiguous block and fill those preferentially.
Hi Raymond,
As a side note, your suggestion also enables order-preserving
On Mon, Dec 10, 2012 at 2:44 AM, Raymond Hettinger
<raymond....@gmail.com> wrote:
> Instead, the data should be organized as follows:
>
> indices = [None, 1, None, None, None, 0, None, 2]
> entries = [[-9092791511155847987, 'timmy', 'red'],
> [-8522787127447073495, 'barry', 'green'],
> [-6480567542315338377, 'guido', 'blue']]
dictionaries: iter() would automatically yield items in the order they
were inserted, as long as there was no deletion. People will
immediately start relying on this "feature"... and be confused by the
behavior of deletion. :-/
A bientôt,
Armin.
_______________________________________________
Python-Dev mailing list
Pytho...@python.org
http://mail.python.org/mailman/listinfo/python-dev
If that's really an issue, we can deliberately scramble the iteration
order a bit :-) (of course it might negatively impact HW prefetching)
Regards
Antoine.
> On Mon, Dec 10, 2012 at 10:48 AM, Antoine Pitrou
> <soli...@pitrou.net> wrote:
> > Le Mon, 10 Dec 2012 10:40:30 +0100,
> > Armin Rigo <ar...@tunes.org> a écrit :
> >> Hi Raymond,
> >>
> >> On Mon, Dec 10, 2012 at 2:44 AM, Raymond Hettinger
> >> <raymond....@gmail.com> wrote:
> >> > Instead, the data should be organized as follows:
> >> >
> >> > indices = [None, 1, None, None, None, 0, None, 2]
> >> > entries = [[-9092791511155847987, 'timmy', 'red'],
> >> > [-8522787127447073495, 'barry', 'green'],
> >> > [-6480567542315338377, 'guido', 'blue']]
> >>
> >> As a side note, your suggestion also enables order-preserving
> >> dictionaries: iter() would automatically yield items in the order
> >> they were inserted, as long as there was no deletion. People will
> >> immediately start relying on this "feature"... and be confused by
> >> the behavior of deletion. :-/
> >
> > If that's really an issue, we can deliberately scramble the
> > iteration order a bit :-) (of course it might negatively impact HW
> > prefetching)
>
> On the other hand, this would also make a fast ordered dictionary
> subclass possible, just by not using the free list for additions,
> combined with periodic compaction before adds or after deletes.
I suspect that's what Raymond has in mind :-)
Regards
Antoine.
On Dec 10, 2012, at 08:48 AM, Christian Heimes wrote:I'd be interested to see what effect this has on memory constrained platforms
>It's hard to predict how the extra CPU instructions are going to affect
>the performance on different CPUs and scenarios. My guts tell me that
>your proposal is going to perform better on modern CPUs with lots of L1
>and L2 cache and in an average case scenario. A worst case scenario with
>lots of collisions might be measurable slower for large dicts and small
>CPU cache.
such as many current ARM applications (mostly likely 32bit for now). Python's
memory consumption is an overheard complaint for folks developing for those
platforms.
Hi Philip,
Technically, I could see Python switching to ordered dictionaries
On Mon, Dec 10, 2012 at 5:16 PM, PJ Eby <p...@telecommunity.com> wrote:
> On the other hand, this would also make a fast ordered dictionary
> subclass possible, just by not using the free list for additions,
> combined with periodic compaction before adds or after deletes.
everywhere. Raymond's insight suddenly makes it easy for CPython and
PyPy, and at least Jython could use the LinkedHashMap class (although
this would need checking with Jython guys). I'd vaguely argue that
dictionary orders are one of the few non-reproducible factors in a
Python program, so it might be a good thing. But only vaguely ---
maybe I got far too used to random orders over time...
On the other hand every lookup and collision checks needs an additional
multiplication, addition and pointer dereferencing:
entry = entries_baseaddr + sizeof(PyDictKeyEntry) * idx
Instead, the 24-byte entries should be stored in a
dense table referenced by a sparse table of indices.
What minimum size and resizing factor do you propose for the entries array?
Without numbers it's hard to say for certain, but the advantage to
keeping ordered dictionaries a distinct type is that the standard
dictionary type can then get that extra bit of speed in exchange for
dropping the ordering requirement.
Another approach is to pre-allocate the two-thirds maximum
(This is simple and fast but gives the smallest space savings.)
What do you mean by maximum?
On 10.12.12 05:30, Raymond Hettinger wrote:On Dec 9, 2012, at 10:03 PM, MRAB <pyt...@mrabarnett.plus.com
<mailto:pyt...@mrabarnett.plus.com>> wrote:What happens when a key is deleted? Will that leave an empty slot inYes. See the __delitem__() method in the pure python implemention
the entry array?
at http://code.activestate.com/recipes/578375
You can move the last entry on freed place. This will preserve compactness of entries array and simplify and speedup iterations and some other operations.
def __delitem__(self, key, hashvalue=None):
if hashvalue is None:
hashvalue = hash(key)
found, i = self._lookup(key, hashvalue)
if found:
index = self.indices[i]
self.indices[i] = DUMMY
self.size -= 1
if index != size:
lasthash = self.hashlist[index]
lastkey = self.keylist[index]
found, j = self._lookup(lastkey, lasthash)
assert found
assert i != j
self.indices[j] = index
self.hashlist[index] = lasthash
self.keylist[index] = lastkey
self.valuelist[index] = self.valuelist[size]
index = size
self.hashlist[index] = UNUSED
self.keylist[index] = UNUSED
self.valuelist[index] = UNUSED
else:
raise KeyError(key)
An indirect function call is technically a branch, as seen from the CPU
(and not necessarily a very predictable one, although recent Intel
CPUs are said to be quite good at that).
Regards
Antoine.
Why would you want to avoid that?
If we want to allocate the dict's data as a single memory block (which
saves a bit in memory consumption and also makes dict allocations
faster), we need to resize both arrays at the same time.
Regards
Antoine.
OTOH changing certain dictionaries in IronPython (such as keyword args) to be
ordered would certainly be possible. Personally I just wouldn't want to see it
be the default as that seems like unnecessary overhead when the specialized
class exists.
On Dec 10, 2012, at 2:48 AM, Christian Heimes <chri...@python.org>
wrote:On the other hand every lookup and collision checks needs an
additional multiplication, addition and pointer dereferencing:
entry = entries_baseaddr + sizeof(PyDictKeyEntry) * idx
Currently, the dict implementation allows alternative lookup
functions based on whether the keys are all strings.
The choice of lookup function is stored in a function pointer.
That lets each lookup use the currently active lookup
function without having to make any computations or branches.
An indirect function call is technically a branch, as seen from the CPU
(and not necessarily a very predictable one, although recent Intel
CPUs are said to be quite good at that).
Hello everyone.
Thanks raymond for writing down a pure python version ;-)
I did an initial port to RPython for experiments. The results (on
large dicts only) are inconclusive - it's either a bit faster or a bit
slower, depending what exactly you do. There is a possibility I messed
something up too (there is a branch rdict-experiments in PyPy, in a
very sorry state though).
One effect we did not think about is that besides extra indirection,
there is an issue with data locality - having to look in two different
large lists seems to be a hit.
The memory part is also why I am interested in this approach.
Another thing has been bothering me. This is the fact that with the default implementation, the smll table is only ever populated up to a certain percentage, I cant recall, perhaps 50%.
Since the small table is small by definition, I think it ought to be worth investigating if we cannot allow it to fill to 100% before growing, even if it costs some collisions. A linear lookup in a handful of slots can't be that much of a bother, it is only with larger number of entries that the O(1) property starts to matter.
If you're very interested in the memory footprint you should do what
PyPy does. It gives you no speed benefits without the JIT, but it
makes all the instances behave like they are having slots. The trick
is to keep a dictionary name -> index into a list on a class (you go
through hierarchy of such dicts as you add or remove attributes) and a
list of attributes on the instance.
We call it maps, V8 calls it hidden classes, it's well documented on
the pypy blog.
Cheers,
fijal
_______________________________________________
Python-Dev mailing list
Pytho...@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/dev-python%2Bgarchive-30976%40googlegroups.com
On 07/01/13 10:55, Maciej Fijalkowski wrote:
> On Sun, Jan 6, 2013 at 11:40 PM, Kristján Valur Jónsson
> <kris...@ccpgames.com> wrote:
>> The memory part is also why I am interested in this approach.
>> Another thing has been bothering me. This is the fact that with the default implementation, the smll table is only ever populated up to a certain percentage, I cant recall, perhaps 50%. Since the small table is small by definition, I think it ought to be worth investigating if we cannot allow it to fill to 100% before growing, even if it costs some collisions. A linear lookup in a handful of slots can't be that much of a bother, it is only with larger number of entries that the O(1) property starts to matter.
>> K
In 3.3 dicts no longer have a small table.
>
> If you're very interested in the memory footprint you should do what
> PyPy does. It gives you no speed benefits without the JIT, but it
> makes all the instances behave like they are having slots. The trick
> is to keep a dictionary name -> index into a list on a class (you go
> through hierarchy of such dicts as you add or remove attributes) and a
> list of attributes on the instance.
>
> We call it maps, V8 calls it hidden classes, it's well documented on
> the pypy blog.
They are called shared keys in CPython 3.3 :)
>
> Cheers,
> fijal
> _______________________________________________
> Python-Dev mailing list
> Pytho...@python.org
> http://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe: http://mail.python.org/mailman/options/python-dev/mark%40hotpy.org
shared keys don't go to the same lengths, do they?
>
>>
>> Cheers,
>> fijal
>> _______________________________________________
>> Python-Dev mailing list
>> Pytho...@python.org
>> http://mail.python.org/mailman/listinfo/python-dev
>> Unsubscribe:
>> http://mail.python.org/mailman/options/python-dev/mark%40hotpy.org
>>
> _______________________________________________
> Python-Dev mailing list
> Pytho...@python.org
> http://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe:
> http://mail.python.org/mailman/options/python-dev/fijall%40gmail.com
What is the current status of this discussion?
I'd like to know whether it is a considered alternative implementation.
There is also a discussion in python-ideas right now where this
alternative is mentioned, and I think especially for small dicts
as **kwargs, it could be a cheap way to introduce order.
On 1 January 2015 at 14:52, Maciej Fijalkowski <fij...@gmail.com> wrote:
> PS. I wonder who came up with the idea first, PHP or rhettinger and
> who implemented it first (I'm pretty sure it was used in hippy before
> it was used in Zend PHP)
We'd need to look more in detail to that question, but a quick look
made me find this Java code from 2012:
which implements almost exactly the original idea of Raymond. (It has
a twist because Java doesn't support arrays of (int, int, Object,
Object), and instead encodes it as one array of long and one array of
Objects. It also uses a chain of buckets instead of open addressing.)
A bientôt,
Armin.