PEP 372 -- Adding an ordered directory to collections

2 views
Skip to first unread message

Armin Ronacher

unread,
Jun 16, 2008, 4:37:44 AM6/16/08
to
Abstract
========

This PEP proposes an ordered dictionary as a new data structure for
the ``collections`` module, called "odict" in this PEP for short. The
proposed API incorporates the experiences gained from working with
similar implementations that exist in various real-world applications
and other programming languages.


Rationale
=========

In current Python versions, the widely used built-in dict type does
not specify an order for the key/value pairs stored. This makes it
hard to use dictionaries as data storage for some specific use cases.

Some dynamic programming languages like PHP and Ruby 1.9 guarantee a
certain order on iteration. In those languages, and existing Python
ordered-dict implementations, the ordering of items is defined by the
time of insertion of the key. New keys are appended at the end, keys
that are overwritten and not moved.

The following example shows the behavior for simple assignments:

>>> d = odict()
>>> d['parrot'] = 'dead'
>>> d['penguin'] = 'exploded'
>>> d.items()
[('parrot', 'dead'), ('penguin', 'exploded')]

That the ordering is preserved makes an odict useful for a couple of
situations:

- XML/HTML processing libraries currently drop the ordering of
attributes, use a list instead of a dict which makes filtering
cumbersome, or implement their own ordered dictionary. This affects
ElementTree, html5lib, Genshi and many more libraries.

- There are many ordererd dict implementations in various libraries
and applications, most of them subtly incompatible with each other.
Furthermore, subclassing dict is a non-trivial task and many
implementations don't override all the methods properly which can
lead to unexpected results.

Additionally, many ordered dicts are implemented in an inefficient
way, making many operations more complex then they have to be.

- PEP 3115 allows metaclasses to change the mapping object used for
the class body. An ordered dict could be used to create ordered
member declarations similar to C structs. This could be useful, for
example, for future ``ctypes`` releases as well as ORMs that define
database tables as classes, like the one the Django framework ships.
Django currently uses an ugly hack to restore the ordering of
members in database models.

- Code ported from other programming languages such as PHP often
depends on a ordered dict. Having an implementation of an
ordering-preserving dictionary in the standard library could ease
the transition and improve the compatibility of different libraries.


Ordered Dict API
================

The ordered dict API would be mostly compatible with dict and existing
ordered dicts. (Note: this PEP refers to the Python 2.x dictionary
API; the transfer to the 3.x API is trivial.)

The constructor and ``update()`` both accept iterables of tuples as
well as mappings like a dict does. The ordering however is preserved
for the first case:

>>> d = odict([('a', 'b'), ('c', 'd')])
>>> d.update({'foo': 'bar'})
>>> d
collections.odict([('a', 'b'), ('c', 'd'), ('foo', 'bar')])

If ordered dicts are updated from regular dicts, the ordering of new
keys is of course undefined again unless ``sort()`` is called.

All iteration methods as well as ``keys()``, ``values()`` and
``items()`` return the values ordered by the the time the key-value
pair was inserted:

>>> d['spam'] = 'eggs'
>>> d.keys()
['a', 'c', 'foo', 'spam']
>>> d.values()
['b', 'd', 'bar', 'eggs']
>>> d.items()
[('a', 'b'), ('c', 'd'), ('foo', 'bar'), ('spam', 'eggs')]

New methods not available on dict:

``odict.byindex(index)``

Index-based lookup is supported by ``byindex()`` which returns
the key/value pair for an index, that is, the "position" of a
key in the ordered dict. 0 is the first key/value pair, -1
the last.

>>> d.byindex(2)
('foo', 'bar')

``odict.sort(cmp=None, key=None, reverse=False)``

Sorts the odict in place by cmp or key. This works exactly
like ``list.sort()``, but the comparison functions are passed
a key/value tuple, not only the value.

>>> d = odict([(42, 1), (1, 4), (23, 7)])
>>> d.sort()
>>> d
collections.odict([(1, 4), (23, 7), (42, 1)])

``odict.reverse()``

Reverses the odict in place.

``odict.__reverse__()``

Supports reverse iteration by key.


Questions and Answers
=====================

What happens if an existing key is reassigned?

The key is not moved but assigned a new value in place. This is
consistent with existing implementations and allows subclasses to
change the behavior easily::

class movingcollections.odict):
def __setitem__(self, key, value):
self.pop(key, None)
odict.__setitem__(self, key, value)

What happens if keys appear multiple times in the list passed to the
constructor?

The same as for regular dicts: The latter item overrides the
former. This has the side-effect that the position of the first
key is used because the key is actually overwritten:

>>> odict([('a', 1), ('b', 2), ('a', 3)])
collections.odict([('a', 3), ('b', 2)])

This behavior is consistent with existing implementations in
Python, the PHP array and the hashmap in Ruby 1.9.

Why is there no ``odict.insert()``?

There are few situations where you really want to insert a key at
an specified index. To avoid API complication, the proposed
solution for this situation is creating a list of items,
manipulating that and converting it back into an odict:

>>> d = odict([('a', 42), ('b', 23), ('c', 19)])
>>> l = d.items()
>>> l.insert(1, ('x', 0))
>>> odict(l)
collections.odict([('a', 42), ('x', 0), ('b', 23), ('c', 19)])


Example Implementation
======================

A poorly performing example implementation of the odict written in
Python is available:

`odict.py <http://dev.pocoo.org/hg/sandbox/raw-file/tip/
odict.py>`_

The version for ``collections`` should be implemented in C and use a
linked list internally.

Other implementations of ordered dicts in various Python projects or
standalone libraries, that inspired the API proposed here, are:

- `odict in Babel`_
- `OrderedDict in Django`_
- `The odict module`_
- `ordereddict`_ (a C implementation of the odict module)
- `StableDict`_
- `Armin Rigo's OrderedDict`_


.. _odict in Babel: http://babel.edgewall.org/browser/trunk/babel/util.py?rev=374#L178
.. _OrderedDict in Django:
http://code.djangoproject.com/browser/django/trunk/django/utils/datastructures.py?rev=7140#L53
.. _The odict module: http://www.voidspace.org.uk/python/odict.html
.. _ordereddict: http://www.xs4all.nl/~anthon/Python/ordereddict/
.. _StableDict: http://pypi.python.org/pypi/StableDict/0.2
.. _Armin Rigo's OrderedDict: http://codespeak.net/svn/user/arigo/hack/pyfuse/OrderedDict.py


Future Directions
=================

With the availability of an ordered dict in the standard library,
other libraries may take advantage of that. For example, ElementTree
could return odicts in the future that retain the attribute ordering
of the source file.


Copyright
=========

This document has been placed in the public domain.

Michele Simionato

unread,
Jun 16, 2008, 5:01:13 AM6/16/08
to
On Jun 16, 10:37 am, Armin Ronacher <armin.ronac...@active-4.com>
wrote:

> Abstract
> ========
>
> This PEP proposes an ordered dictionary as a new data structure for
> the ``collections`` module, called "odict" in this PEP for short.  The
> proposed API incorporates the experiences gained from working with
> similar implementations that exist in various real-world applications
> and other programming languages.

+1, I have been waiting for an odict in the library for at least 5
years.


Michele Simionato

Ben Finney

unread,
Jun 16, 2008, 5:26:12 AM6/16/08
to
Armin Ronacher <armin.r...@active-4.com> writes:

> This PEP proposes an ordered dictionary as a new data structure for
> the ``collections`` module, called "odict" in this PEP for short.

A welcome addition.

Since we're not proposing a built-in type, could we choose a name that
is more explicit, and consistent with the types in 'collections'
already. I'd prefer one of the following:

collections.ordereddict
collections.ordered_dict


Other (minor) comments:

> The key is not moved but assigned a new value in place. This is
> consistent with existing implementations and allows subclasses to
> change the behavior easily::
>
> class movingcollections.odict):

Something is missing in the above line of code. It's invalid as-is,
but I don't know which of the many possible replacements is intended.

> Why is there no ``odict.insert()``?

Thank you for this design decision; I agree entirely that the correct
solution is to resequence and create a new ordered dict from the
result.

> A poorly performing example implementation of the odict written in
> Python is available:
>
> `odict.py <http://dev.pocoo.org/hg/sandbox/raw-file/tip/
> odict.py>`_

This seems to be a victim of errant line-breaking, resulting in the
wrong URL.

--
\ “The reason we come up with new versions is not to fix bugs. |
`\ It's absolutely not.” —Bill Gates, 1995-10-23 |
_o__) |
Ben Finney

Wolfgang Grafen

unread,
Jun 16, 2008, 5:25:50 AM6/16/08
to
Armin Ronacher schrieb:

> Other implementations of ordered dicts in various Python projects or
> standalone libraries, that inspired the API proposed here, are:
>
> - `odict in Babel`_
> - `OrderedDict in Django`_
> - `The odict module`_
> - `ordereddict`_ (a C implementation of the odict module)
> - `StableDict`_
> - `Armin Rigo's OrderedDict`_
>
>
> .. _odict in Babel: http://babel.edgewall.org/browser/trunk/babel/util.py?rev=374#L178
> .. _OrderedDict in Django:
> http://code.djangoproject.com/browser/django/trunk/django/utils/datastructures.py?rev=7140#L53
> .. _The odict module: http://www.voidspace.org.uk/python/odict.html
> .. _ordereddict: http://www.xs4all.nl/~anthon/Python/ordereddict/
> .. _StableDict: http://pypi.python.org/pypi/StableDict/0.2
> .. _Armin Rigo's OrderedDict: http://codespeak.net/svn/user/arigo/hack/pyfuse/OrderedDict.py
I want add to this list my seqdict package, maybe the first
implementation of an ordered dict in Python?
http://home.arcor.de/wolfgang.grafen/Python/Modules/Modules.html

I have never seen a real world dictionary which wasn't in order, so I
ever felt the Python dictionary was not complete. I support the idea of
an ordered dictionary in Python.

Best regards

Wolfgang

bearoph...@lycos.com

unread,
Jun 16, 2008, 6:24:16 AM6/16/08
to
Oh, very good, better late than never.
This is my pure Python version, it performs get, set and del
operations too in O(1):
http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/498195
Working with C such data structure becomes much faster, because it can
use true pointers.

Then another data structure that may be named SortedDict can be
created, that keeps items in their natural order (instead of the
insertion order).

Bye,
bearophile

Jeroen Ruigrok van der Werven

unread,
Jun 16, 2008, 8:23:36 AM6/16/08
to Armin Ronacher, pytho...@python.org
-On [20080616 10:41], Armin Ronacher (armin.r...@active-4.com) wrote:
>This PEP proposes an ordered dictionary as a new data structure for
>the ``collections`` module, called "odict" in this PEP for short.

I fully support adding this. In my work I have had to create this datatype a
few times already.

The suggested design choices seem logical to me.

+1

--
Jeroen Ruigrok van der Werven <asmodai(-at-)in-nomine.org> / asmodai
イェルーン ラウフロック ヴァン デル ウェルヴェン
http://www.in-nomine.org/ | http://www.rangaku.org/ | GPG: 2EAC625B
Of all that is to come, the Dream has just begun...

Calvin Spealman

unread,
Jun 16, 2008, 9:20:35 AM6/16/08
to Armin Ronacher, pytho...@python.org
This gets my +1, for what its worth.

I don't really see a good reason not to include the insert() method,
however. I don't see that it would complicate things much, if at all.

>>> d = odict([('a', 42), ('b', 23)])
>>> d.insert(1, ('c', 19))
>>> d
collections.odict([('a', 42), ('c', 19), ('b', 23)])

> --
> http://mail.python.org/mailman/listinfo/python-list

Paul McGuire

unread,
Jun 16, 2008, 9:46:59 AM6/16/08
to
1. With the current dict, the following code

a = { "A" : 1, "B" : 2 }
b = { "B" : 2, "A" : 1 }

a==b evaluates to True. I assume that if these were odicts, this
would evaluate to False.

2. Will odict.byindex support slicing?

3. Will odict inherit from dict?

4. The current dict API (as of Python 2.5) is given by dir as:

['__class__', '__cmp__', '__contains__', '__delattr__', '__delitem__',
'__doc__', '__eq__', '__ge__', '__getattribute__', '__getitem__',
'__gt__', '__hash__', '__init__', '__iter__', '__le__', '__len__',
'__lt__', '__ne__', '__new__', '__reduce__', '__reduce_ex__',
'__repr__', '__setattr__', '__setitem__', '__str__', 'clear', 'copy',
'fromkeys', 'get', 'has_key', 'items', 'iteritems', 'iterkeys',
'itervalues', 'keys', 'pop', 'popitem', 'setdefault', 'update',
'values']

If I read your PEP correctly, you propose adding:

byindex
sort
reverse
__reverse__ (should be '__reversed__', to match list's reverse
iterator)

(This isn't really a question, your PEP was not completely explicit
about what odict's API would be, so I thought enumerating what dict
does currently might illuminate some other unresolved aspects of
odict. For instance, since odict is somewhat list-like in its notion
of keys and values, will pop support list-like popping as well as what
dict currently provides? Will you need a 'popbyindex' for the same
reason you need 'byindex', so that you have unambiguous lookup of
items by index vs. by key. Perhaps doing dir(list) will help you to
see other items to resolve/clarify.)

5. The more I think and write about this, the more struck I am at the
similarity of odict and the ParseResults class in pyparsing. For
instance, in ParseResults, there is also support for dict-style and
list-style item referencing, and I chose to restrict some cases so
that using [] notation would not be ambiguous. You might want to add
pyparsing.ParseResults as another reference of current "odicts in the
wild" (although ParseResults implements quite a bit of additional
behavior that would not be required or necessarily appropriate for
odict).

I vote +0 on this PEP - I've never personally needed an odict, but I
can see how some of the XML and ORM coding would be simplified by one.

-- Paul

(I could go either way on the name 'odict' or 'ordereddict'. Given
the desire for this to evenutally become a built-in, keep the name
short. On the other hand, collections already has 'defaultdict', so
is 'ordereddict' so bad?)

Paul McGuire

unread,
Jun 16, 2008, 10:54:19 AM6/16/08
to Shane Geiger, pytho...@python.org
:)

Yes, I thought about that even as I was writing that post. But I also said,


"ParseResults implements quite a bit of additional behavior that would not

be required or necessarily appropriate for odict." Even if odict existed,
I think I would have needed ParseResults anyway (but using an odict
internally might have simplified things for me, instead of the combined list
and dict that I have now).

-- Paul


-----Original Message-----
From: Shane Geiger [mailto:sge...@ncee.net]
Sent: Monday, June 16, 2008 9:20 AM
To: Paul McGuire
Cc: pytho...@python.org
Subject: Re: PEP 372 -- Adding an ordered directory to collections

Paul,

You seem to be contradicting yourself. You may have never needed an odict,
yet you seem to have implemented one on your own. Maybe you needed one but
did not realize it?

Shane


> 5. The more I think and write about this, the more struck I am at the
> similarity of odict and the ParseResults class in pyparsing. For
> instance, in ParseResults, there is also support for dict-style and
> list-style item referencing, and I chose to restrict some cases so
> that using [] notation would not be ambiguous. You might want to add
> pyparsing.ParseResults as another reference of current "odicts in the
> wild" (although ParseResults implements quite a bit of additional
> behavior that would not be required or necessarily appropriate for
> odict).
>
> I vote +0 on this PEP - I've never personally needed an odict, but I
> can see how some of the XML and ORM coding would be simplified by one.
>
>

--
Shane Geiger
IT Director
National Council on Economic Education
sge...@ncee.net | 402-438-8958 | http://www.ncee.net

Leading the Campaign for Economic and Financial Literacy

ivarne...@gmail.com

unread,
Jun 16, 2008, 12:54:42 PM6/16/08
to
I'm very happy to see this PEP. I have needed to use ordered
dictionaries many times, and this has always felt to me like a
surprising omission from Python.

bruno.des...@gmail.com

unread,
Jun 16, 2008, 2:41:44 PM6/16/08
to
On 16 juin, 10:37, Armin Ronacher <armin.ronac...@active-4.com> wrote:
> Abstract
> ========
>
> This PEP proposes an ordered dictionary as a new data structure for
> the ``collections`` module, called "odict" in this PEP for short. The
> proposed API incorporates the experiences gained from working with
> similar implementations that exist in various real-world applications
> and other programming languages.
>
(snip)

+1

"insertion-ordered" dicts are something I often need, and while there
are usable workarounds, they either require boilerplate code,
introduce external dependancies to (possibly unoptimised) code or make
you reinvent the (square) wheel, all of which is (IMHO) below average
Python's standard regarding data structures.

"Martin v. Löwis"

unread,
Jun 16, 2008, 6:24:12 PM6/16/08
to
> ``odict.byindex(index)``
>
> Index-based lookup is supported by ``byindex()`` which returns
> the key/value pair for an index, that is, the "position" of a
> key in the ordered dict. 0 is the first key/value pair, -1
> the last.
>
> >>> d.byindex(2)
> ('foo', 'bar')

For this API, I think it's important to make some performance
guarantees. It seems fairly difficult to make byindex O(1), and
simultaneously also make insertion/deletion better than O(n).

IOW, the PEP should somehow specify which operations are efficient,
and which ones aren't.

Regards,
Martin

bearoph...@lycos.com

unread,
Jun 16, 2008, 7:12:03 PM6/16/08
to
Martin v. L.:

> For this API, I think it's important to make some performance guarantees.

I may appreciate them for all Python collections :-)


> It seems fairly difficult to make byindex O(1), and
> simultaneously also make insertion/deletion better than O(n).

It may be possible to make both of them O(log n), but I may appreciate
byindex O(n) and insertion/deletion O(1).

Bye,
bearophile

Paul McGuire

unread,
Jun 16, 2008, 11:23:13 PM6/16/08
to

My guess is that the two main memory allocate/deallocate cases are 1)
appending a new item to the end, and 2) GC'ing the entire data
structure. I would optimize these 2 at the expense of all others.

-- Paul

"Martin v. Löwis"

unread,
Jun 17, 2008, 1:27:29 AM6/17/08
to
>> For this API, I think it's important to make some performance guarantees.
>
> I may appreciate them for all Python collections :-)

See

http://wiki.python.org/moin/TimeComplexity

>> It seems fairly difficult to make byindex O(1), and
>> simultaneously also make insertion/deletion better than O(n).
>
> It may be possible to make both of them O(log n), but I may appreciate
> byindex O(n) and insertion/deletion O(1).

If so, what's the advantage of using that method over d.items[n]?

Regards,
Martin

"Martin v. Löwis"

unread,
Jun 17, 2008, 1:28:45 AM6/17/08
to Paul McGuire
> My guess is that the two main memory allocate/deallocate cases are 1)
> appending a new item to the end, and 2) GC'ing the entire data
> structure. I would optimize these 2 at the expense of all others.

Does that include dictionary lookups?

Regards,
Martin

Paul McGuire

unread,
Jun 17, 2008, 2:20:51 AM6/17/08
to


Well, I did (try to) qualify my guess as pertaining specifically to
memory allocate/deallocate cases, the other topics in the nearby posts
were talking about updates to the odict, so I hope I can be forgiven
this oversight. But you're right, the keyed access is one of the main
reasons this is being done with an odict instead of just a list of
tuples.

The point I was trying to make was that sometimes, the GC case occurs
far more frequently than other update methods, yet is overlooked
because it happens invisibly to most Python users.

I worked on a project a while ago where there was a huge performance
penalty in releasing a list structure, because each node was freed
individually, causing surrounding freelist entries to be updated for
every node. In typical worst-case scenario fashion, a second list was
built at the same time as the first, and the default system memory
allocator interleaved the list nodes. The truly frustrating part was
that at this point in the program, we were trying to free *all* the
nodes in *both* lists, and would have preferred to just release the
whole chunk of memory, if it had just been allocated in a large chunk
instead of a node at a time. I implemented a zone-based memory
allocator, so that nodes were allocated within a memory zone, but when
we were done, we just released the entire zone, with tremendous
improvement in system performance.

But what kind of keyed access is likely to be used on an odict? For
those cases where all of the items are desired, then I would recommend
that the odict iterator be preferred, for this type of retrieval:

for k,v in odictvar.iteritems():
# do something with key k and value v

and discourage this usage:

for k in odictvar.keys(): # or iterkeys()
# do something with key k and value odictvar[k]

How about this as a simple first approach? Let's assume that the
nominal usage pattern for an odict is 1) create all entries in the
odict, probably using append, 2) access some or all of the entries,
either by iterating over the keys or values, or by keyed access to
specific values, and 3) delete the odict. Have the odict keep an
internal dict representing the keyed lookups, and have this internal
dict set to null whenever the odict is updated. When the odict is
accessed by specific key, the internal dict is used - if it null, it
is built using dict(odict.iteritems()). In the nominal usage, this
dict will only be built once, since the keyed accesses wont begin
until after all of the key-value pairs have been added to the odict.

Or as a first-first approach (to avoid premature optimization), just
implement the odict as a list of tuples, and do linear search for
matching key if accessed by key.

I would bias any performance choices about keyed access toward cases
where the queried key exists in the odict - I think that in those
cases where odicts are used, such as ORMs and XML, the keys for a
particular odict are known and expected to exist. Those applications
where the keys are not known are likely to be meta-type apps, like
debuggers and introspection GUIs. I contend that the meat-and-
potatoes production apps will be those like database queries - when I
get an odict back after querying an EMPLOYEES table, I really should
reasonably expect odictvar["EMPNO"] to exist.

And what would be the conditions of an abusive form of odict? How
about an application log kept in an odict, keyed by timestamp? This
could have many 1000's of entries, and yet, I would guess that an
odict would be the wrong data structure for this, and that a list of
tuples or LogMessage objects would be more appropriate. But someone
is likely to do it - if they do, what will happen? Perhaps trying to
anticipate "abusive" or degenerate uses of an odict might shed some
light on ways to avoid, discourage, or maybe even accommodate these
uses in odict.

Well, this has become something of a rant, and a speculative one at
that, but I think the "delete the entire data structure" memory case
should be given reasonably high design attention. (And maybe this is
already the norm in Python development - I'm really quite ignorant of
this process, not being in the Python development circle.)

Always nice to hear from you Martin - cheers!

-- Paul

bearoph...@lycos.com

unread,
Jun 17, 2008, 7:21:20 AM6/17/08
to
Martin v. L.:
> http://wiki.python.org/moin/TimeComplexity

Thank you, I think that's not a list of guarantees, while a list of
how things are now in CPython.


> If so, what's the advantage of using that method over d.items[n]?

I think I have lost the thread here, sorry. So I explain again what I
mean. I think for this data structure it's important to keep all the
normal dict operations at the same speed. If you use a C
implementation vaguely similar to my pure python recipe you can
perform the del in O(1) too, because pairs are joined in (double)
linked list. But such data structure is O(n) to find the n-th item
inserted into the sequence.

Bye,
bearophile

"Martin v. Löwis"

unread,
Jun 17, 2008, 6:34:32 PM6/17/08
to
> I think I have lost the thread here, sorry. So I explain again what I
> mean. I think for this data structure it's important to keep all the
> normal dict operations at the same speed. If you use a C
> implementation vaguely similar to my pure python recipe you can
> perform the del in O(1) too, because pairs are joined in (double)
> linked list. But such data structure is O(n) to find the n-th item
> inserted into the sequence.

Right. So byindex(n) would be O(n) then, right? If so, what's the
purpose of having that method in the first place?

The PEP doesn't give a rationale, but just proposes that the method
be there. My guess is that it includes it for performance reasons.
However, I think the PEP (author) is misguided in assuming that
making byindex() a method of odict, you get better performance than
directly doing .items()[n] - which, as you say, you won't.

Regards,
Martin

"Martin v. Löwis"

unread,
Jun 17, 2008, 6:42:56 PM6/17/08
to
> Well, this has become something of a rant,

Indeed - and I was only asking about .byindex(n) :-)

I don't know why that method is included in the PEP. Speculating
myself, I assume that the PEP author wants it to be O(1). As
bearophile explains, that's not possible/not a good idea.

As for releasing the odict - that will likely have mostly the
same cost as releasing a regular dictionary, except that you
might have to look at each key twice (once in the regular dict,
once in the structure preserving the order). It might be
that the odict uses a linked-list style approach, which indeed
would nto be so good from a allocation/deallocation overhead
point of view. With the approach bearophile suggests (i.e.
each slot in the dict has a reference to the predecessor and
successor key also), a pure Python implementation would normally
allocate additional memory per key, whereas a pure C implementation
wouldn't (adding the space for the prev/next pointer right into
the slot's struct definition.

Regards,
Martin

bearoph...@lycos.com

unread,
Jun 18, 2008, 6:35:06 AM6/18/08
to
Martin v. L.:

> However, I think the PEP (author) is misguided in assuming that
> making byindex() a method of odict, you get better performance than
> directly doing .items()[n] - which, as you say, you won't.

In Python 2.5 .items()[n] creates a whole list, and then takes one
item of such list.
An O(n) byindex() just scans the items to return the n-th.
So while being both O(n) in time, the .items()[n] may allocate quite
more memory.

Bye,
bearophile

Armin Ronacher

unread,
Jun 18, 2008, 6:47:44 AM6/18/08
to pytho...@python.org
Martin v. Löwis <martin <at> v.loewis.de> writes:

>
> > I think I have lost the thread here, sorry. So I explain again what I
> > mean. I think for this data structure it's important to keep all the
> > normal dict operations at the same speed. If you use a C
> > implementation vaguely similar to my pure python recipe you can
> > perform the del in O(1) too, because pairs are joined in (double)
> > linked list. But such data structure is O(n) to find the n-th item
> > inserted into the sequence.
>
> Right. So byindex(n) would be O(n) then, right? If so, what's the
> purpose of having that method in the first place?

What's the purpose of having list.insert?

> The PEP doesn't give a rationale, but just proposes that the method
> be there. My guess is that it includes it for performance reasons.
> However, I think the PEP (author) is misguided in assuming that
> making byindex() a method of odict, you get better performance than
> directly doing .items()[n] - which, as you say, you won't.

Without byindex the only way to cherry pick an item is either doing
something like

i = od.iteritems()
for idx in xrange(offset):
value = idx.next()
return value

Or

return od.items()[offset]

One creates tons of unnecessary method calls, the other creates a full
blown list object just to throw it away later. Both less than optimal
solutions that can be implemented in a more efficient way on the C
layer where one only has to iterate over the linked list offset times
and return the item. And iteration for that linked list is most likely
something like "for (n = 0; n != offset; ++n) iter = iter->next".

Regards,
Armin

"Martin v. Löwis"

unread,
Jun 18, 2008, 4:25:29 PM6/18/08
to
> What's the purpose of having list.insert?

It's a convenience function: you don't have to write a loop to move all
items to a later index. Any reformulation of it is easy to get wrong,
and difficult to read.

> One creates tons of unnecessary method calls, the other creates a full
> blown list object just to throw it away later. Both less than optimal
> solutions that can be implemented in a more efficient way on the C
> layer where one only has to iterate over the linked list offset times
> and return the item. And iteration for that linked list is most likely
> something like "for (n = 0; n != offset; ++n) iter = iter->next".

Ok, so it is about performance, and intended to provide a speedup by
a constant factor (over the trivial reformulation without it).

What are the use cases for this function? I.e. in what specific
applications did you use it, or would you want to use it?

In these applications, could you instead also have used a (hypothetical)
function

def nth(iter, n):
while n:
iter.next()
n-=1
return iter.next()

instead?

Regards,
Martin

dbpo...@gmail.com

unread,
Jun 18, 2008, 5:58:04 PM6/18/08
to
Wow, I was completely wrong about sorted dicts and odicts.

On Jun 17, 4:21 am, bearophileH...@lycos.com wrote:
> mean. I think for this data structure it's important to keep all the
> normal dict operations at the same speed. If you use a C

Why keep the normal dict operations at the same speed? There is a
substantial cost this entails.

It seems that some odict solutions take up 4n words per entry in the
limit (ignoring the fixed percentage growth space for lists and
dicts). This is n words for _keys and 3n words for the underlying
dict. What about storing the key:value pairs as a pair of lists (maybe
interleaved)? Key lookup then becomes an O(n) operation, but the
storage requirements are reduced to 2n from 4n.

Here is a hypothetical (somewhat contrived) example: odicts are used
to store the attributes of XML elements in a medium-sized XML document
(say 250K). If the required munging does some surgery on elements 3-4
layers deep in a few places, then the number of key lookups may be
relatively small. (Assuming that very few odicts will have more than
30 entries, each individual lookup will be fairly quick as well). The
document has to be loaded into memory anyway, so the extra cost of key
lookups is more than compensated for by the substantial reduction in
the number of cache line misses.

The main concern is that doubling the size of the data structure in
order to turn key lookup into an O(1) operation may give worse
performance in the common case (this is a personal impression from
reading the use case list in the rationale), and a data structure like
this seems to be way off by itself in the time-space complexity
landscape.

In this case the odict no longer inherits from the dict.

Cheers,
David

bearoph...@lycos.com

unread,
Jun 18, 2008, 6:15:35 PM6/18/08
to
dbpoko...:

> Why keep the normal dict operations at the same speed? There is a
> substantial cost this entails.

I presume now we can create a list of possible odict usages, because I
think that despite everyone using it for different purposes, we may
find some main groups of its usage. I use odicts is situations where
an dict is nearly the right data structure, so keeping all operations
close to the time complexity of dicts has a purpose.


> but the storage requirements are reduced to 2n from 4n.

In Python 2.5 a dict(int:None) needs about 36.2 bytes/element. I am
suggesting to add 2 pointers, to create a linked list, so it probably
becomes (on 32 bit systems) about 44.2 bytes/pair.

Note that computer science is full of strange data structures, so
maybe a skip list can be used here, to increase some operation
timings, and reduce other ones... :-)

Bye,
bearophile

Matimus

unread,
Jun 18, 2008, 7:20:59 PM6/18/08
to
On Jun 16, 1:37 am, Armin Ronacher <armin.ronac...@active-4.com>
wrote:
>    http://code.djangoproject.com/browser/django/trunk/django/utils/datas...

> .. _The odict module:http://www.voidspace.org.uk/python/odict.html
> .. _ordereddict:http://www.xs4all.nl/~anthon/Python/ordereddict/
> .. _StableDict:http://pypi.python.org/pypi/StableDict/0.2
> .. _Armin Rigo's OrderedDict:http://codespeak.net/svn/user/arigo/hack/pyfuse/OrderedDict.py
>
> Future Directions
> =================
>
> With the availability of an ordered dict in the standard library,
> other libraries may take advantage of that.  For example, ElementTree
> could return odicts in the future that retain the attribute ordering
> of the source file.
>
> Copyright
> =========
>
> This document has been placed in the public domain.

In Python 3.0 dict.items(), dict.keys() and dict.values() return set-
like views of the dictionary. http://www.python.org/dev/peps/pep-3106/.
You should consider doing the same thing for odict. Of course, the
views would be list-like, not set-like.

At the very least, I would like to see a discussion about it.

Matt

dbpo...@gmail.com

unread,
Jun 18, 2008, 9:14:44 PM6/18/08
to
On Jun 18, 3:15 pm, bearophileH...@lycos.com wrote:

> In Python 2.5 a dict(int:None) needs about 36.2 bytes/element. I am
> suggesting to add 2 pointers, to create a linked list, so it probably
> becomes (on 32 bit systems) about 44.2 bytes/pair.

PyDictEntry is
typedef struct {
Py_ssize_t me_hash;
PyObject *me_key;
PyObject *me_value;
} PyDictEntry;

Which should be 12 bytes on a 32-bit machine. I thought the space for
growth factor for dicts was about 12% but it is really 100%. In any
case, a pair of lists will take up less space than a dict and a list.
Or the storage could be an array of PyDictEntrys (to cache the hash
values of the keys), an approach that is in some sense halfway between
the others.

There is one advantage of this last approach - I think the amount of
hacking on dictobject.c that would have to take place is minimal. In
fact it almost seems like you could get the desired result by setting
mp->ma_lookup to a new function (and keep most of the rest of the
methods as they are). This seems too easy though, so there might be a
catch.

David

Armin Ronacher

unread,
Jun 19, 2008, 7:47:38 AM6/19/08
to pytho...@python.org
Small Status update of the changes incorporated in the PEP:

- The PEP Title was fixed. Of course it's a dictionary not a
directory :-)

- A questions and answers section was added that covers some
of the questions raised here and on the original thread on
python-devel.

- Comparison behavor was documented

- a note on Python 3 support was added (about keys-views)


Regards,
Armin

bearoph...@lycos.com

unread,
Jun 19, 2008, 8:53:18 AM6/19/08
to
dbpoko...:

> Which should be 12 bytes on a 32-bit machine. I thought the space for
> growth factor for dicts was about 12% but it is really 100%.

(Please ignore the trailing ".2" in my number in my last post, such
precision is silly).
My memory value comes from experiments, I have created a little
program like this:

from memory import memory

def main(N):
m1 = memory()
print m1

d = {}
for i in xrange(N):
d[i] = None

m2 = memory()
print m2
print float((m2 - m1) * 1024) / N
main(20000000)

Where memory is a small module of mine that calls a little known
program that tells how much memory is used by the current Python
process. The results for that run n=20000000 are (first two numbers
are kilobytes, the third number is byte/pair):

1876
633932
32.3612672

It means to store 20_000_000 pairs it requires about 647_000_000
bytes, Python 2.5.2, on Win.

Bye,
bearophile

Duncan Booth

unread,
Jun 19, 2008, 9:09:47 AM6/19/08
to
bearoph...@lycos.com wrote:

> My memory value comes from experiments, I have created a little
> program like this:
>
> from memory import memory
>
> def main(N):
> m1 = memory()
> print m1
>
> d = {}
> for i in xrange(N):
> d[i] = None
>
> m2 = memory()
> print m2
> print float((m2 - m1) * 1024) / N
> main(20000000)
>
> Where memory is a small module of mine that calls a little known
> program that tells how much memory is used by the current Python
> process. The results for that run n=20000000 are (first two numbers
> are kilobytes, the third number is byte/pair):
>
> 1876
> 633932
> 32.3612672
>
> It means to store 20_000_000 pairs it requires about 647_000_000
> bytes, Python 2.5.2, on Win.
>

What do you get if you change the output to exclude the integers from
the memory calculation so you are only looking at the dictionary
elements themselves? e.g.

def main(N):
keys = range(N)


m1 = memory()
print m1

d = {}
for i in keys:
d[i] = None

m2 = memory()
print m2
print float((m2 - m1) * 1024) / N
main(20000000)


--
Duncan Booth http://kupuguy.blogspot.com

bearoph...@lycos.com

unread,
Jun 19, 2008, 9:53:20 AM6/19/08
to
Duncan Booth:

> What do you get if you change the output to exclude the integers from
> the memory calculation so you are only looking at the dictionary
> elements themselves? e.g.

The results:

318512 (kbytes)
712124 (kbytes)
20.1529344 (bytes)

Bye,
bearophile

Reply all
Reply to author
Forward
0 new messages