sorteddict PEP proposal [started off as orderedict]

Mark Summerfield

unread,

Sep 25, 2007, 4:53:22 AM9/25/07

to

Hi,

Below is a PEP proposal for a sorteddict. It arises out of a
discussion on this list that began a few weeks ago with the subject of
"An ordered dictionary for the Python library?", and a similar one on
the P3K mailing list with the subject "ordered dict for p3k
collections?".

If there is positive feedback I will submit the PEP to the reviewers,
so if you think it is a good idea please say so. (I'm sure that if you
_don't_ like it you'll tell me anyway:-)

PEP: XXX
Title: Sorted Dictionary
Version: $Revision$
Last-Modified: $Date$
Author: Mark Summerfield
Status: Draft
Type: Standards Track
Content-Type: text/plain
Created: 25-Sep-2007
Python-Version: 2.6
Post-History:

Abstract

This PEP proposes the addition of a sorted dictionary class to the
standard library's collections module.

Rationale

When handling collections of key-value data, it is often
convenient to
access the data in key order. One way to do this is to use an
unsorted
data structure (e.g., a Python dict), and then sort the keys as
needed.
Another way is to use a sorted data structure (e.g., a sorteddict
as
proposed in this PEP), and simply access the keys, knowing that
they
are always in sorted order.

Both approaches make sense in the right circumstances, but Python
currently only supports the first approach out of the box. A
sorteddict never needs sorting and is ideal for creating indexes
to
data where the keys are based on some property of the data.
Adding a
sorteddict to the collections module would not add significantly
to the
size of Python's standard library, but would provide a very useful
data
structure.

Specification

The proposed sorteddict has the same API to the builtin dict, so
can be
used as a drop-in replacement. The only behavioural difference
being
that any list returned by the sorteddict (whether of keys, values,
or
items) will be in key order, and similarly any iterator returned
will
iterate in key order.

In addition, the keys() method has two optional arguments:

keys(firstindex : int = None, secondindex : int = None) -> list of
keys

The parameter names aren't nice, but using say "start" and "end"
would
be misleading since the intention is for the parameters to work
like
they do in range(), e.g.

sd.keys() # returns a list of all the sorteddict's keys
sd.keys(10) # returns a list of the first 10 keys
sd.keys(19, 35) # returns a list of the 19th-34th keys
inclusive

If an out of range index is given, an IndexError exception is
raised.

Since the sorteddict's data is always kept in key order, indexes
(integer offsets) into the sorteddict make sense. Five additional
methods are proposed to take advantage of this:

key(index : int) -> value

item(index : int) -> (key, value)

value(index : int) -> key

set_value(index : int, value)

delete(index : int)

Items and values can still be accessed using the key, e.g.,
sd[key],
since all the dict's methods are available.

Examples

To keep a collection of filenames on a case-insensitive file
system in
sorted order, we could use code like this:

files = collections.sorteddict.sorteddict()
for name in os.listdir("."):
files[name.lower()] = name

We can add new filenames at any time, safe in the knowledge that
whenever we call files.values(), we will get the files in
case-insensitive alphabetical order.

To be able to iterate over a collection of data items in various
predetermined orders, we could maintain two or more sorteddicts:

itemsByDate = collections.sorteddict.sorteddict()
itemsByName = collections.sorteddict.sorteddict()
itemsBySize = collections.sorteddict.sorteddict()
for item in listOfItems:
itemsByDate["%s\t%17X" % (item.date, id(item))] = item
itemsByName["%s\t%17X" % (item.name, id(item))] = item
itemsBySize["%s\t%17X" % (item.size, id(item))] = item

Here we have used the item IDs to ensure key uniqueness.

Now we can iterate in date or name order, for example:

for item in itemsByDate:
print item.name, item.date

Implementation

A pure Python implementation is available at:

http://pypi.python.org/pypi/sorteddict

Copyright

This document has been placed in the public domain.

A.T.Hofkamp

unread,

Sep 25, 2007, 5:53:50 AM9/25/07

to

On 2007-09-25, Mark Summerfield <m.n.sum...@googlemail.com> wrote:
> If there is positive feedback I will submit the PEP to the reviewers,
> so if you think it is a good idea please say so. (I'm sure that if you
> _don't_ like it you'll tell me anyway:-)

I like the idea, ie +1.

> This PEP proposes the addition of a sorted dictionary class to the
> standard library's collections module.

You don't seem to mention the sort criterium. I'd suggest to use the __lt__
operator, which you probably intended since it is commonly used.

Also, can I specify a custom sort function, as with list.sort() ?

> In addition, the keys() method has two optional arguments:
>
> keys(firstindex : int = None, secondindex : int = None) -> list of

Not sure this is a good idea. Wouldn't simply

mysorteddict.keys()[firstindex:secondindex]

be much better? It can do all you propose, and more (with iterators/generators
and such). I see no need to implement it inside the sorteddict as well.

> Since the sorteddict's data is always kept in key order, indexes
> (integer offsets) into the sorteddict make sense. Five additional
> methods are proposed to take advantage of this:
>
> key(index : int) -> value
>
> item(index : int) -> (key, value)
>
> value(index : int) -> key
>
> set_value(index : int, value)
>
> delete(index : int)

I wouldn't do this. It breaks compability with the normal dict. A beginning
user will expect dict and sorteddict to behave the same (except for
sortedness), including their set of supporting functions imho.

Otherwise, please make a case for them (and maybe even a new PEP to get them in
both types of dicts).

> Examples
>
> To keep a collection of filenames on a case-insensitive file
> system in
> sorted order, we could use code like this:
>
> files = collections.sorteddict.sorteddict()
> for name in os.listdir("."):
> files[name.lower()] = name

The more interesting case would be to you preserve case of the files within the
keys.

I usually need this data structure with an A* algorithm implementation, where I
need to obtain the cheapest solution found so far.

> for item in listOfItems:
> itemsByDate["%s\t%17X" % (item.date, id(item))] = item
> itemsByName["%s\t%17X" % (item.name, id(item))] = item
> itemsBySize["%s\t%17X" % (item.size, id(item))] = item

Wouldn't a line like "itemsBySize[(item.size, id(item))] = item" do as well?

(or with a custom sort function on items?)

> Now we can iterate in date or name order, for example:
>
> for item in itemsByDate:
> print item.name, item.date

Hmm, not with dict:

>>> for x in {1:10, 2:20}:
... print x
...
1
2

So you should get your "%s\t%17X" strings here.

Sincerely,
Albert

Andrew Durdin

unread,

Sep 25, 2007, 6:13:10 AM9/25/07

to Mark Summerfield, pytho...@python.org

On 9/25/07, Mark Summerfield <m.n.sum...@googlemail.com> wrote:
>
> Since the sorteddict's data is always kept in key order, indexes
> (integer offsets) into the sorteddict make sense. Five additional
> methods are proposed to take advantage of this:
>
> key(index : int) -> value
>
> item(index : int) -> (key, value)
>
> value(index : int) -> key
>
> set_value(index : int, value)
>
> delete(index : int)

But what about using non-sequential integer keys (something I do quite often)?

e.g. sorteddict({1:'a', 3:'b': 5:'c', 99:'d'})[3] should return 'b', not 'd'.

Andrew

Mark Summerfield

unread,

Sep 25, 2007, 6:24:15 AM9/25/07

to

On 25 Sep, 10:53, "A.T.Hofkamp" <h...@se-162.se.wtb.tue.nl> wrote:

> On 2007-09-25, Mark Summerfield <m.n.summerfi...@googlemail.com> wrote:
>
> > If there is positive feedback I will submit the PEP to the reviewers,
> > so if you think it is a good idea please say so. (I'm sure that if you
> > _don't_ like it you'll tell me anyway:-)
>
> I like the idea, ie +1.
>
> > This PEP proposes the addition of a sorted dictionary class to the
> > standard library's collections module.
>
> You don't seem to mention the sort criterium. I'd suggest to use the __lt__
> operator, which you probably intended since it is commonly used.

You are right that I implicitly use __lt__ (or __cmp__ if there's no
__lt__).

> Also, can I specify a custom sort function, as with list.sort() ?

No. That would make comparing two sorteddicts problematic. You can
always use munged string keys as my second example shows.

> > In addition, the keys() method has two optional arguments:
>
> > keys(firstindex : int = None, secondindex : int = None) -> list of
>
> Not sure this is a good idea. Wouldn't simply
>
> mysorteddict.keys()[firstindex:secondindex]
>
> be much better? It can do all you propose, and more (with iterators/generators
> and such). I see no need to implement it inside the sorteddict as well.

The reason for making it part of th

Mark Summerfield

unread,

Sep 25, 2007, 6:34:23 AM9/25/07

to Andrew Durdin, pytho...@python.org

On 2007-09-25, Andrew Durdin wrote:
> On 9/25/07, Mark Summerfield <m.n.sum...@googlemail.com> wrote:

> > Since the sorteddict's data is always kept in key order, indexes
> > (integer offsets) into the sorteddict make sense. Five additional
> > methods are proposed to take advantage of this:
> >
> > key(index : int) -> value
> >
> > item(index : int) -> (key, value)
> >
> > value(index : int) -> key
> >
> > set_value(index : int, value)
> >
> > delete(index : int)
>

> But what about using non-sequential integer keys (something I do quite
> often)?
>
> e.g. sorteddict({1:'a', 3:'b': 5:'c', 99:'d'})[3] should return 'b', not
> 'd'.
>
> Andrew

The sorteddict really does work in key order, so:

d = sorteddict({1:'a', 3:'b', 5:'c', 99:'d'})
d.items()
[(1, 'a'), (3, 'b'), (5, 'c'), (99, 'd')]

If you do d[3] you will get 'd' since that is the 4th sequential item.
If you want to get the item with _key_ 3 then use value():

d.value(3)
'd'

PS In my previous reply I got my example wrong a second time: should have
been:
for item in itemsByDate.values():...

--
Mark Summerfield, Qtrac Ltd., www.qtrac.eu

Mark Summerfield

unread,

Sep 25, 2007, 6:40:53 AM9/25/07

to Andrew Durdin, pytho...@python.org, Mark Summerfield

On 2007-09-25, Andrew Durdin wrote:
> On 9/25/07, Mark Summerfield <m.n.sum...@googlemail.com> wrote:

> > Since the sorteddict's data is always kept in key order, indexes
> > (integer offsets) into the sorteddict make sense. Five additional
> > methods are proposed to take advantage of this:
> >
> > key(index : int) -> value
> >
> > item(index : int) -> (key, value)
> >
> > value(index : int) -> key
> >
> > set_value(index : int, value)
> >
> > delete(index : int)
>

> But what about using non-sequential integer keys (something I do quite
> often)?
>
> e.g. sorteddict({1:'a', 3:'b': 5:'c', 99:'d'})[3] should return 'b', not
> 'd'.
>
> Andrew

Hmmm, managed to confuse myself with 'b' and 'd'!

d = sorteddict({1:"one", 3:"three", 5:"five", 99:"ninetynine"})

d.items()

[(1, 'one'), (3, 'three'), (5, 'five'), (99, 'ninetynine')]

d[3], d.value(3)

('three', 'ninetynine')

So using [] returns the value for the given key and using value()
returns the value for the given index position.

Duncan Booth

unread,

Sep 25, 2007, 6:51:42 AM9/25/07

to

Mark Summerfield <m.n.sum...@googlemail.com> wrote:

> When handling collections of key-value data, it is often
> convenient to access the data in key order. One way to do this is
> to use an unsorted data structure (e.g., a Python dict), and then
> sort the keys as needed. Another way is to use a sorted data
> structure (e.g., a sorteddict as proposed in this PEP), and simply
> access the keys, knowing that they are always in sorted order.

Please define 'sorted order' and make sure it is customisable.

> keys(firstindex : int = None, secondindex : int = None) ->
> list of keys

I don't really see the point of this, but general support for slicing
might be interesting:

sd[start:end] -> list of values for all keys such that start <= key <
end (using the appropriate definition of 'sorted order').

> key(index : int) -> value
>
> item(index : int) -> (key, value)
>
> value(index : int) -> key

I'm confused: the key method returns a value and the value method
returns a key???

>
> set_value(index : int, value)
>
> delete(index : int)

All of those can of course be replaced with a single method that returns
the key at a specified index and then using that key. Yes that would be
less efficient, but I'd rather have as close to a standard dictionary
interface as possible.

> Examples
>
> To keep a collection of filenames on a case-insensitive file
> system in sorted order, we could use code like this:
>
> files = collections.sorteddict.sorteddict()
> for name in os.listdir("."):
> files[name.lower()] = name
>
> We can add new filenames at any time, safe in the knowledge that
> whenever we call files.values(), we will get the files in
> case-insensitive alphabetical order.

I don't see this as a terribly useful example since you don't explain
why we need to keep filenames in case-insensitive alphabetical order at
all. If we want to print a list of filenames we can sort them once when
printing them, why do we need to keep them 'always sorted'?

> To be able to iterate over a collection of data items in various
> predetermined orders, we could maintain two or more sorteddicts:
>
> itemsByDate = collections.sorteddict.sorteddict()
> itemsByName = collections.sorteddict.sorteddict()
> itemsBySize = collections.sorteddict.sorteddict()
> for item in listOfItems:
> itemsByDate["%s\t%17X" % (item.date, id(item))] = item
> itemsByName["%s\t%17X" % (item.name, id(item))] = item
> itemsBySize["%s\t%17X" % (item.size, id(item))] = item

I think perl might make you use strings as keys, fortunately Python
doesn't. What I really don't like about this is that you've copied the
date, so if you mutate item.date the itemsByDate are no longer sorted.

Again it sounds like sorting for printing will be less overhead that
keeping them always sorted. You need to come up with examples where
there is an actual benefit to not sorting for output.

> Here we have used the item IDs to ensure key uniqueness.

If you are going to do this then it really does need to be:

itemsByDate = sorteddict(items, key=lambda self, k: self[k].date)

so that you can maintain sorted order when mutating an item. How the
dictionary knows that its contents have been mutated is another question
entirely.

>
> Now we can iterate in date or name order, for example:
>
> for item in itemsByDate:
> print item.name, item.date

Which we can do just as well with an ordinary dictionary:

for item in sorted(items, key=byDate):
...

given an appropriate definition of byDate.

The cases I can think of where a sorteddict might be useful would be
things like collecting web server statistics and maintaining a regularly
updated display of the top n pages. You need a combination of factors
for this to be useful: frequent updates, and frequent extraction of
sorted elements + the requirement to access elements as a dictionary.

Also, none of your use cases target any of the additional fripperies you
threw into your proposal.

Jeremy Sanders

unread,

Sep 25, 2007, 7:11:48 AM9/25/07

to

Mark Summerfield wrote:

> If there is positive feedback I will submit the PEP to the reviewers,
> so if you think it is a good idea please say so. (I'm sure that if you
> _don't_ like it you'll tell me anyway:-)

It would be nice to have the ability to use numerical indexes and the key,
but I don't think they should share the same methods.

A useful use case would be to make a LRU (least recently used) dictionary,
where the keys are time-based (e.g. an incrementing counter). You should be
able to identify the least recently used object for discarding by just
accessing the last item in the dictionary.

By the way, I think a LRU cache dictionary would be a great addition to the
standard library.

Is there any speed advantage from implementing the sorteddict as a red-black
tree or something similar in C?

Jeremy

--
Jeremy Sanders
http://www.jeremysanders.net/

Paul Hankin

unread,

Sep 25, 2007, 7:19:24 AM9/25/07

to

Recall sorted...
sorted(iterable, cmp=None, key=None, reverse=False) --> new sorted
list

So why not construct sorteddicts using the same idea of sortedness?

sorteddict((mapping | sequence | nothing), cmp=None, key=None,
reverse=None)

Then we can specify the exact behaviour of sorteddicts: it's the same
as a regular dict, except when the order of keys is important they're
ordered as if they were sorted by sorted(keys, cmp=sd._cmp,
key=sd._key, reverse=sd._reverse). Here I imagine cmp, key and reverse
are stored in the new sorteddict as attributes _cmp, _key, _reverse -
obviously the actual implementation may differ.

This has the benefit of making sorteddict's behaviour explicit and
easy to understand, and doesn't introduce a new sorting API when we
already have a perfectly decent one.

The only problem here is the **kwargs form of the dict constructor
doesn't translate as we're using keyword args to pass in the sort
criterion. Perhaps someone has an idea of how this can be solved. If
nothing else, this constructor could be dropped for sorteddicts, and
sorteddict(dict(**kwargs), cmp=..., key=..., reverse=...) when that
behaviour is wanted.

I don't like the integral indexing idea or the slicing: indexing and
slicing are already part of python and it's bad to have different
syntax for the same concept on sorteddicts. I'd say it's not an
important enough for sorteddicts anyway.

--
Paul Hankin

Mark Summerfield

unread,

Sep 25, 2007, 7:25:39 AM9/25/07

to

On 25 Sep, 12:11, Jeremy Sanders <jeremy

+complangpyt...@jeremysanders.net> wrote:
> Mark Summerfield wrote:
> > If there is positive feedback I will submit the PEP to the reviewers,
> > so if you think it is a good idea please say so. (I'm sure that if you
> > _don't_ like it you'll tell me anyway:-)
>
> It would be nice to have the ability to use numerical indexes and the key,
> but I don't think they should share the same methods.

They don't; you use [] for key index and key(), value(), or item() for
index access.

> A useful use case would be to make a LRU (least recently used) dictionary,
> where the keys are time-based (e.g. an incrementing counter). You should be
> able to identify the least recently used object for discarding by just
> accessing the last item in the dictionary.

You could do that with a sorteddict simply by using datetime.date
objects or timestamp strings for keys. You could get the first or last
item (no matter what its key). For example:

d = sorteddict({1200:"lunch", 1100:"elevensies", 1600:"tea",
1900:"dinner"})
d.items()
[(1100, 'elevensies'), (1200, 'lunch'), (1600, 'tea'), (1900,
'dinner')]
d.item(0), d.item(-1)
((1100, 'elevensies'), (1900, 'dinner'))

So you could remove the last item using d.delete(-1).

> By the way, I think a LRU cache dictionary would be a great addition to the
> standard library.
>
> Is there any speed advantage from implementing the sorteddict as a red-black
> tree or something similar in C?

I'm sure that a C-based implementation using red-black or AVL trees or
skiplists would be faster, but right now I'm just trying to get an
acceptable API.

Corrections to my original PEP:

value(index : int) -> value # I incorrectly had the return value as
key

Also, now in the second example I use a tuple of string, int ID rather
than strings.

> Jeremy
>
> --
> Jeremy Sandershttp://www.jeremysanders.net/

Mark Summerfield

unread,

Sep 25, 2007, 7:51:01 AM9/25/07

to

This makes a lot of sense to me. But how do you envisage it would be
implemented?

Steven Bethard

unread,

Sep 25, 2007, 12:47:48 PM9/25/07

to

Mark Summerfield wrote:
> PEP: XXX
> Title: Sorted Dictionary

[snip]

> In addition, the keys() method has two optional arguments:
>
> keys(firstindex : int = None, secondindex : int = None) -> list of keys
>
> The parameter names aren't nice, but using say "start" and "end" would
> be misleading since the intention is for the parameters to work like
> they do in range(), e.g.
>
> sd.keys() # returns a list of all the sorteddict's keys
> sd.keys(10) # returns a list of the first 10 keys
> sd.keys(19, 35) # returns a list of the 19th-34th keys inclusive

You should use the range() names, "start" and "stop":

>>> help(range)
Help on built-in function range in module __builtin__:

range(...)
range([start,] stop[, step]) -> list of integers

But I also agree that this is probably not necessary given the existence
of slicing and itertools.islice().

> Since the sorteddict's data is always kept in key order, indexes
> (integer offsets) into the sorteddict make sense. Five additional
> methods are proposed to take advantage of this:
>
> key(index : int) -> value
>
> item(index : int) -> (key, value)
>
> value(index : int) -> key
>
> set_value(index : int, value)
>
> delete(index : int)
>
> Items and values can still be accessed using the key, e.g., sd[key],
> since all the dict's methods are available.

I agree with others that these APIs are really not necessary. Simply
call keys(), values() or items() and then use that list if you really
want indexing. If you're dealing with a lot of these kind of indexing
behaviors, working with the plain list object will be faster anyway.

> Examples
>
> To keep a collection of filenames on a case-insensitive file system in
> sorted order, we could use code like this:
>
> files = collections.sorteddict.sorteddict()

The type should probably just be in collections directly. No need for a
sub-package. So that would look like::

files = collections.sorteddict()

I also strongly support Paul Hankin's proposal for the constructor API::

sorteddict((mapping|sequence|..), cmp=None, key=None, reverse=False)

Being able to sort by a key= function would make sorteddict()
dramatically more useful. I almost never use the builtin heap() because
it doesn't support a key= function and I always need one. I suspect use
cases for sorteddict() will be similar. For example::

files = collections.sorteddict(key=lambda s: s.lower())

for name in os.listdir("."):

files[name] = ... some calculated value ...

Now the file names are in lower-case sorted order, but if you actually
retrieve the keys, e.g. through .items(), you'll see the real filename,
not the lower-cased ones.

STeVe

thebjorn

unread,

Sep 25, 2007, 1:35:03 PM9/25/07

to

On Sep 25, 10:53 am, Mark Summerfield <m.n.summerfi...@googlemail.com>
wrote:

> Hi,
>
> Below is a PEP proposal for a sorteddict. It arises out of a
> discussion on this list that began a few weeks ago with the subject of
> "An ordered dictionary for the Python library?", and a similar one on
> the P3K mailing list with the subject "ordered dict for p3k
> collections?".
>
> If there is positive feedback I will submit the PEP to the reviewers,
> so if you think it is a good idea please say so. (I'm sure that if you
> _don't_ like it you'll tell me anyway:-)

I can't see much advantage over:

for key in sorted(mydict):
...

A much simpler data-structure, that is also very useful, is a
dictionary that keeps insertion order -- it's also something that's a
tad bit more difficult to implement yourself than the above. My
personal implementation is documented at http://blog.tkbe.org/archive/python-property-set/
It's very tempting to add value access by numerical index (I did as
well), however I now think it's a mistake.

-1 from me.

-- bjorn

Paul Hankin

unread,

Sep 25, 2007, 2:22:03 PM9/25/07

to

On Sep 25, 12:51 pm, Mark Summerfield <m.n.summerfi...@googlemail.com>
wrote:

Here's a first go. Sorting occurs when the keys are iterated over,
making it fast (almost as a dict) for construction, insertion, and
deletion, but slow if you're iterating a lot. You should look at some
use cases to decide if this approach is best, or if a sorted
datastructure should be used instead, but my instinct is that this is
a decent approach. Certainly, you're unlikely to get a simpler
implementation :)

class sorteddict(dict):
"A sorted dictionary"
def __init__(self, arg=None, cmp=None, key=None, reverse=False):
if arg:
super(sorteddict, self).__init__(arg)
else:
super(sorteddict, self).__init__()
self._cmp = cmp
self._key = key
self._reverse = reverse
def keys(self):
return sorted(super(sorteddict, self).keys(), cmp=self._cmp,
key=self._key, reverse=self._reverse)
def iter_keys(self):
return (s for s in self.keys())
def items(self):
return [(key, self[key]) for key in self.keys()]
def iter_items(self):
return ((key, self[key]) for key in self.keys())
def values(self):
return [self[key] for key in self.keys()]
def iter_values(self):
return (self[key] for key in self.keys())
def __str__(self):
return '{' + ', '.join('%s: %s' % (repr(k), repr(v))
for k, v in self.iter_items()) + '}'
def __repr__(self):
return str(self)
def __iter__(self):
return self.iter_keys()

--
Paul Hankin

Hamilton, William

unread,

Sep 25, 2007, 2:43:45 PM9/25/07

to pytho...@python.org

> From: Paul Hankin

You could speed up keys() at the cost of memory if you maintained a list
of keys in the instance. Doing so would let you use an "unsorted" flag
that gets set when a new key is added and checked when keys() is called.
If the flag is unset, just return a copy of the list. Otherwise, sort
the list in place, return a copy, and unset the flag. (Copies because
you don't want the master key list to be modified by code using the
class.)

The use case for this seems to be when you have a dictionary that you
need to often work through in sorted order. Sorting the keys every time
keys() is called isn't an improvement over using a regular dict and
sorting the keys normally. So the extra memory cost of maintaining an
internal keys list looks reasonable to me.

--
-Bill Hamilton

Steven Bethard

unread,

Sep 25, 2007, 2:58:36 PM9/25/07

to

With this is the implementation, I'm definitely -1. Not because it's a
bad implementation, but because if the iteration is always doing a sort,
then there's no reason for a separate data structure. Just use sorted()
directly on a regular dict. That has the advantage of being much more
obvious about where the expensive operations are::

for key in sorted(d, ...):
... whatever you want to do ...

IMHO, the only reason to introduce a sorteddict() is if it minimizes the
cost of sorting the keys.

STeVe

Hrvoje Niksic

unread,

Sep 25, 2007, 3:14:16 PM9/25/07

to

Steven Bethard <steven....@gmail.com> writes:

> With this is the implementation, I'm definitely -1. Not because it's a
> bad implementation, but because if the iteration is always doing a
> sort, then there's no reason for a separate data structure.

Agreed. A true sorted dict would keep its keys sorted in the first
place, a la C++ std::map.

Paul Hankin

unread,

Sep 25, 2007, 3:28:29 PM9/25/07

to

On Sep 25, 7:58 pm, Steven Bethard <steven.beth...@gmail.com> wrote:
> > Paul Hankin wrote:
> > ...

> > class sorteddict(dict):
> > "A sorted dictionary"
> > def __init__(self, arg=None, cmp=None, key=None, reverse=False):
> > if arg:

> > ...

>
> With this is the implementation, I'm definitely -1. Not because it's a
> bad implementation, but because if the iteration is always doing a sort,
> then there's no reason for a separate data structure. Just use sorted()
> directly on a regular dict. That has the advantage of being much more
> obvious about where the expensive operations are::
>
> for key in sorted(d, ...):
> ... whatever you want to do ...
>
> IMHO, the only reason to introduce a sorteddict() is if it minimizes the
> cost of sorting the keys.

I disagree with your reason for introducing a sorteddict - the main
reason should be if it makes code that uses it simpler and more
readable, with efficiency (within reasonable limits) being of
secondary importance. Unfortunately for sorteddict, your code is
simpler and more explicit for most obvious use cases.

But, with Bill's cached sorted key list it'll be reasonably efficient,
and it's likely possible to update the sorted cache more efficiently
than resorting if there's only been a small number of insertions since
the last sort, but I haven't the time to do a thorough job.

--
Paul Hankin

chris.m...@gmail.com

unread,

Sep 25, 2007, 3:51:43 PM9/25/07

to

On Sep 25, 1:35 pm, thebjorn <BjornSteinarFjeldPetter...@gmail.com>
wrote:

> On Sep 25, 10:53 am, Mark Summerfield <m.n.summerfi...@googlemail.com>
> wrote:
>
> > Hi,
>
> > Below is a PEP proposal for a sorteddict. It arises out of a
> > discussion on this list that began a few weeks ago with the subject of
> > "An ordered dictionary for the Python library?", and a similar one on
> > the P3K mailing list with the subject "ordered dict for p3k
> > collections?".
>
> > If there is positive feedback I will submit the PEP to the reviewers,
> > so if you think it is a good idea please say so. (I'm sure that if you
> > _don't_ like it you'll tell me anyway:-)
>
> I can't see much advantage over:
>
> for key in sorted(mydict):
> ...
>
> A much simpler data-structure, that is also very useful, is a
> dictionary that keeps insertion order -- it's also something that's a
> tad bit more difficult to implement yourself than the above. My

> personal implementation is documented athttp://blog.tkbe.org/archive/python-property-set/

> It's very tempting to add value access by numerical index (I did as
> well), however I now think it's a mistake.
>
> -1 from me.
>
> -- bjorn

I agree on both counts. When I saw this topic, I was hoping it would
be about a structure that remembers insertion order. Now if you
drafted a pep on THAT, I would stand behind it.

James Stroud

unread,

Sep 25, 2007, 4:28:25 PM9/25/07

to Mark Summerfield, Andrew Durdin, pytho...@python.org

Mark Summerfield wrote:
> On 2007-09-25, Andrew Durdin wrote:
>> e.g. sorteddict({1:'a', 3:'b': 5:'c', 99:'d'})[3] should return 'b', not
>> 'd'.
>

> The sorteddict really does work in key order, so:
>
> d = sorteddict({1:'a', 3:'b', 5:'c', 99:'d'})
> d.items()
> [(1, 'a'), (3, 'b'), (5, 'c'), (99, 'd')]
>
> If you do d[3] you will get 'd' since that is the 4th sequential item.
> If you want to get the item with _key_ 3 then use value():
>
> d.value(3)
> 'd'

If a quorum opinion might sway you, I have found, after prolonged and
almost metaphysical type meditation on and experimentation with the
subject, that it would be undesirable to support the type of ambiguous
item access that you suggest.

1. It would break symmetry with __setitem__:

>>> d = sorteddict({1:'a', 3:'b', 5:'c', 99:'d'})

>>> d[2] = 'x'
>>> d[2]
'b'

The need for a work-around in this case (i.e. value()) sends
a flag that something is clumsy about the interface. It seems that
the presence of the value() method exists simply to compensate
for the "broken" design. More symmetrical would be to make value()
or some other well named method return a value at an index.

2. Such ambiguity breaks 100% compatiblity with the built-in
dict. I think the success of a sorted dict would be tied quite
closely to its being a superset of dict. This compatibility
not only makes it easy for people to start using sorteddict
without remembering the distinctions but should also reduce
the need for explicit type checking in libraries that use
sorteddict.

James

James Stroud

unread,

Sep 25, 2007, 4:30:04 PM9/25/07

to

Mark Summerfield wrote:
> Hmmm, managed to confuse myself with 'b' and 'd'!
>
> d = sorteddict({1:"one", 3:"three", 5:"five", 99:"ninetynine"})
>
> d.items()
>
> [(1, 'one'), (3, 'three'), (5, 'five'), (99, 'ninetynine')]
>
> d[3], d.value(3)
>
> ('three', 'ninetynine')
>
> So using [] returns the value for the given key and using value()
> returns the value for the given index position.
>

I didn't read enough messages. My last post is redundant.

Mark Summerfield

unread,

Sep 25, 2007, 4:55:22 PM9/25/07

to

I think that keeping a sorted list of keys is worthwhile, but because
the key or cmp functions could access keys or values or both you have
to invalidate whenever a key or value is added or changed. Here's my
version (v. similar to yours of course, but sorting when necessary):

class sorteddict(dict):

def __init__(self, iterable=None, cmp=None, key=None,
reverse=False):
if iterable is None:
iterable = []
dict.__init__(self, iterable)
self.__cmp = cmp
self.__key = key
self.__reverse = reverse
self.__keys = None

def update(self, *args, **kwargs):
dict.update(self, *args, **kwargs)
self.__keys = None

@classmethod
def fromkeys(cls, iterable, value=None):
dictionary = cls()
for key in iterable:
dictionary[key] = value
return dictionary

def copy(self):
dictionary = sorteddict(dict.copy(self), cmp=self.__cmp,
key=self.__key,
reverse=self.__reverse)
dictionary.__keys = None if self.__keys is None else
self.__keys[:]
return dictionary

def clear(self):
self.__keys = None
dict.clear(self)

def setdefault(self, key, value):
self.__keys = None
return dict.setdefault(self, key, value)

def pop(self, key, value=None):
if key not in self:
return value
self.__keys = None
return dict.pop(self, key, value)

def popitem(self):
self.__keys = None
return dict.popitem(self)

def keys(self):
if self.__keys is None:
self.__keys = sorted(dict.keys(self), cmp=self.__cmp,
key=self.__key,
reverse=self.__reverse)
return self.__keys[:]

def values(self):
if self.__keys is None:
self.__keys = sorted(dict.keys(self), cmp=self.__cmp,
key=self.__key,
reverse=self.__reverse)
return [self[key] for key in self.__keys]

def items(self):
if self.__keys is None:
self.__keys = sorted(dict.keys(self), cmp=self.__cmp,
key=self.__key,
reverse=self.__reverse)
return [(key, self[key]) for key in self.__keys]

def __iter__(self):
if self.__keys is None:
self.__keys = sorted(dict.keys(self), cmp=self.__cmp,
key=self.__key,
reverse=self.__reverse)
return iter(self.__keys)

def iterkeys(self):
if self.__keys is None:
self.__keys = sorted(dict.keys(self), cmp=self.__cmp,
key=self.__key,
reverse=self.__reverse)
return iter(self.__keys)

def itervalues(self):
if self.__keys is None:
self.__keys = sorted(dict.keys(self), cmp=self.__cmp,
key=self.__key,
reverse=self.__reverse)
for key in self.__keys:
yield self[key]

def iteritems(self):
if self.__keys is None:
self.__keys = sorted(dict.keys(self), cmp=self.__cmp,
key=self.__key,
reverse=self.__reverse)
for key in self.__keys:
yield key, self[key]

def __delitem__(self, key):
self.__keys = None
dict.__delitem__(self, key)

def __setitem__(self, key, value):
self.__keys = None
dict.__setitem__(self, key, value)

def __repr__(self):
return str(self)

def __str__(self):
if self.__keys is None:
self.__keys = sorted(dict.keys(self), cmp=self.__cmp,
key=self.__key,
reverse=self.__reverse)
return "{%s}" % ", ".join(
["%r: %r" % (key, self[key]) for key in self.__keys])

Paul Hankin

unread,

Sep 25, 2007, 5:33:23 PM9/25/07

to

On Sep 25, 9:55 pm, Mark Summerfield <m.n.summerfi...@googlemail.com>
wrote:
> ...
> class sorteddict(dict):
>
> ...

> if self.__keys is None:
> self.__keys = sorted(dict.keys(self), cmp=self.__cmp,
> key=self.__key,
> reverse=self.__reverse)

You'd be better defining __keys like this:

def __getkeys(self):
if self.__keycache is None:
self.__keycache = dict.keys(self)
self.__keycache.sort(cmp=...)
return self.__keycache

__keys = property(__getkeys)

and where you have 'self.__keys = None', either replace with
'self.__keycache = None' or more cleanly with a call to:

def __invalidate_key_cache(self):
self.__keycache = None

to improve the code quality.

--
Paul Hankin

Mark Summerfield

unread,

Sep 26, 2007, 2:59:41 AM9/26/07

to

Yes, that's much better.

As you've no doubt realised, this particular implementation gives best
performance when the pattern of use is: lots of edits, lots of
lookups, ..., and gives worst performance when the pattern of use is:
edit, lookup, edit, lookup (in which case using a dict and sorted() is
probably better).

So there is lots of scope for someone to do a version that has good
performance for all patterns of use:-)

I've put a full version with doctests & docstrings on PyPI:
http://pypi.python.org/pypi/sorteddict

Here's the stripped version (just 92 lines):

class sorteddict(dict):

def __init__(self, iterable=None, cmp=None, key=None,
reverse=False):
if iterable is None:
iterable = []
dict.__init__(self, iterable)
self.__cmp = cmp
self.__key = key
self.__reverse = reverse

self.__keycache = None

@property
def __keys(self):

if self.__keycache is None:
self.__keycache = dict.keys(self)

self.__keycache.sort(cmp=self.__cmp, key=self.__key,
reverse=self.__reverse)
return self.__keycache

def __invalidate_key_cache(self):
self.__keycache = None

def update(self, *args, **kwargs):
self.__invalidate_key_cache()
dict.update(self, *args, **kwargs)

@classmethod
def fromkeys(cls, iterable, value=None):
dictionary = cls()
for key in iterable:
dictionary[key] = value
return dictionary

def copy(self):
return sorteddict(dict.copy(self), cmp=self.__cmp,
key=self.__key, reverse=self.__reverse)

def clear(self):
self.__invalidate_key_cache()
dict.clear(self)

def setdefault(self, key, value):
self.__invalidate_key_cache()
return dict.setdefault(self, key, value)

def pop(self, key, value=None):
if key not in self:
return value

self.__invalidate_key_cache()
return dict.pop(self, key, value)

def popitem(self):
self.__invalidate_key_cache()
return dict.popitem(self)

def keys(self):
return self.__keys[:]

def values(self):

return [self[key] for key in self.__keys]

def items(self):

return [(key, self[key]) for key in self.__keys]

def __iter__(self):
return iter(self.__keys)

def iterkeys(self):
return iter(self.__keys)

def itervalues(self):

for key in self.__keys:
yield self[key]

def iteritems(self):

for key in self.__keys:
yield key, self[key]

def __delitem__(self, key):
self.__invalidate_key_cache()
dict.__delitem__(self, key)

def __setitem__(self, key, value):
self.__invalidate_key_cache()
dict.__setitem__(self, key, value)

def __repr__(self):
raise NotImplementedError()

def __str__(self):

Duncan Booth

unread,

Sep 26, 2007, 4:31:15 AM9/26/07

to

Mark Summerfield <m.n.sum...@googlemail.com> wrote:

> As you've no doubt realised, this particular implementation gives best
> performance when the pattern of use is: lots of edits, lots of
> lookups, ..., and gives worst performance when the pattern of use is:
> edit, lookup, edit, lookup (in which case using a dict and sorted() is
> probably better).
>
> So there is lots of scope for someone to do a version that has good
> performance for all patterns of use:-)

I that's the point though: you can't write one implementation that has good
performance for all patterns of use: you have to either compromise on
performance somewhere, or use an implementation specifically matched to
your use case.

For example, if the use case was some updates followed by iterating over
the first few keys only, it may be faster to use a heap for iteration.

I suspect that for many use patterns you could improve performance
significantly by having two levels of invalidation for the the key list: in
__setitem__, if the key already exists you don't need to discard the list
in __keycache (although if a key or cmp function was provided it may no
longer be sorted). In that case it might be worthwhile keeping __keycache
but flagging it as needing to be sorted again next time it is used. Given
that Python's sort is very fast on sorted or nearly sorted lists this could
provide a worthwhile speedup for cases where the values change but the keys
don't change often.

Hrvoje Niksic

unread,

Sep 26, 2007, 4:51:26 AM9/26/07

to

Duncan Booth <duncan...@invalid.invalid> writes:

> I that's the point though: you can't write one implementation that has good
> performance for all patterns of use

An implementation of sorted dict using a balanced tree as the
underlying data structure would give decent performance in all the
mentioned use cases. For example, red-black trees search, insert, and
delete in O(log n) time.

Mark Summerfield

unread,

Sep 26, 2007, 5:42:39 AM9/26/07

to Antoon Pardon

On 26 Sep, 09:51, Hrvoje Niksic <hnik...@xemacs.org> wrote:

Basically, as implemented, I have to invalidate if there is any change
to a key _or_ to a value because if cmp or key functions are specified
they could be using the key, the value, or even both. (If key and cmp
were both None and reverse was False, I could use the bisect module
and maintain the keys in sorted order at all times in that case, but
then performance would vary considerably depending on use which I
think would be v. confusing for users.)

You are quite right that using a balanced tree (red-black, AVL, or
b*tree) would provide decent performance in all the use cases. There
appear to be two on PyPI (rbtree and pyavl), but both seem to be C
extensions. Antoon Pardon has a pure Python AVL tree, but whether he
could or would produce a version that had the sorteddict API, I don't
know.

Hrvoje Niksic

unread,

Sep 26, 2007, 6:27:01 AM9/26/07

to

Mark Summerfield <m.n.sum...@googlemail.com> writes:

> On 26 Sep, 09:51, Hrvoje Niksic <hnik...@xemacs.org> wrote:
>> Duncan Booth <duncan.bo...@invalid.invalid> writes:
>> > I that's the point though: you can't write one implementation
>> > that has good performance for all patterns of use
>>
>> An implementation of sorted dict using a balanced tree as the
>> underlying data structure would give decent performance in all the
>> mentioned use cases. For example, red-black trees search, insert, and
>> delete in O(log n) time.
>
> Basically, as implemented, I have to invalidate if there is any

> change [...]

No argument here, as long as the limitation is understood to be a
consequence of the current implementation model. Seriously proposing
a sorteddict that is a mere syntactic sugar over dict dooms the PEP to
rejection.

Major programming language libraries have included sorted mapping and
set types for a while now, making the performance and complexity
constraints generally well understood. We should make use of that
knowledge when designing sorteddict.

Mark Summerfield

unread,

Sep 26, 2007, 7:02:18 AM9/26/07

to chris.g...@newcenturycomputers.net

On 26 Sep, 11:27, Hrvoje Niksic <hnik...@xemacs.org> wrote:

> Mark Summerfield <m.n.summerfi...@googlemail.com> writes:
> > On 26 Sep, 09:51, Hrvoje Niksic <hnik...@xemacs.org> wrote:
> >> Duncan Booth <duncan.bo...@invalid.invalid> writes:
> >> > I that's the point though: you can't write one implementation
> >> > that has good performance for all patterns of use
>
> >> An implementation of sorted dict using a balanced tree as the
> >> underlying data structure would give decent performance in all the
> >> mentioned use cases. For example, red-black trees search, insert, and
> >> delete in O(log n) time.
>
> > Basically, as implemented, I have to invalidate if there is any
> > change [...]
>
> No argument here, as long as the limitation is understood to be a
> consequence of the current implementation model. Seriously proposing
> a sorteddict that is a mere syntactic sugar over dict dooms the PEP to
> rejection.

On the Py3K list GvR made it clear that sorteddict wouldn't be in
Python until an implementation had been in popular use for some
time... so I will not put forward a PEP for it.

> Major programming language libraries have included sorted mapping and
> set types for a while now, making the performance and complexity
> constraints generally well understood. We should make use of that
> knowledge when designing sorteddict.

I know! I use Python and C++ and Qt, so what with the STL and Qt's
containers I'm used to having far more containers to choose from than
Python offers. I think Python could do with both sorteddict and
sortedset in the collections module, but to get decent performance
will require a balanced tree (or a skiplist) implementation, and I
don't have the time to do that. Apart from Antoon Pardon's AVL tree,
there's also Chris Gonnerman's RBTree, also pure Python, with a dict
API and that seems to accept a cmp function, but again, I don't know
whether it could be adapted to the sorteddict((mapping | sequence |
nothing), cmp=None, key=None, reverse=None) API that Paul Hankin
described.

Paul Hankin

unread,

Sep 26, 2007, 7:14:57 AM9/26/07

to

On Sep 26, 9:31 am, Duncan Booth <duncan.bo...@invalid.invalid> wrote:

More flexibly, keep a set of inserted keys that haven't yet been
included in the sorted list, and a set of deleted keys that haven't
yet been removed from the sorted list. The cache is invalid if either
of these sets are empty - and to make it valid you can choose what to
do based on the sizes of the two sets (and the sorted list). For
instance, if there's just been one insertion you're probably better
doing an insertion rather than a full resort. Of course, there's a few
nasty cases here but it's always possible to just throw away the
sorted list and reconstruct it from the dict keys when things get too
hairy (eg, the user deletes a key that's in the inserted-but-not-yet-
sorted set).

--
Paul Hankin

Paul Hankin

unread,

Sep 26, 2007, 7:23:39 AM9/26/07

to

On Sep 26, 9:51 am, Hrvoje Niksic <hnik...@xemacs.org> wrote:

But dicts do search, insert and delete in O(1) time, so using some
variety of balanced tree will give you much worse performance when
you're doing regular dict operations.

--
Paul Hankin

Antoon Pardon

unread,

Sep 26, 2007, 8:22:34 AM9/26/07

to

Well you should decide what you want. In a previous exchange one of the
things that was wanted was that you could already seed a tree by using
key word arguments so that you could do the following:

t=Tree(a=15, b=30)

which would be equivallent to:

t=Tree()
t.update(a=15,b=30)

But with the above proposal

t=Tree(cmp=cmp)

is not equivallent to

t=Tree()
t.update(cmp=cmp)

So it seems the API needs some more sorting out.

--
Antoon Pardon

Hrvoje Niksic

unread,

Sep 26, 2007, 8:42:30 AM9/26/07

to

Paul Hankin <paul....@gmail.com> writes:

>> An implementation of sorted dict using a balanced tree as the
>> underlying data structure would give decent performance in all the
>> mentioned use cases. For example, red-black trees search, insert,
>> and delete in O(log n) time.
>
> But dicts do search, insert and delete in O(1) time, so using some
> variety of balanced tree will give you much worse performance when
> you're doing regular dict operations.

I wouldn't call it "much worse"; while O(log(n)) is worse than O(1),
it's still very fast, which is why popular programming language
libraries have an ordered mapping type based on balanced trees. Also
note that dict performance can degrade with hash collisions, while
trees can maintain complexity guarantees on all operations.

In the end, it's a tradeoff. Hash tables offer O(1) access, but lack
ordering. Balanced trees offer ordering at the price of O(log n)
access. Both have their uses, but neither is syntactic sugar for the
other.

Mark Summerfield

unread,

Sep 26, 2007, 9:44:23 AM9/26/07

to

On 26 Sep, 13:22, Antoon Pardon <apar...@forel.vub.ac.be> wrote:

I think you missed some of the previous postings.

The sorteddict API that has emerged so far is (1) apart from the
constructor, everything is identical to dict, (2) the constructor
takes the same args as sorted(), so if you want to seed with a dict or
with keywords you write sorteddict(dict(a=1,b=2), ...), (or you could
create a sorteddict and use update() since that takes the same args as
dict's constructor).

Could your AVLTree support cmp and keys and reverse?

Duncan Booth

unread,

Sep 26, 2007, 9:46:03 AM9/26/07

to

Paul Hankin <paul....@gmail.com> wrote:

>> I suspect that for many use patterns you could improve performance
>> significantly by having two levels of invalidation for the the key
>> list: in __setitem__, if the key already exists you don't need to
>> discard the list in __keycache (although if a key or cmp function was
>> provided it may no longer be sorted). In that case it might be
>> worthwhile keeping __keycache but flagging it as needing to be sorted
>> again next time it is used. Given that Python's sort is very fast on
>> sorted or nearly sorted lists this could provide a worthwhile speedup
>> for cases where the values change but the keys don't change often.
>
> More flexibly, keep a set of inserted keys that haven't yet been
> included in the sorted list, and a set of deleted keys that haven't
> yet been removed from the sorted list. The cache is invalid if either
> of these sets are empty - and to make it valid you can choose what to
> do based on the sizes of the two sets (and the sorted list). For
> instance, if there's just been one insertion you're probably better
> doing an insertion rather than a full resort. Of course, there's a few
> nasty cases here but it's always possible to just throw away the
> sorted list and reconstruct it from the dict keys when things get too
> hairy (eg, the user deletes a key that's in the inserted-but-not-yet-
> sorted set).
>

Yes that sounds good. Did you mean 'The cache is invalid if either of
these sets is not empty'?

If you delete a key which is in the inserted set you can simply delete
it from the inserted set. The inserted set can be added to sorted list
simply by appending and sorting, and the deleted keys can be removed
with a list comprehension. The only other thing you need is to catch
when the sort key has changed: a public method to request a resort would
handle this (but of course it wouldn't actually resort until it needed
the sorted keys again.)

Mark's sortteddict.py modified to avoid calling __invalidate_key (and
__repr__ also implemented):

------------ sorteddict.py -------------
#!/usr/bin/env python
# Copyright (c) 2007 Qtrac Ltd. All rights reserved.
# This module is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or (at
# your option) any later version.
# The API and implementation are based on ideas from Paul Hankin writing
# on comp.lang.python.

"""
A dictionary that is sorted by key or by the given cmp or key function.

Provides a dictionary with the same methods and behavior as a standard
dict and that can be used as a drop-in replacement for a dict (apart
from the constructor), but which always returns iterators and lists
(whether of keys or values) in sorted order. It does not matter when
items are inserted or when removed, the items in the sorteddict are
always returned in sorted order. The ordering is implicitly based on the
key's __lt__() (or failing that __cmp__()) method if no cmp or key
function is given.

The main benefit of sorteddicts is that you never have to explicitly
sort.

One use case is where you have a set of objects that you want to make
available in various sorted orders. You could, of course, sort on demand,
but if the number of objects is very large (and if you're prepared to
sacrifice some memory), it may be faster to use a sorteddict. For
example:

>>> class PrimeMinister:
... def __init__(self, forename, surname, elected):
... self.forename = forename
... self.surname = surname
... self.elected = elected
>>> byForename = sorteddict(key=lambda pm: (pm.forename, pm.surname))
>>> bySurname = sorteddict(key=lambda pm: (pm.surname, pm.forename))
>>> byName = sorteddict(key=lambda pm: (pm.surname, pm.forename,
... pm.elected), reverse=True)
>>> byElected = sorteddict(key=lambda pm: pm.elected)
>>> for forename, surname, elected in (
... ("Ramsay", "MacDonald", "1924-01-22"),
... ("Arthur", "Balfour", "1902-07-11"),
... ("Herbert Henry", "Asquith", "1908-04-07"),
... ("Stanley", "Baldwin", "1924-11-04"),
... ("David", "Lloyd George", "1916-12-07"),
... ("Andrew", "Bonar Law", "1922-10-23"),
... ("Henry", "Campbell-Bannerman", "1905-12-05"),
... ("Stanley", "Baldwin", "1923-05-23"),
... ("Ramsay", "MacDonald", "1929-06-05")):
... pm = PrimeMinister(forename, surname, elected)
... byForename[pm] = pm
... bySurname[pm] = pm
... byName[pm] = pm
... byElected[pm] = pm
>>> [pm.forename for pm in byForename.values()]
['Andrew', 'Arthur', 'David', 'Henry', 'Herbert Henry', 'Ramsay', \
'Ramsay', 'Stanley', 'Stanley']
>>> [pm.surname for pm in bySurname.values()]
['Asquith', 'Baldwin', 'Baldwin', 'Balfour', 'Bonar Law', \
'Campbell-Bannerman', 'Lloyd George', 'MacDonald', 'MacDonald']
>>> ["%s %s %s" % (pm.forename, pm.surname, pm.elected) \
for pm in byName.values()]
['Ramsay MacDonald 1929-06-05', 'Ramsay MacDonald 1924-01-22', \
'David Lloyd George 1916-12-07', 'Henry Campbell-Bannerman 1905-12-05', \
'Andrew Bonar Law 1922-10-23', 'Arthur Balfour 1902-07-11', \
'Stanley Baldwin 1924-11-04', 'Stanley Baldwin 1923-05-23', \
'Herbert Henry Asquith 1908-04-07']
>>> ["%s %s %s" % (pm.forename, pm.surname, pm.elected) \
for pm in byElected.values()]
['Arthur Balfour 1902-07-11', 'Henry Campbell-Bannerman 1905-12-05', \
'Herbert Henry Asquith 1908-04-07', 'David Lloyd George 1916-12-07', \
'Andrew Bonar Law 1922-10-23', 'Stanley Baldwin 1923-05-23', \
'Ramsay MacDonald 1924-01-22', 'Stanley Baldwin 1924-11-04', \
'Ramsay MacDonald 1929-06-05']

Thanks to Python's object references, even though there are four
sorteddicts referring to the same PrimeMinister objects, only one instance
of each object is held in memory.

>>> files = ["README.txt", "readme", "MANIFEST", "test.py"]
>>> d = sorteddict([(name, name) for name in files],
... cmp=lambda a, b: cmp(a.lower(), b.lower()))
>>> d.keys()
['MANIFEST', 'readme', 'README.txt', 'test.py']

Here are a few tests for some of the base class methods that are not
reimplemented:

>>> d = sorteddict(dict(s=1, a=2, n=3, i=4, t=5, y=6))
>>> d.get("X", 21)
21
>>> d.get("i")
4
>>> d.has_key("a")
True
>>> d.has_key("x")
False
>>> "a" in d
True
>>> "x" in d
False
>>> len(d)
6
>>> del d["n"]
>>> del d["y"]
>>> len(d)
4
>>> d.clear()
>>> len(d)
0
>>> d = sorteddict(dict(s=1, a=2, n=3, i=4, t=5, y=6))
>>> d["i"]
4
>>> d["y"]
6
>>> d["z"]
Traceback (most recent call last):
...
KeyError: 'z'
"""

__author__ = "Mark Summerfield"
__version__ = "1.1.0"

class sorteddict(dict):
"""A dictionary that is sorted by key or by the given cmp or key function.

The sorteddict always returns any list or iterator in sorted order.
For best performance prefer a key argument to a cmp one. If neither
is given __lt__() (falling back to __cmp__()) will be used for
ordering.

This particular implementation has reasonable performance if the
pattern of use is: lots of edits, lots of lookups, ..., but gives
its worst performance if the pattern of use is: edit, lookup, edit,
lookup, ..., in which case using a plain dict and sorted() will
probably be better.

If you want to initialize with a dict, either use
sorteddict(dict(...), ...), or create the sorteddict and then call
update with the arguments normally passed to a dict constructor.
"""

def __init__(self, iterable=None, cmp=None, key=None, reverse=False):

"""Initializes the sorteddict using the same arguments as sorted()

>>> d = sorteddict(dict(s=1, a=2, n=3, i=4, t=5, y=6))
>>> d.items()
[('a', 2), ('i', 4), ('n', 3), ('s', 1), ('t', 5), ('y', 6)]
>>> str(sorteddict())
'{}'
>>> e = sorteddict(d)
>>> e.items()
[('a', 2), ('i', 4), ('n', 3), ('s', 1), ('t', 5), ('y', 6)]

"""
if iterable is None:
iterable = []
dict.__init__(self, iterable)
self.__cmp = cmp
self.__key = key
self.__reverse = reverse
self.__keycache = None

self.__addkeys = set()
self.__delkeys = set()

@property
def __keys(self):
if self.__keycache is None:
self.__keycache = dict.keys(self)

self.__addkeys = set()
self.__delkeys = set()
self.__keycache.sort(cmp=self.__cmp, key=self.__key,
reverse=self.__reverse)
else:
if self.__delkeys:
delkeys = self.__delkeys
self.__keycache = [key for key in self.__keycache if key not in delkeys]
self.__delkeys = set()

if self.__addkeys:
self.__keycache += list(self.__addkeys)
self.__addkeys = set()

self.__keycache.sort(cmp=self.__cmp, key=self.__key,
reverse=self.__reverse)
return self.__keycache

def __invalidate_key_cache(self):
self.__keycache = None

self.__addkeys = set()
self.__delkeys = set()

def __addkey(self, key):
if key in self.__delkeys:
del self.__delkeys[key]
else:
self.__addkeys.add(key)

def __removekey(self, key):
if key in self.__addkeys:
del self.__addkeys[key]
else:
self.__delkeys.add(key)

def update(self, *args, **kwargs):
"""Updates the sorteddict using the same arguments as dict

>>> d = sorteddict(dict(s=1, a=2, n=3, i=4, t=5))
>>> d.update(a=4, z=-4)
>>> d.items()
[('a', 4), ('i', 4), ('n', 3), ('s', 1), ('t', 5), ('z', -4)]
>>> del d["a"]
>>> del d["i"]
>>> d.update({'g': 9}, a=1, z=3)
>>> d.items()
[('a', 1), ('g', 9), ('n', 3), ('s', 1), ('t', 5), ('z', 3)]
>>> e = sorteddict(dict(p=4, q=5))
>>> del d["a"]
>>> del d["n"]
>>> e.update(d)
>>> e.items()
[('g', 9), ('p', 4), ('q', 5), ('s', 1), ('t', 5), ('z', 3)]
"""
self.__invalidate_key_cache()
dict.update(self, *args, **kwargs)

@classmethod
def fromkeys(cls, iterable, value=None):

"""A class method that returns an sorteddict whose keys are
from the iterable and each of whose values is value

>>> d = sorteddict()
>>> e = d.fromkeys("KYLIE", 21)
>>> e.items()
[('E', 21), ('I', 21), ('K', 21), ('L', 21), ('Y', 21)]
>>> e = sorteddict.fromkeys("KYLIE", 21)
>>> e.items()
[('E', 21), ('I', 21), ('K', 21), ('L', 21), ('Y', 21)]

"""
dictionary = cls()
for key in iterable:
dictionary[key] = value
return dictionary

def copy(self):
"""Returns a shallow copy of this sorteddict

>>> d = sorteddict(dict(s=1, a=2, n=3, i=4, t=5, y=6))
>>> e = d.copy()
>>> e.items()
[('a', 2), ('i', 4), ('n', 3), ('s', 1), ('t', 5), ('y', 6)]
>>> d = sorteddict()
>>> e = d.copy()
>>> e.items()
[]

"""
return sorteddict(dict.copy(self), cmp=self.__cmp,
key=self.__key, reverse=self.__reverse)

def clear(self):
"""Removes every item from this sorteddict

>>> d = sorteddict(dict(s=1, a=2, n=3, i=4, t=5, y=6))
>>> len(d)
6
>>> d.clear()
>>> len(d)
0
>>> d["m"] = 3
>>> d["a"] = 5
>>> d["z"] = 7
>>> d["e"] = 9
>>> d.keys()
['a', 'e', 'm', 'z']
"""
self.__invalidate_key_cache()
dict.clear(self)

def setdefault(self, key, value):
"""If key is in the dictionary, returns its value;
otherwise adds the key with the given value which is also
returned

>>> d = sorteddict(dict(s=1, a=2, n=3, i=4, t=5, y=6))
>>> d.setdefault("n", 99)
3
>>> d.values()
[2, 4, 3, 1, 5, 6]
>>> d.setdefault("r", -20)
-20
>>> d.items()[2:]
[('n', 3), ('r', -20), ('s', 1), ('t', 5), ('y', 6)]
>>> d.setdefault("@", -11)
-11
>>> d.setdefault("z", 99)
99
>>> d.setdefault("m", 50)
50
>>> d.keys()
['@', 'a', 'i', 'm', 'n', 'r', 's', 't', 'y', 'z']

"""
if key not in self:

self.__addkey(key)
return dict.setdefault(self, key, value)

def pop(self, key, value=None):
"""If key is in the dictionary, returns its value and removes it
from the dictionary; otherwise returns the given value

>>> d = sorteddict(dict(s=1, a=2, n=3, i=4, t=5, y=6))
>>> d.pop("n")
3
>>> "n" in d
False
>>> d.pop("q", 41)
41
>>> d.keys()
['a', 'i', 's', 't', 'y']
>>> d.pop("a")
2
>>> d.pop("t")
5
>>> d.keys()
['i', 's', 'y']

"""
if key not in self:
return value

self.__removekey(key)
return dict.pop(self, key, value)

def popitem(self):
"""Returns and removes an arbitrary item from the dictionary

>>> d = sorteddict(dict(s=1, a=2, n=3, i=4, t=5, y=6))
>>> len(d)
6
>>> item = d.popitem()
>>> item = d.popitem()
>>> item = d.popitem()
>>> len(d)
3
"""
item = dict.popitem(self)
self.__removekey(item[0])
return item

def keys(self):
"""Returns the dictionary's keys in sorted order

>>> d = sorteddict(dict(s=1, a=2, n=3, i=4, t=5, y=6))
>>> d.keys()
['a', 'i', 'n', 's', 't', 'y']
"""
return self.__keys[:]

def values(self):
"""Returns the dictionary's values in key order

>>> d = sorteddict(dict(s=1, a=2, n=3, i=4, t=5, y=6))
>>> d.values()
[2, 4, 3, 1, 5, 6]
>>> d = sorteddict(dict(s=1, a=2, n=3, i=4, t=5, y=6),
... reverse=True)
>>> d.values()
[6, 5, 1, 3, 4, 2]
>>> d = sorteddict(dict(S=1, a=2, N=3, i=4, T=5, y=6),
... cmp=lambda a, b: cmp(a.lower(), b.lower()))
>>> d.keys()
['a', 'i', 'N', 'S', 'T', 'y']

"""
return [self[key] for key in self.__keys]

def items(self):
"""Returns the dictionary's items in key order

>>> d = sorteddict(dict(s=1, a=2, n=3, i=4, t=5, y=6))
>>> d.items()
[('a', 2), ('i', 4), ('n', 3), ('s', 1), ('t', 5), ('y', 6)]

"""
return [(key, self[key]) for key in self.__keys]

def __iter__(self):
"""Returns an iterator over the dictionary's keys

>>> d = sorteddict(dict(s=1, a=2, n=3, i=4, t=5, y=6))
>>> list(d)
['a', 'i', 'n', 's', 't', 'y']
"""
return iter(self.__keys)

def iterkeys(self):
"""Returns an iterator over the dictionary's keys

>>> d = sorteddict(dict(s=1, a=2, n=3, i=4, t=5, y=6))
>>> list(d)
['a', 'i', 'n', 's', 't', 'y']
"""
return iter(self.__keys)

def itervalues(self):
"""Returns an iterator over the dictionary's values in key order

>>> d = sorteddict(dict(s=1, a=2, n=3, i=4, t=5, y=6))
>>> list(d.itervalues())
[2, 4, 3, 1, 5, 6]

"""
for key in self.__keys:
yield self[key]

def iteritems(self):
"""Returns an iterator over the dictionary's values in key order

>>> d = sorteddict(dict(s=1, a=2, n=3, i=4, t=5, y=6))
>>> list(d.iteritems())
[('a', 2), ('i', 4), ('n', 3), ('s', 1), ('t', 5), ('y', 6)]

"""
for key in self.__keys:
yield key, self[key]

def __delitem__(self, key):
"""Deletes the item with the given key from the dictionary

>>> d = sorteddict(dict(s=1, a=2, n=3, i=4, t=5, y=6))
>>> del d["s"]
>>> del d["y"]
>>> del d["a"]
>>> d.keys()
['i', 'n', 't']
"""
self.__removekey(key)
dict.__delitem__(self, key)

def __setitem__(self, key, value):
"""If key is in the dictionary, sets its value to value;
otherwise adds the key to the dictionary with the given value

>>> d = sorteddict(dict(s=1, a=2, n=3, i=4, t=5, y=6))
>>> d["t"] = -17
>>> d["z"] = 43
>>> d["@"] = -11
>>> d["m"] = 22
>>> d["r"] = 5
>>> d.keys()
['@', 'a', 'i', 'm', 'n', 'r', 's', 't', 'y', 'z']

"""
if key not in self:

self.__addkey(key)
dict.__setitem__(self, key, value)

def __repr__(self):
"""
>>> sorteddict()
sorteddict()
>>> sorteddict({'a':1})
sorteddict({'a': 1})
>>> def comparison(a,b): return cmp(a,b)
>>> def keyfn(a): return a
>>> sorteddict({'a':1}, cmp=comparison) #doctest: +ELLIPSIS
sorteddict({'a': 1}, cmp=<function comparison at ...>)
>>> sorteddict({'a':1}, key=keyfn, reverse=True) #doctest: +ELLIPSIS
sorteddict({'a': 1}, key=<function keyfn at ...>, reverse=True)
"""
args = []
if self:

args.append(dict.__repr__(self))
if self.__cmp is not None:
args.append('cmp=%r' % self.__cmp)
if self.__key is not None:
args.append('key=%r' % self.__key)
if self.__reverse is not False:
args.append('reverse=%r' % self.__reverse)
return "%s(%s)" % (self.__class__.__name__, ', '.join(args))

def __str__(self):
"""Returns a human readable string representation of the
dictionary

The returned string is proportional in size to the number of
items so could be very large.

>>> d = sorteddict(dict(s=1, a=2, n=3, i=4, t=5))
>>> str(d)
"{'a': 2, 'i': 4, 'n': 3, 's': 1, 't': 5}"
>>> d = sorteddict({2: 'a', 3: 'm', 1: 'x'})
>>> str(d)
"{1: 'x', 2: 'a', 3: 'm'}"

"""
return "{%s}" % ", ".join(
["%r: %r" % (key, self[key]) for key in self.__keys])

if __name__ == "__main__":
import doctest
doctest.testmod()

----------------------------------------

Paul Hankin

unread,

Sep 26, 2007, 9:59:40 AM9/26/07

to

On Sep 26, 2:46 pm, Duncan Booth <duncan.bo...@invalid.invalid> wrote:

> Paul Hankin <paul.han...@gmail.com> wrote:
> > More flexibly, keep a set of inserted keys that haven't yet been
> > included in the sorted list, and a set of deleted keys that haven't
> > yet been removed from the sorted list. The cache is invalid if either
> > of these sets are empty - and to make it valid you can choose what to
> > do based on the sizes of the two sets (and the sorted list). For
> > instance, if there's just been one insertion you're probably better
> > doing an insertion rather than a full resort. Of course, there's a few
> > nasty cases here but it's always possible to just throw away the
> > sorted list and reconstruct it from the dict keys when things get too
> > hairy (eg, the user deletes a key that's in the inserted-but-not-yet-
> > sorted set).
>
> Yes that sounds good. Did you mean 'The cache is invalid if either of
> these sets is not empty'?

Yes :)

> If you delete a key which is in the inserted set you can simply delete
> it from the inserted set.

No, in case it was already in the sorted list before the insert. You
have to remove it from the inserted set AND add it to the deleted set.

--
Paul Hankin

Mark Summerfield

unread,

Sep 26, 2007, 10:24:17 AM9/26/07

to

So should Duncan's

def __removekey(self, key):
if key in self.__addkeys:
del self.__addkeys[key]
else:
self.__delkeys.add(key)

be changed to:

def __removekey(self, key):
if key in self.__addkeys:
del self.__addkeys[key]

self.__delkeys.add(key)

?

(BTW I'm happy to put up a new version on PyPI, but I'm v. conscious
of the fact that the key ideas and code have come from you and Duncan,
so if either of you would like to adopt it, I will gladly step aside.
I just want a sorteddict in Python, and I don't mind who does it:-)

Steve Holden

unread,

Sep 26, 2007, 11:17:33 AM9/26/07

to pytho...@python.org

Mark Summerfield wrote:
> On 26 Sep, 13:22, Antoon Pardon <apar...@forel.vub.ac.be> wrote:

[...]

>> So it seems the API needs some more sorting out.
>
> I think you missed some of the previous postings.
>
> The sorteddict API that has emerged so far is (1) apart from the
> constructor, everything is identical to dict, (2) the constructor
> takes the same args as sorted(), so if you want to seed with a dict or
> with keywords you write sorteddict(dict(a=1,b=2), ...), (or you could
> create a sorteddict and use update() since that takes the same args as
> dict's constructor).
>
> Could your AVLTree support cmp and keys and reverse?
>

I don't see how you can guarantee ordering of the seeded components,
since you are creating an inherently unpredictable dict as the component
to specify the initial contents of the sorteddict.

Or is an arbitrary initial ordering acceptable?

regards
Steve
--
Steve Holden +1 571 484 6266 +1 800 494 3119
Holden Web LLC/Ltd http://www.holdenweb.com
Skype: holdenweb http://del.icio.us/steve.holden

Sorry, the dog ate my .sigline

Paul Hankin

unread,

Sep 26, 2007, 11:20:26 AM9/26/07

to

On Sep 26, 3:24 pm, Mark Summerfield <m.n.summerfi...@googlemail.com>
wrote:

Yes, and the same in __addkey: if it's in __delkeys it should be
removed from there, and added to __addkeys. There's an invariant: any
key is in at most one of __addkeys and __delkeys.

> (BTW I'm happy to put up a new version on PyPI, but I'm v. conscious
> of the fact that the key ideas and code have come from you and Duncan,
> so if either of you would like to adopt it, I will gladly step aside.
> I just want a sorteddict in Python, and I don't mind who does it:-)

Speaking for myself, I'm happy to have helped out, but you should
carry on.

--
Paul Hankin

Jeremy Sanders

unread,

Sep 26, 2007, 11:21:19 AM9/26/07

to

Mark Summerfield wrote:

> The sorteddict API that has emerged so far is (1) apart from the
> constructor, everything is identical to dict, (2) the constructor
> takes the same args as sorted(), so if you want to seed with a dict or
> with keywords you write sorteddict(dict(a=1,b=2), ...), (or you could
> create a sorteddict and use update() since that takes the same args as
> dict's constructor).

first() and last() would also be nice on a sorted dict.

Jeremy

--
Jeremy Sanders
http://www.jeremysanders.net/

Mark Summerfield

unread,

Sep 26, 2007, 11:40:39 AM9/26/07

to

OK, I have changed both:

def __addkey(self, key):
if key in self.__delkeys:
del self.__delkeys[key]

self.__addkeys.add(key)

def __removekey(self, key):
if key in self.__addkeys:
del self.__addkeys[key]
self.__delkeys.add(key)

> > (BTW I'm happy to put up a new version on PyPI, but I'm v. conscious

> > of the fact that the key ideas and code have come from you and Duncan,
> > so if either of you would like to adopt it, I will gladly step aside.
> > I just want a sorteddict in Python, and I don't mind who does it:-)
>
> Speaking for myself, I'm happy to have helped out, but you should
> carry on.

OK, I will if Duncan doesn't want it:-)

> --
> Paul Hankin

Raymond Hettinger

unread,

Sep 26, 2007, 1:59:52 PM9/26/07

to

[Mark Summerfield]

> Below is a PEP proposal for a sorteddict. It arises out of a
> discussion on this list that began a few weeks ago with the subject of
> "An ordered dictionary for the Python library?"

It is worth remembering that the long sought after datatype was a dict
that could loop over keys in *insertion* order, not based on some
function of the key itself.

> If there is positive feedback I will submit the PEP to the reviewers,
> so if you think it is a good idea please say so.

As one who was submitted many PEPs, I can advise that the quickest way
to irrevocably kill an idea is to submit a PEP to early (we almost
didn't get genexps because the PEP was submitted pre-maturely).
Instead, I advise posting an ASPN recipe so the details can be
hammered-out and a fan club can be grown.

For this particular idea, I am -1 on inclusion in the standard
library. The PEP focuses mainly on implementation but is weakest in
the rationale section. Its advantages over "for k in sorted(d)" are
miniscule. To be considered for the collections module, there needs
to be compelling use case advantages either in terms of usability or
performance.

After grepping through my accumulated code base for "sorted", I see
remarkably few places where sortdict would have be preferred and in
those few cases, the advantage was so slight that the code would not
be worth changing. The majority of dictionary sorts are characterized
by fully building a dictionary and then sorting it. In those cases,
sortdict merely offers a different spelling and adds a small
performance cost during key insertion. The sortdict has some
potential in the comparatively rare case of a long lived dictionary
where periodic sorted key requests are interleaved with operations
that mutate the dictionary. In those cases, the separate tracking of
added/removed keys may save some sorting effort; however, there is an
offsetting cost of many __hash__/__cmp__ calls in the list
comprehension update and the modest savings was in all cases
trivialized by cost of whatever was initiating the dictionary
mutations (user inputs or web updates).

Also, there were cases when the sortdict was at a distinct
disadvantage. In those cases, there were multiple key functions
(key=record.name or key=record.date or key=record.size). There
standard approach using sorted() creates three separate sorted lists
and only one dict. In contrast, the sortdict approach would need
three sortdicts each with their own dict and sorted list.

IMHO, the collections module should be saved for something much more
powerful and generic like a dictionary that generates a callback
whenever it is mutated. A datatype like that would have it much
easier for you to implement something like this sortdict or the long
sought after dict that remembers its key insertion order.

Raymond

Paul Rubin

unread,

Sep 26, 2007, 2:51:49 PM9/26/07

to

Mark Summerfield <m.n.sum...@googlemail.com> writes:
> The sorteddict API that has emerged so far is (1) apart from the
> constructor, everything is identical to dict,

I don't see this as necessary, it's ok if it resembles dict as much
as dbm does.

> (2) the constructor takes the same args as sorted(), so if you want
> to seed with a dict or with keywords you write
> sorteddict(dict(a=1,b=2), ...), (or you could create a sorteddict
> and use update() since that takes the same args as dict's
> constructor).

sorteddict could itself return a constructor:

d = sorteddict(key=int).new((('12', 'a'), ('3', 'b')))

gives sorted version of {'3':'b', '12':'a'}

James Stroud

unread,

Sep 26, 2007, 3:28:51 PM9/26/07

to

Raymond Hettinger wrote:
> As one who was submitted many PEPs, I can advise that the quickest way
> to irrevocably kill an idea is to submit a PEP to early (we almost
> didn't get genexps because the PEP was submitted pre-maturely).

Maybe its the PEP killers who act prematurely because I friggin' love
genexps and the little generators they generate. LC can be crude by
comparison and is usually the construct that requires a compelling
reason for usage.

James

Raymond Hettinger

unread,

Sep 26, 2007, 4:06:36 PM9/26/07

to

[James Stroud ]

> Maybe its the PEP killers who act prematurely because I friggin' love
> genexps and the little generators they generate.

No, it's the PEP author who controls when they ask for pronouncement.
The natural instinct is to ask for pronouncement as soon as you have
an implementation and a couple of advocates, but the smarter play is
to put the code in the wild and let it mature so that the kinks get
worked out, a fan club can grow, the patterns of usage become
established, and acceptance/rejection becomes self-evident.

Genexps got hung-up on details like whether the loop variable should
be visible (like it is with list comps), was it faster or slower than
list comps (unknown until an implementation was provided), could it be
fit into the grammar unambiguously (the answer was yes *after*
introducing a requirement for enclosing parentheses in some cases).
Also, there were some cases that had (and still have) surprising
behaviors -- in the end, those were addressed by adding advice to
consumer genexps right away and not pass them around like regular
generators where the scoping was more self-evident.

The other disadvantage of asking for pronouncement prematurely is the
risk that everyone gives it a nod without deep thought about whether
it is truly necessary and useful. I got dict.fromkeys() accepted
without much fuss but now am now sure it was a good idea -- it added
to an already heavy dict API but did not offer compelling usability
advantages or performance advantages. Had the set() builtin already
been in place, it is much less likely that dict.fromkeys would have
been accepted.

Raymond

Duncan Booth

unread,

Sep 26, 2007, 4:52:17 PM9/26/07

to

Paul Hankin <paul....@gmail.com> wrote:

>> So should Duncan's
>>
>> def __removekey(self, key):
>> if key in self.__addkeys:
>> del self.__addkeys[key]
>> else:
>> self.__delkeys.add(key)
>>
>> be changed to:
>>
>> def __removekey(self, key):
>> if key in self.__addkeys:
>> del self.__addkeys[key]
>> self.__delkeys.add(key)
>
> Yes, and the same in __addkey: if it's in __delkeys it should be
> removed from there, and added to __addkeys. There's an invariant: any
> key is in at most one of __addkeys and __delkeys.
>

No, don't do that. The code was fine as I had written it, except for the
minor point that sets don't support 'del'! Also there need to be some tests
which actually exercise those two branches of the code: once you add
appropriate tests it will become obvious that you really do need the else.

A key which is in dict must be either in __keycache or in __addkeys, but
never in both.

A key which is not in the dict is either in none of __keycache, __addkeys,
and __delkeys, or it is in both __keycache and __delkeys.

If you don't have the else in both __removekey and __addkey, then deleting
a key leaves the key present in both __keycache and __delkeys (as it
should), and re-adding the key then leaves it in both __keycache and
__addkeys which breaks the first condition and means iterating will see
the key twice. Alternatively adding a key not in the dictionary and then
deleting it leaves it only in __delkeys, which isn't as serious. Actually
maybe you only need it in __addkey, or maybe I still don't have enough
tests.

Here's the code with set removal fixed and the extra test added: try
removing the else from __addkey and you'll find it gives repeated indexes
on iteration.

However, what we really need now is some code which actually uses both
versions of sorteddict, and an ordinary dict sorted when needed, and
demonstrates a real benefit from one or both sorteddict over the ordinary
dict. I think my version should be faster, but it is more complicated and
often in Python the fastest way is also the simplest.

------------- sorteddict.py -----------------------

Check that a mix of adding and removing keys doesn't cause confusion:

>>> d = sorteddict(dict(s=1, a=2, n=3, i=4, t=5, y=6))
>>> d.keys()
['a', 'i', 'n', 's', 't', 'y']

>>> del d["t"]
>>> d["t"]=6

>>> d.keys()
['a', 'i', 'n', 's', 't', 'y']

"""

__author__ = "Mark Summerfield"
__version__ = "1.1.0"

self.__delkeys.remove(key)
else:
self.__addkeys.add(key)

def __removekey(self, key):
if key in self.__addkeys:

self.__addkeys.remove(key)
else:
self.__delkeys.add(key)

----------------------------------------------------

Paul Hankin

unread,

Sep 26, 2007, 6:44:16 PM9/26/07

to

On Sep 26, 9:52 pm, Duncan Booth <duncan.bo...@invalid.invalid> wrote:

> Paul Hankin <paul.han...@gmail.com> wrote:
> >> So should Duncan's
>
> >> def __removekey(self, key):
> >> if key in self.__addkeys:
> >> del self.__addkeys[key]
> >> else:
> >> self.__delkeys.add(key)
>
> >> be changed to:
>
> >> def __removekey(self, key):
> >> if key in self.__addkeys:
> >> del self.__addkeys[key]
> >> self.__delkeys.add(key)
>
> > Yes, and the same in __addkey: if it's in __delkeys it should be
> > removed from there, and added to __addkeys. There's an invariant: any
> > key is in at most one of __addkeys and __delkeys.
>
> No, don't do that. The code was fine as I had written it, except for the
> minor point that sets don't support 'del'! Also there need to be some tests
> which actually exercise those two branches of the code: once you add
> appropriate tests it will become obvious that you really do need the else.
>
> A key which is in dict must be either in __keycache or in __addkeys, but
> never in both.

Yes, I'm sorry: you're right.

But there's a different bug: if you delete a key that's not in the
dict, you'll add it to the deleted list before the exception for the
missing key is raised.

sd = sorteddict.sorteddict()
sd['a'] = 'a'
print sd.keys() = ['a']
try:
del sd['b']
except:
pass
sd['b'] = 'b'
print sd.keys()

The second print statement produces ['a'] rather than ['a', 'b']

--
Paul Hankin

Antoon Pardon

unread,

Sep 27, 2007, 2:49:31 AM9/27/07

to

On 2007-09-26, Mark Summerfield <m.n.sum...@googlemail.com> wrote:

> On 26 Sep, 13:22, Antoon Pardon <apar...@forel.vub.ac.be> wrote:
>>
>> Well you should decide what you want. In a previous exchange one of the
>> things that was wanted was that you could already seed a tree by using
>> key word arguments so that you could do the following:
>>
>> t=Tree(a=15, b=30)
>>
>> which would be equivallent to:
>>
>> t=Tree()
>> t.update(a=15,b=30)
>>
>> But with the above proposal
>>
>> t=Tree(cmp=cmp)
>>
>> is not equivallent to
>>
>> t=Tree()
>> t.update(cmp=cmp)
>>
>> So it seems the API needs some more sorting out.
>
> I think you missed some of the previous postings.

That is possible.

>
> The sorteddict API that has emerged so far is (1) apart from the
> constructor, everything is identical to dict, (2) the constructor
> takes the same args as sorted(), so if you want to seed with a dict or
> with keywords you write sorteddict(dict(a=1,b=2), ...), (or you could
> create a sorteddict and use update() since that takes the same args as
> dict's constructor).
>
> Could your AVLTree support cmp and keys and reverse?

It already supports cmp. Although for the moment it is done by
subclassing. Like the following.

class myTree(AVLTree):
@staticmethod
def cmp(a, b);
return ....

I don't think adding support for keys will be difficult but it may be
tedious.

I would prefer not to support reverse because the effect can be acquired
by the other two and in contradiction with sorting, I think using it
will slow things down instead of speeding things up.

--
Antoon Pardon

Duncan Booth

unread,

Sep 27, 2007, 3:32:03 AM9/27/07

to

Paul Hankin <paul....@gmail.com> wrote:

>> A key which is in dict must be either in __keycache or in __addkeys, but
>> never in both.
>
> Yes, I'm sorry: you're right.
>
> But there's a different bug: if you delete a key that's not in the
> dict, you'll add it to the deleted list before the exception for the
> missing key is raised.
>
> sd = sorteddict.sorteddict()
> sd['a'] = 'a'
> print sd.keys() = ['a']
> try:
> del sd['b']
> except:
> pass
> sd['b'] = 'b'
> print sd.keys()
>
> The second print statement produces ['a'] rather than ['a', 'b']

Yes, I think there are probably several cases where I need to account for
exceptions.

There's another serious problem: if Mark wants this module to be used and
stand any chance at all of eventually being incorporated into the standard
library, then he will have to change the license on his code: the GPL
simply isn't an option here.

Bryan Olson

unread,

Sep 27, 2007, 4:11:29 AM9/27/07

to

Mark Summerfield wrote:
> The sorteddict API that has emerged so far is (1) apart from the
> constructor, everything is identical to dict, (2) the constructor
> takes the same args as sorted(), so if you want to seed with a dict or
> with keywords you write sorteddict(dict(a=1,b=2), ...), (or you could
> create a sorteddict and use update() since that takes the same args as
> dict's constructor).
>
> Could your AVLTree support cmp and keys and reverse?

I think you are expending too much effort on the no-brainers.
Yes, AVL-trees can support cmp and keys and reverse. I cannot
speak for Antoon, but have no doubt that adapting his code to
do so is within his ability.

Hrvoje Niksic is right that a simple layer over existing
hashed dicts "dooms the PEP to rejection". Thus the candidate
algorithms are the approximately-balanced tree structures:
AVL trees, red-black trees, 2-3 trees, b-trees, splay trees,
or skip-lists.

Implementing these is non-trivial, but easily within the
ability of many people here. Don't worry about the coding.

--
--Bryan

Mark Summerfield

unread,

Sep 27, 2007, 4:13:21 AM9/27/07

to

On 27 Sep, 08:32, Duncan Booth <duncan.bo...@invalid.invalid> wrote:

I have fixed __addkey() and __removekey():

def __addkey(self, key):
if key in self.__delkeys:
self.__delkeys.remove(key)
else:
self.__addkeys.add(key)

def __removekey(self, key):
if key in self.__addkeys:

self.__addkeys.remove(key)
else:
self.__delkeys.add(key)

I have also fixed __delitem__():

try:
dict.__delitem__(self, key)
except KeyError:
raise
else:
self.__removekey(key)

As for the license, while it is on PyPI, I'll leave it as GPL v 3. If
it was wanted for the standard library (and I can't see that ever
happening), I will happily change it to the one that is preferred for
Python modules.

Mark Summerfield

unread,

Sep 27, 2007, 4:26:16 AM9/27/07

to

On 26 Sep, 18:59, Raymond Hettinger <pyt...@rcn.com> wrote:
> [Mark Summerfield]
>
> > Below is a PEP proposal for a sorteddict. It arises out of a
> > discussion on this list that began a few weeks ago with the subject of
> > "An ordered dictionary for the Python library?"
>
> It is worth remembering that the long sought after datatype was a dict
> that could loop over keys in *insertion* order, not based on some
> function of the key itself.
>
> > If there is positive feedback I will submit the PEP to the reviewers,
> > so if you think it is a good idea please say so.
>
> As one who was submitted many PEPs, I can advise that the quickest way
> to irrevocably kill an idea is to submit a PEP to early (we almost
> didn't get genexps because the PEP was submitted pre-maturely).
> Instead, I advise posting an ASPN recipe so the details can be
> hammered-out and a fan club can be grown.
>
> For this particular idea, I am -1 on inclusion in the standard
> library. The PEP focuses mainly on implementation but is weakest in
> the rationale section. Its advantages over "for k in sorted(d)" are
> miniscule. To be considered for the collections module, there needs
> to be compelling use case advantages either in terms of usability or
> performance.

[snip]

I gave up on the idea of submitting it as a PEP many postings ago:-)
I've now changed the subject line to reflect that.

I will leave the pure python sorteddict on PyPI for anyone who wants
it.

I won't put it on the cookbook site for the time being, simply because
it is eating up much more time than I thought!

Duncan Booth

unread,

Sep 27, 2007, 5:13:42 AM9/27/07

to

Mark Summerfield <m.n.sum...@googlemail.com> wrote:

> As for the license, while it is on PyPI, I'll leave it as GPL v 3. If
> it was wanted for the standard library (and I can't see that ever
> happening), I will happily change it to the one that is preferred for
> Python modules.

Ok, your choice, just be aware that by using such a restrictive license you
will dissuade a lot of people from using your code. You've prevented your
module being used in any existing projects with Python license, GPL v2, or
probably any license other than GPL v3.

Paul Rubin

unread,

Sep 27, 2007, 5:24:33 AM9/27/07

to

Duncan Booth <duncan...@invalid.invalid> writes:
> Ok, your choice, just be aware that by using such a restrictive license you
> will dissuade a lot of people from using your code. You've prevented your
> module being used in any existing projects with Python license, GPL v2, or
> probably any license other than GPL v3.

GPL2 projects seem to be migrating nicely to GPL3 so GPL2 shouldn't be
too much of a problem.

Since one of the Python license's goals is to allow re-use in
proprietary programs, and the GPL family's main goal is the exact
opposite, I'd say if someone chooses a GPL, then preventing the code
from being used in a Python-licensed project was part of the intention
and not an undesired consequence. While Python itself is unlikely to
switch to a new license in order to import GPL code, other projects
can and have done exactly that.

ga...@dsdata.it

unread,

Sep 27, 2007, 8:50:45 AM9/27/07

to

I don't see a focused discussion of computational complexity of a
sorted dict; its API cannot be simpler than sorting a dictionary and
it has issues and complications that have already been discussed
without completely satisfactory solutions, so the only possible reason
to adopt a sorted dict is that some important use case for mapping
types becomes significantly cheaper.

With n entries, the size of a non-sorted hashtable, of a hashtable
plus lists or sets of keys, and of reasonable sorted dict
implementations with trees are all O(n). No substantial space
advantage can be obtained by sorting dictionaries.

Iterating through all n entries of a mapping, once, in sorted order,
is O(n) time and O(1) space with an unsorted hash table, a hash table
with a sorted list of the keys and all types of tree that I know of.
If there is a performance gain, it must come from amortizing
insertions, deletions and index-building. (The other operation, value
updates for an existing key, doesn't matter: updates cause no
structural changes and they must not invalidate any iterator.)

Let's consider a very simple use case: n insertions followed by x
iterations through all entries and n*y lookups by key.

Cost for a hashtable and an ad hoc sorted list of the keys,
fundamentally equivalent to sorting a Python dict:

O(n) for insertions
O(n log n) for indexing
O(nx) for iterations
O(ny) for lookups

Cost for a tree:

O(n log n) for insertions
no indexing
O(nx) for iterations
O(ny log n) for lookups

The hashtable comes out ahead because of cheaper lookups, for any x
and y; note that without lookups there is no reason to use a mapping
instead of a list of (key,value) tuples.

With an equal number k of insertions and deletions between the
iterations, the hashtable must be reindexed x times:

O(n) for insertions
O(kx) for updates and deletions
O(nx log n) for indexing and reindexing
O(nx) for iterations
O(ny) for lookups

The tree might be cheaper:

O(n log n) for insertions
O(kx log n) for updates and deletions
no indexing and reindexing
O(nx) for iterations
O(ny log n) for lookups

For a fixed small k, or with k proportional to n, reindexing the
hashtable and lookups in the tree are equally mediocre.

Maybe we could make k changes in the middle of each iteration. For a
naively reindexed hashtable:

O(n) for insertions
O(kx) for updates and deletions
O(knx log n) for indexing and reindexing
O(nx) for iterations
O(ny) for lookups

For a tree, the costs remain as above: the new factor of n for the
hashtable is fatal. Clever updates of the existing index or use of a
heap would lower the cost, but they would need to be encapsulated as a
sorteddict implementation.

Is this a practical use case? When are sequential visits of all
elements in order frequently suspended to make insertions and
deletions, with a need for efficient lookup by key?
- Priority queues; but given the peculiar pattern of insertions and
deletions there are more suitable specialized data structures.
- A* and similar best-first algorithms.
It's a small but important niche; maybe it isn't important enough for
the standard library. Other absent datatypes like heaps, an immutable
mapping type similar to frozenset and tuple, or disjoint sets with
would be more fundamental and general, and a mapping that remembers
the order of insertion regardless of keys would be equally useful.
In the Java collections framework all these kinds of mapping and
others coexist peacefully, but Python doesn't have the same kitchen
sink approach to basic libraries.

Regarding the API, a sorted dict should not expose random access by an
entry's position in the sequence: it is a gratuitous difficulty for
the implementor and, more importantly, a perversion of the mapping
data type. For that purpose there are lists and tuples, or explicit
indices like those of the Boost multi-index containers (http://
www.boost.org/libs/multi_index).
The only differences with dict should be the constraint that items(),
keys(), values(), iteritems(), iterkeys(), itervalues() return entries
sorted by key.

Regards,

Lorenzo Gatti

Antoon Pardon

unread,

Sep 29, 2007, 9:49:45 AM9/29/07

to

On 2007-09-27, ga...@dsdata.it <ga...@dsdata.it> wrote:

> Is this a practical use case? When are sequential visits of all
> elements in order frequently suspended to make insertions and
> deletions, with a need for efficient lookup by key?

Does it need to be a sequential visit of *all* elements?

Suppose you have a mapping of start times to tasks. You can then want to
iterate over all tasks that need to be started between noon en 4 pm next
monday. If you have a hashtable you still will need to sort all the keys
even if you will visit only 10%. If you have a tree you can just visit the
specified keys.

--
Antoon Pardon

Duncan Booth

unread,

Sep 29, 2007, 10:23:06 AM9/29/07

to

Antoon Pardon <apa...@forel.vub.ac.be> wrote:

It still depends on the exact pattern of use. If you have an implementation
which tracks additions and deletions and sorts them into the list when
required (as we came up with earlier) then this is much more efficient that
re-sorting the whole list every time. Sorting a large sorted list with a
few unsorted elements on the end is comparable to inserting the elements
individually into a tree, and you still have the hashtable benefits on
accessing elements to help level the playing field.

For me though, the most convincing use-case for a sorted dictionary is one
that I don't think has been mentioned yet:
There are situations when you want to use a dictionary with existing
library code that doesn't care about the random key ordering, but you have
additional requirements which the original code didn't know about.

For example, in the sorteddict code I added an implementation for the
__repr__ method and an associated doctest. Unlike iteration over sorteddict
itself, I didn't bother to make __repr__ stable, so in that particular
doctest it only tests the repr of a sorteddict with a single element.
If that was fixed to make the repr stable it would be a real benefit for
writing any doctests which want to produce a dictionary as a result.

Another example would be if you had a library which serialised a dictionary
to xml. There is nothing wrong with the library if it doesn't care about
order, but if you have some other reason why you want the xml to be stable
(e.g. because you store it in a version control system and want to compare
revisions) then a sorteddict would allow you to impose that behaviour on
the library from outside.

Contrary to my earlier insistence that sorteddict is only really useful if
you can have a key parameter, both of these examples simply want an
arbitrary but defined order of iteration for dictionary keys. A much
simpler sorteddict that has been discussed earlier would be sufficient.

thebjorn

unread,

Sep 29, 2007, 11:11:36 AM9/29/07

to

On Sep 29, 4:23 pm, Duncan Booth <duncan.bo...@invalid.invalid> wrote:
[...]

> Another example would be if you had a library which serialised a dictionary
> to xml. There is nothing wrong with the library if it doesn't care about
> order, but if you have some other reason why you want the xml to be stable
> (e.g. because you store it in a version control system and want to compare
> revisions) then a sorteddict would allow you to impose that behaviour on
> the library from outside.
>
> Contrary to my earlier insistence that sorteddict is only really useful if
> you can have a key parameter, both of these examples simply want an
> arbitrary but defined order of iteration for dictionary keys. A much
> simpler sorteddict that has been discussed earlier would be sufficient.

In fact, a dictionary that maintains insertion order would work in
this case too.

-- bjorn

Duncan Booth

unread,

Sep 29, 2007, 1:13:46 PM9/29/07

to

thebjorn <BjornSteinarF...@gmail.com> wrote:

It would be better than the current situation, but say I have elements 'a',
'b', and 'c'. Next run of the program I delete 'b', then the run after that
I add 'b' back into the list so now I might get 'a', 'c', 'b'. Sorting
would ensure that I can tell that I actually reverted a change.

Right now I think there are probably three dict variants needed: sorteddict
(still waiting for a convincing use case), ordereddict (lots of use cases),
and this one: stabledict.

thebjorn

unread,

Sep 29, 2007, 1:58:11 PM9/29/07

to

On Sep 29, 7:13 pm, Duncan Booth <duncan.bo...@invalid.invalid> wrote:
[...]

> Right now I think there are probably three dict variants needed: sorteddict
> (still waiting for a convincing use case), ordereddict (lots of use cases),
> and this one: stabledict.

What's stabledict? I'm assuming that ordereddict is a mapping that
maintains insertion order(?)

The only other mapping type I use very frequently is a dict where the
keys are limited to valid identifiers, and where attribute lookup
(d.foo) is defined as key lookup (d['foo']). It makes lots of code
easier to read (and write).

In the Smalltalk collection hierarchy SortedCollection is a subclass
of OrderedCollection, which implies to me that it'd be better to add
an ordereddict first.

-- bjorn

Hamilton, William

unread,

Oct 1, 2007, 8:24:57 AM10/1/07

to pytho...@python.org

> From: thebjorn

> What's stabledict? I'm assuming that ordereddict is a mapping that
> maintains insertion order(?)

Yes, ordereddict is a dict that maintains insertion order. Stabledict
is probably a dict that maintains _an_ order, so that repr() and the
like return the same value when used on dicts containing the same data.

> In the Smalltalk collection hierarchy SortedCollection is a subclass
> of OrderedCollection, which implies to me that it'd be better to add
> an ordereddict first.

That depends entirely on how ordereddict and sorteddict function. If
they are similar there might be a benefit. However, an ordereddict
would probably be best implemented with an internal list of keys,
whereas the consensus seems to be using a tree for sorteddict. In this
case, trying to build sorteddict from ordereddict is going to give you
extra baggage and overhead for no benefit.

--
-Bill Hamilton

Duncan Booth

unread,

Oct 1, 2007, 8:51:12 AM10/1/07

to

"Hamilton, William " <wha...@entergy.com> wrote:

>> From: thebjorn
>> What's stabledict? I'm assuming that ordereddict is a mapping that
>> maintains insertion order(?)
>
> Yes, ordereddict is a dict that maintains insertion order. Stabledict
> is probably a dict that maintains _an_ order, so that repr() and the
> like return the same value when used on dicts containing the same data.

Yes, that was what I was suggesting in an unclear manner: stabledict would
be a dictionary which returned the same repr for the same contents no
matter what order the elements were inserted.

>> In the Smalltalk collection hierarchy SortedCollection is a subclass
>> of OrderedCollection, which implies to me that it'd be better to add
>> an ordereddict first.
>
> That depends entirely on how ordereddict and sorteddict function. If
> they are similar there might be a benefit. However, an ordereddict
> would probably be best implemented with an internal list of keys,
> whereas the consensus seems to be using a tree for sorteddict. In this
> case, trying to build sorteddict from ordereddict is going to give you
> extra baggage and overhead for no benefit.
>

Subclassing doesn't have to imply a common implementation, just a common
interface.

Terry Reedy

unread,

Oct 2, 2007, 10:41:11 AM10/2/07

to pytho...@python.org

"Duncan Booth" <duncan...@invalid.invalid> wrote in message
news:Xns99BC8BF9F1...@127.0.0.1...

| Subclassing doesn't have to imply a common implementation, just a common
| interface.

True, but in Python, subclassing is usually done to reuse implementation.
Interface subclassing is usually from a common abstract base class (which
will be better supported in Py3)..

Paul Rubin

unread,

Oct 12, 2007, 4:17:35 AM10/12/07

to

Mark Summerfield <m.n.sum...@googlemail.com> writes:
> Below is a PEP proposal for a sorteddict. ...

Is this proposal dead? I'd been meaning to post some thoughts which I
still haven't gotten around to writing up, and am wondering whether to
keep it on my todo list.

Mark Summerfield

unread,

Oct 12, 2007, 11:07:09 AM10/12/07

to

On 12 Oct, 09:17, Paul Rubin <http://phr...@NOSPAM.invalid> wrote:

Yes, it's dead.

I've left the sorteddict implementation on PyPI for those that want
it, but there will be no PEP. GvR's view is that people should use
dict + sorted().