Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

mini-howto: sorting

23 views
Skip to first unread message

Andrew Dalke

unread,
Feb 15, 1999, 3:00:00 AM2/15/99
to how-to editor
I know, it's somewhat silly, but I went through a little tutorial
here showing a half dozen ways to sort a list with the built-in
"sort" method. It was short so I wrote it up in case it is
deserving of a mini HOWTO. (Yeah, I still haven't written up the
exception classes HOWTO ): I think it's deserving because there's
not one spot in the manual or Lutz' book talking about the different
ways.

To be more specific, there's nothing that reminds you that the
default sort on classes can call the object's __cmp__ method as
another way to define your own sort comparison function. Or at
least, I couldn't find it.

For attribution, I believe I got the idea of sorting through an
intermediate list from a posting of Christian Tismer's some months
back.

Andrew


Python lists have a built-in sort method. There are many ways to use
it to sort a list and there doesn't appear to be a single, central
place in the various manuals describing them, so I'll do so here.


Basic data types:

>>> a = [5, 2, 3, 1, 4]
>>> a.sort()
>>> print a
[1, 2, 3, 4, 5]

Sort takes an optional function which can be called for doing the
comparisons. The default sort routine is equivalent to

>>> a = [5, 2, 3, 1, 4]
>>> a.sort(cmp)
>>> print a
[1, 2, 3, 4, 5]

where "cmp" is a built-in function which compares two objects, x and
y, and returns -1, 0 or 1 depending on if x<y, x==y, or x>y. During
the course of the sort the relationships must stay the same for the
final list to make sense.

If you want, you can define your own function for the comparison. For
integers (and numbers in general) we can do:

>>> def numeric_compare(x, y):
>>> return x-y
>>>
>>> a = [5, 2, 3, 1, 4]
>>> a.sort(numeric_compare)
>>> print a
[1, 2, 3, 4, 5]

Or, if you don't want to define a new named function you can create an
anonymous one using lamba, as in:

>>> a = [5, 2, 3, 1, 4]
>>> a.sort(lambda x, y: x-y)
>>> print a
[5, 4, 3, 2, 1]

If you want the numbers sorted in reverse you can do

>>> a = [5, 2, 3, 1, 4]
>>> def reverse_numeric(x, y):
>>> return y-x
>>>
>>> a.sort(reverse_numeric)
>>> print a
[5, 4, 3, 2, 1]

(a more general implementation could return cmp(y,x) or -cmp(x,y)).

However, it's faster if Python doesn't have to call a function for
every comparison so if you want a reversed sorted list of basic data
types do the forward sort first, then use the "reverse" method.

>>> a = [5, 2, 3, 1, 4]
>>> a.sort()
>>> a.reverse()
>>> print a
[1, 2, 3, 4, 5]

Here's a case-insensitive string comparison using a lambda function:

>>> import string
>>> a = string.split("This is a test string from Andrew.")
>>> a.sort(lambda x, y: cmp(string.lower(x), string.lower(y)))
>>> print a
['a', 'Andrew.', 'from', 'is', 'string', 'test', 'This']


This goes through the overhead of converting a word to lower case
every time it must be compared. At times it may be faster to compute
these once and use those values, and the following example shows how.

>>> words = string.split("This is a test string from Andrew.")
>>> offsets = []
>>> for i in range(len(words)):
>>> offsets.append( (string.lower(words[i]), i) )
>>>
>>> offsets.sort()
>>> new_words = []
>>> for i in range(len(words)):
>>> new_words.append(words[offsets[i][1]])
>>>
>>> print new_words

The "offsets" list is initialized to a tuple of the lower-case string
and its position in the "words" list. It is then sorted. Python's
sort method sorts tuples by comparing terms; given x and y, compare
x[0] to y[0], then x[1] to y[1], etc. until there is a difference.

The result is the "offsets" list is ordered by its first term, and the
second term can be used to figure out where the original data was
stored.

Another way to implement this is to store the original data as the
second term in the "offsets" list, as in:

>>> words = string.split("This is a test string from Andrew.")
>>> offsets = []
>>> for word in words:
>>> offsets.append( (string.lower(word), word) )
>>>
>>> offsets.sort()
>>> new_words = []
>>> for word in offsets:
>>> new_words.append(word[1])
>>>
>>> print new_words

This isn't always appropriate because the second terms in the list
(the "word" in this example) will be compared when the first terms are
the same. If this happens many times, then there will be the unneeded
performance hit of comparing the two objects. This can be a large
cost if most terms are the same and the objects define their own
__cmp__ method, but there will still be some overhead to determine if
__cmp__ is defined.

Still, for large lists, or for lists where the comparison information
is expensive to calculate, the last two examples are likely to be the
fastest way to sort a list. It will not work on weakly sorted data,
like complex numbers, but if you don't know what that means, you
probably don't need to worry about it.


Comparing classes

The comparison for two basic data type, like ints to ints or string to
string, is built in to Python and makes sense. There is a default way
to compare class instances, but the default manner isn't usually very
useful. You can define your own comparison with the __cmp__ method,
as in:

>>> class Spam:
>>> def __init__(self, spam, eggs):
>>> self.spam = spam
>>> self.eggs = eggs
>>> def __cmp__(self, other):
>>> return cmp(self.spam+self.eggs, other.spam+other.eggs)
>>> def __str__(self):
>>> return str(self.spam + self.eggs)
>>>
>>> a = [Spam(1, 4), Spam(9, 3), Spam(4,6)]
>>> a.sort()
>>> for spam in a:
>>> print str(spam)
5
10
12

Sometimes you may want to sort by a specific attribute of a class. If
appropriate you should just define the __cmp__ method to compare
those values, but you cannot do this if you want to compare between
different attributes at different times. Instead, you'll need to go
back to passing a comparison function to sort, as in:

>>> a = [Spam(1, 4), Spam(9, 3), Spam(4,6)]
>>> a.sort(lambda x, y: cmp(x.eggs, y.eggs))
>>> for spam in a:
>>> print spam.eggs, str(spam)
3 12
4 5
6 10

If you want to compare two arbitrary attributes (and aren't overly
concerned about performance) you can even define your own comparison
function object. This uses the ability of a class instance to emulate
an function by defining the __call__ method, as in:

>>> class CmpAttr:
>>> def __init__(self, attr):
>>> self.attr = attr
>>> def __call__(self, x, y):
>>> return cmp(getattr(x, self.attr), getattr(y, self.attr))
>>>
>>> a = [Spam(1, 4), Spam(9, 3), Spam(4,6)]
>>> a.sort(CmpAttr("spam")) # sort by the "spam" attribute
>>> for spam in a:
>>> print spam.spam, spam.eggs, str(spam)
1 4 5
4 6 10
9 3 12

>>> a.sort(CmpAttr("eggs")) # re-sort by the "eggs" attribute
>>> for spam in a:
>>> print spam.spam, spam.eggs, str(spam)
9 3 12
1 4 5
4 6 10


So, there you have it; about a half-dozen different ways to define how
to sort a list:
sort using the default method
sort using a comparison function
reverse sort not using a comparison function
sort on an intermediate list (two forms)
sort using class defined __cmp__ method
sort using a sort function object

Kevin Dahlhausen

unread,
Feb 15, 1999, 3:00:00 AM2/15/99
to
Andrew Dalke <da...@bioreason.com> wrote:

>I know, it's somewhat silly, but I went through a little tutorial

No it's not at all silly. Actually, if anything, it's not silly enough!
Just kidding. Around here, silliness is an art form.
Thanks for putting this together.

Have you ever looked at sgml and linuxdoc? In case not, linuxdoc is the
definition of a markup language. If you add the linuxdoc tags (like html) to
a text file, you can print very nice postscript, text, latex, and windows help
(what an oxymoron) files.


Michael Scharf

unread,
Feb 15, 1999, 3:00:00 AM2/15/99
to
Andrew Dalke wrote:
>
> I know, it's somewhat silly, but I went through a little tutorial
> here showing a half dozen ways to sort a list with the built-in
> "sort" method. It was short so I wrote it up in case it is
> deserving of a mini HOWTO. (Yeah, I still haven't written up the
> exception classes HOWTO ): I think it's deserving because there's
> not one spot in the manual or Lutz' book talking about the different
> ways.
[snip]

> >>> class CmpAttr:
> >>> def __init__(self, attr):
> >>> self.attr = attr
> >>> def __call__(self, x, y):
> >>> return cmp(getattr(x, self.attr), getattr(y, self.attr))
> >>>
[snip]

I would like to add yet another way of sorting: a little
sorting framework using design patterns [1]. A few days
ago, a colleague asked me how to sort tables on different
columns (some of them might be sorted inverse). In a
few minutes I created a little framework in python
(which he then implemented in C++). This is yet another
example of python as prototyping language. But I think,
the framework can also be used in python directly.

Let's take the Spam class:

>>> class Spam:
>>> def __init__(self, spam, eggs):
>>> self.spam = spam
>>> self.eggs = eggs

>>> def __repr__(self):
>>> return 'Spam(s=%s,e=%s)' %(repr(self.spam),repr(self.eggs))
>>> a = [Spam(1, 4), Spam(9, 3), Spam(4,6),Spam(4,4)]

If you want to sort on two fields (e.g columns in tables or
two attributes) you might use the composite comparer:

>>> class CmpComposite:
>>> """Takes a list of compare functions and sorts in that order."""
>>> def __init__(self,*comparers):
>>> self.comparers=comparers
>>> def __call__(self,a,b):
>>> for cmp in self.comparers:
>>> c=cmp(a,b)
>>> if c:
>>> return c
>>> return 0

Now you can sort on Spam.spam, eggs.spam:

>>> a.sort(CmpComposite(CmpAttr('spam'),CmpAttr('eggs')))
>>> print a
[Spam(s=1,e=4), Spam(s=4,e=4), Spam(s=4,e=6), Spam(s=9,e=3)]

Or you can sort on Spam.eggs, Spam.spam:

>>> a.sort(CmpComposite(CmpAttr('eggs'),CmpAttr('spam')))
>>> print a
[Spam(s=9,e=3), Spam(s=1,e=4), Spam(s=4,e=4), Spam(s=4,e=6)

But what if you want to sort inverse on spam? Define a CompInverse:

>>> class CmpInverse:
>>> """Inverses the effect of a cmp."""
>>> def __init__(self,cmp):
>>> self.cmp=cmp
>>> def __call__(self,a,b):
>>> return -self.cmp(a,b)

Let's try it:

>>> a.sort(CmpInverse(CmpAttr('spam')))
[Spam(s=9,e=3), Spam(s=4,e=6), Spam(s=4,e=4), Spam(s=1,e=4)

OK, and now sort on eggs and inverse on spam:

>>> a.sort(CmpComposite(CmpAttr('eggs'),CmpInverse(CmpAttr('spam'))))
[Spam(s=9,e=3), Spam(s=4,e=4), Spam(s=1,e=4), Spam(s=4,e=6)]

We can use our little framework to sort tables as well:

>>> class CmpColumn:
>>> """Sorts on an index of a sequence."""
>>> def __init__(self,column):
>>> self.column=column
>>> def __call__(self,a,b):
>>> return cmp(a[self.column],b[self.column])
>>>
>>> table=[(1,2), (1,3), (1,0), (2,0), (0,3), (3,2)]

Let's now sort our table on column 0 and inverse on column 1:

>>> table.sort(CmpComposite(CmpInverse(CmpColumn(1)),CmpColumn(0)))
>>> print table
[(0, 3), (1, 3), (1, 2), (1, 0), (2, 0), (3, 2)]


In terms of design patterns, we can identify CmpInverse
as a Decorator and CmpComposite as Composite pattern.
CmpColumn and CmpAttr are Strategy patterns. It is now
relatively easy to sort on arbitrary combinations of
attributes. The disadvantage is: it can become quite
slow for big lists.


[1] Erich Gamma, Richard Helm, Ralph Johnson, John Vlissides:
Design Patterns: Elements of Reusable Object-Oriented Software.
Addison-Wesley Pub Co; ISBN: 0201633612; 1994.
http://www.amazon.com/exec/obidos/ASIN/0201633612

Michael
--
''''\ Michael Scharf
` c-@@ TakeFive Software
` > http://www.TakeFive.com
\_ V mailto:Michael...@TakeFive.co.at

Comparers.py

Evan Simpson

unread,
Feb 15, 1999, 3:00:00 AM2/15/99
to
Michael Scharf wrote in message <36C82A80...@gmx.de>...

>Let's now sort our table on column 0 and inverse on column 1:
>
>>>> table.sort(CmpComposite(CmpInverse(CmpColumn(1)),CmpColumn(0)))
>>>> print table
>[(0, 3), (1, 3), (1, 2), (1, 0), (2, 0), (3, 2)]

Forgive me for taking your excellent exposition of design patterns off on a
tangent, but this sort of construction is exactly the sort of area which
could benefit from the function-building tools which John Skaller has been
requesting. Imagine writing your Cmp* classes like this:

def _composite(comparers, a,b):
for comp in comparers:
c=comp(a,b)


if c:
return c
return 0

def _invert(a, b):
return b, a

def _column(column, a):
return a[column]

def CmpComposite(*comparers):
# make a closure of _composite with argument comparers
return _composite + (comparers,)

def CmpInverse(compare):
# make a composition of compare with _invert
return compare * _invert

def CmpColumn(compare, column):
# make a composition of compare with two closures of _column
ccol = _column + (column,)
return compare * (ccol, ccol)

and have CmpComposite(CmpInverse(CmpColumn(cmp, 1)),CmpColumn(cmp, 0)))
return a constructed code object which, instead of having nested function
calls with attribute lookups, is equivalent to:

def _(a, b, comparers=(_1, _2)):
for comp in comparers:
c=comp(a,b)


if c:
return c
return 0

where:

def _1(a, b):
a, b = a[1], b[1]
a, b = b, a
return cmp(a, b)

def _2(a, b):
a, b = a[0], b[0]
return cmp(a, b)

except that there isn't really an optional third argument to the function.
All of this can be (and has been) simulated with classes, but wouldn't it be
nice to have our Python first-class function objects treated as truly
first-class in this way? For anyone who had trouble parsing the notation,
"(f * g)(x)" == "f(g(x))" is composition, "(f + (x,))(y)" == "f(x,y)" is
closure, and "(f, g)(x)" == "(f(x), g(x))" is product (only when composed or
applied, of course, otherwise it's just a tuple). Summation isn't used
above, but could be written as "{1: f, 2: g}(x, y)" == "if x==1: f(y); elif
x==2: g(y)" (again, only when composed or applied). There would also be an
identity function idfun, which could be used like "idfun * (f, g)" when a
neutral closure is needed.

Mmmm... off topic posts
Evan Simpson

Gaetan Corneau

unread,
Feb 15, 1999, 3:00:00 AM2/15/99
to
Sorry for posting it to the list: I could not reach Andrew at his e-mail
address,

Thanks for the sorting "mini-howto".

I'm writing to ask permission to translate it to french and include it in my
Python crash course document. I'm writing this document for my colleages at
work who want to learn Python to do "basic" scripting stuff. I'm not getting
paid for this (I do it on my own time).

As soon as the document is finished, I will modify it (it is full of inside
jokes specific to our company) and post it at Python.org (or some other
Python foreing-language documentation repository).

Of course, I would give you full credit for the sorting section. I have
nothing written on this subject, and translating your text, which is clear
and concise, would be easier than starting from scratch.
______________________________________________________
Gaetan Corneau
Software Developer (System integration Team)
BaaN Supply Chain Solutions
E-mail: Gaetan_...@baan.com
Compuserve: Gaetan_...@compuserve.com
ICQ Number: 7395494
Tel: (418) 654-1454 ext. 252
______________________________________________________
"Profanity is the one language all programmers know best"

Andrew Dalke

unread,
Feb 15, 1999, 3:00:00 AM2/15/99
to
Summerizing responses:

Kevin Dahlhausen <mo...@harborcom.net> said:
> No it's not at all silly. Actually, if anything, it's not silly
> enough!

Alas, I haven't reached the zen point where I can be both clear and
funny, so I try to stick with clear.

Gaetan Corneau <Gaetan_...@baan.com>


> I could not reach Andrew at his e-mail address,

Strange. I wonder what happened. If da...@bioreason.com doesn't
work then da...@acm.org should be fine.

> I'm writing to ask permission to translate it to french and include
> it in my Python crash course document.

Oops. Forgot to give the copyright statement. Umm... "Permission
to use, translate, modify, redistribute, sort and fold this
documentation is hereby granted."


Michael Scharf <Michael...@gmx.de>


> I would like to add yet another way of sorting: a little
> sorting framework using design patterns [1]

That, of course, generalizes what I was doing, but I can guarantee
you that most of the people I was targeting the writeup for would
have a hard time following what you were doing.


Andrew
da...@acm.org

Aahz Maruch

unread,
Feb 15, 1999, 3:00:00 AM2/15/99
to
In article <36C891CB...@bioreason.com>,

Andrew Dalke <da...@bioreason.com> wrote:
>
>Oops. Forgot to give the copyright statement. Umm... "Permission
>to use, translate, modify, redistribute, sort and fold this
>documentation is hereby granted."

You forgot "spindle or mutilate".
--
--- Aahz (@netcom.com)

Hugs and backrubs -- I break Rule 6 <*> NEW! http://www.rahul.net/aahz/
Androgynous poly kinky vanilla queer het

I guess I mostly see Life [tm] as a process of closing doors that you
might want to go through.

Andrew M. Kuchling

unread,
Feb 15, 1999, 3:00:00 AM2/15/99
to
Kevin Dahlhausen writes:

>Andrew Dalke <da...@bioreason.com> wrote:
>Have you ever looked at sgml and linuxdoc? In case not, linuxdoc is the
>definition of a markup language. If you add the linuxdoc tags (like html) to
>a text file, you can print very nice postscript, text, latex, and windows help
>(what an oxymoron) files.

The Python docs (and that includes the HOWTOs) are currently
maintained in LaTeX, and we have a toolchain that can produce PDF,
PostScript, or HTML from the LaTeX. It will, eventually, be converted
to SGML (or XML? Only Fred Drake knows for sure...), and some work
has been done on that. For the moment, I'll work on doing a quick
reformat of Andrew D.'s document into LaTeX.

BTW, the follow-ups by Michael Scharf and Evan Simpson, doing
more elaborate patterns for sorting, are quite interesting. I'd
recommend adding them to the document in a section described as
'advanced'. Anyone volunteer to write something? (Actually,
Michael's post might be suitable as it stands.)

--
A.M. Kuchling http://starship.python.net/crew/amk/
"Can I have a name?"
"Don't you have one? ... If you don't have a name, what do people call
you? I mean, do they just wave and smile, or jingle little silver bells or
what?"
-- The receptionist and Delirium, in SANDMAN #43: "Brief Lives:3"


Michael Scharf

unread,
Feb 16, 1999, 3:00:00 AM2/16/99
to
"Andrew M. Kuchling" wrote:
> (Actually,
> Michael's post might be suitable as it stands.)

Add it, or modify it as you like :-)

Tim Peters

unread,
Feb 16, 1999, 3:00:00 AM2/16/99
to
[Andrew Dalke]

> I know, it's somewhat silly, but I went through a little tutorial
> here showing a half dozen ways to sort a list with the built-in
> "sort" method.

Good idea!

> ...


> where "cmp" is a built-in function which compares two objects, x and
> y, and returns -1, 0 or 1 depending on if x<y, x==y, or x>y.

Not quite: cmp is defined to return <0, ==0, or >0; -1 and +1 aren't the
only possible non-zero return values.

> ...


> >>> def numeric_compare(x, y):
> >>> return x-y

This isn't good. Can e.g. overflow if x and y are ints. That's why cmp
isn't defined in terms of subtraction <wink>. A safe emulation is:

def numeric_compare(x, y):
if x == y:
return 0
elif x < y:
return -1
else:
return 1

> ...


> Or, if you don't want to define a new named function you can create an
> anonymous one using lamba, as in:
>
> >>> a = [5, 2, 3, 1, 4]
> >>> a.sort(lambda x, y: x-y)
> >>> print a
> [5, 4, 3, 2, 1]

Either the lambda or the output is wrong in this one (the output is reversed
or the subtract is).

> ...


> >>> words = string.split("This is a test string from Andrew.")
> >>> offsets = []
> >>> for i in range(len(words)):
> >>> offsets.append( (string.lower(words[i]), i) )
> >>>
> >>> offsets.sort()
> >>> new_words = []
> >>> for i in range(len(words)):
> >>> new_words.append(words[offsets[i][1]])
> >>>
> >>> print new_words
>
> The "offsets" list is initialized to a tuple of the lower-case string
> and its position in the "words" list. It is then sorted. Python's
> sort method sorts tuples by comparing terms; given x and y, compare
> x[0] to y[0], then x[1] to y[1], etc. until there is a difference.
>
> The result is the "offsets" list is ordered by its first term, and the
> second term can be used to figure out where the original data was
> stored.

The 2nd loop seems more natural as e.g.

for dontcare, i in offsets:
new_words.append(words[i])


Note that a similar trick works for obtaining a stable stort, as in

# create [(stuff[0], 0), (stuff[1], 1), ...]
pairs = map(None, stuff, range(len(stuff)))
pairs.sort()
for i in range(len(stuff)):
stuff[i] = pairs[i][0]

Tricking sort into stability isn't frequent enough to be a FAQ, but is
irritating enough to be a FIQ <wink>.

> ...


> If you want to compare two arbitrary attributes (and aren't overly
> concerned about performance) you can even define your own comparison
> function object. This uses the ability of a class instance to emulate
> an function by defining the __call__ method, as in:
>
> >>> class CmpAttr:
> >>> def __init__(self, attr):
> >>> self.attr = attr
> >>> def __call__(self, x, y):
> >>> return cmp(getattr(x, self.attr), getattr(y, self.attr))
> >>>
> >>> a = [Spam(1, 4), Spam(9, 3), Spam(4,6)]
> >>> a.sort(CmpAttr("spam")) # sort by the "spam" attribute
> >>> for spam in a:
> >>> print spam.spam, spam.eggs, str(spam)
> 1 4 5
> 4 6 10
> 9 3 12
>
> >>> a.sort(CmpAttr("eggs")) # re-sort by the "eggs" attribute
> >>> for spam in a:
> >>> print spam.spam, spam.eggs, str(spam)
> 9 3 12
> 1 4 5
> 4 6 10

I'd backtrack here and give one more way:

def sortbyattr(seq, attrname):
pairs = []
for i in range(len(seq)):
pairs.append( (getattr(seq[i], attrname), i) )
pairs.sort()
result = []
for attr, i in pairs:
result.append(seq[i])
seq[:] = result

followed by:

sortbyattr(a, "eggs")

Faster for large lists and simple attrs, & stable (that's especially useful
when repeatedly sorting by a number of attrs). It's an application of stuff
you covered before, but somehow people seem to forget what they once knew
when it comes to sorting class instances <0.7 wink>.

Also worth mentioning that it's sometimes better to avoid sort() entirely,
keeping a list sorted as it's built up via bisect.insort (esp. handy if
you're interleaving accumulation with searching).

sordidly y'rs - tim

0 new messages