Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

No trees in the stdlib?

19 views
Skip to first unread message

Tom Reed

unread,
Jun 26, 2009, 1:29:30 AM6/26/09
to pytho...@python.org
Why no trees in the standard library, if not as a built in? I searched
the archive but couldn't find a relevant discussion. Seems like a
glaring omission considering the batteries included philosophy,
particularly balanced binary search trees. No interest, no good
implementations, something other reason? Seems like a good fit for the
collections module. Can anyone shed some light?

Thanks.
--
Tom

Aahz

unread,
Jun 26, 2009, 1:32:13 AM6/26/09
to
In article <mailman.2139.1245994...@python.org>,

What do you want such a tree for? Why are dicts and the bisect module
inadequate? Note that there are plenty of different tree implementations
available from either PyPI or the Cookbook.
--
Aahz (aa...@pythoncraft.com) <*> http://www.pythoncraft.com/

"as long as we like the same operating system, things are cool." --piranha

João Valverde

unread,
Jun 26, 2009, 1:55:50 AM6/26/09
to pytho...@python.org
Aahz wrote:
> In article <mailman.2139.1245994...@python.org>,
> Tom Reed <tomr...@gmail.com> wrote:
>
>> Why no trees in the standard library, if not as a built in? I searched
>> the archive but couldn't find a relevant discussion. Seems like a
>> glaring omission considering the batteries included philosophy,
>> particularly balanced binary search trees. No interest, no good
>> implementations, something other reason? Seems like a good fit for the
>> collections module. Can anyone shed some light?
>>
>
> What do you want such a tree for? Why are dicts and the bisect module
> inadequate? Note that there are plenty of different tree implementations
> available from either PyPI or the Cookbook.
>
A hash table is very different to a BST. They are both useful. The
bisect module I'm not familiar with, I'll have to look into that, thanks.

I have found pyavl on the web, it does the job ok, but there no
implementations for python3 that I know of.

Simple example usage case: Insert string into data structure in sorted
order if it doesn't exist, else retrieve it.

João Valverde

unread,
Jun 26, 2009, 2:05:18 AM6/26/09
to pytho...@python.org
Jo�o Valverde wrote:

> Aahz wrote:
>> In article <mailman.2139.1245994...@python.org>,
>> Tom Reed <tomr...@gmail.com> wrote:
>>
>>> Why no trees in the standard library, if not as a built in? I
>>> searched the archive but couldn't find a relevant discussion. Seems
>>> like a glaring omission considering the batteries included
>>> philosophy, particularly balanced binary search trees. No interest,
>>> no good implementations, something other reason? Seems like a good
>>> fit for the collections module. Can anyone shed some light?
>>>
>>
>> What do you want such a tree for? Why are dicts and the bisect module
>> inadequate? Note that there are plenty of different tree
>> implementations
>> available from either PyPI or the Cookbook.
>>
> A hash table is very different to a BST. They are both useful. The
> bisect module I'm not familiar with, I'll have to look into that, thanks.
>
> I have found pyavl on the web, it does the job ok, but there no
> implementations for python3 that I know of.
>
> Simple example usage case: Insert string into data structure in sorted
> order if it doesn't exist, else retrieve it.
>
Crap, sorry about the mixed identities, I'm not using my own machine.
The original post is mine also.

João Valverde

unread,
Jun 26, 2009, 2:21:05 AM6/26/09
to pytho...@python.org
Jo�o Valverde wrote:
> Aahz wrote:
>> In article <mailman.2139.1245994...@python.org>,
>> Tom Reed <tomr...@gmail.com> wrote:
>>
>>> Why no trees in the standard library, if not as a built in? I
>>> searched the archive but couldn't find a relevant discussion. Seems
>>> like a glaring omission considering the batteries included
>>> philosophy, particularly balanced binary search trees. No interest,
>>> no good implementations, something other reason? Seems like a good
>>> fit for the collections module. Can anyone shed some light?
>>>
>>
>> What do you want such a tree for? Why are dicts and the bisect module
>> inadequate? Note that there are plenty of different tree
>> implementations
>> available from either PyPI or the Cookbook.
>>
> A hash table is very different to a BST. They are both useful. The
> bisect module I'm not familiar with, I'll have to look into that, thanks.
>
> I have found pyavl on the web, it does the job ok, but there no
> implementations for python3 that I know of.
The main problem with pyavl by the way is that it doesn't seem to be
subclassable (?). Besides some interface glitches, like returning None
on delete if I recall correctly.

There's also rbtree, which I didn't try. And I think that's it. On the
whole not a lot of choice and not as practical for such a common data
structure.

Chris Rebert

unread,
Jun 26, 2009, 2:23:41 AM6/26/09
to João Valverde, pytho...@python.org
On Thu, Jun 25, 2009 at 10:55 PM, João Valverde<back...@netcabo.pt> wrote:
> Aahz wrote:
>>
>> In article <mailman.2139.1245994...@python.org>,
>> Tom Reed  <tomr...@gmail.com> wrote:
>>
>>>
>>> Why no trees in the standard library, if not as a built in? I searched
>>> the archive but couldn't find a relevant discussion. Seems like a glaring
>>> omission considering the batteries included philosophy, particularly
>>> balanced binary search trees. No interest, no good implementations,
>>> something other reason? Seems like a good fit for the collections module.
>>> Can anyone shed some light?
>>>
>>
>> What do you want such a tree for?  Why are dicts and the bisect module
>> inadequate?  Note that there are plenty of different tree implementations
>> available from either PyPI or the Cookbook.
>>
>
> A hash table is very different to a BST.  They are both useful. The bisect
> module I'm not familiar with, I'll have to look into that, thanks.
>
> I have found pyavl on the web, it does the job ok, but there no
> implementations for python3 that I know of.

> Simple example usage case: Insert string into data structure in sorted order
> if it doesn't exist, else retrieve it.

That's pretty much the bisect module in a nutshell. It manipulates a
sorted list using binary search.

Cheers,
Chris
--
http://blog.rebertia.com

Stefan Behnel

unread,
Jun 26, 2009, 2:48:52 AM6/26/09
to
Jo�o Valverde wrote:
> Besides some interface glitches, like returning None
> on delete if I recall correctly.

That's actually not /that/ uncommon. Operations that change an object are
not (side-effect free) functions, so it's just purity if they do not have a
return value.

Although practicality beats purity, sometimes... ;)

Stefan

Miles Kaufmann

unread,
Jun 26, 2009, 2:54:45 AM6/26/09
to pytho...@python.org
On Jun 26, 2009, at 2:23 AM, Chris Rebert wrote:

> On Thu, Jun 25, 2009 at 10:55 PM, João Valverde<back...@netcabo.pt>
> wrote:
>> Aahz wrote:
>>>

>>> In article <mailman.2139.1245994...@python.org>,
>>> Tom Reed <tomr...@gmail.com> wrote:
>>>
>>>>
>>>> Why no trees in the standard library, if not as a built in? I
>>>> searched
>>>> the archive but couldn't find a relevant discussion. Seems like a
>>>> glaring
>>>> omission considering the batteries included philosophy,
>>>> particularly
>>>> balanced binary search trees. No interest, no good implementations,
>>>> something other reason? Seems like a good fit for the collections
>>>> module.
>>>> Can anyone shed some light?
>>>>
>>>
>>> What do you want such a tree for? Why are dicts and the bisect
>>> module
>>> inadequate? Note that there are plenty of different tree
>>> implementations
>>> available from either PyPI or the Cookbook.
>>>
>>

>> Simple example usage case: Insert string into data structure in
>> sorted order
>> if it doesn't exist, else retrieve it.
>
> That's pretty much the bisect module in a nutshell. It manipulates a
> sorted list using binary search.

With O(n) insertions and removals, though. A decent self-balancing
binary tree will generally do those in O(log n).

-Miles

Jason Scheirer

unread,
Jun 26, 2009, 3:09:06 AM6/26/09
to
On Jun 25, 10:32 pm, a...@pythoncraft.com (Aahz) wrote:
> In article <mailman.2139.1245994218.8015.python-l...@python.org>,

> Tom Reed  <tomree...@gmail.com> wrote:
>
>
>
> >Why no trees in the standard library, if not as a built in? I searched
> >the archive but couldn't find a relevant discussion. Seems like a
> >glaring omission considering the batteries included philosophy,
> >particularly balanced binary search trees. No interest, no good
> >implementations, something other reason? Seems like a good fit for the
> >collections module. Can anyone shed some light?
>
> What do you want such a tree for?  Why are dicts and the bisect module
> inadequate?  Note that there are plenty of different tree implementations
> available from either PyPI or the Cookbook.
> --
> Aahz (a...@pythoncraft.com)           <*>        http://www.pythoncraft.com/

>
> "as long as we like the same operating system, things are cool." --piranha

...And heapq is more-or-less an emulation of a tree structure in its
underlying model. I once wrote a binary sorted tree data structure for
Python in C. It performed anywhere from 15-40% worse than dicts. I
think with optimization it will only perform 10% worse than dicts at
best.

Oh hey maybe that is why trees aren't an emphasized part of the
standard. They are going to be much slower than the ultra-optimized
dicts already in the standard lib.

João Valverde

unread,
Jun 26, 2009, 3:09:36 AM6/26/09
to pytho...@python.org
Jo�o Valverde wrote:
> Aahz wrote:
>> In article <mailman.2139.1245994...@python.org>,
>> Tom Reed <tomr...@gmail.com> wrote:
>>
>>> Why no trees in the standard library, if not as a built in? I
>>> searched the archive but couldn't find a relevant discussion. Seems
>>> like a glaring omission considering the batteries included
>>> philosophy, particularly balanced binary search trees. No interest,
>>> no good implementations, something other reason? Seems like a good
>>> fit for the collections module. Can anyone shed some light?
>>>
>>
>> What do you want such a tree for? Why are dicts and the bisect module
>> inadequate? Note that there are plenty of different tree
>> implementations
>> available from either PyPI or the Cookbook.
>>
> A hash table is very different to a BST. They are both useful. The
> bisect module I'm not familiar with, I'll have to look into that, thanks.
>
> I have found pyavl on the web, it does the job ok, but there no
> implementations for python3 that I know of.
>
> Simple example usage case: Insert string into data structure in sorted
> order if it doesn't exist, else retrieve it.
>
After browsing the bisect module I don't think it is the complete
answer. Please correct me if I'm mistaken but...

Ignoring for a moment that subjectively I feel this is not very pythonic
for my use case, if I get back the insertion position, doesn't that mean
I have to go over on average N/2 items on a linked list to insert the
item in position? Maybe less for sophisticated implementations but still
O(n)? That doesn't compare favorably to the cost of (possibly) having to
rebalance a tree on insertion.

João Valverde

unread,
Jun 26, 2009, 3:48:49 AM6/26/09
to pytho...@python.org
But a dict can't be used to implement a (sorted) table ADT.

João Valverde

unread,
Jun 26, 2009, 3:49:45 AM6/26/09
to pytho...@python.org
I didn't know that. But in this case I think purity gets pummeled every
time :) It's still not making sense to force a lookup to fetch something
before deleting (another lookup operation). If that were imposed by
python's internal machinery I'd suggest fixing that instead.

João Valverde

unread,
Jun 26, 2009, 3:58:29 AM6/26/09
to pytho...@python.org
Jo�o Valverde wrote:
> I didn't know that. But in this case I think purity gets pummeled
> every time :) It's still not making sense to force a lookup to fetch
> something before deleting (another lookup operation). If that were
> imposed by python's internal machinery I'd suggest fixing that instead.
>
To be clear what I mean by that is that it's just reference passing so
it can't generate programmatic errors.

Chris Rebert

unread,
Jun 26, 2009, 4:00:40 AM6/26/09
to João Valverde, pytho...@python.org
On Fri, Jun 26, 2009 at 12:09 AM, João Valverde<back...@netcabo.pt> wrote:
> João Valverde wrote:

>>
>> Aahz wrote:
>>>
>>> In article <mailman.2139.1245994...@python.org>,
>>> Tom Reed  <tomr...@gmail.com> wrote:
>>>
>>>>
>>>> Why no trees in the standard library, if not as a built in? I searched
>>>> the archive but couldn't find a relevant discussion. Seems like a glaring
>>>> omission considering the batteries included philosophy, particularly
>>>> balanced binary search trees. No interest, no good implementations,
>>>> something other reason? Seems like a good fit for the collections module.
>>>> Can anyone shed some light?
>>>>
>>>
>>> What do you want such a tree for?  Why are dicts and the bisect module
>>> inadequate?  Note that there are plenty of different tree implementations
>>> available from either PyPI or the Cookbook.
>>>
>>
>> A hash table is very different to a BST.  They are both useful. The bisect
>> module I'm not familiar with, I'll have to look into that, thanks.
>>
>> I have found pyavl on the web, it does the job ok, but there no
>> implementations for python3 that I know of.
>>
>> Simple example usage case: Insert string into data structure in sorted
>> order if it doesn't exist, else retrieve it.
>>
> After browsing the bisect module I don't think it is the complete answer.
> Please correct me if I'm mistaken but...
>
> Ignoring for a moment that subjectively I feel this is not very pythonic for
> my use case, if I get back the insertion position, doesn't that mean I have
> to go over on average N/2 items on a linked list to insert the item in

Linked list?! Python lists are *array-based*.
<"You must be new here" joke redacted>

João Valverde

unread,
Jun 26, 2009, 4:28:56 AM6/26/09
to pytho...@python.org
Jo�o Valverde wrote:
> Jo�o Valverde wrote:
>> Aahz wrote:
>>> In article <mailman.2139.1245994...@python.org>,
>>> Tom Reed <tomr...@gmail.com> wrote:
>>>
>>>> Why no trees in the standard library, if not as a built in? I
>>>> searched the archive but couldn't find a relevant discussion. Seems
>>>> like a glaring omission considering the batteries included
>>>> philosophy, particularly balanced binary search trees. No interest,
>>>> no good implementations, something other reason? Seems like a good
>>>> fit for the collections module. Can anyone shed some light?
>>>>
>>>
>>> What do you want such a tree for? Why are dicts and the bisect module
>>> inadequate? Note that there are plenty of different tree
>>> implementations
>>> available from either PyPI or the Cookbook.
>>>
>> A hash table is very different to a BST. They are both useful. The
>> bisect module I'm not familiar with, I'll have to look into that,
>> thanks.
>>
>> I have found pyavl on the web, it does the job ok, but there no
>> implementations for python3 that I know of.
>>
>> Simple example usage case: Insert string into data structure in
>> sorted order if it doesn't exist, else retrieve it.
>>
> After browsing the bisect module I don't think it is the complete
> answer. Please correct me if I'm mistaken but...
>
> Ignoring for a moment that subjectively I feel this is not very
> pythonic for my use case, if I get back the insertion position,
> doesn't that mean I have to go over on average N/2 items on a linked
> list to insert the item in position? Maybe less for sophisticated
> implementations but still O(n)? That doesn't compare favorably to the
> cost of (possibly) having to rebalance a tree on insertion.
I was indeed corrected on this shameful blunder, but in my defense array
based lists are implementation dependent and the case for trees can be
made on deletion.

Steven D'Aprano

unread,
Jun 26, 2009, 6:08:38 AM6/26/09
to
On Fri, 26 Jun 2009 00:09:06 -0700, Jason Scheirer wrote:

> I once wrote a binary sorted tree data structure for Python in C. It
> performed anywhere from 15-40% worse than dicts. I think with
> optimization it will only perform 10% worse than dicts at best.
>
> Oh hey maybe that is why trees aren't an emphasized part of the
> standard. They are going to be much slower than the ultra-optimized
> dicts already in the standard lib.

But dicts require hashable keys, and are in arbitrary order. You can't
(for example) traverse a dict in post-order.

Hash tables (dicts) are useful for many of the same things that trees are
useful for, but they are different data structures with different
strengths and weaknesses, and different uses. To argue that we don't need
trees because we have dicts makes as much sense as arguing that we don't
need dicts because we have lists.

--
Steven

Paul Rubin

unread,
Jun 26, 2009, 7:14:18 AM6/26/09
to
Stefan Behnel <stef...@behnel.de> writes:
> > Besides some interface glitches, like returning None
> > on delete if I recall correctly.
>
> That's actually not /that/ uncommon. Operations that change an object are
> not (side-effect free) functions, so it's just purity if they do not have a
> return value.

But deletes in an AVL tree should not cause mutation. They should
just allocate a new root and path up to where the deleted node was.
That allows having references to the old and new versions of the tree,
etc.

Paul Rubin

unread,
Jun 26, 2009, 7:15:48 AM6/26/09
to
Chris Rebert <cl...@rebertia.com> writes:
> > Simple example usage case: Insert string into data structure in
> > sorted order if it doesn't exist, else retrieve it.
>
> That's pretty much the bisect module in a nutshell. It manipulates a
> sorted list using binary search.

That lets you find an existing entry in log(n) time, but inserting
or deleting an entry still takes linear time. Also, you don't
get a persistent structure in the sense of functional programming.

Stefan Behnel

unread,
Jun 26, 2009, 7:37:25 AM6/26/09
to
Paul Rubin wrote:

I doubt that there are many AVL implementations that do that. Plus, if
deletion doesn't delete, I'd happily consider that a bug.

Stefan

Hallvard B Furuseth

unread,
Jun 26, 2009, 10:35:59 AM6/26/09
to
Stefan Behnel writes:
>Jo�o Valverde wrote:
>> Besides some interface glitches, like returning None
>> on delete if I recall correctly.
>
> That's actually not /that/ uncommon. Operations that change an object are
> not (side-effect free) functions, so it's just purity if they do not have a
> return value.

It's purity that they don't return the modified tree/dict/whatever.
They can still return the deleted element and remain pure.

--
Hallvard

Stefan Behnel

unread,
Jun 26, 2009, 10:42:42 AM6/26/09
to

Fair enough.

Stefan

Paul Rubin

unread,
Jun 26, 2009, 1:13:25 PM6/26/09
to
Stefan Behnel <stef...@behnel.de> writes:
> > But deletes in an AVL tree should not cause mutation. They should
> > just allocate a new root and path up to where the deleted node was.
> I doubt that there are many AVL implementations that do that. Plus, if
> deletion doesn't delete, I'd happily consider that a bug.

I've never seen one that DIDN'T do that, but maybe it's just the crowd
I hang out with.

See "Purely Functional Data Structures" by Chris Okasaki for more
detailed descriptions of such methods.

Aahz

unread,
Jun 26, 2009, 1:15:58 PM6/26/09
to
In article <006078f0$0$9721$c3e...@news.astraweb.com>,

Steven D'Aprano <st...@REMOVE-THIS-cybersource.com.au> wrote:
>
>Hash tables (dicts) are useful for many of the same things that trees are
>useful for, but they are different data structures with different
>strengths and weaknesses, and different uses. To argue that we don't need
>trees because we have dicts makes as much sense as arguing that we don't
>need dicts because we have lists.

The problem is that trees are like standards: there are so many to choose
from. Each has its own tradeoffs, and because Python dicts and lists can
substitute for many of the traditionals uses of trees, the tradeoffs are
less clear. I think nobody would object to adding trees to the standard
library, but it will certainly require a clear PEP, preferably with a
pointer to an existing PyPI library that has acquired real-world use.

(In particular, WRT the bisect module, although insertion and deletion
are O(N), the constant factor for doing a simple memory move at C speed
swamps bytecode until N gets very large -- and we already have
collections.deque() for some other common use cases.)
--
Aahz (aa...@pythoncraft.com) <*> http://www.pythoncraft.com/

Paul Rubin

unread,
Jun 26, 2009, 1:25:17 PM6/26/09
to
aa...@pythoncraft.com (Aahz) writes:
> (In particular, WRT the bisect module, although insertion and deletion
> are O(N), the constant factor for doing a simple memory move at C speed
> swamps bytecode until N gets very large -- and we already have
> collections.deque() for some other common use cases.)

Again, at least in my case, I'd hope for an immutable structure.

Also, "N very large" is likely to mean no more than a few thousand,
which is small by today's standards. And the C speed difference goes
away completely if the tree implementation is also in C.

Hedgehog Lisp has a good functional AVL tree implementation written in
C, that I might try to wrap with the Python C API sometime. I think
it is LGPL licensed though, not so good for the Python stdlib.

João Valverde

unread,
Jun 26, 2009, 2:57:42 PM6/26/09
to pytho...@python.org
Aahz wrote:
> In article <006078f0$0$9721$c3e...@news.astraweb.com>,
> Steven D'Aprano <st...@REMOVE-THIS-cybersource.com.au> wrote:
>
>> Hash tables (dicts) are useful for many of the same things that trees are
>> useful for, but they are different data structures with different
>> strengths and weaknesses, and different uses. To argue that we don't need
>> trees because we have dicts makes as much sense as arguing that we don't
>> need dicts because we have lists.
>>
>
> The problem is that trees are like standards: there are so many to choose
> from. Each has its own tradeoffs, and because Python dicts and lists can
> substitute for many of the traditionals uses of trees, the tradeoffs are
> less clear. I think nobody would object to adding trees to the standard
> library, but it will certainly require a clear PEP, preferably with a
> pointer to an existing PyPI library that has acquired real-world use.
>
>
>
Wish I had asked this before this year's GSoC started.

What's lacking is an associative array that preserves ordering, doesn't
require a hash function and has fast insertions and deletions in
O(log(n)). The particular algorithm to achieve this is a secondary
issue. It's a BST for sure, AVL vs RBT vs something else. It's my fault
for having opened the topic with simply "trees" instead, it would have
avoided this vagueness problem, but I'm genuinely surprised to know
there are no data structures that efficiently support such a common need
in Python. And I really enjoy the using this language.

Stefan Behnel

unread,
Jun 26, 2009, 3:14:25 PM6/26/09
to
Jo�o Valverde wrote:
> What's lacking is an associative array that preserves ordering, doesn't
> require a hash function and has fast insertions and deletions in
> O(log(n)).
> [...]

> I'm genuinely surprised to know
> there are no data structures that efficiently support such a common need
> in Python.

That's because it's simply not that a common need (in the sense that there
isn't a suitable alternative). I know that Trees have their use cases where
they really shine, but in surprisingly many cases a dict, a set, a list or
a combination of them will do just fine. And if you find a case where those
just don't fit, you may need a database anyway.

Stefan

Aahz

unread,
Jun 26, 2009, 3:14:42 PM6/26/09
to
In article <mailman.2170.1246042...@python.org>,

=?ISO-8859-1?Q?Jo=E3o_Valverde?= <back...@netcabo.pt> wrote:
>
>What's lacking is an associative array that preserves ordering, doesn't
>require a hash function and has fast insertions and deletions in
>O(log(n)). The particular algorithm to achieve this is a secondary
>issue. It's a BST for sure, AVL vs RBT vs something else. It's my fault
>for having opened the topic with simply "trees" instead, it would have
>avoided this vagueness problem, but I'm genuinely surprised to know
>there are no data structures that efficiently support such a common need
>in Python. And I really enjoy the using this language.

Why AVL/RBT instead of B*? It's not that simple.... Another problem is
that unless the tree is coded in C, the constant factors are going to
swamp algorithmic complexity for many use cases -- that in turn makes it
more difficult to deploy a PyPI library for real-world testing.

Anyway, I'm *not* trying to discourage you, just explain some of the
roadblocks to acceptance that likely are why it hasn't already happened.

If you're serious about pushing this through, you have two options:

* Write the code and corresponding PEP yourself (which leads to the
second option, anyway)

* Lobby on the python-ideas mailing list

Carl Banks

unread,
Jun 26, 2009, 3:32:59 PM6/26/09
to
On Jun 26, 7:35 am, Hallvard B Furuseth <h.b.furus...@usit.uio.no>
wrote:
> Stefan Behnel writes:

Correct, dict.pop does exactly this.


Carl Banks

Raymond Hettinger

unread,
Jun 26, 2009, 7:14:14 PM6/26/09
to
[Tom Reed]

> Why no trees in the standard library, if not as a built in?

The sqlite3 module is built on a binary-tree structure.
It can be used with persistent data or kept in-memory.
The gdbm module has similar flexibility (see the F mode).
FWIW, there are some ASPN implementing various types of
trees (red-black, pairing heaps, etc).


Raymond

Raymond Hettinger

unread,
Jun 26, 2009, 7:22:18 PM6/26/09
to
[João Valverde]

> What's lacking is an associative array that preserves ordering, doesn't
> require a hash function and has fast insertions and deletions in
> O(log(n)).

FWIW, Py3.1 has an OrderedDict() that preserves insertion order.
It has O(1) lookup, deletion, insertion, and popping; and O(n)
iteration. The ASPN Cookbook has equivalent code that runs on
earlier versions of Python.


> in Python. And I really enjoy the using this language.

Am glad you like it.


Raymond

greg

unread,
Jun 26, 2009, 9:43:40 PM6/26/09
to
João Valverde wrote:

> What's lacking is an associative array that preserves ordering, doesn't
> require a hash function and has fast insertions and deletions in
> O(log(n)).

Careful here -- you can't get away from the need for
hashability just by using a tree. Even if you don't
need to actually hash the values, it's still important
that the criterion for ordering between objects doesn't
change while they're in the tree, otherwise they'll
be in the wrong place and won't be found by subsequent
lookups.

> I'm genuinely surprised to know
> there are no data structures that efficiently support such a common need
> in Python.

Is it really all that common? If it truly were common,
there probably *would* be something for it in the
stdlib by now.

What sort of things are you doing that you want such
a structure for? Maybe we can suggest a way of using
the existing data structures to achieve the same
goal.

--
Greg

João Valverde

unread,
Jun 27, 2009, 1:03:11 AM6/27/09
to pytho...@python.org
greg wrote:
> João Valverde wrote:
>
>> What's lacking is an associative array that preserves ordering,
>> doesn't require a hash function and has fast insertions and deletions
>> in O(log(n)).
>
> Careful here -- you can't get away from the need for
> hashability just by using a tree. Even if you don't
> need to actually hash the values, it's still important
> that the criterion for ordering between objects doesn't
> change while they're in the tree, otherwise they'll
> be in the wrong place and won't be found by subsequent
> lookups.

I'm aware :) Technically it's necessary to define a total ordering on
the set of keys.


>
> > I'm genuinely surprised to know
>> there are no data structures that efficiently support such a common
>> need in Python.
>
> Is it really all that common? If it truly were common,
> there probably *would* be something for it in the
> stdlib by now.

Obviously my experience differs, but those were my expectations.


>
> What sort of things are you doing that you want such
> a structure for? Maybe we can suggest a way of using
> the existing data structures to achieve the same
> goal.
>

To answer the question of what I need the BSTs for, without getting into
too many boring details it is to merge and sort IP blocklists, that is,
large datasets of ranges in the form of (IP address, IP address,
string). Originally I was also serializing them in a binary format (but
no more after a redesign). I kept the "merge and sort" part as a helper
script, but that is considerably simpler to implement.

Please note that I'm happy with my code, it works well. I intended to
implement it in C all along, even before trying Python. The python code
was a side thing for testing/curiosity/fun. It prompted my original
question but that was really about Python and the standard library
itself, and I don't wish to waste anyone's time.

As an anecdotal data point (honestly not trying to raise the "Python is
slow" strawman), I implemented the same algorithm in C and Python, using
pyavl. Round numbers were 4 mins vs 4 seconds, against Python (plus
pyavl). Even considering I'm a worse Python programmer than C
programmer, it's a lot. I know many will probably think I tried to do "C
in Python" but that's not the case, at least I don' t think so. Anyway
like I said, not really relevant to this discussion.

CTO

unread,
Jun 27, 2009, 1:05:56 AM6/27/09
to
On Jun 26, 1:29 am, Tom Reed <tomree...@gmail.com> wrote:
> Whynotrees in the standard library, if not as a built in? I searched

> the archive but couldn't find a relevant discussion. Seems like a
> glaring omission considering the batteries included philosophy,
> particularly balanced binary search trees.Nointerest,nogood
> implementations, something other reason? Seems like a good fit for the
> collections module. Can anyone shed some light?
>
> Thanks.
> --
> Tom

I've written a graph library (trees being rooted connected acyclic
graphs)
called Graphine that you could try. Obviously not a part of the
standard
library, but it (or networkx) will almost certainly do what you're
looking
for. You can find graphine at graphine.org, or networkx at
networkx.lanl.gov.

Geremy Condra

João Valverde

unread,
Jun 27, 2009, 1:06:53 AM6/27/09
to pytho...@python.org
João Valverde wrote:

> greg wrote:
>> João Valverde wrote:
>>
>>> What's lacking is an associative array that preserves ordering,
>>> doesn't require a hash function and has fast insertions and
>>> deletions in O(log(n)).
>>
>> Careful here -- you can't get away from the need for
>> hashability just by using a tree. Even if you don't
>> need to actually hash the values, it's still important
>> that the criterion for ordering between objects doesn't
>> change while they're in the tree, otherwise they'll
>> be in the wrong place and won't be found by subsequent
>> lookups.
>
> I'm aware :) Technically it's necessary to define a total ordering on
> the set of keys.
>>
>> > I'm genuinely surprised to know
>>> there are no data structures that efficiently support such a common
>>> need in Python.
>>
>> Is it really all that common? If it truly were common,
>> there probably *would* be something for it in the
>> stdlib by now.
> Obviously my experience differs, but those were my expectations.
>>
>> What sort of things are you doing that you want such
>> a structure for? Maybe we can suggest a way of using
>> the existing data structures to achieve the same
>> goal.
>>
> To answer the question of what I need the BSTs for, without getting
> into too many boring details it is to merge and sort IP blocklists,
> that is, large datasets of ranges in the form of (IP address, IP
> address, string). Originally I was also serializing them in a binary
> format (but no more after a redesign). I kept the "merge and sort"
> part as a helper script, but that is considerably simpler to implement.
Crap, this sentence is totally confusing. I meant kept the merge code as
a helper script and moved the rest to C, see next paragraph.

> Please note that I'm happy with my code, it works well. I intended to

> implement it in C all along (for a system daemon), even before trying

João Valverde

unread,
Jun 27, 2009, 1:42:21 AM6/27/09
to pytho...@python.org
Aahz wrote:
> In article <mailman.2170.1246042...@python.org>,
> =?ISO-8859-1?Q?Jo=E3o_Valverde?= <back...@netcabo.pt> wrote:
>
>> What's lacking is an associative array that preserves ordering, doesn't
>> require a hash function and has fast insertions and deletions in
>> O(log(n)). The particular algorithm to achieve this is a secondary
>> issue. It's a BST for sure, AVL vs RBT vs something else. It's my fault
>> for having opened the topic with simply "trees" instead, it would have
>> avoided this vagueness problem, but I'm genuinely surprised to know
>> there are no data structures that efficiently support such a common need
>> in Python. And I really enjoy the using this language.
>>
>
> Why AVL/RBT instead of B*? It's not that simple.... Another problem is
> that unless the tree is coded in C, the constant factors are going to
> swamp algorithmic complexity for many use cases -- that in turn makes it
> more difficult to deploy a PyPI library for real-world testing.
>

I wouldn't consider anything other than C for such a module on
efficiency alone, unless it was a prototype of course. But I have little
knowledge about the Python C API.

About B* trees, again not an expert but I don't see how the added
complexity brings any noticeable advantage to implement the associative
array data structure I mentioned. Simple is better.

> Anyway, I'm *not* trying to discourage you, just explain some of the
> roadblocks to acceptance that likely are why it hasn't already happened.
>
> If you're serious about pushing this through, you have two options:
>
> * Write the code and corresponding PEP yourself (which leads to the
> second option, anyway)
>
> * Lobby on the python-ideas mailing list
>

Currently I don't have a strong need for this. I just believe it would
be a benefit to a language I like a lot. Lobbying isn't my thing. I'd
rather write code, but neither am I the most qualified person for the
job. It would certainly be interesting and fun and challenging in a good
way and a great way to learn some new stuff. But I would definitely need
mentoring or asking some silly questions on the mailing list. Maybe I'll
seriously consider it some other time.

Stefan Behnel

unread,
Jun 27, 2009, 1:52:26 AM6/27/09
to
Jo�o Valverde wrote:
> I wouldn't consider anything other than C for such a module on
> efficiency alone, unless it was a prototype of course. But I have little
> knowledge about the Python C API.

Cython is your true friend, if only for rapid prototyping.

http://cython.org/

Stefan

João Valverde

unread,
Jun 27, 2009, 2:41:39 AM6/27/09
to pytho...@python.org
Jo�o Valverde wrote:
> Aahz wrote:
> Currently I don't have a strong need for this. I just believe it would
> be a benefit to a language I like a lot. Lobbying isn't my thing. I'd
> rather write code, but neither am I the most qualified person for the
> job. It would certainly be interesting and fun and challenging in a
> good way and a great way to learn some new stuff. But I would
> definitely need mentoring or asking some silly questions on the
> mailing list. Maybe I'll seriously consider it some other time.
There's also another issue raise by Paul Rubin I wasn't even aware of,
that the LGPL is not suitable for the standard library. Having to write
a complete BST implementation in C is a drag. There are already good C
libraries available.

alex23

unread,
Jun 27, 2009, 9:03:23 PM6/27/09
to
João Valverde <backu...@netcabo.pt> wrote:
> Currently I don't have a strong need for this.

And clearly neither has anyone else, hence the absence from the
stdlib. As others have pointed out, there are alternative approaches,
and plenty of recipes on ActiveState, which seem to have scratched
whatever itch there is for the data structure you're advocating.

While Python's motto is "batteries included" I've always felt there
was an implicit "but not the kitchen sink" following it. Just because
something "could" be useful shouldn't be grounds for inclusion. That's
what pypi & the recipes are for. Ideally, little should be created
wholesale for the stdlib, what should be added are the existing 3rd
party modules that have become so ubiquitous that their presence on
any python platform is just expected.

João Valverde

unread,
Jun 27, 2009, 10:03:48 PM6/27/09
to pytho...@python.org
alex23 wrote:

> Jo�o Valverde <backu...@netcabo.pt> wrote:
>
>> Currently I don't have a strong need for this.
>>
>
> And clearly neither has anyone else, hence the absence from the
> stdlib. As others have pointed out, there are alternative approaches,
> and plenty of recipes on ActiveState, which seem to have scratched
> whatever itch there is for the data structure you're advocating.
>

Propose such alternative then. There are none that offer the same
performance. At best they're workarounds.

I don't care about recipes. That's called research.

If people don't find it to be useful, that's fine. Surprising, but fine.

And I don't have a need because I'm not using Python for my project. If
I wanted to I couldn't, without implementing myself or porting to Python
3 a basic computer science data structure.

> While Python's motto is "batteries included" I've always felt there
> was an implicit "but not the kitchen sink" following it. Just because
> something "could" be useful shouldn't be grounds for inclusion. That's
> what pypi & the recipes are for. Ideally, little should be created
> wholesale for the stdlib, what should be added are the existing 3rd
> party modules that have become so ubiquitous that their presence on
> any python platform is just expected.
>

Agreed.

Miles Kaufmann

unread,
Jun 27, 2009, 10:56:32 PM6/27/09
to pytho...@python.org
João Valverde wrote:
> To answer the question of what I need the BSTs for, without getting
> into too many boring details it is to merge and sort IP blocklists,
> that is, large datasets of ranges in the form of (IP address, IP
> address, string). Originally I was also serializing them in a binary
> format (but no more after a redesign). I kept the "merge and sort"
> part as a helper script, but that is considerably simpler to
> implement.
>
> ...

>
> As an anecdotal data point (honestly not trying to raise the "Python
> is slow" strawman), I implemented the same algorithm in C and
> Python, using pyavl. Round numbers were 4 mins vs 4 seconds, against
> Python (plus pyavl). Even considering I'm a worse Python programmer
> than C programmer, it's a lot. I know many will probably think I
> tried to do "C in Python" but that's not the case, at least I don' t
> think so. Anyway like I said, not really relevant to this discussion.

What format were you using to represent the IP addresses? (Is it a
Python class?) And why wouldn't you use a network address/subnet mask
pair to represent block ranges? (It seems like being able to
represent ranges that don't fit into a subnet's 2^n block wouldn't be
that common of an occurrence, and that it might be more useful to make
those ranges easier to manipulate.)

One of the major disadvantages of using a tree container is that
usually multiple comparisons must be done for every tree operation.
When that comparison involves a call into Python bytecode (for custom
cmp/lt methods) the cost can be substantial. Compare that to Python's
hash-based containers, which only need to call comparison methods in
the event of hash collisions (and that's hash collisions, not hash
table bucket collisions, since the containers cache each object's hash
value). I would imagine that tree-based containers would only be
worth using with objects with comparison methods implemented in C.

Not that I'm trying to be an apologist, or reject your arguments; I
can definitely see the use case for a well-implemented, fast tree-
based container for Python. And so much the better if, when you need
one, there was a clear consensus about what package to use (like PIL
for image manipulation--it won't meet every need, and there are others
out there, but it's usually the first recommended), rather than having
to search out and evaluate a dozen different ones.

-Miles

João Valverde

unread,
Jun 27, 2009, 11:44:02 PM6/27/09
to pytho...@python.org, Miles Kaufmann
Miles Kaufmann wrote:

> Jo�o Valverde wrote:
>> To answer the question of what I need the BSTs for, without getting
>> into too many boring details it is to merge and sort IP blocklists,
>> that is, large datasets of ranges in the form of (IP address, IP
>> address, string). Originally I was also serializing them in a binary
>> format (but no more after a redesign). I kept the "merge and sort"
>> part as a helper script, but that is considerably simpler to implement.
>>
>> ...
>>
>> As an anecdotal data point (honestly not trying to raise the "Python
>> is slow" strawman), I implemented the same algorithm in C and Python,
>> using pyavl. Round numbers were 4 mins vs 4 seconds, against Python
>> (plus pyavl). Even considering I'm a worse Python programmer than C
>> programmer, it's a lot. I know many will probably think I tried to do
>> "C in Python" but that's not the case, at least I don' t think so.
>> Anyway like I said, not really relevant to this discussion.
>
> What format were you using to represent the IP addresses? (Is it a
> Python class?) And why wouldn't you use a network address/subnet mask
> pair to represent block ranges? (It seems like being able to
> represent ranges that don't fit into a subnet's 2^n block wouldn't be
> that common of an occurrence, and that it might be more useful to make
> those ranges easier to manipulate.)
I was using a bytes subclass. I'm not free to choose CIDR notation,
range boundaries must be arbitrary.

>
> One of the major disadvantages of using a tree container is that
> usually multiple comparisons must be done for every tree operation.
> When that comparison involves a call into Python bytecode (for custom
> cmp/lt methods) the cost can be substantial. Compare that to Python's
> hash-based containers, which only need to call comparison methods in
> the event of hash collisions (and that's hash collisions, not hash
> table bucket collisions, since the containers cache each object's hash
> value). I would imagine that tree-based containers would only be
> worth using with objects with comparison methods implemented in C.

I would flip your statement and say one of the advantages of using trees
is that they efficiently keep random input sorted. Obviously no
algorithm can do that with single comparisons. And not requiring a hash
function is a desirable quality for non-hashable objects. There's a
world beyond dicts. :)

I profiled the code and indeed the comparisons dominated the execution
time. Trimming the comparison function to the bare minimum, a single
python operation, almost doubled the program's speed.

>
> Not that I'm trying to be an apologist, or reject your arguments; I
> can definitely see the use case for a well-implemented, fast

> tree-based container for Python. And so much the better if, when you

> need one, there was a clear consensus about what package to use (like
> PIL for image manipulation--it won't meet every need, and there are
> others out there, but it's usually the first recommended), rather than
> having to search out and evaluate a dozen different ones.
>

Thanks, and I'm not trying to start a religion either. ;)

Lie Ryan

unread,
Jun 28, 2009, 3:25:36 AM6/28/09
to
João Valverde wrote:
> alex23 wrote:

>> João Valverde <backu...@netcabo.pt> wrote:
>>
>>> Currently I don't have a strong need for this.
>>>
>>
>> And clearly neither has anyone else, hence the absence from the
>> stdlib. As others have pointed out, there are alternative approaches,
>> and plenty of recipes on ActiveState, which seem to have scratched
>> whatever itch there is for the data structure you're advocating.
>>
>
> Propose such alternative then. There are none that offer the same
> performance. At best they're workarounds.
>
> I don't care about recipes. That's called research.
>
> If people don't find it to be useful, that's fine. Surprising, but fine.

Python devs, based on my observation, tend to choose a data structure
based on the interface and not its implementation. Binary Sorted Tree is
an implementation, its interface can be a sorted dict (sorted key-value
mapping) or a list (not the most natural interface for a tree).

Basically, python already have all the common interfaces, i.e. list,
set, and mapping/dict.

Let's see it like this. In how many ways can a list be implemented?
- Array (python's builtin list)
- Linked List
- Binary Tree
- and the list goes on...

All of them can expose their interface as list, but only array
implementation is available as builtin.

João Valverde

unread,
Jun 28, 2009, 11:23:47 PM6/28/09
to pytho...@python.org
Paul Rubin wrote:
> aa...@pythoncraft.com (Aahz) writes:
>
>> (In particular, WRT the bisect module, although insertion and deletion
>> are O(N), the constant factor for doing a simple memory move at C speed
>> swamps bytecode until N gets very large -- and we already have
>> collections.deque() for some other common use cases.)
>>
>
> Again, at least in my case, I'd hope for an immutable structure.
>
>
Could you clarify what you mean by immutable? As in... not mutable? As
in without supporting insertions and deletions? That's has the same
performance as using binary search on a sorted list. What's the point of
using a tree for that?

Paul Rubin

unread,
Jun 28, 2009, 11:54:11 PM6/28/09
to
Jo�o Valverde <back...@netcabo.pt> writes:
> Could you clarify what you mean by immutable? As in... not mutable? As
> in without supporting insertions and deletions?

Correct.

> That's has the same performance as using binary search on a sorted
> list. What's the point of using a tree for that?

The idea is you can accomplish the equivalent of insertion or deletion
by allocating a new root, along with the path down to the place you
want to insert, i.e. O(log n) operations. So instead of mutating an
existing tree, you create a new tree that shares most of its structure
with the old tree, and switch over to using the new tree. This
trivially lets you maintain snapshots of old versions of the tree,
implement an "undo" operation, have a background thread do a complex
operation on a snapshot while the foreground thread does any number of
update-and-replace operations, etc.

This is very standard stuff. See:

http://en.wikipedia.org/wiki/Persistent_data_structure

The wikipedia article on AVL trees makes it pretty obvious how an
implementation would work.

João Valverde

unread,
Jun 29, 2009, 1:17:01 AM6/29/09
to pytho...@python.org
Paul Rubin wrote:
> Jo�o Valverde <back...@netcabo.pt> writes:
>
>> Could you clarify what you mean by immutable? As in... not mutable? As
>> in without supporting insertions and deletions?
>>
>
> Correct.
>
>
>> That's has the same performance as using binary search on a sorted
>> list. What's the point of using a tree for that?
>>
>
> The idea is you can accomplish the equivalent of insertion or deletion
> by allocating a new root, along with the path down to the place you
> want to insert, i.e. O(log n) operations. So instead of mutating an
> existing tree, you create a new tree that shares most of its structure
> with the old tree, and switch over to using the new tree. This
> trivially lets you maintain snapshots of old versions of the tree,
> implement an "undo" operation, have a background thread do a complex
> operation on a snapshot while the foreground thread does any number of
> update-and-replace operations, etc.
>
>
>
Interesting, thanks. The concept is not difficult to understand but I'm
not sure it would be preferable. A copy operation should have the same
cost as a "snapshot", undo is kind of redundant and multithreading...
don't see a compelling use that would justify it. Also the interface is
a mapping so it'd be rather nice to emulate dict's.

Have you considered how the syntax would work in Python by the way? This:

new_tree = old_tree.insert(object)

Just looks wrong. The interface should support non functional idioms too.

João Valverde

unread,
Jun 29, 2009, 1:24:36 AM6/29/09
to pytho...@python.org
Jo�o Valverde wrote:

> Paul Rubin wrote:
>> Jo�o Valverde <back...@netcabo.pt> writes:
>>
>>> Could you clarify what you mean by immutable? As in... not mutable? As
>>> in without supporting insertions and deletions?
>>
>> Correct.
>>
>>> That's has the same performance as using binary search on a sorted
>>> list. What's the point of using a tree for that?
>>>
>>
>> The idea is you can accomplish the equivalent of insertion or deletion
>> by allocating a new root, along with the path down to the place you
>> want to insert, i.e. O(log n) operations. So instead of mutating an
>> existing tree, you create a new tree that shares most of its structure
>> with the old tree, and switch over to using the new tree. This
>> trivially lets you maintain snapshots of old versions of the tree,
>> implement an "undo" operation, have a background thread do a complex
>> operation on a snapshot while the foreground thread does any number of
>> update-and-replace operations, etc.
>>
>>
>>
> Interesting, thanks. The concept is not difficult to understand but
> I'm not sure it would be preferable. A copy operation should have the
> same cost as a "snapshot", undo is kind of redundant and
> multithreading... don't see a compelling use that would justify it.
> Also the interface is a mapping so it'd be rather nice to emulate dict's.
>
> Have you considered how the syntax would work in Python by the way? This:
>
> new_tree = old_tree.insert(object)
>

Heh, that's a poor example for a mapping. But:

bst[key] = object

is even dicier for immutable structures no?

Paul Rubin

unread,
Jun 29, 2009, 1:41:46 AM6/29/09
to
Jo�o Valverde <back...@netcabo.pt> writes:
> Interesting, thanks. The concept is not difficult to understand but
> I'm not sure it would be preferable. A copy operation should have the
> same cost as a "snapshot",

You mean a deep-copy? That is unnecessarily expensive; with a
functional structure you can snapshot (or copy) by copying a single
pointer.

> undo is kind of redundant and multithreading... don't see a
> compelling use that would justify it.

Here is one:
http://groups.google.com/group/comp.lang.python/msg/1fbe66701e4bc65b

> Have you considered how the syntax would work in Python by the way? This:
> new_tree = old_tree.insert(object)
> Just looks wrong.

It looks fine to me. Obviously you could support a wrapper with
a mutating slot that holds a pointer to the tree.

Paul Rubin

unread,
Jun 29, 2009, 1:42:19 AM6/29/09
to
Jo�o Valverde <back...@netcabo.pt> writes:
> bst[key] = object
> is even dicier for immutable structures no?

bst = bst.insert(key, object)

João Valverde

unread,
Jun 29, 2009, 1:56:07 AM6/29/09
to pytho...@python.org
Paul Rubin wrote:
> Jo�o Valverde <back...@netcabo.pt> writes:
>
>> Interesting, thanks. The concept is not difficult to understand but
>> I'm not sure it would be preferable. A copy operation should have the
>> same cost as a "snapshot",
>>
>
> You mean a deep-copy? That is unnecessarily expensive; with a
> functional structure you can snapshot (or copy) by copying a single
> pointer.
>
>

Shallow copy...

>> undo is kind of redundant and multithreading... don't see a
>> compelling use that would justify it.
>>
>
> Here is one:
> http://groups.google.com/group/comp.lang.python/msg/1fbe66701e4bc65b
>
>

I just skimmed that but if someone really needs multithreading for such
intensive processing without wanting a database, fair enough I guess.

>> Have you considered how the syntax would work in Python by the way? This:
>> new_tree = old_tree.insert(object)
>> Just looks wrong.
>>
>
> It looks fine to me. Obviously you could support a wrapper with
> a mutating slot that holds a pointer to the tree.
>

I didn't get the last part, sorry. But I think you'd have a lot of users
annoyed that the interface is similar to a list yet their objects
mysteriously disappear. To me, tree.insert() implies mutability but I
defer that to others like yourself with more experience in Python than me.

João Valverde

unread,
Jun 29, 2009, 2:16:21 AM6/29/09
to pytho...@python.org
Jo�o Valverde wrote:
> Paul Rubin wrote:
>> Jo�o Valverde <back...@netcabo.pt> writes:
>>
>>> Interesting, thanks. The concept is not difficult to understand but
>>> I'm not sure it would be preferable. A copy operation should have the
>>> same cost as a "snapshot",
>>
>> You mean a deep-copy? That is unnecessarily expensive; with a
>> functional structure you can snapshot (or copy) by copying a single
>> pointer.
>>
>>
>
> Shallow copy...
>
>
Actually I meant whatever that snapshot operation is, minus the
insertion, if that makes sense.

Nobody

unread,
Jun 29, 2009, 4:59:54 AM6/29/09
to
On Sun, 28 Jun 2009 20:54:11 -0700, Paul Rubin wrote:

> Jo�o Valverde <back...@netcabo.pt> writes:
>> Could you clarify what you mean by immutable? As in... not mutable? As
>> in without supporting insertions and deletions?
>
> Correct.
>
>> That's has the same performance as using binary search on a sorted
>> list. What's the point of using a tree for that?
>
> The idea is you can accomplish the equivalent of insertion or deletion
> by allocating a new root, along with the path down to the place you
> want to insert, i.e. O(log n) operations. So instead of mutating an
> existing tree, you create a new tree that shares most of its structure
> with the old tree, and switch over to using the new tree.

The main issue here is that you need to be a bit smarter when it comes to
"modifying" the tree.

If you want to insert, delete or replace multiple elements, using repeated
insert()s (etc) on the root is sub-optimal, as you will end up repeatedly
duplicating the upper levels. Ideally you want to provide operations which
will add/remove/replace multiple elements in a single traversal.

Tim Wintle

unread,
Jun 29, 2009, 8:01:19 AM6/29/09
to João Valverde, pytho...@python.org
On Sat, 2009-06-27 at 06:03 +0100, João Valverde wrote:
> To answer the question of what I need the BSTs for, without getting
> into too many boring details it is to merge and sort IP blocklists,
> that is, large datasets of ranges in the form of (IP address, IP
> address, string).

<snip>

> As an anecdotal data point (honestly not trying to raise the "Python
> is slow" strawman), I implemented the same algorithm in C and Python,
> using pyavl. Round numbers were 4 mins vs 4 seconds, against Python
> (plus pyavl).

Out of interest, I recently wrote something similar that imported (a
class of) snort rules for blacklisting ip traffic. I could only use the
standard library.

I ended up writing a simple tree using dict-like objects [1]. Afraid I
haven't got a taught CS background to know the name of the structure.

(note insertion wasn't the focus, and I didn't bother writing it to
handle updates/merges - this is a quick script I run every now and then,
so I'm sure it could be done better - I just liked having the standard
dict interface for each node)

I only had three levels of branching, using the first octet to branch at
the root node, the second octet to branch as the second node, and the
final two to branch at the third node's depth (since even then that's
normally sparse relative to the first two nodes).

It works well enough for me - I'm IO bound reading in ip addresses from
logs to check against the blacklist, and there is a fair bit of other
processing going on for each line.

(Obviously I converted the ip addresses to integers before doing all
this to avoid hashing strings etc)


[1]
(As rules could be for any subnet I overloaded some of the dict methods
to check against rules on unusual subnets etc. before checking
individual ips in the final part)

Terry Reedy

unread,
Jun 29, 2009, 3:06:52 PM6/29/09
to pytho...@python.org
Paul Rubin wrote:

> The idea is you can accomplish the equivalent of insertion or deletion
> by allocating a new root, along with the path down to the place you
> want to insert, i.e. O(log n) operations. So instead of mutating an
> existing tree, you create a new tree that shares most of its structure
> with the old tree, and switch over to using the new tree.

Now I get what your have been talking about over several posts.
Update and mutate are kind of synonymous in my mind, but the above
explains how they can be different.

This
> trivially lets you maintain snapshots of old versions of the tree,
> implement an "undo" operation, have a background thread do a complex
> operation on a snapshot while the foreground thread does any number of
> update-and-replace operations, etc.
>
> This is very standard stuff.

Now for someone raised on arrays, iteration, and procedural programming ;-).

> http://en.wikipedia.org/wiki/Persistent_data_structure

Reading it now.

João Valverde

unread,
Jun 29, 2009, 5:54:38 PM6/29/09
to pytho...@python.org
Jo�o Valverde wrote:
> Paul Rubin wrote:
>> Jo�o Valverde <back...@netcabo.pt> writes:
>>
>>> Interesting, thanks. The concept is not difficult to understand but
>>> I'm not sure it would be preferable. A copy operation should have the
>>> same cost as a "snapshot",
>>
>> You mean a deep-copy? That is unnecessarily expensive; with a
>> functional structure you can snapshot (or copy) by copying a single
>> pointer.
>>
>>
>
> Shallow copy...

>
>>> undo is kind of redundant and multithreading... don't see a
>>> compelling use that would justify it.
>>
>> Here is one:
>> http://groups.google.com/group/comp.lang.python/msg/1fbe66701e4bc65b
>>
>>
>
> I just skimmed that but if someone really needs multithreading for
> such intensive processing without wanting a database, fair enough I
> guess.
>
>>> Have you considered how the syntax would work in Python by the way?
>>> This:
>>> new_tree = old_tree.insert(object)
>>> Just looks wrong.
>>
>> It looks fine to me. Obviously you could support a wrapper with
>> a mutating slot that holds a pointer to the tree.
>>
> I didn't get the last part, sorry. But I think you'd have a lot of
> users annoyed that the interface is similar to a list yet their
> objects mysteriously disappear. To me, tree.insert() implies
> mutability but I defer that to others like yourself with more
> experience in Python than me.
>

Rereading this I got what you meant by "wrapper with mutating slot".
But that is (like I think you implied) functionally equivalent to a
mutating data structure, with worse performance because of additional
memory allocation and such. Is it faster to rebalance the tree with a
persistent data structure?

João Valverde

unread,
Jun 29, 2009, 7:27:33 PM6/29/09
to pytho...@python.org
alex23 wrote:

> Jo�o Valverde <backu...@netcabo.pt> wrote:
>
>> Currently I don't have a strong need for this.
>>
>
> And clearly neither has anyone else, hence the absence from the
> stdlib. As others have pointed out, there are alternative approaches,
> and plenty of recipes on ActiveState, which seem to have scratched
> whatever itch there is for the data structure you're advocating.
>
>

I can't resist quoting Sedgewick here. Then I'll shut up about it.

[quote=http://www.cs.princeton.edu/~rs/talks/LLRB/LLRB.pdf]
Abstract
The red-black tree model for implementing balanced search trees,
introduced by Guibas and Sedge-
wick thirty years ago, is now found throughout our computational
infrastructure. Red-black trees
are described in standard textbooks and are the underlying data
structure for symbol-table imple-
mentations within C++, Java, Python, BSD Unix, and many other modern
systems.
[/quote]

You'd think so, but no. You should correct him that in Python a balanced
search tree is the useless cross between a dict and a database.

Paul Rubin

unread,
Jun 29, 2009, 10:40:56 PM6/29/09
to

If you're going to use a self-balancing tree you will have to do some
tree rotations even if the tree is mutable. My guess is that it's
still O(log n) updates, but with a smaller proportionality constant.

Lawrence D'Oliveiro

unread,
Jul 1, 2009, 4:39:06 AM7/1/09
to
In message <mailman.2140.1245996...@python.org>, João
Valverde wrote:

> Simple example usage case: Insert string into data structure in sorted
> order if it doesn't exist, else retrieve it.

the_set = set( ... )

if str in the_set :
... "retrieval" case ...
else :
the_set.add(str)
#end if

Want sorted order?

sorted(tuple(the_set))

What could be simpler?

Lawrence D'Oliveiro

unread,
Jul 1, 2009, 4:40:19 AM7/1/09
to
In message <mailman.2144.1245999...@python.org>, Miles
Kaufmann wrote:

> On Jun 26, 2009, at 2:23 AM, Chris Rebert wrote:
>
>> That's pretty much the bisect module in a nutshell. It manipulates a
>> sorted list using binary search.
>
> With O(n) insertions and removals, though. A decent self-balancing
> binary tree will generally do those in O(log n).

For values of n below, say, a million, would you even notice the difference
on modern hardware?

Lawrence D'Oliveiro

unread,
Jul 1, 2009, 4:43:10 AM7/1/09
to
In message <mailman.2146.1246002...@python.org>, João Valverde wrote:

> But a dict can't be used to implement a (sorted) table ADT.

for key in sorted(the_dict.keys(), cmp = ... whatever ordering criteria you like ...) :
... do something with the_dict[key] ...
#end for


Paul Rubin

unread,
Jul 1, 2009, 4:43:49 AM7/1/09
to
Lawrence D'Oliveiro <l...@geek-central.gen.new_zealand> writes:
> > With O(n) insertions and removals, though. A decent self-balancing
> > binary tree will generally do those in O(log n).
>
> For values of n below, say, a million, would you even notice the difference
> on modern hardware?

Huh? Yes, of course you'd notice, unless you were doing the operation
just once or something like that.

Steven D'Aprano

unread,
Jul 1, 2009, 7:41:07 AM7/1/09
to
On Wed, 01 Jul 2009 20:39:06 +1200, Lawrence D'Oliveiro wrote:

> In message <mailman.2140.1245996...@python.org>, João
> Valverde wrote:
>
>> Simple example usage case: Insert string into data structure in sorted
>> order if it doesn't exist, else retrieve it.
>
> the_set = set( ... )
>
> if str in the_set :
> ... "retrieval" case ...

Not terribly useful, because there's no way of attaching any data to
retrieve to the key. Apart from the case of interning strings (or other
keys), you'd probably want a dict rather than a set:

if str in the_dict:
return the_dict[str]
else:
the_dict[str] = something_or_other


> else :
> the_set.add(str)
> #end if


> Want sorted order?
>
> sorted(tuple(the_set))
>
> What could be simpler?

Dropping the entirely superfluous creation of a tuple:

sorted(the_set)

That's a reasonable approach for small sets of data, but it isn't very
useful if the_set contains 4GB of keys, because you have to duplicate the
keys, then sort them. More effective would be a data structure that was
always sorted. This has the further advantage that it's even simpler than
sorted(the_set):

the_set

--
Steven

Tim Golden

unread,
Jul 1, 2009, 12:20:38 PM7/1/09
to pytho...@python.org
Lawrence D'Oliveiro wrote:
> Want sorted order?
>
> sorted(tuple(the_set))

Not sure why you want the tuple () there, but
you probably knew that...

TJG

MRAB

unread,
Jul 1, 2009, 12:21:54 PM7/1/09
to pytho...@python.org
Why not just "sorted(the_set)"?

João Valverde

unread,
Jul 1, 2009, 2:45:26 PM7/1/09
to pytho...@python.org

Try putting that inside a loop with thousands of iterations and you'll
see what the problem is.

zooko

unread,
Jul 3, 2009, 12:48:10 AM7/3/09
to

Lawrence D'Oliveiro

unread,
Jul 3, 2009, 4:08:20 AM7/3/09
to
In message <mailman.2446.1246491...@python.org>, João
Valverde wrote:

You could apply the same argument to anything. E.g. why create a tree
structure with a million elements? Try putting that inside a loop with

Paul Rubin

unread,
Jul 3, 2009, 4:33:27 AM7/3/09
to
Lawrence D'Oliveiro <l...@geek-central.gen.new_zealand> writes:
> >> Want sorted order?
> >> sorted(tuple(the_set))
> >> What could be simpler?
> >
> > Try putting that inside a loop with thousands of iterations and you'll
> > see what the problem is.
>
> You could apply the same argument to anything. E.g. why create a tree
> structure with a million elements? Try putting that inside a loop with
> thousands of iterations and you'll see what the problem is.

The idea is you're going to insert or delete something in the tree
structure at each iteration, an O(log n) operation, making the whole
loop O(n log n). If you sort a set (which is O(n log n)) inside the
loop then you end up with O(n**2 log n) which is impractical. A
reason you might sort inside the loop might be to find the nearby
neighbors of the new element or traverse some elements in the middle.
This is trivial with a tree structure but messy with something like
sets.

Steven D'Aprano

unread,
Jul 3, 2009, 4:39:19 AM7/3/09
to
On Fri, 03 Jul 2009 20:08:20 +1200, Lawrence D'Oliveiro wrote:

> In message <mailman.2446.1246491...@python.org>, João
> Valverde wrote:
>
>> Lawrence D'Oliveiro wrote:

[...]


>>> Want sorted order?
>>>
>>> sorted(tuple(the_set))
>>>
>>> What could be simpler?
>>
>> Try putting that inside a loop with thousands of iterations and you'll
>> see what the problem is.
>
> You could apply the same argument to anything. E.g. why create a tree
> structure with a million elements? Try putting that inside a loop with
> thousands of iterations and you'll see what the problem is.

The difference is, it's vanishingly rare to want to build a tree with
millions of elements thousands of times over and over again, but it is
not unreasonable to want to access sorted data thousands of times.
Needing to re-sort it over and over and over again is wasteful, slow and
stupid. Binary trees were invented, in part, specifically to solve this
use-case.

--
Steven

0 new messages