Red-Black Trees & Actor centrality

360 views
Skip to first unread message

SirVer

unread,
Aug 16, 2012, 1:40:08 PM8/16/12
to juli...@googlegroups.com
Hello,

I decided to spent some more time with Julia. The basic idea here is to learn the idiomatic way to program it 'good'. I implemented a red-black tree, a basic Set and Dict based on them and applied them to the Actor centrality problem I posted on this list earlier this week.

The gist for the red-black tree is here: https://gist.github.com/3371829  (put into examples/rbtree.jl)
The (slow) actor centrality implementation is here: https://gist.github.com/3371833

I am not (so much) concerned with speed, but while programming this, I ran into a number of newbie problems which I could not answer with the documentation alone. Maybe someone can help.

1) I started with the binary tree that comes in the examples. I hope that I could reuse a lot of it, but was unable to "subclass" it in any useful way (I know, the concept is not available in Julia, but code reuse likely is). For example, a Node in a RedBlack is the same as one in a normal binary tree except that it also keeps track of its color and its parent. I was unable to express this nicely though.
2) the constants RED and BLACK should be Int8 and not Int64. I was unable to force them to. It would have been even better if I could have introduced a enum like type. for example print(BLACK) now prints 1, and it would be cool if it would print "BLACK". and BLACK should be != 1. Is this possible?
3) Conversion/promotion is difficult to understand for me. For example,

    k = RBTree{Int64, UTF8String}()
    d[64] = "Hi" 

chokes because "Hi" is an ASCII String. The Dict{} implementation somehow copes with this better. I want that too, please.
4) Iteration is quite awful for this case. I understand the start, done, next approach and its functional school of thought. But the python way is much easier to read and express in this case: iterate left subtree, yield me, iterate right subtree, done. I would love to see both ways supported because each one is sometimes better and sometimes not.
5) what is the difference between 'for n = <iteratable>'  and 'for n in <iteratable>'? 
6) Speed. This implementation is quite slow compared to my naive python implementation I did a year ago [1]. I am not so much interested in speed personally, I just wonder if there are obvious implementation nono that cost too much speed.
7) Would a OrderedDict/OrderedSet be wanted in the standard library? If so, this implementation could maybe be used as base.


I am sure I forgot something. I am looking forward to your suggestions.

Regards,
Holger

Stefan Karpinski

unread,
Aug 16, 2012, 4:38:58 PM8/16/12
to juli...@googlegroups.com
On Thu, Aug 16, 2012 at 1:40 PM, SirVer <sir...@gmx.de> wrote:

I decided to spent some more time with Julia. The basic idea here is to learn the idiomatic way to program it 'good'. I implemented a red-black tree, a basic Set and Dict based on them and applied them to the Actor centrality problem I posted on this list earlier this week.

Excellent! Good way to start. Tor is working on an AVL tree implementation, so we can compare the two at some point and maybe make them both available in a collections framework with a uniform interface. (Or maybe have a death match and let the best tree win.)

1) I started with the binary tree that comes in the examples. I hope that I could reuse a lot of it, but was unable to "subclass" it in any useful way (I know, the concept is not available in Julia, but code reuse likely is). For example, a Node in a RedBlack is the same as one in a normal binary tree except that it also keeps track of its color and its parent. I was unable to express this nicely though.

Julia allows sharing behavior by subclassing a common abstract type since code can "target" abstract types (in this sense, abstract types in Julia are really traits). There is no way to share structure, which is what happens in C++ or Java inheritance. You could, if you needed the same structure in a bunch of types, use a macro to define a bunch of fields. As soon as we have inline immutable composite types, the shared structure bit could just be a single field with a bunch of subfields. That's probably the best design for intentionally shared complex structure.

In this case, I would say the answer is just cut-and-paste [audience gasps]. Yep, I said it. The Node and RBTreeNode types *happen* to share some structure, but there's really no deep reason for that. Having them tied together couples in that changing the Node implementation to break the RBTreeNode implementation. Even worse, with inheritance, there's no indication that something is inherited from, so you might change Node and not realize that you have to look for and change RBTreeNode. With the macro or sub-struct approach, you know if you change the macro or the sub-struct definition that you need to look at all the places that it is used.

2) the constants RED and BLACK should be Int8 and not Int64. I was unable to force them to. It would have been even better if I could have introduced a enum like type. for example print(BLACK) now prints 1, and it would be cool if it would print "BLACK". and BLACK should be != 1. Is this possible?

You could create a custom bits type that prints differently. We could make an enum macro that does that for you, but I haven't felt the need for it yet. It would take this:

@enum NodeColor BLACK RED

and turn it into something like this:

bitstype 8 NodeColor
const BLACK = box(NodeColor,0x0)
const RED = box(NodeColor,0x1)
isequal(x::NodeColor, y::NodeColor) =
    eq_int(unbox(NodeColor,x),unbox(NodeColor,y))
show(io::IO, x::NodeColor) =
    print(io, x==BLACK ? "BLACK" : "RED")

This code works and after running it, you can show and compare RED and BLACK, which are elements of type NodeColor.

In this case, however, I don't think you should be using an enum at all. You should have two different types of RBTreeNodes: RedNode and BlackNode. That would save you a byte in representation and make coding things significantly cleaner since some methods that do things based on node color could be written with dispatch. The only fly in that ointment is that there are places where you change the color of a node during the algorithm. You could, of course, create a new node instead of the other color. Especially once composite types become immutable, this would definitely be the way to go instead.

3) Conversion/promotion is difficult to understand for me. For example,

    k = RBTree{Int64, UTF8String}()
    d[64] = "Hi" 

chokes because "Hi" is an ASCII String. The Dict{} implementation somehow copes with this better. I want that too, please.

Conversion and promotion isn't the issue here — its that you don't have a method that handles assignment when the RHS isn't a UTF8String. This is the same problem Tor encountered — check out this thread.

4) Iteration is quite awful for this case. I understand the start, done, next approach and its functional school of thought. But the python way is much easier to read and express in this case: iterate left subtree, yield me, iterate right subtree, done. I would love to see both ways supported because each one is sometimes better and sometimes not.

Another really handy way to define iterators is with coroutines. For example:

thecount(what) = @task begin
    produce("one $what")
    produce("two $what")
    produce("three $what")
end

julia> for x in thecount("ha, ha, ha")
           println(x)
       end
one ha, ha, ha
two ha, ha, ha
three ha, ha, ha

Using coroutines (aka tasks) is not as efficient as using the functional iteration design, but in situations where you just want to get the job done and don't care that much about speed, I think it's a great way to go since you can write a custom iterator without creating new types or defining start/next/done methods. For something like a red/black tree, of course, you probably want to write the bullet and do the start/next/done dance so that it's fast and doesn't have to keep swapping stacks around (once segmented stack support is complete in LLVM the overhead of coroutines will go way down).

5) what is the difference between 'for n = <iteratable>'  and 'for n in <iteratable>'?

No difference. Just two syntaxes for it. The convention is that you use = when the RHS is a range like 1:n and in when the RHS is an object do iterate over. The in form was added because writing for x = y looks like an assignment and feels kind of weird.
 
6) Speed. This implementation is quite slow compared to my naive python implementation I did a year ago [1]. I am not so much interested in speed personally, I just wonder if there are obvious implementation nono that cost too much speed.

No idea, but you can imagine from the other thread that there's lots of things that can be tried.

7) Would a OrderedDict/OrderedSet be wanted in the standard library? If so, this implementation could maybe be used as base.


Absolutely.

tor

unread,
Aug 17, 2012, 10:23:07 AM8/17/12
to juli...@googlegroups.com
The basic idea here is to learn the idiomatic way to program it 'good'. I implemented a red-black tree, a basic Set and Dict based on them and applied them to the Actor centrality problem I posted on this list earlier this week.
Great! I read about the actor centrality problem and it was very interesting.

Excellent! Good way to start. Tor is working on an AVL tree implementation, so we can compare the two at some point and maybe make them both available in a collections framework with a uniform interface. (Or maybe have a death match and let the best tree win.)
If we agreed on an some abstracts, we could replace one implementation with another, without changing all the high level code, such as tests, convenience functions ans so on.
If Julia had interfaces, I would suggest simply:
interface Btree{K, D
	key :: K
	data :: D
	left :: Btree{K, D} 	# I will ditch my 'inline' child array :D
	right :: Btree{K, D}
end

This isn't valid code, the point is only the field names (yours are already compliant, how lucky! :) 
Then RBtrees can have parent pointers and I can have my count fields and front end functions should work with very little change: they only need to throw exceptions if an unsupported operation is attempted For instance, my AVL trees have 'rank' and 'select' functions which require count fields to be efficient. If the underlying tree doesn't have count fields I'd rather throw an exception than implement correct, but slow code.
 
There is a third candidate that also must be considered: http://en.wikipedia.org/wiki/AA_tree
Just look at the two main routines, skew and split, isn't that neat? I guess one good reason to have trees written in Julia rather than wrapping external implementation is that people can then augment the code for special purposes. In that case the simple implementation of AA trees makes them stand out.

In this case, however, I don't think you should be using an enum at all. You should have two different types of RBTreeNodes: RedNode and BlackNode. That would save you a byte in representation and make coding things significantly cleaner since some methods that do things based on node color could be written with dispatch. The only fly in that ointment is that there are places where you change the color of a node during the algorithm. You could, of course, create a new node instead of the other color. Especially once composite types become immutable, this would definitely be the way to go instead.
I hadn't thought of that. This can be used for AVL trees as well: there are balanced trees, left-heavy trees and right-heavy trees. Inserting on the left of a left-heavy tree does so and so... Since Julia must store a type field with each object, anyway, why not put it to use?
    k = RBTree{Int64, UTF8String}()
    d[64] = "Hi" 

chokes because "Hi" is an ASCII String. The Dict{} implementation somehow copes with this better. I want that too, please.
I still don't understand the type system. Or the modules. I'm waiting for the book (“The Complete Idiot's Guide to Julia Programming”. Stefan and Jeff, you are writing on a book, right? Where can I preorder? :)
6) Speed. This implementation is quite slow compared to my naive python implementation I did a year ago [1]. I am not so much interested in speed personally, I just wonder if there are obvious implementation nono that cost too much speed.
No idea, but you can imagine from the other thread that there's lots of things that can be tried.
Jeff pointed at allocation as the likely culprit. Python implementations likely have highly fine-tuned meomory management by now. If you want c-like speed for a tree in Julia you would probably have to write a custom allocator, not so much for the general allocation overhead (which will likely be improved down the road), but to avoid the type fields.
Some code improvement can be done without regard to tree implementation. Did you for instance see this old paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.14.8991

SirVer

unread,
Aug 17, 2012, 2:42:09 PM8/17/12
to juli...@googlegroups.com
Here it goes,

[Subclassing Binary Tree by Copy & Paste] 
Okay, I think I got this. I can understand this in this case because you are right, the AVL treenodes do not share a lot of common data with the Binary Search tree code.

[Defining NodeColor]
This is nice. Thank you, this was precisely what I wanted to do.

In this case, however, I don't think you should be using an enum at all. You should have two different types of RBTreeNodes: RedNode and BlackNode. That would save you a byte in representation and make coding things significantly cleaner since some methods that do things based on node color could be written with dispatch. The only fly in that ointment is that there are places where you change the color of a node during the algorithm. You could, of course, create a new node instead of the other color. Especially once composite types become immutable, this would definitely be the way to go instead.
I do not really understand this reasoning - or rather I have trouble transferring this into code. For example, there is even more special casing for nodes that have only one child, two childs or none. Together we have 6 combinations of traits (Children+Color: 0R, 1R, 2R, 0B, 1B, 2B) which can't even be expressed in Julia because traits (or inheritance) cannot be mixed. There is also the RootNode which has no parent and is always black. This is also some kind of trait. Why do you feel that Black and Red should be subtypes of node and not the other cases? What is wrong with if/ifelse clauses in the code? I am asking to understand Julia code design better - I do not question your approach, I just want to understand it better.

3) Conversion/promotion is difficult to understand for me. For example,

    k = RBTree{Int64, UTF8String}()
    d[64] = "Hi" 

chokes because "Hi" is an ASCII String. The Dict{} implementation somehow copes with this better. I want that too, please.

Conversion and promotion isn't the issue here — its that you don't have a method that handles assignment when the RHS isn't a UTF8String. This is the same problem Tor encountered — check out this thread.
I didn't advance in this problem. I understand that I should not restrict the type in the assignement to hard: assign{K,V}(tree::RBTree{K,V}, val::V, key::K), rather I should use convert as early as possible. Doing this just pushes the error further up in the call chain (because I typed pretty much everything). Can I take this as a sign that I have typed too much in my functions and that I have somehow killed the flexibility of Julia with it?

And I also have a few new questions:

- I saw that there are some tests in test/, but they are all stateless. Stateful tests (with initializers and finalizers and so on) would be useful to me. Is there some kind of unit test module already available?
- There is a Heap implementation in examples/heap.jl. It duplicates a lot of code: either if a direct or indirect(hash based) heap is used or if the Heap is a MaxHeap or a MinHeap. That reads quite awful. Is there no proper way to parametrize Heap in some kind of way? (Of course I am not asking for the Heap, but for my Red-Black Trees. I want to have the ordering function as parameter. Also applying some function to the key is useful in times.
- Is there a way to define new infix operators in julia? 
- Is there a best-practice to document code? I love the python docstring approach - keeping the source code and the documentation together. I realize that pretty much nothing in the code is documented in Julia which is quite awful indeed.
- Is there a way to catch specific errors only? Something like
try
   throw(KeyError)
catch # But only KeyError please!
end
- Is there a way to do printf like string interpolation in Julia? I know of "$var" but this just does not cut it from time to time.

Thanks a lot!
Holger

SirVer

unread,
Aug 17, 2012, 2:48:38 PM8/17/12
to juli...@googlegroups.com
Hello,

If we agreed on an some abstracts, we could replace one implementation with another, without changing all the high level code, such as tests, convenience functions ans so on.
I agree that we should do this. Also, if we provide a Dict and a Set implementation based on the trees, we should the container be a parameter somehow.
There is a third candidate that also must be considered: http://en.wikipedia.org/wiki/AA_tree
Just look at the two main routines, skew and split, isn't that neat? I guess one good reason to have trees written in Julia rather than wrapping external implementation is that people can then augment the code for special purposes. In that case the simple implementation of AA trees makes them stand out.
 Not a fan of AA Trees really. In my (limited) experience, Red-Black Trees outperformed them easily. 
Jeff pointed at allocation as the likely culprit. Python implementations likely have highly fine-tuned meomory management by now. If you want c-like speed for a tree in Julia you would probably have to write a custom allocator, not so much for the general allocation overhead (which will likely be improved down the road), but to avoid the type fields.
I would be interested in checking this out. Where can I find examples of custom allocators? 
Some code improvement can be done without regard to tree implementation. Did you for instance see this old paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.14.8991
I do not know this paper. However, I am currently more interested in what the language can deliver. I agree that for a standard library container, every path to more speed must be investigated.

Tor, I wipped together a quick benchmark: inserting 1_000_000 unique integers (where key=value) and deleting half of them again takes ~2 secs in my current implementation. If you want to, I can share the benchmark and code with you.

Regards,
Holger
 

Harlan Harris

unread,
Aug 17, 2012, 3:00:56 PM8/17/12
to juli...@googlegroups.com
Just responding to a few things I know the answers to...

On Fri, Aug 17, 2012 at 2:42 PM, SirVer <sir...@gmx.de> wrote:
- I saw that there are some tests in test/, but they are all stateless. Stateful tests (with initializers and finalizers and so on) would be useful to me. Is there some kind of unit test module already available?

There's a useful-but-still-rough unit test implementation in extras/test.jl. See test/test_test.jl for the test suite for the test suite.
 
- Is there a way to define new infix operators in julia? 

Not currently. You can certainly overload or abuse the existing operators, though.
 
- Is there a way to do printf like string interpolation in Julia? I know of "$var" but this just does not cut it from time to time.

Yes. Stefan's printf implementation was recently refactored to be a macro (I think to avoid having to repeatedly compile the parameter string in some cases):
@printf("%3.4f", pi)
3.1416

@sprintf works too. I think the docs still refer to printf as a function...

 -Harlan
 

Thanks a lot!
Holger

--
 
 
 

Stefan Karpinski

unread,
Aug 17, 2012, 3:06:43 PM8/17/12
to juli...@googlegroups.com
On Fri, Aug 17, 2012 at 3:00 PM, Harlan Harris <har...@harris.name> wrote:
 
@sprintf works too. I think the docs still refer to printf as a function...

Doh. I'd better fix that... 

Tor Gausen

unread,
Aug 18, 2012, 12:28:15 PM8/18/12
to juli...@googlegroups.com
 
 Not a fan of AA Trees really. In my (limited) experience, Red-Black Trees outperformed them easily. 

I thought you didn't care about performance ;)
Jeff pointed at allocation as the likely culprit. Python implementations likely have highly fine-tuned meomory management by now. If you want c-like speed for a tree in Julia you would probably have to write a custom allocator, not so much for the general allocation overhead (which will likely be improved down the road), but to avoid the type fields.
I would be interested in checking this out. Where can I find examples of custom allocators? 

Custom memory allocation, especially for storing references, is hairy stuff that should make one shudder. One basically writes one's own memory manager, trying to exploit special knowledge of the data that a general manager cannot have. I don't know of any Julia examples, but Julia does not bar you from accessing malloc and other c 'goodies'. I just mentioned it to stress how hard it would be to beat c in a dynamically typed, garbage collected language. I wouldn't want to write one myself, at least not at the moment.
 
Tor, I wipped together a quick benchmark: inserting 1_000_000 unique integers (where key=value) and deleting half of them again takes ~2 secs in my current implementation. If you want to, I can share the benchmark and code with you.

Sure. Can you please put it somewhere on GitHub? Do you have a repository somewhere for the RedBlack tree?

 
Regards,
Holger
 

--
 
 
 

Reply all
Reply to author
Forward
0 new messages