I decided to spent some more time with Julia. The basic idea here is to learn the idiomatic way to program it 'good'. I implemented a red-black tree, a basic Set and Dict based on them and applied them to the Actor centrality problem I posted on this list earlier this week.
1) I started with the binary tree that comes in the examples. I hope that I could reuse a lot of it, but was unable to "subclass" it in any useful way (I know, the concept is not available in Julia, but code reuse likely is). For example, a Node in a RedBlack is the same as one in a normal binary tree except that it also keeps track of its color and its parent. I was unable to express this nicely though.
2) the constants RED and BLACK should be Int8 and not Int64. I was unable to force them to. It would have been even better if I could have introduced a enum like type. for example print(BLACK) now prints 1, and it would be cool if it would print "BLACK". and BLACK should be != 1. Is this possible?
@enum NodeColor BLACK RED
bitstype 8 NodeColorconst BLACK = box(NodeColor,0x0)const RED = box(NodeColor,0x1)isequal(x::NodeColor, y::NodeColor) =eq_int(unbox(NodeColor,x),unbox(NodeColor,y))show(io::IO, x::NodeColor) =print(io, x==BLACK ? "BLACK" : "RED")
3) Conversion/promotion is difficult to understand for me. For example,k = RBTree{Int64, UTF8String}()d[64] = "Hi"chokes because "Hi" is an ASCII String. The Dict{} implementation somehow copes with this better. I want that too, please.
4) Iteration is quite awful for this case. I understand the start, done, next approach and its functional school of thought. But the python way is much easier to read and express in this case: iterate left subtree, yield me, iterate right subtree, done. I would love to see both ways supported because each one is sometimes better and sometimes not.
thecount(what) = @task beginproduce("one $what")produce("two $what")produce("three $what")endjulia> for x in thecount("ha, ha, ha")println(x)endone ha, ha, hatwo ha, ha, hathree ha, ha, ha
5) what is the difference between 'for n = <iteratable>' and 'for n in <iteratable>'?
6) Speed. This implementation is quite slow compared to my naive python implementation I did a year ago [1]. I am not so much interested in speed personally, I just wonder if there are obvious implementation nono that cost too much speed.
7) Would a OrderedDict/OrderedSet be wanted in the standard library? If so, this implementation could maybe be used as base.
The basic idea here is to learn the idiomatic way to program it 'good'. I implemented a red-black tree, a basic Set and Dict based on them and applied them to the Actor centrality problem I posted on this list earlier this week.
Great! I read about the actor centrality problem and it was very interesting.
Excellent! Good way to start. Tor is working on an AVL tree implementation, so we can compare the two at some point and maybe make them both available in a collections framework with a uniform interface. (Or maybe have a death match and let the best tree win.)
If we agreed on an some abstracts, we could replace one implementation with another, without changing all the high level code, such as tests, convenience functions ans so on.
If Julia had interfaces, I would suggest simply:
interface Btree{K, D key :: K data :: D left :: Btree{K, D} # I will ditch my 'inline' child array :D right :: Btree{K, D} end This isn't valid code, the point is only the field names (yours are already compliant, how lucky! :) Then RBtrees can have parent pointers and I can have my count fields and front end functions should work with very little change: they only need to throw exceptions if an unsupported operation is attempted For instance, my AVL trees have 'rank' and 'select' functions which require count fields to be efficient. If the underlying tree doesn't have count fields I'd rather throw an exception than implement correct, but slow code.
There is a third candidate that also must be considered: http://en.wikipedia.org/wiki/AA_tree
Just look at the two main routines, skew and split, isn't that neat? I guess one good reason to have trees written in Julia rather than wrapping external implementation is that people can then augment the code for special purposes. In that case the simple implementation of AA trees makes them stand out.
In this case, however, I don't think you should be using an enum at all. You should have two different types of RBTreeNodes: RedNode and BlackNode. That would save you a byte in representation and make coding things significantly cleaner since some methods that do things based on node color could be written with dispatch. The only fly in that ointment is that there are places where you change the color of a node during the algorithm. You could, of course, create a new node instead of the other color. Especially once composite types become immutable, this would definitely be the way to go instead.
I hadn't thought of that. This can be used for AVL trees as well: there are balanced trees, left-heavy trees and right-heavy trees. Inserting on the left of a left-heavy tree does so and so... Since Julia must store a type field with each object, anyway, why not put it to use?
k = RBTree{Int64, UTF8String}()
d[64] = "Hi"
chokes because "Hi" is an ASCII String. The Dict{} implementation somehow copes with this better. I want that too, please.
I still don't understand the type system. Or the modules. I'm waiting for the book (“The Complete Idiot's Guide to Julia Programming”. Stefan and Jeff, you are writing on a book, right? Where can I preorder? :)
6) Speed. This implementation is quite slow compared to my naive python implementation I did a year ago [1]. I am not so much interested in speed personally, I just wonder if there are obvious implementation nono that cost too much speed.
No idea, but you can imagine from the other thread that there's lots of things that can be tried.
Jeff pointed at allocation as the likely culprit. Python implementations likely have highly fine-tuned meomory management by now. If you want c-like speed for a tree in Julia you would probably have to write a custom allocator, not so much for the general allocation overhead (which will likely be improved down the road), but to avoid the type fields.
Some code improvement can be done without regard to tree implementation. Did you for instance see this old paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.14.8991
In this case, however, I don't think you should be using an enum at all. You should have two different types of RBTreeNodes: RedNode and BlackNode. That would save you a byte in representation and make coding things significantly cleaner since some methods that do things based on node color could be written with dispatch. The only fly in that ointment is that there are places where you change the color of a node during the algorithm. You could, of course, create a new node instead of the other color. Especially once composite types become immutable, this would definitely be the way to go instead.
3) Conversion/promotion is difficult to understand for me. For example,k = RBTree{Int64, UTF8String}()d[64] = "Hi"chokes because "Hi" is an ASCII String. The Dict{} implementation somehow copes with this better. I want that too, please.Conversion and promotion isn't the issue here — its that you don't have a method that handles assignment when the RHS isn't a UTF8String. This is the same problem Tor encountered — check out this thread.
If we agreed on an some abstracts, we could replace one implementation with another, without changing all the high level code, such as tests, convenience functions ans so on.
There is a third candidate that also must be considered: http://en.wikipedia.org/wiki/AA_treeJust look at the two main routines, skew and split, isn't that neat? I guess one good reason to have trees written in Julia rather than wrapping external implementation is that people can then augment the code for special purposes. In that case the simple implementation of AA trees makes them stand out.
Jeff pointed at allocation as the likely culprit. Python implementations likely have highly fine-tuned meomory management by now. If you want c-like speed for a tree in Julia you would probably have to write a custom allocator, not so much for the general allocation overhead (which will likely be improved down the road), but to avoid the type fields.
Some code improvement can be done without regard to tree implementation. Did you for instance see this old paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.14.8991
- I saw that there are some tests in test/, but they are all stateless. Stateful tests (with initializers and finalizers and so on) would be useful to me. Is there some kind of unit test module already available?
- Is there a way to define new infix operators in julia?
- Is there a way to do printf like string interpolation in Julia? I know of "$var" but this just does not cut it from time to time.
--Thanks a lot!Holger
@sprintf works too. I think the docs still refer to printf as a function...
Not a fan of AA Trees really. In my (limited) experience, Red-Black Trees outperformed them easily.
Jeff pointed at allocation as the likely culprit. Python implementations likely have highly fine-tuned meomory management by now. If you want c-like speed for a tree in Julia you would probably have to write a custom allocator, not so much for the general allocation overhead (which will likely be improved down the road), but to avoid the type fields.
I would be interested in checking this out. Where can I find examples of custom allocators?
Tor, I wipped together a quick benchmark: inserting 1_000_000 unique integers (where key=value) and deleting half of them again takes ~2 secs in my current implementation. If you want to, I can share the benchmark and code with you.
--Regards,Holger