I would like to share with you an Ocaml implementation of extensible
arrays. The implementation is functional, based on balanced trees (and on
the code for Set and Map); I called the module Vec (for vector - I like
short names). You can find it at http://www.dealfaro.com/home/vec.html
Module Vec provides, in log-time:
- Access and modification to arbitrary elements (Vec.put n el v puts
element el in position n of vector v, for instance).
- Concatenation
- Insertion and removal of elements from arbitrary positions
(auto-enlarging and auto-shrinking the vector).
as well as:
- All kind of iterators and some visitor functions.
- Efficient translation to/from lists and arrays.
An advantage of Vec over List, for very large data structures, is that
iterating over a Vec of size n requires always stack depth bounded by log n:
with lists, non-tail-recursive functions can cause stack overflows.
I have been using this data structure for some months, and it has been very
handy in a large number of occasions. I hope it can be as useful to you.
I would appreciate all advice and feedback. Also, is there a repository
where I should upload it? Do you think it is worth it?
All the best,
Luca
Very interesting. I always felt uneasy about the presence of
imperative arrays without a functional counterpart. I can't wait to
try it.
Looking at your array type definition, I assume that the timings you
specified are worst-case? Is it possible to achieve better (but
amortized) bounds? Do you think it would be worth the trouble?
I didn't see in your specs the complexity of your iterators. Does
these work in linear time, like those of the List and Array module?
Regards,
Loup
_______________________________________________
Caml-list mailing list. Subscription management:
http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list
Archives: http://caml.inria.fr
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
Bug reports: http://caml.inria.fr/bin/caml-bugs
For those of you interested in functional array consider Sylvain Conchon
and Jean-Christophe Filliātre paper in [1]. The Union-Find (UF) uses
persistent arrays as its base data structure. I have made tests with the
UF using the code provided, an implementation of k-BUF data structure
(delayed backtracking) and altered version of the persistent array (fat
nodes + delayed backtracking). The tests I did show that this version of
persistent arrays is very efficient (especially for single threaded
backtracking).
Maybe Luca could pit his implementation against that of the article and
report on how they fare?
Regards,
Hugo Ferreira.
[1] http://www.lri.fr/~filliatr/ftp/publis/puf-wml07.ps
I did. :)
> ;-). I am open to new ideas. In part, I wanted a simple data structure
> (easier to extend, among other things). Also, I use Set, Map, etc, quite
> often, and those are also balanced trees: I thought that if I can live with
> those, I can probably live with Vec as well.
So can I. Your current implementation is already very attractive, and
looks very usable. For the new idea, have you thought of making (or
specifying) syntactic sugar to use your array?
About improving performance, here is my guess : there is no way to
lower the bounds on get and set. However, the average cost of insert
may already be O(1), provided you use your array the same way you
would use an imperative version of it (more accurately, not inserting
an element to an old version of your array). The same may be true for
remove.
Therefore, if I guess right, to take advantage of persistence AND have
insert perform in O(1) average, you would have to use (and pay for)
lazy evaluation. How, I don't know (yet).
(Note that I have stolen this idea from Okasaki's book)
> For an iterator, the worst case is as follows, where n is the size of the
> Vec:
>
> if you iterate on the whole Vec, then O(n)
> if you iterate over m elements (you can iterate on a subrange), then O(m +
> log n).
> That's why I have iterators: you can also iterate via a for loop, using get
> to access the elements, but then the time becomes O(n log n) for the first
> case, and O(m log n) for the second case.
That is why I wondered if lazy evaluation was worth the trouble at all
: most of the time, we iterate rather than insert or remove elements.
I only regret the absence of filter. Is there a way to obtain a
efficient filter? (Well, if my guess above is right, a naive
implementation of filter would already be quite efficient...)
For get/set, the worst case and the average case are both logarithmic: it's
a balanced tree (if you are lucky, you can find your answer at the root!
;-). I am open to new ideas. In part, I wanted a simple data structure
(easier to extend, among other things). Also, I use Set, Map, etc, quite
often, and those are also balanced trees: I thought that if I can live with
those, I can probably live with Vec as well.
For an iterator, the worst case is as follows, where n is the size of the
Vec:
- if you iterate on the whole Vec, then O(n)
- if you iterate over m elements (you can iterate on a subrange), then
O(m + log n).
That's why I have iterators: you can also iterate via a for loop, using get
to access the elements, but then the time becomes O(n log n) for the first
case, and O(m log n) for the second case.
Luca
This is the beginnings of an awesome data structure!
> So can I. Your current implementation is already very attractive, and
> looks very usable. For the new idea, have you thought of making (or
> specifying) syntactic sugar to use your array?
Should be very easy using the new camlp4. You might like to add a slicing
notation as well. :-)
> About improving performance...
I have two suggestions:
1. Add an extra node representing single elements that replaces Node(Empty, _,
Empty). The reduces GC stress enormously and makes the whole thing ~30%
faster.
2. Allow unbalanced sub trees. Balancing is slow and folds and maps don't need
to rebalance, but "get" should force rebalancing. Extracting subarrays should
return an unbalanced result.
> Is there a way to obtain a efficient filter?
Yes. I discovered a most-excellent way to do this. It requires arbitrary
metadata in every node, a constructor that composes subnodes to create the
metadata for the parent and a filter function that can cull branches from the
search tree.
I used this in my Mathematica implementation to provide asymptotically fast
filtering based upon lazily evaluated sets of symbols in each subnode. This
gave huge performance improvements with no significant performance overhead.
--
Dr Jon D Harrop, Flying Frog Consultancy Ltd.
OCaml for Scientists
http://www.ffconsultancy.com/products/ocaml_for_scientists/?e
Luca
thanks for the pointer to the excellent paper. First, let me say that my
Vec data structure was born to fill a need I had while working on a project:
while it has been useful to me, I certainly do not claim it is the best that
can be done, so I am very grateful for these suggestions!
My Vec data structure is different from persistent arrays. It is likely to
be less efficient for get/set use.
However, it offers at logarithmic cost insertion/removal operations that are
not present in the standard persistent arrays.
Consider a Vec a of size 10.
- Vec.insert 3 d a inserts value d in position 3 of a, shifting
elements 3..9 of a to positions 4..10.
- Vec.remove 3 a removes the element in position 3 of a, shifting
elements 4..9 to positions 3..8. Vec.pop is similar and returns the
removed element as well.
- Vec.concat works in log-time.
These operations are necessary if you want to use a Vec as a FIFO, for
example (you append elements at the end, and you get the first element via
Vec.pop 0 a). In many algorithms, it is often handy to be able to
remove/insert elements in the middle of a list.
In summary, I don't think the Vec data structure is a replacement for arrays
or persistent arrays in numerically-intensive work. But if you want a
flexible data structure for the 90% of the code that is not peformance
critical, they can be useful.
Now the question is: can one get better get/set efficiency while retaining
the ability to insert/remove elements? (I am sure that there is something
better to be done...).
Luca
On 7/19/07, Hugo Ferreira <h...@inescporto.pt> wrote:
>
Thanks!
> So can I. Your current implementation is already very attractive, and
> > looks very usable. For the new idea, have you thought of making (or
> > specifying) syntactic sugar to use your array?
>
> Should be very easy using the new camlp4. You might like to add a slicing
> notation as well. :-)
I have to study how to do it ... this would be very interesting.
Would you be interested in helping?
> About improving performance...
>
> I have two suggestions:
>
> 1. Add an extra node representing single elements that replaces
> Node(Empty, _,
> Empty). The reduces GC stress enormously and makes the whole thing ~30%
> faster.
This is easy. I can give it a try soon, and see if I get something
reasonable, or if the code blows up.
2. Allow unbalanced sub trees. Balancing is slow and folds and maps don't
> need
> to rebalance, but "get" should force rebalancing. Extracting subarrays
> should
> return an unbalanced result.
This is almost easy. I would need to add a bit to each node to keep track
of whether it's balanced...
The penalty would be that the balancing function would need to do slightly
more work to find out what has to be balanced.
So perhaps it's not a good idea for append, insert, but it could make sense
for concat (?), and especially for filter and sub...
But I am hesitant. If one does concat, or one does sub to extract a
sub-array, I wrote the code already so that sharing is maximized. What is
the percentage of cases in which you get a Vec, but then don't do any
get/set on it, and only iterate?
Especially since you already have iterators on subranges? Do you think it's
worth it? Anyone has advice?
> > Is there a way to obtain a efficient filter?
>
> Yes. I discovered a most-excellent way to do this. It requires arbitrary
> metadata in every node, a constructor that composes subnodes to create the
> metadata for the parent and a filter function that can cull branches from
> the
> search tree.
>
> I used this in my Mathematica implementation to provide asymptotically
> fast
> filtering based upon lazily evaluated sets of symbols in each subnode.
> This
> gave huge performance improvements with no significant performance
> overhead.
I don't provide filter because..., well, I guess because I forgot: of all
iterators, filter is the one I need most rarely.
I should at least provide a simple implementation of it...
Another operation I would like to implement is splice:
splice i v1 v2
replaces the element in position i of vec v2 with vec v1. A sort of
generalized insert.
Dr Jon D Harrop, Flying Frog Consultancy Ltd.
> OCaml for Scientists
> http://www.ffconsultancy.com/products/ocaml_for_scientists/?e
>
> _______________________________________________
> Caml-list mailing list. Subscription management:
> http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list
> Archives: http://caml.inria.fr
> Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
> Bug reports: http://caml.inria.fr/bin/caml-bugs
>
BTW, Jon (and anyone else as well), let me know if you would like to help...
I could create a Google Code project so that we get a svn repository for the
code.
Luca
>
>
> This is almost easy. I would need to add a bit to each node to keep
> track of whether it's balanced...
> The penalty would be that the balancing function would need to do
> slightly more work to find out what has to be balanced.
> So perhaps it's not a good idea for append, insert, but it could make
> sense for concat (?), and especially for filter and sub...
> But I am hesitant. If one does concat, or one does sub to extract a
> sub-array, I wrote the code already so that sharing is maximized. What
> is the percentage of cases in which you get a Vec, but then don't do
> any get/set on it, and only iterate?
> Especially since you already have iterators on subranges? Do you
> think it's worth it? Anyone has advice?
I don't think that with laziness you can avoid enough work to make
inserts O(1).
On the other hand, sub and filter can be done in O(M + log N) easily
enough, see:
http://citeseer.ist.psu.edu/236207.html
The paper is about red-black trees, but it's applicable to all
rotation-balanced trees.
Brian