Truelength

71 views
Skip to first unread message

Matthew Dowle

unread,
Jun 12, 2012, 9:47:48 PM6/12/12
to juli...@googlegroups.com
Enter code here...

Hi,

I've just started to look at Julia. I'm one of the authors of data.table in R, btw, which has been mentioned here before. Nice job on Julia! One thing that looks especially attractive in julia.h is that *data is, well, a pointer.

typedef struct {
    JL_STRUCT_TYPE
    size_t length
;
    jl_value_t
*data[1];
} jl_tuple_t;

In R we don't have that.  In R the general type SEXP is a header (such as type and length) followed by the data; i.e., DATAPTR() is an offset, not a pointer.  So in data.table the way it has to implement fast grouping, for contiguous groups, is :
  1. Find largest group
  2. Allocate memory for that largest group in a static environment
  3. Loop through groups doing a memcpy of each subset of each column to the static environment 
  4. Eval j
  5. Accumulate results
  6. Next group step 3,  reusing the same static memory.

This works well and is fast (the fastest grouping available in R for single runs on large data such as 1e9 rows), but could be faster. Possibly much faster in Julia due to less overhead and freedom in the julia code we eval in each group (such as mean, sum, sd, etc etc). In Julia we wouldn't need the memcpy as the *data could just be pointed to the start of the group, set length as long as the group, and that's it.  Am I understanding correctly?  If so, that's quite attractive. That's also how Python is arranged internally I believe?  I don't know how Pandas does that, but would be interesting.

However, and this is the main point of this post, what we do have in R is truelength as well as length.  This allows data.table to over allocate column pointer slots (100 by default, but some people set that to 1000 or more).  A 2 column data.table is length 2 and truelength 100.  Adding a new column is just assignment to the 3rd slot and updating length to 3 (no copy of existing 2 columns).   Deleting a column is instant regardless of size, using memmove to budge up the column pointers after the one to be deleted. These ops are extremely fast in data.table since they don't copy any part of memory at all. Afaik, no other package (or indeed base R) uses truelength for this purpose or similar. So it's something that R has in its internals that only data.table uses, afaik.

Have I missed truelength in Julia, and if not would adding it to julia.h make sense?

Another aspect is over-allocating each column, so that fast insert() can take place after the last row of the table. Think real-time streaming data, time ordered.  Different schemes can be used to over allocate and reallocate when full,  but all of them would need truelength in the vector header as well as length, iiuc.


typedef struct {
    JL_STRUCT_TYPE
    size_t length;
    size_t truelength;
    jl_value_t *data[1];
} jl_tuple_t;

Jeff Bezanson

unread,
Jun 12, 2012, 10:09:12 PM6/12/12
to juli...@googlegroups.com
Hi Matt, great to hear from you.

That particular struct just describes the layout of tuples, which are
one kind of builtin object (an immutable collection of a few other
values). Our arrays (jl_array_t) allow an arbitrary data pointer. But,
most of the action is in user defined composite types, which are like
structs and are no less efficient than "built in" types. Those can be
used to put your own abstractions over arrays, as in base/subarray.jl
which adds in-place indexing.

Our 1-d Arrays internally perform over-allocation to support efficient
resizing. That can be used to build other growable things, or you can
perform your own over-allocation inside some new type (e.g. keeping an
array of some arbitrary size, and an implementation of length() that
reports the publicly-visible length).

-Jeff

Matthew Dowle

unread,
Jun 13, 2012, 5:05:55 AM6/13/12
to juli...@googlegroups.com
Very interesting. Thanks for the pointers ...

On Wednesday, 13 June 2012 03:09:12 UTC+1, Jeff Bezanson wrote:
Hi Matt, great to hear from you.

That particular struct just describes the layout of tuples, which are
one kind of builtin object (an immutable collection of a few other
values). Our arrays (jl_array_t) allow an arbitrary data pointer. But,
most of the action is in user defined composite types, which are like
structs and are no less efficient than "built in" types. Those can be
used to put your own abstractions over arrays, as in base/subarray.jl
which adds in-place indexing.

Our 1-d Arrays internally perform over-allocation to support efficient
resizing. That can be used to build other growable things, or you can
perform your own over-allocation inside some new type (e.g. keeping an
array of some arbitrary size, and an implementation of length() that
reports the publicly-visible length).

-Jeff

Reply all
Reply to author
Forward
0 new messages