Hi,
I've just started to look at Julia. I'm one of the authors of data.table in R, btw, which has been mentioned here before. Nice job on Julia! One thing that looks especially attractive in julia.h is that *data is, well, a pointer.
typedef struct {
JL_STRUCT_TYPE
size_t length;
jl_value_t *data[1];
} jl_tuple_t;
In R we don't have that. In R the general type SEXP is a header (such as type and length) followed by the data; i.e., DATAPTR() is an offset, not a pointer. So in data.table the way it has to implement fast grouping, for contiguous groups, is :
1. Find largest group
2. Allocate memory for that largest group in a static environment
3. Loop through groups doing a memcpy of each subset of each column to the static environment
4. Eval j
5. Accumulate results
6. Next group step 3, reusing the same static memory.
This works well and is fast (the fastest grouping available in R for single runs on large data such as 1e9 rows), but could be faster. Possibly much faster in Julia due to less overhead and freedom in the julia code we eval in each group (such as mean, sum, sd, etc etc). In Julia we wouldn't need the memcpy as the *data could just be pointed to the start of the group, set length as long as the group, and that's it. Am I understanding correctly? If so, that's quite attractive. That's also how Python is arranged internally I believe? I don't know how Pandas does that, but would be interesting.
However, and this is the main point of this post, what we do have in R is truelength as well as length. This allows data.table to over allocate column pointer slots (100 by default, but some people set that to 1000 or more). A 2 column data.table is length 2 and truelength 100. Adding a new column is just assignment to the 3rd slot and updating length to 3 (no copy of existing 2 columns). Deleting a column is instant regardless of size, using memmove to budge up the column pointers after the one to be deleted. These ops are extremely fast in data.table since they don't copy any part of memory at all. Afaik, no other package (or indeed base R) uses truelength for this purpose or similar. So it's something that R has in its internals that only data.table uses, afaik.
Have I missed truelength in Julia, and if not would adding it to julia.h make sense?
Another aspect is over-allocating each column, so that fast insert() can take place after the last row of the table. Think real-time streaming data, time ordered. Different schemes can be used to over allocate and reallocate when full, but all of them would need truelength in the vector header as well as length, iiuc.
typedef struct {
JL_STRUCT_TYPE
size_t length;
size_t truelength;
jl_value_t *data[1];
} jl_tuple_t;