Hi,
I was ramping some code up to handle larger data with datadr using kvMemory objects (in memory datadr), and I ran into this line of code:
byProt = divide(ddf(data), by="Protein")
data is a data.frame. That piece of code was taking forever on a data.frame with 80 columns 400k rows.
So... I went into divide looking around at how things worked. I timed pieces of getDivideDF which is called in "divide", and much to my surprise Ryan's getDivideDF worked in 80 seconds, so what happened?
Well the code the user likely wants to execute is in "divide" at:
if (by$type == "condDiv" && is.data.frame(data)) {
res <- getDivideDF(data, by = by, postTransFn = postTransFn,
bsvFn = bsvFn, update = update)
suppressMessages(output <- output)
return(convert(res, output, overwrite = overwrite))
}
That is the speedy data.table conversion in "getDivideDF":, but by wrapping the data in ddf(data) what happens is it gets converted to something not a data.frame before divide sees it, the getDivideDF line never executes, and the divide drops into the quite computationally intensive MapReduce.
I am not sure there is a solution that needs implementing though one imagine that peeking at the ddf object could help, this is just a public service announcement about the in-memory kvMemory idiom. For the world: if you are going to "divide" into a kvMemory object don't wrap data with a "ddf" call. It will break an optimization Ryan wrote for us =)
If you want a sense of a the timings I am dealing with they look like this for subsets, but I don't think they scale linearly:
> subset_data = data[1:20000,]
> ddf(subset_data)
Distributed data frame backed by 'kvMemory' connection
attribute | value
----------------+-----------------------------------------------------------------------------------------------------------------------------------------
names | Peptide(fac), Protein(fac), X13.1489(num), X42.2590(num), X36.2529(num), X24.1105(num), X29.1785(num), X24.2290(num), and 78 more
nrow | 20000
size (stored) | 37.83 MB
size (object) | 37.83 MB
# subsets | 1
* Other attributes: getKeys()
* Missing attributes: splitSizeDistn, splitRowDistn, summary
#Making kvMemory with ddf before calling divide:
> system.time({byProSlow <- divide(ddf(subset_data), by = c("Protein"))})
* Verifying parameters...
* Applying division...
user system elapsed
45.884 0.198 46.102
#Letting divide execute and then make kvMemory:
> system.time({byProFast <- divide(subset_data, by = c("Protein"))})
user system elapsed
20.881 0.067 20.950