The hazard of getting a kvMemory object with divide(ddf(df))

11 views

Skip to first unread message

jeremiah rounds

unread,

Aug 8, 2015, 8:19:44 PM8/8/15

to Tessera-Users

Hi,

I was ramping some code up to handle larger data with datadr using kvMemory objects (in memory datadr), and I ran into this line of code:

byProt = divide(ddf(data), by="Protein")

data is a data.frame. That piece of code was taking forever on a data.frame with 80 columns 400k rows.

So... I went into divide looking around at how things worked. I timed pieces of getDivideDF which is called in "divide", and much to my surprise Ryan's getDivideDF worked in 80 seconds, so what happened?

Well the code the user likely wants to execute is in "divide" at:

if (by$type == "condDiv" && is.data.frame(data)) {

res <- getDivideDF(data, by = by, postTransFn = postTransFn,

bsvFn = bsvFn, update = update)

suppressMessages(output <- output)

return(convert(res, output, overwrite = overwrite))

}

That is the speedy data.table conversion in "getDivideDF":, but by wrapping the data in ddf(data) what happens is it gets converted to something not a data.frame before divide sees it, the getDivideDF line never executes, and the divide drops into the quite computationally intensive MapReduce.

I am not sure there is a solution that needs implementing though one imagine that peeking at the ddf object could help, this is just a public service announcement about the in-memory kvMemory idiom. For the world: if you are going to "divide" into a kvMemory object don't wrap data with a "ddf" call. It will break an optimization Ryan wrote for us =)

If you want a sense of a the timings I am dealing with they look like this for subsets, but I don't think they scale linearly:

> subset_data = data[1:20000,]

> ddf(subset_data)

Distributed data frame backed by 'kvMemory' connection

attribute | value

----------------+-----------------------------------------------------------------------------------------------------------------------------------------

names | Peptide(fac), Protein(fac), X13.1489(num), X42.2590(num), X36.2529(num), X24.1105(num), X29.1785(num), X24.2290(num), and 78 more

nrow | 20000

size (stored) | 37.83 MB

size (object) | 37.83 MB

# subsets | 1

* Other attributes: getKeys()

* Missing attributes: splitSizeDistn, splitRowDistn, summary

#Making kvMemory with ddf before calling divide:

> system.time({byProSlow <- divide(ddf(subset_data), by = c("Protein"))})

* Verifying parameters...

* Applying division...

user system elapsed

45.884 0.198 46.102

#Letting divide execute and then make kvMemory:

> system.time({byProFast <- divide(subset_data, by = c("Protein"))})

user system elapsed

20.881 0.067 20.950

Ryan Hafen

unread,

Aug 14, 2015, 6:16:43 PM8/14/15

to Jeremiah Rounds, Tessera-Users

Thanks Jeremiah. That is indeed a good observation - thanks for sharing. The general idea for datadr is to always use mapreduce for all computations since this makes it very easy to abstract away different back ends. But this (data frame input to divide) is the one case where I made an exception and wrote custom non-mapreduce code to do it faster.

--
You received this message because you are subscribed to the Google Groups "Tessera-Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tessera-user...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tessera-users/5ce5e04c-0ac0-4a87-9e36-ea255d4a2ea0%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all

Reply to author

Forward

0 new messages