Merge-like operation for DDFs

Aleksander Eskilson

unread,

Jun 2, 2015, 10:15:07 AM6/2/15

to tesser...@googlegroups.com

Hello,

R's dataframes support an operation similar to a table join using merge. Looking at the guide datadr's divide and drJoin don't seem to provide quite the same semantics if the use case is to combine the features from two DDFs that share a unique value, and to attempt to divide the data on a unique value would seem wasteful in job management overhead, since the data would be chunked into singletons. Is there a semantic in datadr similar to dataframe merging if one wants to row-join on data sets currently contained in separated DDFs? I suppose divide and drJoin could work, but it looks like it would require listing all the columns we want joined as opposed to just listing the key to join on. It just seems to me that divide and drJoin were made for divisions where the key was associated with values greater than size one, as exemplified in the doc [1].

Thanks,

Alek

[1] -- http://tessera.io/docs-datadr/index.html#common_data_operations

Ryan Hafen

unread,

Jun 2, 2015, 2:03:40 PM6/2/15

to Aleksander Eskilson, tesser...@googlegroups.com

Hi Alek,

You are correct that this isn’t currently supported aside from the inefficient approach of using drJoin on singleton chunks. Our main goal with datadr is to provide a simple interface and analysis paradigm for key/value type large data sets. Data frames are a special and common case of data in this paradigm, but we aren’t planning on full support of the type of operations you might get in a RDBMS or data frame specific packages like dplyr.

However, I don’t think it would be too difficult to add this and it would be a good thing to support. The approach I am thinking of would be to support input ddfs of arbitrary chunking and basically write a custom map/reduce procedure that in the map breaks the data up either row-wise or according to a hash function applied to the columns to merge on and does the merge in the reduce. If you have use case you’d like to see this feature for, please file an issue on github.

Thanks,

Ryan

--
You received this message because you are subscribed to the Google Groups "Tessera-Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tessera-user...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tessera-users/817f3293-50dd-402f-a14c-620da30abd46%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Aleksander Eskilson

unread,

Jun 2, 2015, 2:29:34 PM6/2/15

to tesser...@googlegroups.com, aleksa...@gmail.com

Hi Ryan,

Thanks for the reply. It makes sense to perhaps have other elements of the Hadoop stack handle RDBMS operations when they're necessary, and then one could have those systems export to a file type R can grok (Hive could do this quite naturally if one is truly working with datasets already too big for dplyr, and they'd already be in HDFS at that point). Still, I think it's common enough that data from one cohesive set is housed in separate sources with a common key that having access to MapReduce implemented joins in the datadr API will help developers get their data to the dr phase a littler faster. I found this question seems to have been poked at in the context of Rhipe a little already [1]. Your outlined implementation seems right on the money. I'll file an issue for it.

Regards,

Alek Eskilson

[1] -- http://piccolboni.info/2011/04/bringing-relational-joins-to-rhipe.html

Reply all

Reply to author

Forward