plyrmr::dcast memory issues

72 views
Skip to first unread message

amitai golub

unread,
Feb 18, 2015, 6:53:47 AM2/18/15
to rha...@googlegroups.com
Hi, 

I've been working with the RHadoop packages for about two months now, and have been running into some problems here and there that often relate to space, which I feel that my cluster should be able to handle but for some reason doesn't. 

In particular, I have a melted table (in hdfs, in a hive-ish format) that has 5 columns, like such:

> head(my.tbl)

          loc   id          date     pivot    cat
1          aa    1    2013-04-07       d60      1
2          aa    1    2013-04-07       d90      0
3          aa    2    2013-04-07       d60      0
4          aa    2    2013-04-07       d90      0
5          aa    3    2013-04-07       d60      3

The full table has ~100K rows. I'm trying to run the following commad:

> dcast(my.tbl, formula = loc + id + date ~ pivot, value.var = 'cat')

for 100 lines, this works fine. For 10K lines this breaks down (the map job runs for a while and then hangs at some percentage). Increasing the memory via 'plyrmr.options' does help, by getting the 10K job through, but not enough to solve the table as a whole. How does this scale? 

I've looked at the source code for "dcast.pipe" and have tried to run it line by line, and the function seemed to seize up at the grouping stage. Could it be connected to the format (when calling the data, R reads the data correctly)?

Thanks for the help!!!
Amitai

Antonio Piccolboni

unread,
Feb 18, 2015, 12:25:48 PM2/18/15
to RHadoop Google Group
Could you share the plyrmr.options call that improved things? The current implementation of plyrmr::dcast requires all the rows that are identical as far as their id variables to fit in memory at once. I don't see how that could not be with 100K rows, so I suspect it's the usual container/java/R memory allocation problem.


Antonio

--
post: rha...@googlegroups.com ||
unsubscribe: rhadoop+u...@googlegroups.com ||
web: https://groups.google.com/d/forum/rhadoop?hl=en-US
---
You received this message because you are subscribed to the Google Groups "RHadoop" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rhadoop+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

amitai golub

unread,
Feb 24, 2015, 4:07:44 AM2/24/15
to rha...@googlegroups.com
Thanks for the quick answer and sorry for taking a bit longer to get back to you. The following settings alleviate the problem a bit:

list(hadoop = list(D = 'mapreduce.map.java.opts = -Xmx3072m', D = 'mapreduce.reduce.java.opts = -Xmx3072m', D = 'mapreduce.map.memory.mb = 3072', D = 'mapreduce.reduce.memory.mb = 3072' ) )
Additionally, the system admin tells me that once fired, the hadoop.streaming job eventually starts "stealing" memory from available memory resources. I don't know if it matters but for sake of completeness I thought I'd share. If you need me to get the technical terms of what exactly is going on server side I can do that.

Cheers,
Amitai

Antonio Piccolboni

unread,
Feb 24, 2015, 10:26:33 PM2/24/15
to RHadoop Google Group
Take a look at help(hadoop.settings)




--
Reply all
Reply to author
Forward
0 new messages