Hi,
I've been working with the RHadoop packages for about two months now, and have been running into some problems here and there that often relate to space, which I feel that my cluster should be able to handle but for some reason doesn't.
In particular, I have a melted table (in hdfs, in a hive-ish format) that has 5 columns, like such:
> head(my.tbl)
loc id date pivot cat
1 aa 1 2013-04-07 d60 1
2 aa 1 2013-04-07 d90 0
3 aa 2 2013-04-07 d60 0
4 aa 2 2013-04-07 d90 0
5 aa 3 2013-04-07 d60 3
The full table has ~100K rows. I'm trying to run the following commad:
> dcast(my.tbl, formula = loc + id + date ~ pivot, value.var = 'cat')
for 100 lines, this works fine. For 10K lines this breaks down (the map job runs for a while and then hangs at some percentage). Increasing the memory via 'plyrmr.options' does help, by getting the 10K job through, but not enough to solve the table as a whole. How does this scale?
I've looked at the source code for "dcast.pipe" and have tried to run it line by line, and the function seemed to seize up at the grouping stage. Could it be connected to the format (when calling the data, R reads the data correctly)?
Thanks for the help!!!
Amitai