Follow up to "plyrmr::dcast memory issues"

68 views
Skip to first unread message

amitai golub

unread,
Mar 18, 2015, 11:20:33 AM3/18/15
to rha...@googlegroups.com
Hi,

First thing: Thanks for the help and direction to the right documentation. 

Again, the data is the same as the previous case:

> my.tbl <- input('/user/hive/warehouse/ag_temp', input = 'pig.hive')
> head(my.tbl)

           V1   V2            V3        V4     V5
1          aa    1    2013-04-07       d60      1
2          aa    1    2013-04-07       d90      0
3          aa    2    2013-04-07       d60      0
4          aa    2    2013-04-07       d90      0
5          aa    3    2013-04-07       d60      3

As you can see, the data comes from a hive table. when I try to dcast it (no modifications to the standard memory settings) using the following command

dcast(temp.ag, formula = V1 + V3 ~ V4, value.var = 'V2')

the mapreduce job completes fine, only to crash while output-ing the result to the /tmp/location. with a segmentation fault. Oddly enough, this seems to work fine for up to 5002 rows, after which point the problem shows up. Here is the traceback I get when the job (and R) crashes.


_ _
 |
 |
<REDACTED MAPREDUCE OUTPUT>
 |
_|_

15/03/18 14:46:11 INFO streaming.StreamJob: Output directory: /tmp/file7cd1436ba89b

 *** caught segfault ***
address 0x60000010, cause 'memory not mapped'
Traceback:
1: unlist(x)
2: .class1(object)
3: as(unlist(x), class(template))
4: rmr.coerce(x[[i]], template[[i]])
5: FUN(1:63[[5L]], ...)
6: lapply(seq_along(template), function(i) rmr.coerce(x[[i]], template[[i]]))
7: to.data.frame(x, template)
8: from.list(vv, template[[2]])
9: rmr.length(val)
10: keyval(from.list(kk, template[[1]]), from.list(vv, template[[2]]))
11: format$format(con)
12: keyval.reader()
13: read.file(tmp)
14: from.dfs(input = x$data, format = x$format)
15: values(from.dfs(input = x$data, format = x$format))
16: as.data.frame.big.data(as.big.data(x))
17: as.data.frame(as.big.data(x))
18: as.data.frame.pipermr(sample(x, method = "any", n = 100))
19: as.data.frame(sample(x, method = "any", n = 100))
20: print(as.data.frame(sample(x, method = "any", n = 100)))
21: print.pipe(list(input = list(data = "/user/hive/warehouse/ag_temp",     
                                 format = "pig.hive",
                                 digest = "65c9a86b943a643d5177854fc8ec2a22"),
                    ungroup = FALSE, group = function (.x){
                      .y = do.call(f, c(list(.x), list(...)))
                      if (is.null(.y))
                        NULL
                      else {
                        if (is.data.frame(.y))
                          .y            
                        else {
                          if (is.matrix(.y))
                            as.data.frame(.y, stringsAsFactors = F)
                          else 
                            data.frame(x = .y, stringsAsFactors = F) 
                          }        
                        }    
                      }, 
                    reduce = function (.x){
                      .y = do.call(f, c(list(.x), list(...)))
                      if (is.null(.y))
                        NULL
                      else{
                        if (is.data.frame(.y))
                          .y
                        else {
                          if (is.matrix(.y))
                            as.data.frame(.y, stringsAsFactors = F)
                          else 
                            data.frame(x = .y, stringsAsFactors = F)
                          }        
                        }    
                      },
                    vectorized = FALSE
                    )
               )
22: print(list(input = list(data = "/user/hive/warehouse/ag_temp",     
                            format = "pig.hive", 
                            digest = "65c9a86b943a643d5177854fc8ec2a22"),     
               ungroup = FALSE, 
               group = function (.x){
                 .y = do.call(f, c(list(.x), list(...)))
                 if (is.null(.y))
                   NULL        
                 else {
                   if (is.data.frame(.y))
                     .y
                   else {                
                     if (is.matrix(.y))
                       as.data.frame(.y, stringsAsFactors = F)
                     else data.frame(x = .y, stringsAsFactors = F)
                     }        
                   }    
                 }, 
               reduce = function (.x){
                 .y = do.call(f, c(list(.x), list(...)))
                 if (is.null(.y))
                   NULL        
                 else {
                   if (is.data.frame(.y))
                     .y
                   else {
                     if (is.matrix(.y))
                       as.data.frame(.y, stringsAsFactors = F)
                     else data.frame(x = .y, stringsAsFactors = F)
                     }        
                   }    
                 },
               vectorized = FALSE
               )
          )


Any Ideas? My table does generate many columns (about 63 after the cast), in case it's of any relevance.

Thanks a lot!
Amitai

Antonio Piccolboni

unread,
Mar 18, 2015, 11:45:53 AM3/18/15
to rha...@googlegroups.com
Hi,
head is not a plyrmr function so you are on your own trying this. Why don't you instead type just

input('/user/hive/warehouse/ag_temp', input = 'pig.hive')

and let the group know what happens?  I suspect it will reproduce the problem and then it would be a plyrmr bug. Then with the same 5002 row data sets I would try this

as.data.frame(input('/user/hive/warehouse/ag_temp', input = 'pig.hive'))


And see if that reproduces the problem again. Best would be for me to be able to reproduce the problem. Can you share the minimal data set that makes this happen? You can't post it here but maybe there is a way for you to upload it somewhere? Thanks

amitai golub

unread,
Mar 19, 2015, 11:57:52 AM3/19/15
to rha...@googlegroups.com
Hi,

input('/user/hive/warehouse/ag_temp', input = 'pig.hive')

works fine. It runs a short MR job and then outputs to a /tmp file on first call. Subsequent calls run no MR but rather seem to read from the /tmp file and output to console.

the problem arises when I take that data and use it as input for the dcast function. 

dcast(input('/user/hive/warehouse/ag_temp', input = 'pig.hive'), formula = V1 + V3 ~ V4, value.var = 'V2')

yields the segmentation fault I mentioned. When first using 

temp.ag <- as.data.frame(input('/user/hive/warehouse/ag_temp', input = 'pig.hive'))
dcast(temp.ag, formula = V1 + V3 ~ V4, value.var = 'V2')

It works fine. In the 5002 line case (which works for all cases) this leads the right result. The extra 5003rd line, in the second case, yields a dcast result where the last line has some NAs in it. I've tried using the "fill = 0" option in bost cases. The first case (5002) leads to a new error:

  Error: !is.null(template) is not TRUE
  No traceback available
  Error during wrapup:
  Execution halted 

This seems similar to issue



Sorry for being so long winded... I hope I managed to explain the issues ok. I'll try to see if I can get some data to you and a code snippet that would generate the error.

Best and Thanks,
Amitai

Antonio Piccolboni

unread,
Mar 23, 2015, 1:32:21 PM3/23/15
to RHadoop Google Group
You are right! It's very similar to #26. What you specified as the input argument should be "format".

input('/user/hive/warehouse/ag_temp', format = 'pig.hive')

It's a problem with R generics, they force you to add the ... argument, then it doesn't catch simple mistakes like this. I could fix it with boilerplate code but it seems wrong. Once the format is not correctly set, the first time one needs to actually use the data (from.dfs or dcast, it doesn't matter) the wrong parser is used. 

--
post: rha...@googlegroups.com ||
unsubscribe: rhadoop+u...@googlegroups.com ||
web: https://groups.google.com/d/forum/rhadoop?hl=en-US
---
You received this message because you are subscribed to the Google Groups "RHadoop" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rhadoop+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Antonio Piccolboni

unread,
Apr 17, 2015, 11:58:42 AM4/17/15
to rha...@googlegroups.com, ant...@piccolboni.info
Thanks for your private message providing a data sample. I can't respond privately, but here is what I found. First, your csv has a "," separator which is not default for csv in plyrmr, following the default of read.table. Second, your csv had a header row, which is not supported, so I removed it. After that, as far as I can tell, it works fine. Not sure that helps for your original dataset as it was in pig format, if I can remember.


temp.ag <- input('~/Downloads/temp.ag.csv', format = make.input.format("csv", sep =","))
dcast(temp.ag, formula = V1 + V2 ~ V3, value.var = 'V4')


web: https://groups.google.com/d/forum/rhadoop?hl=en-US
---
You received this message because you are subscribed to the Google Groups "RHadoop" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rhadoop+unsubscribe@googlegroups.com.

amitai golub

unread,
Jul 7, 2015, 4:44:53 PM7/7/15
to rha...@googlegroups.com, ant...@piccolboni.info
Awesome! Thanks for solving this issue! I haven't had a chance to answer sooner or really try it out myself as this direction of the project was axed. But once I get some free time I'll have a look and try to see what else I can do with it. 

Thanks again!

web: https://groups.google.com/d/forum/rhadoop?hl=en-US
---
You received this message because you are subscribed to the Google Groups "RHadoop" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rhadoop+u...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages