Groups keyboard shortcuts have been updated
Dismiss
See shortcuts

Using as.POSIXlt in a Mapper or Reducer

24 views
Skip to first unread message

Sean Willson

unread,
Jul 26, 2013, 2:52:55 PM7/26/13
to rha...@googlegroups.com
I'm running R 3.0.0 on RHadoop using these library version:

[1] rmr2_2.2.1     reshape2_1.2.2 plyr_1.8       stringr_0.6.2  functional_0.4
[6] digest_0.6.3   bitops_1.0-5   RJSONIO_1.0-3  Rcpp_0.10.3   

We're running in a hadoop cluster using CDH 4.1.2.

Now on to my question ... When I write a Mapper and/or Reducer and have it call the as.POSIXlt methods it blows up with backend=local or backend=hadoop. It runs fine obviously in normal uses. Has anyone else seen this? Here is some example code you can run to simulate the issue. If you uncomment the output variable in the mapper, and comment the line out above it, it will work fine. Even if you hard code the value going into POSIXlt to not be from the inbound value set it fails.

#--
library(rmr2)

mapreduce.with.dateTime = function(input, output = NULL) {
  test.mapper = function(key, value) {
    output = strptime(as.POSIXlt(as.numeric(value), origin="1970-01-01"), format = "%Y-%m-%d")
#    output = "YES"
    keyval(NULL, output)
  }
  
  mapreduce(input = input, output = output, map = test.mapper, verbose = T)
}

my.data = c(1373263236, 1373263200)

# this works great
print(strptime(as.POSIXlt(my.data, origin="1970-01-01"), format = "%Y-%m-%d"))

# setup to run local or on hadoop
rmr.options(backend = "local")
# copy the data as usual
hdfs.data = to.dfs(my.data)

# run on the cluster and it will fail, usually with weird errors not related to this method call
results = mapreduce.with.dateTime(hdfs.data)

print(from.dfs(results))
#--

Thoughts?

Sean

Antonio Piccolboni

unread,
Jul 26, 2013, 6:00:55 PM7/26/13
to RHadoop Google Group
I can reproduce your problem. Right now I am on trip but I will look into it as soon as I can.


Antonio



Sean

--
post: rha...@googlegroups.com ||
unsubscribe: rhadoop+u...@googlegroups.com ||
web: https://groups.google.com/d/forum/rhadoop?hl=en-US
---
You received this message because you are subscribed to the Google Groups "RHadoop" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rhadoop+u...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Antonio Piccolboni

unread,
Jul 26, 2013, 6:51:55 PM7/26/13
to RHadoop Google Group
Hi,
keyval arguments can be one of list, matrix, data.frame and vector, but not, for instance, POSIXlt. If you change your function this way

mapreduce.with.dateTime = function(input, output = NULL) {
test.mapper = function(key, value) {
output = strptime(as.POSIXlt(as.numeric(value), origin="1970-01-01"), format = "%Y-%m-%d")
#    output = "YES"
keyval(NULL, list(output))
}
mapreduce(input = input, output = output, map = test.mapper, verbose = T)
}

it seems to work fine. The crash is pretty bad when you don't do the right thing, the side effect of some speed-related changes. I need to put some checks here and there to at least make the errors less drastic and more informative. Thanks

Sean Willson

unread,
Jul 27, 2013, 12:28:53 PM7/27/13
to rha...@googlegroups.com, ant...@piccolboni.info
For some reason I thought strptime returned a string and not a POSIXlt object, that's weird. I don't see how to get it to just a string and not that POSIX object (which i'm sure is more expensive to serialize) from mapper to reducer. Ideas?

As you said though if it can handle any structure when why does it fail if you add it as a column to an existing data frame. If you take the same example, passing a data frame in place of value, and then change the code to be the following:

    value$day = strptime(as.POSIXlt(value$epochTime, origin="1970-01-01"), format = "%Y-%m-%d")

It blows up again which seems counter to what you said because the argument is a data.frame that is being passed around, the mapper just adds a column.

Thoughts,
Sean

Sean Willson

unread,
Jul 27, 2013, 12:59:35 PM7/27/13
to rha...@googlegroups.com, ant...@piccolboni.info
If I cast it using as.character(...) coming out of strptime it seems to work. I guess that is a solution, though based on what you said I still don't get why the data.frame doesn't serialize right.

Antonio Piccolboni

unread,
Jul 27, 2013, 4:10:44 PM7/27/13
to RHadoop Google Group


On Jul 27, 2013 12:28 PM, "Sean Willson" <sean.w...@gmail.com> wrote:
>
> For some reason I thought strptime returned a string and not a POSIXlt object, that's weird. I don't see how to get it to just a string and not that POSIX object (which i'm sure is more expensive to serialize) from mapper to reducer. Ideas?
>

Yes, character will use some faster serialization that's not R's built in.

> As you said though if R's it can handle any structure when why does it fail if you add it as a column to an existing data frame. If you take the same example, passing a data frame in place of value, and then change the code to be the following:


>
>     value$day = strptime(as.POSIXlt(value$epochTime, origin="1970-01-01"), format = "%Y-%m-%d")
>
> It blows up again which seems counter to what you said because the argument is a data.frame that is being passed around, the mapper just adds a column.

It may be that I've never tested it with more than atomic columns. I will look into it.

Antonio

Dannell

Antonio Piccolboni

unread,
Jul 27, 2013, 6:46:48 PM7/27/13
to RHadoop Google Group
I need to see a simple, reproducible example of the problem, like your first report. I tried something with data frames with a POSIXlt column and it worked just fine. Thanks

Sean Willson

unread,
Jul 28, 2013, 10:56:23 PM7/28/13
to rha...@googlegroups.com, ant...@piccolboni.info
I found that my problem was that I was calling as.numeric(...) converting a data frame string column into a number (or so I thought) but it wasn't working. I instead had to call strtoi(...) and it worked a lot better. Thanks for the help with this! When there's a crash in the mapper or reducer due to API related issues it's currently very hard to diagnose as sometimes the error reports things like Rcpp issues when it's not, it just had trouble serializing the results so it reports with that.

Thanks again.

Sean
Reply all
Reply to author
Forward
0 new messages