Debugging count.cols in plyrmr

Kirsti Laurila

unread,

Sep 25, 2014, 8:13:47 AM9/25/14

to rha...@googlegroups.com

Hello,

I have a problem with count.cols function. It fails pretty randomly with my data. I.e I have a data.frame and when I run
xx=to.dfs(aa)
a=as.data.frame(count.cols(input(xx))) it sometimes fails and sometimes doesn't file (when aa is the same).

When I check the stderr, it doesn't give any error, in syslog there is an error:

2014-09-25 09:51:15,905 INFO [main] org.apache.hadoop.streaming.PipeMapRed: PipeMapRed failed!
java.lang.RuntimeException: java.io.EOFException
	at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:344)
	at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:543)
	at org.apache.hadoop.streaming.PipeReducer.close(PipeReducer.java:134)
	at org.apache.hadoop.mapred.Task$OldCombinerRunner.combine(Task.java:1577)
	at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1631)
	at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1482)
	at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:440)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:415)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)
Caused by: java.io.EOFException
	at java.io.DataInputStream.readFully(DataInputStream.java:197)
	at org.apache.hadoop.typedbytes.TypedBytesInput.readRawBytes(TypedBytesInput.java:218)
	at org.apache.hadoop.typedbytes.TypedBytesInput.readRaw(TypedBytesInput.java:152)
	at org.apache.hadoop.typedbytes.TypedBytesInput.readRawVector(TypedBytesInput.java:412)
	at org.apache.hadoop.typedbytes.TypedBytesInput.readRaw(TypedBytesInput.java:144)
	at org.apache.hadoop.streaming.io.TypedBytesOutputReader.readKeyValue(TypedBytesOutputReader.java:56)
	at org.apache.hadoop.streaming.PipeMapRed$MROutputThread.run(PipeMapRed.java:386)
2014-09-25 09:51:15,910 WARN [main] org.apache.hadoop.mapred.YarnChild: Exception running child : java.lang.RuntimeException: java.io.EOFException
	at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:344)
	at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:543)
	at org.apache.hadoop.streaming.PipeReducer.close(PipeReducer.java:134)
	at org.apache.hadoop.mapred.Task$OldCombinerRunner.combine(Task.java:1577)
	at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1631)
	at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1482)
	at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:440)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:415)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)
Caused by: java.io.EOFException
	at java.io.DataInputStream.readFully(DataInputStream.java:197)
	at org.apache.hadoop.typedbytes.TypedBytesInput.readRawBytes(TypedBytesInput.java:218)
	at org.apache.hadoop.typedbytes.TypedBytesInput.readRaw(TypedBytesInput.java:152)
	at org.apache.hadoop.typedbytes.TypedBytesInput.readRawVector(TypedBytesInput.java:412)
	at org.apache.hadoop.typedbytes.TypedBytesInput.readRaw(TypedBytesInput.java:144)
	at org.apache.hadoop.streaming.io.TypedBytesOutputReader.readKeyValue(TypedBytesOutputReader.java:56)
	at org.apache.hadoop.streaming.PipeMapRed$MROutputThread.run(PipeMapRed.java:386)

Don't really know how to proceed, is there a way how I could debug count.cols function? 

Best,
Kirsti

Antonio Piccolboni

unread,

Sep 25, 2014, 3:01:37 PM9/25/14

to RHadoop Google Group

What's in aa? Does a=as.data.frame(count.cols(input(aa))) work? Are you using 0.4.0? Thanks

--
post: rha...@googlegroups.com ||
unsubscribe: rhadoop+u...@googlegroups.com ||
web: https://groups.google.com/d/forum/rhadoop?hl=en-US
---
You received this message because you are subscribed to the Google Groups "RHadoop" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rhadoop+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Kirsti Laurila

unread,

Sep 29, 2014, 3:47:27 AM9/29/14

to rha...@googlegroups.com, ant...@piccolboni.info

Hi,

aa is data.frame. I tested a=as.data.frame(count.cols(input(aa))) and sometimes it works, sometime it fails (with the same data.frame). I was using 0.3.0, but tried this with 0.4.0.

Apparently in 0.4.0 there is not anymore count.cols but just countn there I couldn't get

xx=to.dfs(aa)
a=as.data.frame(plyrmr::count(input(xx))) to work at all, complains that "no applicable method for 'as.pipermr' applied to an object of class "function". Additionally, without converting to dfs

a=as.data.frame(plyrmr::count(input(aa))), I got some memory problems and when tested, I couldn't run any mapreduce there as I got error:

MAP capability required is more than the supported max container capability in the cluster. Killing the Job. mapResourceReqt: 4096 maxContainerCapability:3072
Job received Kill while in RUNNING state.
REDUCE capability required is more than the supported max container capability in the cluster. Killing the Job. reduceResourceReqt: 4096 maxContainerCapability:3072

Anything I could do? How to use objects in dfs with count function in with 0.4.0?

Best,
Kirsti

Antonio Piccolboni

unread,

Sep 29, 2014, 12:16:52 PM9/29/14

to rha...@googlegroups.com, ant...@piccolboni.info

On Monday, September 29, 2014 12:47:27 AM UTC-7, Kirsti Laurila wrote:

Hi,

aa is data.frame. I tested a=as.data.frame(count.cols(input(aa))) and sometimes it works, sometime it fails (with the same data.frame). I was using 0.3.0, but tried this with 0.4.0.

Sorry, we need to focus on one release, the latest.

Apparently in 0.4.0 there is not anymore count.cols but just countn there I couldn't get

Sorry again, while a package is in 0.x.x release backward compatibility requirements are waived or at least relaxed (see semver.org for details). I would suggest reading help(count) and see if that gets you going.

xx=to.dfs(aa)
a=as.data.frame(plyrmr::count(input(xx))) to work at all, complains that "no applicable method for 'as.pipermr' applied to an object of class "function".

That's a bug and it's fixed in 0.5.0 (upcoming), but you are really complicating your life by mixing rmr2 and plyrmr when you don' have to. count allows you to specify which columns or combination thereof you need to count, for instance

count(mtcars, carb:cyl, carb:gear)

So it's a lot more flexible than count.cols, but you need to give it additional arguments. I don't have 0.4.0 installed right now, but I hope there is a help(count) entry that tells you as much. If you are disappointed that we had a backward incompatible change, you'd be better off waiting for a 1.0 release, but I don't have a timetable for that. What we are doing right now is making an early release available to get feedback and develop based on what people do with it or not. So it's very valuable to us that you want to try it out, at the same time you need to set your expectation accordingly. If you have feedback on count or whatever other function I am interested.

Additionally, without converting to dfs

a=as.data.frame(plyrmr::count(input(aa))), I got some memory problems and when tested, I couldn't run any mapreduce there as I got error:

MAP capability required is more than the supported max container capability in the cluster. Killing the Job. mapResourceReqt: 4096 maxContainerCapability:3072
Job received Kill while in RUNNING state.
REDUCE capability required is more than the supported max container capability in the cluster. Killing the Job. reduceResourceReqt: 4096 maxContainerCapability:3072

Anything I could do? How to use objects in dfs with count function in with 0.4.0?

That's a separate problem and has to deal with how Hadoop YARN allocates memory. What Hadoop and what rmr are you running please?

Antonio

Reply all

Reply to author

Forward