Cascading 3.0 Group by Throws Null pointer exception

varun reddy

unread,

Aug 24, 2015, 2:28:07 PM8/24/15

to cascading-user

Hello Team

I am try to run Group By followed by count aggregation, But facing a NULL PointerException. Its is straightforward code simlar to that in Impatient2 tutorial, Am I missing anything?

My code,

Scheme schIn2 = new TextDelimited( new Fields("id_2", "oddeven", "name"), ",");

Scheme schOut1 = new TextDelimited( new Fields("id_2", "name"), true, ",");

Tap srctap = new Hfs(schIn2, "/user/hive/warehouse/pokernew/poker_1.csv");

Tap sinkTap = new Hfs(new TextDelimited(true, ","), "/user/hashoutput/", SinkMode.REPLACE);

Pipe lhs = new Pipe("lhs");

lhs = new GroupBy(lhs,new Fields("oddeven"));

lhs = new Every(lhs, new Fields("oddeven"), new Count(), Fields.ALL );

FlowDef flowDef = FlowDef.flowDef().addSource(lhs, srctap).addTailSink(lhs, sinkTap);

Properties properties = new Properties();

AppProps.setApplicationJarClass(properties, cascadClient.class);

Hadoop2MR1FlowConnector flowConnector = new Hadoop2MR1FlowConnector( properties );

Flow flow = flowConnector.connect(flowDef);

flow.writeDOT( "dot/Segment.dot" );

flow.complete();

STACK TRACE,

Exception in thread "main" cascading.flow.FlowException: step failed: (1/1) /user/hashoutput, with job id: job_1440408860150_0008, please see cluster logs for failure messages

at cascading.flow.planner.FlowStepJob.blockOnJob(FlowStepJob.java:261)

at cascading.flow.planner.FlowStepJob.start(FlowStepJob.java:162)

at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:124)

at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:43)

at java.util.concurrent.FutureTask.run(FutureTask.java:266)

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

at java.lang.Thread.run(Thread.java:745)

Cluster Logs:

2015-08-24 16:22:46,320 INFO [main] cascading.tap.hadoop.io.MultiInputSplit: current split input path: hdfs://localhost:9000/user/hive/warehouse/pokernew/poker_1.csv

2015-08-24 16:22:46,322 INFO [main] org.apache.hadoop.mapred.MapTask: Processing split: cascading.tap.hadoop.io.MultiInputSplit@5c8504fd

2015-08-24 16:22:46,348 INFO [main] org.apache.hadoop.mapred.MapTask: numReduceTasks: 10

2015-08-24 16:22:46,352 WARN [main] org.apache.hadoop.mapred.YarnChild: Exception running child : java.lang.NullPointerException

at org.apache.hadoop.mapred.MapTask.createSortingCollector(MapTask.java:414)

at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:442)

at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)

at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)

at java.security.AccessController.doPrivileged(Native Method)

at javax.security.auth.Subject.doAs(Subject.java:422)

at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)

at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)

Here is input data

1999977,1,B1999977

1999978,0,B1999978

1999979,1,B1999979

1999980,0,B1999980

1999981,1,B1999981

1999982,0,B1999982

1999983,1,B1999983

Thx

Varun

varun reddy

unread,

Aug 24, 2015, 5:51:57 PM8/24/15

to cascading-user

Resolved. I am new to hadoop and was playing around with configurations. I wanted to override some mapred config params for which I created a file named mapred-default.xml in etc/hadoop [ instead of adding them to mapred-site.xml]. After deleting mapred-default.xml file from etc/hadoop the Group by code worked.

What still puzzles me is that, I created mapred-default.xml couple of days back. Since two days, Cascade code with Each (filters and functions ) pipes worked fine, its only when I tried today 'Group by and Every' pipes NULL Pointer exception was thrown. Why is that? Below is the snapshot of params I haved added to mapred-default.xml,

<name>mapred.child.java.opts</name>

</property>

<name>mapreduce.map.java.opts</name>

</property>

<name>mapreduce.reduce.java.opts</name>

</property>

<name>mapreduce.reduce.memory.mb</name>

</property>

<name>mapred.reduce.tasks</name>

</property>

<name>mapreduce.map.memory.mb</name>

</property>

<name>mapred.min.split.size</name>

</property>

</configuration>

Thx

Varun

Ken Krugler

unread,

Aug 24, 2015, 6:06:54 PM8/24/15

to cascadi...@googlegroups.com

Some of the parameters you specify are only used when there's a reduce phase to the job (which would be triggered by a GroupBy)

So one guess - you've got 10 reduce tasks specified (mapred.reduce.tasks), each with 2GB of memory (mapreduce.reduce.memory.mb); depending on the size of your cluster I could see this running out of memory.

-- Ken

From: varun reddy

Sent: August 24, 2015 2:51:56pm PDT

To: cascading-user

Subject: Re: Cascading 3.0 Group by Throws Null pointer exception

--------------------------

Ken Krugler

+1 530-210-6378

http://www.scaleunlimited.com

custom big data solutions & training

Hadoop, Cascading, Cassandra & Solr

varun reddy

unread,

Aug 24, 2015, 6:57:37 PM8/24/15

to cascading-user

Thanks Ken for your reply.

Data on which I ran is around 30MB, so OOM might not be the cause in this case, 512MB is default jvm size so 512x10 ~5GB my machine has 12 GB mem.

I suspect some mandatory params required for reduce phase are missing, as the file I created has 3 params for reduce.