Cascading 3.0 Group by Throws Null pointer exception

57 views
Skip to first unread message

varun reddy

unread,
Aug 24, 2015, 2:28:07 PM8/24/15
to cascading-user

Hello Team

I am try to run Group By followed by count aggregation, But facing a NULL PointerException. Its is straightforward code simlar to that in Impatient2 tutorial, Am I missing anything?

My code, 

Scheme schIn2 = new TextDelimited( new Fields("id_2", "oddeven", "name"), ",");
        Scheme schOut1 = new TextDelimited( new Fields("id_2", "name"), true, ",");

        Tap srctap = new Hfs(schIn2, "/user/hive/warehouse/pokernew/poker_1.csv");
        Tap sinkTap = new Hfs(new TextDelimited(true, ","), "/user/hashoutput/", SinkMode.REPLACE);

        Pipe lhs = new Pipe("lhs");
        lhs = new GroupBy(lhs,new Fields("oddeven"));

        lhs = new Every(lhs, new Fields("oddeven"), new Count(), Fields.ALL );
FlowDef flowDef = FlowDef.flowDef().addSource(lhs, srctap).addTailSink(lhs, sinkTap);
        
        Properties properties = new Properties();
        AppProps.setApplicationJarClass(properties, cascadClient.class);
        Hadoop2MR1FlowConnector flowConnector = new Hadoop2MR1FlowConnector( properties );

        Flow flow = flowConnector.connect(flowDef);
        flow.writeDOT( "dot/Segment.dot" );
        flow.complete();

STACK TRACE,
Exception in thread "main" cascading.flow.FlowException: step failed: (1/1) /user/hashoutput, with job id: job_1440408860150_0008, please see cluster logs for failure messages
at cascading.flow.planner.FlowStepJob.blockOnJob(FlowStepJob.java:261)
at cascading.flow.planner.FlowStepJob.start(FlowStepJob.java:162)
at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:124)
at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:43)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)


Cluster Logs:

2015-08-24 16:22:46,320 INFO [main] cascading.tap.hadoop.io.MultiInputSplit: current split input path: hdfs://localhost:9000/user/hive/warehouse/pokernew/poker_1.csv
2015-08-24 16:22:46,322 INFO [main] org.apache.hadoop.mapred.MapTask: Processing split: cascading.tap.hadoop.io.MultiInputSplit@5c8504fd
2015-08-24 16:22:46,348 INFO [main] org.apache.hadoop.mapred.MapTask: numReduceTasks: 10
2015-08-24 16:22:46,352 WARN [main] org.apache.hadoop.mapred.YarnChild: Exception running child : java.lang.NullPointerException
        at org.apache.hadoop.mapred.MapTask.createSortingCollector(MapTask.java:414)
        at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:442)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)



Here is input data

1999977,1,B1999977
1999978,0,B1999978
1999979,1,B1999979
1999980,0,B1999980
1999981,1,B1999981
1999982,0,B1999982
1999983,1,B1999983


Thx
Varun

varun reddy

unread,
Aug 24, 2015, 5:51:57 PM8/24/15
to cascading-user
Resolved. I am new to hadoop and was playing around with configurations. I wanted to override some mapred config params for which I created a file named mapred-default.xml in etc/hadoop [ instead of adding them to mapred-site.xml]. After deleting mapred-default.xml file from etc/hadoop the Group by code worked. 


What still puzzles me is that,  I created mapred-default.xml couple of days back. Since two days,  Cascade code with Each (filters and functions ) pipes worked fine, its only when I tried today  'Group by  and Every' pipes  NULL Pointer exception was thrown. Why is that?  Below is the snapshot of params I haved added to mapred-default.xml,

<configuration>
    
<property>
        <name>mapred.child.java.opts</name>
        <value>-Xms512m -Xmx1024m</value>
    </property>

    <property>
        <name>mapreduce.map.java.opts</name>
        <value>-Xms512m -Xmx1024m</value>
    </property>
    
    <property>
        <name>mapreduce.reduce.java.opts</name>
        <value>-Xms512m -Xmx1024m</value>
    </property>
            
    <property>
        <name>mapreduce.reduce.memory.mb</name>
        <value>2048</value>
    </property>
    
    <property>
        <name>mapred.reduce.tasks</name>
        <value>10</value>
    </property>
    
    <property>
        <name>mapreduce.map.memory.mb</name>
        <value>2048</value>
    </property>
    
    <property>
        <name>mapred.min.split.size</name>
        <value>268435456</value>
    </property>
</configuration>


Thx
Varun

Ken Krugler

unread,
Aug 24, 2015, 6:06:54 PM8/24/15
to cascadi...@googlegroups.com
Some of the parameters you specify are only used when there's a reduce phase to the job (which would be triggered by a GroupBy)

So one guess - you've got 10 reduce tasks specified (mapred.reduce.tasks), each with 2GB of memory (mapreduce.reduce.memory.mb); depending on the size of your cluster I could see this running out of memory.

-- Ken



From: varun reddy

Sent: August 24, 2015 2:51:56pm PDT

To: cascading-user

Subject: Re: Cascading 3.0 Group by Throws Null pointer exception





--------------------------
Ken Krugler
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr





varun reddy

unread,
Aug 24, 2015, 6:57:37 PM8/24/15
to cascading-user
Thanks Ken for your reply.

Data on which I ran is around 30MB, so OOM might not be the cause in this case, 512MB is default jvm size so 512x10 ~5GB my machine has 12 GB mem. 

I suspect some mandatory params required for reduce phase are missing, as the file I created has 3 params for reduce.

Thx
Varun 
Reply all
Reply to author
Forward
0 new messages