Problem with parquet-cascading

414 views
Skip to first unread message

james

unread,
Jan 1, 2014, 12:05:12 PM1/1/14
to cascadi...@googlegroups.com
hii,

Trying to run this code

Main.java:
public static void main(String[] args) {
...
..
Properties properties = new Properties();
AppProps.setApplicationJarClass(properties, Main.class);
HadoopFlowConnector flowConnector = new HadoopFlowConnector(properties);

Scheme sourceScheme = new queries.ParquetTupleScheme(new Fields("a", "b", "c"));
Tap inTap = new Hfs(sourceScheme, inPath);
...
...
...
}


And i'm getting this error:
java.lang.NoClassDefFoundError: cascading/scheme/Scheme

Here is what i tried so far


1)
When I replace this:
Scheme sourceScheme = new ParquetTupleScheme(new Fields("a", "b", "c"));
with this:
Scheme sourceScheme = null;
The error goes away

2)
When I'm creating class that extends Scheme<JobConf, RecordReader, OutputCollector, Object[], Object[]> like ParquetTupleScheme 
The error goes away

3)
When I'm trying to check if this is a specific parquet-cascading error
Object a = new PigCombiner()
the error goes away

I'm using :
cascading         2.5.1
parquet-cascading 1.3.0
hadoop-core       1.2.1

What i'm doing wrong?

Andre Kelpe

unread,
Jan 7, 2014, 4:39:50 AM1/7/14
to cascadi...@googlegroups.com
Hi,

how are you building your project? Which exact dependencies are you using?

- André


--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cascading-use...@googlegroups.com.
To post to this group, send email to cascadi...@googlegroups.com.
Visit this group at http://groups.google.com/group/cascading-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/cascading-user/30119812-263f-409e-951c-88d6966fbb00%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.



--
André Kelpe
an...@concurrentinc.com
http://concurrentinc.com

Soren Macbeth

unread,
Mar 30, 2014, 9:52:47 PM3/30/14
to cascadi...@googlegroups.com
I'm hitting this same issue trying to use parquet-cascading 1.3.2 with cascalog 2.1.0, cascading-hadoop2-mr1 2.5.3 on CDH4.6.0

I can see the cascading.scheme.Scheme class file in my uberjar. I can import cascading.scheme.Scheme in a repl on the cluster. I can run other cascalog queries fine as long as I don't use parquet-cascading.

The only thing I could think might possibly happening is that parquet-cascading is using a different classloader or something bizarre?

Any help appreciated!

Soren Macbeth

unread,
Mar 31, 2014, 12:08:41 AM3/31/14
to cascadi...@googlegroups.com
turns out this is caused by CDH4.6.0 having older version of parquet on the class path.

Deepak Subhramanian

unread,
Jun 9, 2014, 7:14:30 AM6/9/14
to cascadi...@googlegroups.com

I am getting similar error related to parquet.cascading.ParquetTupleScheme method not found  while trying to use scalding with Parquet. I am using CDH4.5 . Is there a way to override the CDH jars. My code works in hdfs mode in IntelliJ. But when I run with the cluster I am getting the error. 


Exception in thread "main" java.lang.reflect.InvocationTargetException

at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)

at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)

at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)

at java.lang.reflect.Constructor.newInstance(Constructor.java:513)

at com.twitter.scalding.Job$.apply(Job.scala:49)

at com.twitter.scalding.Tool.getJob(Tool.scala:51)

at com.twitter.scalding.Tool.run(Tool.scala:71)

at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)

at JobRunner$.main(JobRunner.scala:28)

at RawLogsJobRunner$delayedInit$body.apply(RawLogsJobRunner.scala:21)

at scala.Function0$class.apply$mcV$sp(Function0.scala:40)

at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)

at scala.App$$anonfun$main$1.apply(App.scala:71)

at scala.App$$anonfun$main$1.apply(App.scala:71)

at scala.collection.immutable.List.foreach(List.scala:318)

at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:32)

at scala.App$class.main(App.scala:71)

 

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)

at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)

at java.lang.reflect.Method.invoke(Method.java:597)

at org.apache.hadoop.util.RunJar.main(RunJar.java:208)

Caused by: java.lang.NoSuchMethodError: parquet.cascading.ParquetTupleScheme.<init>(Lcascading/tuple/Fields;Lcascading/tuple/Fields;Ljava/lang/String;)V

Andre Kelpe

unread,
Jun 10, 2014, 4:52:25 AM6/10/14
to cascadi...@googlegroups.com
Try setting mapreduce.job.user.classpath.first=true to put your jars first on the classpath.

- André


Deepak Subhramanian

unread,
Jun 12, 2014, 4:57:29 AM6/12/14
to cascadi...@googlegroups.com
Thanks Andre. I tried that hadoop setting. For some reason it is not working. 

Thanks, Deepak

Andre Kelpe

unread,
Jun 12, 2014, 5:44:58 AM6/12/14
to cascadi...@googlegroups.com
You will have to talk to your hadoop vendor then. Sorry about that.

- André



For more options, visit https://groups.google.com/d/optout.

Antonios Chalkiopoulos

unread,
Jun 12, 2014, 10:01:07 AM6/12/14
to cascadi...@googlegroups.com
Andre you are right - parquet-cascading-1.5.0 seems to be working ok both in local/HDFS mode ...

This is a vendor issue - and will try to get it resolved with Cloudera

In case someone visits this thread - because of not being able to get parquet-cascading work in HDFS mode .. our quick fix till we get the vendor chipping in is:

echo "Fixing CDH lib/Parquet"
echo "----------------------"
cd /opt/cloudera/parcels/CDH-4.5.0-1.cdh4.5.0.p0.30/lib/parquet
wget http://central.maven.org/maven2/com/twitter/parquet-cascading/1.5.0/parquet-cascading-1.5.0.jar
wget http://central.maven.org/maven2/com/twitter/parquet-column/1.5.0/parquet-column-1.5.0.jar
wget http://central.maven.org/maven2/com/twitter/parquet-common/1.5.0/parquet-common-1.5.0.jar
wget http://central.maven.org/maven2/com/twitter/parquet-encoding/1.5.0/parquet-encoding-1.5.0.jar
wget http://central.maven.org/maven2/com/twitter/parquet-format/1.5.0/parquet-format-1.5.0.jar
wget http://central.maven.org/maven2/com/twitter/parquet-generator/1.5.0/parquet-generator-1.5.0.jar
wget http://central.maven.org/maven2/com/twitter/parquet-hadoop/1.5.0/parquet-hadoop-1.5.0.jar
wget http://central.maven.org/maven2/com/twitter/parquet-hadoop-bundle/1.5.0/parquet-hadoop-bundle-1.5.0.jar
wget http://central.maven.org/maven2/com/twitter/parquet-jackson/1.5.0/parquet-jackson-1.5.0.jar
wget http://central.maven.org/maven2/com/twitter/parquet-pig/1.5.0/parquet-pig-1.5.0.jar
wget http://central.maven.org/maven2/com/twitter/parquet-pig-bundle/1.5.0/parquet-pig-bundle-1.5.0.jar
wget http://central.maven.org/maven2/com/twitter/parquet-hive/1.5.0/parquet-hive-1.5.0.jar
wget http://central.maven.org/maven2/com/twitter/parquet-thrift/1.5.0/parquet-thrift-1.5.0.jar
wget http://central.maven.org/maven2/com/twitter/parquet-avro/1.5.0/parquet-avro-1.5.0.jar

mkdir BACKUP
mv -f *-cdh4.5.0.jar BACKUP/

echo "Fixing CDH lib/hadoop"
echo "---------------------"
cd /opt/cloudera/parcels/CDH-4.5.0-1.cdh4.5.0.p0.30/lib/hadoop/
mkdir BACKUP
mv original-parquet-* BACKUP/
mv parquet-* BACKUP/
wget http://central.maven.org/maven2/com/twitter/parquet-cascading/1.5.0/parquet-cascading-1.5.0.jar
wget http://central.maven.org/maven2/com/twitter/parquet-column/1.5.0/parquet-column-1.5.0.jar
wget http://central.maven.org/maven2/com/twitter/parquet-common/1.5.0/parquet-common-1.5.0.jar
wget http://central.maven.org/maven2/com/twitter/parquet-encoding/1.5.0/parquet-encoding-1.5.0.jar
wget http://central.maven.org/maven2/com/twitter/parquet-format/1.5.0/parquet-format-1.5.0.jar
wget http://central.maven.org/maven2/com/twitter/parquet-generator/1.5.0/parquet-generator-1.5.0.jar
wget http://central.maven.org/maven2/com/twitter/parquet-hadoop/1.5.0/parquet-hadoop-1.5.0.jar
wget http://central.maven.org/maven2/com/twitter/parquet-hadoop-bundle/1.5.0/parquet-hadoop-bundle-1.5.0.jar
wget http://central.maven.org/maven2/com/twitter/parquet-jackson/1.5.0/parquet-jackson-1.5.0.jar
wget http://central.maven.org/maven2/com/twitter/parquet-pig/1.5.0/parquet-pig-1.5.0.jar
wget http://central.maven.org/maven2/com/twitter/parquet-pig-bundle/1.5.0/parquet-pig-bundle-1.5.0.jar
wget http://central.maven.org/maven2/com/twitter/parquet-hive/1.5.0/parquet-hive-1.5.0.jar
wget http://central.maven.org/maven2/com/twitter/parquet-thrift/1.5.0/parquet-thrift-1.5.0.jar
wget http://central.maven.org/maven2/com/twitter/parquet-avro/1.5.0/parquet-avro-1.5.0.jar

After getting the proper libraries in place all we need to do is :

$ hadoop jar uber-jar.jar com.twitter.scalding.Tool com.foo.MyJob --hdfs 
 
and it reads PARQUET files and writes as well in HDFS nicely :)

- Antonios

Andre Kelpe

unread,
Jun 12, 2014, 10:28:25 AM6/12/14
to cascadi...@googlegroups.com
Thanks for sharing. I wonder why vendors keep on adding jars by default. I would prefer if distros shipped with less (outdated) jars by default, but the opposite seems to be the case...

- André



For more options, visit https://groups.google.com/d/optout.

Deepak Subhramanian

unread,
Jul 2, 2014, 11:09:26 AM7/2/14
to cascadi...@googlegroups.com
We got it working with manually replacing the old cdh parquet jars with 1.5 parquet jars.  

We tried using the hadoop parameter (export HADOOP_USER_CLASSPATH_FIRST=true) to override the vendor jars . When we try to override the cdh jars with latest jars we are getting a different error. It is using DeprecratedParquetInputFormat. Is it because the input data is created using old version of Parquet Serde for Hive. 

2014-07-02 15:49:49,128 WARN org.apache.hadoop.mapred.Child: Error running child
java.io.EOFException
	at java.io.DataInputStream.readFully(DataInputStream.java:180)
	at java.io.DataInputStream.readUTF(DataInputStream.java:592)
	at java.io.DataInputStream.readUTF(DataInputStream.java:547)
	at parquet.hadoop.ParquetInputSplit.readFields(ParquetInputSplit.java:177)
	at parquet.hadoop.mapred.DeprecatedParquetInputFormat$ParquetInputSplitWrapper.readFields(DeprecatedParquetInputFormat.java:196)
	at cascading.tap.hadoop.io.MultiInputSplit.readFields(MultiInputSplit.java:151)
	at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:73)
	at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:44)
	at org.apache.hadoop.mapred.MapTask.getSplitDetails(MapTask.java:356)
	at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:388)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332)
	at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:396)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
	at org.apache.hadoop.mapred.Child.main(Child.java:262)
2014-07-02 15:49:49,132 INFO org.apache.hadoop.mapred.Task: Runnning cleanup for the task
Reply all
Reply to author
Forward
0 new messages