How to update Avro version

863 views
Skip to first unread message

Alfy

unread,
Apr 13, 2016, 12:55:41 AM4/13/16
to Google Cloud Dataproc Discussions
I am trying to use a Spark Application that requires a version of Avro higher than 1.7.5 based on this post, but I found that hadoop setup on Dataproc cluster is 1.7.4, is there a simple way to update Avro version, please? 

Below are some info for debugging purpose:

$find /usr/lib/hadoop/ -iname '*avro*'

/usr/lib/hadoop/lib/avro-1.7.4.jar

/usr/lib/hadoop/client/avro-1.7.4.jar

/usr/lib/hadoop/client/avro.jar


Versions:

spark-shell: 1.6.1
Hadoop 2.7.2
openjdk version "1.8.0_72-internal"

Patrick Clay

unread,
Apr 13, 2016, 9:41:54 PM4/13/16
to Google Cloud Dataproc Discussions
Hi Alfy,

I assume you are using spark-avro, but the following applies regardless. This is a known issue with Hadoop 2.7 (or earlier) and spark-avro. This GitHub issue has some proposed solutions. We are currently working on a Dataproc specific fix.

We actually already have a copy of Avro 1.7.7 on Hadoop's classpath, so if you remove /usr/lib/hadoop/lib/avro-1.7.4.jar, /usr/lib/hadoop-mapreduce/avro-1.7.4.jar, and /usr/lib/hadoop-mapreduce/lib/avro-1.7.4.jar using an initialization action, you should be good to go. It is possible this could cause an issue with Hadoop, but developers on this JIRA, don't seem to believe it will.

If you would not use an initialization action. You could set the spark-submit flag '--packages' or set the Spark property (e.g. with the gcloud --properties flag) 'spark.jars.packages' to 'org.apache.avro:avro:1.7.7'. This worked for me. You could also just try setting 'spark.executor.userClassPathFirst=true' (to load your jar before Spark), but this often causes strange version mismatches unrelated to your code.

Hope that helps,
-Patrick

Alfy

unread,
Apr 14, 2016, 12:56:31 AM4/14/16
to Google Cloud Dataproc Discussions
Hi Patrick,

Thank you for your reply. The initialization action doesn't apply after the cluster has been created, does it?

For example, I have already had a cluster with 1 master node and 2 worker nodes,  and I have done some work on it, so I prefer not to delete and create a new cluster, in such case, how about I delete the three files

/usr/lib/hadoop/lib/avro-1.7.4.jar

/usr/lib/hadoop/client/avro-1.7.4.jar

/usr/lib/hadoop/client/avro.jar


manually by ssh to all 3 nodes (1 master + 2 worker), will it have the same effect as initialization action?

Patrick Clay

unread,
Apr 14, 2016, 2:16:05 AM4/14/16
to Google Cloud Dataproc Discussions
You are correct that you cannot run an initialization action on an existing cluster. Manually sshing should do the trick.

A slight correction /usr/lib/hadoop/client/avro*.jar are symlinks to /usr/lib/hadoop/lib/avro-1.7.4.jar. You can delete them but shouldn't have to (if you delete the real jar). There are also two other real copies of avro-1.7.4.jar in /usr/lib/hadoop-mapreduce and 
/usr/lib/hadoop-mapreduce/lib/, which you should also delete to be safe.

-Patrick

Alfy

unread,
Apr 14, 2016, 2:55:16 PM4/14/16
to Google Cloud Dataproc Discussions
It seems to work now. Thank you!

Joshua Ewer

unread,
Jun 21, 2016, 1:18:58 PM6/21/16
to Google Cloud Dataproc Discussions
Hey I am suddenly having the same problem w/ dataproc and spark-avro. 

It was working with no bootstrap options and spark-avro_2.10:2.0.1 but I think when the dataproc image went to Spark 1.6.1 I suddenly started seeing avro version mismatches.  Has anyone experienced something similar? 

Attempting to create a bootstrap action that removes the 1.7.4 jars mentioned above had no positive effect.

Patrick Clay

unread,
Jun 21, 2016, 7:41:15 PM6/21/16
to Google Cloud Dataproc Discussions
Sorry, I shouldn't have recommended relying on the copy of avro-1.7.7 on the classpath. It's incomplete from shading minimization. Here is an initialization action that both cleans and installs the new jars and should work with the latest 1.0 image.

#!/bin/bash
rm -rf /usr/lib/h{adoop,ive}*/{,lib/}*avro*.jar
# Consider staging these jars in GCS to avoid being throttled & be nice to Maven Central.

If that doesn't work you can go back to the old Spark 1.6.0 image, by specifying '--image-version 1.0.1' in gcloud, but you'll miss some important fixes, so it is not recommended. I have no idea how that older image works with spark-avro, as there are still definitely classpath collisions. We have been working on standardizing all components to Avro 1.7.7, but we will probably not release that in the 1.0 branch, because we don't want to break users depending on Avro 1.7.4 (like we accidentally broke you with spark-avro).

Let me know if that initialization action doesn't work.
-Patrick

Joshua Ewer

unread,
Jun 24, 2016, 1:41:11 PM6/24/16
to Google Cloud Dataproc Discussions
That worked perfectly!  The help, as always, is incredibly appreciated.
Reply all
Reply to author
Forward
0 new messages