read avro file using pyspark dataproc

495 views
Skip to first unread message

R1

unread,
Aug 2, 2022, 9:15:48 AM8/2/22
to Google Cloud Dataproc Discussions

I am using a Dataproc cluster (2.0 ubuntu ) with hadoop 3.2 and spark 3.1. I have a python code to read avro files from GCS. So I have used 'spark-avro_2.12-3.1.0.jar' but its give some error like method not found etc (java.lang.NoSuchMethodError: org.apache.spark.sql.execution.datasources.DataSourceUtils$.creteDateRebaseFuncInRead ). How to decide which library is compatible to use ?

I am reading using :
df = spark.read.format("avro").load([ 'file1.avro' , 'file2.avro' ])

Nicolas Porter

unread,
Oct 29, 2022, 5:58:36 PM10/29/22
to Google Cloud Dataproc Discussions
To find the versions, use this page:

Find your image version, click on the link, and you'll see the scala / spark versions to use for the JAR. You can cross reference against:

Side note, you can pull the jar at runtime using this property:
spark:spark.jars.packages=org.apache.spark:spark-avro_{{scalaVersion}}:{{sparkVersion}}

For the latest 2.0-ubuntu18 image, you would use:
spark:spark.jars.packages=org.apache.spark:spark-avro_2.12:3.1.3

Notice that the scala version is missing the last number of the version, and the spark version has all the numbers.

Cheers,
- nick
Reply all
Reply to author
Forward
0 new messages