Reading BigQuery data from Spark in Google Dataproc

1,099 views
Skip to first unread message

mich.ta...@gmail.com

unread,
Oct 5, 2020, 5:56:55 AM10/5/20
to Google Cloud Dataproc Discussions
 have testet few JDBC BigQuery providers like Progress Direct and Simba but none of them seem to work properly through Spark.

The only way I can read and write to BigQuery is through Spark BigQuery API using the following scenario

spark-shell --jars=gs://spark-lib/bigquery/spark-bigquery-latest.jar


Using the following JDBC connection to read


val BQDF = spark.read.

    format("bigquery").

    option("credentialsFile",jsonKeyFile).

    option("project", projectId).

    option("parentProject", projectId).

    option("dataset", targetDataset).

    option("table", targetTable).

    option("partitionColumn", partitionColumn).

    option("lowerBound", lowerBound).

    option("upperBound", upperBound).

    option("numPartitions", numPartitions).

    load()


and for write

    rsBatch.
      write.
      format("bigquery").
      mode(org.apache.spark.sql.SaveMode.Append).
      option("table", fullyQualifiedOutputTableId).
      save()

Appreciate any comments if someone has managed to make this work through any third party JDBC drivers.

Regards,

Mich

davi...@google.com

unread,
Oct 5, 2020, 10:52:29 PM10/5/20
to Google Cloud Dataproc Discussions
Hi Mich,

Using the Spark BigQuery is indeed the recommended way  to read data from BigQuery to Spark and to write data back. It is not using JDBC, but rather relies on the BigQuery Storage Read API which provides a more efficient and distributed data transfer. You can read more about it, and view all the configuration options at https://github.com/GoogleCloudDataproc/spark-bigquery-connector

I hope it answers your question.

Regards,
David

mich.ta...@gmail.com

unread,
Oct 6, 2020, 4:07:10 AM10/6/20
to Google Cloud Dataproc Discussions
Hi David,

Many thanks for the info.

Spark BQ API are very good but I have an issue that I did not have back in Feb 2019.

Then I used The following package

libraryDependencies += "com.github.samelamin" %% "spark-bigquery" % "0.2.6"

and import it into Scala code

import com.samelamin.spark.bigquery._

This used to work and allowed me to create DDL statements (CREATE TABLE/VIEW etc) through Spark Scala code

sqltext = """CREATE TABLE IF NOT EXISTS `axial-glow-224522.test.weights_ML_RESULTS` AS SELECT * FROM ML.EVALUATE(MODEL `axial-glow-224522.test.weights_MODEL`, (SELECT * FROM `axial-glow-224522.test.weights_RS`))"""

And use the following to execute the code

spark.sqlContext.runDMLQuery(sqltext)

This is from the same package. However, I cannot use this one anymore.

<console>:33: error: value runDMLQuery is not a member of org.apache.spark.sql.SQLContext

So I am keen to look for alternatives to do DML in BQ.

For now I can go and create that table through BQ web page as a work-around

Remember I am using this Google supplied jar in spark-shell

gs://spark-lib/bigquery/spark-bigquery-latest.jar

What can I call to create DML through Spark code?

Many thanks,

Mich
Reply all
Reply to author
Forward
0 new messages