Reading BigQuery data from Spark in Google Dataproc

mich.ta...@gmail.com

unread,

Oct 5, 2020, 5:56:55 AM10/5/20

to Google Cloud Dataproc Discussions

have testet few JDBC BigQuery providers like Progress Direct and Simba but none of them seem to work properly through Spark.

The only way I can read and write to BigQuery is through Spark BigQuery API using the following scenario

spark-shell --jars=gs://spark-lib/bigquery/spark-bigquery-latest.jar

Using the following JDBC connection to read

val BQDF = spark.read.

format("bigquery").

option("credentialsFile",jsonKeyFile).

option("project", projectId).

option("parentProject", projectId).

option("dataset", targetDataset).

option("table", targetTable).

option("partitionColumn", partitionColumn).

option("lowerBound", lowerBound).

option("upperBound", upperBound).

option("numPartitions", numPartitions).

load()

and for write

rsBatch.

write.

format("bigquery").

mode(org.apache.spark.sql.SaveMode.Append).

option("table", fullyQualifiedOutputTableId).

save()

Appreciate any comments if someone has managed to make this work through any third party JDBC drivers.

Regards,

Mich

davi...@google.com

unread,

Oct 5, 2020, 10:52:29 PM10/5/20

to Google Cloud Dataproc Discussions

Hi Mich,

Using the Spark BigQuery is indeed the recommended way to read data from BigQuery to Spark and to write data back. It is not using JDBC, but rather relies on the BigQuery Storage Read API which provides a more efficient and distributed data transfer. You can read more about it, and view all the configuration options at https://github.com/GoogleCloudDataproc/spark-bigquery-connector

I hope it answers your question.

Regards,

David

mich.ta...@gmail.com

unread,

Oct 6, 2020, 4:07:10 AM10/6/20

to Google Cloud Dataproc Discussions

Hi David,

Many thanks for the info.

Spark BQ API are very good but I have an issue that I did not have back in Feb 2019.

Then I used The following package

libraryDependencies += "com.github.samelamin" %% "spark-bigquery" % "0.2.6"

and import it into Scala code

import com.samelamin.spark.bigquery._

This used to work and allowed me to create DDL statements (CREATE TABLE/VIEW etc) through Spark Scala code

sqltext = """CREATE TABLE IF NOT EXISTS `axial-glow-224522.test.weights_ML_RESULTS` AS SELECT * FROM ML.EVALUATE(MODEL `axial-glow-224522.test.weights_MODEL`, (SELECT * FROM `axial-glow-224522.test.weights_RS`))"""

And use the following to execute the code

spark.sqlContext.runDMLQuery(sqltext)

This is from the same package. However, I cannot use this one anymore.

<console>:33: error: value runDMLQuery is not a member of org.apache.spark.sql.SQLContext

So I am keen to look for alternatives to do DML in BQ.

For now I can go and create that table through BQ web page as a work-around

Remember I am using this Google supplied jar in spark-shell

gs://spark-lib/bigquery/spark-bigquery-latest.jar

What can I call to create DML through Spark code?

Many thanks,

Mich

Reply all

Reply to author

Forward