Read from Bigtable using Dataproc Serverless Python

Adam Scott

unread,

May 18, 2023, 5:26:08 PM5/18/23

to Google Cloud Bigtable Discuss

I have a Cassandra to Bigtable dataproc serverless script working.

To output to Bigtable I followed this as an example (https://github.com/GoogleCloudPlatform/dataproc-templates/blob/main/python/dataproc_templates/gcs/gcs_to_bigtable.py)

And it works great.

output_data.write.format("org.apache.hadoop.hbase.spark").options(
catalog=bt_catalog
).option("hbase.spark.use.hbasecontext", "false").mode("overwrite").save()

Now, I would like to read from Bigtable (again, using Python), and there are no examples to be found.

I've tried this among many other iterations.

df = (
spark.read.options(catalog=catalog)
.format("org.apache.hadoop.hbase.spark")
.option("hbase.spark.use.hbasecontext", "false")
.load()
)
print(f"Count: {df.count()}")

It tries to connect to a localhost for some reason. And we are submitting a container with the hbase-site.xml file set correctly.

It is supposed to be supported to read from Bigtable (https://cloud.google.com/bigtable/docs/integrations#dataproc)

https://cloud.google.com/bigtable/docs/integrations#dataproc

https://cloud.google.com/bigtable/docs/integrations#dataproc)

Here's some of the output of what we get:

INFO ZooKeeper: Initiating client connection, connectString=127.0.0.1:2181 sessionTimeout=90000 watcher=org.apache.hadoop.hbase.zookeeper.ReadOnlyZKClient$$Lambda$2533/0x0000000801216840@337930b7

...

23/05/18 20:56:06 WARN ClientCnxn: Session 0x0 for sever localhost/127.0.0.1:2181, Closing socket connection. Attempting reconnect except it is a SessionExpiredException.
java.net.ConnectException: Connection refused
at java.base/sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at java.base/sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:777)
at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:344)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1290)

What am I missing?

TIA

Mattie Fu

unread,

May 23, 2023, 7:47:35 AM5/23/23

to Google Cloud Bigtable Discuss

Hi Adam,

Error message shows that the application is still trying to connect to zookeeper instead of Bigtable.

1. Are you using the same container with hbase-site.xml file you used for writes? Make sure in your xml file "hbase.client.connection.impl" is set to "com.google.cloud.bigtable.hbase2_x.BigtableConnection":

<name>hbase.client.connection.impl</name>

<value>com.google.cloud.bigtable.hbase2_x.BigtableConnection</value>

</property>

2. Are you specifying the container with "--container-image=.." option when you submit your job? https://cloud.google.com/dataproc-serverless/docs/guides/custom-containers#submit_a_spark_batch_workload_using_a_custom_container_image

Adam Scott

unread,

May 24, 2023, 1:55:20 PM5/24/23

to Google Cloud Bigtable Discuss

Thank you Mattie!

It turns out it was a simple oversight.

The submit was missing

--properties='spark.dataproc.driverEnv.SPARK_EXTRA_CLASSPATH=/etc/hbase/conf/'

Cheers,

Adam

Reply all

Reply to author

Forward