Read from Bigtable using Dataproc Serverless Python

93 views
Skip to first unread message

Adam Scott

unread,
May 18, 2023, 5:26:08 PM5/18/23
to Google Cloud Bigtable Discuss
I have a Cassandra to Bigtable dataproc serverless script working.


And it works great.
        output_data.write.format("org.apache.hadoop.hbase.spark").options(
            catalog=bt_catalog
        ).option("hbase.spark.use.hbasecontext", "false").mode("overwrite").save()


Now, I would like to read from Bigtable (again, using Python), and there are no examples to be found.

I've tried this among many other iterations.

    df = (
        spark.read.options(catalog=catalog)
        .format("org.apache.hadoop.hbase.spark
")
        .option("hbase.spark.use.hbasecontext", "false")
        .load()
    )
    print(f"Count: {df.count()}")

It tries to connect to a localhost for some reason. And we are submitting a container with the hbase-site.xml file set correctly.


Here's some of the output of what we get:
INFO ZooKeeper: Initiating client connection, connectString=127.0.0.1:2181 sessionTimeout=90000 watcher=org.apache.hadoop.hbase.zookeeper.ReadOnlyZKClient$$Lambda$2533/0x0000000801216840@337930b7
...
23/05/18 20:56:06 WARN ClientCnxn: Session 0x0 for sever localhost/127.0.0.1:2181, Closing socket connection. Attempting reconnect except it is a SessionExpiredException.
java.net.ConnectException: Connection refused
at java.base/sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at java.base/sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:777)
at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:344)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1290)

What am I missing?

TIA


Mattie Fu

unread,
May 23, 2023, 7:47:35 AM5/23/23
to Google Cloud Bigtable Discuss
Hi Adam,

Error message shows that the application is still trying to connect to zookeeper instead of Bigtable. 

1. Are you using the same container with hbase-site.xml file you used for writes? Make sure in your xml file "hbase.client.connection.impl" is set to "com.google.cloud.bigtable.hbase2_x.BigtableConnection":

  <property>

    <name>hbase.client.connection.impl</name>

    <value>com.google.cloud.bigtable.hbase2_x.BigtableConnection</value>

  </property>

2. Are you specifying the container with "--container-image=.." option when you submit your job? https://cloud.google.com/dataproc-serverless/docs/guides/custom-containers#submit_a_spark_batch_workload_using_a_custom_container_image

Adam Scott

unread,
May 24, 2023, 1:55:20 PM5/24/23
to Google Cloud Bigtable Discuss
Thank you Mattie!

It turns out it was a simple oversight.

The submit was missing

--properties='spark.dataproc.driverEnv.SPARK_EXTRA_CLASSPATH=/etc/hbase/conf/'

Cheers,
Adam
Reply all
Reply to author
Forward
0 new messages