Hbase to Bigtable : Spark Job failing

428 views
Skip to first unread message

Neeraj Verma

unread,
May 26, 2018, 7:21:31 PM5/26/18
to Google Cloud Dataproc Discussions
Hi ,

I have Hbase Spark job running at AWS EMR cluster. Recently we moved to GCP. I transferred all Hbase data to Bigtable. Now I am running same Spark - Java/Scala  job in Data proc.  Sprak job failing as it is looking spark.hbase.zookeeper.quorum  setting  . Please let me know without code change How I make my spark job run successfully with bigtable

Regards,
Neeraj Verma

Karthik Palaniappan

unread,
May 29, 2018, 12:31:40 AM5/29/18
to Google Cloud Dataproc Discussions
Hi Neeraj,

You need the Hbase client libraries for Bigtable: https://cloud.google.com/bigtable/docs/bigtable-and-hbase. Essentially, to your Spark job, Bigtable will appear as your Hbase.

Assuming you're using the Spark HBase connector, this is a good example to start from: https://github.com/GoogleCloudPlatform/cloud-bigtable-examples/tree/master/scala/bigtable-shc.

--Karthik

Neeraj Verma

unread,
May 29, 2018, 12:44:22 PM5/29/18
to Google Cloud Dataproc Discussions
Hi Karthik,

Thanks for your information . is it possible to pass Hbase properties like below 

Properties
spark.time.interval.date
2018/03/30/14
spark.boost.input.path
gs://bigdata-sandbox/ml-sort-boosts-prepared-json/year=2018/month=03/day=30/hour=14
spark.input.path
gs://bigdata-sandbox/ml-sort-boosts-prepared-json/year=2018/month=03/day=30/hour=14
spark.output.path
gs://bigdata-sandbox/latest/temp
spark.query.input.path
gs://bigdata-sandbox/ml-sort-search-prepared-json/original/year=2018/month=03/day=30/hour=14
spark.search.filter
requestUriStem = '/api/queryresults/browse/womens-shoes'
spark.app.name.suffix
Test_spark-hbase_2018/03/30/14
spark.time.interval.format
yyyy/MM/dd/HH
spark.output.delete.previous.trigger
true
spark.hbase.zookeeper.quorum
cluster-bigdata-test2-m-2,cluster-bigdata-test2-m-0,cluster-bigdata-test2-m-1
spark.hbase.client.connection.impl
com.google.cloud.bigtable.hbase1_x.BigtableConnection

Karthik Palaniappan

unread,
May 29, 2018, 3:44:31 PM5/29/18
to Google Cloud Dataproc Discussions
The bolded properties look correct to me, assuming that the Bigtable client is on the classpath.

I assume you shouldn't need to pass in the spark.hbase.zookeeper.quorum property since you're not actually using hbase. Have you tried running the job without that property?

Karthik Palaniappan

unread,
May 29, 2018, 8:14:46 PM5/29/18
to Google Cloud Dataproc Discussions
Actually, which library are you using for Spark <-> HBase? How does it accept properties to pass through? I realize that spark.google.* probably won't get passed through, since a library would likely look for spark.hbase.*. Instead of passing them in via spark, can you instead set those properties in hbase-site.xml? Take a look at the in-progress initialization action for Bigtable for example: https://github.com/GoogleCloudPlatform/dataproc-initialization-actions/pull/267.

Neeraj Verma

unread,
May 29, 2018, 9:57:04 PM5/29/18
to Google Cloud Dataproc Discussions
Hi Karthik,
Issue resolved. I modify existing spark job  by adding bigtable jar in pom.xml as dependency and below code change .    Thanks for your help.

case class HBaseConnectionWithClosure(zookeeperQuorum: String, zookeeperPort: String) {
  def getConnection(): Connection = {
    val c = HBaseConfiguration.create
    //c.set("hbase.zookeeper.quorum", zookeeperQuorum)
    //c.set("hbase.zookeeper.property.clientPort", zookeeperPort)
    c.set("hbase.meta.replicas.use", "true")
     c.set("google.bigtable.project.id", "fbfblok3vejkytb5bdbbzxkr3krxw3")
      c.set("google.bigtable.instance.id", "bigdata-hbase")
        c.set("hbase.client.connection.impl", "com.google.cloud.bigtable.hbase1_x.BigtableConnection")
    ConnectionFactory.createConnection(c)
  }

and

case class HBaseConnectionWithClosure(zookeeperQuorum: String, zookeeperPort: String) {
  def getConnection(): Connection = {
    val c = HBaseConfiguration.create
    //c.set("hbase.zookeeper.quorum", zookeeperQuorum)
    //c.set("hbase.zookeeper.property.clientPort", zookeeperPort)
    c.set("hbase.meta.replicas.use", "true")
     c.set("google.bigtable.project.id", "fbfblok3vejkytb5bdbbzxkr3krxw3")
      c.set("google.bigtable.instance.id", "bigdata-hbase")
        c.set("hbase.client.connection.impl", "com.google.cloud.bigtable.hbase1_x.BigtableConnection")
    ConnectionFactory.createConnection(c)
  }

and 

 <dependency>
        <groupId>com.google.cloud.bigtable</groupId>
        <artifactId>bigtable-hbase-1.x-shaded</artifactId>
        <version>1.3.0</version>
</dependency>



Reply all
Reply to author
Forward
0 new messages