Pythonconverter Classes are Missing for HBASE [Bigtable] in spark-example.jar of dataproc 1.1

810 views
Skip to first unread message

re...@indosakura.com

unread,
Nov 2, 2016, 5:24:48 AM11/2/16
to Google Cloud Dataproc Discussions
I want to read/write data from  Bigtable in Pyspark for that I am trying below example:

from __future__ import print_function

import sys

from pyspark import SparkContext

if __name__ == "__main__":
   
if len(sys.argv) != 9:
       
print("""
        hbase_bigtable_output <project> <zone> <cluster> <table> <row> <family> <qualifier> <value>
        Assumes you have created <table> with column family <family> in Bigtable cluster <cluster>
        """
, file=sys.stderr)
       
exit(-1)

project
= sys.argv[1]
zone
= sys.argv[2]
cluster
= sys.argv[3]
table
= sys.argv[4]

sc
= SparkContext(appName="HBaseOutputFormat")

conf
= {"hbase.client.connection.impl": "com.google.cloud.bigtable.hbase1_1.BigtableConnection",
   
"google.bigtable.project.id": project,
   
"google.bigtable.zone.name": zone,
   
"google.bigtable.cluster.name": cluster,
   
"hbase.mapred.outputtable": table,
   
"mapreduce.outputformat.class": "org.apache.hadoop.hbase.mapreduce.TableOutputFormat",
   
"mapreduce.job.output.key.class": "org.apache.hadoop.hbase.io.ImmutableBytesWritable",
   
"mapreduce.job.output.value.class": "org.apache.hadoop.io.Writable"}
keyConv
= "org.apache.spark.examples.pythonconverters.StringToImmutableBytesWritableConverter"
valueConv
= "org.apache.spark.examples.pythonconverters.StringListToPutConverter"

sc
.parallelize([sys.argv[5:]]).map(lambda x: (x[0], x)).saveAsNewAPIHadoopDataset(
    conf
=conf,
    keyConverter
=keyConv,
    valueConverter
=valueConv)

sc
.stop()

command : 
gcloud dataproc jobs submit pyspark HBaseOutputFormat.py --cluster <clustername> --properties=^#^spark.jars.packages=com.google.cloud.bigtable:bigtable-hbase-1.1:0.2.2,org.apache.hbase:hbase-server:1.1.1,org.apache.hbase:hbase-common:1.1.1#spark.jars=/usr/lib/spark/examples/jars/spark-examples.jar

But Its throwing an error:

16/11/02 07:57:46 ERROR org.apache.spark.api.python.Converter: Failed to load converter: org.apache.spark.examples.pythonconverters.ImmutableBytesWritableToStringConverter
Traceback (most recent call last):
  File "/tmp/97ee9dc4-6ad7-490b-888a-8ecc7421a438/hbase_bigtable_input.py", line 48, in <module>
    conf=conf)
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/context.py", line 646, in newAPIHadoopRDD
  File "/usr/lib/spark/python/lib/py4j-0.10.3-src.zip/py4j/java_gateway.py", line 1133, in __call__
  File "/usr/lib/spark/python/lib/py4j-0.10.3-src.zip/py4j/protocol.py", line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD.
: java.lang.ClassNotFoundException: org.apache.spark.examples.pythonconverters.ImmutableBytesWritableToStringConverter
        at java.net.URLClassLoader.findClass(URLClassLoader.java:381)



Then I tried to unzip the jar to see the classes and here is the list which I got:

> unzip /usr/lib/spark/examples/jars/spark-examples_2.11-2.0.1.jar

Output for pythonconverter classes:

org/apache/spark/examples/pythonconverters/
AvroConversionUtil$$anonfun$unpackArray$1.class   AvroConversionUtil$$anonfun$unpackRecord$1.class  AvroWrapperToJavaConverter.class
AvroConversionUtil$$anonfun$unpackArray$2.class   AvroConversionUtil.class                          IndexedRecordToJavaConverter.class
AvroConversionUtil$$anonfun$unpackMap$1.class     AvroConversionUtil$.class


That means the example jar which is shipped with google cloud dataproc 1.1 does not contain the python converters for Hbase 

Does anyone know any workaround here to get the jar for dataproc and access the bigtable in pyspark job ?


Highly Appreciated!

Thank You,
Revan

Patrick Clay

unread,
Nov 2, 2016, 12:05:17 PM11/2/16
to Google Cloud Dataproc Discussions
Hi Revan,

The HBase converter examples were deleted from Spark 2 source in favor of the yet to be distributed official HBase Spark support (https://github.com/apache/hbase/tree/master/hbase-spark). They should probably still compile with Spark 2, if you wish to build them yourself. I covered some other alternatives in your StackOverflow question earlier.

Spark HBase connectivity is a work in progress in the community and Dataproc and Bigtable engineers are looking into better solutions going forward.

Sorry for the inconvenience,
-Patrick
Reply all
Reply to author
Forward
0 new messages