Python google.cloud BigTable Scans - best practice?

308 views

Skip to first unread message

jo...@apester.com

unread,

Nov 27, 2016, 7:19:13 PM11/27/16

to Google Cloud Bigtable Discuss

I see there are 2 hbase table-scanning APIs in Google Cloud's sample code:

1) using google.cloud module bigtable object

https://github.com/GoogleCloudPlatform/python-docs-samples/blob/master/bigtable/hello/main.py

from google.cloud import bigtable

client = bigtable.Client(project=project_id, admin=True)

instance = client.instance(instance_id)

table = instance.table(table_id)

partial_rows = table.read_rows(...)

partial_rows.consume_all()

for row_key, row in partial_rows.rows.items():

2) using google.cloud module bigtable and happybase objects

https://github.com/GoogleCloudPlatform/python-docs-samples/blob/master/bigtable/hello_happybase/main.py

from google.cloud import bigtable

from google.cloud import happybase

client = bigtable.Client(project=project_id, admin=True)

instance = client.instance(instance_id)

connection = happybase.Connection(instance=instance)

table = connection.table(table_name)

for key, row in table.scan():

what are the differences in these 2 approaches in terms of performance and suitability for Spark jobs ?

Gary Elliott

unread,

Nov 27, 2016, 8:35:33 PM11/27/16

to Google Cloud Bigtable Discuss

Hi,

In general I would recommend approach #1 instead of using happybase, unless you need HBase compatibility for some reason. Using the google.cloud bigtable objects directly avoids the overhead of converting bigtable rows to happybase objects, gives you more direct access to the features of the cloud bigtable service, and is more likely to access to new features in the service as they arrive.

I don't have any specific recommendations for Spark jobs but here's a pointer to a relevant stackoverflow question: http://stackoverflow.com/questions/40371827/how-to-read-and-write-data-in-google-cloud-bigtable-in-pyspark-application.