Python google.cloud BigTable Scans - best practice?

308 views
Skip to first unread message

jo...@apester.com

unread,
Nov 27, 2016, 7:19:13 PM11/27/16
to Google Cloud Bigtable Discuss
I see there are 2 hbase table-scanning APIs in Google Cloud's sample code:

1) using google.cloud module bigtable object
from google.cloud import bigtable
client = bigtable.Client(project=project_id, admin=True)
instance = client.instance(instance_id)
table = instance.table(table_id)
partial_rows = table.read_rows(...)
partial_rows.consume_all()
for row_key, row in partial_rows.rows.items():

2) using google.cloud module bigtable and happybase objects
from google.cloud import bigtable
from google.cloud import happybase
client = bigtable.Client(project=project_id, admin=True)
instance = client.instance(instance_id)
connection = happybase.Connection(instance=instance)
table = connection.table(table_name)
for key, row in table.scan():

what are the differences in these 2 approaches in terms of performance and suitability for Spark jobs ?

Gary Elliott

unread,
Nov 27, 2016, 8:35:33 PM11/27/16
to Google Cloud Bigtable Discuss
Hi,

In general I would recommend approach #1 instead of using happybase, unless you need HBase compatibility for some reason. Using the google.cloud bigtable objects directly avoids the overhead of converting bigtable rows to happybase objects, gives you more direct access to the features of the cloud bigtable service, and is more likely to access to new features in the service as they arrive.

I don't have any specific recommendations for Spark jobs but here's a pointer to a relevant stackoverflow question: http://stackoverflow.com/questions/40371827/how-to-read-and-write-data-in-google-cloud-bigtable-in-pyspark-application

gary
Reply all
Reply to author
Forward
0 new messages