Announcing bdutil-1.3.2, gcs-connector-1.4.2, and bigquery-connector-0.7.2

188 views

Skip to first unread message

Hadoop on Google Cloud Platform Team

unread,

Sep 12, 2015, 7:09:07 PM9/12/15

to Hadoop on Google Cloud Platform Team, gcp-hadoo...@googlegroups.com

Greetings, users of Hadoop on the Google Cloud Platform!

We’re excited to announce the latest versions of bdutil, gcs-connector, and bigquery-connector with bug fixes and new features.

Highlights for connector updates:

Added handling for certain rateLimitExceeded (429) errors which occasionally caused Spark jobs to fail when many workers concurrently tried to create the same directory
Added a GCS connector key fs.gs.reported.permissions as a way to work around some tools/frameworks (including Hive 0.14.0/1.0+) which require certain permissions to be reported

Highlights for bdutil:

Support for running Spark against Google Cloud Bigtable, with thanks to Tara Gu

Using spark_env.sh with bigtable_env.sh will automatically add the necessary configuration for Spark to access the Bigtable client
Convenient wrapper scripts bigtable-spark-shell and bigtable-spark-submit

New plugin for deploying Apache Hama, with thanks to Edward Yoon
Added plugin standalone_nfs_cache_env.sh to help deploy a shared list-consistency cache server to use with multiple clusters; the GCS_CACHE_MASTER_HOSTNAME variable can be set in a bdutil config to ensure multiple clusters share the same consistent view of GCS
Updated default Spark version to Spark 1.5.0 along with extra logic for properly deploying the new version. This addresses deploying Spark 1.5.0 on older versions of bdutil which results in Spark SQL not working and Spark failing to auto-recover on reboot. All functionality should now work as expected deploying Spark 1.5.0 with this latest version of bdutil.
Updated default Hadoop 2 version to Hadoop 2.7.1. The inclusion of MAPREDUCE-4815 in 2.7.1 drastically improves job-commit time when outputting to a GCS directory.

Special thanks goes out to all the community contributors who helped with this latest set of features and bug fixes!

Download bdutil-1.3.2.tar.gz or bdutil-1.3.2.zip now to try it out, or visit the developer documentation, where the download links now point to the latest version. For manual installation or local library usage, download the jar directly.

Please see the detailed release notes below for more information about the new bdutil, GCS connector, and BigQuery connector features.

As always, please send any questions or comments to gcp-hadoo...@google.com or post a question on stackoverflow.com with tag ‘google-hadoop’ for additional assistance.

All the best,

Your Google Team

Release Notes

bdutil-1.3.2: CHANGES.txt

1.3.2 - 2015-09-12

1. Updated Spark configurations to make Cloud Bigtable work with Spark.

2. Added wrappers bigtable-spark-shell and bigtable-spark-submit to use

with bigtable plugin; only installed if bigtable_env.sh is used.

3. Updated default Hadoop 2 version to 2.7.1.

4. Added support for Apache Hama.

5. Added support for setting up a standalone NFS cache server for GCS

consistency using standalone_nfs_cache_env.sh, along with configurable

GCS_CACHE_MASTER_HOSTNAME to point subsequent clusters at the shared

NFS cache server. See standalone_nfs_cache_env.sh for usage.

6. Added explicit check for ordering of imports spark_env.sh relative

to bigtable_env.sh; Spark must come before Bigtable.

7. Fixed spelling of "amount" in some documentation.

8. Fixed directory resolution to bdutil when using symlinks.

9. Added Dockerfile for bdutil.

10. Updated default Spark version to 1.5.0; for Spark 1.5.0+, core-site.xml

will also set 'fs.gs.reported.permissions' to 733, otherwise the Hive

1.2.1 will error out when using SparkSQL. Hadoop MapReduce will print

a harmless warning in this case, but otherwise works fine. Additionally,

Spark auto-restart configuration now contains logic to use correct

syntax for start-slave.sh depending on whether it's Spark 1.4+, and

Spark auto-restarted daemons now correctly run under user 'hadoop'.

gcs-connector-1.4.2: CHANGES.txt

1.4.2 - 2015-09-12

1. Added checking in GoogleCloudStorageImpl.createEmptyObject(s) to handle

rateLimitExceeded (429) errors by fetching the fresh underlying info

and ignoring the error if the object already exists with the intended

metadata and size. This fixes an issue which mostly affects Spark:

https://github.com/GoogleCloudPlatform/bigdata-interop/issues/10

2. Added logging in GoogleCloudStorageReadChannel for high-level retries.

3. Added support for configuring the permissions reported to the Hadoop

FileSystem layer; the permissions are still fixed per FileSystem instance

and aren't actually enforced, but can now be set with:

fs.gs.reported.permissions [default = "700"]

This allows working around some clients like Hive-related daemons and tools

which pre-emptively check for certain assumptions about permissions.

bigquery-connector-0.7.2: CHANGES.txt

0.7.2 - 2015-09-12

1. Misc updates in gcs-related library dependencies.

Reply all

Reply to author

Forward

0 new messages