Greetings, users of Hadoop on the Google Cloud Platform!
We’re excited to announce the latest versions of bdutil, gcs-connector, and bigquery-connector with bug fixes and new features.
Highlights for connector updates:
Added handling for certain rateLimitExceeded (429) errors which occasionally caused Spark jobs to fail when many workers concurrently tried to create the same directory
Added a GCS connector key fs.gs.reported.permissions as a way to work around some tools/frameworks (including Hive 0.14.0/1.0+) which require certain permissions to be reported
Highlights for bdutil:
Support for running Spark against Google Cloud Bigtable, with thanks to Tara Gu
Using spark_env.sh with bigtable_env.sh will automatically add the necessary configuration for Spark to access the Bigtable client
Convenient wrapper scripts bigtable-spark-shell and bigtable-spark-submit
New plugin for deploying Apache Hama, with thanks to Edward Yoon
Added plugin standalone_nfs_cache_env.sh to help deploy a shared list-consistency cache server to use with multiple clusters; the GCS_CACHE_MASTER_HOSTNAME variable can be set in a bdutil config to ensure multiple clusters share the same consistent view of GCS
Updated default Spark version to Spark 1.5.0 along with extra logic for properly deploying the new version. This addresses deploying Spark 1.5.0 on older versions of bdutil which results in Spark SQL not working and Spark failing to auto-recover on reboot. All functionality should now work as expected deploying Spark 1.5.0 with this latest version of bdutil.
Updated default Hadoop 2 version to Hadoop 2.7.1. The inclusion of MAPREDUCE-4815 in 2.7.1 drastically improves job-commit time when outputting to a GCS directory.
Special thanks goes out to all the community contributors who helped with this latest set of features and bug fixes!
Download bdutil-1.3.2.tar.gz or bdutil-1.3.2.zip now to try it out, or visit the developer documentation, where the download links now point to the latest version. For manual installation or local library usage, download the jar directly.
Please see the detailed release notes below for more information about the new bdutil, GCS connector, and BigQuery connector features.
As always, please send any questions or comments to gcp-hadoo...@google.com or post a question on stackoverflow.com with tag ‘google-hadoop’ for additional assistance.
All the best,
Your Google Team
Release Notes
bdutil-1.3.2: CHANGES.txt
1.3.2 - 2015-09-12
1. Updated Spark configurations to make Cloud Bigtable work with Spark.
2. Added wrappers bigtable-spark-shell and bigtable-spark-submit to use
with bigtable plugin; only installed if bigtable_env.sh is used.
3. Updated default Hadoop 2 version to 2.7.1.
4. Added support for Apache Hama.
5. Added support for setting up a standalone NFS cache server for GCS
consistency using standalone_nfs_cache_env.sh, along with configurable
GCS_CACHE_MASTER_HOSTNAME to point subsequent clusters at the shared
NFS cache server. See standalone_nfs_cache_env.sh for usage.
6. Added explicit check for ordering of imports spark_env.sh relative
to bigtable_env.sh; Spark must come before Bigtable.
7. Fixed spelling of "amount" in some documentation.
8. Fixed directory resolution to bdutil when using symlinks.
9. Added Dockerfile for bdutil.
10. Updated default Spark version to 1.5.0; for Spark 1.5.0+, core-site.xml
will also set 'fs.gs.reported.permissions' to 733, otherwise the Hive
1.2.1 will error out when using SparkSQL. Hadoop MapReduce will print
a harmless warning in this case, but otherwise works fine. Additionally,
Spark auto-restart configuration now contains logic to use correct
syntax for start-slave.sh depending on whether it's Spark 1.4+, and
Spark auto-restarted daemons now correctly run under user 'hadoop'.
gcs-connector-1.4.2: CHANGES.txt
1.4.2 - 2015-09-12
1. Added checking in GoogleCloudStorageImpl.createEmptyObject(s) to handle
rateLimitExceeded (429) errors by fetching the fresh underlying info
and ignoring the error if the object already exists with the intended
metadata and size. This fixes an issue which mostly affects Spark:
https://github.com/GoogleCloudPlatform/bigdata-interop/issues/10
2. Added logging in GoogleCloudStorageReadChannel for high-level retries.
3. Added support for configuring the permissions reported to the Hadoop
FileSystem layer; the permissions are still fixed per FileSystem instance
and aren't actually enforced, but can now be set with:
fs.gs.reported.permissions [default = "700"]
This allows working around some clients like Hive-related daemons and tools
which pre-emptively check for certain assumptions about permissions.
bigquery-connector-0.7.2: CHANGES.txt
0.7.2 - 2015-09-12
1. Misc updates in gcs-related library dependencies.