Announcing bdutil-1.3.2, gcs-connector-1.4.2, and bigquery-connector-0.7.2

188 views
Skip to first unread message

Hadoop on Google Cloud Platform Team

unread,
Sep 12, 2015, 7:09:07 PM9/12/15
to Hadoop on Google Cloud Platform Team, gcp-hadoo...@googlegroups.com

Greetings, users of Hadoop on the Google Cloud Platform!


We’re excited to announce the latest versions of bdutil, gcs-connector, and bigquery-connector with bug fixes and new features.


Highlights for connector updates:


  • Added handling for certain rateLimitExceeded (429) errors which occasionally caused Spark jobs to fail when many workers concurrently tried to create the same directory

  • Added a GCS connector key fs.gs.reported.permissions as a way to work around some tools/frameworks (including Hive 0.14.0/1.0+) which require certain permissions to be reported


Highlights for bdutil:


  • Support for running Spark against Google Cloud Bigtable, with thanks to Tara Gu

    • Using spark_env.sh with bigtable_env.sh will automatically add the necessary configuration for Spark to access the Bigtable client

    • Convenient wrapper scripts bigtable-spark-shell and bigtable-spark-submit

  • New plugin for deploying Apache Hama, with thanks to Edward Yoon

  • Added plugin standalone_nfs_cache_env.sh to help deploy a shared list-consistency cache server to use with multiple clusters; the GCS_CACHE_MASTER_HOSTNAME variable can be set in a bdutil config to ensure multiple clusters share the same consistent view of GCS

  • Updated default Spark version to Spark 1.5.0 along with extra logic for properly deploying the new version. This addresses deploying Spark 1.5.0 on older versions of bdutil which results in Spark SQL not working and Spark failing to auto-recover on reboot. All functionality should now work as expected deploying Spark 1.5.0 with this latest version of bdutil.

  • Updated default Hadoop 2 version to Hadoop 2.7.1. The inclusion of MAPREDUCE-4815 in 2.7.1 drastically improves job-commit time when outputting to a GCS directory.


Special thanks goes out to all the community contributors who helped with this latest set of features and bug fixes!


Download bdutil-1.3.2.tar.gz or bdutil-1.3.2.zip now to try it out, or visit the developer documentation, where the download links now point to the latest version. For manual installation or local library usage, download the jar directly.



Please see the detailed release notes below for more information about the new bdutil, GCS connector, and BigQuery connector features.


As always, please send any questions or comments to gcp-hadoo...@google.com or post a question on stackoverflow.com with tag ‘google-hadoop’ for additional assistance.


All the best,

Your Google Team


Release Notes

bdutil-1.3.2: CHANGES.txt

1.3.2 - 2015-09-12


 1. Updated Spark configurations to make Cloud Bigtable work with Spark.

 2. Added wrappers bigtable-spark-shell and bigtable-spark-submit to use

    with bigtable plugin; only installed if bigtable_env.sh is used.

 3. Updated default Hadoop 2 version to 2.7.1.

 4. Added support for Apache Hama.

 5. Added support for setting up a standalone NFS cache server for GCS

    consistency using standalone_nfs_cache_env.sh, along with configurable

    GCS_CACHE_MASTER_HOSTNAME to point subsequent clusters at the shared

    NFS cache server. See standalone_nfs_cache_env.sh for usage.

 6. Added explicit check for ordering of imports spark_env.sh relative

    to bigtable_env.sh; Spark must come before Bigtable.

 7. Fixed spelling of "amount" in some documentation.

 8. Fixed directory resolution to bdutil when using symlinks.

 9. Added Dockerfile for bdutil.

 10. Updated default Spark version to 1.5.0; for Spark 1.5.0+, core-site.xml

     will also set 'fs.gs.reported.permissions' to 733, otherwise the Hive

     1.2.1 will error out when using SparkSQL. Hadoop MapReduce will print

     a harmless warning in this case, but otherwise works fine. Additionally,

     Spark auto-restart configuration now contains logic to use correct

     syntax for start-slave.sh depending on whether it's Spark 1.4+, and

     Spark auto-restarted daemons now correctly run under user 'hadoop'.




gcs-connector-1.4.2: CHANGES.txt

1.4.2 - 2015-09-12


 1. Added checking in GoogleCloudStorageImpl.createEmptyObject(s) to handle

    rateLimitExceeded (429) errors by fetching the fresh underlying info

    and ignoring the error if the object already exists with the intended

    metadata and size. This fixes an issue which mostly affects Spark:

    https://github.com/GoogleCloudPlatform/bigdata-interop/issues/10

 2. Added logging in GoogleCloudStorageReadChannel for high-level retries.

 3. Added support for configuring the permissions reported to the Hadoop

    FileSystem layer; the permissions are still fixed per FileSystem instance

    and aren't actually enforced, but can now be set with:


       fs.gs.reported.permissions [default = "700"]


   This allows working around some clients like Hive-related daemons and tools

   which pre-emptively check for certain assumptions about permissions.




bigquery-connector-0.7.2: CHANGES.txt

0.7.2 - 2015-09-12


 1. Misc updates in gcs-related library dependencies.


Reply all
Reply to author
Forward
0 new messages