Announcing bdutil-1.2.0, gcs-connector-1.3.3, and bigquery-connector-0.6.0

207 views

Skip to first unread message

Hadoop on Google Cloud Platform Team

unread,

Feb 27, 2015, 7:09:16 PM2/27/15

to gcp-had...@google.com, gcp-hadoo...@googlegroups.com

Greetings, users of Hadoop on Google Cloud Platform!

We’re excited to announce the latest version of bdutil which fixes several bugs, improves Spark support, and introduces several new features. Special thanks goes out to generous contributions from Data Artisans for adding bdutil support for Apache Flink, Hortonworks for continued work on documentation and features for bdutil’s Ambari plugin, and Google intern Andy Butler for adding an Apache Storm plugin for bdutil. This release also adds basic HBase and CDH plugins for bdutil.

Additionally, the BigQuery connector for Hadoop now uses BigQueryAsyncWriteChannel which improves performance and reduces the number of "load" jobs over the course of a mapreduce in addition to fixing a bug that occasionally caused extraneous records to be written.

Download bdutil-1.2.0.tar.gz or bdutil-1.2.0.zip now to try it out, or visit the developer documentation where the download links now point to the latest version.

Abridged highlights for bdutil updates:

New bdutil extensions:

Fixed memory allocation for Spark executors running on YARN

extensions/spark/spark_on_yarn_env.sh

Misc Spark updates

Automatic restart of Spark processes on reboot
Spark driver memory settings
Default Spark version is now 1.2.1

Added support for using import_env in generate_config; example best-practice for storing a local config file my_spark_cluster_env.sh specifying a Spark cluster of size 6:

./bdutil -b my-bucket -p my-project -z us-central1-a -e spark -n 6 generate_config my_spark_cluster_env.sh
./bdutil -e my_spark_cluster_env.sh deploy

Please see the detailed release notes below for more information about the new bdutil, GCS connector, and BigQuery connector features. Updated connector javadocs are available for Hadoop 1 and Hadoop 2.

As always, please send any questions or comments to gcp-hadoo...@google.com or post a question on stackoverflow.com with tag ‘google-hadoop’ for additional assistance.

All the best,

Your Google Team

bdutil-1.2.0: CHANGES.txt

1.2.0 - 2015-02-26

1. Fixed reboots on CentOS 7.
2. Fixed Ambari-plugin support of reusing persistent disks across deployments.
3. Added support for Apache Flink.
4. Made all UPLOAD_FILES be relative to the bdutil directory.
5. Added basic HBase, CDH, and Storm plugins for bdutil.
        ./bdutil -e hbase
        ./bdutil -e cdh
        ./bdutil -e storm
6. Only symlink the GCS connector for Client nodes.
7. Fixed memory allocation for Spark executors running on YARN; Created
    extension extensions/spark/spark_on_yarn_env.sh to support Spark
    on YARN without Spark daemons. the combination of spark_env.sh and
    hadoop2_env.sh will allow the user to submit Spark Jobs to either the
    Spark Master or YARN.
8. Enabled restart spark processes on reboot.
9. Added support for the GCS connector with ambari_manual_env.sh. See
    the "Can I deploy HDP manually using Ambari" section in
    platforms/hdp/README.md
10. Added an experimental env file to enable cluster resizing.
       See extensions/google/experimental/resize_env.sh.
11. Updated default Spark version to 1.2.1.
12. Updated Spark driver memory settings to scale with VM size.
13. Added import_env support to generate_config. For example:
       "./bdutil -e spark -b my-bucket -p my-project generate_config my_env.sh"
       makes my_env.sh contain "import_env /path/to/spark_env.sh".

14. Use default mount options to avoid SELinux issues on reboot.

gcs-connector-1.3.3: CHANGES.txt

1.3.3 - 2015-02-26

1. When performing a retry in GoogleCloudStorageReadChannel, attempts to
    close() the underlying channel are now performed explicitly instead of
    waiting for performLazySeek() to do it, so that SSLException can be
    caught and ignored; broken SSL sockets cannot be closed normally,
    and are responsible for already cleaning up on error.
2. Added explicit check of currentPosition == size when -1 is read from
    underlying stream in GoogleCloudStorageReadChannel, in case the
    stream fails to identify an error case and prematurely reaches
    end-of-stream.

bigquery-connector-0.6.0: CHANGES.txt

0.6.0 - 2015-02-26

1. Fixed a bug in BigQueryOutputFormat which can occasionally cause extraneous
    records to be written due to low-level retries not be de-duplicated on the
    server side; temporary tables are now written with WRITE_TRUNCATE intead of
    WRITE_APPEND to get around the lack of de-duplication of low-level retries.
2. Removed the BigQueryBatchedWriteChannel, which was previously the default
    behavior and controlled by mapred.bq.output.async.write.enabled defaulting
    to 'false'; default is now 'true' and the key is ignored; the batched
    channel is fundamentally vulnerable to low-level retries causing extraneous
    records to be written. BigQueryAsyncWriteChannel is now always used, which
    improves performance and reduces the number of "load" jobs over the course
    of a mapreduce; it is now equal to the number of reduce tasks, rather than
    being equal to the total output_size divided by the batch_size.
3. All BigQuery job insertions now supply client-generated job_ids and
    handle 409 conflict as a duplicate job caused by low-level retries.

Reply all

Reply to author

Forward

0 new messages