Greetings, users of Hadoop on Google Cloud Platform!
We’re excited to announce the latest version of bdutil which fixes several bugs, improves Spark support, and introduces several new features. Special thanks goes out to generous contributions from Data Artisans for adding bdutil support for Apache Flink, Hortonworks for continued work on documentation and features for bdutil’s Ambari plugin, and Google intern Andy Butler for adding an Apache Storm plugin for bdutil. This release also adds basic HBase and CDH plugins for bdutil.
Additionally, the BigQuery connector for Hadoop now uses BigQueryAsyncWriteChannel which improves performance and reduces the number of "load" jobs over the course of a mapreduce in addition to fixing a bug that occasionally caused extraneous records to be written.
Download bdutil-1.2.0.tar.gz or bdutil-1.2.0.zip now to try it out, or visit the developer documentation where the download links now point to the latest version.
Abridged highlights for bdutil updates:
New bdutil extensions:
Fixed memory allocation for Spark executors running on YARN
Misc Spark updates
Automatic restart of Spark processes on reboot
Spark driver memory settings
Default Spark version is now 1.2.1
Added support for using import_env in generate_config; example best-practice for storing a local config file my_spark_cluster_env.sh specifying a Spark cluster of size 6:
./bdutil -b my-bucket -p my-project -z us-central1-a -e spark -n 6 generate_config my_spark_cluster_env.sh
./bdutil -e my_spark_cluster_env.sh deploy
Please see the detailed release notes below for more information about the new bdutil, GCS connector, and BigQuery connector features. Updated connector javadocs are available for Hadoop 1 and Hadoop 2.
As always, please send any questions or comments to gcp-hadoo...@google.com or post a question on stackoverflow.com with tag ‘google-hadoop’ for additional assistance.
All the best,
Your Google Team
bdutil-1.2.0: CHANGES.txt
1.2.0 - 2015-02-26
1. Fixed reboots on CentOS 7.
2. Fixed Ambari-plugin support of reusing persistent disks across deployments.
3. Added support for Apache Flink.
4. Made all UPLOAD_FILES be relative to the bdutil directory.
5. Added basic HBase, CDH, and Storm plugins for bdutil.
./bdutil -e hbase
./bdutil -e cdh
./bdutil -e storm
6. Only symlink the GCS connector for Client nodes.
7. Fixed memory allocation for Spark executors running on YARN; Created
extension extensions/spark/spark_on_yarn_env.sh to support Spark
on YARN without Spark daemons. the combination of spark_env.sh and
hadoop2_env.sh will allow the user to submit Spark Jobs to either the
Spark Master or YARN.
8. Enabled restart spark processes on reboot.
9. Added support for the GCS connector with ambari_manual_env.sh. See
the "Can I deploy HDP manually using Ambari" section in
platforms/hdp/README.md
10. Added an experimental env file to enable cluster resizing.
See extensions/google/experimental/resize_env.sh.
11. Updated default Spark version to 1.2.1.
12. Updated Spark driver memory settings to scale with VM size.
13. Added import_env support to generate_config. For example:
"./bdutil -e spark -b my-bucket -p my-project generate_config my_env.sh"
makes my_env.sh contain "import_env /path/to/spark_env.sh".
14. Use default mount options to avoid SELinux issues on reboot.
gcs-connector-1.3.3: CHANGES.txt
1.3.3 - 2015-02-26
1. When performing a retry in GoogleCloudStorageReadChannel, attempts to
close() the underlying channel are now performed explicitly instead of
waiting for performLazySeek() to do it, so that SSLException can be
caught and ignored; broken SSL sockets cannot be closed normally,
and are responsible for already cleaning up on error.
2. Added explicit check of currentPosition == size when -1 is read from
underlying stream in GoogleCloudStorageReadChannel, in case the
stream fails to identify an error case and prematurely reaches
end-of-stream.
bigquery-connector-0.6.0: CHANGES.txt
0.6.0 - 2015-02-26
1. Fixed a bug in BigQueryOutputFormat which can occasionally cause extraneous
records to be written due to low-level retries not be de-duplicated on the
server side; temporary tables are now written with WRITE_TRUNCATE intead of
WRITE_APPEND to get around the lack of de-duplication of low-level retries.
2. Removed the BigQueryBatchedWriteChannel, which was previously the default
behavior and controlled by mapred.bq.output.async.write.enabled defaulting
to 'false'; default is now 'true' and the key is ignored; the batched
channel is fundamentally vulnerable to low-level retries causing extraneous
records to be written. BigQueryAsyncWriteChannel is now always used, which
improves performance and reduces the number of "load" jobs over the course
of a mapreduce; it is now equal to the number of reduce tasks, rather than
being equal to the total output_size divided by the batch_size.
3. All BigQuery job insertions now supply client-generated job_ids and
handle 409 conflict as a duplicate job caused by low-level retries.