Greetings, users of Hadoop on Google Cloud Platform!
We’re pleased to announce the latest version of bdutil which adds support for Hortonworks HDP 2.2, improves default Hadoop 2 configuration, and improves Spark deployments.
Download bdutil-1.1.0.tar.gz or bdutil-1.1.0.zip now to try it out, or visit the developer documentation where the download links now point to the latest version.
Abridged highlights for bdutil updates:
bdutil now includes an extension for installing HDP via ambari which can be used by adding the following parameters to your existing bdutil invocations: "-e platforms/hdp/ambari_env.sh"
Abridged highlights for gcs-connector updates:
Zero-length file creation markers, which are used for fast-failing in the case of two writers, are now disabled by default. A configuration option has been created to re-enable them. See the gcs-connector CHANGES.txt for details.
Abridged highlights for bigquery-connector updates:
Various bug fixes in both the input and output formats.
Please see the detailed release notes below for more information about the new bdutil and connector features.
You may download each of the connectors directly via the following links, or use the latest bdutil to install them on a new cluster.
gcs-connector: gcs-connector-1.3.2-hadoop1.jar and gcs-connector-1.3.2-hadoop2.jar
bigquery-connector: bigquery-connector-0.5.1-hadoop1.jar and bigquery-connector-0.5.1-hadoop2.jar
As always, please send any questions or comments to gcp-hadoo...@google.com or post a question on stackoverflow.com with tag ‘google-hadoop’ for additional assistance.
All the best,
Your Google Team
bdutil-1.1.0: CHANGES.txt
1.1.0 - 2015-01-22
1. Added plugin for deploying Ambari/HDP with:
./bdutil -e platforms/hdp/ambari_env.sh deploy
2. Set dfs.replication to 2 under conf/hadoop*/hdfs-template.xml; this suits
PD deployments better than r=3, but if deploying with HDFS residing on
non-PD storage, the value should be reverted to 3.
3. Enabled Spark EventLog for Spark deployments, logging to
gs://${CONFIGBUCKET}/spark-eventlog-base/${MASTER_HOSTNAME}
4. Migrated off of misc deprecated fields in favor of using
spark-defaults.conf for Spark 1.0+; cleans up warnings on spark-submit.
5. Moved SPARK_LOG_DIR from default of ${SPARK_HOME}/logs into
/hadoop/spark/logs so that they reside on the large PD if it exists.
6. Upgraded default Spark version to 1.2.0.
7. Added bdutil_env option INSTALL_JDK_DEVEL to optionally install full JDK
with compiler/tools instead of just the minimal JRE; set to 'true' in
single_node_env.sh and ambari_env.sh.
8. Added python script to allocate memory more intelligently in Hadoop 2.
9. Upgraded Hadoop 2 version to 2.5.2.
gcs-connector-1.3.2: CHANGES.txt
1.3.2 - 2015-01-22
1. In the create file path, marker file creation is now configurable. By
default, marker files will not be created. The default is most suitable
for MapReduce applications. Setting fs.gs.create.marker.files.enable to
true in core-site.xml will re-enable marker files. The use of marker files
should be considered for applications that depend on early failing when
two concurrent writes attempt to write to the same file. Note that file
overwrite semantics are preserved with or without marker files, but
failures will occur sooner with marker files present.
bigquery-connector-0.5.1: CHANGES.txt
0.5.1 - 2015-01-22
1. Added enforcement of maximum number of export shards (currently 500)
when calculating splits for BigQueryIntputFormat.
2. Fixed a bug where BigQueryOutputCommitter.needsTaskCommit() incorrectly
depended on a Bigquery.Tables.list() call; listing tables suffers
"eventual consistency", so occasionally a task would erroneously
fail to commit data.
3. Removed extraneous table-deletion in BigQueryOutputCommitter.abortTask();
cleanup occurs during job cleanup anyways, and this would incorrectly
(but harmlessly) try to delete a nonexistent table for map tasks.