Greetings, users of Hadoop on the Google Cloud Platform!
We’re excited to announce the latest versions of bdutil, gcs-connector, and bigquery-connector with several bug fixes and new features.
Abridged highlights for connector updates:
Upgraded Google API client libraries and Guava to latest versions (see detailed notes below for exact version numbers)
Removed an obsolete configuration setting for a 250GB upload limit on objects; objects larger than 250GB can now be uploaded without modifying config settings
Added better configurability and usability for directly calling the lower-level GoogleCloudStorage libraries when not going through the Hadoop FileSystem interface
Abridged highlights for bdutil:
Thanks to the generous contributions of MapR, bdutil now includes a plugin for deploying MapR clusters on GCE
Added support for running Storm on Cloud Bigtable using the Cloud Bigtable connector for HBase, with an end-to-end example under cloud-bigtable-examples
Default Flink version is now 0.9.0
Download bdutil-1.3.1.tar.gz or bdutil-1.3.1.zip now to try it out, or visit the developer documentation, where the download links now point to the latest version. For manual installation or local library usage, download the jar directly.
Please see the detailed release notes below for more information about the new bdutil, GCS connector, and BigQuery connector features.
As always, please send any questions or comments to gcp-hadoo...@google.com or post a question on stackoverflow.com with tag ‘google-hadoop’ for additional assistance.
All the best,
Your Google Team
Release Notes
bdutil-1.3.1: CHANGES.txt
1.3.1 - 2015-07-09
1. Added plugin for deploying MapR under platforms/mapr/mapr_env.sh; see
platforms/mapr/README.md for details.
2. Changed mapreduce.fileoutputcommitter.algorithm.version to "2"; this should
only have an effect when running with Hadoop 2.7+, where it significantly
speeds up job-commit time when using the GCS connector.
See https://issues.apache.org/jira/browse/MAPEDUCE-4815 for more details.
3. Added an option ENABLE_STORM_BIGTABLE to extensions/storm/storm_env.sh to
set up using Google Cloud Bigtable from Apache Storm.
4. Updated Flink version to 0.9.0.
5. Switched from using SPARK_CLASSPATH to using SPARK_DIST_CLASSPATH pointed
at the Hadoop classpath to inherit gcs-connector and other Hadoop libraries
on the default Spark classpath. This gets rid of a warning message about
SPARK_CLASSPATH deprecation when running Spark, and improves access to
related Hadoop libraries from Spark jobs.
6. Fixed reboot recovery for single-node clusters; this includes the ability
for single-node clusters to recover from issuing "Stop" and then "Start"
commands via the GCE API.
7. Added explicit value for mapreduce.job.working.dir in Ambari config; this
works around a bug in PigInputFormat where an exception is thrown with
"Wrong FS scheme" when the default filesystem doesn't have the same scheme
as the filesystem of the input file(s) (e.g. when reading GCS files and
the default FS is HDFS). Pig reading from GCS should now work in ambari
bdutil deployments.
8. Fixed a bug where Hive deployed under ambari_env.sh is unable to
LOAD DATA INPATH 'gs://<...>' due to Hive server needing to be restarted
after GCS connector installation to pick it up on its classpath.
gcs-connector-1.4.1: CHANGES.txt
1.4.1 - 2015-07-09
1. Switched from the custom SeekableReadableByteChannel to
Java 7's java.nio.channels.SeekableByteChannel.
2. Removed the configurable but default-constrained 250GB upload limit;
uploads can now exceed 250GB without needing to modify config settings.
3. Added helper classes related to GCS retries.
4. Added workaround support for read retries on objects with content-encoding
set to gzip; such content encoding isn't generally correct to use since
it means filesystem reported bytes will not match actual read bytes, but
for cases which accept byte mismatches, the read channel can now manually
seek to where it left off on retry rather than having a GZIPInputStream
throw an exception for a malformed partial stream.
5. Added an option for enabling "direct uploads" in
GoogleCloudStorageWriteChannel which is not directly used by the Hadoop
layer, but can be used by clients which directly access the lower
GoogleCloudStorage layer.
6. Added CreateBucketOptions to the GoogleCloudStorage interface so that
clients using the low-level GoogleCloudStorage directly can create buckets
with different locations and storage classes.
7. Fixed https://github.com/GoogleCloudPlatform/bigdata-interop/issues/5 where
stale cache entries caused stuck phantom directories if the directories
were deleted using non-Hadoop-based GCS clients.
8. Fixed a bug which prevented the Apache HTTP transport from working with
Hadoop 2 when no proxy was set.
9. Misc updates in library dependencies; google.api.version
(com.google.http-client, com.google.api-client) updated from 1.19.0 to
1.20.0, google-api-services-storage from v1-rev16-1.19.0 to
v1-rev35-1.20.0, and google-api-services-bigquery from v2-rev171-1.19.0
to v2-rev217-1.20.0, and Guava from 17.0 to 18.0.
bigquery-connector-0.7.1: CHANGES.txt
0.7.1 - 2015-07-09
1. Misc updates in library dependencies; google.api.version
(com.google.http-client, com.google.api-client) updated from 1.19.0 to
1.20.0, google-api-services-storage from v1-rev16-1.19.0 to
v1-rev35-1.20.0, and google-api-services-bigquery from v2-rev171-1.19.0
to v2-rev217-1.20.0, and Guava from 17.0 to 18.0.