Greetings, users of Hadoop on Google Cloud Platform!
We’re excited to announce the latest version of bdutil which updates bdutil to use the gcloud Cloud SDK interface, the latest version of the bigquery-connector which adds support for Hadoop 2 MapReduce and Avro-based exports, and the latest version of datastore-connector which also adds support for Hadoop 2 MapReduce.
Download bdutil-1.0.1.tar.gz or bdutil-1.0.1.zip now to try it out, or visit the developer documentation where the download links now point to the latest version.
Abridged highlights for bdutil updates:
bdutil now uses gcloud compute for interacting with GCE.
The default zone for bdutil was changed to us-central1-a.
Abridged highlights for bigquery-connector updates:
Support for Hadoop 2 was added for java MapReduce.
Adds support for Avro-based exports and MapReduce applications.
Abridged highlights for gcs-connector updates:
Several improvements to NFS-cache handling and improved handling of 500-level errors.
Abridged highlights for datastore-connector updates:
Adds support for Hadoop 2.
Please see the detailed release notes below for more information about the new bdutil and connector features.
You may download each of the connectors directly via the following links, or use the latest bdutil to install them on a new cluster.
gcs-connector: gcs-connector-1.3.1-hadoop1.jar and gcs-connector-1.3.1-hadoop2.jar
bigquery-connector: bigquery-connector-0.5.0-hadoop1.jar and bigquery-connector-0.5.0-hadoop2.jar
datastore-connector: datastore-connector-0.14.9-hadoop1.jar and datastore-connector-0.14.9-hadoop2.jar
As always, please send any questions or comments to gcp-hadoo...@google.com or post a question on stackoverflow.com with tag ‘google-hadoop’ for additional assistance.
All the best,
Your Google Team
bdutil-1.0.1: CHANGES.txt
1.0.1 - 2014-12-16
1. Replaced usage of deprecated gcutil with gcloud compute.
2. Changed GCE_SERVICE_ACCCOUNT_SCOPES from a comma separated list to a bash
array.
3. Fixed cleanup of pig-validate-setup.sh, hive-validate-setup.sh and
spark-validate-setup.sh.
4. Upgraded default Spark version to 1.1.1.
5. The default zone for instances is now us-central1-a.
gcs-connector-1.3.1: CHANGES.txt
1.3.1 - 2014-12-16
1. Fixed a rare NullPointerException in FileSystemBackedDirectoryListCache
which can occur if a directory being listed is purged from the cache
between a call to "exists()" and "listFiles()".
2. Fixed a bug in GoogleHadoopFileSystemCacheCleaner where cache-cleaner
fails to clean any contents when a bucket is non-empty but expired.
3. Fixed a bug in FileSystemBackedDirectoryListCache which caused garbage
collection to require several passes for large directory hierarchies;
now we can successfully garbage-collect an entire expired tree in a
single pass, and cache files are also processed in-place without having
to create a complete in-memory list.
4. Updated handling of new file creation, file copying, and file deletion
so that all object modification requests sent to GCS contain preconditions
that should prevent race-conditions in the face of retried operations.
bigquery-connector-0.5.0: CHANGES.txt
0.5.0 - 2014-12-16
1. BigQueryInputFormat has been renamed GsonBigQueryInputFormat to better
reflect its nature as a gson-based format. A forwarding declaration
was left in place to maintain compatibility.
2. JsonTextBigQueryInputFormat was added to provide lines of JSON text as
they appear in the BigQuery export.
3. When using sharded BigQuery exports (the default), the keys will no
longer be in increasing order per mapper. Instead, the keys will be
as they are reported by the delegate RecordReader which is generally
going to be the byte position within the current file. However, the
sharded export creates many files per mapper so this position will
appear to reset to 0 when we switch between files. The record reader's
getProgress() will still report progress across the entire dataset that
the record reader is responsible for.
4. The BigQuery connector can now ingest Avro based BigQuery exports. Using
and Avro-based export should result in less data transferred between your
MapReduce job and Google Cloud Storage and should require less CPU time
to parse the data files. To use Avro, set the input format to
AvroBigQueryInputFormat and update your map code to expect LongWritable
keys and Avro GenericData.Record values.
5. Hadoop 2 support was added for java MapReduce. Streaming support for
Hadoop 2 will be included in a future release.
datastore-connector-0.14.9: CHANGES.txt
0.14.9 - 2014-12-16
1. Added support for Hadoop2.