Announcing bdutil-1.3.0, gcs-connector-1.4.0, and bigquery-connector-0.7.0

139 views

Skip to first unread message

Hadoop on Google Cloud Platform Team

unread,

May 27, 2015, 5:24:32 PM5/27/15

to gcp-had...@google.com, gcp-hadoo...@googlegroups.com

Greetings, users of Hadoop on Google Cloud Platform!

We’re excited to announce the latest versions of bdutil, gcs-connector, and bigquery-connector with several bug fixes and new features.

Abridged highlights for connector updates:

Moved all logging to directly call slf4j interfaces to better standardize logging semantics across other Google service libraries and connectors; removed the old LogUtil wrapper class.
The BigQuery connector now supports concurrently running multiple jobs writing into the same output dataset.
The GCS connector now provides an option to select the HttpTransport type to use, which can be either APACHE or JAVA_NET. This is set with the fs.gs.http.transport.type configuration key.
The GCS connector now supports routing through a proxy server by setting the fs.gs.proxy.address configuration key.
The GCS connector now reports fake “inferred” directory objects when objects exist without corresponding parent-directory object placeholders. This behavior is in line with the behavior of other object-store based file systems used in Hadoop, and makes it easier to use the GCS connector in a “read-only” mode on public datasets, which may not have well-formed directory placeholders. This behavior can be controlled with the fs.gs.implicit.dir.infer.enable configuration key, and defaults to true.

Abridged highlights for bdutil:

New flag --preemptible (corresponding to bdutil_env.sh variable PREEMPTIBLE_FRACTION) allows you to set a fractional number (between 0.0 and 1.0) of worker VMs to run as preemptible VMs. Preemptible VMs provide a 70% discount over standard pricing; supplementing a GCS-based Hadoop cluster with such VMs may reduce costs.
Default Hadoop 2 version is now 2.6.0.
Default Spark version is now 1.3.1.
Added support for deploying onto ubuntu-12-04 and ubuntu-14-04 images.
Added flags --master_boot_disk_size_gb and --worker_boot_disk_size_gb (corresponding to MASTER_BOOT_DISK_SIZE_GB and WORKER_BOOT_DISK_SIZE_GB) for setting boot disk sizes
Fixed a bug which caused incomplete Ambari deployment by failing to copy mapreduce.tar.gz, pig.tar.gz, etc., into hdfs://hdp/apps/...

Download bdutil-1.3.0.tar.gz or bdutil-1.3.0.zip now to try it out, or visit the developer documentation, where the download links now point to the latest version. For manual installation or local library usage, download gcs-connector-1.4.0-hadoop1.jar, gcs-connector-1.4.0.-hadoop2.jar, bigquery-connector-0.7.0-hadoop1.jar, or bigquery-connector-0.7.0.-hadoop2.jar directly.

Please see the detailed release notes below for more information about the new bdutil, GCS connector, and BigQuery connector features.

As always, please send any questions or comments to gcp-hadoo...@google.com or post a question on stackoverflow.com with tag ‘google-hadoop’ for additional assistance.

All the best,

Your Google Team

Release Notes

bdutil-1.3.0: CHANGES.txt

1.3.0 - 2015-05-27

1. Upgraded default Hadoop 2 version to 2.6.0.

2. Added support for making a portion of worker VMs run as "preemptible vms"

by setting --preemptible or PREEMPTIBLE_FRACTION to a value between

0.0 and 1.0 to specify the fraction of workers to run as preemptible.

3. Added support for deploying onto ubuntu-12-04 or ubuntu-14-04 images.

4. Added support for specifying boot disk sizes via --master_boot_disk_size_gb

and --worker_boot_disk_size_gb or MASTER_BOOT_DISK_SIZE_GB and

WORKER_BOOT_DISK_SIZE_GB; uses default derived from base image if unset.

5. Upgraded default Spark version to 1.3.1.

6. Removed datastore-connector installation options and samples; the connector

has been deprecated since February 17th, 2015. For alternatives see:

https://groups.google.com/forum/#!topic/gcp-hadoop-announce/D3_OZuqn4_o

7. Added workaround for a bug where ambari_env.sh and ambari_manual_env.sh

would fail to copy mapreduce.tar.gz, pig.tar.gz, etc., into

hdfs:///hdp/apps/... during setup. Ambari should now work out-of-the-box.

gcs-connector-1.4.0: CHANGES.txt

1.4.0 - 2015-05-27

1. The new inferImplicitDirectories option to GoogleCloudStorage tells

it to infer the existence of a directory (such as foo) when that

directory node does not exist in GCS but there are GCS files

that start with that path (such as as foo/bar). This allows

the GCS connector to be used on read-only filesystems where

those intermediate directory nodes can not be created by the

connector. The value of this option can be controlled by the

Hadoop boolean config option "fs.gs.implicit.dir.infer.enable".

The default value is true.

2. Increased Hadoop dependency version to 2.6.0.

3. Fixed a bug introduced in 1.3.2 where, during marker file creation,

file info was not properly updated between attempts. This lead

to backoff-retry-exhaustion with 412-preconditon-not-met errors.

4. Added support for changing the HttpTransport implementation to use,

via fs.gs.http.transport.type = [APACHE | JAVA_NET]

5. Added support for setting a proxy of the form "host:port" via

fs.gs.proxy.address, which works for both APACHE and JAVA_NET

HttpTransport options.

6. All logging converted to use slf4j instead of the previous

org.apache.commons.logging.Log; removed the LogUtil wrapper which

previously wrapped org.apache.commons.logging.Log.

7. Automatic retries for premature end-of-stream errors; the previous

behavior was to throw an unrecoverable exception in such cases.

8. Made close() idempotent for GoogleCloudStorageReadChannel

9. Added a low-level method for setting Content-Type metadata in the

GoogleCloudStorage interface.

10.Increased default DirectoryListCache TTL to 4 hours, wired out TTL

settings as top-level config params:

fs.gs.metadata.cache.max.age.entry.ms

fs.gs.metadata.cache.max.age.info.ms

bigquery-connector-0.7.0: CHANGES.txt

0.7.0 - 2015-05-27

1. All logging converted to use slf4j instead of the previous

org.apache.commons.logging.Log; removed the LogUtil wrapper which

previously wrapped org.apache.commons.logging.Log.

2. Added exponential-backoff automatic retries in waitForJobCompletion.

3. Added support for running multiple concurrent jobs writing to the same

output dataset by including the JobID as part of the temporary datasetId.

Reply all

Reply to author

Forward

0 new messages