Hello everyone,
This week we’re releasing a new version of Google Cloud Dataproc - 0.2. This release includes a new bundle of components for Cloud Dataproc clusters (Spark 1.5.2, Hive 1.2.1, Pig 0.15.0), several new features, and numerous optimizations and bug fixes.
Starting with this release, we are staging rollouts to occur over several days. You will see these new features as they become available in your region. You can expect to see all new features live by the end of Friday (11/20).
New features
Version selection - Since there are now multiple versions of Cloud Dataproc, we have released a feature which allows you to select between available versions of Cloud Dataproc. You can review our policy for supporting older versions in our documentation. Additionally, our version detail page details which software components are included in each version. You can select a version when creating a cluster through the API, Cloud SDK (with the flag “--image-version”) or through the Google Developers Console; be sure to update your gcloud installation with gcloud components update to pick up the latest available flags. Importantly, when we release new versions, they will become the default for new clusters when they are available in your region (1-4 days after release.)
OSS upgrades - The new version 0.2 of Dataproc now contains Spark 1.5.2, Hive 1.2.1, and Pig 0.15.0.
Connector updates - Last week we released updates to our BigQuery and Google Cloud Storage connectors (0.7.3 and 1.4.3, respectively.) These connectors fix a number of bugs and the new versions are now included in Cloud Dataproc 0.2.
Hive Metastore - Introduced a MySQL-based per-cluster persistent metastore which is shared between Hive and SparkSQL. This also fixes the “hive” command.
More Native Libraries - Cloud Dataproc now includes native Snappy libraries. It also includes native BLAS, LAPACK and ARPACK libraries for Spark’s MLlib.
Clusters “Diagnose” command - The Cloud SDK now includes a “--diagnose” command for gathering logging and diagnostic information about your cluster. More details about this command are available in the Cloud Dataproc support documentation.
Bugfixes
Fixed the ability to delete jobs which fast-failed before some cluster and staging directories had ever been created
Fixed a rare bug where underlying Compute Engine issues could lead to VM instances failing to be deleted even after the Cloud Dataproc cluster has been successfully deleted
Hive command is fixed
Fixed error reporting when updating the number of workers (standard and preemptible) in a cluster
Fixed some cases where “Rate Limit Exceeded” errors would occur when creating large clusters
The maximum cluster name length is now correctly 55 rather than 56
Developers Console fixes
Cluster list now includes a “Created” column and the cluster configuration tab now includes a “Created” field telling the creation time of the cluster
In the cluster-create screen cluster memory sizes greater than 999 GB are now displayed in TB
A few fields that were missing from the PySpark and Hive job configuration tab (“Additional Python Files” and “Jar Files”) have been added
The option to add preemptible nodes when creating a cluster is now in the “expander” at the bottom of the form
Machine types with too little memory (less than 3.5 GB) are no longer displayed in the list of machine types (previously, selecting one of these small machine types would lead to an error from the backend
The placeholder text in the Arguments field of the submit-job form has been corrected
Core service improvements
A project's default zone setting, if set is now used as the default value for the zone in the create-cluster form in the Developers Console
Optimizations
Hive performance has been greatly increased, especially for partitioned tables that have a large number of partitions
Multithreaded listStatus has now been enabled, which speeds up job startup time for FileInputFormats reading large numbers of files and directories in GCS
The Cloud Dataproc release notes will serve as a consolidated list of all release notes for the Cloud Dataproc service from our beta launch forward. The Cloud Dataproc version list will serve as the reference for listing which software components are available in each Cloud Dataproc version. You can learn more about Cloud Dataproc on the Google Cloud Platform site.
Best,