Google Cloud Dataproc - December 16th Release

28 views
Skip to first unread message

James Malone

unread,
Dec 17, 2015, 4:43:14 PM12/17/15
to gcp-hadoo...@googlegroups.com

Hello everyone,


This week we’re releasing a set of new features and fixes for Google Cloud Dataproc. This release started on Tuesday and will be complete by the end of today.


New features

  • Dataproc clusters now have vim, git, and bash-completion installed by default

  • The Cloud Dataproc API now has an official Maven artifact, Javadocs, and a downloadable .zip

  • Google Cloud Platform Console

    • Properties can now be specified when submitting a job, and can be seen in the Configuration tab of a job

    • A “Clone” button has been added that allows you to easily copy all information about a job to a new job submission form

    • The left-side icons for Clusters and Jobs are now custom icons rather than generic ones

    • An “Image version” field has been added to the bottom of the create cluster form that allows you to select a specific Cloud Dataproc image version when creating a cluster

    • A “VM Instances” tab has been added on the cluster detail page where you can see all VMs in a cluster and easily SSH into the master node

    • An “Initialization Actions” field has been added to the bottom of the create cluster form that allows you to specify initialization actions when creating a cluster

    • Paths to Google Cloud Storage buckets that are displayed in error messages are now clickable links.


Bugfixes

  • Forced distcp settings to match mapred-site.xml settings to provide additional fixes for the distcp command (see this related JIRA)

  • Ensured that workers created during an update do not join the cluster until after custom initialization actions are complete

  • Ensured that workers always disconnect from a cluster when the Cloud Dataproc agent is shutdown

  • Fixed a race condition in the API front-end that occurred when validating a request and marking cluster as updating

  • Enhanced validation checks for quota, Cloud Dataproc image, and initialization actions when updating clusters

  • Improved handling of jobs when the Cloud Dataproc agent is restarted

  • Google Cloud Platform Console

    • Allowed duplicate arguments when submitting a job

    • Replaced generic Failed to load message with details about the cause of an error when an error occurs that is not related to Cloud Dataproc

    • When a single jar file for a job is submitted, allowed it to be listed only in the Main class or jar field on the Submit a Job form, and no longer required it to also be listed in the Jar files field


Connectors and documentation

If you use or plan to use the BigQuery connector with Spark, we recommend reviewing the latest updates to the BigQuery connector and Spark example; in addition to making the examples more easily runnable without needing to edit the sample code, there's now additional code to facilitate cleanup of temporary files created by the BigQueryInputFormat, along with details explaining the need to manually handle temporary Google Cloud Storage files and BigQuery datasets in the event of job failure.


If you've already been using the BigQuery connector in Spark without performing cleanup of export files, you may want to review your gs://bucket/hadoop/tmp/bigquery/ directory for any forgotten temporary files which may be incurring monthly storage charges. As an alternative to periodic cleaning of such directories, you can also consider setting  BigQueryConfiguration.GCS_BUCKET_KEY to use a separate bucket configured with Object Lifecycle Management. This would, for example, auto-delete temporary export files after 1 day, as long as your Spark jobs do not need to run longer than the lifecycle period.


The Cloud Dataproc release notes also contain these notes and all past release notes.


Best,


Google Cloud Dataproc / Google Cloud Spark & Hadoop Team


Reply all
Reply to author
Forward
0 new messages