Hello everyone,
This week we’re releasing a set of new features and fixes for Google Cloud Dataproc. This release started on Tuesday and will be complete by the end of today.
New features
Dataproc clusters now have vim, git, and bash-completion installed by default
The Cloud Dataproc API now has an official Maven artifact, Javadocs, and a downloadable .zip
Google Cloud Platform Console
Properties can now be specified when submitting a job, and can be seen in the Configuration tab of a job
A “Clone” button has been added that allows you to easily copy all information about a job to a new job submission form
The left-side icons for Clusters and Jobs are now custom icons rather than generic ones
An “Image version” field has been added to the bottom of the create cluster form that allows you to select a specific Cloud Dataproc image version when creating a cluster
A “VM Instances” tab has been added on the cluster detail page where you can see all VMs in a cluster and easily SSH into the master node
An “Initialization Actions” field has been added to the bottom of the create cluster form that allows you to specify initialization actions when creating a cluster
Paths to Google Cloud Storage buckets that are displayed in error messages are now clickable links.
Bugfixes
Forced distcp settings to match mapred-site.xml settings to provide additional fixes for the distcp command (see this related JIRA)
Ensured that workers created during an update do not join the cluster until after custom initialization actions are complete
Ensured that workers always disconnect from a cluster when the Cloud Dataproc agent is shutdown
Fixed a race condition in the API front-end that occurred when validating a request and marking cluster as updating
Enhanced validation checks for quota, Cloud Dataproc image, and initialization actions when updating clusters
Improved handling of jobs when the Cloud Dataproc agent is restarted
Google Cloud Platform Console
Allowed duplicate arguments when submitting a job
Replaced generic Failed to load message with details about the cause of an error when an error occurs that is not related to Cloud Dataproc
When a single jar file for a job is submitted, allowed it to be listed only in the Main class or jar field on the Submit a Job form, and no longer required it to also be listed in the Jar files field
Connectors and documentation
If you use or plan to use the BigQuery connector with Spark, we recommend reviewing the latest updates to the BigQuery connector and Spark example; in addition to making the examples more easily runnable without needing to edit the sample code, there's now additional code to facilitate cleanup of temporary files created by the BigQueryInputFormat, along with details explaining the need to manually handle temporary Google Cloud Storage files and BigQuery datasets in the event of job failure.
If you've already been using the BigQuery connector in Spark without performing cleanup of export files, you may want to review your gs://bucket/hadoop/tmp/bigquery/ directory for any forgotten temporary files which may be incurring monthly storage charges. As an alternative to periodic cleaning of such directories, you can also consider setting BigQueryConfiguration.GCS_BUCKET_KEY to use a separate bucket configured with Object Lifecycle Management. This would, for example, auto-delete temporary export files after 1 day, as long as your Spark jobs do not need to run longer than the lifecycle period.
The Cloud Dataproc release notes also contain these notes and all past release notes.
Best,
Google Cloud Dataproc / Google Cloud Spark & Hadoop Team