Greetings, users of Hadoop on Google Cloud Platform!
We’re excited to unveil our latest features and improvements to bdutil and the GCS connector for Hadoop, now with support for YARN/Hadoop 2. Additionally, bdutil now supports deploying onto CentOS images. Download bdutil-0.34.0.tar.gz or bdutil-0.34.0.zip now to try it out, or visit the developer documentation where the download links now point to the latest version.
The new GCS connector contains significant performance improvements to “list” and “glob” operations, and fixes the usage of ‘?’ in glob expressions. Detailed change notes are listed below.
We’ve changed bdutil to now handle command-line flags and expose several more commands which can be used to help in managing and interacting with your Hadoop clusters. Additionally, several features have been implemented to improve the debuggability and stability of bdutil for deploying larger clusters. Basic usage is still mostly backwards compatible, except environment-variable overrides are now passed through a comma-separated “-e” flag instead of directly listing them after the “deploy” command:
Old command | New command |
./bdutil deploy | <unchanged> |
./bdutil delete | <unchanged> |
./bdutil deploy env1.sh | ./bdutil deploy -e env1.sh |
./bdutil deploy env1.sh env2.sh | ./bdutil deploy -e env1.sh,env2.sh |
Now, you no longer have to specify the project for your bdutil deployment, if you’ve already configured a default project for gcloud. With command-line flags, you can now deploy out-of-the-box without ever editing a config file:
./bdutil -b <configbucket> deploy
If you want a different cluster name or size and still use command-line flags, e.g. with prefix “foo-cluster” and 10 nodes, you simply type:
./bdutil -b <configbucket> -n 10 -P foo-cluster deploy
Though you can deploy directly with your command-line flag settings, note that you’ll then need to provide the same flag values to delete the same cluster correctly:
./bdutil -b <configbucket> -n 10 -P foo-cluster deploy
./bdutil -b <configbucket> -n 10 -P foo-cluster delete
For your production clusters, you’ll likely want a reproducible deploy/delete with a single config file; to create one, you can simply use the generate_config command, which will create a local env file instead of deploying:
./bdutil -b <configbucket> -n 10 -P foo-cluster generate_config prod1_env.sh
./bdutil -e prod1_env.sh deploy
./bdutil -e prod1_env.sh -u hadoop-validate-setup.sh -v run_command -- sudo -u hadoop ./hadoop-validate-setup.sh
./bdutil -e prod1_env.sh delete
If you have a preferred set of Hadoop installation software or scripts you can now use them in conjunction with bdutil to still get your VMs and the connectors. E.g., with Apache Ambari you could run:
./bdutil -e my_custom_ambari_env.sh create
./bdutil -e my_custom_ambari_env.sh run_command_group install_ambari
<Install Hadoop using Ambari, however you like>
./bdutil -e my_custom_ambari_env.sh run_command_group install_connectors
<Restart Hadoop using Ambari>
To get a more complete summary of the new capabilities of bdutil, type:
./bdutil --help
We now distribute two different jarfiles of the GCS connector, gcs-connector-1.2.5-hadoop1.jar for use with Hadoop 1 (and other versions of the same series, like 0.20.205.0), and gcs-connector-1.2.5-hadoop2.jar for use with Hadoop 2. Feel free to download the connector jarfiles directly for advanced use cases and custom configuration, or allow bdutil to perform the installation and configuration for you without having to deal with connector jarfiles at all.
To deploy a standard Hadoop 2.2.0 installation, you can use bdutil with the included hadoop2_env.sh file:
./bdutil deploy -e hadoop2_env.sh -b <configbucket>
Once the cluster is deployed, you can visit http://<master-node external IP address>:8088 to see the YARN resource manager’s Web UI; there will no longer be a JobTracker UI at http://<master-node external IP address>:50030 like there is with Hadoop 1.
As always, please send any questions or comments to gcp-hadoo...@google.com
All the best,
Your Google Team
bdutil-0.34.0: CHANGES.txt
0.34.0 - 2014-05-08
1. Changed sample applications and tools to use GenericOptionsParser instead
of creating a new Configuration object directly.
2. Added printout of bdutil version number alongside "usage" message.
3. Added sleeps between async invocations of GCE API calls during deployment,
configurable with: GCUTIL_SLEEP_TIME_BETWEEN_ASYNC_CALLS_SECONDS
4. Added tee'ing of client-side console output into debuginfo.txt with better
delineation of where the error is likely to have occurred.
5. Just for extensions/querytools/querytools_env.sh, added an explicit
mapred.working.dir to fix a bug where PigInputFormat crashes whenever the
default FileSystem is different from the input FileSystem. This fix allows
using GCS input paths in Pig with DEFAULT_FS='hdfs'.
6. Added a retry-loop around "apt-get -y -qq update" since it may flake under
high load.
7. Significantly refactored bdutil into better-isolated helper functions, and
added basic support for command-line flags and several new commnds. The old
command "./bdutil env1.sh env2.sh" is now "./bdutil -e env1.sh,env2.sh".
Type ./bdutil --help for an overview of all the new functionality.
8. Added better checking of env and upload files before starting deployment.
9. Reorganized bdutil_env.sh into logical sections with better descriptions.
10. Significantly reduced amount of console output; printed dots indicate
progress of async subprocesses. Controllable with VERBOSE_MODE or '-v'.
11. Script and file dependencies are now staged through GCS rather than using
gcutil push; drastically decreases bandwidth and improves scalability.
12. Added MAX_CONCURRENT_ASYNC_PROCESSES to splitting the async loops into
multiple smaller batches, to avoid OOMing.
13. Made delete_cluster continue on error, still reporting a warning at the
end if errors were encountered. This way, previously-failed cluster
creations or deletions with partial resources still present can be
cleaned up by retrying the "delete" command.
gcs-connector-1.2.5: CHANGES.txt
1.2.5 - 2014-05-08
1. Fixed a bug where fs.gs.auth.client.file was unconditionally being
overwritten by a default value.
2. Enabled direct upload for directory creation to save one round-trip call.
3. Added wiring for GoogleHadoopFileSystem.close() to call through to close()
its underlying helper classes as well.
4. Added a new batch mode for creating directories in parallel which requires
manually parallelizing in the client. Speeds up nested directory creation
and repairing large numbers of implicit directories in listStatus.
5. Eliminated redundant API calls in listStatus, speeding it up by ~half.
6. Fixed a bug where globStatus didn't correctly handle globs containing '?'.
7. Implemented new version of globStatus which initially performs a flat
listing before performing the recursive glob logic in-memory to
dramatically speed up globs with lots of directories; the new behavior is
default, but can disabled by setting fs.gs.glob.flatlist.enable = false.
bigquery-connector-0.4.1: CHANGES.txt
0.4.1 - 2014-05-08
1. Removed mapred.bq.output.num.records.batch in favor of
mapred.bq.output.buffer.size.
2. Misc updates in library dependencies.
datastore-connector-0.14.4: CHANGES.txt
0.14.4 - 2014-05-08
1. Misc updates in library dependencies.