Announcing the new bdutil-0.34.0 with Hadoop 2 support and gcs-connector-1.2.5

296 views
Skip to first unread message

Hadoop on Google Cloud Platform Team

unread,
May 9, 2014, 5:14:51 PM5/9/14
to gcp-had...@google.com, gcp-hadoo...@googlegroups.com

Greetings, users of Hadoop on Google Cloud Platform!


We’re excited to unveil our latest features and improvements to bdutil and the GCS connector for Hadoop, now with support for YARN/Hadoop 2. Additionally, bdutil now supports deploying onto CentOS images. Download bdutil-0.34.0.tar.gz or bdutil-0.34.0.zip now to try it out, or visit the developer documentation where the download links now point to the latest version.


The new GCS connector contains significant performance improvements to “list” and “glob” operations, and fixes the usage of ‘?’ in glob expressions. Detailed change notes are listed below.


New bdutil flags, commands


We’ve changed bdutil to now handle command-line flags and expose several more commands which can be used to help in managing and interacting with your Hadoop clusters. Additionally, several features have been implemented to improve the debuggability and stability of bdutil for deploying larger clusters. Basic usage is still mostly backwards compatible, except environment-variable overrides are now passed through a comma-separated “-e” flag instead of directly listing them after the “deploy” command:


Old command

New command

./bdutil deploy

<unchanged>

./bdutil delete

<unchanged>

./bdutil deploy env1.sh

./bdutil deploy -e env1.sh

./bdutil deploy env1.sh env2.sh

./bdutil deploy -e env1.sh,env2.sh


Now, you no longer have to specify the project for your bdutil deployment, if you’ve already configured a default project for gcloud. With command-line flags, you can now deploy out-of-the-box without ever editing a config file:


./bdutil -b <configbucket> deploy


If you want a different cluster name or size and still use command-line flags, e.g. with prefix “foo-cluster” and 10 nodes, you simply type:


./bdutil -b <configbucket> -n 10 -P foo-cluster deploy


Though you can deploy directly with your command-line flag settings, note that you’ll then need to provide the same flag values to delete the same cluster correctly:

./bdutil -b <configbucket> -n 10 -P foo-cluster deploy

./bdutil -b <configbucket> -n 10 -P foo-cluster delete


For your production clusters, you’ll likely want a reproducible deploy/delete with a single config file; to create one, you can simply use the generate_config command, which will create a local env file instead of deploying:


./bdutil -b <configbucket> -n 10 -P foo-cluster generate_config prod1_env.sh

./bdutil -e prod1_env.sh deploy

./bdutil -e prod1_env.sh -u hadoop-validate-setup.sh -v run_command -- sudo -u hadoop ./hadoop-validate-setup.sh

./bdutil -e prod1_env.sh delete


If you have a preferred set of Hadoop installation software or scripts you can now use them in conjunction with bdutil to still get your VMs and the connectors. E.g.,  with Apache Ambari you could run:


./bdutil -e my_custom_ambari_env.sh create

./bdutil -e my_custom_ambari_env.sh run_command_group install_ambari

<Install Hadoop using Ambari, however you like>

./bdutil -e my_custom_ambari_env.sh run_command_group install_connectors

<Restart Hadoop using Ambari>


To get a more complete summary of the new capabilities of bdutil, type:


./bdutil --help


Hadoop 2 and YARN support


We now distribute two different jarfiles of the GCS connector, gcs-connector-1.2.5-hadoop1.jar for use with Hadoop 1 (and other versions of the same series, like 0.20.205.0), and gcs-connector-1.2.5-hadoop2.jar for use with Hadoop 2. Feel free to download the connector jarfiles directly for advanced use cases and custom configuration, or allow bdutil to perform the installation and configuration for you without having to deal with connector jarfiles at all.


To deploy a standard Hadoop 2.2.0 installation, you can use bdutil with the included hadoop2_env.sh file:


./bdutil deploy -e hadoop2_env.sh -b <configbucket>


Once the cluster is deployed, you can visit http://<master-node external IP address>:8088 to see the YARN resource manager’s Web UI; there will no longer be a JobTracker UI at http://<master-node external IP address>:50030 like there is with Hadoop 1.


As always, please send any questions or comments to gcp-hadoo...@google.com


All the best,

Your Google Team


bdutil-0.34.0: CHANGES.txt

0.34.0 - 2014-05-08


 1. Changed sample applications and tools to use GenericOptionsParser instead

    of creating a new Configuration object directly.

 2. Added printout of bdutil version number alongside "usage" message.

 3. Added sleeps between async invocations of GCE API calls during deployment,

    configurable with: GCUTIL_SLEEP_TIME_BETWEEN_ASYNC_CALLS_SECONDS

 4. Added tee'ing of client-side console output into debuginfo.txt with better

    delineation of where the error is likely to have occurred.

 5. Just for extensions/querytools/querytools_env.sh, added an explicit

    mapred.working.dir to fix a bug where PigInputFormat crashes whenever the

    default FileSystem is different from the input FileSystem. This fix allows

    using GCS input paths in Pig with DEFAULT_FS='hdfs'.

 6. Added a retry-loop around "apt-get -y -qq update" since it may flake under

    high load.

 7. Significantly refactored bdutil into better-isolated helper functions, and

    added basic support for command-line flags and several new commnds. The old

    command "./bdutil env1.sh env2.sh" is now "./bdutil -e env1.sh,env2.sh".

    Type ./bdutil --help for an overview of all the new functionality.

 8. Added better checking of env and upload files before starting deployment.

 9. Reorganized bdutil_env.sh into logical sections with better descriptions.

 10. Significantly reduced amount of console output; printed dots indicate

    progress of async subprocesses. Controllable with VERBOSE_MODE or '-v'.

 11. Script and file dependencies are now staged through GCS rather than using

    gcutil push; drastically decreases bandwidth and improves scalability.

 12. Added MAX_CONCURRENT_ASYNC_PROCESSES to splitting the async loops into

    multiple smaller batches, to avoid OOMing.

 13. Made delete_cluster continue on error, still reporting a warning at the

    end if errors were encountered. This way, previously-failed cluster

    creations or deletions with partial resources still present can be

    cleaned up by retrying the "delete" command.



gcs-connector-1.2.5: CHANGES.txt

1.2.5 - 2014-05-08


 1. Fixed a bug where fs.gs.auth.client.file was unconditionally being

    overwritten by a default value.

 2. Enabled direct upload for directory creation to save one round-trip call.

 3. Added wiring for GoogleHadoopFileSystem.close() to call through to close()

    its underlying helper classes as well.

 4. Added a new batch mode for creating directories in parallel which requires

    manually parallelizing in the client. Speeds up nested directory creation

    and repairing large numbers of implicit directories in listStatus.

 5. Eliminated redundant API calls in listStatus, speeding it up by ~half.

 6. Fixed a bug where globStatus didn't correctly handle globs containing '?'.

 7. Implemented new version of globStatus which initially performs a flat

    listing before performing the recursive glob logic in-memory to

    dramatically speed up globs with lots of directories; the new behavior is

    default, but can disabled by setting fs.gs.glob.flatlist.enable = false.



bigquery-connector-0.4.1: CHANGES.txt

0.4.1 - 2014-05-08


 1. Removed mapred.bq.output.num.records.batch in favor of

    mapred.bq.output.buffer.size.

 2. Misc updates in library dependencies.



datastore-connector-0.14.4: CHANGES.txt

0.14.4 - 2014-05-08


 1. Misc updates in library dependencies.


Reply all
Reply to author
Forward
0 new messages