Announcing bdutil-0.36.4 and gcs-connector-1.3.0

339 views

Skip to first unread message

Hadoop on Google Cloud Platform Team

unread,

Oct 21, 2014, 11:39:42 PM10/21/14

to gcp-had...@google.com, gcp-hadoo...@googlegroups.com

Greetings, users of Hadoop on Google Cloud Platform!

We’re excited to announce the latest version of bdutil which fixes several bugs, adds new convenient features, and provides automatic recovery after unexpected reboots.

Additionally, the GCS connector for Hadoop now implements cluster-wide immediate “list after create” consistency, making it safe to run multistage pipelines directly on GCS wherein later stages depend on listing the files produced by earlier stages. Implemented by storing supplemental metadata in NFS, bdutil will automatically setup the necessary dependencies and configure the GCS connector to use this feature out-of-the-box.

Download bdutil-0.36.4.tar.gz or bdutil-0.36.4.zip now to try it out, or visit the developer documentation where the download links now point to the latest version.

Abridged highlights for bdutil updates:

Patched several minor bugs for YARN/Hadoop2 such as logging/local directories and running the Job History Server, significantly improving the hadoop2_env.sh experience.
Several new startup-recovery settings now allow the nodes to auto-rejoin the cluster after a reboot; unplanned reboots can occur occasionally when a server goes down.
Support for passing relative paths to env_var_files for bdutil when not running directly inside the bdutil directory; new shorthand notation and adjustable BDUTIL_EXTENSIONS_PATH.

Abridged highlights for gcs-connector updates:

Cluster-wide immediate “list after create” consistency is now enforced by the GCS connector in the default bdutil/gcs-connector installation.
Directory modification timestamps, introduced in gcs-connector-1.2.9, have been fine-tuned and optimized to potentially significantly speed up use cases involving a large number of files within a single directory.
Performance and reliability features, such as low-level retries on credential refresh and optimized “seeks”; use cases like LZO indexing gain significant speedup.

Please see the detailed release notes below for more information about the new bdutil and GCS connector features.

Manual installation of the GCS connector without using bdutil will still function as it always has, but requires additional setup if you wish to adopt the new “immediate list consistency” functionality. You may download gcs-connector-1.3.0-hadoop1.jar or gcs-connector-1.3.0.-hadoop2.jar directly, and view bdutil-0.36.4/libexec/setup_master_nfs.sh, bdutil-0.36.4/libexec/setup_client_nfs.sh, and bdutil-0.36.4/libexec/install_and_configure_gcs_connector.sh as an example of how to setup and configure the list-consistency feature.

As always, please send any questions or comments to gcp-hadoo...@google.com or post a question on stackoverflow.com with tag ‘google-hadoop’ for additional assistance.

All the best,

Your Google Team

bdutil-0.36.4: CHANGES.txt

0.36.4 - 2014-10-17

1. Added bdutil flags --worker_attached_pds_size_gb and
    --master_attached_pd_size_gb corresponding to the bdutil_env variables of
    the same names.
2. Added bdutil_env.sh variables and corresponding flags:
    --worker_attached_pds_type and --master_attached_pd_type to specify
    the type or PD to create, 'pd-standard' or 'pd-ssd'. Default: pd-standard.
3. Fixed a bug where we forgot to actually add
    extensions/querytools/setup_profiles.sh to the COMMAND_GROUPS under
    extensions/querytools/querytools_env.sh; now it's actually possible to
    run 'pig' or 'hive' directly with querytools_env.sh installed.
4. Fixed a bug affecting Hadoop 1.2.1 HDFS persistence across deployments
    where dfs.data.dir directories inadvertently had their permissions
    modified to 775 from the correct 755, and thus caused datanodes to fail to
    recover the data. Only applies in the use case of setting:

        CREATE_ATTACHED_PDS_ON_DEPLOY=false
        DELETE_ATTACHED_PDS_ON_DELETE=false

    after an initial deployment to persist HDFS across a delete/deploy command.
    The explicit directory configuration is now set in bdutil_env.sh with
    the variable HDFS_DATA_DIRS_PERM, which is in turn wired into
    hdfs-site.xml.
5. Added mounted disks to /etc/fstab to re-mount them on boot.
6. bdutil now uses a search path mechanism to look for env files to reduce
    the amount of typing necessary to specify env files. For each argument
    to the -e (or --env_var_files) command line option, if the argument
    specifies just a base filename without a directory, bdutil will use
    the first file of that name that it finds in the following directories:
      1. The current working directory (.).
      2. Directories specified as a colon-separated list of directories in
         the environment variable BDUTIL_EXTENSIONS_PATH.
      3. The bdutil directory (where the bdutil script is located).
      4. Each of the extensions directories within the bdutil directory.
    If the base filename is not found, it will try appending "_env.sh" to
    the filename and look again in the same set of directories.
    This change allows the following:
      1. You can specify standard extensions succinctly, such as
         "-e spark" for the spark extension, or "-e hadoop2" to use Hadoop 2.
      2. You can put the bdutil directory in your PATH and run bdutil
         from anywhere, and it will still find all its own files.
      3. You can run bdutil from a directory containing your custom env
         files and use filename completion to add them to a bdutil command.
      4. You can collect your custom env files into one directory, set
         BDUTIL_EXTENSIONS_PATH to point to that directory, run bdutil
         from anywhere, and specify your custom env files by name only.
7. Added new boolean setting to bdutil_env.sh, ENABLE_NFS_GCS_FILE_CACHE,
    which defaults to 'true'. When true, the GCS connector will be configured
    to use its new "FILESYSTEM_BACKED" DirectoryListCache for immediate
    cluster-wide list consistency, allowing multi-stage pipelines in e.g. Pig
    and Hive to safely operate with DEFAULT_FS=gs. With this setting, bdutil
    will install and configure an NFS export point on the master node, to
    be mounted as the shared metadata cache directory for all cluster nodes.
8. Fixed a bug where the datastore-to-bigquery sample neglected to set a
    'filter' in its query based on its ancestor entities.
9. YARN local directories are now set to spread IO across all directories
    under /mnt.
10. YARN container logs will be written to /hadoop/logs/.
11. The Hadoop 2 MR Job History Server will now be started on the master node.
12. Added /etc/init.d entries for Hadoop daemons to restart them after
     VM restarts.
13. Moved "hadoop fs -test" of gcs-connector to end of Hadoop setup, after
     starting Hadoop daemons.
14. The spark_env.sh extension will now install numpy.

gcs-connector-1.3.0: CHANGES.txt

1.3.0 - 2014-10-17

1. Directory timestamp updating can now be controlled via user-settable
    properties "fs.gs.parent.timestamp.update.enable",
    "fs.gs.parent.timestamp.update.substrings.excludes". and
    "fs.gs.parent.timestamp.update.substrings.includes" in core-site.xml. By
    default, timestamp updating is enabled for the YARN done and intermediate
    done directories and excluded for everything else. Strings listed in
    includes take precedence over excludes.
2. Directory timestamp updating will now occur on a background thread inside
    GoogleCloudStorageFileSystem.
3. Attempting to acquire an OAuth access token will be now be retried when
    using .p12 or installed application (JWT) credentials if there is a
    recoverable error such as an HTTP 5XX response code or an IOException.
4. Added FileSystemBackedDirectoryListCache, extracting a common interface
    for it to share with the (InMemory)DirectoryListCache; instead of using
    an in-memory HashMap to enforce only same-process list consistency, the
    FileSystemBacked version mirrors GCS objects as empty files on a local
    FileSystem, which may itself be an NFS mount for cluster-wide or even
    potentially cross-cluster consistency groups. This allows a cluster to
    be configured with a "consistent view", making it safe to use GCS as the
    DEFAULT_FS for arbitrary multi-stage or even multi-platform workloads.
    This is now enabled by default for machine-wide consistency, but it is
    strongly recommended to configure clusters with an NFS directory for
    cluster-wide strong consistency. Relevant configuration settings:
    fs.gs.metadata.cache.enable [default: true]
    fs.gs.metadata.cache.type [IN_MEMORY (default) | FILESYSTEM_BACKED]
    fs.gs.metadata.cache.directory [default: /tmp/gcs_connector_metadata_cache]
5. Optimized seeks in GoogleHadoopFSDataInputStream which fit within
    the pre-fetched memory buffer by simply repositioning the buffer in-place
    instead of delegating to the underlying channel at all.
6. Fixed a performance-hindering bug in globStatus where "foo/bar/*" would
    flat-list "foo/bar" instead of "foo/bar/"; causing the "candidate matches"
    to include things like "foo/bar1" and "foo/bar1/baz", even though the
    results themselves would be correct due to filtering out the proper glob
    client-side in the end.
7. The versions of java API clients were updated to 1.19 derived versions.

bigquery-connector-0.4.5: CHANGES.txt

0.4.5 - 2014-10-17

1. Attempting to acquire an OAuth access token will be now be retried when
using .p12 or installed application (JWT) credentials if there is a
recoverable error such as an HTTP 5XX response code or an IOException.

datastore-connector-0.14.8: CHANGES.txt

0.14.8 - 2014-10-17

1. Attempting to acquire an OAuth access token will be now be retried when
using .p12 or installed application (JWT) credentials if there is a
recoverable error such as an HTTP 5XX response code or an IOException.

Reply all

Reply to author

Forward

0 new messages