Greetings, users of Hadoop on Google Cloud Platform!
We’re excited to announce the latest version of bdutil which fixes several bugs, adds new convenient features, and provides automatic recovery after unexpected reboots.
Additionally, the GCS connector for Hadoop now implements cluster-wide immediate “list after create” consistency, making it safe to run multistage pipelines directly on GCS wherein later stages depend on listing the files produced by earlier stages. Implemented by storing supplemental metadata in NFS, bdutil will automatically setup the necessary dependencies and configure the GCS connector to use this feature out-of-the-box.
Download bdutil-0.36.4.tar.gz or bdutil-0.36.4.zip now to try it out, or visit the developer documentation where the download links now point to the latest version.
Abridged highlights for bdutil updates:
Patched several minor bugs for YARN/Hadoop2 such as logging/local directories and running the Job History Server, significantly improving the hadoop2_env.sh experience.
Several new startup-recovery settings now allow the nodes to auto-rejoin the cluster after a reboot; unplanned reboots can occur occasionally when a server goes down.
Support for passing relative paths to env_var_files for bdutil when not running directly inside the bdutil directory; new shorthand notation and adjustable BDUTIL_EXTENSIONS_PATH.
Abridged highlights for gcs-connector updates:
Cluster-wide immediate “list after create” consistency is now enforced by the GCS connector in the default bdutil/gcs-connector installation.
Directory modification timestamps, introduced in gcs-connector-1.2.9, have been fine-tuned and optimized to potentially significantly speed up use cases involving a large number of files within a single directory.
Performance and reliability features, such as low-level retries on credential refresh and optimized “seeks”; use cases like LZO indexing gain significant speedup.
Please see the detailed release notes below for more information about the new bdutil and GCS connector features.
Manual installation of the GCS connector without using bdutil will still function as it always has, but requires additional setup if you wish to adopt the new “immediate list consistency” functionality. You may download gcs-connector-1.3.0-hadoop1.jar or gcs-connector-1.3.0.-hadoop2.jar directly, and view bdutil-0.36.4/libexec/setup_master_nfs.sh, bdutil-0.36.4/libexec/setup_client_nfs.sh, and bdutil-0.36.4/libexec/install_and_configure_gcs_connector.sh as an example of how to setup and configure the list-consistency feature.
As always, please send any questions or comments to gcp-hadoo...@google.com or post a question on stackoverflow.com with tag ‘google-hadoop’ for additional assistance.
All the best,
Your Google Team
bdutil-0.36.4: CHANGES.txt
0.36.4 - 2014-10-17
1. Added bdutil flags --worker_attached_pds_size_gb and
--master_attached_pd_size_gb corresponding to the bdutil_env variables of
the same names.
2. Added bdutil_env.sh variables and corresponding flags:
--worker_attached_pds_type and --master_attached_pd_type to specify
the type or PD to create, 'pd-standard' or 'pd-ssd'. Default: pd-standard.
3. Fixed a bug where we forgot to actually add
extensions/querytools/setup_profiles.sh to the COMMAND_GROUPS under
extensions/querytools/querytools_env.sh; now it's actually possible to
run 'pig' or 'hive' directly with querytools_env.sh installed.
4. Fixed a bug affecting Hadoop 1.2.1 HDFS persistence across deployments
where dfs.data.dir directories inadvertently had their permissions
modified to 775 from the correct 755, and thus caused datanodes to fail to
recover the data. Only applies in the use case of setting:
CREATE_ATTACHED_PDS_ON_DEPLOY=false
DELETE_ATTACHED_PDS_ON_DELETE=false
after an initial deployment to persist HDFS across a delete/deploy command.
The explicit directory configuration is now set in bdutil_env.sh with
the variable HDFS_DATA_DIRS_PERM, which is in turn wired into
hdfs-site.xml.
5. Added mounted disks to /etc/fstab to re-mount them on boot.
6. bdutil now uses a search path mechanism to look for env files to reduce
the amount of typing necessary to specify env files. For each argument
to the -e (or --env_var_files) command line option, if the argument
specifies just a base filename without a directory, bdutil will use
the first file of that name that it finds in the following directories:
1. The current working directory (.).
2. Directories specified as a colon-separated list of directories in
the environment variable BDUTIL_EXTENSIONS_PATH.
3. The bdutil directory (where the bdutil script is located).
4. Each of the extensions directories within the bdutil directory.
If the base filename is not found, it will try appending "_env.sh" to
the filename and look again in the same set of directories.
This change allows the following:
1. You can specify standard extensions succinctly, such as
"-e spark" for the spark extension, or "-e hadoop2" to use Hadoop 2.
2. You can put the bdutil directory in your PATH and run bdutil
from anywhere, and it will still find all its own files.
3. You can run bdutil from a directory containing your custom env
files and use filename completion to add them to a bdutil command.
4. You can collect your custom env files into one directory, set
BDUTIL_EXTENSIONS_PATH to point to that directory, run bdutil
from anywhere, and specify your custom env files by name only.
7. Added new boolean setting to bdutil_env.sh, ENABLE_NFS_GCS_FILE_CACHE,
which defaults to 'true'. When true, the GCS connector will be configured
to use its new "FILESYSTEM_BACKED" DirectoryListCache for immediate
cluster-wide list consistency, allowing multi-stage pipelines in e.g. Pig
and Hive to safely operate with DEFAULT_FS=gs. With this setting, bdutil
will install and configure an NFS export point on the master node, to
be mounted as the shared metadata cache directory for all cluster nodes.
8. Fixed a bug where the datastore-to-bigquery sample neglected to set a
'filter' in its query based on its ancestor entities.
9. YARN local directories are now set to spread IO across all directories
under /mnt.
10. YARN container logs will be written to /hadoop/logs/.
11. The Hadoop 2 MR Job History Server will now be started on the master node.
12. Added /etc/init.d entries for Hadoop daemons to restart them after
VM restarts.
13. Moved "hadoop fs -test" of gcs-connector to end of Hadoop setup, after
starting Hadoop daemons.
14. The spark_env.sh extension will now install numpy.
gcs-connector-1.3.0: CHANGES.txt
1.3.0 - 2014-10-17
1. Directory timestamp updating can now be controlled via user-settable
properties "fs.gs.parent.timestamp.update.enable",
"fs.gs.parent.timestamp.update.substrings.excludes". and
"fs.gs.parent.timestamp.update.substrings.includes" in core-site.xml. By
default, timestamp updating is enabled for the YARN done and intermediate
done directories and excluded for everything else. Strings listed in
includes take precedence over excludes.
2. Directory timestamp updating will now occur on a background thread inside
GoogleCloudStorageFileSystem.
3. Attempting to acquire an OAuth access token will be now be retried when
using .p12 or installed application (JWT) credentials if there is a
recoverable error such as an HTTP 5XX response code or an IOException.
4. Added FileSystemBackedDirectoryListCache, extracting a common interface
for it to share with the (InMemory)DirectoryListCache; instead of using
an in-memory HashMap to enforce only same-process list consistency, the
FileSystemBacked version mirrors GCS objects as empty files on a local
FileSystem, which may itself be an NFS mount for cluster-wide or even
potentially cross-cluster consistency groups. This allows a cluster to
be configured with a "consistent view", making it safe to use GCS as the
DEFAULT_FS for arbitrary multi-stage or even multi-platform workloads.
This is now enabled by default for machine-wide consistency, but it is
strongly recommended to configure clusters with an NFS directory for
cluster-wide strong consistency. Relevant configuration settings:
fs.gs.metadata.cache.enable [default: true]
fs.gs.metadata.cache.type [IN_MEMORY (default) | FILESYSTEM_BACKED]
fs.gs.metadata.cache.directory [default: /tmp/gcs_connector_metadata_cache]
5. Optimized seeks in GoogleHadoopFSDataInputStream which fit within
the pre-fetched memory buffer by simply repositioning the buffer in-place
instead of delegating to the underlying channel at all.
6. Fixed a performance-hindering bug in globStatus where "foo/bar/*" would
flat-list "foo/bar" instead of "foo/bar/"; causing the "candidate matches"
to include things like "foo/bar1" and "foo/bar1/baz", even though the
results themselves would be correct due to filtering out the proper glob
client-side in the end.
7. The versions of java API clients were updated to 1.19 derived versions.
bigquery-connector-0.4.5: CHANGES.txt
0.4.5 - 2014-10-17
1. Attempting to acquire an OAuth access token will be now be retried when
using .p12 or installed application (JWT) credentials if there is a
recoverable error such as an HTTP 5XX response code or an IOException.
datastore-connector-0.14.8: CHANGES.txt
0.14.8 - 2014-10-17