Greetings, users of Hadoop on Google Cloud Platform!
We’re excited to announce the latest updates to bdutil which add several performance and reliability features, improved usability of Pig and Hive using extensions/querytools/querytools_env.sh, new support for using hadoop-streaming with the BigQuery connector, and support in the GCS connector for “directory modification time”. Additionally, we’ve updated the default Spark version to upgraded to 1.1.0 inside extensions/spark/spark_env.sh; older versions are still available at their previous download paths. Download bdutil-0.35.2.tar.gz or bdutil-0.35.2.zip now to try it out, or visit the developer documentation where the download links now point to the latest version.
The following patch to bdutil addresses a latent bug in Hadoop which may very rarely manifest as MapReduce jobs hanging for hours before failing due to improper TaskTracker initialization on a single worker. The manifestation only occurs during deployment/startup, so long-running clusters are not generally vulnerable. If you commonly deploy large clusters on demand, it’s highly recommended to pick up this update:
4. Added a health check script in Hadoop 1 to check if Jetty failed to load
for the TaskTracker as in [MAPREDUCE-4668].
The BigQuery connector now supports hadoop-streaming through use of the Hadoop 'mapred' API. To use hadoop-streaming with the connector you must specify the appropriate -inputformat and -outputformat values, plus a collection of -D options to set the required BigQuery parameters. Available in the bdutil-0.35.2/samples/ directory is a streaming wordcount example which uses streaming_word_count.sh to launch hadoop-streaming with the word_count_mapper.py and word_count_reducer.py files also provided in that directory. To walk through the samples end-to-end once you’ve configured bdutil with your project and bucket:
# Deploy a cluster with bigquery enabled.
./bdutil -e bigquery_env.sh deploy
# Create an empty output dataset in BigQuery using the ‘bq’ CLI
# (or alternatively do it through the BigQuery GUI)
bq mk tmpdataset_20140918
# Determine the project for your output table; it will generally be the same
# as the project you’ve used to deploy your cluster.
export PROJECT=<your project here>
# Use bdutil to stage the three files onto the master and run the sample.
./bdutil -u samples/streaming_word_count.sh \
-u samples/word_count_mapper.py \
-u samples/word_count_reducer.py \
-t master -v run_command -- ./streaming_word_count.sh \
--stream_output \
--output_project ${PROJECT} \
--output_dataset tmpdataset_20140918 \
--output_table wordcountout
# View the results of your word count using the ‘bq’ CLI
# (or navigate to the BigQuery GUI)
bq head tmpdataset_20140918.wordcountout
You may tune some of the settings at the top of samples/streaming_word_count.sh to try out other modes, such as running BigQuery input, HDFS/GCS output with:
STREAM_OUTPUT=false
OUTPUT_DIR_NOT_STREAMING='gs://your-bucket/streamingwordcountoutput'
In general, a typical command directly running hadoop-streaming with the BigQuery connector may look like this:
hadoop jar hadoop-install/contrib/streaming/hadoop-streaming-1.2.1.jar \
-D mapred.bq.input.project.id=publicdata \
-D mapred.bq.input.dataset.id=samples \
-D mapred.bq.input.table.id=shakespeare \
-D mapred.bq.output.project.id=myProject \
-D mapred.bq.output.dataset.id=testdataset \
-D mapred.bq.output.table.id=testtable \
-D mapred.bq.output.table.schema="[{'name': 'Word','type': 'STRING'},{'name': 'Count','type': 'INTEGER'}]" \
-D mapred.output.committer.class=com.google.cloud.hadoop.io.bigquery.mapred.BigQueryMapredOutputCommitter \
-inputformat com.google.cloud.hadoop.io.bigquery.mapred.BigQueryMapredInputFormat \
-input requiredButUnused \
-outputformat com.google.cloud.hadoop.io.bigquery.mapred.BigQueryMapredOutputFormat \
-output requiredButUnused \
-mapper mapper.py \
-reducer reducer.py
Note that you must supply values for -input and -output to satisfy hadoop-streaming parameter checks, but these values are not used by BigQuery streaming.
The mapper receives as input one line of text for each record from BigQuery, formatted as JSON. It should output each key-value pair as tab-separated values terminated by a newline.
The reducer receives as input one line of text for each record from the mapper, formatted as a tab-separated key-value pair. It should output a tab-separated and newline-terminated key-value pair where the key will be ignored, so can be anything (such as "0") and the value is a JSON string with fields that match the schema of the output table as specified by the value for mapred.bq.output.table.schema.
Changes to Pig and Hive (extensions/querytools/querytools_env.sh)
In general, it’s best practice to use DEFAULT_FS=hdfs when running Pig or Hive since these are more dependent on multi-mapreduce pipelines which are in turn susceptible to the eventual list consistency semantics of GCS. We now include the explicit override of setting DEFAULT_FS=hdfs inside extensions/querytools/querytools_env.sh. Additionally, for consistency and ease of use, we’ve changed the default “user” for Pig and Hive to be “hadoop” instead of “hdpuser”, so whereas you’d previously find pig under:
/home/hdpuser/pig/bin/pig
It will now reside under
/home/hadoop/pig/bin/pig
We’ve also updated querytools_env.sh to add the /bin/ paths of both Pig and Hive to the system PATH for all users, so that if you ssh or use “bdutil shell” to your cluster, you can simply type “pig” or “hive” without needing to run “sudo sudo -i -u hdpuser” or specify the fully-qualified paths to the binaries. We’ve also added extensions/querytools/pig-validate-setup.sh and extensions/querytools/hive-validate-setup.sh to quickly get started with Pig and Hive; these can be used the same way as extensions/spark/spark-validate-setup.sh for Spark:
./bdutil -e extensions/querytools/querytools_env.sh \
-e extensions/spark/spark_env.sh deploy
./bdutil shell < extensions/querytools/pig-validate-setup.sh
./bdutil shell < extensions/querytools/hive-validate-setup.sh
./bdutil shell < extensions/spark/spark-validate-setup.sh
As always, please send any questions or comments to gcp-hadoo...@google.com
All the best,
Your Google Team
bdutil-0.35.2: CHANGES.txt
0.35.2 - 2014-09-18
1. When installing Hadoop 1 and 2, snappy will now be installed and symbolic
links will be created from the /usr/lib or /usr/lib64 tree to the Hadoop
native library directory.
2. When installing Hadoop 2, bdutil will attempt to download and install
precompiled native libraries for the installed version of Hadoop.
3. Modified default hadoop-validate-setup.sh to use 10MB of random data
instead of the old 1MB, otherwise it doesn't work for larger clusters.
4. Added a health check script in Hadoop 1 to check if Jetty failed to load
for the TaskTracker as in [MAPREDUCE-4668].
5. Added ServerAliveInterval and ServerAliveCountMax SSH options to SSH
invocations to detect dropped connections.
6. Pig and Hive installation (extensions/querytools/querytools_env.sh) now
sets DEFAULT_FS='hdfs'; reading from GCS using explicit gs:// URIs will
still work normally, but intermediate data for multi-stage pipelines will
now reside on HDFS. This is because Pig and Hive more commonly rely on
immediate "list consistency" across clients, and thus are more susceptible
to GCS "eventual list consistency" semantics even if the majority case
works fine.
7. Changed occurrences of 'hdpuser' to 'hadoop' in querytools_env.sh, such
that Pig and Hive will be installed under /home/hadoop instead of
/home/hdpuser, and the files will be owned by 'hadoop' instead of
'hdpuser'; this is more consistent with how other extensions have been
handled.
8. Modified extensions/querytools/querytools_env.sh to additionally insert
the Pig and Hive 'bin' directories into the PATH environment variable
for all users, such that SSH'ing into the master provides immediate
access to launching 'pig' or 'hive' without requiring
"sudo sudo -i -u hdpuser"; removed 'chmod 600 hive-site.xml' so that any
user can successfully run 'hive' directly.
9. Added extensions/querytools/{hive, pig}-validate-setup.sh which can be
used as a quick test of Pig/Hive functionality:
./bdutil shell < extensions/querytools/pig-validate-setup.sh
10. Updated extensions/spark/spark_env.sh to now use spark-1.1.0 by default.
11. Added new BigQuery connector sample under bdutil-0.35.2/samples as file
streaming_word_count.sh which demonstrates using the new support for
the older "hadoop.mapred.*" interfaces via hadoop-streaming.jar.
bigquery-connector-0.4.4: CHANGES.txt
0.4.4 - 2014-09-18
1. Added new classes implementing the hadoop.mapred.* interfaces by wrapping
the existing hadoop.mapreduce.* implementations and delegating
appropriately. This enables backwards-compatability for some stacks which
depend on the "old api" interfaces, including now being able to use
the standard "hadoop-streaming.jar" to run binary mappers/reducers with
the BigQuery connector. Note that in the absence of a blocking driver
program to call BigQueryInputFormat.cleanupJob, you must instead explicitly
clean up the temporary exported files after a hadoop-streaming job if
using the input connector. Extra cleanup is not necessary if only using
the output connector in hadoop-streaming. The new top-level classes:
com.google.cloud.hadoop.io.bigquery.mapred.BigQueryMapredInputFormat
com.google.cloud.hadoop.io.bigquery.mapred.BigQueryMapredOutputFormat
See the javadocs for the associated RecordReader/Writer, InputSplit, and
OutputCommitter classes.
gcs-connector-1.2.9: CHANGES.txt
1.2.9 - 2014-09-18
1. When directory contents are updated e.g., files or directories are added,
removed, or renamed the GCS connector will now attempt to update a
metadata property on the parent directory with a modification time. The
modification time recorded will be used as the modification time in
subsequent FileSystem#getStatus(...), FileSystem#listStatus and
FileSystem#globStatus(...) calls and is the time as reported by
the system clock of the system that made the modification.
datastore-connector-0.14.7: CHANGES.txt
0.14.7 - 2014-09-18