Announcing bdutil-0.35.2, gcs-connector-1.2.9, and bigquery-connector-0.4.4

213 views

Skip to first unread message

Hadoop on Google Cloud Platform Team

unread,

Sep 18, 2014, 11:42:40 PM9/18/14

to gcp-had...@google.com, gcp-hadoo...@googlegroups.com

Greetings, users of Hadoop on Google Cloud Platform!

We’re excited to announce the latest updates to bdutil which add several performance and reliability features, improved usability of Pig and Hive using extensions/querytools/querytools_env.sh, new support for using hadoop-streaming with the BigQuery connector, and support in the GCS connector for “directory modification time”. Additionally, we’ve updated the default Spark version to upgraded to 1.1.0 inside extensions/spark/spark_env.sh; older versions are still available at their previous download paths. Download bdutil-0.35.2.tar.gz or bdutil-0.35.2.zip now to try it out, or visit the developer documentation where the download links now point to the latest version.

The following patch to bdutil addresses a latent bug in Hadoop which may very rarely manifest as MapReduce jobs hanging for hours before failing due to improper TaskTracker initialization on a single worker. The manifestation only occurs during deployment/startup, so long-running clusters are not generally vulnerable. If you commonly deploy large clusters on demand, it’s highly recommended to pick up this update:

4. Added a health check script in Hadoop 1 to check if Jetty failed to load

for the TaskTracker as in [MAPREDUCE-4668].

BigQuery connector backwards-compatibility with ‘mapred’ interfaces, Hadoop Streaming support

The BigQuery connector now supports hadoop-streaming through use of the Hadoop 'mapred' API. To use hadoop-streaming with the connector you must specify the appropriate -inputformat and -outputformat values, plus a collection of -D options to set the required BigQuery parameters. Available in the bdutil-0.35.2/samples/ directory is a streaming wordcount example which uses streaming_word_count.sh to launch hadoop-streaming with the word_count_mapper.py and word_count_reducer.py files also provided in that directory. To walk through the samples end-to-end once you’ve configured bdutil with your project and bucket:

# Deploy a cluster with bigquery enabled.

./bdutil -e bigquery_env.sh deploy

# Create an empty output dataset in BigQuery using the ‘bq’ CLI

# (or alternatively do it through the BigQuery GUI)

bq mk tmpdataset_20140918

# Determine the project for your output table; it will generally be the same

# as the project you’ve used to deploy your cluster.

export PROJECT=<your project here>

# Use bdutil to stage the three files onto the master and run the sample.

./bdutil -u samples/streaming_word_count.sh \

-u samples/word_count_mapper.py \

-u samples/word_count_reducer.py \

-t master -v run_command -- ./streaming_word_count.sh \

--stream_output \

--output_project ${PROJECT} \

--output_dataset tmpdataset_20140918 \

--output_table wordcountout

# View the results of your word count using the ‘bq’ CLI

# (or navigate to the BigQuery GUI)

bq head tmpdataset_20140918.wordcountout

You may tune some of the settings at the top of samples/streaming_word_count.sh to try out other modes, such as running BigQuery input, HDFS/GCS output with:

STREAM_OUTPUT=false

OUTPUT_DIR_NOT_STREAMING='gs://your-bucket/streamingwordcountoutput'

In general, a typical command directly running hadoop-streaming with the BigQuery connector may look like this:

  hadoop jar hadoop-install/contrib/streaming/hadoop-streaming-1.2.1.jar \
     -D mapred.bq.input.project.id=publicdata \
     -D mapred.bq.input.dataset.id=samples \
     -D mapred.bq.input.table.id=shakespeare \
     -D mapred.bq.output.project.id=myProject \
     -D mapred.bq.output.dataset.id=testdataset \
     -D mapred.bq.output.table.id=testtable \
     -D mapred.bq.output.table.schema="[{'name': 'Word','type': 'STRING'},{'name': 'Count','type': 'INTEGER'}]" \

     -D mapred.output.committer.class=com.google.cloud.hadoop.io.bigquery.mapred.BigQueryMapredOutputCommitter \
     -inputformat com.google.cloud.hadoop.io.bigquery.mapred.BigQueryMapredInputFormat \
     -input requiredButUnused \
     -outputformat com.google.cloud.hadoop.io.bigquery.mapred.BigQueryMapredOutputFormat \
     -output requiredButUnused \
     -mapper mapper.py \
     -reducer reducer.py

Note that you must supply values for -input and -output to satisfy hadoop-streaming parameter checks, but these values are not used by BigQuery streaming.

The mapper receives as input one line of text for each record from BigQuery, formatted as JSON. It should output each key-value pair as tab-separated values terminated by a newline.

The reducer receives as input one line of text for each record from the mapper, formatted as a tab-separated key-value pair. It should output a tab-separated and newline-terminated key-value pair where the key will be ignored, so can be anything (such as "0") and the value is a JSON string with fields that match the schema of the output table as specified by the value for mapred.bq.output.table.schema.

Changes to Pig and Hive (extensions/querytools/querytools_env.sh)

In general, it’s best practice to use DEFAULT_FS=hdfs when running Pig or Hive since these are more dependent on multi-mapreduce pipelines which are in turn susceptible to the eventual list consistency semantics of GCS. We now include the explicit override of setting DEFAULT_FS=hdfs inside extensions/querytools/querytools_env.sh. Additionally, for consistency and ease of use, we’ve changed the default “user” for Pig and Hive to be “hadoop” instead of “hdpuser”, so whereas you’d previously find pig under:

/home/hdpuser/pig/bin/pig

It will now reside under

/home/hadoop/pig/bin/pig

We’ve also updated querytools_env.sh to add the /bin/ paths of both Pig and Hive to the system PATH for all users, so that if you ssh or use “bdutil shell” to your cluster, you can simply type “pig” or “hive” without needing to run “sudo sudo -i -u hdpuser” or specify the fully-qualified paths to the binaries. We’ve also added extensions/querytools/pig-validate-setup.sh and extensions/querytools/hive-validate-setup.sh to quickly get started with Pig and Hive; these can be used the same way as extensions/spark/spark-validate-setup.sh for Spark:

./bdutil -e extensions/querytools/querytools_env.sh \

-e extensions/spark/spark_env.sh deploy

./bdutil shell < extensions/querytools/pig-validate-setup.sh

./bdutil shell < extensions/querytools/hive-validate-setup.sh

./bdutil shell < extensions/spark/spark-validate-setup.sh

As always, please send any questions or comments to gcp-hadoo...@google.com

All the best,

Your Google Team

bdutil-0.35.2: CHANGES.txt

0.35.2 - 2014-09-18

1. When installing Hadoop 1 and 2, snappy will now be installed and symbolic

links will be created from the /usr/lib or /usr/lib64 tree to the Hadoop

native library directory.

2. When installing Hadoop 2, bdutil will attempt to download and install

precompiled native libraries for the installed version of Hadoop.

3. Modified default hadoop-validate-setup.sh to use 10MB of random data

instead of the old 1MB, otherwise it doesn't work for larger clusters.

4. Added a health check script in Hadoop 1 to check if Jetty failed to load

for the TaskTracker as in [MAPREDUCE-4668].

5. Added ServerAliveInterval and ServerAliveCountMax SSH options to SSH

invocations to detect dropped connections.

6. Pig and Hive installation (extensions/querytools/querytools_env.sh) now

sets DEFAULT_FS='hdfs'; reading from GCS using explicit gs:// URIs will

still work normally, but intermediate data for multi-stage pipelines will

now reside on HDFS. This is because Pig and Hive more commonly rely on

immediate "list consistency" across clients, and thus are more susceptible

to GCS "eventual list consistency" semantics even if the majority case

works fine.

7. Changed occurrences of 'hdpuser' to 'hadoop' in querytools_env.sh, such

that Pig and Hive will be installed under /home/hadoop instead of

/home/hdpuser, and the files will be owned by 'hadoop' instead of

'hdpuser'; this is more consistent with how other extensions have been

handled.

8. Modified extensions/querytools/querytools_env.sh to additionally insert

the Pig and Hive 'bin' directories into the PATH environment variable

for all users, such that SSH'ing into the master provides immediate

access to launching 'pig' or 'hive' without requiring

"sudo sudo -i -u hdpuser"; removed 'chmod 600 hive-site.xml' so that any

user can successfully run 'hive' directly.

9. Added extensions/querytools/{hive, pig}-validate-setup.sh which can be

used as a quick test of Pig/Hive functionality:

./bdutil shell < extensions/querytools/pig-validate-setup.sh

10. Updated extensions/spark/spark_env.sh to now use spark-1.1.0 by default.

11. Added new BigQuery connector sample under bdutil-0.35.2/samples as file

streaming_word_count.sh which demonstrates using the new support for

the older "hadoop.mapred.*" interfaces via hadoop-streaming.jar.

bigquery-connector-0.4.4: CHANGES.txt

0.4.4 - 2014-09-18

1. Added new classes implementing the hadoop.mapred.* interfaces by wrapping

the existing hadoop.mapreduce.* implementations and delegating

appropriately. This enables backwards-compatability for some stacks which

depend on the "old api" interfaces, including now being able to use

the standard "hadoop-streaming.jar" to run binary mappers/reducers with

the BigQuery connector. Note that in the absence of a blocking driver

program to call BigQueryInputFormat.cleanupJob, you must instead explicitly

clean up the temporary exported files after a hadoop-streaming job if

using the input connector. Extra cleanup is not necessary if only using

the output connector in hadoop-streaming. The new top-level classes:

com.google.cloud.hadoop.io.bigquery.mapred.BigQueryMapredInputFormat

com.google.cloud.hadoop.io.bigquery.mapred.BigQueryMapredOutputFormat

See the javadocs for the associated RecordReader/Writer, InputSplit, and

OutputCommitter classes.

gcs-connector-1.2.9: CHANGES.txt

1.2.9 - 2014-09-18

1. When directory contents are updated e.g., files or directories are added,

removed, or renamed the GCS connector will now attempt to update a

metadata property on the parent directory with a modification time. The

modification time recorded will be used as the modification time in

subsequent FileSystem#getStatus(...), FileSystem#listStatus and

FileSystem#globStatus(...) calls and is the time as reported by

the system clock of the system that made the modification.

datastore-connector-0.14.7: CHANGES.txt

0.14.7 - 2014-09-18

1. Misc updates in library dependencies.

Reply all

Reply to author

Forward

0 new messages