Greetings, users of Hadoop on Google Cloud Platform!
We’re excited to announce several enhancements to bdutil which automate installing Apache Spark and Apache Shark alongside your Hadoop cluster, and add two new commands. We have also optimized some Hadoop settings for better reliability and performance. Download bdutil-0.34.3.tar.gz or bdutil-0.34.3.zip now to try it out, or visit the developer documentation where the download links now point to the latest version.
Note: This release doesn’t include any changes to bdconfig or the GCS/BigQuery/Datastore connectors.
To install Spark and Shark (0.9.1) (and their implicit dependencies, Hive and Scala 2.10.3), simply type:
./bdutil -e extensions/spark/spark_shark_env.sh -b <configbucket> deploy
Or on Hadoop 2:
./bdutil -e hadoop2_env.sh,extensions/spark/spark_shark_env.sh -b <configbucket> deploy
Once deployed, you can visit http://<master-node external IP address>:8080 to view the Spark GUI. If you SSH into the master node, you can simply type shark to start the Shark shell, where you can start issuing Shark or Hive queries such as described in Hive’s official “getting started” walkthrough.
In this default mode, your “Hive metastore” will use a local DerbyDB, which should be thought of as transient; the metadata about tables you’ve created in your session will disappear along with deleting the cluster.
Note that the current Hadoop 2 deployment does not use Spark on YARN; running Spark on YARN will be added in a future release.
We also provide a separate extension file to install Spark 1.0.0 without installing Shark separately; Spark 1.0.0 includes a new alpha-stage component called Spark SQL which combines the functionality of Shark into Spark itself. To deploy Spark 1.0.0, simply use:
./bdutil -e extensions/spark/spark1_env.sh -b <configbucket> deploy
New bdutil features
In a nutshell, bdutil now improves the reliability of longer-lived and larger Hadoop clusters by better assigning available memory to master daemons, enabling recovery for MapReduces when the master daemons are restarted, and moving the Hadoop logs to the attached persistent disk if available. Additionally, bdutil now accepts the flag --network to specify a GCE network other than “default”.
Additional helpful commands are bdutil socksproxy <optional port; default: 1080> which runs an SSH session with dynamic port forwarding as a SOCKS5 proxy, and bdutil shell, which is simply shorthand for opening an SSH session to your master node.
To use your SOCKS5 proxy with port 12345 on Firefox, for example:
bdutil socksproxy 12345
Go to Edit -> Preferences -> Advanced -> Network -> Settings
Enable "Manual proxy configuration" with a SOCKS host "localhost" on port 12345
Force the DNS resolution to occur on the remote proxy host rather than locally.
Go to "about:config" in the URL bar
Search for "socks" to toggle "network.proxy.socks_remote_dns" to "true".
Visit the web UIs exported by your cluster!
http://hs-ghfs-nn:50030 for your Hadoop 1 JobTracker
http://hs-ghfs-nn:8088 for your YARN/Hadoop 2 Resource Manager
http://hs-ghfs-nn:8080 for Spark
As always, please send any questions or comments to gcp-hadoo...@google.com
All the best,
Your Google Team
bdutil-0.34.3: CHANGES.txt
0.34.3 - 2014-06-13
1. Jobtracker / Resource manager recovery has been enabled by default to
preserve job queues if the daemon dies.
2. Fixed single_node_env.sh to work with hadoop2_env.sh
3. Two new commands were added to bdutil: socksproxy and shell; socksproxy
will establish a SOCKS proxy to the cluster and shell will start an SSH
session to the namenode.
4. A new variable, GCE_NETWORK, was added to bdutil_env.sh and can be set
from the command line via the --network flag when deploying a cluster or
generating a configuration file. The network specified by GCE_NETWORK
must exist and must allow SSH connections from the host running bdutil
and must allow intra-cluster communication.
5. Increased configured heap sizes of the master daemons (JobTracker,
NameNode, SecondaryNameNode, and ResourceManager).
6. The HADOOP_LOG_DIR is now /hadoop/logs instead of the default
/home/hadoop/hadoop-install/logs; if using attached PDs for larger disk
storage, this directory resides on that attached PD rather than the
boot volume, so that Hadoop logs will no longer fill up the boot disk.
7. Added new extensions under bdutil-<version>/extensions/spark; includes
spark_shark_env.sh and spark1_env.sh, both compatible for mixing with
Hadoop2 as well. For now, doesn't use Mesos or YARN in either case,
but suitable for single-user or Spark-only setups. The spark_shark_env.sh
extension installs Spark + Shark 0.9.1, while spark1_env.sh only installs
Spark 1.0.0, in which case Spark SQL serves as the alternative to Shark.