Announcing new Apache Spark and Apache Shark extensions for bdutil as bdutil-0.34.3

794 views
Skip to first unread message

Hadoop on Google Cloud Platform Team

unread,
Jun 13, 2014, 8:33:15 PM6/13/14
to gcp-had...@google.com, gcp-hadoo...@googlegroups.com

Greetings, users of Hadoop on Google Cloud Platform!


We’re excited to announce several enhancements to bdutil which automate installing Apache Spark and Apache Shark alongside your Hadoop cluster, and add two new commands. We have also optimized some Hadoop settings for better reliability and performance. Download bdutil-0.34.3.tar.gz or bdutil-0.34.3.zip now to try it out, or visit the developer documentation where the download links now point to the latest version.


Note: This release doesn’t include any changes to bdconfig or the GCS/BigQuery/Datastore connectors.


Spark and Shark


To install Spark and Shark (0.9.1) (and their implicit dependencies, Hive and Scala 2.10.3), simply type:


./bdutil -e extensions/spark/spark_shark_env.sh -b <configbucket> deploy


Or on Hadoop 2:


./bdutil -e hadoop2_env.sh,extensions/spark/spark_shark_env.sh -b <configbucket> deploy


Once deployed, you can visit http://<master-node external IP address>:8080 to view the Spark GUI. If you SSH into the master node, you can simply type shark to start the Shark shell, where you can start issuing Shark or Hive queries such as described in Hive’s official “getting started” walkthrough.


In this default mode, your “Hive metastore” will use a local DerbyDB, which should be thought of as transient; the metadata about tables you’ve created in your session will disappear along with deleting the cluster.


Note that the current Hadoop 2 deployment does not use Spark on YARN; running Spark on YARN will be added in a future release.


We also provide a separate extension file to install Spark 1.0.0 without installing Shark separately; Spark 1.0.0 includes a new alpha-stage component called Spark SQL which combines the functionality of Shark into Spark itself. To deploy Spark 1.0.0, simply use:


./bdutil -e extensions/spark/spark1_env.sh -b <configbucket> deploy



New bdutil features


In a nutshell, bdutil now improves the reliability of longer-lived and larger Hadoop clusters by better assigning available memory to master daemons, enabling recovery for MapReduces when the master daemons are restarted, and moving the Hadoop logs to the attached persistent disk if available. Additionally, bdutil now accepts the flag --network to specify a GCE network other than “default”.


Additional helpful commands are bdutil socksproxy <optional port; default: 1080> which runs an SSH session with dynamic port forwarding as a SOCKS5 proxy, and bdutil shell, which is simply shorthand for opening an SSH session to your master node.


To use your SOCKS5 proxy with port 12345 on Firefox, for example:


  • bdutil socksproxy 12345

  • Go to Edit -> Preferences -> Advanced -> Network -> Settings

  • Enable "Manual proxy configuration" with a SOCKS host "localhost" on port 12345

  • Force the DNS resolution to occur on the remote proxy host rather than locally.

    • Go to "about:config" in the URL bar

    • Search for "socks" to toggle "network.proxy.socks_remote_dns" to "true".

  • Visit the web UIs exported by your cluster!

As always, please send any questions or comments to gcp-hadoo...@google.com


All the best,

Your Google Team



bdutil-0.34.3: CHANGES.txt

0.34.3 - 2014-06-13


 1. Jobtracker / Resource manager recovery has been enabled by default to

    preserve job queues if the daemon dies.

 2. Fixed single_node_env.sh to work with hadoop2_env.sh

 3. Two new commands were added to bdutil: socksproxy and shell; socksproxy

    will establish a SOCKS proxy to the cluster and shell will start an SSH

    session to the namenode.

 4. A new variable, GCE_NETWORK, was added to bdutil_env.sh and can be set

    from the command line via the --network flag when deploying a cluster or

    generating a configuration file. The network specified by GCE_NETWORK

    must exist and must allow SSH connections from the host running bdutil

    and must allow intra-cluster communication.

 5. Increased configured heap sizes of the master daemons (JobTracker,

    NameNode, SecondaryNameNode, and ResourceManager).

 6. The HADOOP_LOG_DIR is now /hadoop/logs instead of the default

    /home/hadoop/hadoop-install/logs; if using attached PDs for larger disk

    storage, this directory resides on that attached PD rather than the

    boot volume, so that Hadoop logs will no longer fill up the boot disk.

 7. Added new extensions under bdutil-<version>/extensions/spark; includes

    spark_shark_env.sh and spark1_env.sh, both compatible for mixing with

    Hadoop2 as well. For now, doesn't use Mesos or YARN in either case,

    but suitable for single-user or Spark-only setups. The spark_shark_env.sh

    extension installs Spark + Shark 0.9.1, while spark1_env.sh only installs

    Spark 1.0.0, in which case Spark SQL serves as the alternative to Shark.


Reply all
Reply to author
Forward
0 new messages