Here, we call flatMap to transform a Dataset of lines to a Dataset of words, and then combine groupByKey and count to compute the per-word counts in the file as a Dataset of (String, Long) pairs. To collect the word counts in our shell, we can call collect:
It may seem silly to use Spark to explore and cache a 100-line text file. The interesting part isthat these same functions can be used on very large data sets, even when they are striped acrosstens or hundreds of nodes. You can also do this interactively by connecting bin/pyspark toa cluster, as described in the RDD programming guide.
It may seem silly to use Spark to explore and cache a 100-line text file. The interesting part isthat these same functions can be used on very large data sets, even when they are striped acrosstens or hundreds of nodes. You can also do this interactively by connecting bin/spark-shell toa cluster, as described in the RDD programming guide.
This option also applies to the standalone mode you've been using, but if you have been using the ec2 scripts, we set "spark.executor.memory" in conf/spark-defaults.conf to do this automatically so you don't have to specify it each time on the command line. You can also do the same in YARN.
NB: It is important to note that the JARs are copied to the working directory of the executors, but are not copied to the working directory of the driver. Usually, the spark.driver.extraClassPath will be the same path you passed to --jars whereas spark.executor.extraClassPath must be a relative path.
If you run the Spark shell as it is, you will only have the built-in Spark commands available.If you want to use it with the Couchbase Connector, the easiest way is to provide a specific argument that locates the dependency and pulls it in:
The Spark shell is based on the Scala REPL (Read-Eval-Print-Loop). It allows you to create Spark programs interactively and submit work to the framework. You can access the Spark shell by connecting to the primary node with SSH and invoking spark-shell. For more information about connecting to the primary node, see Connect to the primary node using SSH in the Amazon EMR Management Guide. The following examples use Apache HTTP Server access logs stored in Amazon S3.
By default, the Spark shell creates its own SparkContext object called sc. You can use this context if it is required within the REPL. sqlContext is also available in the shell and it is a HiveContext.
Spark also includes a Python-based shell, pyspark, that you can use to prototype Spark programs written in Python. Just as with spark-shell, invoke pyspark on the primary node; it also has the same SparkContext object.
df19127ead