Livy is an open source REST interface for using Spark from anywhere.
It supports executing snippets of code or programs in a Spark Context that runs locally or in YARN. This makes it ideal for building applications or Notebooks that can interact with Spark in real time. For example, it is currently used for powering the Spark snippets of the Hadoop Notebook in Hue.
In this post we see how we can execute some Spark 1.5 snippets in Python.
Livy sits between the remote users and the Spark cluster
Based on the README, we check out Livy’s code. It is currently living in Hue repository for simplicity but hopefully will eventually graduate in its top project.
1 | git clone g...@github.com:cloudera /hue .git |
Then we compile Livy with
1 2 | cd hue /apps/spark/java mvn -DskipTests clean package |
And start it
1 | . /bin/livy-server |
Note: Livy defaults to Spark local mode, to use the YARN mode copy the configuration template file apps/spark/java/conf/livy-defaults.conf.tmpl into livy-defaults.conf and set the property:
1 | livy.server.session.factory = yarn |
As the REST server is running, we can communicate with it. We are on the same machine so will use ‘localhost’ as the address of Livy.
Let’s list our open sessions
1 2 3 | curl localhost:8998 /sessions { "from" :0, "total" :0, "sessions" :[]} |
There is zero session. We create an interactive PySpark session
1 2 3 | curl -X POST --data '{"kind": "pyspark"}' -H "Content-Type: application/json" localhost:8998 /sessions { "id" :0, "state" : "starting" , "kind" : "pyspark" , "log" :[]} |
Sessions ids are incrementing numbers starting from 0. We can then reference the session later by its id.
Livy supports the three languages of Spark:
Kinds | Languages |
spark | Scala |
pyspark | Python |
sparkr | R |
We check the status of the session until its state becomes idle
: it means it is ready to be execute snippet of PySpark:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 | curl localhost:8998 /sessions/0 | python -m json.tool % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 1185 0 1185 0 0 72712 0 --:--:-- --:--:-- --:--:-- 79000 { "id" : 5, "kind" : "pyspark" , "log" : [ "15/09/03 17:44:14 INFO util.Utils: Successfully started service 'SparkUI' on port 4040." , "15/09/03 17:44:14 INFO spark.SparkContext: Added JAR file:/home/romain/projects/hue/apps/spark/java-lib/livy-assembly.jar at http://172.21.2.198:33590/jars/livy-assembly.jar with timestamp 1441327454666" , "15/09/03 17:44:14 WARN metrics.MetricsSystem: Using default name DAGScheduler for source because spark.app.id is not set." , "15/09/03 17:44:14 INFO executor.Executor: Starting executor ID driver on host localhost" , "15/09/03 17:44:14 INFO util.Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 54584." , "15/09/03 17:44:14 INFO netty.NettyBlockTransferService: Server created on 54584" , "15/09/03 17:44:14 INFO storage.BlockManagerMaster: Trying to register BlockManager" , "15/09/03 17:44:14 INFO storage.BlockManagerMasterEndpoint: Registering block manager localhost:54584 with 530.3 MB RAM, BlockManagerId(driver, localhost, 54584)" , "15/09/03 17:44:15 INFO storage.BlockManagerMaster: Registered BlockManager" ], "state" : "idle" } |
In YARN mode, Livy creates a remote Spark Shell in the cluster that can be accessed easily with REST
When the session state is idle
, it means it is ready to accept statements! Lets compute 1 + 1
1 2 3 | curl localhost:8998 /sessions/0/statements -X POST -H 'Content-Type: application/json' -d '{"code":"1 + 1"}' { "id" :0, "state" : "running" , "output" :null} |
We check the result of statement 0 when its state is available
1 2 3 | curl localhost:8998 /sessions/0/statements/0 { "id" :0, "state" : "available" , "output" :{ "status" : "ok" , "execution_count" :0, "data" :{ "text/plain" : "2" }}} |
Statements are incrementing and all share the same context, so we can have a sequences
1 2 3 | curl localhost:8998 /sessions/0/statements -X POST -H 'Content-Type: application/json' -d '{"code":"a = 10"}' { "id" :1, "state" : "available" , "output" :{ "status" : "ok" , "execution_count" :1, "data" :{ "text/plain" : "" }}} |
Spanning multiple statements
1 2 3 | curl localhost:8998 /sessions/5/statements -X POST -H 'Content-Type: application/json' -d '{"code":"a + 1"}' { "id" :2, "state" : "available" , "output" :{ "status" : "ok" , "execution_count" :2, "data" :{ "text/plain" : "11" }}} |
Let’s close the session to free up the cluster. Note that Livy will automatically inactive idle sessions after 1 hour (configurable).
1 2 3 | curl localhost:8998 /sessions/0 -X DELETE { "msg" : "deleted" } |
Let’s say we want to create a shell running as the user bob
, this is particularly useful when multi users are sharing a Notebook server
1 2 | curl -X POST --data '{"kind": "pyspark", "proxyUser": "bob"}' -H "Content-Type: application/json" localhost:8998 /sessions { "id" :0, "state" : "starting" , "kind" : "pyspark" , "proxyUser" : "bob" , "log" :[]} |
All the properties supported by spark shells like the number of executors, the memory, etc can be changed at session creation. Their format is the same as when typing spark-shell -h
1 2 | curl -X POST --data '{"kind": "pyspark", "numExecutors": "3", "executorMemory": "2G"}' -H "Content-Type: application/json" localhost:8998 /sessions { "id" :0, "state" : "starting" , "kind" : "pyspark" , "numExecutors" : "3" , "executorMemory" : "2G" , "log" :[]} |
And that’s it! Next time we will explore some more advanced features like the magic keywords for introspecting data or printing images. Then, we will detail how to do batch submissions in compiled Scala, Java or Python (i.e. jar or py files).
The architecture of Livy was presented for the first time at Big Data Scala by the Bay last August and next updates will be at the Spark meetup before Strata NYC and Spark Summit in Amsterdam.
Feel free to ask any questions about the architecture, usage of the server in the comments, @gethue or the hue-user list. And pull requests are always welcomed!
Hue Team
To unsubscribe from this group and stop receiving emails from it, send an email to hue-user+u...@cloudera.org.
...
To unsubscribe from this group and stop receiving emails from it, send an email to hue-user+u...@cloudera.org.
I'd be happy to contribute things back upstream to the Jupyter project, but we'll have to save that for another time.Hello Alejandro,
1) Our first step is that we want Livy to be able to talk to the various Jupyter kernels. This would enable Livy to work with things we won't have time to implement. You can see all 49 kernels here that we'd be able to support:
https://github.com/ipython/ipython/wiki/IPython-kernels-for-other-languages
2) Yes, the introspection is done by magic commands that are executed on the livy-repl inside the interpreter. The livy server already returns JSON. The livy-repl also does too.
3) Right now a session has a proxy user, which is used by YARN in order to grant privileges and limit resources for a particular user. We're still working on a real security story.
And WOW! https://github.com/jupyter-incubator/sparkmagic is awesome! Please let us know how we can help out with it! I only *briefly* thought about how Livy could be used as a proxy for Jupyter. It's awesome you've already started going down that route!
...
To unsubscribe from this group and stop receiving emails from it, send an email to hue-user+u...@cloudera.org.
'{"total_statements":1,"statements":[{"id":0,"state":"available","output":{"status":"ok","execution_count":0,"data":{"text/plain":"Pi is roughly 3.14336"}}}]}'
...
To unsubscribe from this group and stop receiving emails from it, send an email to hue-user+u...@cloudera.org.
...
curl -X POST --data
'{"file": "/user/romain/pi.py"}'
-H
"Content-Type: application/json"
localhost:8998
/batches