Blog post: How to use the Livy Spark REST Job Server API for doing some interactive Spark with curl

2,059 views
Skip to first unread message

Romain Rigaux

unread,
Sep 24, 2015, 1:54:37 AM9/24/15
to Hue-Users
Originally posted on http://gethue.com/how-to-use-the-livy-spark-rest-job-server-for-interactive-spark/

Livy is an open source REST interface for using Spark from anywhere.

It supports executing snippets of code or programs in a Spark Context that runs locally or in YARN. This makes it ideal for building applications or Notebooks that can interact with Spark in real time. For example, it is currently used for powering the Spark snippets of the Hadoop Notebook in Hue.

In this post we see how we can execute some Spark 1.5 snippets in Python.

 

20150818_scalabythebay.012

Livy sits between the remote users and the Spark cluster

 

Starting the REST server

Based on the README, we check out Livy’s code. It is currently living in Hue repository for simplicity but hopefully will eventually graduate in its top project.

1
git clone g...@github.com:cloudera/hue.git

Then we compile Livy with

1
2
cd hue/apps/spark/java
mvn -DskipTests clean package

And start it

1
./bin/livy-server

Note: Livy defaults to Spark local mode, to use the YARN mode copy the configuration template file apps/spark/java/conf/livy-defaults.conf.tmpl into livy-defaults.conf and set the property:

1
livy.server.session.factory = yarn

 

Executing some Spark

As the REST server is running, we can communicate with it. We are on the same machine so will use ‘localhost’ as the address of Livy.

Let’s list our open sessions

1
2
3
curl localhost:8998/sessions
 
{"from":0,"total":0,"sessions":[]}

 

There is zero session. We create an interactive PySpark session

1
2
3
curl -X POST --data '{"kind": "pyspark"}' -H "Content-Type: application/json" localhost:8998/sessions
 
{"id":0,"state":"starting","kind":"pyspark","log":[]}

 

Sessions ids are incrementing numbers starting from 0. We can then reference the session later by its id.

Livy supports the three languages of Spark:

KindsLanguages
sparkScala
pysparkPython
sparkrR

 

We check the status of the session until its state becomes idle: it means it is ready to be execute snippet of PySpark:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
curl localhost:8998/sessions/0 | python -m json.tool
 
 
% Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
 
                                Dload  Upload   Total   Spent    Left  Speed
 
100  1185    0  1185    0     0  72712      0 --:--:-- --:--:-- --:--:-- 79000
 
{
 
    "id": 5,
 
    "kind": "pyspark",
 
    "log": [
 
       "15/09/03 17:44:14 INFO util.Utils: Successfully started service 'SparkUI' on port 4040.",
 
       "15/09/03 17:44:14 INFO ui.SparkUI: Started SparkUI at http://172.21.2.198:4040",
 
       "15/09/03 17:44:14 INFO spark.SparkContext: Added JAR file:/home/romain/projects/hue/apps/spark/java-lib/livy-assembly.jar at http://172.21.2.198:33590/jars/livy-assembly.jar with timestamp 1441327454666",
 
       "15/09/03 17:44:14 WARN metrics.MetricsSystem: Using default name DAGScheduler for source because spark.app.id is not set.",
 
       "15/09/03 17:44:14 INFO executor.Executor: Starting executor ID driver on host localhost",
 
       "15/09/03 17:44:14 INFO util.Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 54584.",
 
       "15/09/03 17:44:14 INFO netty.NettyBlockTransferService: Server created on 54584",
 
       "15/09/03 17:44:14 INFO storage.BlockManagerMaster: Trying to register BlockManager",
 
       "15/09/03 17:44:14 INFO storage.BlockManagerMasterEndpoint: Registering block manager localhost:54584 with 530.3 MB RAM, BlockManagerId(driver, localhost, 54584)",
 
       "15/09/03 17:44:15 INFO storage.BlockManagerMaster: Registered BlockManager"
 
    ],
 
    "state": "idle"
 
}

 

20150818_scalabythebay.024

In YARN mode, Livy creates a remote Spark Shell in the cluster that can be accessed easily with REST

 

When the session state is idle, it means it is ready to accept statements! Lets compute 1 + 1

1
2
3
curl localhost:8998/sessions/0/statements -X POST -H 'Content-Type: application/json' -d '{"code":"1 + 1"}'
 
{"id":0,"state":"running","output":null}

We check the result of statement 0 when its state is available

1
2
3
curl localhost:8998/sessions/0/statements/0
 
{"id":0,"state":"available","output":{"status":"ok","execution_count":0,"data":{"text/plain":"2"}}}

Statements are incrementing and all share the same context, so we can have a sequences

1
2
3
curl localhost:8998/sessions/0/statements -X POST -H 'Content-Type: application/json' -d '{"code":"a = 10"}'
 
{"id":1,"state":"available","output":{"status":"ok","execution_count":1,"data":{"text/plain":""}}}

Spanning multiple statements

1
2
3
curl localhost:8998/sessions/5/statements -X POST -H 'Content-Type: application/json' -d '{"code":"a + 1"}'
 
{"id":2,"state":"available","output":{"status":"ok","execution_count":2,"data":{"text/plain":"11"}}}

 

Let’s close the session to free up the cluster. Note that Livy will automatically inactive idle sessions after 1 hour (configurable).

1
2
3
curl localhost:8998/sessions/0 -X DELETE
 
{"msg":"deleted"}

 

Impersonation

Let’s say we want to create a shell running as the user bob, this is particularly useful when multi users are sharing a Notebook server

1
2
curl -X POST --data '{"kind": "pyspark", "proxyUser": "bob"}' -H "Content-Type: application/json" localhost:8998/sessions
{"id":0,"state":"starting","kind":"pyspark","proxyUser":"bob","log":[]}

Additional properties

All the properties supported by spark shells like the number of executors, the memory, etc can be changed at session creation. Their format is the same as when typing spark-shell -h

1
2
curl -X POST --data '{"kind": "pyspark", "numExecutors": "3", "executorMemory": "2G"}' -H "Content-Type: application/json" localhost:8998/sessions
{"id":0,"state":"starting","kind":"pyspark","numExecutors":"3","executorMemory":"2G","log":[]}

 

And that’s it! Next time we will explore some more advanced features like the magic keywords for introspecting data or printing images. Then, we will detail how to do batch submissions in compiled Scala, Java or Python (i.e. jar or py files).

The architecture of Livy was presented for the first time at Big Data Scala by the Bay last August and next updates will be at the Spark meetup before Strata NYC and Spark Summit in Amsterdam.

 

Feel free to ask any questions about the architecture, usage of the server in the comments, @gethue or the hue-user list. And pull requests are always welcomed!

 Hue Team


agg....@gmail.com

unread,
Sep 25, 2015, 2:05:36 AM9/25/15
to Hue-Users
Romain, this is an awesome post! =) Thanks for sharing.

In the github repo you mention that after executing a statement you might get text or json. I started reading the code and could not find where it is that you decide to return text or json. When do you return json? For Spark dataframes? So far, every single statement execution I've done returns text/plain.

Also, I was reading the magics code and see that you return a table made from json. You do mention that you'll cover the magic keywords for introspecting data in a future post, but I'm currently playing with Livy and its output and would love to continue coding. Any pointers are appreciated.

Thanks!

Romain Rigaux

unread,
Sep 25, 2015, 2:37:19 PM9/25/15
to Alejandro Guerrero, Hue-Users
Glad to hear!

When executing Scala or Python you get text by default like in the shells.

Similarly to Jupyther (we are using the same concepts and will be pluggable at some point), if you start a line with

%table

it will try to introspect and return json data like when using a SQL snippet, e.g.

a = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
%table a

or look in this post:
http://gethue.com/bay-area-bike-share-data-analysis-with-spark-notebook-part-2/

Romain

To unsubscribe from this group and stop receiving emails from it, send an email to hue-user+u...@cloudera.org.

agg....@gmail.com

unread,
Sep 25, 2015, 4:32:13 PM9/25/15
to Hue-Users, agg....@gmail.com
Thanks for the answer Romain!

That makes sense about getting text for Scala and Python. Some things I'd like to clarify:

1. What does it mean that you are using the same concepts as Jupyter and that it'll be pluggable in the future? Are you planning to do any work to integrate Jupyter and Livy? I understood your comment as pluggable in Hue Spark notebook, but just wanted to make sure I got it right.
2. The introspection done by magics is not available from livy-server right? It's only available from livy-repl from the code I'm reading. Are you planning to make the server return json at some point?
3. How are sessions/proxy-users semantically different?

To give you some context about the I am currently working to make Jupyter work remotely with Spark through Livy and wanted to render automatic rich visualizations of dataframes. Here's the proposal we've submitted for incubation, with some code: https://github.com/jupyter-incubator/sparkmagic

Best,
Alejandro
...

Erick Tryzelaar

unread,
Sep 25, 2015, 4:44:42 PM9/25/15
to Alejandro Guerrero, Hue-Users
Hello Alejandro,

1) Our first step is that we want Livy to be able to talk to the various Jupyter kernels. This would enable Livy to work with things we won't have time to implement. You can see all 49 kernels here that we'd be able to support:

https://github.com/ipython/ipython/wiki/IPython-kernels-for-other-languages

I'd be happy to contribute things back upstream to the Jupyter project, but we'll have to save that for another time.

2) Yes, the introspection is done by magic commands that are executed on the livy-repl inside the interpreter. The livy server already returns JSON. The livy-repl also does too.

3) Right now a session has a proxy user, which is used by YARN in order to grant privileges and limit resources for a particular user. We're still working on a real security story.

And WOW! https://github.com/jupyter-incubator/sparkmagic is awesome! Please let us know how we can help out with it! I only *briefly* thought about how Livy could be used as a proxy for Jupyter. It's awesome you've already started going down that route!

-Erick



To unsubscribe from this group and stop receiving emails from it, send an email to hue-user+u...@cloudera.org.

agg....@gmail.com

unread,
Sep 25, 2015, 6:01:15 PM9/25/15
to Hue-Users, agg....@gmail.com
Erick, these are thrilling news! I've added comments in-line.

Thanks!


On Friday, September 25, 2015 at 1:44:42 PM UTC-7, Erick Tryzelaar wrote:
Hello Alejandro,

1) Our first step is that we want Livy to be able to talk to the various Jupyter kernels. This would enable Livy to work with things we won't have time to implement. You can see all 49 kernels here that we'd be able to support:

https://github.com/ipython/ipython/wiki/IPython-kernels-for-other-languages

I'd be happy to contribute things back upstream to the Jupyter project, but we'll have to save that for another time.
Maybe the Jupyter community would be willing to help. Have you thought about running this as an incubator project? Just floating the idea out there. Incubation could happen in an external repo, but I'm sure the community would love to hear about the idea.


2) Yes, the introspection is done by magic commands that are executed on the livy-repl inside the interpreter. The livy server already returns JSON. The livy-repl also does too.
What curl command should I execute to get JSON back?  I haven't been able to do it so far :(


3) Right now a session has a proxy user, which is used by YARN in order to grant privileges and limit resources for a particular user. We're still working on a real security story.
I need more time to understand this statement, but I'll do that in the next couple of days. Thanks! 

And WOW! https://github.com/jupyter-incubator/sparkmagic is awesome! Please let us know how we can help out with it! I only *briefly* thought about how Livy could be used as a proxy for Jupyter. It's awesome you've already started going down that route!
Thank you so much for the offer. I'll come back with questions in this forum :P
 
...

Erick Tryzelaar

unread,
Sep 25, 2015, 6:07:46 PM9/25/15
to Alejandro Guerrero, Hue-Users
We haven't considered incubation, but I'll talk with the rest of the team. On the curl commands, you should be able to do something like:

% curl -i -XPOST -H 'Content-Type: application/json' localhost:8998/sessions -d '{"kind": "spark"}'

I also have some python-by-way-of-requests here that uses the JSON api. It also describes the API and the structure of the request and response objects:

https://github.com/cloudera/hue/tree/master/apps/spark/java



To unsubscribe from this group and stop receiving emails from it, send an email to hue-user+u...@cloudera.org.

agg....@gmail.com

unread,
Sep 25, 2015, 7:12:44 PM9/25/15
to Hue-Users, agg....@gmail.com
Thanks Erick!

If I execute that curl command, or any statement command, I'll get a response like the following and I'm pointing out that I've never been able to get application/json instead of text/plain in the statement output object, as described by https://github.com/cloudera/hue/tree/master/apps/spark/java#statement-output

'{"total_statements":1,"statements":[{"id":0,"state":"available","output":{"status":"ok","execution_count":0,"data":{"text/plain":"Pi is roughly 3.14336"}}}]}'

Is there a way to get application/json as the mime type of the data part of statement output?

Best,
Alejandro
...

Erick Tryzelaar

unread,
Sep 25, 2015, 7:32:53 PM9/25/15
to Alejandro Guerrero, Hue-Users
Hello Alejandro,

Ah, you're talking about the content type of the actual data. Right now only the spark backend has support for jsonifying data, but it'd be pretty simple to add to the python backend. You need to do this:

```
➜  ~  curl -i -XPOST -H 'Content-Type: application/json' localhost:8998/sessions -d '{"kind": "spark"}'
HTTP/1.1 201 Created
Date: Fri, 25 Sep 2015 23:27:50 GMT
Content-Type: application/json; charset=UTF-8
Location: /0
Transfer-Encoding: chunked
Server: Jetty(9.2.z-SNAPSHOT)

{"id":0,"state":"starting","kind":"spark","log":[]}%                                                                                                                  ➜  ~  curl -XPOST -H 'Content-Type: application/json' localhost:8998/sessions/0/statements -d '{"code": "val x = 1\n%json x"}' | python -m json.tool

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100    70    0    40  100    30    147    110 --:--:-- --:--:-- --:--:--   147
{
    "id": 0,
    "output": null,
    "state": "running"
}
➜  ~  curl -XGET -H 'Content-Type: application/json' localhost:8998/sessions/0/statements | python -m json.tool

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   141    0   141    0     0   8965      0 --:--:-- --:--:-- --:--:--  9400
{
    "statements": [
        {
            "id": 0,
            "output": {
                "data": {
                    "application/json": 1
                },
                "execution_count": 0,
                "status": "ok"
            },
            "state": "available"
        }
    ],
    "total_statements": 1
}
```


Both the pyspark and spark backends do support what we call the `%table` magic, that will convert types into a table, with a header that contains the table type and field name:


```
➜  ~  curl -i -XPOST -H 'Content-Type: application/json' localhost:8998/sessions -d '{"kind": "pyspark"}'
HTTP/1.1 201 Created
Date: Fri, 25 Sep 2015 23:31:47 GMT
Content-Type: application/json; charset=UTF-8
Location: /0
Transfer-Encoding: chunked
Server: Jetty(9.2.z-SNAPSHOT)

{"id":0,"state":"starting","kind":"pyspark","log":[]}%                                                                                                                ➜  ~  curl -XPOST -H 'Content-Type: application/json' localhost:8998/sessions/0/statements -d '{"code": "x = [[1,2],[3,4]]\n%table x"}' | python -m json.tool

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100    79    0    40  100    39    150    146 --:--:-- --:--:-- --:--:--   150
{
    "id": 0,
    "output": null,
    "state": "running"
}
➜  ~  curl -XGET -H 'Content-Type: application/json' localhost:8998/sessions/0/statements | python -m json.tool

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   254    0   254    0     0  14108      0 --:--:-- --:--:-- --:--:-- 14941
{
    "statements": [
        {
            "id": 0,
            "output": {
                "data": {
                    "application/vnd.livy.table.v1+json": {
                        "data": [
                            [
                                1,
                                2
                            ],
                            [
                                3,
                                4
                            ]
                        ],
                        "headers": [
                            {
                                "name": "0",
                                "type": "INT_TYPE"
                            },
                            {
                                "name": "1",
                                "type": "INT_TYPE"
                            }
                        ]
                    }
                },
                "execution_count": 0,
                "status": "ok"
            },
            "state": "available"
        }
    ],
    "total_statements": 1
}
```


To unsubscribe from this group and stop receiving emails from it, send an email to hue-user+u...@cloudera.org.

Erick Tryzelaar

unread,
Sep 25, 2015, 7:43:39 PM9/25/15
to Alejandro Guerrero, Hue-Users
Just pushed up a patch for review that adds `%json variable` to the python backend: https://review.cloudera.org/r/5998/

Erick Tryzelaar

unread,
Sep 25, 2015, 7:50:16 PM9/25/15
to Alejandro Guerrero, Hue-Users
And landed in master :)

agg....@gmail.com

unread,
Sep 25, 2015, 7:53:55 PM9/25/15
to Hue-Users, agg....@gmail.com
Hahah oh Erick, you're killing me here :) That was fast! I'll go play with it now.

One more question: what advantages do you see in enabling the kernels from Livy? Were there alternatives you were thinking of?

Thanks!
Alejandro

...

MAKE ME

unread,
Oct 23, 2017, 11:11:04 PM10/23/17
to Hue-Users, agg....@gmail.com
requirement failed: local path cannot be added to user sessions curl , im getting these error when im using curl 
curl -X POST --data '{"file": "/user/romain/pi.py"}' -H "Content-Type: application/json" localhost:8998/batches
Reply all
Reply to author
Forward
0 new messages