REST client always streaming?

174 views
Skip to first unread message

Josh Adell

unread,
May 6, 2012, 11:55:45 AM5/6/12
to Neo4j
Question for other Neo4j REST client library authors and people
familiar with the REST server internals:

I'm really liking the performance increase of streaming Cypher. I'd
like to make it the default (and only method) for Cypher results in
Neo4jPHP.

Unfortunately, I made some design choices in Neo4jPHP that make it
difficult to tune my request headers on a per-request basis. I did
some initial tests against 1.5, 1.6, 1.7 and 1.8 Neo4j instances, and
none of them seem to mind passing the "Accept: application/
json;stream=true" header to all the endpoints; it looks like they will
just ignore the "stream=true" part.

So the questions are:

1) Despite being a bit semantically incorrect, should I just pass
"stream=true" in the Accept header to every endpoint, regardless of
server version?

2) Are there any compatibility, functionality or performance concerns
I'm not seeing?

3) For library authors: do you give your users a choice to use
streaming or not? Or is that an implementation detail that the library
hides from the user?

Obviously, the question is open to anybody, but I'm particularly
interested in how other library authors handled this.

-- Josh

Aseem Kishore

unread,
May 6, 2012, 2:18:56 PM5/6/12
to ne...@googlegroups.com
I'm the maintainer of the Node.js client library.

I haven't had a chance to check out streaming Cypher yet, but it's great to hear that it's nothing but positive. =)

I'd love answers to (1) and (2) also.

I can't answer (3) myself without understanding streaming Cypher a bit more, so here are some additional questions from my side:

(4) Is it accurate to say that streaming Cypher just means that Neo4j returns JSON as a stream instead of all at once?

(5) Is it also accurate to say that a client library that waits for the entire HTTP response to finish before parsing the JSON will continue to work just fine?

If (5) is indeed the case, I don't see any reason the user should have to be concerned with streaming Cypher; my answer to (3) then would be no, that's an implementation detail (a perf optimization).

And just to understand this further:

(6) Can client libraries theoretically take advantage of streaming Cypher even more by also parsing JSON results (and potentially returning them to clients) as they stream in rather than all at the end?

If (6) is also the case, then it seems client libraries should have this as an enhancement option -- let the developer handle result rows as they come in rather than at the end.

Thanks guys,

Aseem

Nigel Small

unread,
May 6, 2012, 6:14:33 PM5/6/12
to ne...@googlegroups.com
Hi Josh

I believe the streaming method will eventually dominate and render the previous method redundant. So, except for reasons of backward compatibility, I can't see any reason why there would be a need for a slower, less flexible method in the long-term. To that end, I don't believe a choice should be given to library users. This allows us to maintain a consistent interface over the longer term while avoiding the need for extra complexity that the user probably wouldn't care about, possibly wouldn't understand and would almost certainly hardly ever need. There also seems little harm in passing "stream=true" to every request - it's certainly my plan to do so.

I've not found any compatibility issues either- so far, everything has been remarkably smooth. I am however looking at some performance stats although nothing so far has indicated that the streaming is anything but a vast improvement :-)

Nige

Josh Adell

unread,
May 6, 2012, 6:35:32 PM5/6/12
to Neo4j
Hey Aseem,

As far as #5 goes, that's the way Neo4jPHP currently works in my
experiments (i. e. wait for the entire stream to finish and parse the
results.) I'm not even aware of a PHP library that does true streaming
HTTP, so Neo4jPHP will continue to do that for the foreseeable
future. The performance gain is entirely on the server side, and it's
a vast improvement (returning ~10000 rows in ~1 second with streaming
and ~3.5 seconds without streaming.)

I would love to see #6 be a reality, but you really have to trust that
the server will send well-formed JSON if you want to start parsing it
before the full JSON document is received. For me, that would also
mean writing my own JSON parser, as PHP's built-in parser expects a
fully-formed document to begin with.

Anyway, great additional questions! Thanks.

-- Josh

Josh Adell

unread,
May 6, 2012, 6:39:29 PM5/6/12
to Neo4j
Nigel,
That's what I was leaning towards, I'm just curious how others handled
it. Does py2neo wait to receive the entire JSON document before
parsing, or is it parsing a partial JSON document as it streams in? If
the latter, did you write your own parser for that, or is there
already a Python library that parses partial JSON?

-- Josh

On May 6, 6:14 pm, Nigel Small <ni...@nigelsmall.net> wrote:
> Hi Josh
>
> I believe the streaming method will eventually dominate and render the
> previous method redundant. So, except for reasons of backward
> compatibility, I can't see any reason why there would be a need for a
> slower, less flexible method in the long-term. To that end, I don't believe
> a choice should be given to library users. This allows us to maintain a
> consistent interface over the longer term while avoiding the need for extra
> complexity that the user probably wouldn't care about, possibly wouldn't
> understand and would almost certainly hardly ever need. There also seems
> little harm in passing "stream=true" to every request - it's certainly my
> plan to do so.
>
> I've not found any compatibility issues either- so far, everything has been
> remarkably smooth. I am however looking at some performance stats although
> nothing so far has indicated that the streaming is anything but a vast
> improvement :-)
>
> Nige
>

Aseem Kishore

unread,
May 6, 2012, 6:49:35 PM5/6/12
to ne...@googlegroups.com
Great, thanks Josh!

You might be able to find a PHP library that can parse JSON streams. I haven't used any myself, but there certainly exist several out there across many platforms.

Besides searching Google, this SO question has lots of input: http://stackoverflow.com/questions/444380/is-there-a-streaming-api-for-json

Cheers,
Aseem

Nigel Small

unread,
May 6, 2012, 7:26:20 PM5/6/12
to ne...@googlegroups.com
Finding client libs that supported streaming was my biggest challenge. For HTTP, I settled with tornado which allows a streaming callback, called each time a new chunk is received. JSON was more of a problem though. I've had to put together some code which decodes one line at a time, incrementally building up the complete document. It's a bit messy and relies on the content being pretty-printed but seems to do the job. The code is at:

https://github.com/nigelsmall/py2neo/blob/7195b2463460b8980e02dd41b65529883623eb9f/src/py2neo/cypher.py#L122

I'm considering rebuilding a streaming JSON parser outside of the main code but haven't had the time so far. I would certainly prefer not only to be able to decode after the whole thing is received otherwise I'm missing a potential benefit, performance-wise.

On top of this, the entire interface to Cypher execution has changed in py2neo 1.2. There are now callbacks in place, the main one of which is called each time a new row has been received from a query. This allows the application to begin to use the response before it has completely arrived. There's another callback for the metadata (currently only columns) which unfortunately always seems to kick off *after* the rows have been received since the column data follows the row data in the response. It would be nice to have the columns arrive first so that tabulated output could be produced in order (for example).

Nige

Peter Neubauer

unread,
May 7, 2012, 1:27:34 AM5/7/12
to ne...@googlegroups.com

Nigel,
The order thing sounds like a good improvement issue. Great discussion!

Michael Hunger

unread,
May 7, 2012, 4:56:54 AM5/7/12
to ne...@googlegroups.com
Great discussion,

thanks for all the input.

There is only one breaking change when it comes to streaming (and passing stream=true), exceptions and other errors _might_ occur only after the fact, i.e. when the first data was already streamed so they won't be reflected in the header.

This is especially true for cypher, the batch-rest-api and traversals, not so much for other calls.

For the batch-rest-API the commands that failed will abort the operation and contain the status code and error messages as part of its result payload.

Regarding a better streaming friendly format.

I would like to change the streaming cypher format into a stream of fully formed json objects,
first. header (contains the columns and perhaps query and parameters)
then. times row (with the data, or an error object that aborts the query)
last. footer (total rows, time taken, other metadata)

It is what you'd get in the batch-rest API by leaving of the first and last "[" "]".

This should be much easier to consume in a streaming way, see also my test-client impl in the streaming-cypher experiment server-extension: 
esp. the callback interface.

For the changes in the batch-rest-API (which will be merged in this week) the performance gain is:
4 sec for creating 30k nodes with streaming (and almost no memory usage)
14 sec for creating 30k nodes w/o streaming (and lots of memory used)

All these changes don't yet contain the compact format which would add another performance gain, but we're not sure yet how to request that compact format,
- either with a different URI or query parameter (different representation) 
- an additional or extended header field
- .... ?
- the application/json;stream=true is probably also preliminary as it is not the correct way to indicate streaming-able clients

Michael 

Nigel Small

unread,
May 7, 2012, 11:44:56 AM5/7/12
to ne...@googlegroups.com
I can see the issue with errors and agree that the only way to dynamically produce an error part way through the output would be to ensure that a series of objects were passed instead of parts of a bigger object as it is today. A couple of questions though:

1. Does each row (of data) need to be a JSON object? Would a JSON array not make more sense?
2. The header row isn't an issue but how do you delimit the footer row? There clearly cannot be a row count up-front due to the nature of the query results but how do we know that we have the footer and not just another row of data?

On the other hand, we could use an object for the header and footer and an array for each data row. That could give us something like:

{"columns": ["name", "age"]}
["Alice", 33]
["Bob", 44]
["Carol", 55]
["Dave", 66]
{"row_count": 4, "time": 1.23}

That would mean if a line starts with a "{" then it's a metadata row and if it starts with a "[" then it's a normal data row.

As metadata, an error could also then be rendered as:

{"columns": ["name", "age"]}
["Alice", 33]
["Bob", 44]
{"error": {"code": 999, "message": "Bad stuff happened
"}}

Nige

Michael Hunger

unread,
May 7, 2012, 12:22:54 PM5/7/12
to ne...@googlegroups.com
Right I implied objects or arrays or other json constructs (like strings, numbers, booleans) when saying "object"

we could know that this is the footer by:
- it being an object instead of an array
- having a dedicated key in there that specifies the type: (e.g. type: footer , similarly type:header)
- or having an "EOF" string denoting the end of the stream after the footer

Michael

Aseem Kishore

unread,
May 8, 2012, 10:04:03 AM5/8/12
to ne...@googlegroups.com
Sorry, haven't caught up on the full thread, just the last part, but why won't the current JSON format work just fine? It delimits "header" info (e.g. columns) from the rows (called "data"), so you could just add further keys for "footer" info (e.g. errors, time, etc.).

{
    columns: [ ... ],
    data: [ ... ],
    error: ...
}

Where the data is still a streaming array of arrays.

One important thing IMHO would be for the entire response to still be valid JSON. What do you guys think -- do you agree with that goal?

If commas separating the rows is a concern, you can easily address it w/ Isaac Schleuter's comma-first style when streaming the JSON back. E.g. here's what the rows would look like:

data:
// first row processed...
[ [...]
// then second row...
, [...]
// then third row...
, [...]
// no more rows left
]

Cheers,
Aseem

Nigel Small

unread,
May 8, 2012, 10:26:24 AM5/8/12
to ne...@googlegroups.com
Why do we want to constrain ourselves to a pure JSON response? Because that's the way we've done it until now? We haven't had streaming results until now.

Is a document format appropriate for streaming results? I honestly don't believe so: one is static by nature the other is dynamic.

If we were to design the response format without any knowledge of the current implementation, how would we go about it?

Aseem Kishore

unread,
May 8, 2012, 10:36:25 AM5/8/12
to ne...@googlegroups.com
Nah, I was just saying it because that way it works with existing tools. Ideally, the content-type doesn't even need changing -- as far as HTTP is concerned, it doesn't matter whether the response is streamed or sent all at once.

I also say this because the bottleneck wasn't client libraries parsing one chunk of JSON, it was Neo4j building up all the results in memory before serializing them to JSON. That's fixed; it doesn't harm Neo4j to send back valid JSON still.

Not a big deal, just something I think could be worth maintaining since it doesn't need to have a high cost.

Aseem

Josh Adell

unread,
May 8, 2012, 10:52:52 AM5/8/12
to Neo4j
I don't think a document format is appropriate for streaming results,
at least, the entire response should not be a document. For streaming,
I would want to see one result per line, with a literal \r\n between
each. Put the header with columns at the beginning and the footer at
the end with a double \r\n separating them from the data, the same way
an HTTP response separates header and body content:

{"columns": ["col1", "col2"]}\r\n
\r\n
["valA","valX"]\r\n
["valB","valY"]\r\n
["valC","valZ"]\r\n
\r\n
{"error":null,"row_count":3,"time":1.23}


An error mid-stream is the same format, but with an error indicated in
the footer, and the row count being the number of rows returned before
the error was encountered:

{"columns": ["col1", "col2"]}\r\n
\r\n
["valA","valX"]\r\n
["valB","valY"]\r\n
\r\n
{"error":{"code":123,"message":"Something bad happened"},"row_count":
2,"time":1.23}


If we're streaming over HTTP, this format takes advantage a user's
existing knowledge of how HTTP responses are formatted. It also
explicitly demarcates the header, data and footer sections of the
response; no checking if a row is an object or an array. It does not
rely on a user's language/framework of choice having a JSON parser
which can handle incomplete JSON documents, because each line is a
fully-formed JSON document. Client code is simpler because the success
case and the mid-stream error case are in the same format.

Just my thoughts.

-- Josh

On May 8, 10:26 am, Nigel Small <ni...@nigelsmall.net> wrote:
> Why do we want to constrain ourselves to a pure JSON response? Because
> that's the way we've done it until now? We haven't had streaming results
> until now.
>
> Is a document format appropriate for streaming results? I honestly don't
> believe so: one is static by nature the other is dynamic.
>
> If we were to design the response format without any knowledge of the
> current implementation, how would we go about it?
>
> >> On 7 May 2012 09:56, Michael Hunger <michael.hun...@neotechnology.com>wrote:
>
> >>> Great discussion,
>
> >>> thanks for all the input.
>
> >>> There is only one breaking change when it comes to streaming (and
> >>> passing stream=true), exceptions and other errors _might_ occur only after
> >>> the fact, i.e. when the first data was already streamed so they won't be
> >>> reflected in the header.
>
> >>> This is especially true for cypher, the batch-rest-api and traversals,
> >>> not so much for other calls.
>
> >>> For the batch-rest-API the commands that failed will abort the operation
> >>> and contain the status code and error messages as part of its result
> >>> payload.
>
> >>> Regarding a better streaming friendly format.
>
> >>> I would like to change the streaming cypher format into a stream of
> >>> fully formed json objects,
> >>> first. header (contains the columns and perhaps query and parameters)
> >>> then. times row (with the data, or an error object that aborts the query)
> >>> last. footer (total rows, time taken, other metadata)
>
> >>> It is what you'd get in the batch-rest API by leaving of the first and
> >>> last "[" "]".
>
> >>> This should be much easier to consume in a streaming way, see also my
> >>> test-client impl in the streaming-cypher experiment server-extension:
> >>> (
> >>>https://github.com/neo4j-contrib/streaming-cypher/blob/master/src/mai...
> >>> )
> >>>>https://github.com/nigelsmall/py2neo/blob/7195b2463460b8980e02dd41b65...
>
> >>>> I'm considering rebuilding a streaming JSON parser outside of the main
> >>>> code but haven't had the time so far. I would certainly prefer not only to
> >>>> be able to decode after the whole thing is received otherwise I'm missing a
> >>>> potential benefit, performance-wise.
>
> >>>> On top of this, the entire interface to Cypher execution has changed in
> >>>> py2neo 1.2. There are now callbacks in place, the main one of which is
> >>>> called each time a new row has been received from a query. This allows the
> >>>> application to begin to use the response before it has completely arrived.
> >>>> There's another callback for the metadata (currently only columns) which
> >>>> unfortunately always seems to kick off *after* the rows have been received
> >>>> since the column data follows the row data in the response. It would be
> >>>> nice to have the columns arrive first so that tabulated output could be
> >>>> produced in order (for example).
>
> >>>> Nige
>
> >>>> On 6 May 2012 23:49, Aseem Kishore <aseem.kish...@gmail.com> wrote:
>
> >>>>> Great, thanks Josh!
>
> >>>>> You might be able to find a PHP library that can parse JSON streams. I
> >>>>> haven't used any myself, but there certainly exist several out there across
> >>>>> many platforms.
>
> >>>>>  Besides searching Google, this SO question has lots of input:
> >>>>>http://stackoverflow.com/questions/444380/is-there-a-streaming-api-fo...
>
> >>>>> Cheers,
> >>>>> Aseem
> ...
>
> read more »

Nigel Small

unread,
May 8, 2012, 11:16:33 AM5/8/12
to ne...@googlegroups.com
The blank lines are an excellent idea - consistent with HTTP and no requirement to sniff the type of line being read. We would probably be best assigning this a content-type which explicitly needed "Accept"ing ... "application/vnd.neo.cypher-results" or something like that. No reason that the existing JSON format couldn't remain the default.

At the risk of bikeshedding: "\r\n", "\r" or "\n"?

Josh Adell

unread,
May 8, 2012, 11:25:53 AM5/8/12
to Neo4j
Custom Accept and Content-Type headers are probably a good idea. As
for newline; HTTP 1.1 spec (http://www.w3.org/Protocols/rfc2616/
rfc2616-sec2.html#sec2) mandates CRLF for every element, except the
entity body. My vote is for CRLF within and between the response
sections, for consistency, but either CRLF or just LF are acceptable.

-- Josh

On May 8, 11:16 am, Nigel Small <ni...@nigelsmall.net> wrote:
> The blank lines are an excellent idea - consistent with HTTP and no
> requirement to sniff the type of line being read. We would probably be best
> assigning this a content-type which explicitly needed "Accept"ing ...
> "application/vnd.neo.cypher-results" or something like that. No reason that
> the existing JSON format couldn't remain the default.
>
> At the risk of bikeshedding: "\r\n", "\r" or "\n"?
>

Nigel Small

unread,
May 8, 2012, 11:34:35 AM5/8/12
to ne...@googlegroups.com
Who am I to argue with the W3C? :-)

+1 for CRLF


Reply all
Reply to author
Forward
0 new messages