Stream tweaks proposal

Showing 1-21 of 21 messages
Stream tweaks proposal Dominic 7/15/12 7:25 PM
I've been writing and using a lot of streams lately,
and have some proposals for some small changes to node's streams.

https://gist.github.com/3117184

in some parts it's a tightening up on expected behaviour,
in others it's a few small changes to Stream#pipe.

cheers, Dominic
Re: [node-dev] Stream tweaks proposal Isaac Schlueter 7/26/12 9:57 PM
I think the real problem is with readable streams.  Writable streams
are pretty easy to implement, but the readable side is unnecessarily
difficult and too easy to get wrong.

Here's a proposal in code: https://github.com/isaacs/readable-stream

The readme explains the position, and the implementation makes it easy
to play around with.  A base class and a fs readable stream class are
provided.  Note that Readable.pipe() is very short and simple :)
Re: Stream tweaks proposal Bruno Jouhier 7/26/12 10:50 PM
A more radical proposal would be to switch to a callback-style API.

See http://bjouhier.wordpress.com/2012/07/04/node-js-stream-api-events-or-callbacks/

Isaac's proposal is a move in the same direction (a read method). The main difference is that Isaac uses a completion "event" while I use a completion "callback".

A point that I did not mention in my post: the callback API makes it trivial to implement an "unread" method. Handy when you need to look ahead when parsing input.

Bruno
Re: [node-dev] Re: Stream tweaks proposal Dominic 7/27/12 3:58 AM
@isaacs, your ReadableStream#pipe does not have any error handling yet.
so it won't be quite so simple when it has that.
Re: [node-dev] Stream tweaks proposal Joran Greef 7/27/12 5:38 AM
Looks good.

How often in practice are multiple on("end") and on("readable") events going to be attached?

Would prefer to just have plain vanilla stream.onEnd = function() {} and stream.onReadable = function() {} callbacks.
Re: [node-dev] Stream tweaks proposal Dominic 7/27/12 6:03 AM
EventEmitter is optimised under the hood, so that if you only ever add
one listener, that is what is pretty much what is happening.

adding a listener:

https://github.com/joyent/node/blob/master/lib/events.js#L139-141

calling the only event-listener

https://github.com/joyent/node/blob/master/lib/events.js#L74-99

there any many times you need extra event listeners. when debugging,
for example, or maybe you want to register another end listener,
because you want to be notified when the stream has completed, so that
you can do something else.
Re: [node-dev] Stream tweaks proposal Dominic 7/27/12 6:08 AM
note, there is a pull request here

https://github.com/isaacs/readable-stream/pull/1

to re add error handling, & cleanup.
Re: [node-dev] Stream tweaks proposal oliver 7/27/12 6:22 AM
On Fri, Jul 27, 2012 at 2:38 PM, Joran Greef <jo...@ronomon.com> wrote:
> How often in practice are multiple on("end") and on("readable") events going
> to be attached?
>
> Would prefer to just have plain vanilla stream.onEnd = function() {} and
> stream.onReadable = function() {} callbacks.

'in practice' is a bit problematic here, as (IMO) we still have to
discover all the possibilities the nodejs-stream-concept gives us and
what (at the end of discovery-phase) turns out to be
practical/useful/powerful/sane.

For example, I'm writing on a module, that chains controllers. Every
controller has three stream-objects. Some of them must react to the
end of the streams of other controllers (eg flush buffered data). Its
hard for me to explain the module as a whole (Proof-of-concept: The C
in MVC with streams?) But this would definitely not work if I could
not bind multiple callbacks to one 'end' event.
Re: [node-dev] Stream tweaks proposal Mikeal Rogers 7/27/12 11:23 AM
readable will get called a lot.

multiple end listeners is rather common.
Re: [node-dev] Stream tweaks proposal Mikeal Rogers 7/27/12 12:03 PM
After several conversations about this I have identified one thing we lose with this change.

Because the current implementation makes streams start emitting data right after creation and could start doing so at any time after nextTick() it's required that all streams be connected within a single tick.

In request, and other streams, we make the assumption that all stream behavior was setup in a single tick, which is how this works:

```
request.get(url, function (e, resp, body) {})
fs.createReadStream(file).pipe(request.put(url))
request.get(url).pipe(fs.createWriteStream(file))
```

I need to know before I start the underlying http client if this has a source and/or a destination stream. If there is no source stream I must call end() on the underlying http client. Now, we do get out of needing to know the destination stream ahead of time **if** i have a way to know that I shouldn't be buffering the data for a callback. I think i could do this by simply removing a feature i used to have where that callback became a callback on "response" and did not include buffered data.

The new stream style rids us of the pause/resume dance but it also means that a stream can be created and just hang out for many cycles before getting piped to something else. That means we lose an assumption we used to have and that we have spent some time building on.

Now, it's very uncommon that you would create a stream and then pipe **to** it some number of cycles later. But, over time I think that people will be used to the "magic" of these new streams that can be created and piped around whenever and will step on this.

This is not meant to derail or stop any of the changes proposed, it's simply an assumption we used to make and have come to rely on that cannot be polyfilled and will break whenever this gets released.

-Mikeal
Re: [node-dev] Stream tweaks proposal Dominic 7/27/12 12:32 PM
isaacs mentioned to me in irc that he wanted to implement a
ReadableStream, but leave Stream exactly as it was.

so, have two types of stream.

there is another thing this would break, it would be impossible to
stream a .read() based ReadableStream to multiple destinations...
Although, if regular Streams still exist, then you'll still have that
if you need it.
Re: [node-dev] Stream tweaks proposal Bruno Jouhier 7/27/12 12:41 PM
 

Now, it's very uncommon that you would create a stream and then pipe **to** it some number of cycles later. But, over time I think that people will be used to the "magic" of these new streams that can be created and piped around whenever and will step on this.


It is not so uncommon. A number of people got the problem with the following scenario:

1) http server dispatches POST request
2) request is authenticated
3) request is routed
4) POSTed data is piped to destination.

The problem is that step 2 (authentication) usually needs to access some kind of database. So, the stream (request) is created at step 1 and piped at step 4, with I/O ticks at step 2.

You cannot assume that you'll always have all the elements to decide what to do with a stream in the tick that creates the stream. Some time you need to read some header data from the stream, do some I/O to gather related data that will tell you what to do with the stream, and then only process the stream (pipe it).
Re: [node-dev] Stream tweaks proposal Tim Caswell 7/28/12 12:05 PM
FWIW, I actually like Bruno's proposal.  It doesn't cover all the use
cases, but it makes backpressure enabled pumps really easy.

One use case missing that's easy to add is when consuming a binary
protocol, I often only want part of the input.  For example, I might
want to get the first 4 bytes, decode that as a uint32 length header
and then read n more bytes for the body.  Without being able to
request how many bytes I want, I have to handle putting data back in
the stream that I don't need.  That's very error prone and tedious.
So on the read function, add an optional "maxBytes" or "bytes"
parameter.  The difference is in the maxBytes case, I want the data as
soon as there is anything, even if it's less than the number of bytes
I want.   In the "bytes" case I want to wait till that many bytes are
available.  Both are valid for different use cases.

Also streams (both readable and writable) need a configurable
low-water mark.  I don't want to wait till the pipe is empty before I
start piping data again.  This mark would control how soon writable
streams called my write callback and how much readable streams would
readahead from their data source before waiting for me to call read.
I want to keep it always full.  It would be great if this was handled
internally in the stream and consumers of the stream simply configured
what the mark should be.
Re: [node-dev] Stream tweaks proposal Mikeal Rogers 7/28/12 12:14 PM

On Jul 28, 2012, at July 28, 201212:05 PM, Tim Caswell <t...@creationix.com> wrote:

> FWIW, I actually like Bruno's proposal.  It doesn't cover all the use
> cases, but it makes backpressure enabled pumps really easy.
>
> One use case missing that's easy to add is when consuming a binary
> protocol, I often only want part of the input.  For example, I might
> want to get the first 4 bytes, decode that as a uint32 length header
> and then read n more bytes for the body.  Without being able to
> request how many bytes I want, I have to handle putting data back in
> the stream that I don't need.  That's very error prone and tedious.
> So on the read function, add an optional "maxBytes" or "bytes"
> parameter.  The difference is in the maxBytes case, I want the data as
> soon as there is anything, even if it's less than the number of bytes
> I want.   In the "bytes" case I want to wait till that many bytes are
> available.  Both are valid for different use cases.

The early stuff I saw included a "length" option.

>
> Also streams (both readable and writable) need a configurable
> low-water mark.  I don't want to wait till the pipe is empty before I
> start piping data again.  This mark would control how soon writable
> streams called my write callback and how much readable streams would
> readahead from their data source before waiting for me to call read.
> I want to keep it always full.  It would be great if this was handled
> internally in the stream and consumers of the stream simply configured
> what the mark should be.

I think you're missing how this works. Nobody automatically asks for data so watermarks aren't strictly necessary. You ask for data if it's available and you read as much as you can handle.

There is no "readahead". If someone stops calling read() then the buffer fills and, if it's a TCP stream, it's asked to stop sending data.

Remember that when the "readable" event goes off it's expected that the pending data is read in the same event loop cycle.




Re: [node-dev] Stream tweaks proposal Tim Caswell 7/28/12 12:22 PM
In any backpressure case where the data provider is faster than the
consumer, there will be a series of pauses and resumes to slow down
the provider.  That is the point of backpressure.

So in these discussions we need to assume that the pauses and resumes
will be happening.  What triggers a pause, what triggers a resume?
Will there ever be a case where the pipe is empty and stalled.  This
is what I'm trying to avoid.  It's kills throughput.

>  If someone stops calling read() then the buffer fills and, if it's a TCP stream, it's asked to stop sending data.

When does it ask the tcp stream to stop sending data.  And when do we
tell the tcp stream to start sending data?  How long does it take for
the tcp stream to get this message and start sending the data?  If we
wait till the buffer is 100% empty and new data isn't loaded
immedietly, then it's a stall.  This is where a low-water mark keeps
things running smoothly.

>
>
>
Re: [node-dev] Stream tweaks proposal Mikeal Rogers 7/28/12 12:38 PM
pause and resume at the stream layer are gone.

the TCP socket will get paused if "readable" is emitted and read() is not called. it will be resumed when read() is called again.

the amount of data that is available on a "readable" event is variable. the amount of data buffered is of no importance to pause/resume logic.

-Mikeal
Re: [node-dev] Stream tweaks proposal Bruno Jouhier 7/29/12 10:51 AM
@tim

The API that I used in this blog post is a simplified version of the API I implemented in streamline. I simplified it in the blog post because I just wanted to demo the equivalence between the two styles of API.

The streams module that I am using (https://github.com/Sage/streamlinejs/blob/master/lib/streams/server/streams.md) has most of the features that you saw missing:

* an optional "len" parameter in the read call.
* low and high water mark options in the ReadableStream constructor.

The "len" parameter has your "bytes" semantics and I use it exactly the way you describe (typically to read 4 bytes to get a frame length and then read N bytes for a frame). I did not implement "maxBytes" semantics because I did not need it (which does not mean it would not be useful). The thing is that all the additional bells and whistles can be implemented around the basic read(cb) call (called readChunk in my module).

I introduced low and high mark options because I wanted to avoid a pause/resume dance around every data event when the data arrives faster than it is consumed. My assumption was that a little queue with high and low marks would reduce the number of pause/resume calls and improve performance. Basically tradiing a bit of space for speed. But I have to admit that I did not bench it. So, if the pause/resume dance costs very little this may be overkill.

@isaac and mikeal,

This callback proposal may sound very "anti-eventish" and it may give the impression that I'm sorta trying to eradicate events from node's APis (nobody said it but I can see how it could be perceived this way). This is not the case. I like node's event API and I find it very elegant. But node gives us two API styles (callbacks and events) and it is not always easy to choose between the two. Here is the rationale that I use to decide between them:

My main criteria is CORRELATION. Basically, I start with the assumption that the API is event-oriented and then I analyze the degree of correlation between the various events. If the events are highly correlated, I choose the callback style. If there are loosely correlated, I keep the event style. Some examples:

* User events (browser side) are very loosely correlated => event style
* Incoming HTTP requests (server side) are also very loosely correlated => event style
* Data streams vary. If each data chunk is a complete message which is more or less independent from other messages, the event style is best. If, on the other hand, the chunks are correlated (because the whole stream has a strong internal structure, or because it has been chunked on arbitrary boundaries that don't match its internal structure), then the callback style is best.
* Confirmation events (like "connect/error" events that follows a connection attempt, or a "drain" event that follows a write returning false) are fully correlated => callback style.

Also, the event style API is more powerful than the callback style API as it supports multiple listeners.
BUT:

* It is very easy to wrap a callback API with an event listener.
* Very often, in the correlated case, there is a "main" consumer which needs to correlate the events, and auxiliary consumers that don't care that much about the correlations (log them, feed statistics, etc). A dual API with callbacks for the main consumer and events for  the auxiliary ones works great.
* Wrapping an event style API with a callback style API is a lot more difficult.
* Callback style APIs are easier to use when the events are correlated because you don't need to setup state machines to re-correlate the events.

Given this, I probably favor the callback style a lot more than most node developers. But this is not a systematic "anti-event" attitude, there is a rationale behind it and I wanted to share it with you.

Bruno
Re: [node-dev] Stream tweaks proposal Dominic 7/30/12 3:53 AM
I think there is another problem here, if creationix writes a stream
that likes to read a byte, or a line, or whatever, it means that I
can't just pipe anything into it, because he basically needs a custom
`flow` function (as per
https://github.com/isaacs/readable-stream/blob/master/readable.js#L84-95
) but that will be on the readable side.

if data is gonna be pulled off the readable stream, then the pulling
duties really belong to the puller, not the pullee.

but, I can also sense complexity rearing it's ugly head, but maybe...
pipe could check for a dest.pull method, and use that instead of flow?
Re: [node-dev] Stream tweaks proposal Isaac Schlueter 7/30/12 10:02 AM
> there is another thing this would break, it would be impossible to
> stream a .read() based ReadableStream to multiple destinations...

Yeah, that use case is going to be hard.  If you want a tee-stream,
you can implement it pretty easily on top of this.  And of course, if
you use the "data" event facade, then it'll keep working just how it
does, so I don't expect existing programs to be affected too badly
until they specifically try to upgrade.

But, yes, this is a thing that's going to break in 0.9, most likely,
and someone probably will be affected by it.  The best approach is to
do it as soon as possible and document the changes.

On Sun, Jul 29, 2012 at 10:51 AM, Bruno Jouhier <bjou...@gmail.com> wrote:
> Given this, I probably favor the callback style a lot more than most node
> developers.

For several implementations, a read(n, cb) method is easier to do.
For example, on Windows, that's how we read from sockets, and on all
platforms, that's how fs.read works.  So, we *must* implement
something like this for fs read streams anyway, and if we want elegant
cross-platform support, then we'll have to do that for TCP and Pipes
as well.

The ReadableStream base class will have a read([n]) method that
returns either up-to-n bytes (or some arbitrary number of bytes if n
is unspecified), or null.  If it returns null, then you wait for a
'readable' event and try again.

It will also have a _read(n, function(er, bytes)) method that you can
override with your own implementation details.  The assumption here is
that: n is always specified, the cb is called at some point in the
future with an error or up-to-n bytes.  The ReadableStream class will
take care of managing the buffer and calling this._read when
appropriate.  If there are no more bytes to read, then _read(n, cb)
should call the cb with null.

This makes it easy to interface with underlying systems that can only
do asynchronous reads, while still getting the benefits of the simpler
read(n)/emit('readable') API.

Of course, if your system CAN safely do reads synchronously (ie, if
it's some kind of abstract stream that reads from memory and doesn't
have to do any slow IO) then you can just override the read(n) method
instead of the _read(n, cb) method.
Re: [node-dev] Stream tweaks proposal Bruno Jouhier 7/30/12 2:42 PM
@isaac

Thanks for the detailed reply.

Your proposed API is optimal for the sync case (memory stream). I handle this case with a trampoline (streamline) or a nextTick call (pure JS), which introduces a bit of overhead.

I still prefer the callback style over the event style because a read() that returns null is always followed by either a "readable" or an "error" event so the follow up event is fully correlated with the read() call. But anyway, this API design is your call and the new API is clearly heading in the right direction. I'll keep my "correlation" ramblings for a future blog post.

Bruno
Re: [node-dev] Stream tweaks proposal Bert Belder 7/31/12 3:57 PM
On Monday, July 30, 2012 11:42:03 PM UTC+2, Bruno Jouhier wrote:
I still prefer the callback style over the event style because a read() that returns null is always followed by either a "readable" or an "error" event so the follow up event is fully correlated with the read() call. But anyway, this API design is your call and the new API is clearly heading in the right direction. I'll keep my "correlation" ramblings for a future blog post.


If it's any comfort, I agree with Bruno on this one.

- Bert
More topics »