Stream tweaks proposal | Dominic | 7/15/12 7:25 PM | I've been writing and using a lot of streams lately,
and have some proposals for some small changes to node's streams. https://gist.github.com/3117184 in some parts it's a tightening up on expected behaviour, in others it's a few small changes to Stream#pipe. cheers, Dominic |
Re: [node-dev] Stream tweaks proposal | Isaac Schlueter | 7/26/12 9:57 PM | I think the real problem is with readable streams. Writable streams
are pretty easy to implement, but the readable side is unnecessarily difficult and too easy to get wrong. Here's a proposal in code: https://github.com/isaacs/readable-stream The readme explains the position, and the implementation makes it easy to play around with. A base class and a fs readable stream class are provided. Note that Readable.pipe() is very short and simple :) |
Re: Stream tweaks proposal | Bruno Jouhier | 7/26/12 10:50 PM | A more radical proposal would be to switch to a callback-style API. See http://bjouhier.wordpress.com/2012/07/04/node-js-stream-api-events-or-callbacks/ Isaac's proposal is a move in the same direction (a read method). The main difference is that Isaac uses a completion "event" while I use a completion "callback". A point that I did not mention in my post: the callback API makes it trivial to implement an "unread" method. Handy when you need to look ahead when parsing input. Bruno |
Re: [node-dev] Re: Stream tweaks proposal | Dominic | 7/27/12 3:58 AM | @isaacs, your ReadableStream#pipe does not have any error handling yet.
so it won't be quite so simple when it has that. |
Re: [node-dev] Stream tweaks proposal | Joran Greef | 7/27/12 5:38 AM | Looks good. How often in practice are multiple on("end") and on("readable") events going to be attached? Would prefer to just have plain vanilla stream.onEnd = function() {} and stream.onReadable = function() {} callbacks. |
Re: [node-dev] Stream tweaks proposal | Dominic | 7/27/12 6:03 AM | EventEmitter is optimised under the hood, so that if you only ever add
one listener, that is what is pretty much what is happening. adding a listener: https://github.com/joyent/node/blob/master/lib/events.js#L139-141 calling the only event-listener https://github.com/joyent/node/blob/master/lib/events.js#L74-99 there any many times you need extra event listeners. when debugging, for example, or maybe you want to register another end listener, because you want to be notified when the stream has completed, so that you can do something else. |
Re: [node-dev] Stream tweaks proposal | Dominic | 7/27/12 6:08 AM | note, there is a pull request here
https://github.com/isaacs/readable-stream/pull/1 to re add error handling, & cleanup. |
Re: [node-dev] Stream tweaks proposal | oliver | 7/27/12 6:22 AM | On Fri, Jul 27, 2012 at 2:38 PM, Joran Greef <jo...@ronomon.com> wrote:'in practice' is a bit problematic here, as (IMO) we still have to discover all the possibilities the nodejs-stream-concept gives us and what (at the end of discovery-phase) turns out to be practical/useful/powerful/sane. For example, I'm writing on a module, that chains controllers. Every controller has three stream-objects. Some of them must react to the end of the streams of other controllers (eg flush buffered data). Its hard for me to explain the module as a whole (Proof-of-concept: The C in MVC with streams?) But this would definitely not work if I could not bind multiple callbacks to one 'end' event. |
Re: [node-dev] Stream tweaks proposal | Mikeal Rogers | 7/27/12 11:23 AM | readable will get called a lot. multiple end listeners is rather common. |
Re: [node-dev] Stream tweaks proposal | Mikeal Rogers | 7/27/12 12:03 PM | After several conversations about this I have identified one thing we lose with this change.
Because the current implementation makes streams start emitting data right after creation and could start doing so at any time after nextTick() it's required that all streams be connected within a single tick. In request, and other streams, we make the assumption that all stream behavior was setup in a single tick, which is how this works: ``` request.get(url, function (e, resp, body) {}) fs.createReadStream(file).pipe(request.put(url)) request.get(url).pipe(fs.createWriteStream(file)) ``` I need to know before I start the underlying http client if this has a source and/or a destination stream. If there is no source stream I must call end() on the underlying http client. Now, we do get out of needing to know the destination stream ahead of time **if** i have a way to know that I shouldn't be buffering the data for a callback. I think i could do this by simply removing a feature i used to have where that callback became a callback on "response" and did not include buffered data. The new stream style rids us of the pause/resume dance but it also means that a stream can be created and just hang out for many cycles before getting piped to something else. That means we lose an assumption we used to have and that we have spent some time building on. Now, it's very uncommon that you would create a stream and then pipe **to** it some number of cycles later. But, over time I think that people will be used to the "magic" of these new streams that can be created and piped around whenever and will step on this. This is not meant to derail or stop any of the changes proposed, it's simply an assumption we used to make and have come to rely on that cannot be polyfilled and will break whenever this gets released. -Mikeal |
Re: [node-dev] Stream tweaks proposal | Dominic | 7/27/12 12:32 PM | isaacs mentioned to me in irc that he wanted to implement a
ReadableStream, but leave Stream exactly as it was. so, have two types of stream. there is another thing this would break, it would be impossible to stream a .read() based ReadableStream to multiple destinations... Although, if regular Streams still exist, then you'll still have that if you need it. |
Re: [node-dev] Stream tweaks proposal | Bruno Jouhier | 7/27/12 12:41 PM |
It is not so uncommon. A number of people got the problem with the following scenario: 1) http server dispatches POST request 2) request is authenticated 3) request is routed 4) POSTed data is piped to destination. The problem is that step 2 (authentication) usually needs to access some kind of database. So, the stream (request) is created at step 1 and piped at step 4, with I/O ticks at step 2. You cannot assume that you'll always have all the elements to decide what to do with a stream in the tick that creates the stream. Some time you need to read some header data from the stream, do some I/O to gather related data that will tell you what to do with the stream, and then only process the stream (pipe it). |
Re: [node-dev] Stream tweaks proposal | Tim Caswell | 7/28/12 12:05 PM | FWIW, I actually like Bruno's proposal. It doesn't cover all the use
cases, but it makes backpressure enabled pumps really easy. One use case missing that's easy to add is when consuming a binary protocol, I often only want part of the input. For example, I might want to get the first 4 bytes, decode that as a uint32 length header and then read n more bytes for the body. Without being able to request how many bytes I want, I have to handle putting data back in the stream that I don't need. That's very error prone and tedious. So on the read function, add an optional "maxBytes" or "bytes" parameter. The difference is in the maxBytes case, I want the data as soon as there is anything, even if it's less than the number of bytes I want. In the "bytes" case I want to wait till that many bytes are available. Both are valid for different use cases. Also streams (both readable and writable) need a configurable low-water mark. I don't want to wait till the pipe is empty before I start piping data again. This mark would control how soon writable streams called my write callback and how much readable streams would readahead from their data source before waiting for me to call read. I want to keep it always full. It would be great if this was handled internally in the stream and consumers of the stream simply configured what the mark should be. |
Re: [node-dev] Stream tweaks proposal | Mikeal Rogers | 7/28/12 12:14 PM | The early stuff I saw included a "length" option. I think you're missing how this works. Nobody automatically asks for data so watermarks aren't strictly necessary. You ask for data if it's available and you read as much as you can handle. There is no "readahead". If someone stops calling read() then the buffer fills and, if it's a TCP stream, it's asked to stop sending data. Remember that when the "readable" event goes off it's expected that the pending data is read in the same event loop cycle. |
Re: [node-dev] Stream tweaks proposal | Tim Caswell | 7/28/12 12:22 PM | In any backpressure case where the data provider is faster than the
consumer, there will be a series of pauses and resumes to slow down the provider. That is the point of backpressure. So in these discussions we need to assume that the pauses and resumes will be happening. What triggers a pause, what triggers a resume? Will there ever be a case where the pipe is empty and stalled. This is what I'm trying to avoid. It's kills throughput. When does it ask the tcp stream to stop sending data. And when do we tell the tcp stream to start sending data? How long does it take for the tcp stream to get this message and start sending the data? If we wait till the buffer is 100% empty and new data isn't loaded immedietly, then it's a stall. This is where a low-water mark keeps things running smoothly. > > > |
Re: [node-dev] Stream tweaks proposal | Mikeal Rogers | 7/28/12 12:38 PM | pause and resume at the stream layer are gone.
the TCP socket will get paused if "readable" is emitted and read() is not called. it will be resumed when read() is called again. the amount of data that is available on a "readable" event is variable. the amount of data buffered is of no importance to pause/resume logic. -Mikeal |
Re: [node-dev] Stream tweaks proposal | Bruno Jouhier | 7/29/12 10:51 AM | @tim The API that I used in this blog post is a simplified version of the API I implemented in streamline. I simplified it in the blog post because I just wanted to demo the equivalence between the two styles of API. The streams module that I am using (https://github.com/Sage/streamlinejs/blob/master/lib/streams/server/streams.md) has most of the features that you saw missing: * an optional "len" parameter in the read call. * low and high water mark options in the ReadableStream constructor. The "len" parameter has your "bytes" semantics and I use it exactly the way you describe (typically to read 4 bytes to get a frame length and then read N bytes for a frame). I did not implement "maxBytes" semantics because I did not need it (which does not mean it would not be useful). The thing is that all the additional bells and whistles can be implemented around the basic read(cb) call (called readChunk in my module). I introduced low and high mark options because I wanted to avoid a pause/resume dance around every data event when the data arrives faster than it is consumed. My assumption was that a little queue with high and low marks would reduce the number of pause/resume calls and improve performance. Basically tradiing a bit of space for speed. But I have to admit that I did not bench it. So, if the pause/resume dance costs very little this may be overkill. @isaac and mikeal, This callback proposal may sound very "anti-eventish" and it may give the impression that I'm sorta trying to eradicate events from node's APis (nobody said it but I can see how it could be perceived this way). This is not the case. I like node's event API and I find it very elegant. But node gives us two API styles (callbacks and events) and it is not always easy to choose between the two. Here is the rationale that I use to decide between them: My main criteria is CORRELATION. Basically, I start with the assumption that the API is event-oriented and then I analyze the degree of correlation between the various events. If the events are highly correlated, I choose the callback style. If there are loosely correlated, I keep the event style. Some examples: * User events (browser side) are very loosely correlated => event style * Incoming HTTP requests (server side) are also very loosely correlated => event style * Data streams vary. If each data chunk is a complete message which is more or less independent from other messages, the event style is best. If, on the other hand, the chunks are correlated (because the whole stream has a strong internal structure, or because it has been chunked on arbitrary boundaries that don't match its internal structure), then the callback style is best. * Confirmation events (like "connect/error" events that follows a connection attempt, or a "drain" event that follows a write returning false) are fully correlated => callback style. Also, the event style API is more powerful than the callback style API as it supports multiple listeners. BUT: * It is very easy to wrap a callback API with an event listener. * Very often, in the correlated case, there is a "main" consumer which needs to correlate the events, and auxiliary consumers that don't care that much about the correlations (log them, feed statistics, etc). A dual API with callbacks for the main consumer and events for the auxiliary ones works great. * Wrapping an event style API with a callback style API is a lot more difficult. * Callback style APIs are easier to use when the events are correlated because you don't need to setup state machines to re-correlate the events. Given this, I probably favor the callback style a lot more than most node developers. But this is not a systematic "anti-event" attitude, there is a rationale behind it and I wanted to share it with you. Bruno |
Re: [node-dev] Stream tweaks proposal | Dominic | 7/30/12 3:53 AM | I think there is another problem here, if creationix writes a stream
that likes to read a byte, or a line, or whatever, it means that I can't just pipe anything into it, because he basically needs a custom `flow` function (as per https://github.com/isaacs/readable-stream/blob/master/readable.js#L84-95 ) but that will be on the readable side. if data is gonna be pulled off the readable stream, then the pulling duties really belong to the puller, not the pullee. but, I can also sense complexity rearing it's ugly head, but maybe... pipe could check for a dest.pull method, and use that instead of flow? |
Re: [node-dev] Stream tweaks proposal | Isaac Schlueter | 7/30/12 10:02 AM | > there is another thing this would break, it would be impossible toYeah, that use case is going to be hard. If you want a tee-stream, you can implement it pretty easily on top of this. And of course, if you use the "data" event facade, then it'll keep working just how it does, so I don't expect existing programs to be affected too badly until they specifically try to upgrade. But, yes, this is a thing that's going to break in 0.9, most likely, and someone probably will be affected by it. The best approach is to do it as soon as possible and document the changes. For several implementations, a read(n, cb) method is easier to do. For example, on Windows, that's how we read from sockets, and on all platforms, that's how fs.read works. So, we *must* implement something like this for fs read streams anyway, and if we want elegant cross-platform support, then we'll have to do that for TCP and Pipes as well. The ReadableStream base class will have a read([n]) method that returns either up-to-n bytes (or some arbitrary number of bytes if n is unspecified), or null. If it returns null, then you wait for a 'readable' event and try again. It will also have a _read(n, function(er, bytes)) method that you can override with your own implementation details. The assumption here is that: n is always specified, the cb is called at some point in the future with an error or up-to-n bytes. The ReadableStream class will take care of managing the buffer and calling this._read when appropriate. If there are no more bytes to read, then _read(n, cb) should call the cb with null. This makes it easy to interface with underlying systems that can only do asynchronous reads, while still getting the benefits of the simpler read(n)/emit('readable') API. Of course, if your system CAN safely do reads synchronously (ie, if it's some kind of abstract stream that reads from memory and doesn't have to do any slow IO) then you can just override the read(n) method instead of the _read(n, cb) method. |
Re: [node-dev] Stream tweaks proposal | Bruno Jouhier | 7/30/12 2:42 PM | @isaac Thanks for the detailed reply. Your proposed API is optimal for the sync case (memory stream). I handle this case with a trampoline (streamline) or a nextTick call (pure JS), which introduces a bit of overhead. I still prefer the callback style over the event style because a read() that returns null is always followed by either a "readable" or an "error" event so the follow up event is fully correlated with the read() call. But anyway, this API design is your call and the new API is clearly heading in the right direction. I'll keep my "correlation" ramblings for a future blog post. Bruno |
Re: [node-dev] Stream tweaks proposal | Bert Belder | 7/31/12 3:57 PM | On Monday, July 30, 2012 11:42:03 PM UTC+2, Bruno Jouhier wrote:I still prefer the callback style over the event style because a read() that returns null is always followed by either a "readable" or an "error" event so the follow up event is fully correlated with the read() call. But anyway, this API design is your call and the new API is clearly heading in the right direction. I'll keep my "correlation" ramblings for a future blog post. If it's any comfort, I agree with Bruno on this one. - Bert |