Why is my streaming word counter so much slower than a non streaming version?

Dan Schmidt

unread,

Apr 6, 2014, 1:01:57 AM4/6/14

to nod...@googlegroups.com

So, I'm trying to use streams more and better understand how to use them properly.

There was this little contest over at Treehouse to write an app that would count the words in a text file after being filtered from a certain set of words in another file.

Myself and one other person chose to write our apps in node.

I wrote mine using streams and they wrote theirs without. I have used the time command to compare the two. Mine is consistently about 5 times slower.

Link to mine: https://gist.github.com/DanSchmidt/10000777

Link to their's: https://gist.github.com/pleary/9967501#file-waldo-js

I could understand if some of my transform and filtering logic was perhaps less performant than another's, but 5 times slower seems to suggest I'm probably doing something wrong with the streams.

Any help would be appreciated, thanks.

Forrest Norvell

unread,

Apr 6, 2014, 2:50:47 AM4/6/14

to nod...@googlegroups.com

I don't think there's anything wrong with how you're using streams. Your version trades performance for a more consistent memory file, and theirs does the opposite. The nice thing about your implementation is that it can handle a much larger source file, and can start working before the whole file has been pulled down.

If you wanted to improve performance while still streaming, I'd look at figuring out how many of the pipeline steps you could combine, as each stage in the pipeline is necessarily going to add at least a little overhead. It may also be that combining them allows you to eliminate some work (maybe at the cost of using more memory), e.g. eliminating some of the conversions between Buffers and strings.

F

--
--
Job Board: http://jobs.nodejs.org/
Posting guidelines: https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines
You received this message because you are subscribed to the Google
Groups "nodejs" group.
To post to this group, send email to nod...@googlegroups.com
To unsubscribe from this group, send email to
nodejs+un...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/nodejs?hl=en?hl=en

---
You received this message because you are subscribed to the Google Groups "nodejs" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nodejs+un...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Ryan Graham

unread,

Apr 6, 2014, 2:57:17 AM4/6/14

to nod...@googlegroups.com

On Sat, Apr 5, 2014 at 10:01 PM, Dan Schmidt <daniel.ad...@gmail.com> wrote:

Your version is a lot more friendly to the event loop to the point that the same process could probably server a decent number of http requests if you threw an http.createServer in there somewhere.

The faster one, on the other hand, will completely block the event loop after the second async call to request() finishes.

So I wouldn't say you're doing anything wrong with the streams, it's just an extreme use of them. If the dataset is large enough to outweigh the setup overhead, you could try wrapping some of the pipeline steps in separate processes.

~Ryan

--
http://twitter.com/rmgraham

Dan Schmidt

unread,

Apr 6, 2014, 8:35:04 PM4/6/14

to nod...@googlegroups.com

Thanks for the insights. I had tried using much larger text files to compare the two, but wasn't considering that the synchronous will always beat it, so long as there is enough memory to read in the file. A better comparison would be to set them both up as web servers and send multiple requests.

I did try out combining the processing steps into a single transform to see the difference and it's performance numbers are roughly the same as the synchronous version.

--

--
Job Board: http://jobs.nodejs.org/
Posting guidelines: https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines
You received this message because you are subscribed to the Google
Groups "nodejs" group.
To post to this group, send email to nod...@googlegroups.com
To unsubscribe from this group, send email to
nodejs+un...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/nodejs?hl=en?hl=en

---

You received this message because you are subscribed to a topic in the Google Groups "nodejs" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/nodejs/0jSkPYaaE6s/unsubscribe.
To unsubscribe from this group and all its topics, send an email to nodejs+un...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--

Dan Schmidt

@codeswish - Blog

Floby

unread,

Apr 7, 2014, 11:28:02 AM4/7/14

to nod...@googlegroups.com

As said before using streams would probably not show any performance in speed in your case but rather an improvement in memory usage and concurrency.

Streams will chunk your input into smaller pieces of work which can then be scheduled more nicely over time. The synchronous version only takes the task as whole and has to complete it during the same event loop.

The difference between the two implementations should appear if you use files larger than 4GB for example, or if you count words on several different streams.

wrap your implementations inside a http server and hit it hard with ab :)

Chad Engler

unread,

Apr 7, 2014, 6:19:44 PM4/7/14

to nod...@googlegroups.com

Benefits of using async streams is not so much raw speed, as ability to do multiple things “at once”. If you were doing this task to 10,000 files, your methods would process many of those files in parallel; and the non-streaming version would do them serially. Similarly if you are operating on one large file, you will save memory by streaming it instead of loading it all and parsing at once.

-Chad

You received this message because you are subscribed to the Google Groups "nodejs" group.

To unsubscribe from this group and stop receiving emails from it, send an email to nodejs+un...@googlegroups.com.

Mikeal Rogers

unread,

Apr 8, 2014, 10:44:21 AM4/8/14

to nod...@googlegroups.com

There's a pretty simple rule to follow here:

- If your working set fits in memory, pull it in to memory.

- If your working set does not fit in to memory, or you require concurrent working sets which will cumulatively not fit in memory, use streams.

It is *always* faster to do transforms and manipulation in memory when the set fits in to memory but it is common for the data not to fit, especially if you're doing anything concurrently.

-Mikeal

Reply all

Reply to author

Forward