pushing io performance

82 views
Skip to first unread message

Paul Querna

unread,
Jul 13, 2010, 8:48:01 PM7/13/10
to nod...@googlegroups.com
Hi,

I started to play with node's default buffer sizes, but I noticed that
the performance in general with large files was to say the least,
terrible.

In order to get a better picture, I wrote a little javascript file,
which creates a 1gb file, writes to it in various buffer sizes, and
then repeats the various sizes for a read stream:
<http://github.com/pquerna/node/blob/buffer_sizes/benchmark/io.js>

In order to get a baseline comparison of the 'best case' behavior, I
wrote a C program that does the same operations, using blocking IO and
no threads -- node.js can never beat this, just because of the threads
& message passing overhead, but ideally we could get within 20%, and
not be order of magnitudes out of place:
<http://github.com/pquerna/node/blob/buffer_sizes/benchmark/io.c>

I'm happy to accept patches to either of the io benchmarks to make
them more comparable.

On my 2.8ghz MBP w/ an SSD, the C version finishes in about 1 minute
and 15 seconds, using about 320kb resident memory.

I gave up on the node version after 10 minutes, it wasn't even half
way done, so we are looking at easily an order of magnitude difference
in performance on runtime. In addition the node.js process was
pushing 700megs resident.

Now, ideally, on a program like this, you are IO bound. You are
waiting for IO to complete, and in the C version, this is true. On
the Node.js version however, it is burning at 98% user time, with
almost no time spent in IO. What I saw in tests was that under
node.js it was accomplishing writes at less than 1 megabyte/second
with large files.

The main culprit seems to be related to libeio and how it is using
condition variables to distribute work. I've only done some cursory
debugging with dtrace on osx and strace on linux -- both seem to have
similar problems.

For ever write, libbeio pushes onto a queue from the main thread, and
then signals a condition variable.

The worker thread is waiting on a timed condition variable (which
means its calling gettimeofday() inside every loop, suck), when it
comes out it tries to pop a single item off the work queue. It then
processes the item, calling our write call, and then pushes the result
onto the mutex protected result queue.

Our main thread then gets woken up, and we would emit a drain event to
javascript land, and it would repeat the loop.

What I found from the dtrace was that we were only spending 11% of our
CPU time on the blocking write, the other 89% was spent in locking,
message passing, condition variables (88%), and of course, executing
javascript (<%1).

I think this problem will plague node.js on semi-single-threaded IO
bound tasks in general, but I have several ideas on how to improve it:

1) Opportunistic async calls: For file IO (not for some methods like
stat), its possible to *try* an async write, and if it succeeds, we
just avoided a thread swap. If it comes back with EAGAIN, then
fallback to the threaded option. I've not seen this implemented
before, so I'm not sure if it would work, but its definitely something
I'd like to experiment with.

2) Pipelining of Requests in EIO. I think this could be a pretty big
win, but if we could create a class of EIO events that should be
'moderately fast', and put them in their own queue -- we could then
take 10 of them off the request queue, run them all inline, and then
bulk push their results -- this will get you pretty big wins on things
like static http file serving I believe.

3) Use pthread_workqueue API when available:
<http://people.freebsd.org/~sson/thrworkq/pthread_workqueue.3.txt>
This Work Queue API is what powers Apple's Grand Central -- the
support for it has been added to FreeBSD -- this is likely a change to
make inside libeio itself, but on platforms that support it, this
should drastically reduce the cost of pushing tasks to worker threads.

I've just started pushing this, so I'm sure there are other ideas out
there on how we could improve this?

Thanks,

Paul

r...@tinyclouds.org

unread,
Jul 13, 2010, 9:37:19 PM7/13/10
to nod...@googlegroups.com
Awesome write up. More that anything - this reaffirms in me my desire
for the build bots (or something else) displaying graphs that is not
just noise http://buildbot.nodejs.org:8010/perf/dashboard/overview.html
Ideally we can have a benchmark like this in core, displaying on some
website for each commit, while trying to correct it.

> The worker thread is waiting on a timed condition variable (which
> means its calling gettimeofday() inside every loop, suck), when it
> comes out it tries to pop a single item off the work queue. It then
> processes the item, calling our write call, and then pushes the result
> onto the mutex protected result queue.

Are you seeing that in dtrace? It shouldn't be calling that. It's only
enabled if you call eio_set_max_poll_time(), which Node doesn't.

> What I found from the dtrace was that we were only spending 11% of our
> CPU time on the blocking write, the other 89% was spent in locking,
> message passing, condition variables (88%), and of course, executing
> javascript (<%1).

I've long (http://four.livejournal.com/918336.html) wanted to
implement a new backend to libeio with some sort of lock-free queue. I
had the idea of serializing the addresses of the eio_req pointers...
There are probably better ways. It's a very fun activity, making the
thread pool faster.

> 1) Opportunistic async calls: For file IO (not for some methods like
> stat), its possible to *try* an async write, and if it succeeds, we
> just avoided a thread swap. If it comes back with EAGAIN, then
> fallback to the threaded option. I've not seen this implemented
> before, so I'm not sure if it would work, but its definitely something
> I'd like to experiment with.

I would be surprised if OSes actually allow you to do this. Worth
checking out, of course.

> 2) Pipelining of Requests in EIO. I think this could be a pretty big
> win, but if we could create a class of EIO events that should be
> 'moderately fast', and put them in their own queue -- we could then
> take 10 of them off the request queue, run them all inline, and then
> bulk push their results -- this will get you pretty big wins on things
> like static http file serving I believe.

The unexposed group API can do this, I believe.

> 3) Use pthread_workqueue API when available:
> <http://people.freebsd.org/~sson/thrworkq/pthread_workqueue.3.txt>
> This Work Queue API is what powers Apple's Grand Central -- the
> support for it has been added to FreeBSD -- this is likely a change to
> make inside libeio itself, but on platforms that support it, this
> should drastically reduce the cost of pushing tasks to worker threads.

Sure. That's going to be awesome.


ry

Paul Querna

unread,
Jul 13, 2010, 9:44:00 PM7/13/10
to nod...@googlegroups.com
An update after some discussions with the libeio author.

On Tue, Jul 13, 2010 at 5:48 PM, Paul Querna <pqu...@apache.org> wrote:
> The main culprit seems to be related to libeio and how it is using
> condition variables to distribute work.  I've only done some cursory
> debugging with dtrace on osx and strace on linux -- both seem to have
> similar problems.

To be clear, the main culprit of *time* spent is libeio, the issues
are much more likely to be related to node.js's singleton pushing of
events, so we end up ping-poinging single events around, instead of
getting consistent write performance.

> For ever write, libbeio pushes onto a queue from the main thread, and
> then signals a condition variable.
>
> The worker thread is waiting on a timed condition variable (which
> means its calling gettimeofday() inside every loop, suck), when it
> comes out it tries to pop a single item off the work queue.  It then
> processes the item, calling our write call, and then pushes the result
> onto the mutex protected result queue.

This was wrong or unclear: EIO does pull requests off the queue if
there are any available.

It only enters goes into the condition variable, if the request queue
is empty. In the tests I ran, this did happen, which means libeio's
outstanding request queue is empty. This points directly to node.js
not pushing events into libeio quickly enough.

> Our main thread then gets woken up, and we would emit a drain event to
> javascript land, and it would repeat the loop.
>
> What I found from the dtrace was that we were only spending 11% of our
> CPU time on the blocking write, the other 89% was spent in locking,
> message passing, condition variables (88%), and of course, executing
> javascript (<%1).
>
> I think this problem will plague node.js on semi-single-threaded IO
> bound tasks in general, but I have several ideas on how to improve it:
>
> 1) Opportunistic async calls: For file IO (not for some methods like
> stat), its possible to *try* an async write, and if it succeeds, we
> just avoided a thread swap.  If it comes back with EAGAIN, then
> fallback to the threaded option.  I've not seen this implemented
> before, so I'm not sure if it would work, but its definitely something
> I'd like to experiment with.

Apparently this is not possible.

r...@tinyclouds.org

unread,
Jul 13, 2010, 9:53:41 PM7/13/10
to nod...@googlegroups.com
On Tue, Jul 13, 2010 at 6:44 PM, Paul Querna <pqu...@apache.org> wrote:
> An update after some discussions with the libeio author.
>
> On Tue, Jul 13, 2010 at 5:48 PM, Paul Querna <pqu...@apache.org> wrote:
>> The main culprit seems to be related to libeio and how it is using
>> condition variables to distribute work.  I've only done some cursory
>> debugging with dtrace on osx and strace on linux -- both seem to have
>> similar problems.
>
> To be clear, the main culprit of *time* spent is libeio, the issues
> are much more likely to be related to node.js's singleton pushing of
> events, so we end up ping-poinging single events around, instead of
> getting consistent write performance.

It's very likely we're not pulling off responses from eio in an
optimal way. Have you seen this non-sense, that I'm doing to avoid
race conditions (test/simple/test-eio-*)?

http://github.com/ry/node/blob/6bf294d71a172ccf6b9066c92b1aa8e772a31ffd/src/node.cc#L1984-2000
http://github.com/ry/node/blob/6bf294d71a172ccf6b9066c92b1aa8e772a31ffd/src/node.cc#L224-276

It could definitely use a going over with an eye towards performance,
especially since with those race condition tests.

r...@tinyclouds.org

unread,
Jul 13, 2010, 10:13:03 PM7/13/10
to nod...@googlegroups.com

For example, changing that ev_idle to an ev_prepare. Maybe auto
starting the eio_poller when dispatching requests - since it's likely
they'll come back by the next iteration.

mape

unread,
Jul 14, 2010, 8:32:43 AM7/14/10
to nodejs
Results from actually running Pauls benchmark in node (debian 4, 4GB
ram, on a WD 10k rpm raptor drive).

Wrote 2097156000 bytes in 2016919ms using 1024 byte buffers:
0.991611462830188 mB/s
Wrote 2097156000 bytes in 2708309ms using 4096 byte buffers:
0.7384681733140495 mB/s
Wrote 2097156000 bytes in 1956594ms using 8192 byte buffers:
1.0221844695424804 mB/s
Wrote 2097156000 bytes in 2089429ms using 16384 byte buffers:
0.9571993113908153 mB/s
Wrote 2097156000 bytes in 2254114ms using 32768 byte buffers:
0.8872665712559347 mB/s
Wrote 2097156000 bytes in 2785221ms using 65536 byte buffers:
0.7180758726147763 mB/s

/mape

On Jul 14, 2:48 am, Paul Querna <pque...@apache.org> wrote:
> I gave up on the node version after 10 minutes, it wasn't even half
> way done, so we are looking at easily an order of magnitude difference
> in performance on runtime.  In addition the node.js process was
> pushing 700megs resident.

> Paul

Peter Sunde

unread,
Jul 14, 2010, 6:20:34 PM7/14/10
to nodejs
More benchmarks:

peters@de-1558:/software/node# hdparm -tT /dev/sda

/dev/sda:
Timing cached reads: 10976 MB in 2.00 seconds = 5499.51 MB/sec
Timing buffered disk reads: 324 MB in 3.01 seconds = 107.81 MB/sec

peters@de-1558:/software/node# ./iotest
Wrote 1048576000 bytes in 3.380000s using 1024 byte buffers:
295.857988mB/s
Wrote 1048576000 bytes in 0.950000s using 4096 byte buffers:
1052.631579mB/s
Wrote 1048576000 bytes in 0.820000s using 8192 byte buffers:
1219.512195mB/s
Wrote 1048576000 bytes in 0.960000s using 16384 byte buffers:
1041.666667mB/s
Wrote 1048576000 bytes in 0.830000s using 32768 byte buffers:
1204.819277mB/s
Wrote 1048576000 bytes in 0.860000s using 65536 byte buffers:
1162.790698mB/s
Read 1048576000 bytes in 0.620000s using 1024 byte buffers:
1612.903226mB/s
Read 1048576000 bytes in 0.500000s using 4096 byte buffers:
2000.000000mB/s
Read 1048576000 bytes in 0.450000s using 8192 byte buffers:
2222.222222mB/s
Read 1048576000 bytes in 0.440000s using 16384 byte buffers:
2272.727273mB/s
Read 1048576000 bytes in 0.430000s using 32768 byte buffers:
2325.581395mB/s
Read 1048576000 bytes in 0.430000s using 65536 byte buffers:
2325.581395mB/s

Peter Sunde

unread,
Jul 14, 2010, 6:27:28 PM7/14/10
to nodejs
Sorry, that was the c benchmark. The js version does not work on node
HEAD.

mscdex

unread,
Jul 15, 2010, 12:41:39 AM7/15/10
to nodejs
Here's what I get on my Ubuntu server:

With io.js:
Wrote 1048576000 bytes in 93.934s using 1024 byte buffers:
10.64577256371495 mB/s
Wrote 1048576000 bytes in 332.441s using 4096 byte buffers:
3.008052556694271 mB/s
Wrote 1048576000 bytes in 515.732s using 8192 byte buffers:
1.9389915692646569 mB/s
Wrote 1048576000 bytes in 397.385s using 16384 byte buffers:
2.5164513003762097 mB/s
Wrote 1048576000 bytes in 501.725s using 32768 byte buffers:
1.9931237231551147 mB/s
Wrote 1048576000 bytes in 498.542s using 65536 byte buffers:
2.0058490558468494 mB/s
Read 1048576000 bytes in 17.745s using 1024 byte buffers:
56.35390250774866 mB/s
Read 1048576000 bytes in 18.099s using 4096 byte buffers:
55.25167136305873 mB/s
Read 1048576000 bytes in 12.895s using 8192 byte buffers:
77.54943776657619 mB/s
Read 1048576000 bytes in 12.833s using 16384 byte buffers:
77.92410192472532 mB/s
Read 1048576000 bytes in 18.346s using 32768 byte buffers:
54.507794614629894 mB/s
Read 1048576000 bytes in 12.846s using 65536 byte buffers:
77.84524365561265 mB/s
All done!

With io.c:
Wrote 1048576000 bytes in 4.610000s using 1024 byte buffers:
216.919740mB/s
Wrote 1048576000 bytes in 0.770000s using 4096 byte buffers:
1298.701299mB/s
Wrote 1048576000 bytes in 0.760000s using 8192 byte buffers:
1315.789474mB/s
Wrote 1048576000 bytes in 0.730000s using 16384 byte buffers:
1369.863014mB/s
Wrote 1048576000 bytes in 0.680000s using 32768 byte buffers:
1470.588235mB/s
Wrote 1048576000 bytes in 0.630000s using 65536 byte buffers:
1587.301587mB/s
Read 1048576000 bytes in 0.240000s using 1024 byte buffers:
4166.666667mB/s
Read 1048576000 bytes in 0.150000s using 4096 byte buffers:
6666.666667mB/s
Read 1048576000 bytes in 0.140000s using 8192 byte buffers:
7142.857143mB/s
Read 1048576000 bytes in 0.140000s using 16384 byte buffers:
7142.857143mB/s
Read 1048576000 bytes in 0.130000s using 32768 byte buffers:
7692.307692mB/s
Read 1048576000 bytes in 0.130000s using 65536 byte buffers:
7692.307692mB/s

mscdex

unread,
Jul 15, 2010, 12:44:40 AM7/15/10
to nodejs
Sorry, forgot to mention that the benchmark for io.js was using node
v0.1.100-b1901cd.

r...@tinyclouds.org

unread,
Aug 17, 2010, 11:48:20 AM8/17/10
to nod...@googlegroups.com
On Tue, Jul 13, 2010 at 5:48 PM, Paul Querna <pqu...@apache.org> wrote:
> Hi,
>
> I started to play with node's default buffer sizes, but I noticed that
> the performance in general with large files was to say the least,
> terrible.

So I played around with it a bit. As I said, there are some easy wins
here. The first thing is that it shouldn't be encoding to "binary" -
it should try to write directly from the buffer to the FD. The second
was that the wt() function wasn't really setup very good - it doesn't
need nextTick. Actually we should test this without WriteStream - for
a better comparison.

Wrote 1048576000 bytes in 50.452s using 1024 byte buffers:
19.82081978910648 mB/s
Wrote 1048576000 bytes in 52.675s using 4096 byte buffers:
18.984337921214998 mB/s
Wrote 1048576000 bytes in 47.91s using 8192 byte buffers:
20.872469213107912 mB/s
Wrote 1048576000 bytes in 52.049s using 16384 byte buffers:
19.21266498876059 mB/s
Wrote 1048576000 bytes in 46.941s using 32768 byte buffers:
21.303338233101126 mB/s
Wrote 1048576000 bytes in 51.355s using 65536 byte buffers:
19.472300652322073 mB/s

trussing the process shows too many kevent calls (on macintiosh) which
I suspect is from the bad pullback from the thread pool.

I pulled Paul's io tests into the source tree in 41ad339. Feel free to
clean it up future to match the C version. It should call fs.write()
directly.

Reply all
Reply to author
Forward
0 new messages