Effective disk I/O bandwidth, below 150MB/s if CPU loaded

Mauricio Fernandez

未讀,

2008年6月13日下午2:34:362008/6/13

收件者：wide-...@googlegroups.com

Bonnie reports the disk is able to read at ~150MB/s; this is also the speed I
measured with a trivial C program that reads O.all with this loop:

while(read(fd, buf, 32768)) i += buf[0];

(i += is there to make sure this isn't optimized out, and changes nothing
otherwise).

This program can saturate the disk I/O bandwidth with ~80% of one core; that
measurement (and the Bonnie run too, I assume) was made when no other process
was competing for the CPU or memory bandwidth.

I suspected I/O bandwidth would be lower with higher loads (when running wf2
against O.all, I often observed disk usage below 100%, as measured by iostat,
and low CPU load, *simultaneously*), and decided to test that conjecture.

I wrote a small program to launch 31 busywork processes (leaving one thread of
execution free for the I/O benchmark), and the results differ greatly when
there are other processes using CPU time:

Read 43178.0 MBs in 371.72s at 116.2 MB/s.

real 6m11.736s
user 0m3.853s
sys 5m20.318s

This means we might be closer to the lower bound than thought previously.

wf2_multicore2.ml, which processes the 42GB O.all file in 7min02s, approaches
the above times already.

--
Mauricio Fernandez - http://eigenclass.org

Erik Engbrecht

未讀,

2008年6月13日下午2:41:112008/6/13

收件者：wide-...@googlegroups.com

Would it be hard for you to repeat the experiment with 27 busywork threads and 1 reader? The reason I ask is that the T2 really only has 8 cores, although each one has 4 threads. So I'm wondering if you intentionally try to allocate one core to IO and not distract it with other tasks then it will improve IO performance.

--
http://erikengbrecht.blogspot.com/

Alex Morega

未讀,

2008年6月13日下午2:52:462008/6/13

收件者：wide-...@googlegroups.com

On Jun 13, 2008, at 21:41 , Erik Engbrecht wrote:

> Would it be hard for you to repeat the experiment with 27 busywork
> threads and 1 reader? The reason I ask is that the T2 really only
> has 8 cores, although each one has 4 threads. So I'm wondering if
> you intentionally try to allocate one core to IO and not distract it
> with other tasks then it will improve IO performance.

Also, please consider running 32 reader threads, each reading 1/32 of
the file. That's the approach I've been using so far - reading pieces
of the file in many worker threads - and I'm wondering how it compares
with the approach of having 1 reader thread.

Thanks,
-- Alex

Erik Engbrecht

未讀,

2008年6月13日下午2:57:502008/6/13

收件者：wide-...@googlegroups.com

Are you reading in parallel from multiple threads or are you reading sequentially from multiple threads?

I'm reading sequentially from multiple threads, in other words, my input channel is passed around like a baton.

--
http://erikengbrecht.blogspot.com/

Tim Bray

未讀,

2008年6月13日下午2:59:592008/6/13

收件者：wide-...@googlegroups.com

In fact, it has 8 cores and it turns out each has two live integer
streams, so it's kinda like having 8 dual-core cores. So in fact you
can get 16* the nominal throughput of one CPU. Well, unless you're
doing FP. -T

Mauricio Fernandez

未讀,

2008年6月13日下午3:21:182008/6/13

收件者：wide-...@googlegroups.com

On Fri, Jun 13, 2008 at 09:52:46PM +0300, Alex Morega wrote:
>
>
> On Jun 13, 2008, at 21:41 , Erik Engbrecht wrote:
>
> > Would it be hard for you to repeat the experiment with 27 busywork
> > threads and 1 reader? The reason I ask is that the T2 really only
> > has 8 cores, although each one has 4 threads. So I'm wondering if
> > you intentionally try to allocate one core to IO and not distract it
> > with other tasks then it will improve IO performance.

You're very right. You get the full I/O bandwidth when a whole core is left
for the reader.

> Also, please consider running 32 reader threads, each reading 1/32 of
> the file. That's the approach I've been using so far - reading pieces
> of the file in many worker threads - and I'm wondering how it compares
> with the approach of having 1 reader thread.

That's precisely what I'm doing in the top two entries, inefficient
(line-oriented) I/O in multiple sections of the file at once.
It was only when I tried to use a separate process to avoid disk seeks that I
found the problem whose nature is clear now.

Alex Morega

未讀,

2008年6月13日下午3:47:392008/6/13

收件者：wide-...@googlegroups.com

On Jun 13, 2008, at 21:57 , Erik Engbrecht wrote:

> Are you reading in parallel from multiple threads or are you reading
> sequentially from multiple threads?
>
> I'm reading sequentially from multiple threads, in other words, my
> input channel is passed around like a baton.

I'm reading in parallel from multiple threads: the file is opened in
30 separate places and 30 threads independently seek and read data.
I'm counting on the OS/hardware to read multiple streams efficiently,
and on the language to do buffered I/O. I'm pretty sure that the
second assumption is good; not so sure about the first though.

Passing the channel around among the threads, or something similar,
may improve performance; I'll try that approach too. But I'm curious
what a raw read benchmark in C would do for my current approach,
without any messaging or locking or language overhead.

Cheers,
-- Alex

James Aylett

未讀,

2008年6月14日清晨7:24:482008/6/14

收件者：wide-...@googlegroups.com

On Fri, Jun 13, 2008 at 11:59:59AM -0700, Tim Bray wrote:

> In fact, it has 8 cores and it turns out each has two live integer
> streams, so it's kinda like having 8 dual-core cores. So in fact you
> can get 16* the nominal throughput of one CPU. Well, unless you're
> doing FP. -T

Is FP the only thing where that's true? My understanding is that the
T1 is interleaved multi-threading, so I'm assuming there are things
beyond the ALU that only exist once in each core. (Fetching, perhaps?)
Or maybe it's more interleaving four threads onto two pipelines per core.

James

--
/--------------------------------------------------------------------------\
James Aylett xapian.org
ja...@tartarus.org uncertaintydivision.org

Tim Bray

未讀,

2008年6月14日晚上11:42:312008/6/14

收件者：wide-...@googlegroups.com

On Sat, Jun 14, 2008 at 4:24 AM, James Aylett <jay...@gmail.com> wrote:

> Is FP the only thing where that's true? My understanding is that the
> T1 is interleaved multi-threading, so I'm assuming there are things
> beyond the ALU that only exist once in each core. (Fetching, perhaps?)
> Or maybe it's more interleaving four threads onto two pipelines per core.

My memory seems to be telling me that the T1 has 3 (!?) memory
controllers. There must be a data sheet somewhere.... -T
- Show quoted text -

James Aylett

未讀,

2008年6月15日清晨7:41:582008/6/15

收件者：wide-...@googlegroups.com

On Sat, Jun 14, 2008 at 08:42:31PM -0700, Tim Bray wrote:

> My memory seems to be telling me that the T1 has 3 (!?) memory
> controllers. There must be a data sheet somewhere.... -T

The closest I've found is
<http://www.sun.com/processors/UltraSPARC-T1/specs.xml>, which says 4
memory interfaces/controllers. There's a 3MB L2 cache, which may have
been what you were thinking?

Much more on
<http://www.opensparc.net/opensparc-t1/index.html>. There's some
interesting stuff (if you're interested at this level :-) in the
architecture document:
<http://opensparc-t1.sunsource.net/specs/OpenSPARCT1_Micro_Arch.pdf>.

回覆所有人

回覆作者

轉寄