while(read(fd, buf, 32768)) i += buf[0];
(i += is there to make sure this isn't optimized out, and changes nothing
otherwise).
This program can saturate the disk I/O bandwidth with ~80% of one core; that
measurement (and the Bonnie run too, I assume) was made when no other process
was competing for the CPU or memory bandwidth.
I suspected I/O bandwidth would be lower with higher loads (when running wf2
against O.all, I often observed disk usage below 100%, as measured by iostat,
and low CPU load, *simultaneously*), and decided to test that conjecture.
I wrote a small program to launch 31 busywork processes (leaving one thread of
execution free for the I/O benchmark), and the results differ greatly when
there are other processes using CPU time:
Read 43178.0 MBs in 371.72s at 116.2 MB/s.
real 6m11.736s
user 0m3.853s
sys 5m20.318s
This means we might be closer to the lower bound than thought previously.
wf2_multicore2.ml, which processes the 42GB O.all file in 7min02s, approaches
the above times already.
--
Mauricio Fernandez - http://eigenclass.org
> Would it be hard for you to repeat the experiment with 27 busywork
> threads and 1 reader? The reason I ask is that the T2 really only
> has 8 cores, although each one has 4 threads. So I'm wondering if
> you intentionally try to allocate one core to IO and not distract it
> with other tasks then it will improve IO performance.
Also, please consider running 32 reader threads, each reading 1/32 of
the file. That's the approach I've been using so far - reading pieces
of the file in many worker threads - and I'm wondering how it compares
with the approach of having 1 reader thread.
Thanks,
-- Alex
You're very right. You get the full I/O bandwidth when a whole core is left
for the reader.
> Also, please consider running 32 reader threads, each reading 1/32 of
> the file. That's the approach I've been using so far - reading pieces
> of the file in many worker threads - and I'm wondering how it compares
> with the approach of having 1 reader thread.
That's precisely what I'm doing in the top two entries, inefficient
(line-oriented) I/O in multiple sections of the file at once.
It was only when I tried to use a separate process to avoid disk seeks that I
found the problem whose nature is clear now.
> Are you reading in parallel from multiple threads or are you reading
> sequentially from multiple threads?
>
> I'm reading sequentially from multiple threads, in other words, my
> input channel is passed around like a baton.
I'm reading in parallel from multiple threads: the file is opened in
30 separate places and 30 threads independently seek and read data.
I'm counting on the OS/hardware to read multiple streams efficiently,
and on the language to do buffered I/O. I'm pretty sure that the
second assumption is good; not so sure about the first though.
Passing the channel around among the threads, or something similar,
may improve performance; I'll try that approach too. But I'm curious
what a raw read benchmark in C would do for my current approach,
without any messaging or locking or language overhead.
Cheers,
-- Alex
> In fact, it has 8 cores and it turns out each has two live integer
> streams, so it's kinda like having 8 dual-core cores. So in fact you
> can get 16* the nominal throughput of one CPU. Well, unless you're
> doing FP. -T
Is FP the only thing where that's true? My understanding is that the
T1 is interleaved multi-threading, so I'm assuming there are things
beyond the ALU that only exist once in each core. (Fetching, perhaps?)
Or maybe it's more interleaving four threads onto two pipelines per core.
James
--
/--------------------------------------------------------------------------\
James Aylett xapian.org
ja...@tartarus.org uncertaintydivision.org
> Is FP the only thing where that's true? My understanding is that the
> T1 is interleaved multi-threading, so I'm assuming there are things
> beyond the ALU that only exist once in each core. (Fetching, perhaps?)
> Or maybe it's more interleaving four threads onto two pipelines per core.
My memory seems to be telling me that the T1 has 3 (!?) memory
controllers. There must be a data sheet somewhere.... -T
- Show quoted text -
> My memory seems to be telling me that the T1 has 3 (!?) memory
> controllers. There must be a data sheet somewhere.... -T
The closest I've found is
<http://www.sun.com/processors/UltraSPARC-T1/specs.xml>, which says 4
memory interfaces/controllers. There's a 3MB L2 cache, which may have
been what you were thinking?
Much more on
<http://www.opensparc.net/opensparc-t1/index.html>. There's some
interesting stuff (if you're interested at this level :-) in the
architecture document:
<http://opensparc-t1.sunsource.net/specs/OpenSPARCT1_Micro_Arch.pdf>.