Hello,
I imagine the following problem has been efficiently solved
over a million times in the past.
I need to rotate a picture clockwise 90 degrees.
Conceptually, my picture is represented by a matrix
(W columns, H rows) of pixels, with each pixel
represented by 3 octets.
pixel original_picture[W][H];
pixel rotated_picture[H][W];
At the implementation level, I'm dealing with 1D integer arrays.
int src[W*H*3];
int dest[W*H*3];
Conceptually, the rotation operation comes down to
rotated_picture[j][H-1-i] = original_picture[i][j]
My actual code is
for (i = 0; i < H; ++i)
for (j = 0; j < W; ++j)
memcpy(dest+(H*j+H-1-i)*3, src+(W*i+j)*3, 3);
Consider as an example,
W = 1644
H = 1164
size = 5.74 MB
On my target platform, the rotation takes 1.65 seconds.
I'm trying to get this down under half a second.
I'm using the GNU tool chain. For some weird reason, we
compile everything -O0. The first thing I'll try is crank
gcc's optimization level.
I'm hoping gcc can perform some strength reduction, as the
index calculation seems to be taking a non-negligible fraction
of the total run-time.
Changing the loop to
for (i = 0; i < H; ++i)
for (j = 0; j < W; ++j)
{
memcpy(dest+(H*j+H-i-1)*3, src, 3);
src += 3;
}
brings a 5% improvement.
I thought changing the memcpy call to 3 explicit assignments
might help, but it actually slowed things down.
Perhaps I need to perform some tiling magic... I don't think
gcc performs automatic tiling?
Comments and insight would be greatly appreciated.
Regards.
I haven't looked at it in detail but that immediately caught my
eye. Surely:
memcpy(dest+(H*j+H-1-i)*3, src+(W*i+j)*3, 3 * sizeof(int));
--
Andrew Smallshaw
and...@sdf.lonestar.org
This is indeed a well-known problem, the key to make it fast is to
realize that it is very similar to a transpose operation!
I.e. first copy everything to the target array (in a single block move),
then transpose it, finally you reverse each of the rows.
The difference between clock and counter-clock-wise rotation is in doing
the row reverse either before or after the transpose. Try both since I
don't remeber which is which!
The key here is that the transpose operation is quite cache-friendly,
much better than your current naive code.
Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
For extra marks, produce a much more cache-friendly version, using
blocks rather than rows ....
You can even do that in place, and make it efficient. But I suspect
that you (Terje) know all of this!
Regards,
Nick Maclaren.
Robert.
.------------------->X
|
|
|
|
|
|
\|/
Y
rotated 90 degrees clockwise
y<--------------------.
|
|
|
|
|
|
\|/
x
new_x <---------------.
|
|
\|/
new_y
new_x = (height-1)-y;
new_y = x;
Source := 0;
Dest := (Height-1);
Count := 0;
for Index := 0 to PixelCount-1 do
begin
Source := Source + 1;
Dest := Dest-1;
Count := Count + 1;
if Count >= Height then
begin
Count := 0;
Dest := Dest + Height;
end;
Dest^ := Source^;
end;
Untested, but that's what I would conceptually try.
This should get rid of all the (slower?) multiplications, in return for one
single branch, and some fast adds and subs.
Plus a memory copy per pixel.
(What target you on ? ;))
Bye,
Skybuck.
The access time would go up, and some other operations would run a
lot more slowly. That technique works, but isn't any use as a
general solution. It can help for some applications.
TANSTAAFL.
Regards,
Nick Maclaren.
I usually defer to you wrt optimization, Terje, but unless you have an especially fast block copy instruction that the
user cannot roll himself (not unlike my P6-era fast strings), I doubt that it will be faster to copy and then transpose
in place.
It is probably better to subdivide the original picture into blocks, and perform the rotation reading a block of the
original_picture, and writing the block of the rotated_picture that it maps to.
The only real question is the size of the blocks: cache sized? Maybe not: if you have N write combining buffers, there
may be a benefit in writing blocks whose vertical height is equal to the number of WC buffers. However, when I have
tuned transposes in the past I have usually found it better to write out rows of the destination, reading columns of the
origin. This way you don't really depend on the number of WC buffers: you completely fill one WC buffer, and then move
to the next.
With the secondary issue of what width you read the columns in - e.g. if transposing byte sized elements, you might read
32 or 64 at a time, and then do a register transpose to assemble the blocks to write out. Nowadays, you have to look at
128 bit SSE and 256 bit AVX (or 512 bit LRB registers) as options. (Hmm, on a machine with scatter/gather, you might be
tempted to read in a full cache line of the source, and scatter write to the destination. But then you *really* want to
make sure that you fit into the WC buffers. Scatter read and single write probably better. But I suspect that an in
register transpose is better overall. How efficiently can LRB transpose a set of vector registers? E.g. with 32 bits
per element, a full cache line 64B register is 8 elements wide - so how efficiently can you transpose a matrix stored in
8 vector registers?
Interestingly, most GPUs allow indirect addressing on their register files. I think that I have seen index registers -
i.e. access register[register[i]] - as well as strided access TO THE GPU RF. In such a system, transposing or rotating
is just (1) read the data in, (2) write the data out. No shuffling about. Although there is cross lane movement - but
then the ALU lanes are not necessarily the same as the memory lanes)
In any case, you wat to avoid the unnecessary read-for-ownerships that you might naively get when writing into memory
one element of the time. SSE streaming stores.
> Noob wrote:
>
>> I imagine the following problem has been efficiently solved
>> over a million times in the past.
>>
>> I need to rotate a picture clockwise 90 degrees.
>>
>> Conceptually, my picture is represented by a matrix
>> (W columns, H rows) of pixels, with each pixel
>> represented by 3 octets.
>>
>> pixel original_picture[W][H];
>> pixel rotated_picture[H][W];
>
> This is indeed a well-known problem, the key to make it fast is to
> realize that it is very similar to a transpose operation!
>
> I.e. first copy everything to the target array (in a single block move),
> then transpose it, finally you reverse each of the rows.
>
> The difference between clock and counter-clock-wise rotation is in doing
> the row reverse either before or after the transpose. Try both since I
> don't remember which is which!
>
> The key here is that the transpose operation is quite cache-friendly,
> much better than your current naive code.
Hello Terje,
Your sig is the reason I thought it would be a good idea to include
comp.arch in the list :-)
Let me take a 5x3 image as an example.
0 1 2 3 4
5 6 7 8 9
10 11 12 13 14
Transpose
0 5 10
1 6 11
2 7 12
3 8 13
4 9 14
Reverse the rows
10 5 0
11 6 1
12 7 2
13 8 3
14 9 4
This is, indeed, equivalent to a 90° clockwise rotation.
I don't see why the transpose operation would be more
cache-friendly than directly doing the rotation.
TRANSPOSE = copy ROW(i) to COL(i)
90°CW ROT = copy ROW(i) to COL(H-1-i)
Is it because the column access is incremented rather than
decremented? I must be missing something...
Could you provide some insight, or some links?
Regards.
The basic (fast) transpose operation works on squares that have power of
two sides, this is what allows it to work recursively, and thereby to
fit into lower cache levels as soon as possible.
The innermost loop of the transpose works with N registers each capable
of holding N elements, so you can transpose a 4x4 block of 32-bit values
using 4 SSE registers, or an 8x8 block of 16-bit values (as long as
you're in 64-bit mode so you have at least 16 regs available).
If you actually work with very small blocks, like your 5x3 example, then
it really doesn't matter, you can just generate unrolled code to do it
directly while copying, in which case the fastest code will be like
Andy's suggestion where you gather in data to be able to write full
cache lines to the target buffer.
> Noob wrote:
>
>> I don't see why the transpose operation would be more
>> cache-friendly than directly doing the rotation.
>>
>> TRANSPOSE = copy ROW(i) to COL(i)
>> 90°CW ROT = copy ROW(i) to COL(H-1-i)
>>
>> Is it because the column access is incremented rather than
>> decremented? I must be missing something...
>>
>> Could you provide some insight, or some links?
>
> The basic (fast) transpose operation works on squares that have power of
> two sides, this is what allows it to work recursively, and thereby to
> fit into lower cache levels as soon as possible.
>
> The innermost loop of the transpose works with N registers each capable
> of holding N elements, so you can transpose a 4x4 block of 32-bit values
> using 4 SSE registers, or an 8x8 block of 16-bit values (as long as
> you're in 64-bit mode so you have at least 16 regs available).
Hold on. SSE? 64-bit mode? :-)
My target is a 266-MHz SH-4.
Assuming the vendor's implementation of memcpy is properly optimized,
I think it is safe to assume that it provides a lower bound for the
run-time of the rotation operation. Correct?
As an example, copying a 5.7-MB picture (1644 x 1164) takes 320 ms
(i.e. 18 MB/s)
Right now, my naive (strength-reduced, no tiling) C implementation
takes 1100 ms. I think there is still room for improvement.
One thing you may have overlooked in the advice you've given me is
that the pixel values are packed in 24-bits (RGB, no alpha).
Thus I can't (blindly) deal with 32-bit words. There's some
bit-twiddling to do to merge different parts of pixels, and
then I have to worry about endian-ness. Good times.
> If you actually work with very small blocks, like your 5x3 example, then
> it really doesn't matter, you can just generate unrolled code to do it
> directly while copying, in which case the fastest code will be like
> Andy's suggestion where you gather in data to be able to write full
> cache lines to the target buffer.
I was surprised that I got slightly better performance when I iterated
over consecutive bytes of the SOURCE buffer, than when I iterated over
consecutive bytes of the DEST buffer.
My current code handles slightly over 5 MB/s. It's fast enough for pictures
smaller than, say, 2 MB. But some pictures weigh 6-7 MB. Having the user
wait 1.5 seconds might be acceptable, but barely.
Regards.
> >What would happen if you changed the default storage order for arrays
> >of dimension two and greater, so that they were always stored in
> >blocks with some cache or node-friendly order? You could make it
> >nearly transparent in a language like c++ by defining appropriate
> >objects.
>
> The access time would go up, and some other operations would run a
> lot more slowly. That technique works, but isn't any use as a
> general solution. It can help for some applications.
>
> TANSTAAFL.
Someone I worked with referred to this as conservation of difficulty.
I didn't actually mean to imply that c++ would be the best way to do
it, only that it was a way of doing it that would make question
concrete.
Let me ask a more vague question.
Suppose that there were a progammable hardware memory remapper that
worked either at the node or cluster level.
Robert.
I read it the way you meant.
>Let me ask a more vague question.
>
>Suppose that there were a progammable hardware memory remapper that
>worked either at the node or cluster level.
Same answer, with variations. Sorry.
While there are some applications where an 'odd' memory layout is
much better than the obvious one, most of those can be rewritten
to use a standard layout efficiently. What is far more common is
that one layout is best in one part of a program, and another in
another. Remapping CAN be worthwhile, but it isn't a cheap
operation.
I once got a factor of 4 improvement on a linear solver by turning
the arrays the other way round and keeping both the ordinary and
transposed code of one matrix - using whichever was best. And the
dreaded 3-D FFT algorithms are remapping ones. But it's rarely as
simple.
IBM did memory mapping across CPUS between the POWER3 and POWER4,
and got it BADLY wrong. They backed off for the POWER5. And IBM
aren't exactly tyros at that game. It's very easy to get wrong.
The history of large computers is littered with worthy attempts at
using memory banking in clever ways, most of which have failed to
a greater or lesser degree.
Randomising memory layout would have the advantage that it would
reduce the chances of an application suffering catastrophic loss
of efficiency, at the expense of making 'perfect' tuning infeasible.
Apparently, it's been done and it worked. But it didn't have the
expense of remapping.
Regards,
Nick Maclaren.
I was actually thinking more in terms of transparency and portability
than I was in terms of efficiency. Take the burden off the programmer
and the cleverness out of the source code. Or, if there is burden on
a programmer, let it be a systems programmer who doesn't have to mess
with the source code to make it even less readable.
Robert.
And on many other systems. The experience is that simple banking
is easy to get right, and works fairly well, though with some very
nasty gotchas. It's the attempts to be clever with it that I was
referring to.
>I was actually thinking more in terms of transparency and portability
>than I was in terms of efficiency. Take the burden off the programmer
>and the cleverness out of the source code. Or, if there is burden on
>a programmer, let it be a systems programmer who doesn't have to mess
>with the source code to make it even less readable.
That's where randomisation should score. It would be trivial to
transform every address, using a cheap and insecure 'cryptographic'
methods - a fixed network of bit shuffles and XORs isn't exactly
hard to put into hardware, after all - it's in hardware 101 in
most courses :-)
That would eliminate 99% of the gotchas, at the expense of stopping
the extreme tuning of algorithm cores. The impact on benchmarketing
may well be why it isn't done.
Regards,
Nick Maclaren.
Both dimensions are divisible by 4, but not by 8, so 4x4 basic blocks is
a natural starting point.
>
> Right now, my naive (strength-reduced, no tiling) C implementation
> takes 1100 ms. I think there is still room for improvement.
>
> One thing you may have overlooked in the advice you've given me is
> that the pixel values are packed in 24-bits (RGB, no alpha).
Aha!
This is crucial, it probably means that you should work with 4x4 pixel
blocks, each using 3x4x4 = 48 bytes.
48 bytes would fit in 12 32-bit registers, I don't know the details of
the SH-4, but I believe it has at least 16 integer registers, right?
Starting from this point I would first try a simple setup where I read
in 4x4 pixels in 3x4 registers, then shuffle/shift/combine these into a
rotated/transposed version, which I then store to 4 small temporary buffers.
Each time these 4 buffers reach 64 bytes each (or 128, whatever the
cache line size is), I'd copy them to the target array: This has the big
benefit of always writing complete cache lines, while keeping the
working buffers small enough that they will stay put in L1 cache.
During the load part you'll read 12 bytes from each of 4 different
addresses, this probably isn't quite optimal, but you'll at least get
some reasonable benefit from burst dram transfers.
>
> Thus I can't (blindly) deal with 32-bit words. There's some
> bit-twiddling to do to merge different parts of pixels, and
> then I have to worry about endian-ness. Good times.
>
>> If you actually work with very small blocks, like your 5x3 example, then
>> it really doesn't matter, you can just generate unrolled code to do it
>> directly while copying, in which case the fastest code will be like
>> Andy's suggestion where you gather in data to be able to write full
>> cache lines to the target buffer.
>
> I was surprised that I got slightly better performance when I iterated
> over consecutive bytes of the SOURCE buffer, than when I iterated over
> consecutive bytes of the DEST buffer.
>
> My current code handles slightly over 5 MB/s. It's fast enough for pictures
> smaller than, say, 2 MB. But some pictures weigh 6-7 MB. Having the user
> wait 1.5 seconds might be acceptable, but barely.
Back in the 486-33MHz days I used to say that 10 MB/s was my initial
speed target, your SH-4 is probably quite a bit faster than this?
OTOH your library block copy speed of 18 MB/s seems quite low. :-(
So, tell us what they did?
The standard usually turns out to be a bad idea thing here is to interleave cache lines across CPUs. (Did the Power3
have a CPU-attached memory controller.) Not usually a good idea, unless local and remote memory are very close in
latency and bandwidth.
Not quite so bad, but still often surpringly painful, is to interleave 4K pages across CPUs/MCs. The OS can work around
this by playing with virtual address mappings, but it catches them by surprise.
The best is usually to have several large contiguous chunks per CPU. Let the OS do NUMA placement. Where you can
specify what physical address they appear at. Several, because there are occasionally assumptions about physical address
ranges. (E.g. the OS assumes CPU0 has exclusive access to physical address [64K,128K), CPU1 to [128K,+64K), etc.) The
remappability is just to cope with such assumptions. Which sometimes crop up wrt I/O, especially when you transition
across a boundary like 32 to 64 bit.
But I still like having the option of being to interleave at finer granularity - cache line, or page - across a subset
of memory controllers, for a subset of memory. Because for certain apps it's a win. If the bandwidth of a single
channel cannot saturate a CPU.
It turns out that interleaving should be performed at the level of the
DRAM page for maximal effect with memory controllers that are smart
enough. This is far larger than the CPU page, but if the DRAMs do not
all have the same page sisze that the largest page that fits in all
DRAMs is generally an excellent granule.
A DRAM controller is sufficiently smart when it can hold dozens to
hundreds of cache lines waiting to be written while still servicing
demand reads with some protocal as to when writes actually happen and
some means to access the write buffer on behalf of demand reads.
This strategy commits the fewest number of DIMMs to the typical large
scale strip mine active access patterns. {Leaving the rest of the DRAM
banks to do more random activities.}
Mitch
How? Until and unless IBM tell all, we peasants will merely have to
guess. All we know is that they did what you describe below, but
that wasn't the whole story and wasn't the reason for the major
problems.
>The standard usually turns out to be a bad idea thing here is to
>interleave cache lines across CPUs. (Did the Power3 have a CPU-
>attached memory controller.) Not usually a good idea, unless local
>and remote memory are very close in latency and bandwidth.
That is, I believe, what the POWER4 did. And the details of the
memory controller have never been published, nor has why the change
was so catastrophic.
Incidentally, it worked very well on the vector systems, and was very
common on them. My guess is that it would work very well with modern
Fortran codes (i.e. ones using arrays as first-class objects). Heaven
help C/C++, but what else is new?
>Not quite so bad, but still often surpringly painful, is to interleave
>4K pages across CPUs/MCs. The OS can work around this by playing with
>virtual address mappings, but it catches them by surprise.
I would regard that as a bug in the OS design. The contiguous approach
can be almost as painful for many shared-memory languages/programs, as
almost everything gets allocated into a single bank. And I regard
that as a bug in the OS design, too ....
Regards,
Nick Maclaren.
>> I need to rotate a picture clockwise 90 degrees.
>
> The data sheet states
>
> SH-4 32-bit super-scalar RISC CPU
> o 266 MHz, 2-way set associative 16-Kbyte ICache, 32-Kbyte DCache, MMU
> o 5-stage pipeline, delayed branch support
> o floating point unit, matrix operation support
> o debug port, interrupt controller
The most important number is cache line size, which you missed.
If your image is 1,024 lines tall, that will completely thrash
the cache, resulting in 3 bytes copied per cache line load/spill.
If you copy 16x16 tiles you can get a 10x speedup.
CopyTile16(source, dest, x, y, width, height)...
You can also try 8x8 and 4x4, smaller loops can be faster due
to all the args fitting in memory, and the loop getting unrolled.
but the difference will be ~25% which is hardly worth the time to code.
> for (i = 0; i < H; ++i)
> for (j = 0; j < W; ++j)
> {
> unsigned char *C = B+(H*j+H-i-1)*3;
> C[0] = A[0];
> C[1] = A[1];
> C[2] = A[2];
> A += 3;
> }
If you do the following you can get a 2x speedup, it looks like
more code, but will generate less, and the results will be
pipelined correctly.
Extra bonus points to those that understand why. Half the posters here?
{
unsigned char *C = B+(H*j+H-i-1)*3;
temp0 = A[0];
temp1 = A[1];
temp2 = A[2];
C[0] = temp0;
C[1] = temp1;
C[2] = temp2;
A += 3;
}
Do not use *C++ = *A++;
Brett
You have explicitly gotten rid of the aliasing issues so the compiler
can avoid having to assume aliasing conflicts between C[0] and
A[1],...
Mitch
Programming hotshots have done so much damage.
And they brag about it.
I watched some doing one-upsmanship while the earth was still being
created, and I decided I wanted nothing to do with it. I think I
showed good judgment (rare for me).
Robert.
Terje Mathisen wrote:
> Noob wrote:
>
>> My target is a 266-MHz SH-4.
>>
>> Assuming the vendor's implementation of memcpy is properly optimized,
>> I think it is safe to assume that it provides a lower bound for the
>> run-time of the rotation operation. Correct?
There seems to be something fishy with the vendor's memcpy: it handles (only)
~18 MB/s while the straight-forward foo() below handles ~21 MB/s
void foo(unsigned int *dest, unsigned int *src, int wordcount)
{
for (i = 0; i < wordcount; ++i) *dest++ = *src++;
}
The loop kernel translates to
.L137:
mov.l @r4+,r1
dt r2
mov.l r1,@r5
bf/s .L137
add #4,r5
Unrolling the loop twice makes the code 50% (huh?!?) slower
.L137:
mov.l @r4,r1
dt r3
mov.l @(4,r4),r2
add #8,r4
mov.l r1,@r5
mov.l r2,@(4,r5)
bf/s .L137
add #8,r5
I'm starting to think that the array is allocated in a non-cached region.
>> As an example, copying a 5.7-MB picture (1644 x 1164) takes 320 ms
>> (i.e. 18 MB/s)
>
> Both dimensions are divisible by 4, but not by 8, so 4x4 basic blocks is
> a natural starting point.
The dimensions are, indeed, divisible by 4, because I do the following ;-)
cinfo.output_width &= ~3;
cinfo.output_height &= ~3;
>> Right now, my naive (strength-reduced, no tiling) C implementation
>> takes 1100 ms. I think there is still room for improvement.
>>
>> One thing you may have overlooked in the advice you've given me is
>> that the pixel values are packed in 24-bits (RGB, no alpha).
>
> Aha!
>
> This is crucial, it probably means that you should work with 4x4 pixel
> blocks, each using 3x4x4 = 48 bytes.
Someone showed me how to get libjpeg to output 4-byte pixels, and
this alone gave me a 3x speedup.
> 48 bytes would fit in 12 32-bit registers, I don't know the details of
> the SH-4, but I believe it has at least 16 integer registers, right?
Right.
General register file:
o Sixteen 32-bit general registers (and eight 32-bit shadow registers)
o Seven 32-bit control registers
o Four 32-bit system registers
> Starting from this point I would first try a simple setup where I read
> in 4x4 pixels in 3x4 registers, then shuffle/shift/combine these into a
> rotated/transposed version, which I then store to 4 small temporary
> buffers.
>
> Each time these 4 buffers reach 64 bytes each (or 128, whatever the
> cache line size is)
The cache line size is 32 bytes.
I can play every cache trick in the book, but it won't help if I'm
working with non-cached memory.
> Back in the 486-33MHz days I used to say that 10 MB/s was my initial
> speed target, your SH-4 is probably quite a bit faster than this?
AFAICT, my 266-MHz SH-4 should be 5-10 faster than a 33-MHz 486 in
pure integer code.
I found a better description of the CPU:
o 32-bit internal data bus
o General register file:
– sixteen 32-bit general registers (and eight 32-bit shadow registers)
– seven 32-bit control registers
– four 32-bit system registers
o RISC-type instruction set
– fixed 16-bit instruction length for improved code efficiency
– load-store architecture
– delayed branch instructions
– conditional execution
o Superscalar architecture: parallel execution of two instructions
o Instruction execution time: maximum 2 instructions/cycle
o On-chip multiplier
o Five-stage pipeline (100, 200, 400 and 500 series)
o Seven-stage pipeline (300 series)
o Branch target cache for bubble-free branching along predicted path (300 series)
and the cache sub-system:
o Instruction cache (IC) features:
– 16 Kbyte, 2-way set associative
– 512 entries, 32-bytes block length
– compatibility mode (8 Kbyte direct mapped)
– index mode
o Operand cache (OC) features:
– 32 Kbyte, 2-way set associative
– 1024 entries, 32 bytes block length
– compatibility mode (16 Kbyte direct mapped)
– index mode
– RAM mode (16 Kbyte cache + 16 Kbyte RAM)
o Single-stage copy-back buffer, single-stage write-through buffer
o Cache memory contents can be accessed directly by address mapping (usable as on-chip memory)
o Store queue (32 bytes x 2 entries).
cf. also
http://documentation.renesas.com/eng/products/mpumcu/rej09b0318_sh_4sm.pdf
Regards.
Ouch!
That sounds like a frame buffer...
Can you direct these temporary image buffers to be in cacheable ram instead?
On many memory systems the DRAM page is approximately the same size as the virtual memory page - 4KB or thereabouts. So
in this situation our advice coincides.
When the DRAM page is smaller than the virtual memory page, it is probably a bad idea to interleave between CPUs at the
DRAM page. System software cannot control.
By the way, interleaving across channels between DRAM pages is often a win - if you are limited by channel bandwidth.
But this, unfortunately, makes the effective DRAM page even larger, exacerbating the following effect. It can be an
effective way of increasing a 2KB DRAM page to a 4KB page size that the OS can deall with.
When the DRAM page is larger than the e.g. 4KB virtual memory page, if you are only using that page size, on a long
running system all the address bits beyond bit 12 are effectively randomized. You might hope for page clustering or
large pages in the OS to help. Which they do. However, in my experience, large pages have usually only been available
for benchmark specials, and not for truly general purpose computing. Still, if you have an HPC system, use them if you
got them.
> Noob wrote:
> >
> > I'm starting to think that the array is allocated in a non-cached region.
>
> Ouch!
That is worse than the case where you only use 3 bytes for every cache line fill/spill.
> That sounds like a frame buffer...
>
> Can you direct these temporary image buffers to be in cacheable ram instead?
>
> Terje
If not you want to load four pixels at a time using three long reads, and shift
out the bytes to write. Three slow reads instead of 12.
Brett
Damage?
That is clean code that is easy to read and understand.
> And they brag about it.
Only one in a hundred programers know an optimizaton like that, for
half of comp.arch to be that good says good things about comp.arch.
> I watched some doing one-upsmanship while the earth was still being
> created, and I decided I wanted nothing to do with it. I think I
> showed good judgment (rare for me).
I thought I was generous giving away top secrets that most everyone
else hoards. I do have quite a bit more that will get another 50%,
but he lacks the background to understand such, and I lack the skills
to teach the complex stuff over a tty.
> Robert.
If you remember the story about the programmer told to add a cheat
to the blackjack program so the customer would always win...
I am not him, my code is clear and readable.
Brett ;)
Well, yes and no. It is much cleaner to write the code in the form
the algorithm uses, and have the compiler optimise it, but you can't
do that if you insist on writing in C. Fortran rules in that respect,
though even Fortran is nowhere near as optimisable as it could be.
35 years ago, it was standard practice to unroll loops by hand,
whether in Algol, Fortran or assembler, but it never was more than
a necessity because the compilers' optimisation was generally poor
(though there were exceptions). The reason that it is needed in C
is because C is almost unoptimisable, by design.
Regards,
Nick Maclaren.
That was in my suggested algorithm, reading 4x4 blocks of pixels from
the source buffer, rewrapping them (24->32->24 bits) inside registers
and writing back to the target using full cache line stores if possible.
However, if he can move stuff out of non-cacheable framebuffer space,
that will help a _lot_ more. :-)
Brett Davis wrote:
> I thought I was generous giving away top secrets that most everyone
> else hoards. I do have quite a bit more that will get another 50%,
> but he lacks the background to understand such [...]
I do appreciate the advice you've given.
For the record, I've worked several years on LNO-related topics,
thus I find your characterization rather misdirected ;-)
Regards.
I agree. C was intended as and is fairly effective as portable
assembler, but it almost assumes hand optimization, and, as many have
noted, the resulting source code can be very obscure (and the
cleverness may not be portable).
It is the triumph of the mentality that I found so unappealing on
first contact to which I am objecting, and I will stand by my
objection. The purpose of computing is not to demonstrate cleverness,
but to produce results that are useful to humans. Suppose the
maintainer of the code that Brett proposes isn't in the small
percentage of those who can successfully reverse-engineer his
cleverness. Is the likelihood of that happening something to be proud
of? Does the lost time downstream as someone wonders WTF justify the
wall clock time saved?
For applications where behavior in wall-clock time is critical,
perhaps there is no other way, but I think the choice between
transparency and cleverness tips too often or at least too
automatically in the direction of cleverness, and in no small part
because of the culture that has grown up around C.
Robert.
Ok, well, I'm going to take the risk of seeming a total doofuss
on a global scale, but I don't see what you are getting at.
A[x] and C[y] are referenced only once so there is no reason
for the compiler to enregister their values, and all other
variables are locals or pass-by-value, and therefore
no reason for aliasing to occur.
The SH-4 has byte memory access instructions, so this is just the
difference between LD ST LD ST LD ST and LD LD LD ST ST ST.
The latter requires 2 extra temp regs, which on x86 causes
a var spill into the stack, but probably is ok on sh-4.
So I don't see the 2x speedup here.
Eric
> Does the lost time downstream as someone wonders WTF justify the
> wall clock time saved?
That should have read, "Is the time lost downstream as someone wonders
WTF justified by the wall clock time saved?"
Sorry.
Robert.
>A[x] and C[y] are referenced only once so there is no reason
>for the compiler to enregister their values, and all other
>variables are locals or pass-by-value, and therefore
>no reason for aliasing to occur.
A proper compiler will generate optimal code for:
for (i = 0; i < len; i++)
C[i] = A[i];
or similar code and will generate about the same code for:
pC = C; pA = A;
while (*pC < C[len])
*pC++ = *pA++;
If you can generate faster code by unrolling loops in C then you should get
a better compiler.
Casper
--
Expressed in this posting are my opinions. They are in no way related
to opinions held by my employer, Sun Microsystems.
Statements on Sun products included here are not gospel and may
be fiction rather than truth.
Including when A, C and len are arguments? Oh, come now! Few
start with a test for whether A and C overlap, which is what is
needed.
>If you can generate faster code by unrolling loops in C then you should get
>a better compiler.
You should know better than to post that nonsense. There is a LOT
of C code where the programmer knows that two arrays can't overlap,
but the compiler can't tell.
C99 restrict brings programs that use it religiously and correctly
up to about the optimisability of Fortran 77, but that's all. And,
without it, the compiler has one hand tied behind its back.
Regards,
Nick Maclaren.
Ok, but that isn't what he said or what I was commenting on.
The MS compiler does a surprisingly good job on the
original cache-hostile code:
void Rotate90ToRight
(int height, int width, const char *srcBuf, char *dstBuf)
{
int i, j;
for (i = 0; i < height; i++)
for (j = 0; j < width; j++)
memcpy (dstBuf+(height*j+height-1-i)*3, srcBuf+(width*i+j)*3, 3);
}
tv410 = -4 ; size = 4
tv434 = 8 ; size = 4
_height$ = 8 ; size = 4
_width$ = 12 ; size = 4
_srcBuf$ = 16 ; size = 4
_dstBuf$ = 20 ; size = 4
_Rotate90ToRight@16 PROC ; COMDAT
; Line 3
push ecx
; Line 7
mov eax, DWORD PTR _height$[esp]
test eax, eax
jle SHORT $LN4@Rotate90To
mov edx, DWORD PTR _dstBuf$[esp]
push ebx
push ebp
mov ebp, DWORD PTR _srcBuf$[esp+8]
push esi
mov esi, DWORD PTR _width$[esp+12]
push edi
lea ecx, DWORD PTR [esi+esi*2]
lea edi, DWORD PTR [eax+eax*2]
mov DWORD PTR tv410[esp+20], ecx
lea edx, DWORD PTR [edi+edx-3]
mov DWORD PTR tv434[esp+16], eax
npad 5
$LL13@Rotate90To:
; Line 8
test esi, esi
jle SHORT $LN5@Rotate90To
mov eax, ebp
mov ecx, edx
npad 8
$LL3@Rotate90To:
; Line 9
mov bx, WORD PTR [eax]
mov WORD PTR [ecx], bx
mov bl, BYTE PTR [eax+2]
mov BYTE PTR [ecx+2], bl
add eax, 3
add ecx, edi
sub esi, 1
jne SHORT $LL3@Rotate90To
; Line 8
mov esi, DWORD PTR _width$[esp+16]
$LN5@Rotate90To:
; Line 7
add ebp, DWORD PTR tv410[esp+20]
sub edx, 3
sub DWORD PTR tv434[esp+16], 1
jne SHORT $LL13@Rotate90To
pop edi
pop esi
pop ebp
pop ebx
$LN4@Rotate90To:
; Line 10
pop ecx
ret 16 ; 00000010H
Eric
The key is that without alias knowledge, it is perfectly possible for
the two arrays to intersect, in which case it is illegal for the
compiler to unroll the loads and stores.
However, the huge advantage here is to load 12 bytes = 4 pixels into 3
registers, then do the same for the 3 following lines so that you have
loaded a full 4x4 pixel block with 12 load instructions instead of 36.
Next you do shifts & shuffles to rotate the 4x4 block, still ending up
with 12 registers, before storing them back with 12 writes.
This saves 2/3 of all the load/store operations.
Ok, so how would it do better without considering buffer overlap
and unrolling the loop? It still seems to me that there would
be the same number of loads and stores. To really benefit it would
have change byte loads and stores to dword, as you say below.
Note that in the example MS code I posted, the compiler
generated the following code for the 3 byte memcpy:
; Line 9
mov bx, WORD PTR [eax]
mov WORD PTR [ecx], bx
mov bl, BYTE PTR [eax+2]
mov BYTE PTR [ecx+2], bl
which would give different results if the source and dest overlapped
but is legal because memcpy is defined as not considering overlap.
> However, the huge advantage here is to load 12 bytes = 4 pixels into 3
> registers, then do the same for the 3 following lines so that you have
> loaded a full 4x4 pixel block with 12 load instructions instead of 36.
>
> Next you do shifts & shuffles to rotate the 4x4 block, still ending up
> with 12 registers, before storing them back with 12 writes.
>
> This saves 2/3 of all the load/store operations.
>
> Terje
I wonder about this. Yes, it save lots of loads and stores,
but unless the sh-4 has byte extract and insert ops (and it doesn't look so)
you are going to spend ops doing moves, shifts, ands, ors.
Coupled with the fact that the transpose approach touches
each source cache line once, and each dest line two times
(once for the source to dest copy-transpose, then the row reverse)
I think this would loose to just a straight copy-while-rotating.
I was thinking just doing the rotate during the copy from source to dest,
but being cache aware so that each line is loaded just once.
It should start in the lower left corner of srcBuf, copy a
vertical stripe, column by column, from bottom to top so the
output to dstBuf is serial and contiguous.
Having stores be contiguous might help if there was store queue merging.
The width of the stripe is 32 elements (96 bytes).
It should move 32 (or less) of the 3 byte elements,
then go back and do the next column in the stripe.
dst[0,0] = src[h-1,0]
dst[0,1] = src[h-2,0]
...
dst[0,31] = src[h-32,0]
dst[1,0] = src[h-1,1]
dst[1,1] = src[h-2,1]
...
dst[31,31] = src[h-32,31]
This gives a data cache working set of 33 lines,
the writes of dst are contiguous,
and each line is loaded from main memory just once.
Eric
These two could be combined so it would rotate a 4*4 block of 3 byte
elements in a vertical stripe and writing 4 sequential rows at once.
I was thinking it would require too many registers to rotate
the block but I think it could be done with just 6 if you
spill each pending row value as soon as it is ready:
4 to hold the pending write value for each row,
one for the just loaded value, one for shifting, masking, merging.
Other regs for holding dstPtr, dstTmpPtr, srcPtr, srcTmpPtr, height, etc,
and it looks like it would pretty much fit in 16 registers,
and only load each cache line once.
That's probably as good as your going to get.
Eric
That is an excellent idea, but you will note that I only suggested long
accesses for the reads, and writing bytes. Because I have made that
optimization before and ended up with longer slower code. Much to my
dismay I might add. (~5% slower.)
The reason is that almost all CPUs have a write combiner to reduce
memory traffic. Four bytes written in sequence end up as one long write
to the cache or memory. So all those extra shifts and adds just waste
cycles. Plus many of the SH chips have slow shift instructions, making
things even worse.
Brett
Benchmark it, try and stay in cache as a 66 MHz embedded CPU does
not have memory 600 cycles away.
Only an Intel class chip has enough OoO stuff to have a chance
of coming close to the same speed as my code.
Brett
Excellent start, there are at least two or three other reasons, all
of which effect performance, unlike the aliasing issue that will
only rear its ugly head when an actual alias violation occurs.
Brett
This is effectively what I'm suggesting, i.e. do the rotate while
copying, trying to minimize the number of load/store operations.
Working in 4x4 blocks is a good compromise.
Oh, you are avoiding read after write data
dependency pipeline stalls on in-order cpus.
The buffer overlap aliasing considerations in C prevents the
compiler from automatically rearranging the original LD ST
order to be more pipeline friendly for a particular cpu,
but Fortran would allow it.
Yeah ok that could be about 2x speed for that kind of cpu.
Eric
| I need to rotate a picture clockwise 90 degrees.
|
| Conceptually, my picture is represented by a matrix
| (W columns, H rows) of pixels, with each pixel
| represented by 3 octets.
|
| pixel original_picture[W][H];
| pixel rotated_picture[H][W];
|
| At the implementation level, I'm dealing with 1D integer arrays.
|
| int src[W*H*3];
| int dest[W*H*3];
|
| Conceptually, the rotation operation comes down to
|
| rotated_picture[j][H-1-i] = original_picture[i][j]
|
| My actual code is
|
| for (i = 0; i < H; ++i)
| for (j = 0; j < W; ++j)
| memcpy(dest+(H*j+H-1-i)*3, src+(W*i+j)*3, 3);
This is not going to be fast because:-
1. you are calling a subroutine 1,913,616 times to move
9 or 48 bits each time.
2. You are multiplying by 3 twice for each move.
3. You are using integer arrays.
1. Best to use ordinary assignment.
2. Eliminate *3 by changing to adds.
3. Use byte arrays (but I'm not clear about what you say.
You say that you have octets. Now 3 octets will be 9 bits,
not 3 integers equal to 48 bits.)
This is how it would look in PL/I:-
declare (src(m,n), dest(n,m)) char (3);
do i = 1 to m;
dest(*, m-i+1) = src(i,*);
end;
If the color information really is octets,
then the declaration is changed to bit (9) aligned.
> | Conceptually, my picture is represented by a matrix
> | (W columns, H rows) of pixels, with each pixel
> | represented by 3 octets.
> | pixel original_picture[W][H];
> | pixel rotated_picture[H][W];
> | At the implementation level, I'm dealing with 1D integer arrays.
> | int src[W*H*3];
> | int dest[W*H*3];
> | Conceptually, the rotation operation comes down to
> | rotated_picture[j][H-1-i] = original_picture[i][j]
> | My actual code is
> | for (i = 0; i < H; ++i)
> | for (j = 0; j < W; ++j)
> | memcpy(dest+(H*j+H-1-i)*3, src+(W*i+j)*3, 3);
This is basically a matrix transpose (with the difference that
one goes down instead of up). There are cache friendly matrix
transpose algorithms that should also speed this one up.
> This is not going to be fast because:-
> 1. you are calling a subroutine 1,913,616 times to move
> 9 or 48 bits each time.
Many C compilers implement memcpy() inline. I suppose you
believe that PL/I does a subroutine call for the MOD function?
> 2. You are multiplying by 3 twice for each move.
Most can also figure that one out.
> 3. You are using integer arrays.
> 1. Best to use ordinary assignment.
> 2. Eliminate *3 by changing to adds.
> 3. Use byte arrays (but I'm not clear about what you say.
> You say that you have octets. Now 3 octets will be 9 bits,
> not 3 integers equal to 48 bits.)
-- glen
> Brett Davis wrote:
> >>>>> If you do the following you can get a 2x speedup, it looks like
> >>>>> more code, but will generate less, and the results will be
> >>>>> pipelined correctly.
> >>>>> Extra bonus points to those that understand why. Half the posters here?
> >>>>>
> >>>>> {
> >>>>> unsigned char *C = B+(H*j+H-i-1)*3;
> >>>>> temp0 = A[0];
> >>>>> temp1 = A[1];
> >>>>> temp2 = A[2];
> >>>>> C[0] = temp0;
> >>>>> C[1] = temp1;
> >>>>> C[2] = temp2;
> >>>>> A += 3;
> >>>>> }
> >>>>>
> >>>>> Do not use *C++ = *A++;
> >>> Only one in a hundred programers know an optimizaton like that, for
> >>> half of comp.arch to be that good says good things about comp.arch.
> >
> > Benchmark it, try and stay in cache as a 66 MHz embedded CPU does
> > not have memory 600 cycles away.
> > Only an Intel class chip has enough OoO stuff to have a chance
> > of coming close to the same speed as my code.
> >
> > Brett
>
> Oh, you are avoiding read after write data
> dependency pipeline stalls on in-order cpus.
No, nice guess.
I dont know of any CPU that stalls on a read after write, instead
they try and forward the data, and in the rare case when a violation
occurs the CPU will throw an interrupt and restart the instructions.
This is an important comp.arch point, so someone will correct me
if I am wrong.
> The buffer overlap aliasing considerations in C prevents the
> compiler from automatically rearranging the original LD ST
> order to be more pipeline friendly for a particular cpu,
> but Fortran would allow it.
This is close, the answer is a mundane but important point.
CPUs today have a load to use delay of ~2 cycles from level one
cache. (less for slower chips, more in faster chips.)
An OoO chip can try and find other instructions to execute, but
even they are subject to this delay. (I believe.)
This copy loop is so small I dont think there is much even a
Intel/AMD chip can do. I was hoping you would benchmark it. ;)
A lot of the dual issue CPUs are partial OoO and support hit
under miss for reads, but these will epic fail at running code
like this faster.
The future of computing (this decade) is lots of simple in-order CPUs.
Rules of die size, heat and efficiency kick in. Like ATI chips.
This remains to be seen. I am tempted to say "That is so last decade 200x".
Wrt GPUs, perhaps.
However, in the last few months I have been seeing evidence of a trend the other way:
The ARM Cortex A9 CPU is out-of-order, and is becoming more and more widely used in things like cell phones and iPads.
Apple's PA-Semi's team's last processor was a low power PowerPC.
I suspect that we will soon see out-of-order processors in the Qualcomm SnapDragon family and the Intel Atom family.
Intel has delayed Larrabee, in-order vector SIMD (as opoposed to GPU style threaded SIMD). I would not be surprised to
see an out-of-order flavor of such an x86 vector SIMD. De-facto, AVX is that, although not in the same space as Larrabee.
I suspect that we will end up in a bifurcated market: out-of-order for the high performance general purpose computation
in cell phones and other important portable computers, in-order in the SIMD/SIMT/CoherentThreading GPU style
microarchitectures.
The annoying thing about such bifurcation is that it leads to hybrid heterogenous architectures - and you never know how
much to invest in either half. Whatever resource allocation you make to in-order SIMD vs. ooo scalar will be wrong for
some workloads.
I think that the most interesting thing going forward will be microarchitectures that are hybrids, but which are
homogenous: where ooo code can run reasonably efficiently on a microarchitecture that can run GPU-style threaded SIMD /
Coherent threading as well. Or vice versa. Minimizng the amount of hardware that can only be used for one class of
computation.
Robin, will you /please/ stop blithering on about things you don't
understand?! Buy a Data Processing dictionary, for God's sake!
The rest of this is addressed to the original poster: I don't
understand why you're using int variables for octets; they should be
char. For the rest, I'd do the following:
typedef struct __Pixel {
unsigned char red, green, blue;
} Pixel;
Pixel src[W][H];
Pixel dest[H][W];
for (int i = 0; i < W; ++i)
for (int j = 0; j < H; ++i) {
dest[j][i].red = src[i][j].red;
dest[j][i].green = src[i][j].green;
dest[j][i].blue = src[i][j].blue;
}
I'd also try:
for (int i = 0; i < W; ++i)
for (int j = 0; j < H; ++i)
memcpy(dest[j][i], src[i][j], sizeof (Pixel));
and see which one is faster; it will depend on the individual compiler.
(Make sure you test at the same optimization level you plan to use.)
--
John W Kennedy
"There are those who argue that everything breaks even in this old dump
of a world of ours. I suppose these ginks who argue that way hold that
because the rich man gets ice in the summer and the poor man gets it in
the winter things are breaking even for both. Maybe so, but I'll swear
I can't see it that way."
-- The last words of Bat Masterson
There used to be a fair number, and I suspect still are. However,
I doubt that stalling when the read and write are on a single CPU
will return. However, I would expect that at least some multi-core
CPUs stall at least sometimes, because there will not be cache to
cache links at all levels, and they will have to wait until the
write reaches the next cache (or memory) with a link.
But that's guessing on the basis of history and general principles,
not actual knowledge.
Regards,
Nick Maclaren.
Well, yes, but that's no different from any other choice. As I have
posted before, I favour a heterogeneous design on-chip:
Essentially uninteruptible, user-mode only, out-of-order CPUs
for applications etc.
Interuptible, system-mode capable, in-order CPUs for the kernel
and its daemons.
Most programs could be run on either, whichever there was more of,
but affinity could be used to select which. CPUs designed for HPC
would be many:one; ones designed for file serving etc would be
one:many. But all systems would run on all CPUs.
Regards,
Nick Maclaren.
And perhaps dest[j][i] = src[i][j];
But in practice W,H might only be known at runtime, making the code rather
different. Depending on the exact format of the image data, there might also
be padding bytes (nothing to do with C's struct padding), for example at the
end of each row.
--
Bartc
> The ARM Cortex A9 CPU is out-of-order, and is becoming more and more
> widely used in things like cell phones and iPads.
Cortex A9 is not shipping in any product yet (I believe). Lots of
preannouncements though. The Apple A4 CPU is currently believed to be a
tweaked Cortex A8, perhaps related to the tweaked A8 that Intrinsity did
for Samsung before being acquired by Apple.
Someone with a jailbroken iPad (or having paid the 99$ fee) could run
benchmarks to probe the properties of the CPU.
--
Mvh./Regards, Niels Jørgen Kruse, Vanløse, Denmark
> I suspect that we will end up in a bifurcated market: out-of-order for the high performance general purpose computation
> in cell phones and other important portable computers, in-order in the SIMD/SIMT/CoherentThreading GPU style
> microarchitectures.
>
> The annoying thing about such bifurcation is that it leads to hybrid heterogenous architectures - and you never know how
> much to invest in either half. Whatever resource allocation you make to in-order SIMD vs. ooo scalar will be wrong for
> some workloads.
>
A significant part of the resources of the Cray 1 were nearly useless
to almost no matter what customer bought and/or used such machines.
Machines were bought by customers who had no use for vector registers
and by customers for whom there was a whole class of scalar registers
that were nearly beside the point.
However difficult those choices may have been (and I'm not sure they
weren't less important than the cost and cooling requirements of the
memory), the machines were built and people bought and used them.
I don't think the choices are nearly as hard now. Transistors are
nearly free, but active transistors consume watts, which aren't free.
There are design costs to absorb, but you'd rather spread those costs
over as many chips as possible, even if it means that most customers
have chips with capabilities they never use. So long as the useless
capabilities are idle and consume no watts, everyone is happy.
> I think that the most interesting thing going forward will be microarchitectures that are hybrids, but which are
> homogenous: where ooo code can run reasonably efficiently on a microarchitecture that can run GPU-style threaded SIMD /
> Coherent threading as well. Or vice versa. Minimizng the amount of hardware that can only be used for one class of
> computation.
I thought that was one of the goals of pushing scheduling out to the
compiler. I still don't know whether the goal was never possible or
Itanium was just a hopelessly clumsy design.
Robert.
> I thought that was one of the goals of pushing scheduling out to the
> compiler. I still don't know whether the goal was never possible or
> Itanium was just a hopelessly clumsy design.
It seems to have been possible for a limited class of problems: ones
where you could use profile-guided optimisation on a relatively small
amount of critical code that consumed almost all the CPU time, with
example data that truly represented (almost) all of the data that was
likely to be put through that critical code, and which used a fairly
small selection of the possible code paths through the critical code.
This comes down, in practice to "code that's much like the benchmarks
that were studied before designing the architecture". Which is rather
different from all the code that people want to run on high-performance
computers.
So while pushing scheduling out the compiler was *possible*, it doesn't
seem to have been *practical*. I still bear many scars from the Itanium,
and it's had the effect of making many of the ideas used in unattractive
for years to come.
--
John Dallman, j...@cix.co.uk, HTML mail is treated as probable spam.
Precisely. Exactly as every expert expected.
What seems to have happened is that a few commercial compscis[*]
demonstrated that working on some carefully selected programs, and
persuaded the decision makers that they could deliver it on most of
the important, performance critical, codes. The fact that it was
known to be infeasible, and had been for 25 years, was ignored.
I have no idea which people were responsible for that, though I
have heard that anyone who queried the party line was howled down
and moved to other work. But that's hearsay.
I said that the project would fail, and why, in detail, in 1995/6.
One of the two aspects they partially fixed up (the interrupt one),
at great difficulty and by dropping one of the most important
performance features. The other was a spectacular failure, for
precisely the reasons I gave. And I have never claimed to be a
world expert - all I was using was common knowledge to people who
had worked in or with those areas.
[*] NOT a flattering term.
Regards,
Nick Maclaren.
That all that cruft would lead to all kinds of problems seems hardly
surprising, but it also seems hardly intrinsic to VLIW and/or putting
more of the burden of scheduling on the compiler.
My assumption, backed by no evidence, is that HP/Intel kept adding
"features" to get the architecture to perform as they had hoped until
the architecture was sunk by its own features.
You think the problem is fundamental. I think the problem is
fundamental only because of the way that code is written, in a
language that leaves the compiler to do too much guessing for the idea
to have even a hope of working at all.
The early work from IBM *didn't* just look at computation-heavy, very
repetitive HPC-like codes. It examined implausible things like word
processors and found a tremendous amount of predictability in behavior
such as computation paths. Maybe most of that predictability has now
been successfully absorbed by run-time branch predictors, making the
possible gains in trying to it exploit it at the compile stage moot.
Since the world *does* write in languages that defy optimization, and
most of the work on languages does not seem interested in how
optimizable a language is, the net conclusion is the same: the idea
will never work, but not for the almost-mathematical reasons you
claim.
Robert.
>
> No, nice guess.
> I dont know of any CPU that stalls on a read after write, instead
> they try and forward the data, and in the rare case when a violation
> occurs the CPU will throw an interrupt and restart the instructions.
> This is an important comp.arch point, so someone will correct me
> if I am wrong.
>
> > The buffer overlap aliasing considerations in C prevents the
> > compiler from automatically rearranging the original LD ST
> > order to be more pipeline friendly for a particular cpu,
> > but Fortran would allow it.
>
> This is close, the answer is a mundane but important point.
> CPUs today have a load to use delay of ~2 cycles from level one
> cache. (less for slower chips, more in faster chips.)
> An OoO chip can try and find other instructions to execute, but
> even they are subject to this delay. (I believe.)
> This copy loop is so small I dont think there is much even a
> Intel/AMD chip can do. I was hoping you would benchmark it. ;)
>
> A lot of the dual issue CPUs are partial OoO and support hit
> under miss for reads, but these will epic fail at running code
> like this faster.
>
> The future of computing (this decade) is lots of simple in-order CPUs.
> Rules of die size, heat and efficiency kick in. Like ATI chips.
>
Regardless of how natural and even gratifying it may be to you, the
outlook you clearly espouse is job security for you but what I would
deem an unacceptable burden for all ordinary mortals.
I believe that even L1 cache delays have varied between at least one
and two cycles, that L2 remains very important even for OoO chips and
that L2 delays vary even more, that L3 delay tradeoffs are something
that someone like, say, Andy would understand, but that most others
wouldn't, and that the circumstances that cause a stall are not always
clear, as evidenced by the discussion here.
If you can write code for the exact CPU and memory setup and test it
and have the time to do lots of tinkering, then super-slick hand-coded
optimizations might be worth talking about in something other than a
programming forum, and there not because the ideas have general
applicability, but because that's the kind of detail that so many
programmers seem keen on.
As it is, the computer architecture tradeoffs, like the tradeoffs in
cache delays, are probably obsessed over by computer architects, but I
can't see the relevance of a particular slick trick in C to any such
decision-making.
Robert.
That's not actually the aspect I was referring to.
>That all that cruft would lead to all kinds of problems seems hardly
>surprising, but it also seems hardly intrinsic to VLIW and/or putting
>more of the burden of scheduling on the compiler.
That is correct. It impacted badly on the interrupt issue, but much
less on the compiled code one.
>My assumption, backed by no evidence, is that HP/Intel kept adding
>"features" to get the architecture to perform as they had hoped until
>the architecture was sunk by its own features.
That might have happened in the very early days, but the feature set
was more-or-less fixed by 1994 - i.e. before the practical work
really got going. The main change to the architecture thereafter
was to DROP a feature - the asynchronous register loading/unloading.
>You think the problem is fundamental. I think the problem is
>fundamental only because of the way that code is written, in a
>language that leaves the compiler to do too much guessing for the idea
>to have even a hope of working at all.
Not at all. It also applies to other, much simpler, architectures.
The point is that the problem about 'fancy' optimisation is NOT in
the code generation, but in the program analysis. And profile-
based optimisation has precisely the restrictions that John Dallman
described, and had been known to for 25 years and more.
The reason that Fortran outperforms C and C++ on most extreme HPC
architectures isn't because more effort is put into the compiler
(actually, it's almost always less), but because decent Fortran
code is just SO much more analysable.
>The early work from IBM *didn't* just look at computation-heavy, very
>repetitive HPC-like codes. It examined implausible things like word
>processors and found a tremendous amount of predictability in behavior
>such as computation paths. Maybe most of that predictability has now
>been successfully absorbed by run-time branch predictors, making the
>possible gains in trying to it exploit it at the compile stage moot.
That is correct, but there is one extra niggle. Most of the papers
I have seen from compscis on this area have had a fatal flaw - they
look at the post-hoc ILP and predictability. That is a classic piece
of stupidity (and, yes, it's stupid). The massive potential gains
never were real, because they required a perfect oracle - and such
oracles do not exist.
As a result, the predictability isn't good enough (in most programs)
to be worth guessing more than a few branches ahead - and designs
like the Itanic needed successful prediction of dozens.
>Since the world *does* write in languages that defy optimization, and
>most of the work on languages does not seem interested in how
>optimizable a language is, the net conclusion is the same: the idea
>will never work, but not for the almost-mathematical reasons you
>claim.
Eh? I was talking about the inherent optimisability of those very
languages, and always have been doing so. It's a semi-mathematical
constraint based on using those languages, in their current paradigms.
For heaven's sake, I started saying that we needed higher level
languages to tackle this problem back around 1971 - and I was by
no means the first person to do that!
Regards,
Nick Maclaren.
The immovable obstacle is not the number of instructions you can get
per clock or the degree of useful predictability in actual codes, but
an apparent complete lack of interest on the part of those who create
programming languages on the kind of information that needs to be
passed to the compiler to allow it to schedule successfully.
The advantage of human programmers is *not* that they know about L1
delays. Compilers can know that kind of stuff, too. The advantage of
human programmers is that they have all the information that they
throw away when writing in a language like C.
Robert.
> "John W Kennedy" <jwk...@attglobal.net> wrote in message
> news:4bca7f2a$0$22520$607e...@cv.net...
>> On 2010-04-17 04:44:49 -0400, robin said:
>>> Now 3 octets will be 9 bits,
>>
>> Robin, will you /please/ stop blithering on about things you don't
>> understand?! Buy a Data Processing dictionary, for God's sake!
>>
>> The rest of this is addressed to the original poster: I don't
>> understand why you're using int variables for octets; they should be
>> char. For the rest, I'd do the following:
>>
>> typedef struct __Pixel {
>> unsigned char red, green, blue;
>> } Pixel;
>>
>> Pixel src[W][H];
>> Pixel dest[H][W];
>>
>> for (int i = 0; i < W; ++i)
>> for (int j = 0; j < H; ++i) {
>> dest[j][i].red = src[i][j].red;
>> dest[j][i].green = src[i][j].green;
>> dest[j][i].blue = src[i][j].blue;
>> }
> ...
>> memcpy(dest[j][i], src[i][j], sizeof (Pixel));
>
> And perhaps dest[j][i] = src[i][j];
Right. After years of mostly using Java, I keep forgetting that C can
do that. (I know well that I keep forgetting it, because I'm teaching
myself Cocoa.)
> But in practice W,H might only be known at runtime, making the code
> rather different.
I considered that, but I could only go with what was there, and W and H
were capital letters, so....
--
John W Kennedy
"Never try to take over the international economy based on a radical
feminist agenda if you're not sure your leader isn't a transvestite."
-- David Misch: "She-Spies", "While You Were Out"
> My assumption, backed by no evidence, is that HP/Intel kept adding
> "features" to get the architecture to perform as they had hoped until
> the architecture was sunk by its own features.
My own view is backed by some anecdotal evidence: I was quite early in
the queue of people learning to do Itanium porting, coming into it in
1998. A couple of things that I was told about the initial design groups
seemed telling: that they were absolutely sure that it was going to be a
stunning success, "because they were Intel and HP", and that it
contained a large number of amazingly bright people, or at least ones
who were considered amazingly bright by their employers; "can't
communicate well with ordinary people" was a quote from someone who
claimed to have worked with them.
This leads me to theorise that too many people's pet ideas got
incorporated, without being properly integrated with each other. The way
that the Register Stack Engine works on integer registers, but not
floating-point ones, the way that the predicate registers are set up -
and some of the instructions that combine them, which I was never able
to understand, and which I never saw compilers use - and such like seem
to support this theory.
There were also a couple of assumptions that turned out to be wrong. The
first was that code size didn't matter. Indeed, memory to store code in
is very cheap, but cache space is not, and the bulk of the instruction
set meant that the caches were less effective, especially if you are not
spending all your time running tight loops that fit in L1, but are doing
a fair amount of branching. That places lots of pressure on bandwidth
and latency between main memory and cache, which didn't seem to be
top-priority areas of the design.
The second was that the instruction set was the key idea. That's really
an idea from the eighties, at latest. By the mid-nineties it should have
becoming clear, especially to people inside Intel with access to the
work being done on the Pentium Pro/II/III code design, that the limits
on processor power had a lot more to do with caches and the speed of
communications between processor and memory: instruction decode no
longer took up enough transistors to be worth pre-optimising at the
expense of code size.
This is almost opposite what I would expect.
Out-of-order tends to benefit OS code more than many user codes. In-order coherent threading benefits manly fairly
stupid codes that run in user space, like multimedia.
I would guess that you are motivated by something like the following:
System code tends to have unpredictable branches, which hurt many OOO machines.
System code you may want to be able to respond to interrupts easily. I am guessing that you believe that OOO has worse
interrupt latency. That is a misconception: OOO tends to have better interrupt latency, since they usually redirect to
the interrupt handler at retirement. However, they lose more work.
(Anecdote: in P6 I asked the Novel Netware guys if they wanted better interrupt latency or minimal work lost. They
preferred the latter, even at the cost of longer interrupt latency. However, we gave them the former, because it was
easier.)
Also, OOO CPUs tend to have more state like TLBs and caches that is not directly related to interrupts, but which
affects interrupt latency.
Finally, it is true that tricks like alternate register sets for interrupt handlers tend to be more prevalent on in-order.
--
I think workloads may be divided into several categories:
a) Trivial throughput oriented - the sort of workload that benefits most from in-order coherent threaded GU style
microarchitectures. Lots of parallelism. Simple instruction and memory coherency.
b) Classical out-of-order workloads: irregular parallelism, pointer chases but also sizeable work at each pointer miss.
Predictable branches.
c) Intermediate: unpredictable branches, but pointers with fan-out/MLP. Classic system code. For that matter, lets
throw interrupt latency into the mix.
OOO dataflow can help speed up system code in class c), but you may lose the benefit due to branch mispredictions.
Better to switch threads than to predict a flakey branch.
Now, code in class c) can also be executed on in-order thread-switching systems. OOO dataflow just improves the latency
of such code, which amounts to reducibg the number of threads needed for a given performance level. Since, in my
experience, there are far fewer threads in class c) than in either of the other classes, reducibg the number of system
threads required seems like a good tradeff.
The taxonomy is not complete. Thse are just the combibnations that I see as most important.
> What seems to have happened is that a few commercial compscis[*]
> demonstrated that working on some carefully selected programs, and
> persuaded the decision makers that they could deliver it on most of
> the important, performance critical, codes.
> [*] NOT a flattering term.
Hmm... From my point of view, the Itanium was the first computer architecture driven mainly by academics with PhDs.
Its immediate predecessor, the P6, has only one PhD amongst its 5 primary architects. (Bob Colwell.)
Itanium had a lot more people who had piled it higher and deeper.
Ref?
Most of the articles that I have seen say that the iPad A4 is a Cortex A9.
One conspiracy-theorist type seems to think that it might actually be the PA Semi OOO PowerPC, running an ARM emulator.
I think the problem was/is fundamentally a political issue with the
leadership of the design teams, especially in the ability of the
leadership to say "No, let us not dedicate of expend resources
investigating that corner of the design space."
Mitch
> System code tends to have unpredictable branches, which hurt many OOO machines.
I think it is easier to think that system codes have so much inherent
serializations that the efforts applied in doing OoO are "for want"
and that these great big OoO machines degrade down to just about the
same performance as the absolutely in-order cousins.
Its a far bigger issue than simple branch mispredictability. Pointer
chasing into poorly cached data structures is rampant; "dangerous"
instructions that are inherently serialized; and poor TLB translation
success rates. Overall, there just is not that much ILP left in many
of the paths through system codes.
Mitch
> The second was that the instruction set was the key idea. That's really
> an idea from the eighties, at latest. By the mid-nineties it should have
> becoming clear, especially to people inside Intel with access to the
> work being done on the Pentium Pro/II/III code design, that the limits
> on processor power had a lot more to do with caches and the speed of
> communications between processor and memory: instruction decode no
> longer took up enough transistors to be worth pre-optimising at the
> expense of code size.
Itanium was designed by people who thought that P6-style out-of-order was going to fail.
In many ways they were the P6 competitors. P6 was Oregon. Itanium was California. Many Itanium folk were from P5.
This forces the OS to effectively become a message-passing system, since
every single os call would otherwise require a pair of migrations
between the two types of cpus.
I'm not saying this would be bad though, since actual data could still
be passed as pointers...
Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
> On 4/18/2010 4:57 AM, Niels Jørgen Kruse wrote:
> > Andy "Krazy" Glew<ag-...@patten-glew.net> wrote:
> >
> >> The ARM Cortex A9 CPU is out-of-order, and is becoming more and more
> >> widely used in things like cell phones and iPads.
> >
> > Cortex A9 is not shipping in any product yet (I believe). Lots of
> > preannouncements though. The Apple A4 CPU is currently believed to be a
> > tweaked Cortex A8, perhaps related to the tweaked A8 that Intrinsity did
> > for Samsung before being acquired by Apple.
>
> Ref?
>
> Most of the articles that I have seen say that the iPad A4 is a Cortex A9.
That would have been the early speculation. A more current view is
<http://www.anandtech.com/show/3640/apples-ipad-the-anandtech-review/16>
and the next page with benchmarks. The A4 is about half the speed of an
Atom N450 (512K Cache, 1.66 GHz) on a web page load benchmark. The hype
around the Cortex A9 would lead us to expect better.
>On 4/18/2010 4:57 AM, Niels Jørgen Kruse wrote:
>> Andy "Krazy" Glew<ag-...@patten-glew.net> wrote:
>>
>>> The ARM Cortex A9 CPU is out-of-order, and is becoming more and more
>>> widely used in things like cell phones and iPads.
>>
>> Cortex A9 is not shipping in any product yet (I believe). Lots of
>> preannouncements though. The Apple A4 CPU is currently believed to be a
>> tweaked Cortex A8, perhaps related to the tweaked A8 that Intrinsity did
>> for Samsung before being acquired by Apple.
>
>Ref?
>
>Most of the articles that I have seen say that the iPad A4 is a Cortex A9.
Based on my count of google results, A8 is slightly ahead of A9 in
terms of what core is actually in A4.
--
Muzaffer Kal
DSPIA INC.
ASIC/FPGA Design Services
> Itanium was designed by people who thought that P6-style out-of-order
> was going to fail.
Ah, that makes sense. Thanks. In some ways the Itanium method of running
several instructions at once seems more "obvious". I was convinced by it
at first, and only gradually realised that in spite of its intuitive
appeal, it did not work well in this example.
> In many ways they were the P6 competitors. P6 was Oregon. Itanium
> was California. Many Itanium folk were from P5.
If they had left P5 in the relatively early days, when clock speeds were
still under 100MHz, their implicit assumptions about speeds and
bandwidths make more sense. Thanks again.
>
> Hmm... From my point of view, the Itanium was the first computer
> architecture driven mainly by academics with PhDs.
>
> Its immediate predecessor, the P6, has only one PhD amongst its 5
> primary architects. (Bob Colwell.)
>
> Itanium had a lot more people who had piled it higher and deeper.
I am very grateful to the PhD academics who tolerated my eccentricities
as I pursued my PhD and for the education they gave me.
I am very grateful to the PhD academics who helped me to develop as an
independent thinker with some competence and self-confidence as I worked
outside academia.
I am very grateful to those who actually *do* things, with or without a
PhD, who continue to be generous with their knowledge and experience.
Scientists and engineers at every level of education are not the most
adept at negotiating political and social landscapes. I'm glad to have
people to talk to.
As to what is the foolproof way to manage the diverse range of talent
that is a multi-billion dollar R&D effort, I don't think there is such a
thing.
Robert.
The Story of Mel, a Real Programmer
http://www.cs.utah.edu/~elb/folklore/mel.html
> I am not him, my code is clear and readable.
>
> Brett ;)
No, not at all. You are thinking performance - I am thinking RAS.
Trying to get asynchronous and parallel code with a lot of subtle
interactions (which is the case with many kernels) to work at all
is hard; doing it with highly out-of-order CPUs is murder. Most
shared-memory parallel codes (kernel and other) have lots of race
conditions that don't show up because the synchronisation time is
short compared with the time between the critical events.
However, when one application hammers the CPU hard, that can cause
large delays for OTHER threads (including kernel ones). As I said
earlier, I have seen 5 seconds delay in memory consistency. The
result is that you get very low probability, load-dependent,
non-repeatable failures. Ugh.
Regards,
Nick Maclaren.
Yes and no. That was definitely the cause, but the missing ability
was to ask "Hang on. Is what we are assuming really true?"
Regards,
Nick Maclaren.
That was the experience in the days of the System/370. User code
got a factor of two better ILP than system code.
Regards,
Nick Maclaren.
Yup. In my view, interrupts are doubleplus ungood - message passing
is good.
And, of course, the memory could be shared at some fairly close
cache level.
Regards,
Nick Maclaren.
>
> The Story of Mel, a Real Programmer
> http://www.cs.utah.edu/~elb/folklore/mel.html
>
>> I am not him, my code is clear and readable.
>>
I've hacked machine code.
You probably know of the proof of the Poincare Conjecture and the
attempt to claim credit for it by filling in "missing" steps.
Grigory Perelman had a right to be annoyed. If what was obvious to him
was not obvious to others, that was not his failing.
The same rules, I claim, do not apply to programming. What you have
done, and why, should be obvious to anyone with enough competence to
read the syntax.
If it's not obvious through the code itself, then it should be obvious
from the comments.
Robert.
> On 4/18/2010 4:57 AM, Niels Jørgen Kruse wrote:
> > Andy "Krazy" Glew<ag-...@patten-glew.net> wrote:
> > Cortex A9 is not shipping in any product yet (I believe). Lots of
> > preannouncements though. The Apple A4 CPU is currently believed to be a
> > tweaked Cortex A8, perhaps related to the tweaked A8 that Intrinsity did
> > for Samsung before being acquired by Apple.
>
> One conspiracy-theorist type seems to think that it might actually be the
> PA Semi OOO PowerPC, running an ARM emulator.
A PowerPC running an emulator was within the realm of possibility.
I bet Apple looked at it.
The problem is that PowerPC offers little over what ARM provides.
(Besides 64 bit address mode, and a nice vector processor,
both of which would cost battery power.)
Thumb mode offers smaller code, a benefit for a handheld.
Apple knows that Thumb is a kludge, I hope Apple is looking at designing
their own CPU instruction set. Engineering a competitive advantage.
Hopefully they design something nice like my CLIW.
If the ARM chip has a nice MMU then Apple can stay 32 bits for a decade,
otherwise the roadmap will be looking dire for ARM in ~2 years.
Fixing the MMU would be the easy solution, it buys time.
Brett
> In article <4BCB60FF...@patten-glew.net>, ag-...@patten-glew.net (
> Glew) wrote:
>
> > Itanium was designed by people who thought that P6-style out-of-order
> > was going to fail.
>
> Ah, that makes sense. Thanks. In some ways the Itanium method of running
> several instructions at once seems more "obvious". I was convinced by it
> at first, and only gradually realised that in spite of its intuitive
> appeal, it did not work well in this example.
Intriguing, could you elaborate. (Bear in mind I would like to know the
good points in Itanium, despite the mocking of Itanic.)
Here is the Itanic Software dev manual, start at page 1:14, section 2.5:
http://download.intel.com/design/Itanium/manuals/24531705.pdf
Also the instruction set manual:
http://download.intel.com/design/Itanium/manuals/24531905.pdf
I look at Section 2.6.1/2.6.2 and I see something similar to the PowerPC
pre-load hint. (A no-harm data preload hint, could be great, is useless.)
The "acheck" and "use" stuff makes me go: What!?! Are you serious!?!
2.6.3 Predication, ARM has this. Seems to hurt clock rate?
2.7: Register Windows, Spark has this, cackle. ;)
2.8: Branch hints could be nice. Will add a form to CLIW.
2.9: Register rotation, someone needs to be locked in a rubber room. ;)
> > In many ways they were the P6 competitors. P6 was Oregon. Itanium
> > was California. Many Itanium folk were from P5.
>
> If they had left P5 in the relatively early days, when clock speeds were
> still under 100MHz, their implicit assumptions about speeds and
> bandwidths make more sense. Thanks again.
It is my opinion that Itanic is a disaster at any speed. ;)
Brett
Doubtless there were some. The early marketing was excellent, but
that wasn't IN Itanium.
>2.7: Register Windows, Spark has this, cackle. ;)
>
>2.9: Register rotation, someone needs to be locked in a rubber room. ;)
Yes and no. They can be VERY effective, as on the Hitachi SR2201.
However, what Hitachi understood and HP/Intel didn't is that you
MUST keep them clean, and they really don't mix with any break in
the sequential control flow, including fixup-by-interrupt.
>It is my opinion that Itanic is a disaster at any speed. ;)
Technically, that does seem to be the case.
Regards,
Nick Maclaren.
I guess the Berkeley RISC and Stanford MIPS was driven by the
students, then, not the advisors (who had PhDs).
- anton
--
M. Anton Ertl Some things have to be seen to be believed
an...@mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html
> 2.6.3 Predication, ARM has this. Seems to hurt clock rate?
I don't know that that's the limit on clock rate, but what do I know. In
any case, ARM's predicates are condition-code flavoured. More Itanium-
like are the predicates used by TI in the C6000 series of DSPs: simple
boolean values in the first five registers (why five? no idea). TI has
versions of the C6000 that run to about 1GHz, which isn't high compared
to mainstream processors, but there aren't many DSP cores faster.
Besides the TI, the Trimedia is still going, and although decamped for
the afterlife, the Transmeta Crusoe didn't really do too badly,
considering (I used one in a Fujitsu laptop). VLIW is probably
unhelpful, now that there are enough transistors and branch predictors to
make OOO behave reasonably, but it's not an unreasonable approach for a
dedicated/embedded application.
Cheers,
--
Andrew
How did they work?
Stefan
Floating-point only, software-controlled, bypassing the cache.
It was called pseudovectorisation, which describes it very well.
Good, vectorisable, Fortran code ran like the clappers, as the
technique almost completely eliminated memory latency. To achieve
that, it dropped (most) support of IEEE denorms etc., and you had
a heavyweight switch between vector and scalar modes.
A slightly different form was used on the SR8000, but I have no
personal experience of that. While it could have been extended
to other forms of use, that wouldn't have been very far. Similarly,
older systems that used register rotation for procedure calls had
to put quite a lot of restrictions on that would fit badly with
some modern languages.
I think that it could be used on a general-purpose CPU, but ONLY
if you designed the architecture round it - not, as on the Itanic,
bolting it on together with every knob, frob, bell and whistle
that you could find in the lumber room.
Regards,
Nick Maclaren.
> It is my opinion that Itanic is a disaster at any speed. ;)
Andy seemed to express his opinion that there were too many opinions
to begin with.
If you could design-by-opinion, I'm sure that development would be
much less expensive.
Although Andy has not mentioned it, I suspect that the compiler-that-
never-really-arrived played a significant role in keeping the design
process a battle of opinions for a very long time.
Intel managed to keep Itanium at or near the lead in a number of SPEC
benchmarks, and I assume that it was compiler tuning that allowed them
to do that.
Intel's success at that enterprise, I'm sure, left end users puzzled
as to why the chip never look as good in their applications at it did
in the benchmarks.
Internally, you could say, "See. The compiler *can* do it, if only in
a limited number of cases."
Burning watts at runtime to schedule and virtually mandating clever
and sometimes obscure hand-coding in order to achieve acceptable
performance are both inferior to having a compiler that can schedule
naive code successfully enough to compete with run-time scheduling.
Beating up on the feature set of Itanium seems pretty pointless. The
question that still begs to be answered is: how much can you push out
to a compiler (and not tricky hand coding) and how do you do it? The
features that were added to Itanium, whatever their merits or obvious
disadvantages, don't seem to have helped enough. The question remains
whether one could do better.
Robert.
Sounds like IA-64, in particular Itanium II.
>I think that it could be used on a general-purpose CPU, but ONLY
>if you designed the architecture round it - not, as on the Itanic,
>bolting it on together with every knob, frob, bell and whistle
>that you could find in the lumber room.
It was used on IA-64, an architecture intended to be general-purpose.
And it did run vectorizable loops fast. The problem is that the
performance for most other stuff is mediocre, mostly because the clock
rate does not keep up with the competition.
CDC was the only company to get this one right. The OS ran mostly* in
the perifferal processors, leaving the great big number cruncher to
(ahem) crunch numbers.
(*) The interupt processing and I/O was run in the PPs and most of the
OS scheduling was run in the PPs.
I remember a time back in 1979, I was logged into a CDC 7600 in
California doing text editing. There were a dozen other members of the
SW team doing similarly. There was a long silent pause where no
character echoing was taking place. A few moments later (about 30
seconds) the processing returned to normal. However, we found out that
we were now logged into a CDC 7600 in Chicago. The Ca machine had
crashed, and the OS had picked up all the nonfaulting tasks, shipped
them up to another machine half way across the country and restarted
the processes.
Why can't we do this today? We could 30 years ago!
Mitch
>I remember a time back in 1979, I was logged into a CDC 7600 in
>California doing text editing. There were a dozen other members of the
>SW team doing similarly. There was a long silent pause where no
>character echoing was taking place. A few moments later (about 30
>seconds) the processing returned to normal. However, we found out that
>we were now logged into a CDC 7600 in Chicago. The Ca machine had
>crashed, and the OS had picked up all the nonfaulting tasks, shipped
>them up to another machine half way across the country and restarted
>the processes.
>
>Why can't we do this today? We could 30 years ago!
I think, in a setup with the (nowadays clearly unaffordably high)
level of computing staff that would be attached to an organisation
with two CDC 7600s in 1979, that is entirely possible today: you log
into a front-end machine which connects to a back-end machine which is
kept as a migratable VM.
Nobody bothers doing it for text editing because it's crazily
uneconomical.
Tom
By "work" I meant: what were the instructions provided to setup/control
the register rotation feature and what were their semantics?
Stefan
It was 12+ years ago now, and I didn't program in assembler on that
system, anyway. I might have a specification somewhere, but I would
have to search.
Regards,
Nick Maclaren.
I surprised a friend who is working on speculative multithreading when he asked what benchmark I used for my SpMT work.
I said "gcc". In my experience, gcc is the user mode benchmark tha is most challenging, and which most resembles system
code.
I reject "inherently serialized" instructions. Very little need be inherently serialized. Such serialiations tend to
happen because you have not wanted to rename or predict the result. Only true MSR/creg accesses need be inherently
serialized.
Pointer chasing: I'm the MLP guy. I can show you a dozen ways to make pointer chasing run faster. Mainly: very
seldom do you just access the pointer. Usually you acccess p=p->nxt or p->p->link, plus several fields p->field1,
p->field2. You always need to consider the ratio of non-pointer chases to pointer chases. Of late, the ratio has been
INCREASING, i.e. system code has been becoming more amenable.
TLB miss rates: again, I can show/have shown many ways to improve these. One of my favorites is to cache a predicted
TB translation inside a data memory cache line, possibly using space freed up by compression.
Mitch: you're a brilliant guy, but you have only seen a small fraction of my ideas. Too bad we never got to work
together at AMD or Motorola.
At Intel, on P6, we made some deliberate decisions that prevented this. We "deliberately" decided not to provide fault
containment within shared memory - most notably, we had incomplete cache tag snooping. When an error was detected, we
could not guarantee how far it had propagated - it might have propagated anywhere in cache coherent shared memory.
I quote "deliberately" because I was aware of this decision - I flagged it, and its consequences - I don't know how far
up the chain of command it propagated. Actually, I don't think it mattered - we would probably have made the smae
decision no matter what. The real problem was that, when demand arose to have better error containment, the knowledge
was lost, and had to be reconstructed. Usually without involving the original designer (me).
Nehalem has added error poison propagation, so this sort of thing can now be done.
When will you see OSes taking advantage? Don't hold your breath.
By the way, OSes have nearly always been able to do this using message passing. But apparently there was not enough demand.
Isn't there a GUI benchmark? Most of that code is diabolical. But I
agree that gcc is an excellent bellwether for a lot of kernel, daemon
and utility code.
>I reject "inherently serialized" instructions. Very little need be
>inherently serialized. Such serialiations tend to happen because you
>have not wanted to rename or predict the result. Only true MSR/creg
>accesses need be inherently serialized.
There are some, but they tend to be used sparsely in run-time systems
and language libraries, rather than open code. But I don't know
what you are counting as MSR/creg accesses.
Regards,
Nick Maclaren.
There are good points in Itanium.
Their system architecture, e.g. SMM, is very good.
Even in VLIW they are not so bad. I think they just pushed everything too far.
All that I was trying to say was that I found it more obvious, twelve
years ago, that the Itanium explicit-parallelism approach could run
several instructions at once that it was that a complex system of
microinstructions with automatic dependency tracking, al la Pentium Pro,
could do the same.
At the point I was forming this opinion, it was thoroughly demonstrated
that the Pentium Pro and its descendants worked well. It seemed to me
that the EPIC approach could also work, and save transistors to allow
more functional units.
Time has shown that I was wrong about that, and I no longer trust my
intuition on these things, but resort to measurements. It seems that I
was looking at it from a vaguely similar PoV to the Itanium architects,
not having properly appreciated the wealth of transistors becoming
available, and the way that latency and bandwidth in and out of the
cache are vital.
HITACHI SR2201 Massively Parallel Processor
http://www.hitachi.co.jp/Prod/comp/hpc/eng/sr1.html
Has a short Do Loop example. Copied Cray after that boat sailed, and sank.
But ultimately is not register windowing just a horrid complex slow
way to get more register bits, in a fix width instruction set?
Are you not profoundly better off just adding an opcode extension word
with more register bits, like CLIW?
The overhead to keep track of all the windows is hardware CISC,
which cuts your clock rate in half of what real RISC and x86 provide.
Brett
> But ultimately is not register windowing just a horrid complex slow
> way to get more register bits, in a fix width instruction set?
Not in the case of Itanium, which has tons of registers.
The purpose, as I understand it, is to permit more seamless operation
across procedure calls.
Robert.
> robin said:
>
>> Now 3 octets will be 9 bits, [...]
http://en.wikipedia.org/wiki/Octet_(computing)
> The rest of this is addressed to the original poster: I don't understand
> why you're using int variables for octets; they should be char.
You're right. It was a typo.
> For the rest, I'd do the following:
>
> typedef struct __Pixel {
> unsigned char red, green, blue;
> } Pixel;
>
> Pixel src[W][H];
> Pixel dest[H][W];
>
> for (int i = 0; i < W; ++i)
> for (int j = 0; j < H; ++i) {
> dest[j][i].red = src[i][j].red;
> dest[j][i].green = src[i][j].green;
> dest[j][i].blue = src[i][j].blue;
> }
Are you sure the above is a description of a rotation? :-)
Clockwise or counter-clockwise?
> I'd also try:
>
> for (int i = 0; i < W; ++i)
> for (int j = 0; j < H; ++i)
> memcpy(dest[j][i], src[i][j], sizeof (Pixel));
>
> and see which one is faster; it will depend on the individual compiler.
> (Make sure you test at the same optimization level you plan to use.)
The target system is
CPU: 266 MHz, dual issue, 5-stage integer pipeline, SH-4
RAM: Two 64-MB, 200-MHz, DDR1 SDRAM modules (on separate memory buses)
After much testing, it dawned on me that the system's memory
allocator returns non-cached memory. (I found no way to request
large contiguous buffers in cached memory.) All cache-specific
optimizations thus became irrelevant.
On this system, a load from non-cached memory has a latency of
~45 cycles, thus the only optimization that made sense was to
load 32-bit words instead of octets. I configured libjpeg to
output 32-bit pixels instead of 24-bit pixels.
Then I got away with trivial code:
void rotate_right(uint32 *A, uint32 *B, int W, int H)
{
int i, j;
for (i = 0; i < H; ++i)
for (j = 0; j < W; ++j)
B[H*j + H-1-i] = A[W*i + j]; /* B[j][H-1-i] = A[i][j] */
}
void rotate_left(uint32 *A, uint32 *B, int W, int H)
{
int i, j;
for (i = 0; i < H; ++i)
for (j = 0; j < W; ++j)
B[H*(W-1-j) + i] = A[W*i + j]; /* B[W-1-j][i] = A[i][j] */
}
gcc-4.2.4 -O2 was smart enough to strength-reduce the index
computation for both arrays.
00000000 <_rotate_right>:
0: 86 2f mov.l r8,@-r15
2: 15 47 cmp/pl r7
4: 96 2f mov.l r9,@-r15
6: 63 68 mov r6,r8
8: 15 8f bf.s 36 <_rotate_right+0x36>
a: 73 61 mov r7,r1
c: 08 47 shll2 r7
e: 83 69 mov r8,r9
10: fc 77 add #-4,r7
12: 13 66 mov r1,r6
14: 7c 35 add r7,r5
16: 08 49 shll2 r9
18: 04 77 add #4,r7
1a: 15 48 cmp/pl r8
1c: 07 8b bf 2e <_rotate_right+0x2e>
1e: 43 60 mov r4,r0
20: 53 63 mov r5,r3
22: 83 62 mov r8,r2
24: 06 61 mov.l @r0+,r1
26: 10 42 dt r2
28: 12 23 mov.l r1,@r3
2a: fb 8f bf.s 24 <_rotate_right+0x24>
2c: 7c 33 add r7,r3
2e: 10 46 dt r6
30: fc 75 add #-4,r5
32: f2 8f bf.s 1a <_rotate_right+0x1a>
34: 9c 34 add r9,r4
36: f6 69 mov.l @r15+,r9
38: 0b 00 rts
3a: f6 68 mov.l @r15+,r8
3c: 09 00 nop
3e: 09 00 nop
The loop kernel is
24: 06 61 mov.l @r0+,r1
26: 10 42 dt r2
28: 12 23 mov.l r1,@r3
2a: fb 8f bf.s 24 <_rotate_right+0x24>
2c: 7c 33 add r7,r3
Thanks to all for your suggestions (especially Terje).
Regards.