On Thursday, October 29, 2015 at 12:09:58 AM UTC+11, Bruce Mardle wrote:
> Thanks, Mux and Tom. I think I'll go with...
> On Monday, 26 October 2015 22:37:06 UTC, Tom Evans wrote:
> ... especially when I need to do it twice. (As in a 'memcpy'.
> I have to treat even source/even destination differently from
> odd src/odd dst and from 1 odd/1 even.)
On a 68010. Where you've got "loop mode". The following code from a project I worked on in 1991 (should, see later) take advantage of this:
| copy bytes, using movb,movw, or movl as appropriate.
| NB: a len of <= 0 is treated as = 0, ie: do nothing.
_bcopy: movl sp@(4),d0
1$: movb a0@+,a1@+
2$: btst #1,d0
3$: movw a0@+,a1@+
4$: asrl #2,d0
5$: movl a0@+,a1@+
Note: I say "should" because the Motorola 68000 User Manual is confusing and most likely dead wrong.
"APPENDIX A MC68010 LOOP MODE OPERATION" gives as an example of Loop Mode:
LOOP LEA SOURCE, A0 Load A Pointer To Source Data
LEA DEST, A1 Load A Pointer To Destination
MOVE.W #LENGTH, D0 Load The Counter Register
MOVE.W (A0);pl, (A1)+ Loop To Move The Block Of Data
DBEQ D0, LOOP Stop If Data Word Is Zero
Figure A-1. DBcc Loop Mode Program Example
I'm pretty sure ";pl" is meant to be a "+" in the above. So it is the classic block-move operation with the magic 68k auto-increment on the address registers.
Fine, except the next table in the book, "Table A-1. MC68010 Loop Mode Instructions" lists all the acceptable addressing mode combinations, and "(Ay)+ to (Ax)+" is NOT THERE. The table says the most used addressing mode isn't supported.
That has to be wrong because "Table 9-2. Move Byte and Word Instruction Execution Times" documents the timing for this most useful case.
Which is 14 clocks for looping "MOVE.W (A0)+, (A1)+" and 22 clocks for "MOVE.L (A0)+, (A1)+"
But if you ignore loop mode and simply unroll the copy loop by eight, then it takes (8 * 20 + 10) = 170 clocks while the loop-mode takes 176. Word mode is 106 for unrolled and 112 for loop mode. Loop mode is better if your memory has wait states though.
The big win is changing simple and dumb "move bytes" code to moving words and longs when it can, as you're doing.
But the fastest way to copy memory is to design your system so you don't have to copy at all, but just copy/read it ONCE and then pass pointers around.
When you get into the RISC CPUs it gets really complicated. The fastest way to copy (external DDR) memory on even a middle of the range CodlFire chip is to copy 64 words (32 bits) from the external DDR to the internal SRAM, and then copy from there back to DDR. That keeps the caches happy and the memory controller "on page". And since it is RISC, all copies have to go through registers! So DDR# --> Register --> SRAM --> Register --> DDR3.