Testing whether address register odd

50 views
Skip to first unread message

Bruce Mardle

unread,
Oct 20, 2014, 4:55:36 AM10/20/14
to
Hi, all.
What's the quickest way of testing whether the contents of an address register is odd (on a 68010)?
Is there anything faster than:
move a0, d0
btst #0, d0

Mux

unread,
Oct 26, 2015, 1:48:35 AM10/26/15
to
Hi!

An simple 'and' would do the trick as well (i.e and #1,d0). Don't know if bit-testing is faster than and'ing. Alterantively you can shift the value right and check the carry flag.

-Y

Tom Evans

unread,
Oct 26, 2015, 6:37:06 PM10/26/15
to
Hello from over a year ago (and to Mux, yesterday). I hope you weren't waiting for these answers, and looked in the Reference Manual instead.

Which shows "btst" to be the WORST choice. "btst #bit" takes 10 clocks. "asr" takes 6+2n, which with "n" being "1" takes 8 clocks. Which is the same time as "andi" as it has to fetch two words, taking 8 clocks.

If you can keep a data register spare to hold a "1" then you can use "and" in 4 clocks or "btst" in 6. The above assumes zero wait-state memory, which changes the numbers.

Tom

Tom Evans

unread,
Oct 27, 2015, 9:04:36 AM10/27/15
to
There's also the tricky code that some versions of gcc generate in place of a bit test, which is detailed (together with some of its problems) here:

https://community.freescale.com/message/501384#501384

To summarise the above, it generates code like:

} else if (cc->status & cd_BUS_RWARN) {
4010903a: 0800 000c btst #12,%d0
4010903e: 6604 bnes 40109044 <comm_check_status+0x62>
} else if (cc->status & cd_OVERRUN) {
40109040: 44c0 movew %d0,%ccr
40109042: 6a02 bpls 40109046 <comm_check_status+0x64>

Note the weird word-saving trick in the last compare? It copies the data to the CCR and then tests the "N" bit.

You can do that to test the lowermost four bits in a register, corresponding to the CCR C, V, N and Z bits. It saves one 16-bit fetch. That should save four clocks, except that this instruction takes TWELVE clocks to execute on the 68000 and 68010.

On the Coldfire (the subject of the referenced post) there's good and bad news. The good news is that it only takes one clock. The bad news is that on the MCF53 series it corrupts the branch predict bit in the status register, and there's no known way to stop gcc from generating that code.

Tom


Mux

unread,
Oct 27, 2015, 4:08:35 PM10/27/15
to
Complete cop-out but if you're looking for the msb AND it's in a register you can do a simple 'add' and check the carry flag. Ahh, assembly language... gotta love it..

On a completely different note, I'm reading 'I am error' about the NES hardware which is pretty awesome in that the author gets into a lot of detail and goes as far as digging into the (disassembled) code for SMB. Due to the little amount of memory the NES had they actually generated most of the levels in SMB with some really nifty bit packing. Good read in case anyone's interested..

-Mux

Bruce Mardle

unread,
Oct 28, 2015, 9:09:58 AM10/28/15
to
Thanks, Mux and Tom. I think I'll go with...

On Monday, 26 October 2015 22:37:06 UTC, Tom Evans wrote:
> If you can keep a data register spare to hold a "1" then you can use "and" in 4 clocks or "btst" in 6. The above assumes zero wait-state memory, which changes the numbers.

... especially when I need to do it twice. (As in a 'memcpy'. I have to treat even source/even destination differently from odd src/odd dst and from 1 odd/1 even.)

<ramble> I must get back into 68k programming. Haven't done any for months. Been writing a Z280 assembler. 68k is much saner! </ramble>

Charles Richmond

unread,
Oct 28, 2015, 6:57:00 PM10/28/15
to
"Bruce Mardle" <marb...@yahoo.co.uk> wrote in message
news:dbd4c1f7-1b92-403c...@googlegroups.com...
68K assembly language is a *sweetheart*!!! The only assembly language I
might like better is the 6809 assembly.

--

numerist at aquaporin4 dot com

Tom Evans

unread,
Oct 28, 2015, 7:15:35 PM10/28/15
to
On Thursday, October 29, 2015 at 12:09:58 AM UTC+11, Bruce Mardle wrote:
> Thanks, Mux and Tom. I think I'll go with...
>
> On Monday, 26 October 2015 22:37:06 UTC, Tom Evans wrote:
> ... especially when I need to do it twice. (As in a 'memcpy'.
> I have to treat even source/even destination differently from
> odd src/odd dst and from 1 odd/1 even.)

On a 68010. Where you've got "loop mode". The following code from a project I worked on in 1991 (should, see later) take advantage of this:

|
| copy bytes, using movb,movw, or movl as appropriate.
| NB: a len of <= 0 is treated as = 0, ie: do nothing.
|
.globl _bcopy
_bcopy: movl sp@(4),d0
movl d0,a0
movl d0,d1
movl sp@(8),d0
movl d0,a1
orl d0,d1
movl sp@(12),d0
bles 6$
orl d0,d1
btst #0,d1
beqs 2$
subql #1,d0
1$: movb a0@+,a1@+
dbra d0,1$
rts

2$: btst #1,d0
beqs 4$
asrl #1,d0
subql #1,d0
3$: movw a0@+,a1@+
dbra d0,3$
rts

4$: asrl #2,d0
subql #1,d0
5$: movl a0@+,a1@+
dbra d0,5$
6$: rts

Note: I say "should" because the Motorola 68000 User Manual is confusing and most likely dead wrong.

"APPENDIX A MC68010 LOOP MODE OPERATION" gives as an example of Loop Mode:

LOOP LEA SOURCE, A0 Load A Pointer To Source Data
LEA DEST, A1 Load A Pointer To Destination
MOVE.W #LENGTH, D0 Load The Counter Register
MOVE.W (A0);pl, (A1)+ Loop To Move The Block Of Data
DBEQ D0, LOOP Stop If Data Word Is Zero

Figure A-1. DBcc Loop Mode Program Example

I'm pretty sure ";pl" is meant to be a "+" in the above. So it is the classic block-move operation with the magic 68k auto-increment on the address registers.

Fine, except the next table in the book, "Table A-1. MC68010 Loop Mode Instructions" lists all the acceptable addressing mode combinations, and "(Ay)+ to (Ax)+" is NOT THERE. The table says the most used addressing mode isn't supported.

That has to be wrong because "Table 9-2. Move Byte and Word Instruction Execution Times" documents the timing for this most useful case.

Which is 14 clocks for looping "MOVE.W (A0)+, (A1)+" and 22 clocks for "MOVE.L (A0)+, (A1)+"

But if you ignore loop mode and simply unroll the copy loop by eight, then it takes (8 * 20 + 10) = 170 clocks while the loop-mode takes 176. Word mode is 106 for unrolled and 112 for loop mode. Loop mode is better if your memory has wait states though.

The big win is changing simple and dumb "move bytes" code to moving words and longs when it can, as you're doing.

But the fastest way to copy memory is to design your system so you don't have to copy at all, but just copy/read it ONCE and then pass pointers around.

When you get into the RISC CPUs it gets really complicated. The fastest way to copy (external DDR) memory on even a middle of the range CodlFire chip is to copy 64 words (32 bits) from the external DDR to the internal SRAM, and then copy from there back to DDR. That keeps the caches happy and the memory controller "on page". And since it is RISC, all copies have to go through registers! So DDR# --> Register --> SRAM --> Register --> DDR3.

Tom

Bruce Mardle

unread,
Oct 30, 2015, 12:32:06 AM10/30/15
to
On Wednesday, 28 October 2015 23:15:35 UTC, Tom Evans wrote:
> Which is 14 clocks for looping "MOVE.W (A0)+, (A1)+" and 22 clocks for
> "MOVE.L (A0)+, (A1)+"

I can vouch for that. It was one of the first things I tried on my 68010 :-)

> But if you ignore loop mode and simply unroll the copy loop by eight, then it
> takes (8 * 20 + 10) = 170 clocks while the loop-mode takes 176.

Good point! But then I'd need more-complicated code to deal with 'stragglers': the bytes left to copy after copying 32-byte chunks. Lots of design decisions to be made! At least the 68010 is happy to read/write longwords at addresses ending in 10b; that simplifies things a little.

Tom Evans

unread,
Oct 30, 2015, 11:18:13 PM10/30/15
to
On Monday, October 20, 2014 at 7:55:36 PM UTC+11, Bruce Mardle wrote:
> Hi, all.
> What's the quickest way of testing whether the contents of an
> address register is odd (on a 68010)?

The absolutely fastest way is to ignore the problem completely and just go ahead with the memory copy. Then handle the unaligned cases in the exception routine.

That only works well if the overwhelming majority of them are aligned.

Otherwise, why worry about saving a couple of clock cycles in a function that spends 95+% of its time in a 20-clock loop moving memory around?

Tom

Tom Evans

unread,
Nov 1, 2015, 5:34:03 AM11/1/15
to
> > But if you ignore loop mode and simply unroll the
> > copy loop by eight, then it takes (8 * 20 + 10) = 170 clocks
> > while the loop-mode takes 176.

Loop takes (22(2/2) * N). Unrolling by "M" takes (N * 20(3/2) + (10(2/0) * N / M)). The point is that it saves ONLY 6 clocks or 3.5% or a maximum of 10% for an "infinite unroll", so why bother? It makes a heap of difference (to unroll the loop) on the 68000 or CPU32, but not the 68010.

> Good point! But then I'd need more-complicated code to
> deal with 'stragglers':

Which would take a lot of clocks.

Have you ever heard of "Duff's Device"? It is a magic fix to the "straggler" problem. It is horribly ugly C code, even worse than "ternary abuse", but is completely legal.

https://en.wikipedia.org/wiki/Duff's_device#Original_version

Tom

Bruce Mardle

unread,
Nov 1, 2015, 7:16:45 AM11/1/15
to
On Sunday, 1 November 2015 10:34:03 UTC, Tom Evans wrote:
> Have you ever heard of "Duff's Device"? It is a magic fix to the "straggler" problem. It is horribly ugly C code, even worse than "ternary abuse", but is completely legal.
>
> https://en.wikipedia.org/wiki/Duff's_device#Original_version

I hadn't, but the idea of doing a calculated jump into the loop for the first 0-7 (or 1-8) copies occurred to me about a day after my previous post. In my defence, I'm a big fan of structured programming so my brain revolts at such ideas :-) (When writing C, I try to avoid `continue` and `break`ing from loops, never mind `goto`.)

Anyway, thanks, everyone, for all the suggestions.

Mux

unread,
Nov 6, 2015, 5:23:09 PM11/6/15
to
That looks... weird. I've always used 'reverse' jumptables for doing polygon fillers and what not. Basically got you the maximum amount of performance especially because you knew that you'd never be filling more than the screen width.

-Mux
Reply all
Reply to author
Forward
0 new messages