Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.

Dismiss

603 cache vs. 601 cache for 68K emulation. (Was: Apple Abandons the PowerPC 603)

87 views

Skip to first unread message

Bruce Hoult

unread,

Jun 5, 1994, 10:59:04 AM6/5/94

zst...@adobe.com (Zalman Stern) writes:
> > The 68k emulator code is written such that it likes to stick a lot of
> > things
> > in the cache at the same location.
>
> Once again, do you have any specifics? One hypothesis is that Apple's
> emulator uses a 512K "executable lookup table." Which provides an
> interesting case in terms of I cache locality in that it probably
> doesn't look anything at all like most programs. Of course there is
> pressure on the data cache both to hold the 68K code and the
> application's data.

I suspect that it carries a little more weight than mere "hypothesis",
although that's about all one can say it is based only on the official
documentation I have -- the only substantive information I can find in
what I've got (and I expect you have more) is on page 1-7 of Inside
Macintosh: PowerPC System Software where it says:

"The 68LC040 Emulator consists of two main parts, a main dispatch table
and a block of additional code called by entries in the main dispatch
table. The main dispatch table contains two native PowerPC instructions
for each recognised 680x0 operation code (or opcode). In cases where a
680x0 opcode can be handled in a single PowerPC instruction, the first
native instruction in the dispatch table is enough to complete the
requested operation. In most cases, however, the handling of the 680x0
opcode requires more than one PowerPC instruction. In that case the
first native instruction in the main dispatch table simply begins the
emulation process."

"The second native instruction in the emulator's main dispatch table is
usually a PC-relative branch into the block of additional code. The
additional code continues the emulation of the 680x0 opcode begun by
the first instruction."

OK, now that doesn't necessarily mean that the main dispatch table is
half a meg, with eight bytes each for all 65536 possible 16-bit opcodes,
but that's certainly one strong possibility.

Other possibilities include some pre-processing being done on each
opcode, with one or more bitfields pulled out of the opcode and
used to synthesise an index into the table -- you might be able to
have just one table entry for all the "move.l" instructions, for
instance.

However, let's work on the hypothesis I've bandied around before on
the net -- that it's a half-meg dispatch table.

Lets go look for this table. (I'm using a PowerMac 6100/60 with
24 MB RAM, virtual memory turned off).

A few seconds with MacsBug tells us that RomBase points to 0x40800000.
We think from the net that the ROM is probably 4MB in size. Quick
tests:

db 40800000 OK
db 407fffff Unable to access that address
db 40c00000 Unable to access that address
db 40bfffff OK

Looks like 4MB, all right.

Let's search for a likely looking dispatch table -- we could walk
through the ROM manually in MacsBug, looking for branch instructions,
but Apple has given us a perfectly servicable PPC disassembler in
Disassembler.h and Disassembler.o. so let's use it to look for groups
of opcodes starting with a "b", spaced eight bytes apart in memory.

Quick&dirty hack program:

----------
#include <Disassembler.h>
#include <stdio.h>

#define minmatch 100

void examine(FILE *f, unsigned long start, unsigned long end){
unsigned long *start1, len1=0, *start2, len2=0;
char mne[10];

unsigned long *p;
for (p=(unsigned long*)start; p<(unsigned long*)end; p+=2){
disassembler(p,0,Disassemble_PowerPC601,mne,0,0,0,0);
if (mne[0] == 'b'){
if (len1 == 0){
start1 = p;
len1 = 1;
} else ++len1;
} else {
if (len1 >= minmatch) fprintf(f,"At %x for %d\n",start1,len1);
len1 = 0;
}

disassembler(p+1,0,Disassemble_PowerPC601,mne,0,0,0,0);
if (mne[0] == 'b'){
if (len2 == 0){
start2 = p+1;
len2 = 1;
} else ++len2;
} else {
if (len2 >= minmatch) fprintf(f,"At %x for %d\n",start2,len2);
len2 = 0;
}
} // for

if (len1 >= minmatch) fprintf(f,"At %x for %d\n",start1,len1);
if (len2 >= minmatch) fprintf(f,"At %x for %d\n",start2,len2);
}

main(){
FILE *f = fopen("lotsOfBranches","w");
examine(f, 0x40800000, 0x40C00000);
fclose(f);
return 0;
}
----------

I tried this in Symantec C++ 7.02 (good for hacking, not that great
for production work :-( ) -- copying the Disassembler files from the
MPW distribution. Ok, it's emulated (slow), and we could do things
like make sure the first six bits of the PPC instruction are 010000
or 010010 or 010011 before disassembling (slow), and we could AND the
return value from "disassembler" against Disassembler_Branch (0x0080U)
instrad of seeing of the first letter is a 'b', but hey it only takes
two minutes to run, so who's worrying :-)

Anyway, the results:

At 408035d0 for 111
At 408035d4 for 111
At 40804014 for 119
At 40804010 for 121
At 40804494 for 126
At 40804490 for 127
At 40b20004 for 256
At 40b2b674 for 122
At 40b80004 for 64528
At 40bfe08c for 1007

Bingo!!!

Starting at 0x40b80000 we have 65536 groups of eight bytes, with the
last four bytes of each group being a branch, except for one single
solitary exception at 0x40bfe084.

That sure *looks* like a half-meg executable dispatch table so far.

If anyone wants to examine it, they could try running the following
(once again extremely crude) program:

----------
#include <DisAsmLookup.h>
#include <Disassembler.h>
#include <stdio.h>

void disasmjumptable(FILE *f, unsigned long start){
unsigned long *p, i;
p = (unsigned long*) start;

char mne1[10], oper1[20], mne2[10], oper2[20], mne3[10], oper3[20], comm3[20];
short bytesUsed, ins68K[10] = {0};

for(i=0; i<=65535; ++i){
disassembler(&p[i+i ],0,Disassemble_PowerPC601,mne1,oper1,0,0,0);
disassembler(&p[i+i+1],0,Disassemble_PowerPC601,mne2,oper2,0,0,0);

ins68K[0] = (short)i;
Disassembler(0,&bytesUsed,(char*)ins68K,mne3,oper3,comm3,0);

fprintf(f,
"%x: %s %s; %s %s # %s %s %s\n",
i, mne1, oper1, mne2, oper2, mne3+1, oper3+1, comm3+1
);
}
}

main(){
FILE *f = fopen("jumptable","w");
disasmjumptable(f, 0x40b80000);
fclose(f);
return 0;
}
----------

This produces nearly 4 MB of text file (taking three or four minutes
in emulation on my 6100), so I'm not going to post it here, but I'll
list some small parts of it as examples. For those who didn't read
the C code above, the format (assuming this really is a half meg
executable jump table) is:

68k opcode: PPC instruction #1; PPC instruction #2 # 68k mnemonic

7c00: addco. r14,r0,r0; b $-0x97F80 # MOVEQ #$00,D6
7c01: addic. r14,r0,0x0001; b $-0x97F4C # MOVEQ #$01,D6
7c02: addic. r14,r0,0x0002; b $-0x97F54 # MOVEQ #$02,D6
7c03: addic. r14,r0,0x0003; b $-0x97F5C # MOVEQ #$03,D6
7c04: addic. r14,r0,0x0004; b $-0x97F64 # MOVEQ #$04,D6
7c05: addic. r14,r0,0x0005; b $-0x97F6C # MOVEQ #$05,D6
7c06: addic. r14,r0,0x0006; b $-0x97F74 # MOVEQ #$06,D6
7c07: addic. r14,r0,0x0007; b $-0x97F7C # MOVEQ #$07,D6
7c08: addic. r14,r0,0x0008; b $-0x97F84 # MOVEQ #$08,D6

Even this sample pretty much proves the hypothesis. This is *exactly*
what we would expect to see, assuming that r14 is being used to hold
the value of the 68K register D6.

In this case the first PPC instruction is enough to do all the work,
and the second always branches back to the same address (each one is
eight bytes higher in memory than the last, remember, and so needs a
bigger offset).

49c0: extsb. r8,r8; b $-0x7ED44 # EXTB.L D0
49c1: extsb. r9,r9; b $-0x7ED4C # EXTB.L D1
49c2: extsb. r10,r10; b $-0x7ED54 # EXTB.L D2
49c3: extsb. r11,r11; b $-0x7ED5C # EXTB.L D3
49c4: extsb. r12,r12; b $-0x7ED64 # EXTB.L D4
49c5: extsb. r13,r13; b $-0x7ED6C # EXTB.L D5
49c6: extsb. r14,r14; b $-0x7ED74 # EXTB.L D6
49c7: extsb. r15,r15; b $-0x7ED7C # EXTB.L D7

This is also *exactly* what the hypothesis would predict. The
PPC extsb is an exact match for the 68K EXTB.L. This example
confirms the use of r14 for D6, and shows that the rest of the
Dn register mappings are what you would guess, given where D6
is.

49e8: add r20,r16,r27; b $-0x7EEC4 # LEA $0000(A0),A4
49e9: add r20,r17,r27; b $-0x7EECC # LEA $0000(A1),A4
49ea: add r20,r18,r27; b $-0x7EED4 # LEA $0000(A2),A4
49eb: add r20,r19,r27; b $-0x7EEDC # LEA $0000(A3),A4
49ec: add r20,r20,r27; b $-0x7EEE4 # LEA $0000(A4),A4
49ed: add r20,r21,r27; b $-0x7EEEC # LEA $0000(A5),A4
49ee: add r20,r22,r27; b $-0x7EEF4 # LEA $0000(A6),A4
49ef: add r20,SP,r27; b $-0x7EEFC # LEA $0000(A7),A4

Here we see that the 68K An registers are stored right after the
Dn, in the PPC's r16 through r22, with the 68K stack pointer being
stored elsewhere -- in the PPC stack pointer register (r1), to be
exact.

What's in the PPC's r27? My guess is that the dispatch routine
loads the next word (two bytes) after the opcode into r27 before
jumping into the table -- not every instruction will use it, so it's
sometimes wasted time, but probably enough will then become single-
instruction PPC routines to make it a win.

Of course, those 68k instructions that do use the next word will have
to branch back to a different entry point in the dispatcher, but
that's trivial to arrange.

BTW, all these instructions are listed as LEA $0000(An) because the
next word was always a zero when my program disassembled the 68K
instructions -- that $0000 could in fact be any 16-bit value in a
real program.

2000: addco. r8,r8,r0; b $-0x69F80 # MOVE.L D0,D0
2001: addco. r8,r9,r0; b $-0x69F88 # MOVE.L D1,D0
2002: addco. r8,r10,r0; b $-0x69F90 # MOVE.L D2,D0
2003: addco. r8,r11,r0; b $-0x69F98 # MOVE.L D3,D0
2004: addco. r8,r12,r0; b $-0x69FA0 # MOVE.L D4,D0
2005: addco. r8,r13,r0; b $-0x69FA8 # MOVE.L D5,D0
2006: addco. r8,r14,r0; b $-0x69FB0 # MOVE.L D6,D0
2007: addco. r8,r15,r0; b $-0x69FB8 # MOVE.L D7,D0
2008: addco. r8,r16,r0; b $-0x69FC0 # MOVE.L A0,D0
2009: addco. r8,r17,r0; b $-0x69FC8 # MOVE.L A1,D0
200a: addco. r8,r18,r0; b $-0x69FD0 # MOVE.L A2,D0
200b: addco. r8,r19,r0; b $-0x69FD8 # MOVE.L A3,D0
200c: addco. r8,r20,r0; b $-0x69FE0 # MOVE.L A4,D0
200d: addco. r8,r21,r0; b $-0x69FE8 # MOVE.L A5,D0
200e: addco. r8,r22,r0; b $-0x69FF0 # MOVE.L A6,D0
200f: addco. r8,SP,r0; b $-0x69FF8 # MOVE.L A7,D0

Notice the "." on the addco. Moves to a data register set the condition
codes, so we use an add that sets the condition codes.

2040: or r16,r8,r8; b $-0x6A180 # MOVEA.L D0,A0
2041: or r16,r9,r9; b $-0x6A188 # MOVEA.L D1,A0
2042: or r16,r10,r10; b $-0x6A190 # MOVEA.L D2,A0
2043: or r16,r11,r11; b $-0x6A198 # MOVEA.L D3,A0
2044: or r16,r12,r12; b $-0x6A1A0 # MOVEA.L D4,A0
2045: or r16,r13,r13; b $-0x6A1A8 # MOVEA.L D5,A0
2046: or r16,r14,r14; b $-0x6A1B0 # MOVEA.L D6,A0
2047: or r16,r15,r15; b $-0x6A1B8 # MOVEA.L D7,A0
2048: rlwimi r29,r27,0x03,0x0D,0xb; b $-0x6A1BC # MOVEA.L A0,A0
2049: or r16,r17,r17; b $-0x6A1C8 # MOVEA.L A1,A0
204a: or r16,r18,r18; b $-0x6A1D0 # MOVEA.L A2,A0
204b: or r16,r19,r19; b $-0x6A1D8 # MOVEA.L A3,A0
204c: or r16,r20,r20; b $-0x6A1E0 # MOVEA.L A4,A0
204d: or r16,r21,r21; b $-0x6A1E8 # MOVEA.L A5,A0
204e: or r16,r22,r22; b $-0x6A1F0 # MOVEA.L A6,A0
204f: or r16,SP,SP; b $-0x6A1F8 # MOVEA.L A7,A0

... but moves to an address register don't set the condition codes, so
we'll use an OR the source with itself to do these register moves (which
is very common in native PPC code).

I'm afraid I've got *no* idea what that rlwimi is try to
achieve, though. :-(

Well, I think that's just about enough to conclusively prove that we
are in fact dealing with a PowerMac 68LC040 emulator that works by
dispatching each one of the 65536 possible opcodes into a table of
65536 two-instruction PPC sequences, taking a total of 512K of ROM
space.

And so all the speculation about thrashing of instruction caches is
very probably true.

I'll just leave you with one last example from this table...

4e71: eieio ; b $-0x81308 # NOP

Hey!! Old MacDonald has a use, after all!

-- Bruce

DangardAce

unread,

Jun 5, 1994, 11:47:03 PM6/5/94

In article <28537...@hoult.actrix.gen.nz>,
Br...@hoult.actrix.gen.nz (Bruce Hoult) writes:
>>A whole lotta technical stuff!<<

But the gist of this is, if you want 68K code to run at the fastest
possible speed, you must buy
a Level 2 cache card that's at least 768K (512K for the jump table,
256K for assorted code.)
Am I correct in postulating this? And just how much of a speed up
would I experience with the
gobs of cache memory?

Danga...@aol.com

Steven O. Miller

unread,

Jun 6, 1994, 2:09:34 AM6/6/94

In article <28537...@hoult.actrix.gen.nz>
Br...@hoult.actrix.gen.nz (Bruce Hoult) writes:

> I'll just leave you with one last example from this table...
>
>
> 4e71: eieio ; b $-0x81308 # NOP
>
> Hey!! Old MacDonald has a use, after all!

Good work Bruce! Thanks for shedding some light on the subject. Using
a 512K lookup makes the PPC emulator relatively (I'm saying relatively,
I mean REALATIVELY boy!) easy to write. Remember that three years ago,
Apple demonstated a Mac running on a Moto 88000 RISC chip to
demonstrate that Apple can move to RISC. They wouldn't have been able
to do this without using a generally simple technique. I believe that
Insignia is using a scheme that translates and caches portions of the
source instruction stream. An article in BYTE discusses a similar IBM
technique of translating loops and caching them as translated native
code. The result (if memory serves me correctly) is that you can get
up to double the performance compared to simple emulation. Simple
emulation is about a 10:1 slowdown and the IBM technique is around 5:1
(from memory) and can be better for many loops.

I'm curious what the main loop of Apple's emulator looks like, we can
see about how many PPC instructions it takes to emulate some common 68K
instructions and see if the approx. 10:1 rule holds.

One thought that I had is if Apple provided a way to use a special
illegal 68K instruction to cause the emulator to quickly switch to
directly execute (ie. jump to) the next 68K instruction which would
actuallybe PPC instructions instead of 68K instructions. What this is
useful for is for a utility that does a partial translation of the CODE
segments of an already existing 68K application. This utility would
translate loops in CODE segments that it recognizes into PPC
instructions that either directly perform what the 68K code did or are
simply unrolled versions of the PPC emulator instructions for each 68K
opcode. To get back into the emulator (at the bottom of the PPC
translated loop), a simple jump would be made directly back into the
68K emulator. The direct jumping into and out of the emulator
bypassing the overhead of the general switching into and out of the
emulator (we're not really every violating the emulator's context and
register usage).

This utility wouldn't be too difficult to write to recognize some
common 68K looping constructs, yet would produce a modified version of
the application that runs at much closer to native speeds (except for
floating point of course).

Steven Miller

Bruce Hoult

unread,

Jun 7, 1994, 12:00:22 AM6/7/94

danga...@aol.com (DangardAce) writes:
> Br...@hoult.actrix.gen.nz (Bruce Hoult) writes:
> >>A whole lotta technical stuff!<<
>
> But the gist of this is, if you want 68K code to run at the fastest
> possible speed, you must buy
> a Level 2 cache card that's at least 768K (512K for the jump table,
> 256K for assorted code.)
> Am I correct in postulating this? And just how much of a speed up
> would I experience with the
> gobs of cache memory?

I doubt that more than 256K would make a huge difference compared to
256K. No program is going to use all or even most 68k instructions,
so 256K would probably get you 95%+ hit rates on the 2nd level cache,
even on the huge emulator code.

-- Bruce

Philip Machanick

unread,

Jun 7, 1994, 7:17:10 AM6/7/94

In article <2su67n$o...@search01.news.aol.com>, danga...@aol.com
(DangardAce) wrote:

I doubt you'd need 768K. There's something called locality which says you
tend to use nearby pieces of memory often, or reuse the same memory soon.
The emulator to some extent breaks these assumptions but it' probably also
true that some instructions are executed rarely if ever. If anyone has
figures to back this up it would be interesting, but I seem to recall that
a 256K cache card gives a much bigger gain over none than a 512K card gives
over 256K.
--
Philip Machanick phi...@cs.wits.ac.za
Department of Computer Science, University of the Witwatersrand
2050 Wits, South Africa
phone 27(11)716-3309 fax 27(11)339-7965

Tim Olson

unread,

Jun 15, 1994, 10:18:44 PM6/15/94

DangardAce writes

| In article <28537...@hoult.actrix.gen.nz>,
| Br...@hoult.actrix.gen.nz (Bruce Hoult) writes:
| >>A whole lotta technical stuff!<<
|
| But the gist of this is, if you want 68K code to run at the fastest
| possible speed, you must buy
| a Level 2 cache card that's at least 768K (512K for the jump table,
| 256K for assorted code.)
| Am I correct in postulating this?

Not necessarily. Unlike most code, emulators of this type are sensitive to the
number of unique (68k) instructions, not the total number of instructions.
--
-- Tim Olson
Apple Computer Inc. / Somerset

0 new messages