Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Emulating a 6502 with an ARM?

304 views
Skip to first unread message

Lyndon Fletcher

unread,
Jul 30, 2001, 12:50:50 PM7/30/01
to
Does anyone have any experience as to how quickly an assembly written
emulator can execute 6502 machine code on an arm? I am interested only
in processor emulation not in machine emulators like the ones for the
BBC. I would imagine that it would be quite fast?

Lyndon

Ian Bannister

unread,
Jul 30, 2001, 4:25:03 PM7/30/01
to
In article <3b658fa5...@news.ericsson.se>, uab...@uab.ericsson.se
If you're doing the bare minimum then yes.
I did a raw emulator which had 4 instructions in the decode loop including
a counter for the number of instructions to do. The execution code was
variable in length as you would expect eg:-

LDA #0 - 7 instructions
LDA $0 - 8 "
LDA $0000 - 11 "
LDA $0000,X - 12 "
LDA ($00),Y - 14 "
INX - 6 "

The code is roughly half memory accessing / branching and the rest group 1
instructions. It was originally for ARM2 where this mattered more than for
cached processors. I'm not sure of the timings on later processors such as
SA but this might give you a guide.


--
|-*- Ian Bannister


Matthias Seifert

unread,
Jul 30, 2001, 4:43:05 PM7/30/01
to

Well, my ARC64 (a C64 emulator) reaches 3441% of the speed of a real C64
on my RiscPC with an SA@287MHz (and 485% with an ARM610@40MHz). As a pure
6502 emulator (e.g. without interrupts) it surely could be about twice as
fast.

--
_ _ | Acorn RiscPC, StrongARM @ 287 MHz,
| | | _, _|__|_ |) ' _, , | 258 MB RAM, >75 GB HD, RISC OS 4.02
| | | / | | | |/\ | / | / \ | -----------------------------------
| | |_/\/|_/|_/|_/| |/|/\/|_/ \/ | http://www.deutschlandwetter.de

druck

unread,
Jul 30, 2001, 5:18:33 PM7/30/01
to

An emulator can manage between 14:2 and 24:3 ARM cycles for 6502 cycles.
This means a 233MHz StrongARM could emulate a 6502 running at around 30MHz.

An 8MHz ARM could not manage to emulate a 2MHz 6502, so I wrote a cross
assembler that returns between 4:2 and 12:3 cycles, and was able to run
code at nearly twice the speed of a BBC B on the A310.

This enabled me to port Superior Softwares Speech synthesizer that required a
the 6502 sample generator to be run in realtime with simulation of writing to
the sound hardware with cycle for cycle accuracy. It did this very well, but
unfortunately also 100% accurately reproduced the terrible quality, as the
Beeb's sound system was never meant to play samples. It only had tone
generators, but the CC system reprogrammed the 4bit volume level on the tone
generator at a rate of a few KHz to give poor quality sample. The real ARM
version used 8 bit sample generation at a much higher rate and was noticably
better, bit just didn't have the authentic feel!

---druck

--
_ _ _ _ _ _ _ _
|_)|(_ / ` / \(_ ' ) / \ / \ /| The Prestige RISC OS Show, 20-21 October
| \| _)\_, \_/ _) /_ \_/ \_/ _|_ Blue Mountain, taste the difference

Lyndon Fletcher

unread,
Jul 31, 2001, 4:30:43 AM7/31/01
to
Some interesting insights however I am trying to relate comments
like:-

>Well, my ARC64 (a C64 emulator) reaches 3441% of the speed of a real C64
>on my RiscPC with an SA@287MHz (and 485% with an ARM610@40MHz).

with

>An emulator can manage between 14:2 and 24:3 ARM cycles for 6502 cycles.
>This means a 233MHz StrongARM could emulate a 6502 running at around 30MHz.

>An 8MHz ARM could not manage to emulate a 2MHz 6502,

Are we talking about the same conditions here? I think it may be
better to give some more application specific data.

I have a singleboard computer design using a 2Mhz 6502. It needs to be
brought up to date and have a bit more power added and at the moment
the only supplier of higher performance 6502 compatable CPs is WDC. I
can get a 20Mhz 65816 8/16 bit 6502 clone from them but buying small
numbers of units are hard and there is no second source.

I want to redesign the board with a more modern CPU but that CP would
have to be able to run 6502 legacy code at reasonable speeds at least
initially. Unlike emulation of say a Beeb on a RiscPC the support
hardware would remain more or less the same and it will be the new
processors ability to emulate the 6502 that will be important.

With that qualification does anyone have a better idea of the kind of
performance that could be expected??

Lyndon

Dennis Ranke

unread,
Jul 31, 2001, 7:08:35 AM7/31/01
to
In message <3b666997...@news.ericsson.se>
uab...@uab.ericsson.se (Lyndon Fletcher) wrote:

A basic 6502 emulator that i have written (simple flat memory, no
interupts) reports 40-50Mhz on my StrongARM, which isn't far off from
drucks figures. Please note, though, that i'm not entirely sure whether
this output of my emu is correct, as i haven't touched it in a long
time.
It is also possible that the emulated code gets stuck in a short loop
and therefore only a few instructions are measured...

--
exoticorn/icebird

Matthias Seifert

unread,
Jul 31, 2001, 5:28:06 AM7/31/01
to

> with

Well, as long as your emulator has to handle interrupts, the figures I
gave above still apply (as I was talking of pure processor performance
anyway) with my code. If interrupts don't have to be handled, I can remove
several ARM instructions from every emulated 6502 instruction, thus giving
even higher speed of the emulation.

And given that the 6510 of the C64 was clocked at ~1 MHz, both comments
above give almost the identical result: A SA@233MHz will be able to
emulate a 6502 (or 6510) at ~30 MHz.

So I guess (I don't know it as I don't know the timings of a 65816) that
you will need a StrongARM (or any other ARM variant with ~200 MHz) to get
about the same 6502 performace as with that 65816.

OTOH it may well be possible to replace part of the code by "native" ARM
code and speed things up dramatically. :-)

Or you may just use a program which translates the 6502 code directly to
ARM code (as druck described with his "cross assembler" solution) which
would definitley give much better performance than any emulator.

Alex Macfarlane Smith

unread,
Jul 31, 2001, 5:43:21 AM7/31/01
to
In message <4aa2a1210...@t-online.de>
Matthias Seifert <M.Se...@t-online.de> wrote:

[snip]


> Well, as long as your emulator has to handle interrupts, the figures I
> gave above still apply (as I was talking of pure processor performance
> anyway) with my code. If interrupts don't have to be handled, I can remove
> several ARM instructions from every emulated 6502 instruction, thus giving
> even higher speed of the emulation.
>
> And given that the 6510 of the C64 was clocked at ~1 MHz, both comments
> above give almost the identical result: A SA@233MHz will be able to
> emulate a 6502 (or 6510) at ~30 MHz.
>

Wasn't Acorn's 6502 Emulator basically able to emulate a 6502 at full speed
on an 8Mhz machine?

Alex.
--
E-mail: archi...@gmx.co.uk
WWW: http://www.toth.org.uk/~aardvark/
ICQ: 29035638

Matthias Seifert

unread,
Jul 31, 2001, 8:47:54 AM7/31/01
to
On 31 Jul, Alex Macfarlane Smith <archi...@gmx.co.uk> wrote:
> In message <4aa2a1210...@t-online.de>
> Matthias Seifert <M.Se...@t-online.de> wrote:

> [snip]
> > Well, as long as your emulator has to handle interrupts, the figures I
> > gave above still apply (as I was talking of pure processor performance
> > anyway) with my code. If interrupts don't have to be handled, I can
> > remove several ARM instructions from every emulated 6502 instruction,
> > thus giving even higher speed of the emulation.
> >
> > And given that the 6510 of the C64 was clocked at ~1 MHz, both comments
> > above give almost the identical result: A SA@233MHz will be able to
> > emulate a 6502 (or 6510) at ~30 MHz.
> >
> Wasn't Acorn's 6502 Emulator basically able to emulate a 6502 at full
> speed on an 8Mhz machine?

I don't know.

Mine reaches 76% of the C64 with an A3000 (i.e. ARM2@8MHz). Without
support for interrupts it should be able to catch up with the 6510 of the
C64.

druck

unread,
Jul 31, 2001, 5:57:32 PM7/31/01
to
On 31 Jul 2001 uab...@uab.ericsson.se (Lyndon Fletcher) wrote:

> Some interesting insights however I am trying to relate comments
> like:-
>
> >Well, my ARC64 (a C64 emulator) reaches 3441% of the speed of a real C64
> >on my RiscPC with an SA@287MHz (and 485% with an ARM610@40MHz).
>
> with
>
> >An emulator can manage between 14:2 and 24:3 ARM cycles for 6502 cycles.
> >This means a 233MHz StrongARM could emulate a 6502 running at around 30MHz.
>
> >An 8MHz ARM could not manage to emulate a 2MHz 6502,
>
> Are we talking about the same conditions here? I think it may be
> better to give some more application specific data.

It depends on the 6502 code instruction mix, as these are variable length and
so are the ARM instructions necessary to emulate them. For example the 6502
in BCD mode takes far more ARM cycles to emulate than in normal binary mode.



> I have a singleboard computer design using a 2Mhz 6502. It needs to be
> brought up to date and have a bit more power added and at the moment
> the only supplier of higher performance 6502 compatable CPs is WDC. I
> can get a 20Mhz 65816 8/16 bit 6502 clone from them but buying small
> numbers of units are hard and there is no second source.

Any ARM 7 series or later will be capable of emulating the 6502 far faster
than real time. To achieve 65816 speed, a StrongARM running at around 150MHz
will be necessary.

> I want to redesign the board with a more modern CPU but that CP would
> have to be able to run 6502 legacy code at reasonable speeds at least
> initially. Unlike emulation of say a Beeb on a RiscPC the support
> hardware would remain more or less the same and it will be the new
> processors ability to emulate the 6502 that will be important.

With this sort of usage it is often more important to emulate the 6502 no
faster and no slower, as such slow processors often used software delay loops
rather than external counter timers. In this case you need a cycle accurate
emulator.

My port of the Speech code did this. In addition to generating the ARM
instructions to emulate the 6502 ones, a count of the number of 6502 clock
ticks these would have taken on the real processor was kept. When the time
critical piece of hardware was written to (in this case the sound chip) a
comparision was made between the emulated 6502 clock ticks and a real 2MHz
timer. A delay could then be added to ensure the access occured at precisely
the same time as the 6502 would have done, regardless of the speed of
emulation.

Torben AEgidius Mogensen

unread,
Aug 1, 2001, 5:39:42 AM8/1/01
to
uab...@uab.ericsson.se (Lyndon Fletcher) writes:

You can get arbitrarily close to recompilation speed in an emulator
depending on how much work you are willing to put into it (JIT
compilation etc.) and to what extent you will have to emulate extra
hardware (e.g. interrupts).

A simple way of getting reasonable speed is to use a branch table: You
use the instruction as index into a table that branches to code for
handling each instruction. This gives a fairly low decode overhead,
something like:

LDRB instr,[6502PC],#1
LDR pc,[table,instr,shift#2]

and then ending each emulation code sequence with a branch to the
emulation main loop. If interrupts are to be handled, the emulation
main loop has to check for this.

You can save a bit of time (a branch per instruction) by ending each
emulation sequence with a copy of the emulation main loop instead of
branching to a single copy. If the main loop is more than a few
instructions, this may increase I-cache misses, though.

You can also replace a LDR with a MOV if you put the emulation code
for instruction N at address base+N*2^k:

LDRB instr,[6502PC],#1
ADD pc,basereg,instr,LSL#k

This is somewhat wasteful of space, though.

If you need to handle BCD arithmetic, you can do this in several ways:

1) Check the flag every time you emulate addition instructions.

2) make instructions that change the decimal flag modify the jump
table.

The latter, clearly. is much faster (unless you constantly switch back
and forth between modes). It gets a bit more complicated if you use
the trick where you replace the indexed jump by addition: You have to
have two copies of _all_ emulation codes, so modifying the decimal
flag modifies the base register instead of a few table entries.

Arithmetic flags are, IMO, best emulated by arithmetic flags. To do
this, all bytes that emulate 6502-values are shifted to occupy the
most significant byte of a register before addition or subtraction.
This way, resulting N, Z, C and V flags will (as far as I can judge
without consulting manuals) be set correctly. However, incoming C
flags and C flags in shift-instructions will have to be treated
differently.

Torben Mogensen (tor...@diku.dk)

Rich Talbot-Watkins

unread,
Aug 1, 2001, 8:01:18 AM8/1/01
to
"Torben AEgidius Mogensen" <tor...@diku.dk> wrote:

> Arithmetic flags are, IMO, best emulated by arithmetic flags. To do
> this, all bytes that emulate 6502-values are shifted to occupy the
> most significant byte of a register before addition or subtraction.
> This way, resulting N, Z, C and V flags will (as far as I can judge
> without consulting manuals) be set correctly. However, incoming C
> flags and C flags in shift-instructions will have to be treated
> differently.

Yes, this is true; I also relied on ARM flags as a basis for the Z80 flag
results on my Sega Master System emulator.

I personally think that the author(s) of !65Host did a superb job of the
6502 emulation - if special memory access is ignored (e.g. screen memory,
memory-mapped I/O), the actual implementation of each 6502 opcode in ARM
is incredibly quick and clever. I wouldn't be surprised if an 8MHz ARM
could emulate an 'average' 2MHz 6502 program at nearly 100% speed using
this code, provided there's no screen or hardware access.

e.g. From memory, using its best ideas, ADC mmmm,X becomes:

.adc_mmmm_x
LDRB temp1,[pc6502],#1 \\ Get 6502 lo address
\\ Note: pc6502 is the actual address
\\ in ARM address space, not an offset
\\ from the base of 6502 memory
LDRB temp2,[pc6502],#1 \\ Get 6502 hi address
ORR temp1,temp1,temp2,LSL #8 \\ Combine them
ADD temp1,temp1,x6502,LSR #24 \\ Add X register to base address
\\ Note: x6502 is X shifted up 24.
LDRB temp1,[base,temp1] \\ Load the value
ORRCS temp1,temp1,propc \\ 'propc' is a constant register
\\ of value &FFFFFF00 whose function
\\ is to propagate the carry flag
\\ up to bit 24 from bit 0
ADCS a6502,a6502,temp1,ROR #8 \\ Do the operation and set flags.
\\ Note: a6502 is A shifted up 24.
\\ The ZNCV 6502 flags are held
\\ 'in place' by ARM flags
LDRB temp1,[pc6502],#1 \\ Get next opcode
ADD PC,routinebase,temp1,LSL #2 \\ Jump to it

AND #nn becomes:

.and_imm
LDRB temp1,[pc6502],#1 \\ Get immediate value
MOV temp1,temp1,LSL #24 \\ Shift up
ANDS a6502,a6502,temp1 \\ Do operation and set flags
\\ Note: we do the shift separately
\\ as we don't want to affect C
LDRB temp1,[pc6502],#1 \\ Get next opcode
ADD PC,routinebase,temp1,LSL #2 \\ Jump to it

etc etc.

The actual add is done by two simple instructions, this also takes care of
setting tricky stuff like the V flag. Most of the preamble is just
getting hold of the relevant value to be added to A.

Anyway, looks like it can be 'quite fast' to me! Hope that helps.

Rich


druck

unread,
Aug 1, 2001, 4:03:48 PM8/1/01
to
On 1 Aug 2001 "Rich Talbot-Watkins" <ric...@hotmail.com> wrote:
> I personally think that the author(s) of !65Host did a superb job of the
> 6502 emulation - if special memory access is ignored (e.g. screen memory,
> memory-mapped I/O), the actual implementation of each 6502 opcode in ARM
> is incredibly quick and clever.

I fully agree. I initially did a clean room implementation for my "cross
assembler" but then allowed myself to be come tained by some of the more
optimal techniques :-)

> I wouldn't be surprised if an 8MHz ARM could emulate an 'average' 2MHz
> 6502 program at nearly 100% speed using this code, provided there's no
> screen or hardware access.

!65Tube which was just the pure code emulator, with OS calls being passed on
to the RISC OS's native calls, was very nearly as fast as 2MHz 6502 on raw
processing, and quite a bit faster overall when performaing I/O via the
native OS routines. !65Tube was somewhat slower, due it having to emulate
much of the hardware level I/O especially translating between the vastly
differing layouts of screen memory.

0 new messages