Fastest ROMable 6502 NEXT?

Dr. Bruce R. McFarling

unread,

May 30, 2004, 9:59:55 PM5/30/04

to

Is this the fastest ROMable 6502 NEXT (that is, legacy 6502, not
65C02)?

NB. either initialisation or ENTER stores the opcode for JMP (addr) at
JV
in the Zero page.

NEXT:
INC JV+1
BEQ NEXT1:
INC JV+1
BEQ NEXT2:
JMP JV
NEXT1:
INC JV+1
NEXT2:
INC JV+2
JMP JV

Essentially the zero page location JV+1, JV+2 is used as a soft
register, with the opcode at JV providing the self-modifying code
outside of the ROM address space.

I say "legacy 6502", because the 65C02 allows the JMP (addr,X) op code
to be put in JV, giving:
NEXT:
INX
BEQ NEXT1:
INX
BEQ NEXT2:
JMP JV
NEXT1:
INX
NEXT2:
INC JV+2
JMP JV

This then basically pushes us to choose between the hardware stack and
a Y-indexed stack for the data and return stacks.

The first NEXT above works with a variety of ENTER/EXIT choices,
including the hardware stack as return stack, reserving X for the data
stack index and leaving Y as the scratch index register:

ENTER: ; HL def begins with JMP ENTER
LDA JV+2
PHA
LDA JV+1
PHA
LDA #$6C
STA JV
LDY #0
LDA (JV+1),Y
CLC
ADC #3
PHA
INY
LDA (JV+1),Y
ADC #0
STA JV+2
PLA
STA JV+1
JMP JV

EXIT:
PLA
STA JV+1
PLA
STA JV+2
NEXT:
INC JV+1
BEQ NEXT1:
INC JV+1
BEQ NEXT2:
JMP JV
NEXT1:
INC JV+1
NEXT2:
INC JV+2
JMP JV

On the other end of the spectrum, if you want multiple stacks for
multiple threads, you can provide 32 deep data stacks and 16 deep
return stacks with seperate high byte and low byte stacks, giving 8
independent threads with three dedicated pages of RAM:

DL -- low data stack byte, aligned on a page boundary
DH = DL+256
RL = RH+256
RH = RL+128

If you have 8K RAM, with the zero page and hardware stack, this is 1
1/4K, add a page per thread for user variables, PAD, and Pictured
numeric input, and you have allocated 5 1/4K, leaving 2 3/4K for I/O
and other uses (block buffers, input buffer, serial port buffer)

ENTER: ; HL def begins with JSR ENTER
DEY
LDA JV+2
STA RH,Y
LDA JV+1
STA RL,Y
LDA #$6C
STA JV
PLA
CLC
ADC #1
STA JV+1
PLA
ADC #0
STA JV+2
JMP JV
ENTER1:
INC JV+2
JMP JV

EXIT:
LDA RH,Y
STA JV+2
LDA RL,Y
STA JV+1
INY
NEXT:
INC JV+1
BEQ NEXT1:
INC JV+1
BEQ NEXT2:
JMP JV
NEXT1:
INC JV+1
NEXT2:
INC JV+2
JMP JV

Jeff Fox

unread,

Jun 1, 2004, 4:33:25 PM6/1/04

to

agi...@netscape.net (Dr. Bruce R. McFarling) wrote in message news:<c8cbc925.04053...@posting.google.com>...

> Is this the fastest ROMable 6502 NEXT (that is, legacy 6502, not
> 65C02)?

No. I would submit that RTS is the fastest NEXT, one byte, 6 cycles.
Of course that means that the hardware stack would be the return stack
in the implementation rather than using it for the parameter stack.
The tradeoff would be that most of the paramter stack manipulation
words would
be significantly slower, but it would be the fastest NEXT.

Best Wishes

Dr. Bruce R. McFarling

unread,

Jun 2, 2004, 12:48:27 AM6/2/04

to

f...@ultratechnology.com (Jeff Fox) wrote in message news:<4fbeeb5a.04060...@posting.google.com>...

'cor, you would say that guvner. Tis 11 cycles, but, as the next word
in the list requires a JSR to complete the NEXT.

I know what you mean, but in weighing the code space versus execution
speed trade-off, you have to have a value on the list of execution
tokens side to weigh againt the 11 cycles of the RTS followed by the
next JSR. That's what I am looking for here. Of course, the missing
instruction is increment register with overflow to the zero page, the
hypothetical "65C04" below.

NEXT: ; NMOS 6502 25.04 cycles
; JMP NEXT at end of primitive, 3 cycles
INC JV+1 ; 5
BEQ NEXT1 ; 2 / +4 app. 1/256
INC JV+1 ; 5
BEQ NEXT2 ; 2 / +6 app. 1/256
JMP JV ; 8 (jump to JMP ())

NEXT1:
INC JV+1 ;
NEXT2:
INC JV+2
JMP JV

NEXT: ; W65C02 21.04 cycles
; JMP NEXT at end of primitive, 3 cycles
INX ; 2
BEQ NEXT1 ; 2 / +4 app. 1/256
INX ; 2
BEQ NEXT2 ; 2 / +6 app. 1/256
JMP JV ; 8 (jump to JMP ())

NEXT1:
INX
NEXT2:
INC JV+2
JMP JV

NEXT: ; 65C04 12.023 cycles
; 7 bytes embedded at end of primitives
INX JV+2 ; 2/5
INX JV+2 ; 2/5
JMP JV ; 8 (jump to JMP ())

Stephen Pelc

unread,

Jun 2, 2004, 8:13:25 AM6/2/04

to comp.lang.forth

On 1 Jun 2004 21:48:27 -0700, agi...@netscape.net (Dr. Bruce R.
McFarling) wrote:

>I know what you mean, but in weighing the code space versus execution
>speed trade-off, you have to have a value on the list of execution
>tokens side to weigh againt the 11 cycles of the RTS followed by the
>next JSR.

Yes, each call is three bytes. BUT RTS is one byte and there is no
code pointer, saving three bytes per colon word for ITC and four bytes
for DTC. If you organise register and zero page usage carefully you
will also find that some primitives, e.g. 1+, can be replaced by
1/2/3 byte instruction sequences.

Stephen
--
Stephen Pelc, steph...@INVALID.mpeltd.demon.co.uk
MicroProcessor Engineering Ltd - More Real, Less Time
133 Hill Lane, Southampton SO15 5AF, England
tel: +44 (0)23 8063 1441, fax: +44 (0)23 8033 9691
web: http://www.mpeltd.demon.co.uk - free VFX Forth downloads

Dr. Bruce R. McFarling

unread,

Jun 3, 2004, 2:46:35 AM6/3/04

to

steph...@INVALID.mpeltd.demon.co.uk (Stephen Pelc) wrote in message news:<40bdc2de....@192.168.0.1>...

> On 1 Jun 2004 21:48:27 -0700, agi...@netscape.net (Dr. Bruce R.
> McFarling) wrote:

> Yes, each call is three bytes. BUT RTS is one byte and there is no
> code pointer, saving three bytes per colon word for ITC and four bytes
> for DTC. If you organise register and zero page usage carefully you
> will also find that some primitives, e.g. 1+, can be replaced by
> 1/2/3 byte instruction sequences.

It doesn't matter how carefully you organise register and zero page
usage, there's no way to get 1+ on a 16-bit value as a 1/2/3 byte
instruction sequence with the 6502 or 65C02. That's the 65016
(INC A).

Anyway, sorry, Jeff, it turned out that it was Stephen who said it.

It is established that, without modifications to the instruction
set itself, the fastest threading model is subroutine threading.

Bit-threading with 8-byte tokens for all primitives seems to offer
performance close to DTC, and it would seem to be more compact
than either DTC or STC.

If the head of this thread really is the fastest ROMable software
NEXT, it doesn't seem to me that DTC is a likely candidate unless
using the W65C02.

Anton Ertl

unread,

Jun 3, 2004, 3:04:49 AM6/3/04

to

f...@ultratechnology.com (Jeff Fox) writes:
>agi...@netscape.net (Dr. Bruce R. McFarling) wrote in message news:<c8cbc925.04053...@posting.google.com>...
>> Is this the fastest ROMable 6502 NEXT (that is, legacy 6502, not
>> 65C02)?
>
>No. I would submit that RTS is the fastest NEXT, one byte, 6 cycles.

The old S=IP direct threading tr^Hechnique? Unfortunately, this
technique has serious disadvantages on the 6502:

- S is only an 8-bit register, so the program would be limited to 256
bytes (<=128 words).

- Interrupts write to the stack (there is no alternate stack or
somesuch), so this technique would only work with interrupts disabled,
and the NMI line disconnected (i.e., not on the C64); or one would
have to restore the program in the interrupt, but I don't see how to
do that and do a proper return from interrupt.

>Of course that means that the hardware stack would be the return stack
>in the implementation rather than using it for the parameter stack.

So you are talking about a different technique anyway; I guess
subroutine threading; but there the NEXT consists of the RTS followed
by a JSR (12 cycles), and you lose one byte per compiled word (for the
JSRs).

Fig-Forth for the 6502 used S as return stack pointer anyway, and X as
data stack pointer; this results in faster code (try implementing +
with S as data stack pointer to see why).

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html

Stephen Pelc

unread,

Jun 3, 2004, 5:00:28 AM6/3/04

to comp.lang.forth

On 2 Jun 2004 23:46:35 -0700, agi...@netscape.net (Dr. Bruce R.
McFarling) wrote:

>It doesn't matter how carefully you organise register and zero page
>usage, there's no way to get 1+ on a 16-bit value as a 1/2/3 byte
>instruction sequence with the 6502 or 65C02. That's the 65016
>(INC A).

Apologies about 1+, but you know what I mean. Unless one is in
modem or retro territory, why use a 65xxx these days? There are
plenty of modern CPUs which are fun to program in assembler.

Brad Eckert

unread,

Jun 3, 2004, 1:20:56 PM6/3/04

to

steph...@INVALID.mpeltd.demon.co.uk (Stephen Pelc) wrote in message news:<40bee7dc....@192.168.0.1>...

> Apologies about 1+, but you know what I mean. Unless one is in
> modem or retro territory, why use a 65xxx these days? There are
> plenty of modern CPUs which are fun to program in assembler.

Good question. This thread looks like something from 20 years ago.
OTOH, Cypress' PSOC chips use a 6800 variant.

--Brad

Dr. Bruce R. McFarling

unread,

Jun 3, 2004, 10:57:12 PM6/3/04

to

nospaa...@tinyboot.com (Brad Eckert) wrote in message news:<7d4cc56.04060...@posting.google.com>...

Normally because you have well tested legacy code to do something
in 65C02 assembly language, and that saves time compared to another
core that you can fit into an FPGA of the same size, including
the rest of the gear you want to fit into the FPGA with it.

IOW, the same reason you programmed in 65C02 assembly language
twenty years ago, except shifted from systems to a single
FPGA with a ROM hanging off of it.