6502 trashing memory cycles...

bie...@terra.es

unread,

Mar 4, 2007, 2:12:10 PM3/4/07

to

Hello,

I've been having a look at the 6502 instruction details in the
databooks and in uta2,p20..23. I've never realized that the 6502
trashes so many memory cycles here and there. For example 3 out of the
6 cycles of an RTS are useless. 2 out of 4 of a PLA are useless, 1 out
of 2 of a DEX are useless.

The table in uta2,p22, shows the instructions classiffied in 29
groups. 22 out of the 29 opcode groups trash between 1 and 3 memory
cycles..!

A clever circuit could make good use of so many memory cycles for some
other purpose... any ideas ?

Jorge.

Bryan Parkoff

unread,

Mar 4, 2007, 3:11:55 PM3/4/07

to

<bie...@terra.es> wrote in message
news:1173035529.9...@n33g2000cwc.googlegroups.com...

Jorge,

Why do you think this way? Let me give you the information below. I
don't think 3 out of the 6 cycles of an RTS are useless.

Cycle Address Bus Data Bus External Operation Internal Operation
1 0300 Opcode Fetch Opcode Finish
Previous Operation, 0301 + PC

2 0301 Discarded Data Fetch Discarded Data Decode RTS

3 01FD Discarded Data Fetch Discarded Data Increment
Stack Pointer to 01FE

4 01FE 02 Fetch PCL Increment
Stack Pointer to 01FF

5 01FF 01 Fetch PCH

6 0102 Discarded Data Put out PC Increment PC
by 1 to 0103

7 0103 Next Opcode Fetch Next Opcode

You can look at Synertek programming manual back to August 1976 from MOS
Technology, Inc 1975. It is useful for more information what you need to
know.

Bryan Parkoff

bie...@terra.es

unread,

Mar 4, 2007, 4:53:43 PM3/4/07

to

On Mar 4, 9:11 pm, "Bryan Parkoff" <nos...@nospam.com> wrote:
> Why do you think this way? Let me give you the information below. I
> don't think 3 out of the 6 cycles of an RTS are useless.
>
> Cycle Address Bus Data Bus External Operation Internal Operation
> 1 0300 Opcode Fetch Opcode Finish
> Previous Operation, 0301 + PC
>
> 2 0301 Discarded Data Fetch Discarded Data Decode RTS
>
> 3 01FD Discarded Data Fetch Discarded Data Increment
> Stack Pointer to 01FE
>
> 4 01FE 02 Fetch PCL Increment
> Stack Pointer to 01FF
>
> 5 01FF 01 Fetch PCH
>
> 6 0102 Discarded Data Put out PC Increment PC
> by 1 to 0103
>
> 7 0103 Next Opcode Fetch Next Opcode
>

Yes, memory cycles 2,3 and 6 are useless, wasted.
An external circuit that knew that could put them to better use : to
generate sound, or to read from the video frame buffer, or for DMA,
or... ?

Jorge.

Tristan Mumford

unread,

Mar 4, 2007, 5:26:32 PM3/4/07

to

bie...@terra.es wrote:

The extra reads may not serve an amazing amount of use but catching them
would be nontrivial. I may be wrong but it would be easier to reimplement a
6502 on an FPGA with the extra circuitry built in to tristate the outputs
during certain cycles.

That kind of design would interfere with sound generation with the
softswitch and may cause RAM refresh problems.

I'm now thinking of the C=64. The clock model on that is enough to give me
nightmares. It wastes very little and uses both phases of the clock too.
Even the VIC-II counts cycles to know the soonest time it can start
stealing cycles.

I don't remember. Does the apple][ use one half of the clock or both?

Tristan.
--
-----> http://members.dodo.com.au/~izabellion1/tristan/index.html <-----
===== It's not pretty, it's not great, but it is mine. =====

Bryan Parkoff

unread,

Mar 4, 2007, 6:48:14 PM3/4/07

to

"Tristan Mumford" <xtristan...@xgmail.xcom> wrote in message
news:45eb47a0$0$60327$c30e...@lon-reader.news.telstra.net...

Tristan,

Yes, 6502 processor needs to sequence cycle by cycle no matter if data
bus is not used. Yes, 14M crystal osc chip derives 1 MHz for both 6502
processor and video processor. The 6502 processor and video processor can't
be processed at the same time at 1 MHz. Video processor processes first at
half clock cycle before it suspends video processor while the 6502 processor
processes at another half clock cycle. It repeats in turn to let video
processor or 6502 processor to run.

Bryan Parkoff

wv9...@yahoo.com

unread,

Mar 4, 2007, 10:50:37 PM3/4/07

to

The RTS is a fairly complex instruction. First it has to put SP on the
memory bus to get half of the saved pc, then sp+1 on the memory bus to
get the other half of the saved pc, then it puts the saved pc on the
memory bus to get the next opcode. That's 3 memory cycles. The NOP
alone takes 2 cycles. So I think it's fair that RTS takes 6.

Mark McDougall

unread,

Mar 4, 2007, 11:18:26 PM3/4/07

to

bie...@terra.es wrote:

>> Cycle Address Bus Data Bus External Operation Internal
>> Operation 1 0300 Opcode Fetch Opcode
>> Finish Previous Operation, 0301 + PC
>>
>> 2 0301 Discarded Data Fetch Discarded Data
>> Decode RTS
>>
>> 3 01FD Discarded Data Fetch Discarded Data
>> Increment Stack Pointer to 01FE
>>
>> 4 01FE 02 Fetch PCL
>> Increment Stack Pointer to 01FF
>>
>> 5 01FF 01 Fetch PCH
>>
>> 6 0102 Discarded Data Put out PC
>> Increment PC by 1 to 0103
>>
>> 7 0103 Next Opcode Fetch Next Opcode
>>
>
> Yes, memory cycles 2,3 and 6 are useless, wasted. An external circuit
> that knew that could put them to better use : to generate sound, or
> to read from the video frame buffer, or for DMA, or... ?

It's called 'pipeling'. For example memory cycle 2 is not "wasted" - at
this point the opcode has not been decoded and the cpu _has no way_ of
knowing whether or not the next byte in memory is required as an
operand, for example. So the system pipelines the operation by fetching
the next byte in case it is required. If you 'stole' that memory cycle
then you'd slow down every instruction that required more than a single
byte.

How do you propose that an external circuit can 'know' which cycles it
can steal??? Even if you could use the cycles, how could it possibly
'know' that the CPU was fetching, say, RTS rather than data that
corresponded to an RTS opcode? Your logic would require a state machine
similar to the instruction decoding logic within the CPU itself. And
even with all that, it would have to disconnect the CPU from the bus
before the next cycle started.

Regards,

--
Mark McDougall, Engineer
Virtual Logic Pty Ltd, <http://www.vl.com.au>
21-25 King St, Rockdale, 2216
Ph: +612-9599-3255 Fax: +612-9599-3266

Mark McDougall

unread,

Mar 4, 2007, 11:25:07 PM3/4/07

to

Tristan Mumford wrote:

> I may be wrong but it would be easier to
> reimplement a 6502 on an FPGA with the extra circuitry built in to
> tristate the outputs during certain cycles.

A 6502 implemented within an FPGA wouldn't necessarily have the same
behaviour cycle-for-cycle - it need not even be cycle-accurate and could
execute some instructions in less cycles than the 6502.

If you're really interesting in cycle-stealing, then you'd probably
clock the FPGA implementation at a multiple of the bus frequency and
enhance the core to signal to external logic that it can have the next
external bus cycle as soon as possible.

bie...@terra.es

unread,

Mar 5, 2007, 3:10:34 AM3/5/07

to

On 5 mar, 05:18, Mark McDougall <m...@vl.com.au> wrote:
>
> It's called 'pipeling'. For example memory cycle 2 is not "wasted" - at
> this point the opcode has not been decoded and the cpu _has no way_ of
> knowing whether or not the next byte in memory is required as an
> operand, for example. So the system pipelines the operation by fetching
> the next byte in case it is required. If you 'stole' that memory cycle
> then you'd slow down every instruction that required more than a single
> byte.

I'm not saying that the 6502 doesn't internally do something useful
during that cycles.
I'm saying however that whatever it is, during these cycles, accessing
the memory is not neccesary.
During these cycles it does un-needed memory accesses.

> How do you propose that an external circuit can 'know' which cycles it
> can steal??? Even if you could use the cycles, how could it possibly
> 'know' that the CPU was fetching, say, RTS rather than data that
> corresponded to an RTS opcode? Your logic would require a state machine
> similar to the instruction decoding logic within the CPU itself. And
> even with all that, it would have to disconnect the CPU from the bus
> before the next cycle started.
>
> Regards,

The circuit would have to monitor SYNC marked cycles in order to check
the fetched opcode, then start an n-cycles state machine (a function
of the fetched opcode) that flags the free/available cycles that can
be stealed. The CPU would have to be isolated from the address/data
bus during that cycles, in a DMA-style but without stopping the cpu's
phase 0 clock.

Given that most of the instructions have memory cycles to "offer", I'm
wondering if a faster-clocked CPU would very likely have provided
enough memory cycles so as to have done the both the video memory scan
and refresh "for free", i.e., without sacrificing half of the memory
bandwith as have been done in the Apple II.

Another clever idea would have been to steal cycles to fecth data for
sound generation, in a similar way to what was years later done in the
original Mac.

The problem is that certain sequences of instructions may not provide
enough free cycles for that purposes, but I think that the probablity
for such a sequence to happen is very likely very small...

I found this an interesting AppleII-logy matter to think about...
isn't it ?
Would you, dear AppleII-logist colleagues, mind to comment about
this ?

Regards,
Jorge.

David Wilson

unread,

Mar 5, 2007, 3:55:53 AM3/5/07

to

On Mar 5, 7:10 pm, "biel...@terra.es" <biel...@terra.es> wrote:
> I'm not saying that the 6502 doesn't internally do something useful
> during that cycles.
> I'm saying however that whatever it is, during these cycles, accessing
> the memory is not neccesary.
> During these cycles it does un-needed memory accesses.

This is true. Have a look at the 6809E for a chip that can release the
bus during dead cycles (a pin called AVMA (Advanced VMA) indicates
that the processor will use the bus in the following cycle). My
Stellation Mill card allows the motherboard 65(c)02 to run during
these periods.

David Empson

unread,

Mar 5, 2007, 6:41:21 AM3/5/07

to

bie...@terra.es <bie...@terra.es> wrote:

> Given that most of the instructions have memory cycles to "offer", I'm
> wondering if a faster-clocked CPU would very likely have provided
> enough memory cycles so as to have done the both the video memory scan
> and refresh "for free", i.e., without sacrificing half of the memory
> bandwith as have been done in the Apple II.

I'd dispute that "most" instructions have spare memory cycles. It may be
that somewhat more than half have a wasted memory access cycle, but the
more commonly used ones don't waste cycles.

For example, all immediate, zero page and absolute instructions don't
waste any memory cycles, but indexing does, as do all single byte
instructions, which have a minimum of two cycle execution time.

Typical code patterns would result in "spare" cycles only appearing
intermittently. You certainly can't guarantee a minimum number of spare
cycles within any particular period of time for arbitrary code
execution, which limits the usefulness of stealing these cycles for
something else.

I'm not sure why the 6502 needs to waste a cycle for things like NOP or
INX. The 6801 derivative I used at work (Hitachi 6303) managed to
execute these in one cycle.

> Another clever idea would have been to steal cycles to fecth data for
> sound generation, in a similar way to what was years later done in the
> original Mac.
>
> The problem is that certain sequences of instructions may not provide
> enough free cycles for that purposes, but I think that the probablity
> for such a sequence to happen is very likely very small...

Lots of code does thinks like LDA immediate, STA somewhere, repeating
for further locations. This could easily go on for ten or twenty
instructions with no spare cycles.

--
David Empson
dem...@actrix.gen.nz

Mark McDougall

unread,

Mar 5, 2007, 6:54:35 AM3/5/07

to

bie...@terra.es wrote:

> I'm not saying that the 6502 doesn't internally do something useful
> during that cycles.
> I'm saying however that whatever it is, during these cycles, accessing
> the memory is not neccesary.
> During these cycles it does un-needed memory accesses.

But my point is that it *is* necessary in _some_ of these cases, even if the
result is not used, for the reasons I gave earlier.

Regards,

--
| Mark McDougall | "Electrical Engineers do it
| <http://members.iinet.net.au/~msmcdoug> | with less resistance!"

bie...@terra.es

unread,

Mar 5, 2007, 9:15:35 AM3/5/07

to

On 5 mar, 12:54, Mark McDougall <msmcd...@no.spam.iinet> wrote:

> biel...@terra.es wrote:
> > I'm not saying that the 6502 doesn't internally do something useful
> > during that cycles.
> > I'm saying however that whatever it is, during these cycles, accessing
> > the memory is not neccesary.
> > During these cycles it does un-needed memory accesses.
>
> But my point is that it *is* necessary in _some_ of these cases, even if the
> result is not used, for the reasons I gave earlier.
>
> Regards,

Yes, one example of what you say is a BEQ/BNE. Both waste a cycle
reading then next address after the opcode, even though the branch may
not be taken (and the offset read at the next address not needed). I
imagine that during this extra cycle the decision is taken inside the
6502. And there's no easy way to know from the outside if that 2nd
cycle can be stealed or not, unless you knew the state of the Z
flag...

But, If the 6502 was running at some extravagant 14 Mhz for example
(*), the probability of being able to steal 40 cycles in a 40
microseconds (***) time lapse would be an infinitesimal away from 1, I
think. Enough (**) to re-create a line of video without wasting any
memory bandwith... !

(*) Figure 6.2 in page 24 of the WDC's 65c02 datasheet.pdf, shows in a
graph fMax as being 20Mhz. The table 6.3 on page 25, OTOH, doen't show
a.c. values for anything higher than 14MHz. What's fMax then, 14MHz or
20 MHz ?
(**) There are 14*40=560 ! phase 0 cycles in 40 microseconds if
running @14MHz. 40 out of 560 is a mere 7.1%.
(***) And in fact, the time lapse between succesive video scan lines
is more than 40 microseconds.

Jorge.

bie...@terra.es

unread,

Mar 5, 2007, 9:35:53 AM3/5/07

to

On 5 mar, 12:41, demp...@actrix.gen.nz (David Empson) wrote:
>
> I'd dispute that "most" instructions have spare memory cycles. It may be
> that somewhat more than half have a wasted memory access cycle, but the
> more commonly used ones don't waste cycles.
>
> For example, all immediate, zero page and absolute instructions don't
> waste any memory cycles, but indexing does, as do all single byte
> instructions, which have a minimum of two cycle execution time.
>
> Typical code patterns would result in "spare" cycles only appearing
> intermittently. You certainly can't guarantee a minimum number of spare
> cycles within any particular period of time for arbitrary code
> execution, which limits the usefulness of stealing these cycles for
> something else.
>
> I'm not sure why the 6502 needs to waste a cycle for things like NOP or
> INX. The 6801 derivative I used at work (Hitachi 6303) managed to
> execute these in one cycle.
>
>

> Lots of code does thinks like LDA immediate, STA somewhere, repeating
> for further locations. This could easily go on for ten or twenty
> instructions with no spare cycles.
>
> --
> David Empson

David,

I wouldn't be so sure apriori.
An analysis of a trace of several seconds would probably reveal the
truth.
Anyway, in my spare time I sometimes like to think (and write and
draw) about a design for an Apple II acelerator, zip-chip like, based
on a WDS 65c02@20Mhz (if at all possible). It makes me feel sick that
whenever the address seen in the address bus must go to the real
(slow) apple II, the whole thing has to slow down from 20 to 1 Mhz,
and waste up to almost 2000ns in the process (when it happens shortly
after a just-started phase 0 cycle that must be left to pass-by until
the next slow cycle). For me, this is enough reason to try to discover
and avoid such memory cycles...

Jorge.

mdj

unread,

Mar 6, 2007, 9:26:17 PM3/6/07

to

Your accelerator could end up being more complex than the Apple II :-)

In any reasonable accelerator design, when running high speed it hits
high-speed memory, so there's no problem, and it that mode you can
break any cycle timing rule you like.

Of course, insisting on using a 'real' 65c02 in such a design imposes
a number of constraints on the optimisation you speak of, so you'd be
better with programmable logic, where you can optimise it to the point
where the worst case cycle time is 3. Although why you'd do this, and
not just up the clock speed of the accelerator until you reached the
glass ceiling speed I am not sure.

Matt

bie...@terra.es

unread,

Mar 7, 2007, 3:11:21 AM3/7/07

to

On 7 mar, 03:26, "mdj" <mdj....@gmail.com> wrote:
>
> Your accelerator could end up being more complex than the Apple II :-)

I think that a circuit to flag these cycles may be quite simple:

-A combinatory logic block in which the input is the opcode and the
output are six bits, the state of these output bits (one for each of
the six forthcoming memory cycles) flags the cycle as "borrowable" or
not. These 6 bits fed the input of:
-A parallel in serial out shift register that is loaded during SYNC
marked cycles (during opcode fetchs), and clocked by phase 0.

Unless I'm missing something, more or less, that's it.

> In any reasonable accelerator design, when running high speed it hits
> high-speed memory, so there's no problem, and it that mode you can
> break any cycle timing rule you like.
>
> Of course, insisting on using a 'real' 65c02 in such a design imposes
> a number of constraints on the optimisation you speak of, so you'd be
> better with programmable logic, where you can optimise it to the point
> where the worst case cycle time is 3.

Yes, but honestly, I don't feel like I could design a 6502 compatible
processor. And, if such a thing were done, it ought to have a 100%
6502 compatible mode, and this would complicate the design even more.

> Although why you'd do this, and
> not just up the clock speed of the accelerator until you reached the
> glass ceiling speed I am not sure.
>
> Matt

I'm trying to draw a thick line between the "processor module" of the
accelerator and the "interface to the Apple II" module, so that the
processor can be replaced by a faster one in the future.

The nice thing about just-thinking-about is that you don't need to
pick up the solder... :-)
Of course, once the design is finished ,there's nothing like doing it
and seeing it work (*) !

Jorge.

(*)never at the first time.

Alex Freed

unread,

Mar 7, 2007, 3:53:43 AM3/7/07

to

<bie...@terra.es> wrote in message
news:1173255081.0...@p10g2000cwp.googlegroups.com...

>
> I think that a circuit to flag these cycles may be quite simple:
>
> -A combinatory logic block in which the input is the opcode and the
> output are six bits, the state of these output bits (one for each of
> the six forthcoming memory cycles) flags the cycle as "borrowable" or
> not. These 6 bits fed the input of:
> -A parallel in serial out shift register that is loaded during SYNC
> marked cycles (during opcode fetchs), and clocked by phase 0.
>
> Unless I'm missing something, more or less, that's it.

I think one thing is missing here. On some instructions some cycles
may be needed sometimes and not needed at other times. It should
still work but not tag ALL of the "borrowable" cycles.

-Alex.

bie...@terra.es

unread,

Mar 7, 2007, 4:23:59 AM3/7/07

to

On 7 mar, 09:53, "Alex Freed" <a...@mirrow.com> wrote:
>
> I think one thing is missing here. On some instructions some cycles
> may be needed sometimes and not needed at other times. It should
> still work but not tag ALL of the "borrowable" cycles.
>
> -Alex.

In the case of conditional branches, I don't see how to predict if
it's going to be taken as the processor flags are unknown from the
outside. In the case of page crossings, it's also difficult if at all
possible to predict.
What other cases do you see ?

Every two cycle opcode gives you one free cycle, and every RTS gives
you 3, and pushes and pulls gives you 2. There are a lot of instances
of these in any program... lots of memory cycles that could be used
for... what ? hehehe

Regards,
Jorge.

mdj

unread,

Mar 7, 2007, 8:40:32 AM3/7/07

to

On Mar 7, 6:11 pm, "biel...@terra.es" <biel...@terra.es> wrote:

> I'm trying to draw a thick line between the "processor module" of the
> accelerator and the "interface to the Apple II" module, so that the
> processor can be replaced by a faster one in the future.

Fair enough. Just keep in mind that freely available 6502 cores
already exist, and faster processors are likely to only be available
via faster programmable logic platforms; there is little reason to
build faster 'hard' 6502's when a 100Mhz part is doable on hobbyist
accessible FPGA gear.

> The nice thing about just-thinking-about is that you don't need to
> pick up the solder... :-)
> Of course, once the design is finished ,there's nothing like doing it
> and seeing it work (*) !

Of course :-) And if you make such a thing I'll be buying one.

Matt

Michael J. Mahon

unread,

Mar 7, 2007, 8:49:57 PM3/7/07

to

You might want to do some frequency analysis on traces to
see just what percentage of memory cycles you are reclaiming.

Of course, the result will always be statistical in nature and
particular lengthy instruction sequences may deviate wildly from
the norm (which makes using them somewhat antithetical to a
machine with otherwise deterministic timing).

-michael

NadaNet networking for Apple II computers!
Home page: http://members.aol.com/MJMahon/

"The wastebasket is our most important design
tool--and it's seriously underused."

bie...@terra.es

unread,

Mar 8, 2007, 3:11:47 AM3/8/07

to

On Mar 8, 2:49 am, "Michael J. Mahon" <mjma...@aol.com> wrote:
>
> You might want to do some frequency analysis on traces to
> see just what percentage of memory cycles you are reclaiming.
>

> -michael

Let's say for example that the percentage was just a 1%.
This accounts for 10 KILOBYTES/second... !

What would you like to use this (free) bandwith for ?
Any ideas, please ?

Jorge.

bie...@terra.es

unread,

Mar 31, 2007, 2:42:16 PM3/31/07

to

> On 7 mar, 09:53, "Alex Freed" <a...@mirrow.com> wrote:
>
> I think one thing is missing here. On some instructions some cycles
> may be needed sometimes and not needed at other times. It should
> still work but not tag ALL of the "borrowable" cycles.
>
> -Alex.

> biel...@terra.es wrote:
> An analysis of a trace of several seconds would probably reveal the
> truth.

On Mar 8, 3:49 am, "Michael J. Mahon" <mjma...@aol.com> wrote:
> You might want to do some frequency analysis on traces to
> see just what percentage of memory cycles you are reclaiming.
>

> -michael

Ok. That's how the throughput figure comes out while the Apple II is
(doing nothing) waiting for a keypress, for example at the basic
prompt or during an input :

The code is:

FD1B INC $4E (5 CYCLES, 1 BORROWABLE)
FD1D BNE $FD21 (3 CYCLES NOT TAKEN, 4 CYCLES TAKEN, 0 BORROWABLE)
FD1F INC $4F (5 CYCLES, 1 BORROWABLE)
FD21 BIT $C000 (4 CYCLES, 0 BORROWABLE)
FD24 BPL $FD1B (TAKEN, 4 CYCLES, 0 BORROWABLE)

The loop takes
(5+4+4+4)=17 cycles, (1 borrowable)
255 times, then
(5+3+5+4+4)=21 cycles, (2 borrowable)
the 256th time.

Every (255*17)+21=4356 cycles, there are (255*1)+2=257 borrowable
cycles.

That is (257/43.56)=5.9 % of the time, @1Mhz translates to 59 KB/s.

Jorge.

Michael J. Mahon

unread,

Mar 31, 2007, 3:17:11 PM3/31/07

to

A very respectable number.

You chose an interesting case to examine, since it also admits a
software-only approach to "background" data transfer.

Since the bandwidth, whether obtained by hardware or software means,
is only available "on the average", any use of it would demand a FIFO
to queue the bytes (incoming or outgoing). If a FIFO (either hardware
or software) is present, then the keyboard poll loop is a natural place
to transfer data using programmed I/O rather than DMA--for example, to
a printer buffer, or whatever.

If the computer is being used interactively--meaning it is usually
waiting for user input--then transferring data in the keyboard wait
loop is a very effective approach. Of course, the potential latency
between keyboard loops is determined only by the processing going on
in the system, and could be long--but can usually be arranged to be
a fraction of a second.

NadaNet's server loop effectively inserts itself into the keyboard
loop by sensing any keypress and returning to the caller--which will
usually result in the keypress being processed (after which the
server loop is again called). Since the default timeout period for
a NadaNet request is about 3 seconds, it doesn't take much care to
arrange to meet the latency constraint.

bie...@terra.es

unread,

Mar 31, 2007, 7:54:47 PM3/31/07

to

> biel...@terra.es wrote:
> ...

> > That is (257/43.56)=5.9 % of the time, @1Mhz translates to 59 KB/s.

On Mar 31, 9:17 pm, "Michael J. Mahon" <mjma...@aol.com> wrote:
>
> A very respectable number.
>

I'm starting to think that it's not too difficult to trap unused
cycles for the conditional branches that are taken. When the branch
opcode is fetched during SYNC cycles, the PC is revealed, and the
offset is the byte read at the forthcoming, next, 2nd cycle. Using
this data the destination for the branch is (easily) calculated, and
if the address seen in the address bus at the 3rd cycle equals the
calculated destination address, would mean that the branch is being
taken. Even page crossings can be known, therefore trapping the 4th
cycle during page crossings would also be possible. Then the numbers
would turn out to be:

The loop takes
(5+4+4+4)=17 cycles, (3 borrowable)
255 times, then
(5+3+5+4+4)=21 cycles, (3 borrowable)
the 256th time.

Every (255*17)+21=4356 cycles, there are (255*3)+3=768 borrowable
cycles.

That is (257/43.56)=17.63 % of the time, @1Mhz translates to 176.3 KB/
s... !

> You chose an interesting case to examine, since it also admits a
> software-only approach to "background" data transfer.

I've been lazy and have choosen the easiest spot.. 8-)
A short loop that makes it easy to calculate the figures and also one
that is run very often.

> Since the bandwidth, whether obtained by hardware or software means,
> is only available "on the average", any use of it would demand a FIFO
> to queue the bytes (incoming or outgoing). If a FIFO (either hardware
> or software) is present, then the keyboard poll loop is a natural place
> to transfer data using programmed I/O rather than DMA--for example, to
> a printer buffer, or whatever.

Yes your're right.
But the way I look at this is : "There is a quite respectable
percentage of bandwidth that is being wasted but needn't be so."
Whatever you can do using "programmed I/O" doesn't help to recover the
lost bandwidth.. ?

--
Jorge.

bie...@terra.es

unread,

Mar 31, 2007, 8:51:54 PM3/31/07

to

On Apr 1, 1:54 am, "biel...@terra.es" <biel...@terra.es> wrote:
>
> I'm starting to think that it's not too difficult to trap unused
> cycles for the conditional branches that are taken. When the branch
> opcode is fetched during SYNC cycles, the PC is revealed, and the
> offset is the byte read at the forthcoming, next, 2nd cycle. Using
> this data the destination for the branch is (easily) calculated, and
> if the address seen in the address bus at the 3rd cycle equals the
> calculated destination address, would mean that the branch is being
> taken. Even page crossings can be known, therefore trapping the 4th
> cycle during page crossings would also be possible.

In fact, if the 3rd cycle after a conditional branch fetch isn't SYNC-
marked, it means that the branch is being taken... and this is a yet
much easier circuit... :-)

Jorge.

bie...@terra.es

unread,

Apr 1, 2007, 4:00:27 AM4/1/07

to

On Mar 31, 8:42 pm, "biel...@terra.es" <biel...@terra.es> wrote:
>
> The code is:
>
> FD1B INC $4E (5 CYCLES, 1 BORROWABLE)
> FD1D BNE $FD21 (3 CYCLES NOT TAKEN, 4 CYCLES TAKEN, 0 BORROWABLE)
> FD1F INC $4F (5 CYCLES, 1 BORROWABLE)
> FD21 BIT $C000 (4 CYCLES, 0 BORROWABLE)
> FD24 BPL $FD1B (TAKEN, 4 CYCLES, 0 BORROWABLE)
>
> The loop takes
> (5+4+4+4)=17 cycles, (1 borrowable)
> 255 times, then
> (5+3+5+4+4)=21 cycles, (2 borrowable)
> the 256th time.
>
> Every (255*17)+21=4356 cycles, there are (255*1)+2=257 borrowable
> cycles.
>
> That is (257/43.56)=5.9 % of the time, @1Mhz translates to 59 KB/s.
>
> Jorge.

Aaargh !
These branches take 2,3 cycles (not taken, taken), the figures are
even better :

(5+3+4+3)*255+(5+2+4+3)=3844 cycles, 255*1+2=257 borrowable

@ 1MHz --> 1e4*(257/38.44)= **** 66.85 KB/s ****

If lost cycles in taken branches are trapped too, then 255*3+3=768
borrowable

@ 1MHz --> 1e4*(768/38.44)= **** 199.79 KB/s *****

Jorge.

bie...@terra.es

unread,

Apr 1, 2007, 5:51:30 AM4/1/07

to

On Apr 1, 1:54 am, "biel...@terra.es" <biel...@terra.es> wrote:
>

> I'm starting to think that it's not too difficult to trap unused
> cycles for the conditional branches that are taken. When the branch
> opcode is fetched during SYNC cycles, the PC is revealed, and the
> offset is the byte read at the forthcoming, next, 2nd cycle. Using
> this data the destination for the branch is (easily) calculated, and
> if the address seen in the address bus at the 3rd cycle equals the
> calculated destination address, would mean that the branch is being
> taken. Even page crossings can be known, therefore trapping the
> 4th cycle during page crossings would also be possible.

Aaargh, forget this : "if the address seen in the address bus at the
3rd cycle equals the calculated destination address"...

Jorge.

Michael J. Mahon

unread,

Apr 1, 2007, 6:20:22 AM4/1/07

to

Careful, gilding the lily is an engineering curse. Only the
statistical analysis on more general cases described below
could be used to justify such a complication (relative to its
benefit).

>>You chose an interesting case to examine, since it also admits a
>>software-only approach to "background" data transfer.
>
>
> I've been lazy and have choosen the easiest spot.. 8-)
> A short loop that makes it easy to calculate the figures and also one
> that is run very often.

I realize that, but it's still an interesting case, since the
computer spends so much of its time in that loop.

BTW, you could capture real traces to a file in AppleWin and
then write a short program to analyze the traces to see not only
the average bandwidth, but also its variance and distribution.

This would be important for any application that would require
any significant fraction of the average bandwidth, since the
latency and FIFO size would depend on the statistics.

>>Since the bandwidth, whether obtained by hardware or software means,
>>is only available "on the average", any use of it would demand a FIFO
>>to queue the bytes (incoming or outgoing). If a FIFO (either hardware
>>or software) is present, then the keyboard poll loop is a natural place
>>to transfer data using programmed I/O rather than DMA--for example, to
>>a printer buffer, or whatever.
>
>
> Yes your're right.
> But the way I look at this is : "There is a quite respectable
> percentage of bandwidth that is being wasted but needn't be so."
> Whatever you can do using "programmed I/O" doesn't help to recover the
> lost bandwidth.. ?

But time spent in a polling loop waiting is *all* "lost time"
that can be reclaimed with just a little software work.

And if you don't have a good use for that statistical bandwidth,
then it remains "wasted".

In fact, the huge majority of computer execution is "wasted", so
its hard to get excited about a little more--unless you have a
"killer app" for the bandwidth...

Put another way, unless you have a concrete benefit that is
achieved, everything is still "waste".