How hard would it be to pipeline a Z80?

95 views
Skip to first unread message

Russell Wallace

unread,
Jan 1, 2023, 4:42:05 PMJan 1
to
From the 90s onward, all high-performance CPUs have been heavily pipelined. Before that, this was not necessarily the case; early microprocessors tended not to be pipelined, presumably due to not having the transistor budget. The earliest pipelined microprocessors were typically RISC, presumably because this kind of design is easier with a simpler architecture.

I'm wondering about the even earlier, 8-bit microprocessors. Take for example the Z80. Is this simpler or more complex than an early RISC like MIPS? In some ways it is simpler (smaller transistor count); in others, it is more complex. Probably overall harder to pipeline.

Just what would it take to make a pipelined implementation of the Z80 architecture? (i.e. you can lay out the chip any way you like, but it has to run existing Z80 code unchanged except for being faster, and it has to stay within process technology available in the heyday of the Z80, say not later than mid-80s.) How many transistors would such an implementation need, compared to the ~8000 of the original Z80?

MitchAlsup

unread,
Jan 1, 2023, 4:58:13 PMJan 1
to
On Sunday, January 1, 2023 at 3:42:05 PM UTC-6, Russell Wallace wrote:
> From the 90s onward, all high-performance CPUs have been heavily pipelined. Before that, this was not necessarily the case; early microprocessors tended not to be pipelined, presumably due to not having the transistor budget. The earliest pipelined microprocessors were typically RISC, presumably because this kind of design is easier with a simpler architecture.
<
Pipelining is expensive (flip flops, clock loading and fanout). Pipelining goes back to at least Stretch
and possibly earlier. Even in the world of microcoded mainframes, various pieces of instruction
execution were pipelined (Fetch and Memory Access) because these paid off, while a complete
pipeline did not.
>
> I'm wondering about the even earlier, 8-bit microprocessors. Take for example the Z80. Is this simpler or more complex than an early RISC like MIPS? In some ways it is simpler (smaller transistor count); in others, it is more complex. Probably overall harder to pipeline.
<
Most of the 8-bit CPUs were entirely bus limited, thus pipelining of the CPU would not have "been
worth it". RISC came along in the time period where one had 2 choices, build a microcoded CPU
(like we had been doing for decades) and include the microcode in the chip with the execution
hardware, or build a pipelined CPU with a very tiny control sequencer.
>
> Just what would it take to make a pipelined implementation of the Z80 architecture? (i.e. you can lay out the chip any way you like, but it has to run existing Z80 code unchanged except for being faster, and it has to stay within process technology available in the heyday of the Z80, say not later than mid-80s.) How many transistors would such an implementation need, compared to the ~8000 of the original Z80?
<
In Z80 time frame Flip-Flops were 6 gates big (10-12 today) and if you pipelined the entire Z80
CPU attempting to get 0.7 instructions per cycle, the CPU would be around 2.5×-3.0× the size
of the Z80 CPU. While doable, until you have a memory port (that is enough pins) big enough
and cache-enough-like to support 1 memory reference per clock, you would gain only 10%-20%
over a Z80 as was done-at the cost of 2.5×-3.0× making the effort "not worth it".

John Dallman

unread,
Jan 1, 2023, 5:07:12 PMJan 1
to
In article <84db53f8-0a5e-4e7c...@googlegroups.com>,
russell...@gmail.com (Russell Wallace) wrote:

> Just what would it take to make a pipelined implementation of the
> Z80 architecture? (i.e. you can lay out the chip any way you like,
> but it has to run existing Z80 code unchanged except for being
> faster, and it has to stay within process technology available in
> the heyday of the Z80, say not later than mid-80s.) How many
> transistors would such an implementation need, compared to the
> ~8000 of the original Z80?

It has been done: the eZ80 was introduced in 2001, with a three-stage
pipeline and about three time the performance of the Z80 at the same
clockspeed. <https://en.wikipedia.org/wiki/Zilog_eZ80> It comes as part
of microcontrollers these days, with substantial amounts of RAM and
peripherals, so the transistor count isn't trivially available.

John

Russell Wallace

unread,
Jan 1, 2023, 5:19:43 PMJan 1
to
On Sunday, January 1, 2023 at 9:58:13 PM UTC, MitchAlsup wrote:
> Most of the 8-bit CPUs were entirely bus limited

I have seen that idea floating around a few times, but it's not the case.

Take the 6502 as a simple example. Typically 1 MHz. But even the cheapest contemporary dynamic RAM chips were 2 MHz. So the 6502 actually left half the available memory bandwidth unused. (Or, okay, in some machines, shared with the video chip, accessing on alternate cycles.)

The Z80 is a little more complex because it means a different thing by clock speed, but the typical "4 MHz" Z80 basically also ran the bus at 1 MHz (okay, with the occasional bus cycle being 3 T-states instead of 4), so again, it left half the available memory bandwidth unused (and because of that caveat that made the timing unpredictable, it typically did *not* alternate with the video chip, so the other half of the available memory bandwidth just went to waste).

And that's just talking about the 4k DRAM chips. 16k and 64k supported fast page mode, which could dramatically improve bandwidth in typical usage. Prior to ARM (1985), no CPU that I'm aware of, used fast page mode. I saw an interview with one of the ARM designers at one point expressing puzzlement that CPUs didn't use this. They were just letting most of the available memory bandwidth go to waste.

So CPUs of that era were definitely not bus limited.

> In Z80 time frame Flip-Flops were 6 gates big (10-12 today) and if you pipelined the entire Z80
> CPU attempting to get 0.7 instructions per cycle, the CPU would be around 2.5×-3.0× the size
> of the Z80 CPU.

Okay, that's more expensive than I would have guessed, but does account for the lack of people doing this. But that is still smaller than e.g. 8086, and would surely provide a substantial performance boost to existing software.

> While doable, until you have a memory port (that is enough pins) big enough
> and cache-enough-like to support 1 memory reference per clock, you would gain only 10%-20%
> over a Z80 as was done-at the cost of 2.5×-3.0× making the effort "not worth it".

See above: there was plenty of memory bandwidth to spare. By the early eighties, you could even get up to 1 memory reference per clock at 4 MHz. (The BBC micro did this, interleaving a 2 MHz 6502 with 2 MHz 80-column video.)

Craig Ruff

unread,
Jan 2, 2023, 8:38:14 AMJan 2
to
For an actual Verilog example of a pipelined Z80 implementation you can
take a look at the book "Microprocessor Design Using Verilog HDL" by
Monte Dalrymple.
https://www.elektor.com/microprocessor-design-using-verilog-hdl-e-book

George Neuner

unread,
Jan 5, 2023, 12:56:00 AMJan 5
to
On Sun, 1 Jan 2023 14:19:41 -0800 (PST), Russell Wallace
<russell...@gmail.com> wrote:

>On Sunday, January 1, 2023 at 9:58:13 PM UTC, MitchAlsup wrote:
>> Most of the 8-bit CPUs were entirely bus limited
>
>I have seen that idea floating around a few times, but it's not the case.
>
>Take the 6502 as a simple example. Typically 1 MHz.

The 6502 was double pumped internally, so ran at 2x the (external)
clock speed. But memory access took 2..6 cycles depending on address
mode.


> But even the
>cheapest contemporary dynamic RAM chips were 2 MHz. So the 6502
>actually left half the available memory bandwidth unused. (Or, okay,
>in some machines, shared with the video chip, accessing on alternate
>cycles.)

My Apple //e came with 128KB of 400ns RAM (2.5MHz). But hires video
took 1/2 the memory bandwidth, and double-hires took 1/2 the bandwidth
from each memory bank.

I replaced the original RAM with 250ns parts as soon as I could afford
it. It made a noticable difference - probably because refresh no
longer interfered with access.

Over time my //e morphed, first into a //gs, then was expanded to 4MB
of 100ns RAM, then finally accelerated to 10MHz.

George

Anton Ertl

unread,
Jan 5, 2023, 3:56:12 AMJan 5
to
George Neuner <gneu...@comcast.net> writes:
>The 6502 was double pumped internally, so ran at 2x the (external)
>clock speed.

Please elaborate. What in particular is double-pumped?

>But memory access took 2..6 cycles depending on address
>mode.

No. Each memory access takes 1 cycle. Some cycles are not used for
memory accesses, but most are.

E.g., "jmp (addr)" is three bytes long, reads two bytes from addr, for
a total of 5 memory accesses. It takes an overall 5 cycles.

As an example in the other extreme, "inc addr,x" is three bytes long,
reads one byte from the effective address, writes one byte to the
effective address (for a total of five memory accesses), but takes 7
cycles.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>

George Neuner

unread,
Jan 6, 2023, 11:34:20 PMJan 6
to
On Thu, 05 Jan 2023 08:40:48 GMT, an...@mips.complang.tuwien.ac.at
(Anton Ertl) wrote:

>George Neuner <gneu...@comcast.net> writes:
>>The 6502 was double pumped internally, so ran at 2x the (external)
>>clock speed.
>
>Please elaborate. What in particular is double-pumped?

The 6502 internally was clocked on both edges of the external clock
pulse, effectively doubling the external clock rate.

Which also means the instructions really took more cycles than
reported, since timing was given WRT the external clock.


>>But memory access took 2..6 cycles depending on address
>>mode.
>
>No. Each memory access takes 1 cycle. Some cycles are not used for
>memory accesses, but most are.

Yes. The [badly expressed] point I was attempting to make was - to
Russell's point - that the 6502 left a whole LOT of memory cycles
unused [not just 1/2]. But given its architecture, it could not have
used those cycles anyway.

George

robf...@gmail.com

unread,
Jan 7, 2023, 2:17:03 AMJan 7
to

IIRC the z80 could be “improved” using a wider ALU resulting in
fewer clocks per instruction even without additional pipelining.
The 68000 is similar using a ½ width ALU.

EricP

unread,
Jan 7, 2023, 11:22:43 AMJan 7
to
The 6502 and Z80 could fetch the next instruction while finishing
the current one. But there was no separate fetch unit - these only
had one state machine for the control sequencer and doing such a
prefetch was hard coded into the execution of the current instruction
state sequence. (This is described in detail in the 6502 manual.)

They only had 2 layers of interconnect, and only one 16-bit bus
which had isolation barriers at critical points allowing it to be
split to perform multiple concurrent bus operations.

Ken Shirriff's blog of reverse engineered old microprocessors
covers a lot of this, including the Z80 4-bit ALU.

Reverse engineering ARM1 instruction sequencing,
compared with the Z-80 and 6502
http://www.righto.com/2016/02/reverse-engineering-arm1-instruction.html

Down to the silicon: how the Z80's registers are implemented
http://www.righto.com/2014/10/how-z80s-registers-are-implemented-down.html

Why the Z-80's data pins are scrambled
http://www.righto.com/2014/09/why-z-80s-data-pins-are-scrambled.html

The Z-80's 16-bit increment/decrement circuit reverse engineered
http://www.righto.com/2013/11/the-z-80s-16-bit-incrementdecrement.html

The Z-80 has a 4-bit ALU. Here's how it works
http://www.righto.com/2013/09/the-z-80-has-4-bit-alu-heres-how-it.html



Reply all
Reply to author
Forward
0 new messages