Building a z80 multiplier module for rc2014

967 views
Skip to first unread message

Phillip Stevens

unread,
Jan 22, 2020, 8:49:06 PM1/22/20
to RC2014-Z80
I'm thinking about a way to build a fast (essentially single cycle) hardware multiply module for the rc2014.
There's no real reason to do this, except that it would be a challenge.

There are three ways I can see to do it, so here's the investigation so far.

1. use an 74S508 8x8 co-processor device.

This is the most flexible and powerful option, as it provides both multiply and divide for signed and unsigned numbers.
And it is the simplest as it includes device registers, buffering and interfacing. It simply needs to be addressed via a number of I/O addresses.
To build a rc2014 module would need just an 74138 for address selection.

The multiplication time is about 500ns, or 5 clocks, plus the time to load and unload the registers.
This means that the multiply result is available in just over 1 (short) z80 machine cycle.

But I can't find a source where these can be purchased. They seem to be made of unobtainium.
So probably impossible to use as a basis.

2. use a 74S558 8x8 Cray multiplier.

This Cray multiplier device is the next easiest option, with very high speed results.
To build a rc2014 module would require 2x 8-bit input buffers, and 2x (8-bit) of output buffers, with addressing using an 74138.
 
 A multiply result (combinational logic) is available in just 60ns, but the operands and results would need to be loaded and unloaded.
Operationally, the module would require just two I/O addresses, as R/W could use the same locations.
Repeated multiplies are easy, as just one of the operands could be reloaded, and the new result could be read out immediately.

But (again) I can't find a source where these can be purchased.
So (again) probably impossible to use as a basis.

3. use 74284/285 4x4 multipliers with a Wallace tree from 74181/182 adders and 74183 add-carry devices.

This solution replicates the full Cray multiplication process above, using 2x 4x4 multipliers, together with a Wallace tree, consisting of 8x 74183, 3x 74181, and 1x 74182, to add the partial results. To build a rc2014 module would also require 2x input buffers, and 2x (16 bits) of output buffers, with addressing using an 74138.

All in all, this would never fit onto a standard height rc2014 module, but 18x DIP16 devices might be squeezed onto a double height module.
I'm not sure of my ability to design this kind of thing though. But that would be part of the challenge.

And, again, sourcing devices seems to be quite hard. But, perhaps it will be less difficult than the other options.


Are there other thoughts or suggestions to do this using retrotech?
Obviously, using an AVR or z180 would be cheating.

Any sourcing suggestions?
I've searched UTsource, and all the usual options. But, nothing.

Cheers, Phillip

Mark T

unread,
Jan 22, 2020, 10:16:37 PM1/22/20
to RC2014-Z80
Maybe a shift and add multiplier could work. If it could run in 8 clocks then 8x8 could be completed between an output and input instruction without delays. Trick would probably be to shift and add on the same clock edge by shifting the position of the bits between the result register and the adder

74x165 for multiplier, 74x273 for multiplicand, 2 x 74x273 for 16 bit result register, 4 of 74x283 for sixteen bit adder.

Maybe copy the state machine for clocking from the shift register spi micro sd card module. 74x163, 74x138 and a few gates.

Mark

Mark T

unread,
Jan 22, 2020, 10:20:48 PM1/22/20
to RC2014-Z80
Forgot to include buffers to read the result register. 2 off 74x244 or 74x245 or similar. Might just push it past what could fit on a standard module.

Mark

Dave White

unread,
Jan 23, 2020, 12:39:55 AM1/23/20
to RC2014-Z80
Would a 49C402 bit slice unit be cheating? It has the "retro" covered.

Phillip Stevens

unread,
Jan 23, 2020, 1:00:18 AM1/23/20
to RC2014-Z80
Dave White wrote:
Would a 49C402 bit slice unit be cheating? It has the "retro" covered.

Good suggestion.

Mark's idea of the shift-add would be fast, and possible in a double height board, I guess.

Currently also looking at Russian clone of the SN74S508, the KR1802VR2.
But there's not much info on it (at least in English). I wonder how close it is?

There seems to be another reference worth following for the MPY-8 from TRW, which might also be similar to the SN74S508.
The advertising prose is effusive...

"Twenty to 30 percent reductions in maximum power dissipation have been announced for the fast TRW 8, 12, and 16-bit bipolar multipliers. All forced air cooling has been eliminated! Output delay time have been improved for all of the multipliers. The 8-bit, 130 nanosecond MPY-8 has reduced average power from 1.75 Watts to an average of 1.2 Watts." TRW LSI Products Calif. USA.

Also, I've not forgotten that an Am9511A would work too, but I was aiming for something a little simpler, and not needing a 12V power supply.


Eric Matecki

unread,
Jan 23, 2020, 1:15:36 AM1/23/20
to RC2014-Z80
Two 64K (UV-,EE-,Flash-) ROMS ?
Operands on address bus, LSB of result on data bus of one of them, MSB on the other.

You choose your retro-level by choosing the ROM size/type, could be done with 128x 2708 :)

Marten Feldtmann

unread,
Jan 23, 2020, 2:29:52 AM1/23/20
to RC2014-Z80
Ha,

what about adding a MC68882 (PLCC-68) :-)))))))))))))))) for around $4 from utsource ...

Marten

Phillip Stevens

unread,
Jan 23, 2020, 4:12:26 AM1/23/20
to RC2014-Z80
Eric Matecki wrote:
Two 64K (UV-,EE-,Flash-) ROMS ?

Yes, you're right. ;-)

As much as I'm in love with the idea of using an old SN74S508 device, a solution using ROMs would really be the most practical solution.
Something made with ROMs would be reproducable, and not too hard to fit into a standard module layout.

The other practical thing would be that the "software" could be completely configurable.
It would be a useful module to use for LUTs for sine waves or other functions for sound generation. So, it wouldn't just be a single application board.
 
Operands on address bus, LSB of result on data bus of one of them, MSB on the other.

I can't see how it would be done like that?
Either you're going to clash with RAM, or you're going to clash with I/O on some of the address ranges.

I think that it will need to be off the address bus, and using an address selector 74x138, two input flip-flops 74x374, and two 64kB 8-bit ROMs W27C512.
These devices would be pretty familiar to rc2014 friends, so that's also a useful design feature.

You choose your retro-level by choosing the ROM size/type, could be done with 128x 2708 :)

I think the easy way is the best way at the start. ;-)

Steve Cousins

unread,
Jan 23, 2020, 9:24:18 AM1/23/20
to RC2014-Z80
Some great ideas guys.

I like Eric's ROM suggestion and Phillip's observation that such a design could have many uses.

How about making the chips RAM instead of ROM. That way you don't need to change ROMs to alter the application. Multiplication tables could be calculated at the start of the app and written to the RAM. Similarly, any other table-based data could be written at the start of the app.

You could use one 128k x 8 RAM chip instead of two ROMs. The design would then come down to:
Decoder to give a chip select when a specific two I/O address range is being accessed (74138 or similar)
RAM chip (128k x 8)
Two address latches for A0-A7 and A8-A15 (74273 or 74374 or similar)
CPU's A0 to connect to RAM's A16 to select hi/lo byte of table data
RAM data lines connected to CPU data bus

You could perhaps increase performance by outputting the high order address latch data on A8-A15 using OUT (C),A where C is the modules I/O address, B = table address A8-A15, and A = table address A0-A7. 

Other application ideas:
RAM disk
LOG tables
SIN/COS/TAN trigonometry tables

Steve

Dave White

unread,
Jan 23, 2020, 9:48:26 AM1/23/20
to RC2014-Z80
Yeah, I was only half joking about the 49c402 (only half - it's given me an idea). I just so happen to have a drawer with 20 of these on my desk, and with them going for between $60 and $160 each on FindChips, it would be nice to find a use for at least one.

Mark T

unread,
Jan 23, 2020, 9:53:00 AM1/23/20
to RC2014-Z80
One way to simplify the interface for an 8x8 bit multiply would be one operand from A15..A8, and the second from D7..D0, then you can multiply any register by B, using
OUT (C),r
IN H,highbyte
IN L,lowbyte

Only removes one out instruction, but every cycle counts.

Only a personal preference but 128K rom or ram seems to be cheating if the idea is to have a retro solution. I think 4x4 using 256bytes would be more reasonable, but then just easier to allocate 256 bytes of memory.

Mark

Alan Cox

unread,
Jan 23, 2020, 10:33:48 AM1/23/20
to rc201...@googlegroups.com
> How about making the chips RAM instead of ROM. That way you don't need to change ROMs to alter the application. Multiplication tables could be calculated at the start of the app and written to the RAM. Similarly, any other table-based data could be written at the start of the app.

How about just not building any hardware at all and putting the table
in a separate memory bank 8)

Another crazy option would be to recreate the Atari 'math box' used
wit the early vector console games which was built out of four AM2091s

Alan

Dave White

unread,
Jan 23, 2020, 11:39:51 AM1/23/20
to RC2014-Z80
Intriguing - according to the data sheet, the 49C402 is "Functionally equivalent to four 2901s and one 2902". So maybe this isn't such a crazy idea after all.

Randy Mongenel

unread,
Jan 23, 2020, 12:09:46 PM1/23/20
to rc201...@googlegroups.com
I wonder why I haven't seen anyone mention the AM95xx chips yet. I see 9511's are available on eBay for around $5, and there is plenty of am95xx-to-z80 interface documentation. From what I see, it can run asynchronously, so clock speed isn't a huge problem to deal with.Then you get 32-bit add/sub/mul/div with decent speed.

--
You received this message because you are subscribed to the Google Groups "RC2014-Z80" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rc2014-z80+...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/rc2014-z80/3f19488e-3c71-4141-98c1-849b6751d6c9%40googlegroups.com.

Phillip Stevens

unread,
Jan 23, 2020, 3:15:52 PM1/23/20
to RC2014-Z80
Steve Cousins wrote:
Some great ideas guys.

+1  
 
How about making the chips RAM instead of ROM.

I prefer the idea of it being "firmware" rather than needing to be reloaded each time.
But that's not to say that RAM isn't a good idea. 
 
You could use one 128k x 8 RAM chip instead of two ROMs.

I looked at 1Mbit (128kB x 8-bit) ROM and it is extortionately expensive.
And the RC2014 platform already uses the W27C512 ROM, so I have a stack in a tube.

You could perhaps increase performance by outputting the high order address latch data on A8-A15 using OUT (C),A where C is the modules I/O address, B = table address A8-A15, and A = table address A0-A7.

Great idea (Mark & Steve).
The code to replicate a Spectrum Next MUL DE would then be.

mul_de:
  ld b
,d ; 4 operand MSB in B
  ld c
,0x42 ; 7 operand latch address
 
out (c),e ; 12 operand LSB from E
  dec c
; 4 result MSB address
 
in d,(c) ; 12 result MSB to D
  dec c
; 4 result LSB address
 
in e,(c) ; 12 result LSB to E
  ret
; 10

A total of 65 cycles, excluding the call. Pretty sharp.
My current best case is stolen from CPC Wiki at around 154 cycles, so this is 2.3x faster.
 
Other application ideas:
RAM disk
LOG tables
SIN/COS/TAN trigonometry tables

Here's a pic of where I'm up to currently with the "LUT (Multiply) Module" made for RC2014.

RC2014_LUT.png

Phillip
 

Phillip Stevens

unread,
Jan 24, 2020, 12:46:42 AM1/24/20
to RC2014-Z80
Phillip Stevens wrote:
Steve Cousins wrote:
Some great ideas guys.
How about making the chips RAM instead of ROM.

I prefer the idea of it being "firmware" rather than needing to be reloaded each time.
But that's not to say that RAM isn't a good idea. 
 
You could use one 128k x 8 RAM chip instead of two ROMs.

I looked at 1Mbit (128kB x 8-bit) ROM and it is extortionately expensive (and only PROM single use).
And the RC2014 platform already uses the W27C512 ROM, so I have a stack in a tube.

You could perhaps increase performance by outputting the high order address latch data on A8-A15 using OUT (C),A where C is the modules I/O address, B = table address A8-A15, and A = table address A0-A7.

Great idea (Mark & Steve).
The code to replicate a Spectrum Next MUL DE would then be.

mul_de:
  ld b
,d ; 4 operand MSB in B
  ld c
,0x42 ; 7 operand latch address
 
out (c),e ; 12 operand LSB from E
  dec c
; 4 result MSB address
 
in d,(c) ; 12 result MSB to D
  dec c
; 4 result LSB address
 
in e,(c) ; 12 result LSB to E
  ret
; 10

A total of 65 cycles, excluding the call. Pretty sharp.
My current best case is stolen from CPC Wiki at around 154 cycles, so this is 2.3x faster.
 
Other application ideas: 
RAM disk
LOG tables
SIN/COS/TAN trigonometry tables

OK. Done. Time to step back and consider a little.

RC2014_LUT_SCH.png


Only 4 vias. Pretty happy with that.


RC2014_LUT_BRD.png



karlab

unread,
Jan 24, 2020, 8:05:59 AM1/24/20
to RC2014-Z80
The layout is a piece of art!

Phillip Stevens

unread,
Jan 24, 2020, 8:38:41 AM1/24/20
to RC2014-Z80
karlab wrote:
The layout is a piece of art!

Thanks. 

Found an issue. Somehow some pins got swapped.

I ordered this off OSH Park tonight (given its CNY and nothing is going on in Asia for two weeks except fireworks).

Let's see whether purple boards work...  I've never used them before.

RC2014_LUT_Module_Schematic.pngRC2014_LUT_Module_Board.png

More when there's a real board.

Phillip
RC2014_LUT_BRD.pdf
RC2014_LUT_SCH.pdf

Phillip Stevens

unread,
Jan 25, 2020, 7:33:10 AM1/25/20
to RC2014-Z80
Phillip Stevens wrote:
I ordered this off OSH Park tonight (given its CNY and nothing is going on in Asia for two weeks except fireworks).
 
Forgot to provide the sharing link at OSH Park >>> LUT (Multiply) Module @ OSH Park.

96c13ac442fa7d4d1a0125b7f342cbc4.pnge540914edb063b0986486f7e376315c1.png


Phillip

Mark T

unread,
Jan 27, 2020, 6:42:23 AM1/27/20
to RC2014-Z80
I think I saw that jlcpcb were not closing for spring festival this year.

Did you consider the ST39SF040 as used on the 512K ROM/RAM board? Could also select a couple of additional functions and is electrically alterable.


Mark
 


 

 

Phillip Stevens

unread,
Jan 27, 2020, 6:01:38 PM1/27/20
to RC2014-Z80
Mark T wrote:
I think I saw that jlcpcb were not closing for spring festival this year.

I didn't know this at the time, but Coronavirus has shut down most of China. Shanghai and Beijing are completely quarantined. All internal transport to these cities is shutdown. Wuhan is blockaded. USA, Canada, UK are sending charter flights to get their citizens home.

Hong Kong (and JLCPCB) has been pretty lax, but Hong Kong just decided to ban people originating from, or travelling through, Wuhan from today.

It is all a bit late really, as 50 million Wuhan people have left since this started in December (reported by Premier of Wuhan).
Who knows where the went?

Did you consider the ST39SF040 as used on the 512K ROM/RAM board? Could also select a couple of additional functions and is electrically alterable.

Would be a good option to use a ST39SF010 or ST39SF020.
The pin assignment is very similar to the W27C512, so it would be relatively easy to lay out the board, now that it has already been done once.

Perhaps another iteration, when I've proven this version works (i.e. effectively improves the floating point and integer performance by a worthwhile amount).

P.

Phillip Stevens

unread,
Feb 1, 2020, 7:47:54 AM2/1/20
to RC2014-Z80
Phillip Stevens wrote:
I ordered this off OSH Park tonight (given its CNY and nothing is going on in Asia for two weeks except fireworks).

Let's see whether purple boards work...  I've never used them before.

RC2014_LUT_Module_Schematic.pngRC2014_LUT_Module_Board.png

More when there's a real board.

Actually had a small win tonight. Small steps forward.

I've worked out how the Microsoft Basic floating point mantissa multiplier works (and it is a beautiful piece of code), and have cut it out and patched in my own ugly version.
The small win is actually getting correct results.

What that means is when I get the LUT board working, I'll be able to produce a kind of Frankenstein MS Basic that should (with the added LUT hardware) be really very fast at math.
Microsoft Basic uses floating point for all its math, so improving floating point will speed it up at everything.

And, it still fits in 8kB with a HEX loader too.

Phillip

Phillip Stevens

unread,
Feb 1, 2020, 9:31:41 PM2/1/20
to RC2014-Z80
Phillip Stevens wrote:
I've worked out how the Microsoft Basic floating point mantissa multiplier works (and it is a beautiful piece of code), and have cut it out and patched in my own ugly version.
The small win is actually getting correct results.

For the purposes of bench-marking against the really very excellent Microsoft Basic code, I used the RC2014 Repository Mandelbrot program.

MS Basic Reference - RC2014 Distribution: 11' 44"

The Microsoft Basic floating point implementation has been extracted and made available in the classic library of the z88dk.
It is implemented with a 24x8x3 add-shift multiplier, and it has 0 detection at byte level. Very efficient.
Microsoft even used push-ret pairs in their implementation, rather than call-ret, to save a few extra cycles.

Some of the benchmarks put its performance as the best software implementation, alongside the BBC micro Basic.

Now I've cut in a version of the 32h_24x24 mantissa multiplier used in Z88DK math32 library, together with a mulu_de 16_8x8 multiplier.
This is really just to test that I understand how the Microsoft basic mantissa multiplier works.

MS Basic - math32 with software mulu_de: 18' 31" or 157% of the reference

That's about what I expect. Using the 16_8x8 elemental multiplication model is much less efficient than a single large multiply, if you don't have hardware suited (like z180 or z80n).

What that means is when I get the LUT board working, I'll be able to produce a kind of Frankenstein MS Basic that should (with the added LUT hardware) be really very fast at math.
Microsoft Basic uses floating point for all its math, so improving floating point will speed it up at everything.

I'm hoping the LUT multiplier result will be about 3x the math32 software mantissa version, or about 2x the reference MS Basic speed.
So, I'm betting on about 6' 20". Excited to see what happens.

Phillip 

Mark T

unread,
Feb 5, 2020, 6:00:47 AM2/5/20
to RC2014-Z80
I’m guessing you are using 8x8 bit unsigned multiply and combining the results as in long multiplication method, but is there any advantage in implementing signed multiply?

Any plans to try and support division? What methods could be used for this?

Mark

Phillip Stevens

unread,
Feb 5, 2020, 6:53:25 AM2/5/20
to RC2014-Z80
Mark T wrote:
I’m guessing you are using 8x8 bit unsigned multiply and combining the results as in long multiplication method, but is there any advantage in implementing signed multiply?

Yes, you're right. Using a 16_8x8 unsigned multiply. I "chose" this as the atomic instruction for what I've been doing, because it is what Zilog already chose for us in the Z180, and also the team building the Spectrum Next also implemented the same atomic mul de instruction but not using the same opcode(s) for their Z80N. So that I could keep the same algorithms and largely the same code, I chose to implement a software copy of the Spectrum Next instruction, using the de registers.

For floating point there is no advantage in having a signed multiply. The actual multiply of the mantissa is never signed. Exponents are added, and they have a bias so they are never negative, and the result sign is xor'd from input signs. I've got some more description here in the math32 library, if you're interested to read further.

The mantissa for the IEEE-754 is 24-bits, and the resulting long multiplication would produce 48-bits. Since only the 24 most significant bits are used, it is not necessary to calculate all the terms of long multiplication. To save time I don't do the lowest term.

Similarly, I do a 32_32x32 multiply as part of a "fused multiply-add" used when calculating polynomial series for the higher functions (to preserve accuracy by providing 8 carry bits to the calculation), and I don't bother calculating the lower 16-bits of the 64-bit result. This saves 3 small multiplies, and a whole bunch of stack juggling.
 
Any plans to try and support division? What methods could be used for this?


There are many methods of doing division, but the one I could get my head around was Newton-Raphson approximation. Yes, the Newton under the apple tree. This stuff has been around for a while.
Newton-Raphson algorithm calculates the inverse of the divisor in this case, so when that is provided by the algorithm it is then multiplied so you get the division.

X / Y = X * (1/Y)

Newton-Raphson method converges pretty quickly, Three iterations takes you to greater accuracy than IEEE-754 can represent.

The entire math32 IEEE-754 library is essentially built on just two functions multiply and add. Everything else is calculated based on variations of these two functions.
Since the Z80 has add done well, it is just the multiply that was missing, and if it is improved then everything else benefits proportionally.
Zilog recognised this when they added the mlt gg functions to the Z180.

Depending on the benchmark, having the small hardware 16_8x8 multiply produces between 2x and 5x speedup. Not a bad outcome I think.

I think the square root calculation (also using Newton-Raphson) was also pretty interesting to write. My "best" function.
The "Quake" seed method was invented by someone passing by from another planet, I think.
Respect.

Phillip

Mark T

unread,
Feb 9, 2020, 12:05:18 PM2/9/20
to RC2014-Z80
Hi phillip,
I was looking at your address decoding and was thinking if you qualified the /WR when writing the operands, use 0x40 to write the operands and read LSB result from 0x40, then you could remove one DEC C instruction. Only 4 cycles but this is likely used in a low level where every cycle helps.

Also the output controls of the ‘374s could be tied to ground, would avoid slight loading of the /RD control line.

I was trying to layout a shift and add multiplier as interesting challenge to see if it would fit on a standard module. Thought I’d try and make it compatible with your code, which just means not qualifying A1, then 0x40 to 0x43 to write operands, 0x40 to read LSB, 0x41 to read MSB. (Also finding layout a bit tight so removing A1 makes it a bit easier). Down to four airwires, I’ll post in a separate then when finished.

Mark

Phillip Stevens

unread,
Feb 9, 2020, 6:34:21 PM2/9/20
to RC2014-Z80
 Mark T wrote:
I was looking at your address decoding and was thinking if you qualified the /WR when writing the operands, use 0x40 to write the operands and read LSB result from 0x40, then you could remove one DEC C instruction. Only 4 cycles but this is likely used in a low level where every cycle helps.

I've been thinking about Steve's comment about making it writeable, and I'm now torn as to what the next version should do.


Also the output controls of the ‘374s could be tied to ground, would avoid slight loading of the /RD control line.

Yes, just a bit of OCD which needs me to tie off everything. But routing the /RD lines was one of the things I should probably not do next version.
 
I was trying to layout a shift and add multiplier as interesting challenge to see if it would fit on a standard module. Thought I’d try and make it compatible with your code, which just means not qualifying A1, then 0x40 to 0x43 to write operands, 0x40 to read LSB, 0x41 to read MSB. (Also finding layout a bit tight so removing A1 makes it a bit easier).

Following up on the next design, I'm now pretty sure I'll build it for the sst39sf040, but not sure yet whether DIP or PLCC config.
PLCC is more robust for multiple insertions (if it needs to be written externally, more below), but DIP is more retro.
To be decided.

To make it writeable there are 9 addresses needed. one for the address latch (16 bits), and 8 to write (and read) each of the bytes, as the EEPROM will be 8x64kB.
If it is not write enabled, then the write address can overlap (and be any of) the read addresses.
I can't see how to generate 9 distinct addresses without adding another '138 to my circuit. But, perhaps there's a way?

The other question is whether the memory should be 8 pages of 64kB, or whether it should be 64kB of 8 Byte values?
The difference is essentially whether the low address lines A0, A1, A2 are used for the low eeprom addresses or are used for the high address lines A16, A17, A18.
On the side of 8 contiguous bytes, it becomes easier to prepare the tables, as each entry is simply 8 Bytes long.
On the side of 8 pages of 64kB, other smaller EEPROMS can be used when they're available, just only some of the pages will be useable. I have a stack of sst39sf020 for example that would just provide 4 pages.

Anyway, long discussion but I'm leaning towards this as lut api standardisation...

1. Making the solution read only (also respecting Mark's shift-add solution).
2. The address register write latch is write to any of 0x40 through 0x47 (so any page can be read back easily).
3. Configuring to use (up to) 8 pages from the sst39sf040, which would be read from 0x40, through 0x47.

Anyway, I'll build the next version this way, unless there are further thoughts?
But first to build and test the first version.

Cheers, Phillip

Mark T

unread,
Feb 9, 2020, 11:52:42 PM2/9/20
to RC2014-Z80
Hi Phillip
I don’t think you need to decode the 9 different addresses.

For a read only eprom, A0 to A2 connect direct to A16 to A18, then just decode a block of 8 addresses to enable read byte from eprom as addressed by the two bytes stored to the registers. The two bytes could be written to any address within the block of 8.

I think it makes sense to have 8 blocks of 64K, then each 64K performs a function and it would be easier to combine different functions into an eprom. If it was arranged as 64K by 8 bytes it might be tricky to combine different function blocks.

For a writeable method, could be decoded as a block of 16, with A3 to select writing data or writing address. This starts to use a large chunk of IO addresses but would be quite simple decoding.

Slightly more complicated to use only 4 io addresses. Only A0 direct to the eprom A16, for easy selection of high or low byte result of a 16 bit function without slowing down the interface too much. Then write to a separate register at 0x41 to select A18 and A17. Write to 0x40 to set the two operand bytes, Read LSB from 0x40 and MSB from 0x41, or write new LSB to 0x42 and write new MSB to 0x43.

Mark

Phillip Stevens

unread,
Feb 10, 2020, 12:11:50 AM2/10/20
to RC2014-Z80
 Mark T wrote:
I don’t think you need to decode the 9 different addresses.
For a read only eprom, A0 to A2 connect direct to A16 to A18, then just decode a block of 8 addresses to enable read byte from eprom as addressed by the two bytes stored to the registers. The two bytes could be written to any address within the block of 8.

Yes, we're thinking alike here.
 
I think it makes sense to have 8 blocks of 64K, then each 64K performs a function and it would be easier to combine different functions into an eprom. If it was arranged as 64K by 8 bytes it might be tricky to combine different function blocks.

Yes, also the other advantage that smaller EEPRROM is pin compatible so then any of sst39sf040 (8x64kB), sst39sf020 (4x64kB), and sst39sf010 (2x64kB) would work covering the full 64kB address range with a reduced number of pages.
 
For a writeable method, could be decoded as a block of 16, with A3 to select writing data or writing address. This starts to use a large chunk of IO addresses but would be quite simple decoding.

I might be missing the "killer application" but I don't think there's a big need for in-circuit write, so to prevent being an address hog and to keep aligned with your shift-add module, I'm prepared to forego that option.

My v1 board, now non-standard compliant :-) is in transit over the Pacific, and I've components and hot iron waiting for it to land.

Phillip

Phillip Stevens

unread,
Feb 14, 2020, 8:56:08 AM2/14/20
to RC2014-Z80
Well the boards arrived, and they are very nicely purple. And most surprisingly they are the right physical and shape size.
So rushing to solder one together gets this outcome...


IMG_0344.JPG


I'm still testing what the performance looks like, but the initial indications are not that positive.

I've timed the Mandelbrot example at just over 11' 05" using the LUT (Multiply) Basic.

The adapted floating point mantissa multiply is here, for interest.


Yes this is faster than MS Basic Reference, but barely so.

Certainly not worth the trouble at this stage for this particular example.


I'll have a look at other floating point benchmarks over the next few days, and see what that brings.


P.

Mark T

unread,
Feb 14, 2020, 10:34:44 AM2/14/20
to RC2014-Z80
I guess there are two possible reasons for not getting much of an improvement. Either splitting the multiply into 8x8 and recombining the results is too much overhead on the multiply or that mandlebrot has too much overhead on top of the multiply.

Mark

Phillip Stevens

unread,
Feb 15, 2020, 1:14:43 AM2/15/20
to RC2014-Z80
Phillip Stevens wrote:
For the purposes of bench-marking against the really very excellent Microsoft Basic code, I used the RC2014 Repository Mandelbrot program.

MS Basic Reference - RC2014 Distribution: 11' 44"

The Microsoft Basic floating point implementation has been extracted and made available in the classic library of the z88dk.
It is implemented with a 24x8x3 add-shift multiplier, and it has 0 detection at byte level. Very efficient.
Microsoft even used push-ret pairs in their implementation, rather than call-ret, to save a few extra cycles.

Some of the benchmarks put its performance as the best software implementation, alongside the BBC micro Basic.

Now I've cut in a version of the 32h_24x24 mantissa multiplier used in Z88DK math32 library, together with a mulu_de 16_8x8 software multiplier.

MS Basic - math32 with software mulu_de: 18' 31" or 157% of the reference

That's about what I expect. Using the 16_8x8 elemental multiplication model is much less efficient than a single large multiply, if you don't have hardware suited (like z180 or z80n). 
I'm hoping the LUT multiplier result will be about 3x the math32 software mantissa version, or about 2x the reference MS Basic speed.
So, I'm betting on about 6' 20". Excited to see what happens.

Well the boards arrived, and they are very nicely purple. And most surprisingly they are the right physical and shape size.

I'm still testing what the performance looks like, but the initial indications are not that positive.


MS Basic - math32 with hardware LUT multiply: 11' 05" or 95% of the reference (or about 60% of equivalent software).

The adapted floating point mantissa multiply is here, for interest.


This is a good demonstration of just how well written the original MS Basic code is!
 

I'll have a look at other floating point benchmarks over the next few days, and see what that brings.


OK, we're back to the original Mandelbrot code, in assembly running in CP/M. And we get a much more promising outcome.
I've attached a version of the code for Mark to try his Shift-Add multiplier, and the results I'm seeing are pretty good.

; The theoretical minimum time is 27.52 seconds.
;
; Results          FCPU         Original    Table mlt    LUT mlt
; RC2014 CP/M    7.432MHz         4'58"       3'48"       2'55"
;
; Results          FCPU         Original    Optimised   z180 mlt
; YAZ180 CP/M   18.432MHz         1'40"       1'24"       1'00"
; YAZ180 yabios 18.432Mhz                     1'14"         54"
; YAZ180 CP/M   36.864MHz         1'
06"         58"         46"
; YAZ180 yabios 36.864MHz                       56"
        45"

Using a very optimised 512 Byte table multiplication algorithm, the very best result I could get was around 3' 48".
The LUT Multiply Module cuts nearly a minute off the total time it posted.

I've cleaned up the design to match the discussed API
The board is using 040h as the operand latch, 040h as LSB, and 041h as MSB of the result.

And ordered some revised boards from OSH Park.

RC2014_LUT_Schematic.png



RC2014_LUT_Board.png OSH_LUT_Top.png OSH_LUT_Bottom.png



Phillip.
mandel-feilipu-rc2014.asm
mandel-feilipu-rc2014-lut.asm

Phillip Stevens

unread,
Feb 18, 2020, 7:37:05 AM2/18/20
to RC2014-Z80
Phillip Stevens wrote:
Phillip Stevens wrote:
For the purposes of bench-marking against the really very excellent Microsoft Basic code, I used the RC2014 Repository Mandelbrot Basic program.

MS Basic Reference - RC2014 Distribution: 11' 44"
MS Basic - math32 with software mulu_de: 18' 31" or 157% of the reference
MS Basic - math32 with hardware LUT multiply: 11' 05" or 95% of the reference (or about 60% of equivalent software).

The adapted floating point mantissa multiply is here, for interest.

This is a good demonstration of just how well written the original MS Basic code is!

OK, we're back to the original assembly Mandelbrot code, in assembly running in CP/M. And we get a much more promising outcome.
I've attached a version of the code for Mark to try his Shift-Add multiplier, and the results I'm seeing are pretty good.

; The theoretical minimum time is 27.52 seconds.
;
; Results          FCPU         Original    Table mlt    LUT mlt
; RC2014 CP/M    7.432MHz         4'58"       3'48"       2'55"
;
; Results          FCPU         Original    Optimised   z180 mlt
; YAZ180 CP/M   36.864MHz         1'06"         58"         46"
Using a very optimised 512 Byte table multiplication algorithm, the very best result I could get was around 3' 48".
The LUT Multiply Module cuts nearly a minute off the total time it posted.


So one more benchmark to try. This time in C, using the original Whetstone program as found in the z88dk benchmarks directory.
Using this example with the compilation line...

 > zcc +rc2014 -subtype=cpm -clib=sdcc_iy -SO3 --max-allocs-per-node800000 -DPRINTOUT whetstone.c -o whetstone --math32 -m -create-app

Gets us the following results (hand timed, but to the second repeatable).

math32 LUT 1'11" kWIPS = 14.1
math32     1'56" kWIPS = 8.6
math48     2'06"
kWIPS = 7.9 (-lm)

Using the LUT Multiply Module nearly doubles the RC2014 floating point performance from 0.0086 MWIPS to 0.0141 MWIPS !!!
Such compute!

I think that's about all I have to add, except looking forward to Mark T's version...

Phillip

Phillip Stevens

unread,
Feb 27, 2020, 8:23:35 PM2/27/20
to RC2014-Z80
Perhaps, I should have added. “Does Blockchain” to the title of this thread.

Just made aware of a crypto library for Z80, that would definitely benefit from a “LUT Blockchain Module“.
https://github.com/nagydani/z80_crypto

lol.
P.

Phillip Stevens

unread,
Mar 8, 2020, 9:06:09 AM3/8/20
to RC2014-Z80

So one more benchmark to try. This time in C, using the original Whetstone program as found in the z88dk benchmarks directory.
Using this example with the compilation line...

 > zcc +rc2014 -subtype=cpm -clib=sdcc_iy -SO3 --max-allocs-per-node800000 -DPRINTOUT whetstone.c -o whetstone --math32 -m -create-app

Gets us the following results (hand timed, but to the second repeatable).

math32 LUT 1'11" kWIPS = 14.1
math32     1'56" kWIPS = 8.6
math48     2'06"
kWIPS = 7.9 (-lm)

Using the LUT Multiply Module nearly doubles the RC2014 floating point performance from 0.0086 MWIPS to 0.0141 MWIPS !!!
Such compute!

Felt like doing some more benchmarks tonight, now that I've finally got the finished modules back from OSH Park, built up, and running.

IMG_0380.JPG



There are two floating point benchmarks in the z88dk examples to look at.

The n-body and the spectral norm.


Both were compiled for and run on CP/M.

The performance of the serial I/O or disk interface has little effect on the benchmark, as only one number is output.


For the n-body benchmark, the math32 library is already pretty good, but using the LUT Module we get about 2x performance improvement.

The n-body benchmark is very reliant on multiplication and squaring, and so benefits greatly from hardware assistance.


zcc +rc2014 -subtype=cpm -SO3 --max-allocs-per-node400000 -DSTATIC -DPRINTF --math32 n-body.c -o nbody -create-app

math32 LUT    
1'45"
math32        3'
18"


With spectral norm the math32 library is not as fast as even the Microsoft Basic maths library.

But still with the LUT Module we get about a result in under 60% of the time required without hardware assistance.

And, if you were waiting over half an hour for each calculation, I think the time saving would be considered worthwhile.


zcc +rc2014 -subtype=cpm -SO3 --max-allocs-per-node400000 -DSTATIC -DPRINTF --math32 spectral-norm.c -o spectral -create-app

math32 LUT  
19'20"
math32       32'
38"

In z88dk, the LUT Module is enabled with a configuration switch, which is set "off" by default.

P.

Phillip Stevens

unread,
Mar 21, 2020, 12:31:49 AM3/21/20
to RC2014-Z80
Phillip Stevens wrote:
Felt like doing some more benchmarks tonight, now that I've finally got the finished modules back from OSH Park, built up, and running.

IMG_0380.JPG


In z88dk, the LUT Module is enabled with a configuration switch, which is set "off" by default.

Over the past couple of weeks I've been adding support for int and long multiplies for the LUT Module to the z88dk standard libraries.
So at this point all of the library multiply routines issued by the C compilers sccz80 and zsdcc will be done by the LUT Module (when it is enabled).
This makes accelerating games or programs using C integer maths completely hands free.

There are a couple of nice features about using a two part LUT multiplier, which I've been able to use in writing the code.
  1. Recurring multiplies, with one of the multipliers constant, are easy because the B register does not need to be reloaded.
  2. The LSB and MSB of the result don't need to be retrieved concurrently, because the result is latched, so the LUT Module acts like another register.
  3. If you don't need the result MSB, it can be ignored.
  4. Any two registers can be used for the multiplier and multiplicand, and for the results.
Using 1. and 2. the mulu_72_64x8 the multiply ripples along the input long long with a minimum of register shuffling.
Using 3. the mulu_16_16x16 doesn't have to handle carry bytes where they're not needed.
Using 4. the mulu_64_32x32 can simply multiply HE = H*E and DL = D*L then add DE+HL.

So what is the result of this distraction?

Well the "calculate pi to 800 decimal places" benchmark (using long multiplies mulu_32_32x32) was tested.
Using the normal z88dk assembly language (optimised) library functions my RC2014 can do the pi to 800 decimals in 12'00". Using the LUT Module code it was done in 8'20".
This is quite a substantial outcome, as there are some long divisions in the benchmark that have not been improved which contribute to the total time.

Enjoy, Phillip

Phillip Stevens

unread,
Mar 31, 2020, 3:56:57 AM3/31/20
to RC2014-Z80
Phillip Stevens wrote:

IMG_0380.JPG


In z88dk, the LUT Module is enabled with a configuration switch, which is set "off" by default.

Over the past couple of weeks I've been adding support for int and long multiplies for the LUT Module to the z88dk standard libraries.
So at this point all of the library multiply routines issued by the C compilers sccz80 and zsdcc will be done by the LUT Module (when it is enabled).
This makes accelerating games or programs using C integer maths completely hands free.

As a further test of integration, I've built a special version of CP/M-IDE which uses the LUT Module multiply instructions as issued by the zsdcc compiler.
CP/M-IDE requires a RC2014 Plus together with Spencer's new IDE Module. Quite useful option, if you can't stretch to a 512kB/512kB ROM/RAM.
So, if you have an interest, then give it a try.
P.
Reply all
Reply to author
Forward
0 new messages