This is the most flexible and powerful option, as it provides both multiply and divide for signed and unsigned numbers.And it is the simplest as it includes device registers, buffering and interfacing. It simply needs to be addressed via a number of I/O addresses.To build a rc2014 module would need just an 74138 for address selection.The multiplication time is about 500ns, or 5 clocks, plus the time to load and unload the registers.This means that the multiply result is available in just over 1 (short) z80 machine cycle.
But I can't find a source where these can be purchased. They seem to be made of unobtainium.So probably impossible to use as a basis.
This Cray multiplier device is the next easiest option, with very high speed results.To build a rc2014 module would require 2x 8-bit input buffers, and 2x (8-bit) of output buffers, with addressing using an 74138.
A multiply result (combinational logic) is available in just 60ns, but the operands and results would need to be loaded and unloaded.
Operationally, the module would require just two I/O addresses, as R/W could use the same locations.Repeated multiplies are easy, as just one of the operands could be reloaded, and the new result could be read out immediately.But (again) I can't find a source where these can be purchased.So (again) probably impossible to use as a basis.
This solution replicates the full Cray multiplication process above, using 2x 4x4 multipliers, together with a Wallace tree, consisting of 8x 74183, 3x 74181, and 1x 74182, to add the partial results. To build a rc2014 module would also require 2x input buffers, and 2x (16 bits) of output buffers, with addressing using an 74138.All in all, this would never fit onto a standard height rc2014 module, but 18x DIP16 devices might be squeezed onto a double height module.I'm not sure of my ability to design this kind of thing though. But that would be part of the challenge.And, again, sourcing devices seems to be quite hard. But, perhaps it will be less difficult than the other options.
74x165 for multiplier, 74x273 for multiplicand, 2 x 74x273 for 16 bit result register, 4 of 74x283 for sixteen bit adder.
Maybe copy the state machine for clocking from the shift register spi micro sd card module. 74x163, 74x138 and a few gates.
Mark
Mark
Would a 49C402 bit slice unit be cheating? It has the "retro" covered.
Two 64K (UV-,EE-,Flash-) ROMS ?
Operands on address bus, LSB of result on data bus of one of them, MSB on the other.
You choose your retro-level by choosing the ROM size/type, could be done with 128x 2708 :)
Only removes one out instruction, but every cycle counts.
Only a personal preference but 128K rom or ram seems to be cheating if the idea is to have a retro solution. I think 4x4 using 256bytes would be more reasonable, but then just easier to allocate 256 bytes of memory.
Mark
--
You received this message because you are subscribed to the Google Groups "RC2014-Z80" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rc2014-z80+...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/rc2014-z80/3f19488e-3c71-4141-98c1-849b6751d6c9%40googlegroups.com.
Some great ideas guys.
How about making the chips RAM instead of ROM.
You could use one 128k x 8 RAM chip instead of two ROMs.
You could perhaps increase performance by outputting the high order address latch data on A8-A15 using OUT (C),A where C is the modules I/O address, B = table address A8-A15, and A = table address A0-A7.
mul_de:
ld b,d ; 4 operand MSB in B
ld c,0x42 ; 7 operand latch address
out (c),e ; 12 operand LSB from E
dec c ; 4 result MSB address
in d,(c) ; 12 result MSB to D
dec c ; 4 result LSB address
in e,(c) ; 12 result LSB to E
ret ; 10
Other application ideas:RAM diskLOG tablesSIN/COS/TAN trigonometry tables
Steve Cousins wrote:Some great ideas guys.
How about making the chips RAM instead of ROM.
I prefer the idea of it being "firmware" rather than needing to be reloaded each time.But that's not to say that RAM isn't a good idea.You could use one 128k x 8 RAM chip instead of two ROMs.
I looked at 1Mbit (128kB x 8-bit) ROM and it is extortionately expensive (and only PROM single use).
And the RC2014 platform already uses the W27C512 ROM, so I have a stack in a tube.You could perhaps increase performance by outputting the high order address latch data on A8-A15 using OUT (C),A where C is the modules I/O address, B = table address A8-A15, and A = table address A0-A7.Great idea (Mark & Steve).The code to replicate a Spectrum Next MUL DE would then be.
mul_de:
ld b,d ; 4 operand MSB in B
ld c,0x42 ; 7 operand latch address
out (c),e ; 12 operand LSB from E
dec c ; 4 result MSB address
in d,(c) ; 12 result MSB to D
dec c ; 4 result LSB address
in e,(c) ; 12 result LSB to E
ret ; 10A total of 65 cycles, excluding the call. Pretty sharp.My current best case is stolen from CPC Wiki at around 154 cycles, so this is 2.3x faster.Other application ideas:
RAM diskLOG tablesSIN/COS/TAN trigonometry tables
Only 4 vias. Pretty happy with that.
The layout is a piece of art!
I ordered this off OSH Park tonight (given its CNY and nothing is going on in Asia for two weeks except fireworks).
I think I saw that jlcpcb were not closing for spring festival this year.
Did you consider the ST39SF040 as used on the 512K ROM/RAM board? Could also select a couple of additional functions and is electrically alterable.
I've worked out how the Microsoft Basic floating point mantissa multiplier works (and it is a beautiful piece of code), and have cut it out and patched in my own ugly version.The small win is actually getting correct results.
What that means is when I get the LUT board working, I'll be able to produce a kind of Frankenstein MS Basic that should (with the added LUT hardware) be really very fast at math.Microsoft Basic uses floating point for all its math, so improving floating point will speed it up at everything.
I’m guessing you are using 8x8 bit unsigned multiply and combining the results as in long multiplication method, but is there any advantage in implementing signed multiply?
Any plans to try and support division? What methods could be used for this?
X / Y = X * (1/Y)
I was looking at your address decoding and was thinking if you qualified the /WR when writing the operands, use 0x40 to write the operands and read LSB result from 0x40, then you could remove one DEC C instruction. Only 4 cycles but this is likely used in a low level where every cycle helps.
Also the output controls of the ‘374s could be tied to ground, would avoid slight loading of the /RD control line.
I was trying to layout a shift and add multiplier as interesting challenge to see if it would fit on a standard module. Thought I’d try and make it compatible with your code, which just means not qualifying A1, then 0x40 to 0x43 to write operands, 0x40 to read LSB, 0x41 to read MSB. (Also finding layout a bit tight so removing A1 makes it a bit easier).
I don’t think you need to decode the 9 different addresses.
For a read only eprom, A0 to A2 connect direct to A16 to A18, then just decode a block of 8 addresses to enable read byte from eprom as addressed by the two bytes stored to the registers. The two bytes could be written to any address within the block of 8.
I think it makes sense to have 8 blocks of 64K, then each 64K performs a function and it would be easier to combine different functions into an eprom. If it was arranged as 64K by 8 bytes it might be tricky to combine different function blocks.
For a writeable method, could be decoded as a block of 16, with A3 to select writing data or writing address. This starts to use a large chunk of IO addresses but would be quite simple decoding.
I'm still testing what the performance looks like, but the initial indications are not that positive.
I've timed the Mandelbrot example at just over 11' 05" using the LUT (Multiply) Basic.
The adapted floating point mantissa multiply is here, for interest.
Yes this is faster than MS Basic Reference, but barely so.
Certainly not worth the trouble at this stage for this particular example.
I'll have a look at other floating point benchmarks over the next few days, and see what that brings.
Mark
For the purposes of bench-marking against the really very excellent Microsoft Basic code, I used the RC2014 Repository Mandelbrot program.MS Basic Reference - RC2014 Distribution: 11' 44"
The Microsoft Basic floating point implementation has been extracted and made available in the classic library of the z88dk.It is implemented with a 24x8x3 add-shift multiplier, and it has 0 detection at byte level. Very efficient.Microsoft even used push-ret pairs in their implementation, rather than call-ret, to save a few extra cycles.Some of the benchmarks put its performance as the best software implementation, alongside the BBC micro Basic.
Now I've cut in a version of the 32h_24x24 mantissa multiplier used in Z88DK math32 library, together with a mulu_de 16_8x8 software multiplier.
MS Basic - math32 with software mulu_de: 18' 31" or 157% of the referenceThat's about what I expect. Using the 16_8x8 elemental multiplication model is much less efficient than a single large multiply, if you don't have hardware suited (like z180 or z80n).
I'm hoping the LUT multiplier result will be about 3x the math32 software mantissa version, or about 2x the reference MS Basic speed.
So, I'm betting on about 6' 20". Excited to see what happens.
Well the boards arrived, and they are very nicely purple. And most surprisingly they are the right physical and shape size.
I'm still testing what the performance looks like, but the initial indications are not that positive.
I'll have a look at other floating point benchmarks over the next few days, and see what that brings.
; The theoretical minimum time is 27.52 seconds.
;
; Results FCPU Original Table mlt LUT mlt
; RC2014 CP/M 7.432MHz 4'58" 3'48" 2'55"
;
; Results FCPU Original Optimised z180 mlt
; YAZ180 CP/M 18.432MHz 1'40" 1'24" 1'00"
; YAZ180 yabios 18.432Mhz 1'14" 54"
; YAZ180 CP/M 36.864MHz 1'06" 58" 46"
; YAZ180 yabios 36.864MHz 56" 45"
Phillip Stevens wrote:For the purposes of bench-marking against the really very excellent Microsoft Basic code, I used the RC2014 Repository Mandelbrot Basic program.
MS Basic Reference - RC2014 Distribution: 11' 44"
MS Basic - math32 with software mulu_de: 18' 31" or 157% of the reference
MS Basic - math32 with hardware LUT multiply: 11' 05" or 95% of the reference (or about 60% of equivalent software).
The adapted floating point mantissa multiply is here, for interest.
This is a good demonstration of just how well written the original MS Basic code is!
OK, we're back to the original assembly Mandelbrot code, in assembly running in CP/M. And we get a much more promising outcome.
I've attached a version of the code for Mark to try his Shift-Add multiplier, and the results I'm seeing are pretty good.
; The theoretical minimum time is 27.52 seconds.
;
; Results FCPU Original Table mlt LUT mlt
; RC2014 CP/M 7.432MHz 4'58" 3'48" 2'55"
;
; Results FCPU Original Optimised z180 mlt
; YAZ180 CP/M 36.864MHz 1'06" 58" 46"
Using a very optimised 512 Byte table multiplication algorithm, the very best result I could get was around 3' 48".The LUT Multiply Module cuts nearly a minute off the total time it posted.
Of course, z88dk supports the RC2014 LUT Module, too.
> zcc +rc2014 -subtype=cpm -clib=sdcc_iy -SO3 --max-allocs-per-node800000 -DPRINTOUT whetstone.c -o whetstone --math32 -m -create-app
math32 LUT 1'11" kWIPS = 14.1
math32 1'56" kWIPS = 8.6
math48 2'06" kWIPS = 7.9 (-lm)
Of course, z88dk supports the RC2014 LUT Module, too.So one more benchmark to try. This time in C, using the original Whetstone program as found in the z88dk benchmarks directory.Using this example with the compilation line...
> zcc +rc2014 -subtype=cpm -clib=sdcc_iy -SO3 --max-allocs-per-node800000 -DPRINTOUT whetstone.c -o whetstone --math32 -m -create-appGets us the following results (hand timed, but to the second repeatable).
math32 LUT 1'11" kWIPS = 14.1
math32 1'56" kWIPS = 8.6
math48 2'06" kWIPS = 7.9 (-lm)Using the LUT Multiply Module nearly doubles the RC2014 floating point performance from 0.0086 MWIPS to 0.0141 MWIPS !!!Such compute!
There are two floating point benchmarks in the z88dk examples to look at.
The n-body and the spectral norm.
Both were compiled for and run on CP/M.
The performance of the serial I/O or disk interface has little effect on the benchmark, as only one number is output.
For the n-body benchmark, the math32 library is already pretty good, but using the LUT Module we get about 2x performance improvement.
The n-body benchmark is very reliant on multiplication and squaring, and so benefits greatly from hardware assistance.
zcc +rc2014 -subtype=cpm -SO3 --max-allocs-per-node400000 -DSTATIC -DPRINTF --math32 n-body.c -o nbody -create-app
math32 LUT 1'45"
math32 3'18"
With spectral norm the math32 library is not as fast as even the Microsoft Basic maths library.
But still with the LUT Module we get about a result in under 60% of the time required without hardware assistance.
And, if you were waiting over half an hour for each calculation, I think the time saving would be considered worthwhile.
zcc +rc2014 -subtype=cpm -SO3 --max-allocs-per-node400000 -DSTATIC -DPRINTF --math32 spectral-norm.c -o spectral -create-app
math32 LUT 19'20"
math32 32'38"
Of course, z88dk supports the RC2014 LUT Module, too.
Felt like doing some more benchmarks tonight, now that I've finally got the finished modules back from OSH Park, built up, and running.
In z88dk, the LUT Module is enabled with a configuration switch, which is set "off" by default.
Of course, z88dk supports the RC2014 LUT Module, too.
In z88dk, the LUT Module is enabled with a configuration switch, which is set "off" by default.
Over the past couple of weeks I've been adding support for int and long multiplies for the LUT Module to the z88dk standard libraries.So at this point all of the library multiply routines issued by the C compilers sccz80 and zsdcc will be done by the LUT Module (when it is enabled).This makes accelerating games or programs using C integer maths completely hands free.