Smokin! :) Z80 at 10MHz takes 3:50.
--
You received this message because you are subscribed to the Google Groups "RC2014-Z80" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rc2014-z80+unsubscribe@googlegroups.com.
To post to this group, send email to rc201...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/rc2014-z80/4972f804-b43f-46ef-a75e-87488884aa54%40googlegroups.com.
I've also added a Z180 specific multiplication routine to the gist, which I'll add to the comparison soon too.
The Z180 at 36MHz is impressive. Z280 has had a slow roll out and was plagued by bugs early on, so by the time it was a real product, it was eclipsed by other processors including the Z180. Z280 never was very popular which is a shame because it has some real modern features. I'm certainly having fun playing with Z280.
I didn't find Z280 on the targets list of Z88DK. Is anyone working on porting Z88DK to Z280?
On Thursday, May 17, 2018 at 7:44:00 PM UTC-6, J.B. Langston wrote:Smokin! :) Z80 at 10MHz takes 3:50.I took a fork of your gist code, and converted it to assemble to the Z88DK CP/M target.Running on a Z180, I get about 1 minute 40 seconds.
I've also added a Z180 specific multiplication routine to the gist, which I'll add to the comparison soon too.
mul_16:
; operand sources are BC and DE
push de
pop hl
db 0edh,0c2h ; op code for sign multiply of HL and BC
; multw hl,bc
retZ280 does have a 16x16 signed multiply instruction: MULTW, so the mul_16 routine is reduced to:mul_16:
; operand sources are BC and DE
push de
pop hl
db 0edh,0c2h ; op code for sign multiply of HL and BC
; multw hl,bc
ret
With the 16x16 signed multiply instruction, the run time for the 24MHz Z280 with 57.6K baud console is reduced from 2 min 40 sec to 1 min 44 sec.
HOWEVER, as I stared at the screen, I noticed the bulk of the time is taken up by the program painting the mandelbrot, not calculating the next pixel. Another word, the real bottleneck is the speed of the serial port, not CPU performance. So I dropped in a 14.7456MHz oscillator and raise the serial clock to 115200 (on the Z280RC the serial clock is derived from the CPU clock so I need 14.7456MHz CPU clock in order to get 115200 serial baud). Run the original test again and the time is 3 min 4 seconds (it should be 4 min 20 seconds if the bottleneck is at the CPU) and with mul_16 routine based on Z280 16x16 signed multiply instruction, the run time is now 1 min 39 sec, which is faster than 24MHz CPU clock/57.6K serial clock! The existing serial port setup is 8-bit data, odd parity, 1 stop, I can get another 10% improvement just by having 8-bit data, no parity and 1 stop. So indeed we've "jumped the shark"!
Smokin! :) Z80 at 10MHz takes 3:50.I took a fork of your gist code, and converted it to assemble to the Z88DK CP/M target.Running on a Z180, I get about 1 minute 40 seconds.
Not sure if it is getting to 'jumping the shark' stage with this mandelbrot stuff, but I've added a signed 16 bit multiplication routine using the z180 8x8 hardware multiplier to the routine, and the run time (with nothing else changed) is now 1 minute.
Looking over J.B.'s code, I agree the picture has 240 x 150 characters, but the code transmits the ANSI color sequence plus the color pixel for each character so that's a string of 10 characters per character displayed. The total number of character transmited to paint the mandelbrot is roughly 10*240*150 = 360000. At 115200 8o1 it takes 31 seconds, at 57600 8o1 it takes 62.5 seconds.
This is why my Z280 at 24MHz CPU clock and 57600 serial baud ran slower than 14.7MHz CPU clock and 115200 serial baud.
Bill
Jumping on the shark, and riding it, I've unrolled the multiplication and now have the following numbers.
Results FCPU Original Optimised z180 mlt
RC2014 7.432MHz 4'51" 4'10"
YAZ180 36.864MHz 1'40" 1'24" 1'00"Next stop... using an APU for the calculation.
Phillip,
Thank you, that'll be most helpful.
Bill
Bill,
I’ll open a new thread on interrupt driven serial.
I’ve got working code for ACIA, SIO/2, and z180 ASCI ports in z88dk and for two CP/M implementations.
Essentially it is the same code each time, just with variation depending on interrupt mechanics.
It might also be useful for z280 too.
Cheers, Phillip
Retrobrewcomputers forum member 'lowen' pointed me to a source of Z280 on UTSource. They are used parts from China and seemed a gamble at the time, but I've purchased 3 lots of them and they all worked fine.
Not sure if it is getting to 'jumping the shark' stage with this mandelbrot stuff...
Running on a Z180, I get about 1 minute 40 seconds.I've also added a Z180 specific multiplication routine to the gist, which I'll add to the comparison soon too.
Not sure if it is getting to 'jumping the shark' stage with this mandelbrot stuff, but I've added a signed 16 bit multiplication routine using the z180 8x8 hardware multiplier to the routine, and the run time (with nothing else changed) is now 1 minute and 1 second.The scaling is not quite right, but at least I've sorted the terminal issues and now the output looks very nice.I read that the Z280 has signed 16 bit hardware multiply functions. I guess the final step would be to convert the `mult16` function to a single Z280 op code, and see what happens then...
; Fast mulu_16_8x8 using a 512 byte table
; x*y = ((x+y)/2)^2 - ((x-y)/2)^2 <- if x+y is even
; = ((x+y-1)/2)^2 - ((x-y-1)/2)^2 + y <- if x+y is odd and x>=y> zcc +rc2014 -subtype=cpm -v --list -m mandel-feilipu.asm -o mandel
> appmake +glue --ihex --clean -b mandel -c mandel > zcc +rc2014 -subtype=cpm -m mandel-feilipu.asm -o mandel
> appmake +glue --ihex --clean -b mandel -c mandel; Upload using XMODEM; mandel__.bin -> mandel.com
; Or upload using hexload or SCM; mandel__.ihx -> mandel.hex
; then use LOAD or MLOAD: mandel.hex -> mandel.comI'd like to incorporate your table-based multiplication into my TMS9918 mandelbrot generator. It appears that l_z80_mulu_de plus the associated tables is the code that you added; however, not sure how to integrate this with my original code given the significant changes introduced for z88dk. I can probably figure it out, but if you have any pointers to save me time, I'd appreciate.
On Sunday, September 1, 2019 at 7:49:35 AM UTC-5, Phillip Stevens wrote:
Following up on this from another thread, for the RC2014 we've been doing some further mandelbrot shark jumping...Bill Shen has produced a "standard" mandel.hex, from the original by JB, which we're using to test on the CP/M versions of the RC2014, and other systems.I've done an optimised version, which improves on the Z80 multiply function only. Doing this takes more than a minute off the benchmark, for me, from 4'58" down to 3'48", whilst still producing all the same 396,300 characters as the original.The original multiply is a shift and add version which is "small" and loops.
The new multiply is stripped from the fast mulu_de table look-up 16_8x8 routine out of the IEEE floating point library I've been writing, and match it up to the Spectrum Next z80n_mulu_32_16x16 multiply. This multiply requires a 512 Byte look up table, containing the high and low bytes of a square table for a 16_8x8 multiply. The 32_16x16 result is calculated as 4 partial 16_8x8 multiplies.
; Fast mulu_16_8x8 using a 512 byte table
; x*y = ((x+y)/2)^2 - ((x-y)/2)^2 <- if x+y is even; = ((x+y-1)/2)^2 - ((x-y-1)/2)^2 + y <- if x+y is odd and x>=y
Phillip,
Looking over J.B.'s code, I agree the picture has 240 x 150 characters, but the code transmits the ANSI color sequence plus the color pixel for each character so that's a string of 10 characters per character displayed. The total number of character transmited to paint the mandelbrot is roughly 10*240*150 = 360000. At 115200 8o1 it takes 31 seconds, at 57600 8o1 it takes 62.5 seconds. This is why my Z280 at 24MHz CPU clock and 57600 serial baud ran slower than 14.7MHz CPU clock and 115200 serial baud.
Bill
lastpixel: defb 0colorpixel:
ld a,b ; iter count in B -> C
and $1F ; lower five bits only
ld hl,lastpixel ; is the pixel value the same?
cp (hl)
jr z,colorpixel_1 ; yes, skip the colour change
ld (hl),a
ld c,a
ld b,0
ld hl, hsv ; get ANSI color code
add hl, bc
ld a,(hl)
call setcolor
colorpixel_1:
ld e, pixel ; show pixel
ret
This isn't an attempt to jump the shark, just skip around the side of it...
This comment from Bill got me thinking:Phillip,
Looking over J.B.'s code, I agree the picture has 240 x 150 characters, but the code transmits the ANSI color sequence plus the color pixel for each character so that's a string of 10 characters per character displayed. The total number of character transmited to paint the mandelbrot is roughly 10*240*150 = 360000. At 115200 8o1 it takes 31 seconds, at 57600 8o1 it takes 62.5 seconds. This is why my Z280 at 24MHz CPU clock and 57600 serial baud ran slower than 14.7MHz CPU clock and 115200 serial baud.
BillWhy even bother sending out the ANSI colour sequence for every pixel when often the current pixel is the same colour as it's predecessor?
I'd be interested to know if other folk can confirm comparable results on their systems?
Following up on this from another thread, for the RC2014 we've been doing some further mandelbrot shark jumping...Bill Shen has produced a "standard" mandel.hex, from the original by JB, which we're using to test on the CP/M versions of the RC2014, and other systems.I've done an optimised version, which improves on the Z80 multiply function only. Doing this takes more than a minute off the benchmark, for me, from 4'58" down to 3'48", whilst still producing all the same 396,300 characters as the original.The original multiply is a shift and add version which is "small" and loops.
The new multiply is stripped from the fast mulu_de table look-up 16_8x8 routine out of the IEEE floating point library I've been writing, and match it up to the Spectrum Next z80n_mulu_32_16x16 multiply. This multiply requires a 512 Byte look up table, containing the high and low bytes of a square table for a 16_8x8 multiply. The 32_16x16 result is calculated as 4 partial 16_8x8 multiplies.; Fast mulu_16_8x8 using a 512 byte table; x*y = ((x+y)/2)2 - ((x-y)/2)2 <- if x+y is even; = ((x+y-1)/2)2 - ((x-y-1)/2)2 + y <- if x+y is odd and x>=y
; Results FCPU Original Optimised
; RC2014 CP/M 7.432MHz 4'58" 3'48"
I didn't get involved with this as a benchmark when the thread originally started but thought I'd try it out on a couple of different systems. All of my systems run with ZERO wait states since just one wait state on a Z180 will drop the CPU performance by about 25%. These systems all use a UART-USB bridge which has 320 byte buffers in addition to the ASCI/UART buffer and I wasn't concerned about cleaning up the graphics. I used the MANDEL-ORIGINAL.HEX and MANDEL-FELIPU.HEX files after a CP/M LOAD on the following systems:
MinZ-C - Z180 at 36.864 MHz and 115,200 baudMinZ-U - Z180 at 36.864 MHz and 230,400 baudMin-eZ - eZ80 at 50 MHz and 115,200 baudMANDEL-ORIGINAL: MinZ-C=67 Sec, MinZ-U=53 Sec, Min-eZ=43 SecMANDEL-FELIPU: MinZ-C=62 Sec, MinZ-U=47 Sec, Min-eZ=42 Sec
Conclusions:This is an interesting benchmark since it is both I/O and CPU intensive although the I/O tends to dominate with the above systems. Doubling the baud rate resulted in about a 20-24% reduction in elapsed time while increasing the processor speed approximately 500% only reduced the elapsed time by about 32-36%.
Having been involved many years ago with mainframe performance analysis for upgrades, this is a classic case of understanding the overall system requirements. While it's easy to concentrate simply on processor speed or I/O speed, one really needs to look at the entire system and it's applications as a whole. I'm sure if I added a very large output buffer and interrupt driven output to the eZ80 then it would come close to the theoretical baud limitation. However most of my system usage is file oriented and I prefer to use that memory for a large RAMdisk and disk buffers rather than the occasional baud-rate limited application. If these programs were the primary purpose of the system then a console based on a FT232H in FIFO mode could change the throughput to be primarily CPU performance.
On Monday Phillip wrote:It has been a while and this thread is positively ancient, but...
And then wandering past my mandelbrot code today, I wondered if the runer112 32_16x16 multiply was significantly better than the 512 byte table look-up multiply that I'd previously used to get the best result.