Assembly Mandelbrot

1,591 views
Skip to first unread message

J.B. Langston

unread,
Feb 19, 2018, 9:04:00 PM2/19/18
to RC2014-Z80
I've adapted an assembly-language fixed point Mandelbrot renderer to the RC2014. The combination of fixed point math and assembly language means it's much faster than the BASIC version. It renders a 256x128 image in just over 2.5 minutes with the CPU running at 7.3MHz.  I found the original code on Rosetta Code, converted it to run on CP/M and colorized it.  The attached screenshot was taken using the 6 point windows terminal font with window dimensions of 258x135.  If you don't have a font that has the IBM-compatible block drawing characters, you may need to change the pixel constant to another character such as #.

Source is at https://gist.github.com/jblang/3b17598ccfa0f7e5cca79ad826a399a9. The attached mandel.out is a CP/M executable that can be uploaded via xmodem to mandel.com and run.

Enjoy!
mandel.png
mandel.out

J.B. Langston

unread,
Feb 19, 2018, 9:48:38 PM2/19/18
to RC2014-Z80
Tweaked parameters for better zoom and aspect ratio. This one takes just over 4 minutes to render a 258x161 image. Source has been updated on github.
mandel.out
mandel.png

Bill Shen

unread,
May 17, 2018, 8:12:27 PM5/17/18
to RC2014-Z80
Took me some fumbling around in Tera Term menu to figure out the proper setup.  I used a terminal size of 300 x 200 and a terminal font size of 5 to get this display.  It took 2 min and 40 seconds to draw that on the 12MHz Z280
mandelbrot_pixel.jpg

J.B. Langston

unread,
May 17, 2018, 9:44:00 PM5/17/18
to RC2014-Z80
Smokin! :) Z80 at 10MHz takes 3:50.

Scott Lawrence

unread,
May 18, 2018, 9:29:36 AM5/18/18
to rc201...@googlegroups.com
awesome! :D

On Thu, May 17, 2018 at 9:44 PM, J.B. Langston <jb.la...@gmail.com> wrote:
Smokin! :) Z80 at 10MHz takes 3:50.

--
You received this message because you are subscribed to the Google Groups "RC2014-Z80" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rc2014-z80+unsubscribe@googlegroups.com.
To post to this group, send email to rc201...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/rc2014-z80/4972f804-b43f-46ef-a75e-87488884aa54%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
Scott Lawrence
yor...@gmail.com

Bill Shen

unread,
May 18, 2018, 3:02:45 PM5/18/18
to RC2014-Z80
Thanks for the run time of 10MHz Z80.  12MHz Z280 was touted as equivalent to 16-20MHz Z80.  16MHz equivalent for running Z80 code without modification as I did here.  In this particular example it is only 14.3MHz equivalent because DRAM refresh takes away some performance and because I don't have the burst mode memory access turned on.  To reach 20MHz Z80 equivalent, I think the code needs to be rewritten to take advantage of Z280-specific instructions.

phillip.stevens

unread,
May 21, 2018, 9:13:55 AM5/21/18
to RC2014-Z80
I took a fork of your gist code, and converted it to assemble to the Z88DK CP/M target.
Running on a Z180, I get about 1 minute 40 seconds.

Unfortunately, I don't have a nice terminal output, because I didn't want to change the actual code for comparison, and I'm not sure how to set it up correctly on Ubuntu. I'm using `minicom -c on` inside gnome-terminal, to get colours. But something is not quite right.


I've also added a Z180 specific multiplication routine to the gist, which I'll add to the comparison soon too.


Bill Shen

unread,
May 21, 2018, 9:14:58 PM5/21/18
to RC2014-Z80
The Z180 at 36MHz is impressive.  Z280 has had a slow roll out and was plagued by bugs early on, so by the time it was a real product, it was eclipsed by other processors including the Z180.  Z280 never was very popular which is a shame because it has some real modern features.  I'm certainly having fun playing with Z280.

I didn't find Z280 on the targets list of Z88DK.  Is anyone working on porting Z88DK to Z280?

  Bill

phillip.stevens

unread,
May 21, 2018, 9:30:35 PM5/21/18
to RC2014-Z80
The Z180 at 36MHz is impressive.  Z280 has had a slow roll out and was plagued by bugs early on, so by the time it was a real product, it was eclipsed by other processors including the Z180.  Z280 never was very popular which is a shame because it has some real modern features.  I'm certainly having fun playing with Z280.

The Z280 is very impressive. I was also keen to build something using it.
But some ebay searches produced very little in terms of NOS, making it a bit of a gamble at the time.
The Z180 just seemed like an easier build.
 
I didn't find Z280 on the targets list of Z88DK.  Is anyone working on porting Z88DK to Z280?

I don't think so, based on the limited number of available platforms that use the CPU.
Having said that, Z180 support was also pretty marginal until someone started agitating to get it incorporated.
The z88dk guys are very supportive of anything Zilog.
So if you've a platform (which you have), then just raise an issue for "Z280 - support".

The ZXNext is the current platform for development focus.
Lots of work getting it 100% easy for new Sinclair ZXNext game developers.

cheers, Phillip

Bill Shen

unread,
May 21, 2018, 11:16:39 PM5/21/18
to RC2014-Z80
Retrobrewcomputers forum member 'lowen' pointed me to a source of Z280 on UTSource.  They are used parts from China and seemed a gamble at the time, but I've purchased 3 lots of them and they all worked fine.

OK, I'll stir the pot and see if anyone interested in porting Z88DK to Z280.  I have a standalone version of Z280 (TinyZ280) that accepts 16meg SIMM memory.  I could never find a good use for it other than big RAMdrive.  Perhaps the Z88DK folks can think of a way of using it.  I think there are interests in Z280 even though it was never popular in its days.  I built up 9 TinyZ280 boards, gave 2 of them away, kept 3 for my own use, and decided to auction off the remaining 4 on eBay to recover some of the development cost.  I was hoping for $50, but the last 2 sold close to $250 each.  I was astonished!

  Bill

phillip.stevens

unread,
May 26, 2018, 8:43:35 AM5/26/18
to RC2014-Z80
On Thursday, May 17, 2018 at 7:44:00 PM UTC-6, J.B. Langston wrote:
Smokin! :) Z80 at 10MHz takes 3:50.

I took a fork of your gist code, and converted it to assemble to the Z88DK CP/M target.
Running on a Z180, I get about 1 minute 40 seconds.

I've also added a Z180 specific multiplication routine to the gist, which I'll add to the comparison soon too.


Not sure if it is getting to 'jumping the shark' stage with this mandelbrot stuff, but I've added a signed 16 bit multiplication routine using the z180 8x8 hardware multiplier to the routine, and the run time (with nothing else changed) is now 1 minute and 1 second.

The scaling is not quite right, but at least I've sorted the terminal issues and now the output looks very nice.


I read that the Z280 has signed 16 bit hardware multiply functions. I guess the final step would be to convert the `mult16` function to a single Z280 op code, and see what happens then...

Bill Shen

unread,
May 26, 2018, 5:16:15 PM5/26/18
to RC2014-Z80
Z280 does have a 16x16 signed multiply instruction: MULTW, so the mul_16 routine is reduced to:

mul_16:
; operand sources are BC and DE
    push de
    pop hl
    db
0edh,0c2h        ; op code for sign multiply of HL and BC
;    multw hl,bc
    ret


With the 16x16 signed multiply instruction, the run time for the 24MHz Z280 with 57.6K baud console is reduced from 2 min 40 sec to 1 min 44 sec. 

HOWEVER, as I stared at the screen, I noticed the bulk of the time is taken up by the program painting the mandelbrot, not calculating the next pixel.  Another word, the real bottleneck is the speed of the serial port, not CPU performance.  So I dropped in a 14.7456MHz oscillator and raise the serial clock to 115200 (on the Z280RC the serial clock is derived from the CPU clock so I need 14.7456MHz CPU clock in order to get 115200 serial baud).  Run the original test again and the time is 3 min 4 seconds (it should be 4 min 20 seconds if the bottleneck is at the CPU) and with mul_16 routine based on Z280 16x16 signed multiply instruction, the run time is now 1 min 39 sec, which is faster than 24MHz CPU clock/57.6K serial clock!  The existing serial port setup is 8-bit data, odd parity, 1 stop, I can get another 10% improvement just by having 8-bit data, no parity and 1 stop.  So indeed we've "jumped the shark"!
  Bill

phillip.stevens

unread,
May 27, 2018, 6:50:52 AM5/27/18
to RC2014-Z80
On Sunday, 27 May 2018 07:16:15 UTC+10, Bill Shen wrote:
Z280 does have a 16x16 signed multiply instruction: MULTW, so the mul_16 routine is reduced to:

mul_16:
; operand sources are BC and DE
    push de
    pop hl
    db
0edh,0c2h        ; op code for sign multiply of HL and BC
;    multw hl,bc
    ret

With the 16x16 signed multiply instruction, the run time for the 24MHz Z280 with 57.6K baud console is reduced from 2 min 40 sec to 1 min 44 sec. 

HOWEVER, as I stared at the screen, I noticed the bulk of the time is taken up by the program painting the mandelbrot, not calculating the next pixel.  Another word, the real bottleneck is the speed of the serial port, not CPU performance.  So I dropped in a 14.7456MHz oscillator and raise the serial clock to 115200 (on the Z280RC the serial clock is derived from the CPU clock so I need 14.7456MHz CPU clock in order to get 115200 serial baud).  Run the original test again and the time is 3 min 4 seconds (it should be 4 min 20 seconds if the bottleneck is at the CPU) and with mul_16 routine based on Z280 16x16 signed multiply instruction, the run time is now 1 min 39 sec, which is faster than 24MHz CPU clock/57.6K serial clock!  The existing serial port setup is 8-bit data, odd parity, 1 stop, I can get another 10% improvement just by having 8-bit data, no parity and 1 stop.  So indeed we've "jumped the shark"!

Not sure that I follow your logic there...

I would do the following to calculate the theoretical minimum time to display the Mandelbrot character matrix at 115200 baud.
  • A line as configured by J.B. is 3 x 80 + CR + LF = 242 characters
  • And there are 10 x 60 / 4 lines = 150 lines
  • Therefore a total of 36,300 characters need to be transmitted.
  • With the standard serial rate is 115200 baud 8n1 or 14,400 8 bit characters per second
  • Therefore the theoretical minimum time to transmit the character matrix is 2.52 seconds, plus or minus.
If that calculation is correct, it is well short of the 4 minutes required for the RC2014.

 On Thursday, May 17, 2018 at 7:44:00 PM UTC-6, J.B. Langston wrote:
Smokin! :) Z80 at 10MHz takes 3:50.

I took a fork of your gist code, and converted it to assemble to the Z88DK CP/M target.
Running on a Z180, I get about 1 minute 40 seconds.

Not sure if it is getting to 'jumping the shark' stage with this mandelbrot stuff, but I've added a signed 16 bit multiplication routine using the z180 8x8 hardware multiplier to the routine, and the run time (with nothing else changed) is now 1 minute.

Jumping on the shark, and riding it, I've unrolled the multiplication and now have the following numbers.

Results        FCPU     Original    Optimised     z180 mlt
RC2014        7.432MHz    4'51"       4'10"
YAZ180       36.864MHz    1'40"       1'24"        1'00"


Next stop... using an APU for the calculation.

Phillip

Bill Shen

unread,
May 27, 2018, 12:31:04 PM5/27/18
to RC2014-Z80
Phillip,
Looking over J.B.'s code, I agree the picture has 240 x 150 characters, but the code transmits the ANSI color sequence plus the color pixel for each character so that's a string of 10 characters per character displayed.  The total number of character transmited to paint the mandelbrot is roughly 10*240*150 = 360000.  At 115200 8o1 it takes 31 seconds, at 57600 8o1 it takes 62.5 seconds.  This is why my Z280 at 24MHz CPU clock and 57600 serial baud ran slower than 14.7MHz CPU clock and 115200 serial baud.
  Bill

phillip.stevens

unread,
May 27, 2018, 6:46:41 PM5/27/18
to RC2014-Z80

Looking over J.B.'s code, I agree the picture has 240 x 150 characters, but the code transmits the ANSI color sequence plus the color pixel for each character so that's a string of 10 characters per character displayed.  The total number of character transmited to paint the mandelbrot is roughly 10*240*150 = 360000.  At 115200 8o1 it takes 31 seconds, at 57600 8o1 it takes 62.5 seconds.

Bill,
of course. You're 100% right. Thank you.
I was completely forgetting that the colour settings had to be transmitted for each pixel.
 
This is why my Z280 at 24MHz CPU clock and 57600 serial baud ran slower than 14.7MHz CPU clock and 115200 serial baud.
  Bill
 
Jumping on the shark, and riding it, I've unrolled the multiplication and now have the following numbers.

Results        FCPU     Original    Optimised     z180 mlt
RC2014        7.432MHz    4'51"       4'10"
YAZ180       36.864MHz    1'40"       1'24"        1'00"


Next stop... using an APU for the calculation.

That also explains why the my Z180 implementation is so fast.
It is nothing to do with the CPU at all.

But rather the relative speed (200% of theoretical) is to do with using interrupt driven buffered transmit routines.
That means that the colour control sequence can be hammered into the Tx buffer, and then the CPU can get on with calculations without waiting for the relatively slow serial to be finished.

I was mistakenly thinking that the calculation was not baud rate bound, and therefore there had to be much more to be gained from the calculation side.
So my final step is to balance the calculation into transmit, and calculation operations, and move the calculation off to the APU leaving the CPU just to coordinate between the two.

Cheers, Phillip


Bill Shen

unread,
May 27, 2018, 8:48:04 PM5/27/18
to RC2014-Z80
Phillip,
My console is still running in polling mode.  Interrupt in CP/M is something I haven't figured out how to do.  It is really helpful as you've demonstrated in your benchmark.
  Bill

phillip.stevens

unread,
May 27, 2018, 10:53:28 PM5/27/18
to RC2014-Z80
Bill,
I’ll open a new thread on interrupt driven serial.
I’ve got working code for ACIA, SIO/2, and z180 ASCI ports in z88dk and for two CP/M implementations.
Essentially it is the same code each time, just with variation depending on interrupt mechanics.
It might also be useful for z280 too.
Cheers, Phillip

Bill Shen

unread,
May 28, 2018, 12:04:43 AM5/28/18
to RC2014-Z80
Phillip,
Thank you, that'll be most helpful.
  Bill

phillip.stevens

unread,
May 28, 2018, 7:36:14 AM5/28/18
to RC2014-Z80
Phillip,
Thank you, that'll be most helpful.
  Bill

Bill,
I’ll open a new thread on interrupt driven serial.
I’ve got working code for ACIA, SIO/2, and z180 ASCI ports in z88dk and for two CP/M implementations.
Essentially it is the same code each time, just with variation depending on interrupt mechanics.
It might also be useful for z280 too.
Cheers, Phillip

Bill,

I've written a blog post entitled Three Rings for the Z80, which  provides code references and the "why?" on decisions made. Perhaps it will be useful to have a look at.

I'm sure you'll see that there is only really only one tool, re-worked to suit the specific hardware requirement each time. But, over the past few years I think the tool has become fairly sharp (missing nothing it needs, and containing nothing it doesn't).

Cheers, Phillip

Bill Shen

unread,
May 28, 2018, 2:42:07 PM5/28/18
to RC2014-Z80
Phillip,
Thanks for the link.  I didn't know how to handle receiver getting data at different rate than CP/M's console input calls.  So implementing a FIFO in software is the way to handle the difference in data rate.  Similar idea for the transmitter as well.

I'll finish the prototype for LCD display and then tackle the BIOS upgrade for CP/M.
  Bill

Tony Nicholson

unread,
May 29, 2018, 5:28:14 AM5/29/18
to RC2014-Z80

On Tuesday, May 22, 2018 at 1:16:39 PM UTC+10, Bill Shen wrote:
Retrobrewcomputers forum member 'lowen' pointed me to a source of Z280 on UTSource.  They are used parts from China and seemed a gamble at the time, but I've purchased 3 lots of them and they all worked fine.

I've also ordered some Z8028012VSC chips (12MHz Z280 in PLCC-68 package) from UTSource to breadboard and play with while I wait for Phillip's YAZ180 boards to become available.  Once I get my old Z80 stuff running, I'll definitely be contributing to porting efforts for Z88DK etc.

Tony

Bill Shen

unread,
May 29, 2018, 7:53:35 AM5/29/18
to RC2014-Z80
Welcome, fellow enthusiast of the black sheep of the Zx80 family! 

Be sure to take a close look at the UART bootstrap feature of the Z280.  It will enable you to bootup and explore quickly without the hassle of ROM programming.  Using that capability I've gone through the whole Z280 development cycle without burning a single ROM.  There are a couple long discussions on Z280 on retrobrewcomputers:
https://www.retrobrewcomputers.org/forum/index.php?t=msg&th=93&start=0&
https://www.retrobrewcomputers.org/forum/index.php?t=msg&th=255&start=0&
As well as the associated design files in the retrobrew wiki pages.
https://www.retrobrewcomputers.org/doku.php?id=boards:sbc:cpu280:start
https://www.retrobrewcomputers.org/doku.php?id=builderpages:plasmo:tinyz280:final_step

Steve Cousins' SC Monitor runs well in Z280.  I ported it to my version of Z280 SBC which is fairly generic.  CP/M 2.2 and CP/M 3 have been ported successfully to Z280.  I understand UZI280 is running in Tilmann Reh's Z280, but I haven't got around to port that to my version.  There are quite a bit of interests in this obscure processor.
  Bill

Spencer Owen

unread,
May 29, 2018, 8:04:57 AM5/29/18
to rc201...@googlegroups.com
On 26 May 2018 at 13:43, phillip.stevens <phillip...@gmail.com> wrote
Not sure if it is getting to 'jumping the shark' stage with this mandelbrot stuff...

Last weekend I was lucky enough to see Jim Austins computer collection, and his newest machine which was Stephen Hawkins last personal super computer.  Although it's over 3 years old now, it's still probably the most powerful computer in private hands.  In addition to calculating Pi to 500,000,000 decimal places (96 seconds!) we also got it to draw a Mandelbrot.  I couldn't tell you what the resolution was, but it was very very high.  And we could zoom in at pretty much real time. A loooooong way!

So I think that the shark has well and truly been jumped!

Spencer

phillip.stevens

unread,
Jun 16, 2018, 9:58:34 AM6/16/18
to RC2014-Z80
I took a fork of your gist code, and converted it to assemble to the Z88DK CP/M target.
Running on a Z180, I get about 1 minute 40 seconds.

I've also added a Z180 specific multiplication routine to the gist, which I'll add to the comparison soon too.


Not sure if it is getting to 'jumping the shark' stage with this mandelbrot stuff, but I've added a signed 16 bit multiplication routine using the z180 8x8 hardware multiplier to the routine, and the run time (with nothing else changed) is now 1 minute and 1 second.

The scaling is not quite right, but at least I've sorted the terminal issues and now the output looks very nice.


I read that the Z280 has signed 16 bit hardware multiply functions. I guess the final step would be to convert the `mult16` function to a single Z280 op code, and see what happens then...

So after revising my APU implementation, I'm hitching a saddle onto the shark, and submitting the new gist of mandelbrot code. Note, I'm not providing timing for the Am9511A APU, because it doesn't use the same optimisations (eg fixed SCALE), and does the full multiply and divide calculation in either 32 bit fixed or 32 bit floating point. So, the comparison isn't fair.

Anyway, the Am9511A is working as hoped, using two interrupt driven ring buffers. One for the commands, and another for 24 bit operand pointers. This means that the APU can access all of the Z180 1MByte memory space to load and unload operands, and can do 255 APU commands autonomously. The CPU execution thread only needs to get involved if there is a decision path to be followed.

If anyone is interested in reading up on the Am9511A, there is a cache of documents scraped up from all over the Internet now in the z88dk techdocs repository, including an application note from the UK Atomic Energy Authority at HARWELL.

Giddy up, fishy...

Phillip

J.B. Langston

unread,
Jun 16, 2018, 3:27:14 PM6/16/18
to RC2014-Z80
Well, if you're jumping the shark, then I'm on a competing shark and we're doing synchronized shark jumping. Not sure if you were following my TMS9918 thread but I've modified my mandelbrot renderer to use it.  Now if you wanted to take it to really silly extremes, you could combine the two...
IMG_1069.jpg

Bill Shen

unread,
Jun 16, 2018, 10:49:51 PM6/16/18
to RC2014-Z80
I want to party with you shark jumpers.  I've placed an order for a few TMS9918; I will host it on the protoRC board using CPLD instead of the TTL logics and draw the mandelbrot using a Z280 SBC with interrupts on...  The shark may win someday.

Watched the youtube replay of the shark jump episode.  I'm old enough to have watched the episode as it first came out, but I don't remember it at all--goes to show that shark jumping may not be memorable and soon forgotten.
  Bill

Mark T

unread,
Jun 16, 2018, 11:02:51 PM6/16/18
to RC2014-Z80
Make sure to get the TMS9918A version or you'll be missing the graphics mode 2. Also TMS9918 without the A might have CAS timing issues, although this might not affect the sram version.

Bill Shen

unread,
Jun 16, 2018, 11:20:41 PM6/16/18
to RC2014-Z80
OK, TMS9918ANL is what I've ordered.

Phillip Stevens

unread,
Sep 1, 2019, 8:49:35 AM9/1/19
to RC2014-Z80
Following up on this from another thread, for the RC2014 we've been doing some further mandelbrot shark jumping...

Bill Shen has produced a "standard" mandel.hex, from the original by JB, which we're using to test on the CP/M versions of the RC2014, and other systems.

I've done an optimised version, which improves on the Z80 multiply function only. Doing this takes more than a minute off the benchmark, for me, from 4'58" down to 3'48", whilst still producing all the same 396,300 characters as the original.

The original multiply is a shift and add version which is "small" and loops.

The new multiply is stripped from the fast mulu_de table look-up 16_8x8 routine out of the IEEE floating point library I've been writing, and match it up to the Spectrum Next z80n_mulu_32_16x16 multiply. This multiply requires a 512 Byte look up table, containing the high and low bytes of a square table for a 16_8x8 multiply. The 32_16x16 result is calculated as 4 partial 16_8x8 multiplies.

; Fast mulu_16_8x8 using a 512 byte table
; x*y = ((x+y)/2)2 - ((x-y)/2)2           <- if x+y is even 
;     = ((x+y-1)/2)2 - ((x-y-1)/2)2 + y   <- if x+y is odd and x>=y

; Results          FCPU         Original    Optimised   z180 mlt
; RC2014 CP/M    7.432MHz         4'58"       3'48"
; YAZ180 CP/M   36.864MHz         1'06"         58"         46"

I've attached the code and HEX file, from both the original (Bill's) and the optimised version, in case anyone wants to try it, and see how much difference it makes on their RC2014.

Phillip
mandel-feilipu.asm
mandel-feilipu.hex
mandel-original.asm
mandel-original.hex

J.B. Langston

unread,
Sep 14, 2019, 3:24:18 PM9/14/19
to RC2014-Z80
I'd like to incorporate your table-based multiplication into my TMS9918 mandelbrot generator.  It appears that l_z80_mulu_de plus the associated tables is the code that you added; however, not sure how to integrate this with my original code given the significant changes introduced for z88dk. I can probably figure it out, but if you have any pointers to save me time, I'd appreciate.

Phillip Stevens

unread,
Sep 14, 2019, 9:57:54 PM9/14/19
to RC2014-Z80
Sure. It should be pretty straight forward to include.

Firstly, credit where it is due. I didn't write the fast mulu_de from scratch. I cribbed it from the CPC Wiki, with a few modifications, including zero detection and early exit, and using the DE registers to emulate the Spectrum Next mul de instruction.

The CPC Wiki has quite a few faster options for multiply, but the size of the table required starts to get unreasonable. I think 512 Bytes is an acceptable cost, given the outcome.

The math32 IEEE floating point library is written based on 16_8x8 multiplies, the same as the z180 and z80n integer libraries, using the natural size of the z180 mlt nn and the z80n mul de instructions. The 16_8x8 unit seems to work well on the z80 too, and it helps to keep the algorithms consistent across the hardware versions.

The 16_8x8 table multiply is called with the the bytes to be multiplied in D and E, and the result is returned in DE. In my version HL is preserved, but (unlike the unrolled multiply version) AF and BC are not preserved.

Initially, I also preserved AF and BC for consistency with the unrolled multiply but because this cost so much time I decided that the calling multiply function 32_16x16 should do the preservation, so it could be done only when it needed. HL is preserved by 16_8x8 because it is always needed because the ex de,hl instruction is used extensively by calling functions.

This means that the standard (z180, z80n) mulu_32_16x16 must be slightly modified to do preservation of AF and BC when required.

; Fast mulu_16_8x8 using a 512 byte table
; x*y = ((x+y)/2)^2 - ((x-y)/2)^2           <- if x+y is even
;     = ((x+y-1)/2)^2 - ((x-y-1)/2)^2 + y   <- if x+y is odd and x>=y

Overall, the routine uses the equation above, and has two tails depending on whether x+y is even or odd. x+y is shifted and then used to look up the table of squared numbers (two 256 Byte tables). The two tables (comprising the lower and upper bytes) must align on a page boundary, and be contiguous.

Otherwise it is a straightforward substitution for the "small" looped multiply.

I've cut the z180, and Am9511A baggage out of my code, and reattached it here.
Probably it is much more recognisable for you as your program in this format.

The z88dk incantation for the RC2014 is:

> zcc +rc2014 -subtype=cpm -v --list -m mandel-feilipu.asm -o mandel
> appmake +glue --ihex --clean -b mandel -c mandel

This gets all the listings and is verbose. But verbosity can be omitted by just doing this.

> zcc +rc2014 -subtype=cpm -m mandel-feilipu.asm -o mandel
> appmake +glue --ihex --clean -b mandel -c mandel

I've attached the source, bin and hex files again.

; Upload using XMODEM; mandel__.bin -> mandel.com
; Or upload using hexload or SCM; mandel__.ihx -> mandel.hex
; then use LOAD or MLOAD: mandel.hex -> mandel.com

Cheers, Phillip

J.B. Langston wrote:
I'd like to incorporate your table-based multiplication into my TMS9918 mandelbrot generator. It appears that l_z80_mulu_de plus the associated tables is the code that you added; however, not sure how to integrate this with my original code given the significant changes introduced for z88dk. I can probably figure it out, but if you have any pointers to save me time, I'd appreciate.

On Sunday, September 1, 2019 at 7:49:35 AM UTC-5, Phillip Stevens wrote:
Following up on this from another thread, for the RC2014 we've been doing some further mandelbrot shark jumping...

Bill Shen has produced a "standard" mandel.hex, from the original by JB, which we're using to test on the CP/M versions of the RC2014, and other systems.

I've done an optimised version, which improves on the Z80 multiply function only. Doing this takes more than a minute off the benchmark, for me, from 4'58" down to 3'48", whilst still producing all the same 396,300 characters as the original.

The original multiply is a shift and add version which is "small" and loops.

The new multiply is stripped from the fast mulu_de table look-up 16_8x8 routine out of the IEEE floating point library I've been writing, and match it up to the Spectrum Next z80n_mulu_32_16x16 multiply. This multiply requires a 512 Byte look up table, containing the high and low bytes of a square table for a 16_8x8 multiply. The 32_16x16 result is calculated as 4 partial 16_8x8 multiplies.

; Fast mulu_16_8x8 using a 512 byte table
; x*y = ((x+y)/2)^2 - ((x-y)/2)^2           <- if x+y is even 
;     = ((x+y-1)/2)^2 - ((x-y-1)/2)^2 + y   <- if x+y is odd and x>=y
mandel-feilipu.asm
mandel__.ihx
mandel__.bin

David Gilbert

unread,
Nov 30, 2019, 3:09:17 AM11/30/19
to RC2014-Z80
This isn't an attempt to jump the shark, just skip around the side of it...

This comment from Bill got me thinking:


Phillip,
Looking over J.B.'s code, I agree the picture has 240 x 150 characters, but the code transmits the ANSI color sequence plus the color pixel for each character so that's a string of 10 characters per character displayed.  The total number of character transmited to paint the mandelbrot is roughly 10*240*150 = 360000.  At 115200 8o1 it takes 31 seconds, at 57600 8o1 it takes 62.5 seconds.  This is why my Z280 at 24MHz CPU clock and 57600 serial baud ran slower than 14.7MHz CPU clock and 115200 serial baud.
  Bill

Why even bother sending out the ANSI colour sequence for every pixel when often the current pixel is the same colour as it's predecessor?

So, taking Phillip's gist as a starting point, I made a couple of changes - first, I've added an additional variable to the data_user section:

lastpixel:      defb    0

Then I modified the colorpixel subroutine to check the value of the new pixel against the last pixel:

colorpixel:
        ld      a
,b                     ; iter count in B -> C
       
and     $1F                     ; lower five bits only
        ld      hl
,lastpixel            ; is the pixel value the same?
        cp      
(hl)
        jr      z
,colorpixel_1          ; yes, skip the colour change
        ld      
(hl),a
        ld      c
,a
        ld      b
,0
        ld      hl
, hsv                 ; get ANSI color code
        add     hl
, bc
        ld      a
,(hl)
        call    setcolor
colorpixel_1
:        
        ld      e
, pixel                ; show pixel
        ret


With the code recompiled with Z180 optimisation enabled, I've (roughly) timed the result at 51-52 seconds on a stock SC126 running RomWBW (so 38400 baud rate). This is (IMHO) a big improvement on the 2'15" that I measured on the same system when ANSI colour sequence are spat out for every pixel.

I'd be interested to know if other folk can confirm comparable results on their systems?

David


 

Phillip Stevens

unread,
Nov 30, 2019, 6:39:00 AM11/30/19
to RC2014-Z80
David Gilbert wrote:
This isn't an attempt to jump the shark, just skip around the side of it...

Shark jumping is the whole point of life...
 

This comment from Bill got me thinking:
Phillip,
Looking over J.B.'s code, I agree the picture has 240 x 150 characters, but the code transmits the ANSI color sequence plus the color pixel for each character so that's a string of 10 characters per character displayed.  The total number of character transmited to paint the mandelbrot is roughly 10*240*150 = 360000.  At 115200 8o1 it takes 31 seconds, at 57600 8o1 it takes 62.5 seconds.  This is why my Z280 at 24MHz CPU clock and 57600 serial baud ran slower than 14.7MHz CPU clock and 115200 serial baud.
  Bill

Why even bother sending out the ANSI colour sequence for every pixel when often the current pixel is the same colour as it's predecessor?

You're absolutely right, there IS no point to send out all of these colour sequences, unless you're trying to provide exactly comparable results across the serial interfaces of multiple machine types. So the original intent of the exercise was to transmit exactly the same number of characters, and see whether interrupt driven I/O was faster than polling (or not) and where the crossover between CPU speed and I/O hardware (ACIA, SIO, etc) lies. But, as that shark has been well and truly jumped...
 
I'd be interested to know if other folk can confirm comparable results on their systems?

but, perhaps there's an "optimised Mandelbrot" shark waiting off the foreshore?
Might have a look at this soon.

p.

Dave

unread,
Nov 30, 2019, 1:57:26 PM11/30/19
to RC2014-Z80
Looks like I've found a little challenge :-D I was looking for a program to write for the Z80 anyway!

J.B. Langston

unread,
May 27, 2020, 9:15:03 PM5/27/20
to RC2014-Z80
Now that I have my own Z180 to play with, I finally got motivated to merge your optimizations into my TMS9918 mandelbrot, and all I've got to say is... holy moly!

RC2014 Z80 @ 7.3726MHz, RomWBW: 2 minutes 31.9 seconds (using multiply table)
SC126 Z180 @ 36.864MHz, RomWBW: 18.3 seconds (using mlt instruction)


Resolution is 256x192. The image is identical to the one I previously shared in this thread.

I will probably eventually do further optimization to use solid guessing, and while I'm at it use a smarter coloring algorithm so the color clash isn't quite so bad.

Phillip Stevens

unread,
Jun 14, 2021, 7:10:55 AM6/14/21
to RC2014-Z80
On Sunday, 1 September 2019 Phillip  wrote:
Following up on this from another thread, for the RC2014 we've been doing some further mandelbrot shark jumping...

Bill Shen has produced a "standard" mandel.hex, from the original by JB, which we're using to test on the CP/M versions of the RC2014, and other systems.

I've done an optimised version, which improves on the Z80 multiply function only. Doing this takes more than a minute off the benchmark, for me, from 4'58" down to 3'48", whilst still producing all the same 396,300 characters as the original.

The original multiply is a shift and add version which is "small" and loops.

The new multiply is stripped from the fast mulu_de table look-up 16_8x8 routine out of the IEEE floating point library I've been writing, and match it up to the Spectrum Next z80n_mulu_32_16x16 multiply. This multiply requires a 512 Byte look up table, containing the high and low bytes of a square table for a 16_8x8 multiply. The 32_16x16 result is calculated as 4 partial 16_8x8 multiplies.

; Fast mulu_16_8x8 using a 512 byte table
; x*y = ((x+y)/2)2 - ((x-y)/2)2           <- if x+y is even 
;     = ((x+y-1)/2)2 - ((x-y-1)/2)2 + y   <- if x+y is odd and x>=y

; Results          FCPU         Original    Optimised
; RC2014 CP/M    7.432MHz         4'58"       3'48"

It has been a while and this thread is positively ancient, but...

Recently I was reworking some of the Z80 specific math32 floating point library code in z88dk, considering whether abandoning the 16_8x8 multiply model (suited for the Z180 and Z80N) would help Z80 performance.
I found a couple of unrolled multiply routines that have now been used to get a 37% improvement in multiply performance for the Z80.

And then wandering past my mandelbrot code today, I wondered if the runer112 32_16x16 multiply was significantly better than the 512 byte table look-up multiply that I'd previously used to get the best result.
Well, it turns out that it smokes the previous best case taking only 2'53", down from 3'48".

; To calculate the theoretical minimum time at 115200 baud.
; Normally 10 colour codes, and 1 character per point.
; A line is (3 x 80) x 11 + CR + LF = 2642 characters
; There are 10 x 60 / 4 lines = 150 lines
; Therefore 396,300 characters need to be transmitted.
; Serial rate is 115200 baud or 14,400 8 bit characters per second
; Therefore the theoretical minimum time is 27.52 seconds.
;
; Results          FCPU         Original    Table mlt   Runer mlt
; RC2014 CP/M    7.432MHz         4'58"       3'48"       2'53"

Pretty happy with that useful outcome.

Cheers, Phillip
 

Bill McMullen

unread,
Jun 14, 2021, 1:43:32 PM6/14/21
to RC2014-Z80
I didn't get involved with this as a benchmark when the thread originally started but thought I'd try it out on a couple of different systems.  All of my systems run with ZERO wait states since just one wait state on a Z180 will drop the CPU performance by about 25%.  These systems all use a UART-USB bridge which has 320 byte buffers in addition to the ASCI/UART buffer and I wasn't concerned about cleaning up the graphics.  I used the MANDEL-ORIGINAL.HEX and MANDEL-FELIPU.HEX files after a CP/M LOAD on the following systems:

MinZ-C - Z180 at 36.864 MHz and 115,200 baud
MinZ-U - Z180 at 36.864 MHz and 230,400 baud
Min-eZ - eZ80 at 50 MHz and 115,200 baud

MANDEL-ORIGINAL: MinZ-C=67 Sec, MinZ-U=53 Sec, Min-eZ=43 Sec

MANDEL-FELIPU: MinZ-C=62 Sec, MinZ-U=47 Sec, Min-eZ=42 Sec

Conclusions:
This is an interesting benchmark since it is both I/O and CPU intensive although the I/O tends to dominate with the above systems.  Doubling the baud rate resulted in about a 20-24% reduction in elapsed time while increasing the processor speed approximately 500% only reduced the elapsed time by about 32-36%.

Having been involved many years ago with mainframe performance analysis for upgrades, this is a classic case of understanding the overall system requirements.  While it's easy to concentrate simply on processor speed or I/O speed, one really needs to look at the entire system and it's applications as a whole.  I'm sure if I added a very large output buffer and interrupt driven output to the eZ80 then it would come close to the theoretical baud limitation.  However most of my system usage is file oriented and I prefer to use that memory for a large RAMdisk and disk buffers rather than the occasional baud-rate limited application.  If these programs were the primary purpose of the system then a console based on a FT232H in FIFO mode could change the throughput to be primarily CPU performance.

Phillip Stevens

unread,
Jun 14, 2021, 8:52:49 PM6/14/21
to RC2014-Z80
Bill wrote:
I didn't get involved with this as a benchmark when the thread originally started but thought I'd try it out on a couple of different systems.  All of my systems run with ZERO wait states since just one wait state on a Z180 will drop the CPU performance by about 25%.  These systems all use a UART-USB bridge which has 320 byte buffers in addition to the ASCI/UART buffer and I wasn't concerned about cleaning up the graphics.  I used the MANDEL-ORIGINAL.HEX and MANDEL-FELIPU.HEX files after a CP/M LOAD on the following systems:

Hi Bill
Thanks for playing along. ;-)
if you're interested to try a Z180 specific version (using the mlt nn instructions), I've attached a HEX file this Z180 version on the Gist.
 
MinZ-C - Z180 at 36.864 MHz and 115,200 baud
MinZ-U - Z180 at 36.864 MHz and 230,400 baud
Min-eZ - eZ80 at 50 MHz and 115,200 baud

MANDEL-ORIGINAL: MinZ-C=67 Sec, MinZ-U=53 Sec, Min-eZ=43 Sec
MANDEL-FELIPU: MinZ-C=62 Sec, MinZ-U=47 Sec, Min-eZ=42 Sec

I see hand timed 46 seconds for the Z180 multiply version (hardware most like your MinZ-C, but with 1 memory wait state).
I'd guess your number using the Z180 multiply will be even faster.

; Results          FCPU         Original    Optimised   z180 mlt
; YAZ180 CP/M   18.432MHz         1'40"       1'24"       1'00"
; YAZ180 CP/M   36.864MHz         1'06"         58"         46"
Conclusions:
This is an interesting benchmark since it is both I/O and CPU intensive although the I/O tends to dominate with the above systems.  Doubling the baud rate resulted in about a 20-24% reduction in elapsed time while increasing the processor speed approximately 500% only reduced the elapsed time by about 32-36%.

I think this benchmark needs to consider the "excess" time above the serial minimum time. For 115,200 baud it is 27 seconds, but for the 230,400 system it would be only 13 seconds.
So your benchmark shows that the results are fairly linear against I/O (MinZ-U is 15 seconds faster than MinZ-C). But, it is probably not that simple.
Quite interesting to see the eZ80 is also fairly I/O bound too.

P.
 
Having been involved many years ago with mainframe performance analysis for upgrades, this is a classic case of understanding the overall system requirements.  While it's easy to concentrate simply on processor speed or I/O speed, one really needs to look at the entire system and it's applications as a whole.  I'm sure if I added a very large output buffer and interrupt driven output to the eZ80 then it would come close to the theoretical baud limitation.  However most of my system usage is file oriented and I prefer to use that memory for a large RAMdisk and disk buffers rather than the occasional baud-rate limited application.  If these programs were the primary purpose of the system then a console based on a FT232H in FIFO mode could change the throughput to be primarily CPU performance.

On Monday Phillip wrote:
It has been a while and this thread is positively ancient, but... 
And then wandering past my mandelbrot code today, I wondered if the runer112 32_16x16 multiply was significantly better than the 512 byte table look-up multiply that I'd previously used to get the best result.
mand180.hex.txt

Bill McMullen

unread,
Jun 14, 2021, 10:31:11 PM6/14/21
to RC2014-Z80
Using MAND180.hex:

MinZ-C = 48.5 Sec, MinZ-U = 32 Sec, Min-eZ = 40 Sec

Absolutely no question that my systems are I/O bound with this application.  Given the instruction optimization, doubling the baud rate now gives a 34% boost in Z180 throughput.  The eZ80 is VERY I/O bound!  Playing around with USB latency might reduce the times a bit.

I think the Z180 wait state gets obscured by the I/O waits but would show up on a more CPU intensive benchmark.  ASCIIART under MBASIC 5.21 gives the following times:

MinZ-C = 27 Sec, MinZ-U = 27 Sec, Min-eZ = under 7 Sec 

Quite awhile ago I tested a 33.333 MHz Z180 on ASCIIART with various memory wait states:

0 waits = 30 Sec, 1 wait = 40 Sec, 2 waits = 49 Sec and 3 waits = 59 Sec.

Shawn Reed

unread,
Apr 9, 2024, 5:44:01 PM4/9/24
to RC2014-Z80
I know this is a very old thread and perhaps nobody wanted to have it dug back up, but I have had a blast using it to refresh my Z80 assembly skills.

Running J.B. Langston's original mandel.com downloaded at the top of the thread on my 18MHz Z180 SC722 I get 2:20:15. (115,200 baud)

With my current version I am getting 44:96.

I'm using the Z180 specific instructions (mlt), not calculating color if the iteration count is the same as previous, watching for ESC to be pressed to cancel, using RomWBW BIOS calls rather than CPM and compiling on target with Hi-Tech C ZAS. I am thinking to add command line parameters, read RTC to display run time and perhaps even some additional buffering to separate the calculations from IO.  I know this shark has been jumped a few times now, but it is as good a target as any to practice on.

-Shawn
mandel.png

Alan Cox

unread,
Apr 9, 2024, 5:52:40 PM4/9/24
to rc201...@googlegroups.com
On Tue, 9 Apr 2024 at 22:44, Shawn Reed <shawns...@gmail.com> wrote:
>
> I know this is a very old thread and perhaps nobody wanted to have it dug back up, but I have had a blast using it to refresh my Z80 assembly skills.
>
> Running J.B. Langston's original mandel.com downloaded at the top of the thread on my 18MHz Z180 SC722 I get 2:20:15. (115,200 baud)
>
> With my current version I am getting 44:96.
>
> I'm using the Z180 specific instructions (mlt), not calculating color if the iteration count is the same as previous, watching for ESC to be pressed to cancel, using RomWBW BIOS calls rather than CPM and compiling on target with Hi-Tech C ZAS. I am thinking to add command line parameters, read RTC to display run time and perhaps even some additional buffering to separate the calculations from IO. I know this shark has been jumped a few times now, but it is as good a target as any to practice on.

Nice - and if you use interrupts for the serial you can check the
escape code basically for free. Just save an SP somewhere and if you
get a serial interrupt with escape set SP to the saved one and ret
clean out of everything.

Shawn Reed

unread,
Apr 9, 2024, 7:39:20 PM4/9/24
to RC2014-Z80
Great point. Right now, I am just polling the HBIOS Function 0x02 – Character Input Status (CIOIST). I'll check on what that is costing me and look into utilizing the interrupt instead.

; Monitor for an ESC key to abort calculation and end the application
charIn:
        ; Save the registers
        push    bc
        push    de
        push    hl

        ; Check the status of the input buffer since the read is blocking
        ld      b, hbios_cioist         ; Input status command
        ld      c, hbios_device         ; Console device
        rst     hbios                   ; Call the HBIOS routine
        cp      0                       ; Update the zero flag
        jp      z, charInEnd            ; nothing in the input buffer so return to the iteration loop

        ld      b, hbios_cioin          ; The input read HBIOS routine
        ld      c, hbios_device         ; Console Device
        rst     hbios                   ; Call the HBIOS Routine
        ld      a, 27                   ; Check for an ESC
        cp      e                       ; Char read in is in E so compare to the ESC in A
        jp      nz, charInEnd           ; If the zero flag is not set we can throw away the char and resume the iteration loop
        jp      mandel_end              ; We had gotten an ESC so we are done
       
charInEnd:
        ; Restore the registers
        pop     hl
        pop     de
        pop     bc
        jp      inner_loop2             ; back to work
Reply all
Reply to author
Forward
0 new messages