A new IEEE 754 floating point library in z88dk

357 views
Skip to first unread message

Phillip Stevens

unread,
May 29, 2019, 8:23:52 PM5/29/19
to RC2014-Z80
I've been quiet on the RC2014 front for a few months. But, finally it is a reasonable time to write a little about what has been occupying me.

When I started with z80, one of the first things that interested me was hooking up the Am9511A Arithmetic Processing Unit (APU), for its floating point capabilities, to the z180 CPU. When I did that I realised that none of the existing floating point libraries use the hardware multiply capabilities of the z180, or incidentally of the z80-zxn CPU in the Spectrum Next (coming soon) and, to be fair, I needed to have a real z180 floating point library for comparison.

So after procrastinating on the z180 and z80-zxn floating point library situation for nearly two years, I decided to write a IEEE 754 32-bit floating point library for those two platforms which could also be used for the normal z80.

I started out with the Digi International floating point library for the Rabbit R2000 and R3000 CPUs. The Rabbit CPU has a signed 32_16x16 hardware multiply, and this made a good starting point. But, after months of working on the problem the process turned out to be much more difficult than I intended:
  1. Rewrite Rabbit code to z80 code -> realise that this doesn't quite work so easily.
  2. Translate 32_16x16 Rabbit multiply algorithms to 16_8x8 z80 multiply algorithms -> accuracy is ugly.
  3. Learn about Newton-Raphson (a month passes).
  4. Learn about Horner's Method.
  5. Rewrite derived assembly functions to use compact IEEE intrinsic functions -> accuracy is middling but getting better.
  6. Rewrite intrinsic functions to use an expanded 32-bit mantissa  -> accuracy is good as IEEE 754 gets.
Getting accuracy in the intrinsic functions required rewriting the code 3 times. If I knew what I was doing it would have been easier. But, as a learning experience rewrites are the only way to make progress. Now, after the rewrites, I'm pretty happy with the resulting code.

Finally, with a solid intrinsic function library, I used extracted the derived functions (trigonometric, hyperbolic, and power)  from the Hi-Tech C library. These C functions are known as the fastest floating point implementations in z88dk bench marking, but they are not particularly accurate. It is still a work in progress to rewrite these to reach an acceptable compromise between performance and accuracy.

Anyway, there's still some work to do to hook this library up to sdcc, but it is already connected to the sccz80 compiler in z88dk. The z88dk sccz80 has been extended to handle IEEE 754 32-bit floating point, and can now optionally handle this new math32 library.

Performance in bench marking is still being examined, but for the two targets of z180 and the Spectrum Next z80-zxn it is about four times (4x) faster for arithmetic intensive benchmarks (n-body). A 4x improvement is in-line with the kind of improvement revealed by writing the z180 and z80-zxn integer math library previously. There is less benefit for the z80, as there is no hardware multiply capability which can be exploited. Any z80 performance improvement would come directly from reduction of the number of bytes shuffled versus the existing 48-bit math libraries, and it is still to be quantified.

For further information, there's a readme in z88dk for more info on how the library is built, along with the early bench marking results.
Hopefully, this library will become quite useful.

For me, it has been a very educational few months. Glad that it is (nearly) done - done.

Cheers, Phillip

Phillip Stevens

unread,
Jun 18, 2019, 1:25:16 AM6/18/19
to RC2014-Z80
Phillip Stevens wrote:
I've been quiet on the RC2014 front for a few months. But, finally it is a reasonable time to write a little about what has been occupying me.

So after procrastinating on the z180 and z80-zxn floating point library situation for nearly two years, I decided to write a IEEE 754 32-bit floating point library for those two platforms which could also be used for the normal z80.

I started out with the Digi International floating point library for the Rabbit R2000 and R3000 CPUs. The Rabbit CPU has a signed 32_16x16 hardware multiply, and this made a good starting point. But, after months of working on the problem the process turned out to be much more difficult than I intended:
  1. Rewrite Rabbit code to z80 code -> realise that this doesn't quite work so easily.
  2. Translate 32_16x16 Rabbit multiply algorithms to 16_8x8 z80 multiply algorithms -> accuracy is ugly.
  3. Learn about Newton-Raphson (a month passes).
  4. Learn about Horner's Method.
  5. Rewrite derived assembly functions to use compact IEEE intrinsic functions -> accuracy is middling but getting better.
  6. Rewrite intrinsic functions to use an expanded 32-bit mantissa  -> accuracy is good as IEEE 754 gets.
Getting accuracy in the intrinsic functions required rewriting the code 3 times. If I knew what I was doing it would have been easier. But, as a learning experience rewrites are the only way to make progress. Now, after the rewrites, I'm pretty happy with the resulting code.

As it stands, the math32 code base is more readable, and well commented, than optimised down to the last cycle. For example, for mantissa shifts I've left bytes in their respective registers, rather than moving one over to the a register because the shift is a few cycles faster. I hope this kind of thing will keep the code maintainable for a longer period.
 
Finally, with a solid intrinsic function library, I used extracted the derived functions (trigonometric, hyperbolic, and power)  from the Hi-Tech C library. These C functions are known as the fastest floating point implementations in z88dk bench marking, but they are not particularly accurate. It is still a work in progress to rewrite these to reach an acceptable compromise between performance and accuracy.

7. Convert the Hi-Tech C library to assembly using the zsdcc, and include that in sccz80 builds. @suborb set this up, and it has been a good way to get going quickly for the sccz80 classic build.
 
Anyway, there's still some work to do to hook this library up to sdcc, but it is already connected to the sccz80 compiler in z88dk. The z88dk sccz80 has been extended to handle IEEE 754 32-bit floating point, and can now optionally handle this new math32 library.

8. Spent too much time looking at accuracy of the Hi-Tech C functions. The immediate result was woeful. So we put the Cephes Math library expf() and logf() functions into the library. The result was still pretty crappy. Things were getting desperate.

After a while it became apparent that there was an issue from our sdcc compiler of choice in that it was using hl register to set up its stack, trashing the least significant bytes of all dehl variable passing fastcall C functions in the process. Effectively resulting in 16 bit floating point. Yeah. Not exactly good. So we fixed that, and unsurprisingly accuracy for C generated functions got substantially better.

Performance in bench marking is still being examined, but for the two targets of z180 and the Spectrum Next z80-zxn it is about four times (4x) faster for arithmetic intensive benchmarks (n-body). A 4x improvement is in-line with the kind of improvement revealed by writing the z180 and z80-zxn integer math library previously. There is less benefit for the z80, as there is no hardware multiply capability which can be exploited. Any z80 performance improvement would come directly from reduction of the number of bytes shuffled versus the existing 48-bit math libraries, and it is still to be quantified.

Because sccz80 is the z88dk "house compiler" @suborb was able to integrate a number of great optimisations into its instruction issuing machine already.

The ldexp() function can be used to provide very fast 2^n multiplies (and divides). So now wherever C code requires 2^n the sccz80 compiler will issue a ldexp() instruction rather than a multiply or divide. This is so useful that @suborb extended it to all z88dk math libraries. In the startrek program, for example, there are 7 instances where this is issued, and each time a full floating point divide process taking many hundreds of ticks is avoided.

Also, as the math32 library calculates divides by first finding the inverse and then multiplying, the invf() function is actually the intrinsic function. So where sccz80 recognises that a number is being simply inversed "1/n" then it will issue a invf(), rather than requiring that the inverse of n be then multiplied by 1.0. This is a special issuance for the math32 library, and avoids a full floating point multiply.

9. Add a faster z80 multiply option. One performance tool that I did build was a fast table driven 16_8x8 multiply for the z80. Since the framework of the library is based on the 16_8x8 unsigned multiply provided by the z180 CPU and the z80-zxn ZX Spectrum Next soft-CPU, the z80 multiply core needed to follow suit. Fortunately, 16_8x8 is also pretty comfortable on a z80, with a very optimised unrolled shift+add routine being pretty simple.

But because the accuracy challenge was impacting the performance we were trying to achieve, I found a table driven solution for multiplication that is substantially faster than a linear shift+add routine, and built that as an option. So for the z80, there is now a math32_fast option. There is a penalty of a 512 Byte table added to the build, but the benchmark result at C level is about a 10% to 15% performance increase. Worthwhile as an alternative.
 
For further information, there's a readme in z88dk for more info on how the library is built, along with the early benchmarking results.
Hopefully, this library will become quite useful.

For me, it has been a very educational few months. Glad that it is (nearly) done - done.

10. There's still a few things to be done, including the biggie of building the linkages so that the z88dk newlib can be used. The rc2014 app (on metal embedded) and basic (to be loaded by hexload) subtypes are included in newlib, as is the rc2014 cpm subtype, so there is still some work to do.

The only way that the library can be used currently from C is through the z88dk classic library with sccz80, but fortunately the classic library includes the cpm target with default subtype. To use this with rc2014, rather than using -lm to link the default math library, use -fp-mode=ieee -lmath32 -pragma-define:CLIB_32BIT_FLOAT=1

Another update, when it is all connected.

Cheers, Phillip

Phillip Stevens

unread,
Apr 28, 2021, 8:14:27 AM4/28/21
to RC2014-Z80
On Tuesday, 18 June 2019 at 15:25:16 UTC+10 Phillip Stevens wrote:
I've been quiet on the RC2014 front for a few months. But, finally it is a reasonable time to write a little about what has been occupying me.

So after procrastinating on the z180 and z80-zxn floating point library situation for nearly two years, I decided to write a IEEE 754 32-bit floating point library for those two platforms which could also be used for the normal z80.

I started out with the Digi International floating point library for the Rabbit R2000 and R3000 CPUs. The Rabbit CPU has a signed 32_16x16 hardware multiply, and this made a good starting point. But, after months of working on the problem the process turned out to be much more difficult than I intended:
  1. Rewrite Rabbit code to z80 code -> realise that this doesn't quite work so easily.
  2. Translate 32_16x16 Rabbit multiply algorithms to 16_8x8 z80 multiply algorithms -> accuracy is ugly.
  3. Learn about Newton-Raphson (a month passes).
  4. Learn about Horner's Method.
  5. Rewrite derived assembly functions to use compact IEEE intrinsic functions -> accuracy is middling but getting better.
  6. Rewrite intrinsic functions to use an expanded 32-bit mantissa  -> accuracy is good as IEEE 754 gets.
Getting accuracy in the intrinsic functions required rewriting the code 3 times. If I knew what I was doing it would have been easier. But, as a learning experience rewrites are the only way to make progress. Now, after the rewrites, I'm pretty happy with the resulting code.

As it stands, the math32 code base is more readable, and well commented, than optimised down to the last cycle. For example, for mantissa shifts I've left bytes in their respective registers, rather than moving one over to the a register because the shift is a few cycles faster. I hope this kind of thing will keep the code maintainable for a longer period.

Another update, when it is all connected.

Well, the promised update on z88dk  math32 has taken a few years, because it has been working as planned, so I had nothing new and exciting to speak of.

In context, the main reason I wrote it in 2019 was to take advantage of the z180 hardware multiply instruction, and also to support the SpectrumNext z80n hardware multiply instruction too.
Z80 support was a bit of an after thought, focused more on compatibility (ease of maintenance) than what was best for performance.

Anyway, over the last few days I've revised the z80 mantissa multiplier for math32 for performance and have made a 37% reduction in cycles.
This has had a substantial positive effect on some benchmarks.
Otherwise no change. Everything is as normal.

The update should be available in tomorrow's z88dk nightly.

Cheers, Phillip
Reply all
Reply to author
Forward
0 new messages