MicroSoft F80/BASCOM

Fred Weigel

unread,

Jul 22, 2021, 11:57:28 AM7/22/21

to retro-comp

Because none of the AM9511 FORTRAN-80 libraries can be located,

I am writing APU.REL than can be linked with F80 compiled code.

So far, I have $D9 (INTEGER/INTEGER), $DY (INTEGER*4/INTEGER), $D1 (INTEGER*4/INTEGER*4), $M9, $MY and $M1 (MULTIPLY) done.

Beginning on the REAL +- * / **, which will be added soon . Published to my

github (just in case my hardware crashes).

https://github.com/ratboy666/apu

And I was wondering if these were applicable to MicroSoft BASCOM.

OBSLIB.REL does have $D9, and $M9. Of course BASCOM/MBASIC for the

8080 did not support V& (32 bit integer), so $DY, $D1, $MY and $M1

being missing is kind of expected.

What I was wondering... if anyone knows... (and I completely forgetten...

it has been 40 years) - would BASCOM compiled program using

/O (OBSLIB.REL) use the FORTRAN-80 routines for arithmetic? If

it does, then APU.REL will apply to both. Does anyone know? Note that

even BASLIB.REL contains $M9 and $D9, so maybe the BRUN.COM

stuff can be accelerated (the is a leap -- $M9 will be BRUN, and if I

patch it exactly, I could possibly get an accelerated BRUN.COM).

If it DOES work that way, then MBASIC (slow) BASCOM (faster) and

BASCOM+APU (fastest).

Thanks in advance

Fred Weigel

unread,

Jul 23, 2021, 12:53:27 PM7/23/21

to retro-comp

Quick update

I now have INTEGER INTEGER*4 multiply divide and REAL REAL

(bother operands REAL) add subtract multiply divide power

done.

Now its likely to be usable as a AM9511A support library.

Working on REAL INTEGER mixed operations

Note that there is a bug in apu.mac -- 1.0 / 2.0 gives -1.0,

not 0.5 as it should, But, in answer to my original question:

BASCOM 5.30a generates

CALL $DVDA / DW X! / DW Y! / CALL $FASO / DW Z!

for Z = X / Y where both X and Y are REAL. This, in turn,

generates a call to $DB, which does the divide. Linking

APU.REL does work with BASCOM.

Fred Weigel

unread,

Jul 23, 2021, 1:12:39 PM7/23/21

to retro-comp

And, fixed

Phillip Stevens

unread,

Jul 24, 2021, 8:19:55 AM7/24/21

to retro-comp

Fred wrote:

Because none of the AM9511 FORTRAN-80 libraries can be located,
I am writing APU.REL than can be linked with F80 compiled code.

So far, I have $D9 (INTEGER/INTEGER), $DY (INTEGER*4/INTEGER), $D1 (INTEGER*4/INTEGER*4), $M9, $MY and $M1 (MULTIPLY) done.
Beginning on the REAL +- * / **, which will be added soon.

Interested to see this, and how to write .REL files.

If it DOES work that way, then MBASIC (slow) BASCOM (faster) and
BASCOM+APU (fastest).

The experience from integrating into z88dk and C libraries was that it is actually slower to use the 16 bit operations on the APU, than doing it with a 7.3MHz (RC2014) Z80.

So after putting them all in, I had to back them out. They're all still there, just not enabled. Interested to hear what you find.

It would be great to have a BASCOM solution for the APU Module in the RC2014.

The MBASIC APU version is faster than with software floating point, but not by very much. When looking at the calls, there is too much overhead from interpretation to get much improvement.

So even if you triple the speed of the math for example, the proportion of math time to instruction decoding time is just too low to see more than about 15% overall improvement.

Having BASCOM would change the proposition entirely and provide an opportunity for the APU to benefit performance from BASIC (as it does in C) up to about 3x performance.

So good luck!

Phillip

Fred Weigel

unread,

Jul 24, 2021, 10:44:11 AM7/24/21

to retro-comp

Phillip

BASCOM appears to work

BASCOM =PROG/O/Z

L80 PROG,APU,PROG/N/E

would do to compile and link PROG.BAS, with APU. I haven't extensively tested this solution, and

have never run APU library on a real device (I don't have one). This does work on the emulator.

Fred Weigel

unread,

Jul 24, 2021, 3:46:47 PM7/24/21

to retro-comp

Phillip

If you are willing to do some benchmark testing...

I have added a "bm" (benchmark) directory to my apu repository.

This contains a benchmark document from 1982, and 9 benchmark programs, in BASIC,

FORTRAN and PASCAL. I entered and and ran them from the paper. They are very

outdated... but are all I can find from that era.

There are results in the paper for Z80 with FORTRAN and two libraries (Redding and Memtech).

It would be interesting to know how APU.REL compares. Note that BM9.FOR is accelerated

with the Redding library, which implies that INTEGER acceleration is useful.

At least at 2, 2.5, 3 and 4 Mhz -- not sure about 7+Mhz.

The Pascal programs are included and have been run with Turbo Pascal -- apparently there is

a problem with BM9.PAS in Pascal MT+ (but I haven't bothered with that yet)

FORTRAN programs all compiled and run with F80

BASIC run with MBASIC and BASCOM

BM9.BAS

Using BASCOM with APU on zxcc

0m0.558s

Same linked WITHOUT APU

0m0.770s

which means ~30% improvement! On an INTEGER dominant benchmark -- and the ONLY

thing we accelerate is 160 LET M=N/K

With F80, we should see an even better improvement -- should compare with the Redding

entries in the benchmark result table.

Let me know if you want/need the APU.REL, or BM9.COM

Fred Weigel

Phillip Stevens

unread,

Jul 25, 2021, 3:44:27 AM7/25/21

to retro-comp

Fred wrote:

Phillip

I have added a "bm" (benchmark) directory to my apu repository.

This contains a benchmark document from 1982, and 9 benchmark programs, in BASIC,
FORTRAN and PASCAL. I entered and and ran them from the paper. They are very
outdated... but are all I can find from that era.

There are results in the paper for Z80 with FORTRAN and two libraries (Redding and Memtech).
It would be interesting to know how APU.REL compares. Note that BM9.FOR is accelerated
with the Redding library, which implies that INTEGER acceleration is useful.

At least at 2, 2.5, 3 and 4 Mhz -- not sure about 7+Mhz.
The Pascal programs are included and have been run with Turbo Pascal -- apparently there is
a problem with BM9.PAS in Pascal MT+ (but I haven't bothered with that yet)

FORTRAN programs all compiled and run with F80
BASIC run with MBASIC and BASCOM

BM9.BAS

Using BASCOM with APU on zxcc

0m0.558s

Same linked WITHOUT APU

0m0.770s

which means ~30% improvement! On an INTEGER dominant benchmark -- and the ONLY
thing we accelerate is 160 LET M=N/K
With F80, we should see an even better improvement -- should compare with the Redding
entries in the benchmark result table.

Let me know if you want/need the APU.REL, or BM9.COM

I've been having a bit of a go with this.

I think I've gotten BM8 to work without the APU, but it seems to hang when using the APU.

See if you can follow along with my thread below.

There are a bunch of errors when assembling the APU.ASM file, which I guess are unresolved symbols.

There's also some messages from L80 that might be an issue. I don't know it well enough.

Anyway, at the end the APU enabled BM8 just hangs.

Am I doing something wrong?

B>a:ddir

-- Directory of volume #1 --

BASCOM2.HLP ........29312

BASCOM.COM ........32768

BASCOM.HLP ........14976

BASLIB.REL ........24960

BRUN.COM ........15488

CREF80.COM .........3968

CREF.COM .........3968

D.COM .........1792

L80.COM ........10752

LIB80.COM .........4736

M80.COM ........19200

MBASIC.COM ........24320

OBSLIB.REL ........48384

SAMPLE.BAS ..........128

COLOUR.BAS ..........896

COLOUR.PRN .........5376

BCLOAD ..........128

COLOUR.REL .........1024

COLOUR.COM .........1152

APU.MAC ........20864

AM9511.MAC ..........256

BM8.BAS ..........256

Total bytes: 264704.

B>m80 =apu

M AM.SIN EQU 02H ; SINE

M AM.CHSF EQU 15H ; FLOATING CHANGE SIGN

M AM.FLTS EQU 1DH ; 16 BIT TO FLOAT

M AM.FIXD EQU 1EH ; FLOAT TO 32 BIT

M AM.FIXS EQU 1FH ; FLOAT TO 16 BIT

M 0060 AM.SINGL EQU 60H ; 16 BIT INTEGER

M 0020 AM.FIXED EQU 20H ; FIXED POINT

M 0002 AM.SIN EQU 02H ; SINE

M 0014 AM.CHS EQU 14H ; CHANGE SIGN

M 0015 AM.CHSF EQU 15H ; FLOATING CHANGE SIGN

M 001C AM.FLTD EQU 1CH ; 32 BIT TO FLOAT

M 001D AM.FLTS EQU 1DH ; 16 BIT TO FLOAT

M 001E AM.FIXD EQU 1EH ; FLOAT TO 32 BIT

M 001F AM.FIXS EQU 1FH ; FLOAT TO 16 BIT

E 0020' D3 00 OUT DA9511

E 0024' D3 00 OUT DA9511

E 002A' D3 00 OUT DA9511

E 002D' D3 00 OUT DA9511

E 0037' D3 00 OUT DA9511

E 0039' D3 00 OUT DA9511

E 003B' D3 00 OUT DA9511

E 003D' D3 00 OUT DA9511

E 0048' D3 00 OUT DA9511

E 004A' D3 00 OUT DA9511

E 004C' D3 00 OUT DA9511

E 0051' D3 00 OUT DA9511

E 0056' DB 00 IN DA9511 ; 9511 EXPONENT

E 0059' DB 00 IN DA9511 ; 9511 HIGH MANTISSA

E 005D' DB 00 IN DA9511 ; 9511 MIDDLE MANTISSA

E 0061' DB 00 IN DA9511 ; 9511 LOW MANTISSA

E 009C' D3 00 OUT DA9511

E 00A0' D3 00 OUT DA9511

E 00A4' D3 00 OUT DA9511

E 00A8' D3 00 OUT DA9511

E 00AC' D3 00 OUT ST9511

E 00AE' DB 00 + ..0000: IN ST9511

E 00D1' D3 00 OUT DA9511

E 00D4' D3 00 OUT DA9511

E 00D8' D3 00 OUT ST9511

E 00DA' DB 00 + ..0001: IN ST9511

E 0101' D3 00 OUT ST9511

E 0103' DB 00 + ..0002: IN ST9511

E 010F' DB 00 IN ST9511

E 0173' D3 00 OUT DA9511

E 0176' D3 00 OUT DA9511

E 0179' D3 00 OUT DA9511

E 017C' D3 00 OUT DA9511

E 0180' D3 00 OUT ST9511

E 0182' DB 00 + ..0003: IN ST9511

E 0198' DB 00 $D9.2: IN DA9511

E 019B' DB 00 IN DA9511

E 01AB' D3 00 OUT DA9511

E 01AE' D3 00 OUT DA9511

E 01B4' D3 00 OUT DA9511

E 01B7' D3 00 OUT DA9511

E 01BB' D3 00 OUT DA9511

E 01BE' D3 00 OUT DA9511

E 01C2' D3 00 OUT DA9511

E 01C4' D3 00 OUT DA9511

E 01C8' D3 00 OUT ST9511

E 01CA' DB 00 + ..0004: IN ST9511

E 01E6' DB 00 $DY.2: IN DA9511

E 01E9' DB 00 IN DA9511

E 01EF' DB 00 IN DA9511

E 01F2' DB 00 IN DA9511

E 0205' D3 00 OUT DA9511

E 0208' D3 00 OUT DA9511

E 020E' D3 00 OUT DA9511

E 0211' D3 00 OUT DA9511

E 0215' D3 00 OUT DA9511

E 0219' D3 00 OUT DA9511

E 021D' D3 00 OUT DA9511

E 0221' D3 00 OUT DA9511

E 0225' D3 00 OUT ST9511

E 0227' DB 00 + ..0005: IN ST9511

E 0243' DB 00 $D1.2: IN DA9511

E 0246' DB 00 IN DA9511

E 024C' DB 00 IN DA9511

E 024F' DB 00 IN DA9511

E 025E' D3 00 OUT DA9511

E 0261' D3 00 OUT DA9511

E 0264' D3 00 OUT DA9511

E 0267' D3 00 OUT DA9511

E 026B' D3 00 OUT ST9511

E 026D' DB 00 + ..0006: IN ST9511

E 027C' DB 00 $M9.2: IN DA9511

E 027F' DB 00 IN DA9511

E 0287' D3 00 OUT DA9511

E 028A' D3 00 OUT DA9511

E 028E' D3 00 OUT DA9511

E 0290' D3 00 OUT DA9511

E 0296' D3 00 OUT DA9511

E 0299' D3 00 OUT DA9511

E 029F' D3 00 OUT DA9511

E 02A2' D3 00 OUT DA9511

E 02A6' D3 00 OUT ST9511

E 02A8' DB 00 + ..0007: IN ST9511

E 02BD' DB 00 $MY.2: IN DA9511

E 02C0' DB 00 IN DA9511

E 02C6' DB 00 IN DA9511

E 02C9' DB 00 IN DA9511

E 02D4' D3 00 OUT DA9511

E 02D8' D3 00 OUT DA9511

E 02DC' D3 00 OUT DA9511

E 02E0' D3 00 OUT DA9511

E 02E6' D3 00 OUT DA9511

E 02E9' D3 00 OUT DA9511

E 02EF' D3 00 OUT DA9511

E 02F2' D3 00 OUT DA9511

E 02F6' D3 00 OUT ST9511

E 02F8' DB 00 + ..0008: IN ST9511

E 030D' DB 00 $M1.2: IN DA9511

E 0310' DB 00 IN DA9511

E 0316' DB 00 IN DA9511

E 0319' DB 00 IN DA9511

115 Fatal error(s)

B>m80 =am9511

No Fatal error(s)

B>lib80 apu=am9511,apu/e

B>bascom =bm8/o

00000 Fatal Error(s)

24502 Bytes Free

B>l80 bm8,bm8/n/e

Data 0103 27CB < 9928>

32721 Bytes Free

[013C 27CB 39]

B>bm8

BM8

E

B>era bm8.com

B>l80 bm8,apu,bm8/n/e

%Mult. Def. Global $AA

%Mult. Def. Global $AB

%Mult. Def. Global $SA

%Mult. Def. Global $SB

Data 0103 2793 < 9872>

32819 Bytes Free

[013C 2793 39]

B>bm8

BM8

Fred Weigel

unread,

Jul 25, 2021, 8:29:14 AM7/25/21

to retro-comp

Phillip

Ok -- the error is that the ports (DA9511 and ST9511 are being assembled as 00, not external.

The M errors are also curious. Wondering about the version of tools you are using. I am going

to add M80.COM, L80.COM, LIB.COM to the repository (actually, going to include F80.COM, BASCOM.COM

and FORLIB.REL, OBSLIB.REL and a compiled version of APU.REL Note that my tools are a bit

smaller -- used popcom.com on them. The compiled version of apu.rel uses 43h/42h for status/data

FredW

Fred Weigel

unread,

Jul 25, 2021, 8:33:59 AM7/25/21

to retro-comp

The linker warnings are normal -- see apu.txt for the explanation.

FredW

Fred Weigel

unread,

Jul 25, 2021, 2:57:08 PM7/25/21

to retro-comp

Phillip

It is the assembler itself -- pretty sure...

Try creating a one-line file X.MAC containing

END ; THAT IS <SPACE>END<CR><:F>

and assemble:

m80 =x/l

X.PRN should now be something like:

MACRO-80 3.44 30-Aug-82 PAGE 1

END

MACRO-80 3.44 30-Aug-82 PAGE S

Macros:

Symbols:

No Fatal error(s)

The important thing is the version: as you can see, I use MACRO-80 3.44

I also tried the ALDS assembler: MSX.M-80 1.00 and that worked as well.

The only "unorthodox" thing in APU.MAC is that the ports are imported

(with EXTRN). Those become 8 bit numbers in the IN and OUT.

Note that there are some differences with the different MACRO-80 assemblers -

3.44 and ALDS support EXT EXTRN EXTERNAL (all the same) and BYTE EXT,

BYTE EXTRN and BYTE EXTERNAL (also all the same)

.

*IF* the EXTRN is messed up, M80 *may* do weird things -- like lots of U, and M errors.

FredW

Phillip Stevens

unread,

Jul 26, 2021, 1:01:00 AM7/26/21

to retro-comp

Fred wrote:

It is the assembler itself -- pretty sure...
Try creating a one-line file X.MAC containing

END ; THAT IS <SPACE>END<CR><:F>

and assemble:

m80 =x/l

X.PRN should now be something like:

MACRO-80 3.44 30-Aug-82 PAGE 1

END

MACRO-80 3.44 30-Aug-82 PAGE S

Yes. That fixed it. I didn't find the assembler you have (date), but at least the same version from a few months earlier.

Odd that that minor increment in version changed things so greatly.

I can play with benchmarking now. ;-)

But do note that the documented benchmarks are with 2MHz, 3MHz, and 4MHz Z80 and 4MHz Am9511.

Since I couldn't find the 4MHz versions easily, I built the APU Module with a 3:1 clock, which means it is proportionally less effective.

So when I backed out the integer routines in z88dk it was because they were (only slightly) slower on the APU than the Z80 host running 3x faster clock.

The benchmarks would show the results from 1:1 clocks up to 4MHz, so there will be a difference.

B>dir

B: BASCOM2 HLP : BASCOM COM : BASCOM HLP : BASLIB REL

B: BRUN COM : CREF80 COM : CREF COM : D COM

B: L80 COM : LIB80 COM : M80 COM : MBASIC COM

B: OBSLIB REL : BM8 REL : AM9511 REL : APU REL

B: BM8 COM : SAMPLE BAS : COLOUR BAS : COLOUR PRN

B: BCLOAD : COLOUR REL : COLOUR COM : APU MAC

B: AM9511 MAC : BM8 BAS : X MAC : X PRN

B: X REL

B>era m80.com

B>a:xmodem m80.com /r

File created

Receiving via CON with CRCsCCCC

B>m80 =x/l

No Fatal error(s)

B>type x.prn

MACRO-80 3.44 09-Dec-81 PAGE 1

end ;

MACRO-80 3.44 09-Dec-81 PAGE S

Macros:

Symbols:

No Fatal error(s)

B>m80 =apu

No Fatal error(s)

B>m80 =am9511

No Fatal error(s)

B>lib80 apu=am9511,apu/e

B>l80 bm8,bm8/n/e

Data 0103 27CB < 9928>

32721 Bytes Free

[013C 27CB 39]

B>era bm8.com

B>l80 bm8,apu,bm8/n/e

%Mult. Def. Global $AA

%Mult. Def. Global $AB

%Mult. Def. Global $SA

%Mult. Def. Global $SB

Data 0103 2793 < 9872>

31809 Bytes Free

[013C 2793 39]

B>bm8

BM8

E

B>

Phillip Stevens

unread,

Jul 26, 2021, 1:09:05 AM7/26/21

to retro-comp

I also found I've another version of l80 too. Which seems to be more efficient at using system memory.

But, doesn't change the outcome as far as I can see.

B>l80 bm8,apu,bm8/n/e

%Mult. Def. Global $AA

%Mult. Def. Global $AB

%Mult. Def. Global $SA

%Mult. Def. Global $SB

Data 0103 2793 < 9872>

35419 Bytes Free

[013C 2793 39]

B>bm8

BM8

E

B>

Fred Weigel

unread,

Jul 26, 2021, 1:47:29 AM7/26/21

to retro-comp

Phillip

Yea!

with bascom, also try the /z switch for z80 code generation. f80 doesn't do z80, just 8080

(as far as I remember).

That "disk" l80.came with one of Microsofts languages (COBOL?) I think it was called ld80

As I remember, it didn't do much except slow down linking. Maybe to do with more symbols

or something? That one I never used.

I am happy that APU is working for you and that you are doing some benchmarking!

FredW

Phillip Stevens

unread,

Jul 26, 2021, 2:22:52 AM7/26/21

to retro-comp

Sample testing with BM8

100 REM BM8

300 PRINT "BM8"

400 K=0

430 DIM M(5)

500 K=K+1

550 A=K^2

560 B=LOG(K)

570 C=SIN(K)

580 IF K<1000 THEN 500

700 PRINT "E"

800 END

RC2014 Z80 7.3728 MHz and APU 2.4576MHz.

1000x iterations

MBASIC 5.29 (interpreted)

Software - 28.0 sec (equivalent to 51 seconds at 4MHz)

MBASIC 4.7C (interpreted) - Z80 instruction optimised.

Software - 26.5 sec (equivalent to 49 seconds at 4MHz)

APU Module - 12.5 sec

More than 2x faster with the APU.

BASCOM 5.03 (compiled)

Software - 27.8 sec (equivalent to 51 seconds at 4MHz)

APU Module - 9.1 sec

This aligns well with the compiled C results of about 3x faster with the APU.

Related scores from the August 1982 Benchmark Document.

NOTE. The table scores are multiplied by 10x, as the tables are calculated with 100 iterations.

TABLE 2 - 4MHz Z80

MBASIC 5.2 (interpreted)

Software - 62 sec

MBASIC 4.51 (interpreted)

Software - 66 sec

BASCOM 5.0 (compiled)

Software - 61 sec

BASIC-E with 4MHz APU

APU - 12 sec

TABLE 8

To compare with this table, we're dividing by 10 for the RC2014/APU Module with 1000 iterations.

So with 0.91 sec for BM8 , in 1982 terms we're the second fastest thing on the planet (or at least in the document).

behind the Cyber 171 with 0.36 sec and in front of the Wang 2200VP with 1.0 sec.

Well done Fred.

Cheers, Phillip

Fred Weigel

unread,

Jul 30, 2021, 8:15:57 PM7/30/21

to retro-comp

Phillip

So, I have been talking with Marcus R Wigan, who authored that paper -- he requested proper attribution in the github,

which I did. He then sent me a scan of the REDDING library documentation. He still has this material. Very interesting.

I have asked Wigan if he minds if I put that on-line.

Anyway, the main thing is that the REDDING APU support library also supports DOUBLE PRECISION. Accelerated.

This is possible -- I was thinking... double*double can be float*float, and fill the bottom mantissa. Exponent is the

same -- we just need wide multiply. Which AM9511 has - 32x32 -> 64 multiply. So, I have started to design

DOUBLE PRECISION MULTIPLY. Then, we will need division. Have to do some serious benchmarking for the

DOUBLE PRECISION ADD/SUB cases.

So, for convenience in my coding - I tied F80 together with HI-TECH C. Result is

https://github.com/ratboy666/mixed

That lets me mess around with FORTRAN DOUBLE PRECISION, doing bit/byte banging coding in C.

Put mixed on my github yesterday - working on DOUBLE PRECISION MULTIPLY now...

Fred Weigel

unread,

Jul 31, 2021, 2:54:12 PM7/31/21

to retro-comp

https://github.com/ratboy666/apu/blob/main/bm/1982%20apulib%20redding%20group%20v1.06%20copy.pdf

Is the REDDING GROUP apulib manual.

Note the acceleration of DOUBLE PRECISION. Implementation notes:

A double precision (64 bit) bit operation can be a 32 bit operation, followed by mantissa recalculation.

The exponent can be adjusted If it if is too large coming out of the operation, the am9511 allows add or subtract

of 128 to bring into range.

If too large coming in.. scale, do the operation and scale again, I would imagine. Working through these

things now.

Fred Weigel

Bill McMullen

unread,

Jul 31, 2021, 8:12:48 PM7/31/21

to retro-comp

Out of curiosity I tried the BM8 benchmark using interpreted MBASIC 5.21 on a 50 MHz eZ80. In order to provide a reasonable time frame for using a stopwatch, the iterations were bumped up to 10,000 and the result was about 11.9 seconds or 0.119 when adjusted to the referenced table.

Compiled BASIC was only about 3% faster for this benchmark since there's very little work for the interpreter. On the ASCIIART benchmark, BASCOM is much faster at 3.0 vs 6.8 seconds.

Phillip Stevens

unread,

Jul 31, 2021, 11:06:09 PM7/31/21

to retro-comp

I find it interesting that this Redding document from 1982 carefully explains the same APU limitations with loading overhead, that I found independently nearly 40 years later. LoL.

Are we doomed to learn nothing from history?

P.

Fred Weigel

unread,

Aug 10, 2021, 5:54:03 PM8/10/21

to retro-comp

Philip

So, REDDING does DOUBLE PRECISION, how? Probably by doing it the "hard way" -- implement using 9511

primitive operations. I will get to it, but kind of got into the TJ Dekker 71 paper. So, started fooling around.

Here is MBASIC code - it makes a double precision number 1/3, .3333333333333333 as printed by MBASIC.

We then break it into mh, mm and ml (mantissa high, medium, low) of 23 bits, 23 bits and 10 bits.

Notice the loss of dynamic range... if we ignore the last 10 bits, we still loose 2^23 -- very sucky.

But.. notice that we can put the number back together -- and it is still .3333333333333333 .

If we ignore the last 10 bits (531 GOTO 550)... .we get 3333333333333286 -- which is just fine.

So, converting to 8080 (Z80) assembler and testing... this is fun! Note that the x * 2^B stuff

is just a bit of bit-banging in assembler... MBASIC doesn't have a good way to do that..

And now off to implement a set of routines for add, subtract, multiply, divide using this RRD

(real-real-double) technique. Per Dekker! After this, will try coding up "proper" double

precision. But this shows that it is actually easy to make the RRD from a double (and back again).

I think that I will call this module RRD.REL The linkage will be L80 prog,AM9511,RRD,APU,prog/N/E

and then RRD is optional -- it can (almost) be a user library, except for the special names needed

to override FORLIB.REL.

This should be the fastest (almost) DOUBLE PRECISION possible with a 9511. And, I finally published

dekker1971.pdf to my apu github.

Anyway... I think that this is what Dekker was after...

Enjoy!

Fred

240 ' DOUBLE PRECISION (RRD)

250 ' GENERATE A DOUBLE PRECISION NUMBER, AND BREAK IT INTO SINGLE

260 ' PRECISION PARTS: 23 23 10

270 ' THE SUM OF THE NUMBERS IS THE DOUBLE PRECISION NUMBER. BUT...

280 ' MS REAL IS 24 BITS, AND DOUBLE PRECISION IS 56 BITS - SO WE

290 ' HAVE TO SPLIT INTO THREE PARTS. THE SUM OF THE FIRST TWO

300 ' PARTS WILL BE 46 BIT, AND WE WILL WORK WITH THAT. THIS WILL

310 ' IGNORE THE LOW 10 BITS. WE ALSO LOOSE DYNAMIC RANGE: INSTEAD

320 ' OF 2^-64..2^63, WE MUST LOP OFF 23 BITS FOR 2^41..2^40

330 A# = 1

340 B# = 3

350 X# = A# / B#

360 PRINT X#

370 ' B = BITS PER MANTISSA, BL = REMAINING

380 B = 23 : BL = 56 - B - B

390 ' E IS EXPONENT OF RRD NUMBER

400 E = INT(LOG(X#) / LOG(2#)) + 1

410 M# = X# * 2^(-E)

420 MH = INT(M# * 2^B)

430 M# = (M# * 2^B) - MH

440 MM = INT(M# * 2^B)

450 M# = (M# * 2^B) - MM

460 ML = INT(M# * 2^BL)

470 M# = (M# * 2^BL) - ML

480 ' M# SHOULD BE 0... WE HAVE 3 MANTISSA PARTS MH, MM, ML

490 EH = E - B

500 EM = EH - B

510 EL = EM - BL

520 P# = MH : F# = P# * 2#^EH

530 P# = MM : P# = P# * 2#^EM : F# = F# + P#

540 P# = ML : P# = P# * 2#^EL : F# = F# + P#

550 PRINT F#

Phillip Stevens

unread,

Aug 10, 2021, 9:08:41 PM8/10/21

to retro-comp

Fred wrote:

So, REDDING does DOUBLE PRECISION, how? Probably by doing it the "hard way" -- implement using 9511
primitive operations. I will get to it, but kind of got into the TJ Dekker 71 paper. So, started fooling around.
Here is MBASIC code - it makes a double precision number 1/3, .3333333333333333 as printed by MBASIC.
We then break it into mh, mm and ml (mantissa high, medium, low) of 23 bits, 23 bits and 10 bits.

But.. notice that we can put the number back together -- and it is still .3333333333333333 .
If we ignore the last 10 bits (531 GOTO 550)... .we get 3333333333333286 -- which is just fine.
So, converting to 8080 (Z80) assembler and testing... this is fun! Note that the x * 2^B stuff
is just a bit of bit-banging in assembler... MBASIC doesn't have a good way to do that.

I know it is off track, but you could do the bit twiddling function x*2^B as a USR() function and make it available to MBASIC.

Though it is far better to do it as you propose below.

And now off to implement a set of routines for add, subtract, multiply, divide using this RRD
(real-real-double) technique. Per Dekker! After this, will try coding up "proper" double
precision. But this shows that it is actually easy to make the RRD from a double (and back again).

I think that I will call this module RRD.REL The linkage will be L80 prog,AM9511,RRD,APU,prog/N/E

This should be the fastest (almost) DOUBLE PRECISION possible with a 9511. And, I finally published
dekker1971.pdf to my apu github.

Do you know whether Dekker method was the "go to" method used for double precision in the 70's?

I guess it must have been, given that the REDDING Am9511A library used it. That's at least one example.

You would think there'd be code examples in any scientific field where very high precision was required.

I wonder if there are other examples?

Anyway, there are quite a few people with Spencer's APU Module already, who will benefit from this accuracy. ;-)

Cheers, Phillip

Fred Weigel

unread,

Aug 10, 2021, 10:32:10 PM8/10/21

to Phillip Stevens, retro-comp

Phillip

Actually, I *don't* think Redding used it. They did it the old-fashioned way. However, early nVidia GPUs had only single precision,

and THEY used Dekker. In 2005 or so. This is because those algorithms are actually very well suited to "no-branching" with fused multiply-add!

The reason I don't think Redding did it that way is that they do not lose on dynamic range! But, this RRD approach will be faster!

It should be a hair off 50% of single precision speed. Or roughly 8 to 10 times software double precision (on my first analysis).

The MBASIC code was a bit of a lark (mostly to convince myself that it *is* possible). I like to work with something like MBASIC,

then go to C. It's just fun!

Of course, the "implied bit 24" throws the thing a bit -- that is why I went with 23/23/10. The 10 trainling bits will be discarded,

giving us 23/23 or 46 bits of mantissa. Should be good enough for almost anything. Then, I'll do the "proper" way. Gives

a choice...

But mostly, I am having a blast with this stuff...

Fred

--
You received this message because you are subscribed to a topic in the Google Groups "retro-comp" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/retro-comp/cDt4-ENNvwY/unsubscribe.
To unsubscribe from this group and all its topics, send an email to retro-comp+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/retro-comp/abe37f19-690a-4eca-8311-ae6a720037b0n%40googlegroups.com.

Phillip Stevens

unread,

Aug 11, 2021, 12:33:08 AM8/11/21

to retro-comp

Fred wrote:

Actually, I *don't* think Redding used it. They did it the old-fashioned way. However, early nVidia GPUs had only single precision,
and THEY used Dekker. In 2005 or so. This is because those algorithms are actually very well suited to "no-branching" with fused multiply-add!

It would be good to add a fma() like function into your BASCOM libraries. It would save two expensive APU stack push/pop cycles, and be quite helpful for some applications.

I was going to point you to my Am9511 poly() function for a similar code example, but I didn't do it. Ouch. Glaring omission!.

Cheers, Phillip

Fred Weigel

unread,

Aug 11, 2021, 12:59:15 PM8/11/21

to retro-comp

Phillip

Ok! I can put in a function fmaf(x,y,z), and make sure BASCOM can use it.

Also, how about fmaf1(z) and fmaf2(x, y), fmaf1() will just load z, fmaf2(x,y) computes x*y + z, and leaves sum on apu stack.

fmafr() returns top of stack as float, and pops it.

That would allow fast combined fmaf() operations. Indeed fmaf(x,y,z) itself would be fmaf1(z) fmaf(x, y) z = fmafr();

Then we get:

t.y = x.y * y.y;

t.x = fmaf (x.y, y.y, -t.y);

t.x = fmaf (x.x, y.x, t.x);

t.x = fmaf (x.y, y.x, t.x);

t.x = fmaf (x.x, y.y, t.x);

becomes

t.y = x.y * y.y

t.x - fmaf(x.y,y.y,-t.y)

fmaf1(t.x);

fmaf2(x.y,y.y)

fmaf2(x.x,y.x)

fmaf2(x.y,y.x)

fmaf2(x.x,y.y)

t.x=fmafr();

...something along those lines. this should allow leaving intermediate results on-chip, and speed things up even more. Naive 16 load ops,

proposed api brings it down to 10 load ops (and, I think really good time savings).

So, the 3 primitives would be fmaf1(z) fmaf2(x,y) r=fmafr() -- ok with you? we could include fmaf(x,y,z) as well, as a convenience

function.

Fred

Phillip Stevens

unread,

Aug 11, 2021, 9:04:08 PM8/11/21

to retro-comp

Ok! I can put in a function fmaf(x,y,z), and make sure BASCOM can use it.

Also, how about fmaf1(z) and fmaf2(x, y), fmaf1() will just load z, fmaf2(x,y) computes x*y + z, and leaves sum on apu stack. fmafr() returns top of stack as float, and pops it.

That would allow fast combined fmaf() operations. Indeed fmaf(x,y,z) itself would be fmaf1(z) fmaf(x, y) z = fmafr();

...something along those lines. this should allow leaving intermediate results on-chip, and speed things up even more. Naive 16 load ops, proposed api brings it down to 10 load ops (and, I think really good time savings).

So, the 3 primitives would be fmaf1(z) fmaf2(x,y) r=fmafr() -- ok with you? we could include fmaf(x,y,z) as well, as a convenience function.

Yes that looks like a good outcome. I did something similar, but very janky and specific, for the planetary motion problem with multiple APU Modules. So I know intermediate results on-chip would be very useful.

It would be good to add a fma() like function into your BASCOM libraries. It would save two expensive APU stack push/pop cycles, and be quite helpful for some applications.

The application I have in mind is Horner’s method for calculation of polynomials. So it would be good to test the functions with looped and unrolled polynomials. WIth a good implementation of Horner’s any transcendental function can be calculated optimally.

FWIW, I’ve used the LOLREMEZ tool to prepare coefficients for much of the things I’ve been doing. You can for example use it to get coefficients in double precision for any functions you want to build later.

Cheers, Phillip

Reply all

Reply to author

Forward