MicroSoft F80/BASCOM

427 views
Skip to first unread message

Fred Weigel

unread,
Jul 22, 2021, 11:57:28 AM7/22/21
to retro-comp
Because none of the AM9511 FORTRAN-80 libraries can be located,
I am writing APU.REL than can be linked with F80 compiled code.

So far, I have $D9 (INTEGER/INTEGER), $DY (INTEGER*4/INTEGER), $D1 (INTEGER*4/INTEGER*4), $M9, $MY and $M1 (MULTIPLY) done.
Beginning on the REAL +- * / **, which will be added soon . Published to my
github (just in case my hardware crashes).


And I was wondering if these were applicable to MicroSoft BASCOM.
OBSLIB.REL does have $D9, and $M9. Of course BASCOM/MBASIC for the
8080 did not support V& (32 bit integer), so $DY, $D1, $MY and $M1
being missing is kind of expected.

What I was wondering... if anyone knows... (and I completely forgetten...
it has been 40 years) - would BASCOM compiled program using
/O (OBSLIB.REL) use the FORTRAN-80 routines for arithmetic? If
it does, then APU.REL will apply to both. Does anyone know? Note that
even BASLIB.REL contains $M9 and $D9, so maybe the BRUN.COM
stuff can be accelerated (the is a leap -- $M9 will be BRUN, and if I
patch it exactly, I could possibly get an accelerated BRUN.COM).

If it DOES work that way, then MBASIC (slow) BASCOM (faster) and
BASCOM+APU (fastest).

Thanks in advance
Fred Weigel

Fred Weigel

unread,
Jul 23, 2021, 12:53:27 PM7/23/21
to retro-comp
Quick update

I now have INTEGER INTEGER*4 multiply divide and REAL REAL
(bother operands REAL) add subtract multiply divide power
done.

Now its likely to be usable as a AM9511A support library.
Working on REAL INTEGER mixed operations

Note that there is a bug in apu.mac -- 1.0 / 2.0 gives -1.0,
not 0.5 as it should, But, in answer to my original question:

BASCOM 5.30a generates

CALL $DVDA / DW X! / DW Y! / CALL $FASO / DW Z!

for Z = X / Y where both X and Y are REAL. This, in turn,
generates a call to $DB, which does the divide. Linking
APU.REL does work with BASCOM.

Fred Weigel

unread,
Jul 23, 2021, 1:12:39 PM7/23/21
to retro-comp
And, fixed

Phillip Stevens

unread,
Jul 24, 2021, 8:19:55 AM7/24/21
to retro-comp
Fred wrote:
Because none of the AM9511 FORTRAN-80 libraries can be located,
I am writing APU.REL than can be linked with F80 compiled code.

So far, I have $D9 (INTEGER/INTEGER), $DY (INTEGER*4/INTEGER), $D1 (INTEGER*4/INTEGER*4), $M9, $MY and $M1 (MULTIPLY) done.
Beginning on the REAL +- * / **, which will be added soon.

Interested to see this, and how to write .REL files.

If it DOES work that way, then MBASIC (slow) BASCOM (faster) and
BASCOM+APU (fastest).

The experience from integrating into z88dk and C libraries was that it is actually slower to use the 16 bit operations on the APU, than doing it with a 7.3MHz (RC2014) Z80.
So after putting them all in, I had to back them out. They're all still there, just not enabled. Interested to hear what you find.

It would be great to have a BASCOM solution for the APU Module in the RC2014.

The MBASIC APU version is faster than with software floating point, but not by very much. When looking at the calls, there is too much overhead from interpretation to get much improvement.
So even if you triple the speed of the math for example, the proportion of math time to instruction decoding time is just too low to see more than about 15% overall improvement.

Having BASCOM would change the proposition entirely and provide an opportunity for the APU to benefit performance from BASIC (as it does in C) up to about 3x performance.
So good luck!

Phillip

Fred Weigel

unread,
Jul 24, 2021, 10:44:11 AM7/24/21
to retro-comp
Phillip

BASCOM appears to work

BASCOM =PROG/O/Z

L80 PROG,APU,PROG/N/E

would do to compile and link PROG.BAS, with APU.  I haven't extensively tested this solution, and
have never run APU library on a real device (I don't have one). This does work on the emulator.

Fred Weigel

Fred Weigel

unread,
Jul 24, 2021, 3:46:47 PM7/24/21
to retro-comp
Phillip

If you are willing to do some benchmark testing...

I have added a "bm" (benchmark) directory to my apu repository.

This contains a benchmark document from 1982, and 9 benchmark programs, in BASIC,
FORTRAN and PASCAL. I entered and and ran them from the paper. They are very
outdated... but are all I can find from that era.

There are results in the paper for Z80 with FORTRAN and two libraries (Redding and Memtech).
It would be interesting to know how APU.REL compares. Note that BM9.FOR is accelerated
with the Redding library, which implies that INTEGER acceleration is useful.

At least at 2, 2.5, 3 and 4 Mhz -- not sure about 7+Mhz.

The Pascal programs are included and have been run with Turbo Pascal -- apparently there is
a problem with BM9.PAS in Pascal MT+ (but I haven't bothered with that yet)

FORTRAN programs all compiled and run with F80
BASIC run with MBASIC  and BASCOM

BM9.BAS

Using BASCOM with APU on zxcc

0m0.558s

Same linked WITHOUT APU

0m0.770s

which means ~30% improvement! On an INTEGER dominant benchmark -- and the ONLY
thing we accelerate is 160 LET M=N/K
With F80, we should see an even better improvement -- should compare with the Redding
entries in the benchmark result table.

Let me know if you want/need the APU.REL, or BM9.COM
 
Fred Weigel

Phillip Stevens

unread,
Jul 25, 2021, 3:44:27 AM7/25/21
to retro-comp
Fred wrote:
Phillip

I have added a "bm" (benchmark) directory to my apu repository.

This contains a benchmark document from 1982, and 9 benchmark programs, in BASIC,
FORTRAN and PASCAL. I entered and and ran them from the paper. They are very
outdated... but are all I can find from that era.

There are results in the paper for Z80 with FORTRAN and two libraries (Redding and Memtech).
It would be interesting to know how APU.REL compares. Note that BM9.FOR is accelerated
with the Redding library, which implies that INTEGER acceleration is useful.

At least at 2, 2.5, 3 and 4 Mhz -- not sure about 7+Mhz.
The Pascal programs are included and have been run with Turbo Pascal -- apparently there is
a problem with BM9.PAS in Pascal MT+ (but I haven't bothered with that yet)

FORTRAN programs all compiled and run with F80
BASIC run with MBASIC  and BASCOM

BM9.BAS

Using BASCOM with APU on zxcc

0m0.558s

Same linked WITHOUT APU

0m0.770s

which means ~30% improvement! On an INTEGER dominant benchmark -- and the ONLY
thing we accelerate is 160 LET M=N/K
With F80, we should see an even better improvement -- should compare with the Redding
entries in the benchmark result table.

Let me know if you want/need the APU.REL, or BM9.COM

I've been having a bit of a go with this.
I think I've gotten BM8 to work without the APU, but it seems to hang when using the APU.

See if you can follow along with my thread below.
There are a bunch of errors when assembling the APU.ASM file, which I guess are unresolved symbols.
There's also some messages from L80 that might be an issue. I don't know it well enough.
Anyway, at the end the APU enabled BM8 just hangs.

Am I doing something wrong?

B>a:ddir

-- Directory of volume #1 --
BASCOM2.HLP    ........29312
BASCOM.COM     ........32768
BASCOM.HLP     ........14976
BASLIB.REL     ........24960
BRUN.COM       ........15488
CREF80.COM     .........3968
CREF.COM       .........3968
D.COM          .........1792
L80.COM        ........10752
LIB80.COM      .........4736
M80.COM        ........19200
MBASIC.COM     ........24320
OBSLIB.REL     ........48384
SAMPLE.BAS     ..........128
COLOUR.BAS     ..........896
COLOUR.PRN     .........5376
BCLOAD         ..........128
COLOUR.REL     .........1024
COLOUR.COM     .........1152
APU.MAC        ........20864
AM9511.MAC     ..........256
BM8.BAS        ..........256
Total bytes: 264704.

B>m80 =apu

M                               AM.SIN   EQU      02H                   ; SINE
M                               AM.CHSF  EQU      15H                   ; FLOATING CHANGE SIGN
M                               AM.FLTS  EQU      1DH                   ; 16 BIT TO FLOAT
M                               AM.FIXD  EQU      1EH                   ; FLOAT TO 32 BIT
M                               AM.FIXS  EQU      1FH                   ; FLOAT TO 16 BIT
M 0060                          AM.SINGL EQU      60H                   ; 16 BIT INTEGER
M 0020                          AM.FIXED EQU      20H                   ; FIXED POINT
M 0002                          AM.SIN   EQU      02H                   ; SINE
M 0014                          AM.CHS   EQU      14H                   ; CHANGE SIGN
M 0015                          AM.CHSF  EQU      15H                   ; FLOATING CHANGE SIGN
M 001C                          AM.FLTD  EQU      1CH                   ; 32 BIT TO FLOAT
M 001D                          AM.FLTS  EQU      1DH                   ; 16 BIT TO FLOAT
M 001E                          AM.FIXD  EQU      1EH                   ; FLOAT TO 32 BIT
M 001F                          AM.FIXS  EQU      1FH                   ; FLOAT TO 16 BIT
E 0020'   D3 00                          OUT      DA9511
E 0024'   D3 00                          OUT      DA9511
E 002A'   D3 00                          OUT      DA9511
E 002D'   D3 00                          OUT      DA9511
E 0037'   D3 00                          OUT      DA9511
E 0039'   D3 00                          OUT      DA9511
E 003B'   D3 00                          OUT      DA9511
E 003D'   D3 00                          OUT      DA9511
E 0048'   D3 00                          OUT      DA9511
E 004A'   D3 00                          OUT      DA9511
E 004C'   D3 00                          OUT      DA9511
E 0051'   D3 00                          OUT      DA9511
E 0056'   DB 00                          IN       DA9511                ; 9511 EXPONENT
E 0059'   DB 00                          IN       DA9511                ; 9511 HIGH MANTISSA
E 005D'   DB 00                          IN       DA9511                ; 9511 MIDDLE MANTISSA
E 0061'   DB 00                          IN       DA9511                ; 9511 LOW MANTISSA
E 009C'   D3 00                          OUT      DA9511
E 00A0'   D3 00                          OUT      DA9511
E 00A4'   D3 00                          OUT      DA9511
E 00A8'   D3 00                          OUT      DA9511
E 00AC'   D3 00                          OUT      ST9511
E 00AE'   DB 00           +     ..0000:       IN       ST9511
E 00D1'   D3 00                          OUT      DA9511
E 00D4'   D3 00                          OUT      DA9511
E 00D8'   D3 00                          OUT      ST9511
E 00DA'   DB 00           +     ..0001:       IN       ST9511
E 0101'   D3 00                          OUT      ST9511
E 0103'   DB 00           +     ..0002:       IN       ST9511
E 010F'   DB 00                          IN       ST9511
E 0173'   D3 00                          OUT      DA9511
E 0176'   D3 00                          OUT      DA9511
E 0179'   D3 00                          OUT      DA9511
E 017C'   D3 00                          OUT      DA9511
E 0180'   D3 00                          OUT      ST9511
E 0182'   DB 00           +     ..0003:       IN       ST9511
E 0198'   DB 00                 $D9.2:   IN       DA9511
E 019B'   DB 00                          IN       DA9511
E 01AB'   D3 00                          OUT      DA9511
E 01AE'   D3 00                          OUT      DA9511
E 01B4'   D3 00                          OUT      DA9511
E 01B7'   D3 00                          OUT      DA9511
E 01BB'   D3 00                          OUT      DA9511
E 01BE'   D3 00                          OUT      DA9511
E 01C2'   D3 00                          OUT      DA9511
E 01C4'   D3 00                          OUT      DA9511
E 01C8'   D3 00                          OUT      ST9511
E 01CA'   DB 00           +     ..0004:       IN       ST9511
E 01E6'   DB 00                 $DY.2:   IN       DA9511
E 01E9'   DB 00                          IN       DA9511
E 01EF'   DB 00                          IN       DA9511
E 01F2'   DB 00                          IN       DA9511
E 0205'   D3 00                          OUT      DA9511
E 0208'   D3 00                          OUT      DA9511
E 020E'   D3 00                          OUT      DA9511
E 0211'   D3 00                          OUT      DA9511
E 0215'   D3 00                          OUT      DA9511
E 0219'   D3 00                          OUT      DA9511
E 021D'   D3 00                          OUT      DA9511
E 0221'   D3 00                          OUT      DA9511
E 0225'   D3 00                          OUT      ST9511
E 0227'   DB 00           +     ..0005:       IN       ST9511
E 0243'   DB 00                 $D1.2:   IN       DA9511
E 0246'   DB 00                          IN       DA9511
E 024C'   DB 00                          IN       DA9511
E 024F'   DB 00                          IN       DA9511
E 025E'   D3 00                          OUT      DA9511
E 0261'   D3 00                          OUT      DA9511
E 0264'   D3 00                          OUT      DA9511
E 0267'   D3 00                          OUT      DA9511
E 026B'   D3 00                          OUT      ST9511
E 026D'   DB 00           +     ..0006:       IN       ST9511
E 027C'   DB 00                 $M9.2:   IN       DA9511
E 027F'   DB 00                          IN       DA9511
E 0287'   D3 00                          OUT      DA9511
E 028A'   D3 00                          OUT      DA9511
E 028E'   D3 00                          OUT      DA9511
E 0290'   D3 00                          OUT      DA9511
E 0296'   D3 00                          OUT      DA9511
E 0299'   D3 00                          OUT      DA9511
E 029F'   D3 00                          OUT      DA9511
E 02A2'   D3 00                          OUT      DA9511
E 02A6'   D3 00                          OUT      ST9511
E 02A8'   DB 00           +     ..0007:       IN       ST9511
E 02BD'   DB 00                 $MY.2:   IN       DA9511
E 02C0'   DB 00                          IN       DA9511
E 02C6'   DB 00                          IN       DA9511
E 02C9'   DB 00                          IN       DA9511
E 02D4'   D3 00                          OUT      DA9511
E 02D8'   D3 00                          OUT      DA9511
E 02DC'   D3 00                          OUT      DA9511
E 02E0'   D3 00                          OUT      DA9511
E 02E6'   D3 00                          OUT      DA9511
E 02E9'   D3 00                          OUT      DA9511
E 02EF'   D3 00                          OUT      DA9511
E 02F2'   D3 00                          OUT      DA9511
E 02F6'   D3 00                          OUT      ST9511
E 02F8'   DB 00           +     ..0008:       IN       ST9511
E 030D'   DB 00                 $M1.2:   IN       DA9511
E 0310'   DB 00                          IN       DA9511
E 0316'   DB 00                          IN       DA9511
E 0319'   DB 00                          IN       DA9511

115 Fatal error(s)

B>m80 =am9511


No  Fatal error(s)

B>lib80 apu=am9511,apu/e


B>bascom =bm8/o


00000 Fatal Error(s)
24502 Bytes Free

B>l80 bm8,bm8/n/e


Link-80  3.43  14-Apr-81  Copyright (c) 1981 Microsoft

Data    0103    27CB    < 9928>

32721 Bytes Free
[013C   27CB       39]

B>bm8

BM8
E


B>era bm8.com

B>l80 bm8,apu,bm8/n/e


Link-80  3.43  14-Apr-81  Copyright (c) 1981 Microsoft
%Mult. Def. Global $AA
%Mult. Def. Global $AB
%Mult. Def. Global $SA
%Mult. Def. Global $SB

Data    0103    2793    < 9872>

32819 Bytes Free
[013C   2793       39]

B>bm8

BM8

 

Fred Weigel

unread,
Jul 25, 2021, 8:29:14 AM7/25/21
to retro-comp
Phillip

Ok -- the error is that the ports (DA9511 and ST9511 are being assembled as 00, not external.

The M errors are also curious. Wondering about the version of tools you are using. I am going
to add M80.COM, L80.COM, LIB.COM to the repository (actually, going to include F80.COM, BASCOM.COM
and FORLIB.REL, OBSLIB.REL and a compiled version of APU.REL Note that my tools are a bit
smaller -- used popcom.com on them. The compiled version of apu.rel uses 43h/42h for status/data

FredW

Fred Weigel

unread,
Jul 25, 2021, 8:33:59 AM7/25/21
to retro-comp
The linker warnings are normal -- see apu.txt for the explanation.

FredW

Fred Weigel

unread,
Jul 25, 2021, 2:57:08 PM7/25/21
to retro-comp
Phillip

It is the assembler itself -- pretty sure...

Try creating a  one-line file X.MAC containing

 END ; THAT IS <SPACE>END<CR><:F>

and assemble:

m80 =x/l

X.PRN should now be something like:

MACRO-80 3.44 30-Aug-82 PAGE 1


                                 END

MACRO-80 3.44 30-Aug-82 PAGE S


Macros:

Symbols:



No Fatal error(s)

The important thing is the version: as you can see, I use MACRO-80 3.44
I also tried the ALDS assembler: MSX.M-80 1.00 and that worked as well.

The only "unorthodox" thing in APU.MAC is that the ports are imported
(with EXTRN). Those become 8 bit numbers in the IN and OUT.

Note that there are some differences with the different MACRO-80 assemblers  -
 3.44 and ALDS support EXT EXTRN EXTERNAL (all the same) and BYTE EXT, 
BYTE EXTRN and BYTE EXTERNAL (also all the same)
.
*IF* the EXTRN is messed up, M80 *may* do weird things -- like lots of U, and M errors.

FredW

Phillip Stevens

unread,
Jul 26, 2021, 1:01:00 AM7/26/21
to retro-comp
Fred wrote:
It is the assembler itself -- pretty sure...
Try creating a  one-line file X.MAC containing

 END ; THAT IS <SPACE>END<CR><:F>

and assemble:

m80 =x/l

X.PRN should now be something like:

MACRO-80 3.44 30-Aug-82 PAGE 1

                                 END

MACRO-80 3.44 30-Aug-82 PAGE

Yes. That fixed it. I didn't find the assembler you have (date), but at least the same version from a few months earlier.
Odd that that minor increment in version changed things so greatly.

I can play with benchmarking now. ;-)

But do note that the documented benchmarks are with 2MHz, 3MHz, and 4MHz Z80 and 4MHz Am9511.
Since I couldn't find the 4MHz versions easily, I built the APU Module with a 3:1 clock, which means it is proportionally less effective.
So when I backed out the integer routines in z88dk it was because they were (only slightly) slower on the APU than the Z80 host running 3x faster clock.
The benchmarks would show the results from 1:1 clocks up to 4MHz, so there will be a difference.

B>dir

B: BASCOM2  HLP : BASCOM   COM : BASCOM   HLP : BASLIB   REL
B: BRUN     COM : CREF80   COM : CREF     COM : D        COM
B: L80      COM : LIB80    COM : M80      COM : MBASIC   COM
B: OBSLIB   REL : BM8      REL : AM9511   REL : APU      REL
B: BM8      COM : SAMPLE   BAS : COLOUR   BAS : COLOUR   PRN
B: BCLOAD       : COLOUR   REL : COLOUR   COM : APU      MAC
B: AM9511   MAC : BM8      BAS : X        MAC : X        PRN
B: X        REL
B>era m80.com

B>a:xmodem m80.com /r

File created
Receiving via CON with CRCsCCCC
B>m80 =x/l

No Fatal error(s)

B>type x.prn

        MACRO-80 3.44   09-Dec-81       PAGE    1


                                 end ;
        MACRO-80 3.44   09-Dec-81       PAGE    S


Macros:

Symbols:

No Fatal error(s)

B>m80 =apu

No Fatal error(s)

B>m80 =am9511

No Fatal error(s)

B>lib80 apu=am9511,apu/e

B>l80 bm8,bm8/n/e


Link-80  3.43  14-Apr-81  Copyright (c) 1981 Microsoft

Data    0103    27CB    < 9928>

32721 Bytes Free
[013C   27CB       39]

B>era bm8.com

B>l80 bm8,apu,bm8/n/e


Link-80  3.43  14-Apr-81  Copyright (c) 1981 Microsoft
%Mult. Def. Global $AA
%Mult. Def. Global $AB
%Mult. Def. Global $SA
%Mult. Def. Global $SB

Data    0103    2793    < 9872>

31809 Bytes Free
[013C   2793       39]

B>bm8

BM8
E

B> 

Phillip Stevens

unread,
Jul 26, 2021, 1:09:05 AM7/26/21
to retro-comp
I also found I've another version of l80 too. Which seems to be more efficient at using system memory.
But, doesn't change the outcome as far as I can see.

B>l80 bm8,apu,bm8/n/e

Link-80  Disk Vers. 3.55  10-Sep-82  Copyright (c) 1981 Microsoft
%Mult. Def. Global $AA
%Mult. Def. Global $AB
%Mult. Def. Global $SA
%Mult. Def. Global $SB

Data    0103    2793    < 9872>

35419 Bytes Free
[013C   2793       39]

B>bm8

BM8
E

B>

Fred Weigel

unread,
Jul 26, 2021, 1:47:29 AM7/26/21
to retro-comp

Phillip

Yea!

with bascom, also try the /z switch for z80 code generation. f80 doesn't do z80, just 8080
(as far as I remember).

That "disk" l80.came with one of Microsofts languages (COBOL?) I think it was called ld80

As I remember, it didn't do much except slow down linking. Maybe to do with more symbols
or something? That one I never used.

I am happy that APU is working for you and that you are doing some benchmarking!

FredW

Phillip Stevens

unread,
Jul 26, 2021, 2:22:52 AM7/26/21
to retro-comp
Sample testing with BM8

100 REM    BM8
300 PRINT "BM8"
400 K=0
430 DIM M(5)
500 K=K+1
550 A=K^2
560 B=LOG(K)
570 C=SIN(K)
580 IF K<1000 THEN 500
700 PRINT "E"
800 END


RC2014 Z80 7.3728 MHz and APU 2.4576MHz.
1000x iterations

MBASIC 5.29 (interpreted)
Software - 28.0 sec (equivalent to 51 seconds at 4MHz)

MBASIC 4.7C (interpreted) - Z80 instruction optimised.
Software - 26.5 sec (equivalent to 49 seconds at 4MHz)
APU Module - 12.5 sec
More than 2x faster with the APU.

BASCOM 5.03 (compiled)
Software - 27.8 sec  (equivalent to 51 seconds at 4MHz)
APU Module - 9.1 sec
This aligns well with the compiled C results of about 3x faster with the APU.


Related scores from the August 1982 Benchmark Document.

NOTE. The table scores are multiplied by 10x, as the tables are calculated with 100 iterations.

TABLE 2 - 4MHz Z80

MBASIC 5.2 (interpreted)
Software - 62 sec

MBASIC 4.51 (interpreted)
Software - 66 sec

BASCOM 5.0 (compiled)
Software - 61 sec

BASIC-E with 4MHz APU
APU - 12 sec

TABLE 8

To compare with this table, we're dividing by 10 for the RC2014/APU Module with 1000 iterations.

So with 0.91 sec for BM8 , in 1982 terms we're the second fastest thing on the planet (or at least in the document).
behind the Cyber 171 with 0.36 sec and in front of the Wang 2200VP with 1.0 sec.

Well done Fred.

Cheers, Phillip

Fred Weigel

unread,
Jul 30, 2021, 8:15:57 PM7/30/21
to retro-comp
Phillip

So, I have been talking with Marcus R Wigan, who authored that paper -- he requested proper attribution in the github,
which I did. He then sent me a scan of the REDDING library documentation. He still has this material. Very interesting.
I have asked Wigan if he minds if I put that on-line.

Anyway, the main thing is that the REDDING APU support library also supports DOUBLE PRECISION. Accelerated.

This is possible -- I was thinking... double*double can be float*float, and fill the bottom mantissa. Exponent is the
same -- we just need wide multiply. Which AM9511 has - 32x32 -> 64 multiply. So, I have started to design
DOUBLE PRECISION MULTIPLY.  Then, we will need division. Have to do some serious benchmarking for the
DOUBLE PRECISION ADD/SUB cases.

So, for convenience in my coding - I tied F80 together with HI-TECH C. Result is


That lets me mess around with FORTRAN DOUBLE PRECISION, doing bit/byte banging coding in C.
Put mixed on my github yesterday - working on DOUBLE PRECISION MULTIPLY now...

Fred Weigel 

Fred Weigel

unread,
Jul 31, 2021, 2:54:12 PM7/31/21
to retro-comp
https://github.com/ratboy666/apu/blob/main/bm/1982%20apulib%20redding%20group%20v1.06%20copy.pdf

Is the REDDING GROUP apulib manual.

Note the acceleration of DOUBLE PRECISION. Implementation notes:

A double precision (64 bit) bit operation can be a 32 bit operation, followed by mantissa recalculation.
The exponent can be adjusted If it if is too large coming out of the operation, the am9511 allows add or subtract
of 128 to bring into range.

If too large coming in.. scale, do the operation and scale again, I would imagine. Working through these
things now.

Fred Weigel

Bill McMullen

unread,
Jul 31, 2021, 8:12:48 PM7/31/21
to retro-comp
Out of curiosity I tried the BM8 benchmark using interpreted MBASIC 5.21 on a 50 MHz eZ80.  In order to provide a reasonable time frame for using a stopwatch, the iterations were bumped up to 10,000 and the result was about 11.9 seconds or 0.119 when adjusted to the referenced table.

Compiled BASIC was only about 3% faster for this benchmark since there's very little work for the interpreter.  On the ASCIIART benchmark, BASCOM is much faster at 3.0 vs 6.8 seconds.

Phillip Stevens

unread,
Jul 31, 2021, 11:06:09 PM7/31/21
to retro-comp
I find it interesting that this Redding document from 1982 carefully explains the same APU limitations with loading overhead, that I found independently nearly 40 years later. LoL. 

Are we doomed to learn nothing from history?

P. 

Fred Weigel

unread,
Aug 10, 2021, 5:54:03 PM8/10/21
to retro-comp
Philip

So, REDDING does DOUBLE PRECISION, how? Probably by doing it the "hard way" -- implement using 9511
primitive operations. I will get to it,  but kind of got into the TJ Dekker 71 paper. So, started fooling around.
Here is MBASIC code - it makes a double precision number 1/3,   .3333333333333333 as printed by MBASIC.
We then break it into mh, mm and ml (mantissa high, medium, low) of 23 bits, 23 bits and 10 bits.

Notice the loss of dynamic range... if we ignore the last 10  bits, we still loose 2^23 -- very sucky.

But.. notice that we can put the number back together -- and it is still .3333333333333333 .
If we ignore the last 10 bits (531 GOTO 550)... .we get 3333333333333286 -- which is just fine.
So, converting to 8080 (Z80) assembler and testing... this is fun! Note that the x * 2^B stuff
is   just a bit of bit-banging in assembler... MBASIC doesn't have a good way to do that..

And now off to implement a set of routines for add, subtract, multiply, divide using this RRD
(real-real-double) technique. Per Dekker!  After this, will try coding up "proper" double
precision. But this shows that it is actually easy to make the RRD from a double (and back again).

I think that I will call this module RRD.REL The linkage will be L80 prog,AM9511,RRD,APU,prog/N/E

and then RRD is optional -- it can (almost) be a user library, except for the special names needed
to override FORLIB.REL.

This should be the fastest (almost) DOUBLE PRECISION possible with a 9511. And, I finally published
dekker1971.pdf to my apu github. 

Anyway... I think that this is what Dekker was after...

Enjoy!
Fred


  240 ' DOUBLE PRECISION (RRD)
  250 ' GENERATE A DOUBLE PRECISION NUMBER, AND BREAK IT INTO SINGLE
  260 ' PRECISION PARTS:  23 23 10
  270 ' THE SUM OF THE NUMBERS IS THE DOUBLE PRECISION NUMBER. BUT...
  280 ' MS REAL IS 24 BITS, AND DOUBLE PRECISION IS 56 BITS - SO WE
  290 ' HAVE TO SPLIT INTO THREE PARTS. THE SUM OF THE FIRST TWO
  300 ' PARTS WILL BE 46 BIT, AND WE WILL WORK WITH THAT. THIS WILL
  310 ' IGNORE THE LOW 10 BITS. WE ALSO LOOSE DYNAMIC RANGE: INSTEAD
  320 ' OF 2^-64..2^63, WE MUST LOP OFF 23 BITS FOR 2^41..2^40
  330 A# = 1
  340 B# = 3
  350 X# = A# / B#
  360 PRINT X#
  370 ' B = BITS PER MANTISSA, BL = REMAINING
  380 B = 23 : BL = 56 - B - B
  390 ' E IS EXPONENT OF RRD NUMBER
  400 E = INT(LOG(X#) / LOG(2#)) + 1
  410 M# = X# * 2^(-E)
  420 MH = INT(M# * 2^B)
  430 M# = (M# * 2^B) - MH
  440 MM = INT(M# * 2^B)
  450 M# = (M# * 2^B) - MM
  460 ML = INT(M# * 2^BL)
  470 M# = (M# * 2^BL) - ML
  480 ' M# SHOULD BE 0... WE HAVE 3 MANTISSA PARTS MH, MM, ML
  490 EH = E - B
  500 EM = EH - B
  510 EL = EM - BL
  520 P# = MH : F# = P# * 2#^EH
  530 P# = MM : P# = P# * 2#^EM : F# = F# + P#
  540 P# = ML : P# = P# * 2#^EL : F# = F# + P#
  550 PRINT F#

Phillip Stevens

unread,
Aug 10, 2021, 9:08:41 PM8/10/21
to retro-comp
Fred wrote:
So, REDDING does DOUBLE PRECISION, how? Probably by doing it the "hard way" -- implement using 9511
primitive operations. I will get to it,  but kind of got into the TJ Dekker 71 paper. So, started fooling around.
Here is MBASIC code - it makes a double precision number 1/3,   .3333333333333333 as printed by MBASIC.
We then break it into mh, mm and ml (mantissa high, medium, low) of 23 bits, 23 bits and 10 bits.

But.. notice that we can put the number back together -- and it is still .3333333333333333 .
If we ignore the last 10 bits (531 GOTO 550)... .we get 3333333333333286 -- which is just fine.
So, converting to 8080 (Z80) assembler and testing... this is fun! Note that the x * 2^B stuff
is   just a bit of bit-banging in assembler... MBASIC doesn't have a good way to do that.

I know it is off track, but you could do the bit twiddling function x*2^B as a USR() function and make it available to MBASIC.
Though it is far better to do it as you propose below.
 
And now off to implement a set of routines for add, subtract, multiply, divide using this RRD
(real-real-double) technique. Per Dekker!  After this, will try coding up "proper" double
precision. But this shows that it is actually easy to make the RRD from a double (and back again).

I think that I will call this module RRD.REL The linkage will be L80 prog,AM9511,RRD,APU,prog/N/E

This should be the fastest (almost) DOUBLE PRECISION possible with a 9511. And, I finally published
dekker1971.pdf to my apu github.

Do you know whether Dekker method was the "go to" method used for double precision in the 70's?
I guess it must have been, given that the REDDING Am9511A library used it. That's at least one example.
You would think there'd be code examples in any scientific field where very high precision was required.
I wonder if there are other examples?

Anyway, there are quite a few people with Spencer's APU Module already, who will benefit from this accuracy. ;-)
Cheers, Phillip

Fred Weigel

unread,
Aug 10, 2021, 10:32:10 PM8/10/21
to Phillip Stevens, retro-comp
Phillip

Actually, I *don't* think Redding used it. They did it the old-fashioned way. However, early nVidia GPUs had only single precision,
and THEY used Dekker. In 2005 or so. This is because those algorithms are actually very well suited to "no-branching" with fused multiply-add!

The reason I don't think Redding did it that way is that they do not lose on dynamic range! But, this RRD approach will be faster!
It should be a hair off 50% of single precision speed. Or roughly 8 to 10 times software double precision (on my first analysis).

The MBASIC code was a bit of a lark (mostly to convince myself that it *is* possible). I like to work with something like MBASIC,
then go to C. It's just fun!

Of course, the "implied bit 24" throws the thing a bit -- that is why I went with 23/23/10. The 10 trainling bits will be discarded,
giving us 23/23 or 46 bits of mantissa. Should be good enough for almost anything. Then, I'll do the "proper" way. Gives
a choice...

But mostly, I am having a blast with this stuff...

Fred


--
You received this message because you are subscribed to a topic in the Google Groups "retro-comp" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/retro-comp/cDt4-ENNvwY/unsubscribe.
To unsubscribe from this group and all its topics, send an email to retro-comp+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/retro-comp/abe37f19-690a-4eca-8311-ae6a720037b0n%40googlegroups.com.

Phillip Stevens

unread,
Aug 11, 2021, 12:33:08 AM8/11/21
to retro-comp
Fred wrote:
Actually, I *don't* think Redding used it. They did it the old-fashioned way. However, early nVidia GPUs had only single precision,
and THEY used Dekker. In 2005 or so. This is because those algorithms are actually very well suited to "no-branching" with fused multiply-add!

It would be good to add a fma() like function into your BASCOM libraries. It would save two expensive APU stack push/pop cycles, and be quite helpful for some applications.
I was going to point you to my Am9511 poly() function for a similar code example, but I didn't do it. Ouch. Glaring omission!.
Cheers, Phillip

Fred Weigel

unread,
Aug 11, 2021, 12:59:15 PM8/11/21
to retro-comp
Phillip

Ok! I can put in a function fmaf(x,y,z), and make sure BASCOM can use it.

Also, how about fmaf1(z) and fmaf2(x, y), fmaf1() will just load z, fmaf2(x,y) computes x*y + z, and leaves sum on apu stack.
fmafr() returns top of stack as float, and pops it.

That would allow fast combined fmaf() operations. Indeed fmaf(x,y,z) itself would be fmaf1(z) fmaf(x, y) z = fmafr();
Then we get:

 t.y = x.y * y.y;
 t.x = fmaf (x.y, y.y, -t.y);
 t.x = fmaf (x.x, y.x, t.x);
 t.x = fmaf (x.y, y.x, t.x); 
 t.x = fmaf (x.x, y.y, t.x);

becomes

 t.y = x.y * y.y
  t.x - fmaf(x.y,y.y,-t.y)
 fmaf1(t.x);
 fmaf2(x.y,y.y)
 fmaf2(x.x,y.x)
 fmaf2(x.y,y.x)
 fmaf2(x.x,y.y)
 t.x=fmafr();

...something along those lines. this should allow leaving intermediate results on-chip, and speed things up even more. Naive 16 load ops,
proposed api brings it down to 10 load ops (and, I think really good time savings).

So, the 3 primitives would be fmaf1(z) fmaf2(x,y) r=fmafr() -- ok with you? we could include fmaf(x,y,z) as well, as a convenience
function.

Fred

Phillip Stevens

unread,
Aug 11, 2021, 9:04:08 PM8/11/21
to retro-comp
Ok! I can put in a function fmaf(x,y,z), and make sure BASCOM can use it.

Also, how about fmaf1(z) and fmaf2(x, y), fmaf1() will just load z, fmaf2(x,y) computes x*y + z, and leaves sum on apu stack. fmafr() returns top of stack as float, and pops it.

That would allow fast combined fmaf() operations. Indeed fmaf(x,y,z) itself would be fmaf1(z) fmaf(x, y) z = fmafr();

...something along those lines. this should allow leaving intermediate results on-chip, and speed things up even more. Naive 16 load ops, proposed api brings it down to 10 load ops (and, I think really good time savings).

So, the 3 primitives would be fmaf1(z) fmaf2(x,y) r=fmafr() -- ok with you? we could include fmaf(x,y,z) as well, as a convenience function.

Yes that looks like a good outcome. I did something similar, but very janky and specific, for the planetary motion problem with multiple APU Modules. So I know intermediate results on-chip would be very useful.
 
It would be good to add a fma() like function into your BASCOM libraries. It would save two expensive APU stack push/pop cycles, and be quite helpful for some applications.

The application I have in mind is Horner’s method for calculation of polynomials. So it would be good to test the functions with looped and unrolled polynomials. WIth a good implementation of Horner’s any transcendental function can be calculated optimally.

FWIW, I’ve used the LOLREMEZ tool to prepare coefficients for much of the things I’ve been doing. You can for example use it to get coefficients in double precision for any functions you want to build later.

Cheers, Phillip
Reply all
Reply to author
Forward
0 new messages