Fast 'Hats Off' in Applesoft Basic

John Brooks

unread,

May 3, 2017, 10:45:30 PM5/3/17

to

5 LOMEM: 24576:PRINT "CALCULATING...":GOSUB 70:HGR:POKE 49234,0:HCOLOR= 3:H = 140:V = 87:F = 5:M = 16245:P = 16384
10 DIM N(57),U(57):FOR J = -E TO E:K = ABS (J):B = K * G + P:IF J > Z THEN U = U(K):N = N(K):GOTO 40
20 U = S(K) * F:N = SQR(M - U):U(K) = U:N(K) = N:X = K * R5:T = T(X):FOR I = Z TO N:D = S(I) + U:IF D >= T THEN FOR T = T TO D:X = X + O:T = T(X):NEXT
30 POKE B + I,G + Q(X):NEXT
40 C = H - J:D = V - J:FOR I = Z TO N:Y = D + PEEK (B + I) - G:IF Y < Z(C - I) THEN HPLOT C - I,Y:Z(C - I) = Y
50 IF Y < Z(C + I) THEN HPLOT C + I,Y:Z(C + I) = Y
60 NEXT:NEXT:PRINT CHR$ (7):END
70 I = 0:N = 0:Y = 0:B = 0:G = 128:C = 0:J = 0:H = 0:V = 0:D = 255:N = 279:Z = 0:O = 1:DIM Z(279):FOR I = Z TO N:Z(I) = D:NEXT
80 R5 = 3 * SQR(5):R3 = O / 3:DIM S(128),T(384):FOR I = O TO G:S(I) = I * I: NEXT:N = 384:FOR I = Z TO N:D = I * R3 + R3:T(I) = D * D:NEXT
90 DIM Q(384):E = 57:F = .4 * E:D = -2 * 3.141592654 / 512:C = G + G:FOR I = Z TO G:Y = I * D:Y = E * SIN(Y) + F * SIN(Y + Y + Y):Q(I)= Y:Q(C - I) = Y:Q(C + I) = - Y:NEXT:RETURN

Above is an optimized version of the 'Hats Off' program, written in 10 lines of Applesoft Basic.

Benchmark results from a physical Apple IIGS:
3m14s @ 2.8 MHz (194 seconds)
8m44s @ 1 MHz (524 seconds)

I noticed that the physical GS @ 2.8 MHz runs the program 2.7 times faster than at 1 MHz. The IIGS emulators I benchmarked on were slower, with their 2.8MHz setting running only 2.5 times faster than 1 MHz.

The likely explanation is that much of this program's execution time is in ROM math or graphics routines which run at a full 2.8MHz on a physical IIGS and have no DRAM refresh penalty.

Emulators are probably not accounting for IIGS DRAM refresh stalls correctly.

Fast Hats Off
List of major changes:
1) Per-column Z-Buffer for hidden-surface removal.
2) Faster sin(Y)+sin(3Y) evaluation time by using a sin table.
3) Reduced sin table calc from 512 to 128 iterations by utilizing sin symmetry.
4) Removed most sqr() calls by doing incremental square operations instead.
5) Optimized square operations (X*X) by using a lookup table.
6) Combined Basic statements to reduce line count and made most vars single-letter.
7) Moved loop code toward the start of the program and allocated inner-loop vars first.
8) Replaced constants with variables.
9) Separated calc and draw code. Calc runs only on the first half of the isolines.
10) Draws to $2000(HGR1) with isoline cache at $4000 and LOMEM at $6000.

'Hats Off' origin story:
http://j-b.livejournal.com/419883.html
https://groups.google.com/forum/?fromgroups#!topic/comp.sys.apple2/buTFvk-VnTo

Lines 40 & 50 should really be replaced with a 6502 fast draw routine.

-JB
@JBrooksBSI

Nick Westgate

unread,

May 3, 2017, 11:44:34 PM5/3/17

to

On Thursday, 4 May 2017 14:45:30 UTC+12, John Brooks wrote:
> Above is an optimized version of the 'Hats Off' program, written in 10 lines of Applesoft Basic.

Amazing.

Next top of the pile looks to be parsing again:
http://jamtronix.com/files/applesoft.html#PTRGET4

00B7- 2218320 = 0.831%
00B8- 2218320 = 0.831%
00B9- 2218320 = 0.831%
00C8- 2218320 = 0.831%
00BC- 2218320 = 0.831%
00BB- 2218320 = 0.831%
00BA- 2218320 = 0.831%
E05C- 2459618 = 0.921%
E05D- 2459618 = 0.921%
E05E- 2459618 = 0.921%
E062- 2459618 = 0.921%
E061- 2459618 = 0.921%
E063- 2459618 = 0.921%
E065- 2459618 = 0.921%
E066- 2459618 = 0.921%
E067- 2459618 = 0.921%
E068- 2459618 = 0.921%
E069- 2459618 = 0.921%
E05F- 2459618 = 0.921%
E05B- 2459618 = 0.921%

Cheers,
Nick.

John Brooks

unread,

May 4, 2017, 12:52:08 AM5/4/17

to

I compiled it with TASC and it ran in 2m36s (156 seconds) at 1 MHz, vs 524 seconds interpreted.

3.3x faster.

-JB
@JBrooksBSI

John Brooks

unread,

May 5, 2017, 1:20:55 AM5/5/17

to

The Applesoft micro-optimizations make it hard to read and modify the code.

Here's a different approach which has shorter lines, comments, visible constants, a cleaner structure, and integer variables typed using '%':

10 LOMEM: 24576: PRINT "CALCULATING...":R3 = 1 / 3:R5 = 3 * SQR (5)
15 DIM S%(128): FOR I = 1 TO 128:S%(I) = I * I: NEXT
20 DIM T%(384): FOR I = 0 TO 384:D = I * R3 + R3:T%(I) = D * D: NEXT
25 DIM Q%(384):E = 57:F = .4 * E:D = - 2 * 3.141592654 / 512
30 FOR I = 0 TO 128:Y = I * D:Y = E * SIN (Y) + F * SIN (Y + Y + Y)
35 Q%(I) = Y:Q%(256 - I) = Y:Q%(256 + I) = - Y: NEXT
40 DIM Z%(279): FOR I = 0 TO 279:Z%(I) = 255: NEXT :P% = 16384
49 REM --- Main loop ---
50 HGR : POKE 49234,0: HCOLOR= 3: DIM N%(57): FOR J = - E TO E
55 K% = ABS (J):B% = P% + K% * 128: IF J > 0 THEN N = N%(K%): GOTO 100
59 REM --- Calc one isoline ---
60 U% = S%(K%) * 5:N = SQR (16245 - U%):N%(K%) = N:X% = K% * R5
65 T = T%(X%): FOR I = 0 TO N:D% = S%(I) + U%
70 IF D% > = T THEN FOR T = T TO D%:X% = X% + 1:T = T%(X%): NEXT
75 POKE B% + I,128 + Q%(X%): NEXT
99 REM --- Draw one isoline ---
100 C% = 140 - J:D% = 87 - 128 - J
110 FOR I = 0 TO N:Y% = D% + PEEK (B%):B% = B% + 1
120 IF Y% < Z%(C% - I) THEN HPLOT C% - I,Y%:Z%(C% - I) = Y%
130 IF Y% < Z%(C% + I) THEN HPLOT C% + I,Y%:Z%(C% + I) = Y%
140 NEXT : NEXT : PRINT CHR$ (7)

This version can be run in either of two modes:
1) Interpreted using Applesoft for development
2) Compiled to a binary file for fast execution

The interpreted version runs about 1 minute slower than the micro-optimized version (9m49s vs 8m44s) on a 1 MHz Apple II.

When compiled using Beagle Compiler v3.2, Fast Hat runs in 1m21s on a 1 MHz Apple II. It's also smaller (518 bytes compiled vs 748 bytes for tokenized Applesoft).

The original Hats Off program took about 3 hours to run on a 1 MHz Atari (10,800 seconds).
The compiled Fast Hat takes 81 seconds on a 1MH Apple II, or about 130x faster.

-JB
@JBrooksBSI

barrym95838

unread,

May 5, 2017, 1:48:16 AM5/5/17

to

On Thursday, May 4, 2017 at 10:20:55 PM UTC-7, John Brooks wrote:
> ...

> The original Hats Off program took about 3 hours to run on a 1 MHz
> Atari (10,800 seconds).
> The compiled Fast Hat takes 81 seconds on a 1MH Apple II, or about
> 130x faster.
>

Excellent work, John. I'm not familiar with any members of the 8-bit
Atari clan that ran at 1 MHz, though ...

Mike B.

Michael J. Mahon

unread,

May 5, 2017, 2:52:19 PM5/5/17

to

The integer variables are a big win for a compiled version, but are a
net loss for interpreted Applesoft, since every integer variable is
converted to float before use and "fixed" to integer before any update.

In short, for interpreted Applesoft, integer variables are only a space
saving, and then only for arrays.

--

-michael

NadaNet 3.1 for Apple II parallel computing!
Home page: http://michaeljmahon.com

"The wastebasket is our most important design
tool--and it's seriously underused."

Michael J. Mahon

unread,

May 5, 2017, 5:44:24 PM5/5/17

to

On 5/4/2017 10:20 PM, John Brooks wrote:
>

While we're exploiting symmetry of the sin function, here's a slight
modification that further exploits the symmetry of the triple-frequency
sin, with a saving of 1/3 of the sin evaluations. I also
strength-reduced the "I * D" to a running sum, since accuracy is not an
issue, and skipped the tripling of ID in the short sin loop:

10 LOMEM: 24576: PRINT "CALCULATING...":R3 = 1 / 3:R5 = 3 * SQR (5)
15 DIM S%(128): FOR I = 1 TO 128:S%(I) = I * I: NEXT
20 DIM T%(384): FOR I = 0 TO 384:D = I * R3 + R3:T%(I) = D * D: NEXT
25 DIM Q%(384):E = 57:F = .4 * E:D = - 2 * 3.141592654 / 512

26 D3 = 3 * D:ID = 0: FOR I = 0 TO 43:Y = F * SIN (ID)
28 Q%(I) = Y:Q%(85 - I) = Y:Q%(85 + I) = - Y:ID = ID + D3: NEXT
30 ID = 0: FOR I = 0 TO 128:Y = E * SIN (ID) + Q%(I)
35 Q%(I) = Y:Q%(256 - I) = Y:Q%(256 + I) = - Y:ID = ID + D: NEXT

40 DIM Z%(279): FOR I = 0 TO 279:Z%(I) = 255: NEXT :P% = 16384
49 REM --- Main loop ---
50 HGR : POKE 49234,0: HCOLOR= 3: DIM N%(57): FOR J = - E TO E
55 K% = ABS (J):B% = P% + K% * 128: IF J > 0 THEN N = N%(K%): GOTO 100
59 REM --- Calc one isoline ---
60 U% = S%(K%) * 5:N = SQR (16245 - U%):N%(K%) = N:X% = K% * R5
65 T = T%(X%): FOR I = 0 TO N:D% = S%(I) + U%
70 IF D% > = T THEN FOR T = T TO D%:X% = X% + 1:T = T%(X%): NEXT
75 POKE B% + I,128 + Q%(X%): NEXT
99 REM --- Draw one isoline ---
100 C% = 140 - J:D% = 87 - 128 - J
110 FOR I = 0 TO N:Y% = D% + PEEK (B%):B% = B% + 1
120 IF Y% < Z%(C% - I) THEN HPLOT C% - I,Y%:Z%(C% - I) = Y%
130 IF Y% < Z%(C% + I) THEN HPLOT C% + I,Y%:Z%(C% + I) = Y%
140 NEXT : NEXT : PRINT CHR$ (7)

James Davis

unread,

May 5, 2017, 10:46:34 PM5/5/17

to

Hi Michael,

Nice! Still takes about 4 seconds to finish in AppleWin at highspeed, though.

James Davis

John Brooks

unread,

May 6, 2017, 12:42:07 AM5/6/17

to

On Friday, May 5, 2017 at 2:44:24 PM UTC-7, Michael J. Mahon wrote:

Nice optimization Michael!

I noticed a few missing pixels at the top rim and bottom brim of the hat, so I kept the first 128 calc as full floating point which fixed the missing pixels:

10 DIM S%(128),T%(384): FOR I = 1 TO 128:I% = I:S%(I) = I% * I%: NEXT
15 FOR I = 1 TO 385:D = I * R3:T%(I - 1) = D * D - 1: NEXT
20 DIM Y(128):Q% = 7807:E = 57:F = .4 * E:D = - 2 * 3.141592654 / 512
25 D3 = 3 * D:ID = 0: FOR I = 0 TO 43:Y = F * SIN (ID)
30 Y(I) = Y:Y(85 - I) = Y:Y(85 + I) = - Y:ID = ID + D3: NEXT
35 ID = 0: FOR I = 0 TO 128:Y% = E * SIN (ID) + Y(I):NY% = 128 - Y%
40 Y% = 128 + Y%: POKE Q% + I,Y%: POKE Q% + 256 - I,Y%
45 POKE Q% + 256 + I,NY%:ID = ID + D: NEXT
50 Z% = 640: FOR I = 0 TO 279: POKE Z% + I,255: NEXT :P% = 16384
55 DIM PB%(57): FOR I = 0 TO 57:PB%(I) = P%:P% = P% + 128: NEXT
60 HGR : POKE 49234,0: HCOLOR= 3: DIM N%(57):N% = 0
64 REM --- Main loop ---
65 FOR J = - E TO E:J% = J:K% = ABS (J%):B% = PB%(K%)
70 IF J% > 0 THEN N% = N%(K%): GOTO 100
74 REM --- Calc one isoline ---
75 U% = S%(K%) * 5:X% = K% * R5:T% = T%(X%)
80 IF U% + S%(N% + 1) < 16246 THEN N% = N% + 1: GOTO 80
85 N%(K%) = N%: FOR I = 0 TO N%:D% = S%(I) + U%
90 IF D% > T% THEN X% = X% + 1:T% = T%(X%): GOTO 90
95 POKE B% + I, PEEK (Q% + X%): NEXT

99 REM --- Draw one isoline ---

100 C% = Z% + 140 - J%:D% = 87 - J% - 128
110 FOR I = 0 TO N%:Y% = D% + PEEK (B% + I)
120 L% = C% - I: IF Y% < PEEK (L%) THEN HPLOT L% - Z%,Y%: POKE L%,Y%
130 R% = C% + I: IF Y% < PEEK (R%) THEN HPLOT R% - Z%,Y%: POKE R%,Y%

140 NEXT : NEXT : PRINT CHR$ (7)

This version improves all the inner loops:
1) Per-isoline sqr replaced with incremental-square calculation
2) Mul 128 in main loop replaced with incremental-add table build
3) Arrays of 16-bit ints replaced with byte-arrays via peek/poke (Zbuffer & point cache)
4) Isoline draw routine does fewer math ops by calculating left & right pos & reusing them

Execution time at 1 MHz with Beagle compiled program is down to 1 min, 1 sec!

Compiled size is 653 bytes vs 950 byte original.

I think Beagle Compiler should have been called Bird's Better Basic.

Using the GSplus emulator at full speed on my Macbook Pro, the compiled Fast Hat finishes in ~500ms (1/2 second).

-JB
@JBrooksBSI

David Schmidt

unread,

May 6, 2017, 12:52:00 AM5/6/17

to

On 5/6/2017 12:42 AM, John Brooks wrote:
> 90 IF D% > T% THEN X% = X% + 1:T% = T%(X%): GOTO 90

Hmmm, I get a bad subscript error in 90. Does it depend on any prior
version being in memory, perhaps?

John Brooks

unread,

May 6, 2017, 1:08:27 AM5/6/17

to

Whoops, the previous post was missing the first line, line #5. Here is the complete listing:

5 LOMEM: 24576: PRINT "CALCULATING...":R3 = 1 / 3:R5 = 3 * SQR (5)

10 DIM S%(128),T%(384): FOR I = 1 TO 128:I% = I:S%(I) = I% * I%: NEXT
15 FOR I = 1 TO 385:D = I * R3:T%(I - 1) = D * D - 1: NEXT
20 DIM Y(128):Q% = 7807:E = 57:F = .4 * E:D = - 2 * 3.141592654 / 512
25 D3 = 3 * D:ID = 0: FOR I = 0 TO 43:Y = F * SIN (ID)
30 Y(I) = Y:Y(85 - I) = Y:Y(85 + I) = - Y:ID = ID + D3: NEXT
35 ID = 0: FOR I = 0 TO 128:Y% = E * SIN (ID) + Y(I):NY% = 128 - Y%
40 Y% = 128 + Y%: POKE Q% + I,Y%: POKE Q% + 256 - I,Y%
45 POKE Q% + 256 + I,NY%:ID = ID + D: NEXT
50 Z% = 640: FOR I = 0 TO 279: POKE Z% + I,255: NEXT :P% = 16384
55 DIM PB%(57): FOR I = 0 TO 57:PB%(I) = P%:P% = P% + 128: NEXT
60 HGR : POKE 49234,0: HCOLOR= 3: DIM N%(57):N% = 0
64 REM --- Main loop ---
65 FOR J = - E TO E:J% = J:K% = ABS (J%):B% = PB%(K%)
70 IF J% > 0 THEN N% = N%(K%): GOTO 100
74 REM --- Calc one isoline ---
75 U% = S%(K%) * 5:X% = K% * R5:T% = T%(X%)
80 IF U% + S%(N% + 1) < 16246 THEN N% = N% + 1: GOTO 80
85 N%(K%) = N%: FOR I = 0 TO N%:D% = S%(I) + U%

90 IF D% > T% THEN X% = X% + 1:T% = T%(X%): GOTO 90

95 POKE B% + I, PEEK (Q% + X%): NEXT
99 REM --- Draw one isoline ---
100 C% = Z% + 140 - J%:D% = 87 - J% - 128
110 FOR I = 0 TO N%:Y% = D% + PEEK (B% + I)
120 L% = C% - I: IF Y% < PEEK (L%) THEN HPLOT L% - Z%,Y%: POKE L%,Y%
130 R% = C% + I: IF Y% < PEEK (R%) THEN HPLOT R% - Z%,Y%: POKE R%,Y%
140 NEXT : NEXT : PRINT CHR$ (7)

-JB
@JBrooksBSI

1time...@gmail.com

unread,

May 6, 2017, 4:03:26 PM5/6/17

to

> When compiled using Beagle Compiler v3.2, Fast Hat runs in 1m21s on a 1 MHz Apple II.

Is this using the Beagle Compiler's FAST.HPLOT patch? (See page 29 of the manual.)

John Brooks

unread,

May 6, 2017, 5:23:42 PM5/6/17

to

On Saturday, May 6, 2017 at 1:03:26 PM UTC-7, 1time...@gmail.com wrote:
> > When compiled using Beagle Compiler v3.2, Fast Hat runs in 1m21s on a 1 MHz Apple II.
>
> Is this using the Beagle Compiler's FAST.HPLOT patch? (See page 29 of the manual.)

Nope. I've been benchmarking the vanilla config of Beagle Compiler v3.2, without any of the extended memory options or the fast.hplot patch.

The fast.hplot patch does make Fast Hat run faster by about 1.3 seconds, but it's also uses quite a bit more memory and requires a patched runtime.

BTW, many of the public BeagleCompiler 3.2 images floating around (on Asimov, archive.org, etc) will crash because they are prodos sector order but have a .dsk suffix (DOS 3.3 sector order). Renaming the image to a .po will fix it. Thanks to Chris Osborn (@FozzTexx) for diagnosing this image problem and the fix.

-JB

James Davis

unread,

May 6, 2017, 8:22:26 PM5/6/17

to

Hi John,

On Saturday, May 6, 2017 at 2:23:42 PM UTC-7, John Brooks wrote:
> BTW, many of the public BeagleCompiler 3.2 images floating around (on Asimov, archive.org, etc) will crash because they are prodos sector order but have a .dsk suffix (DOS 3.3 sector order). Renaming the image to a .po will fix it. Thanks to Chris Osborn (@FozzTexx) for diagnosing this image problem and the fix.
>
> -JB

These are the BC disk images I have:

Beagle Compiler v1.0.dsk.PO
Beagle Compiler v2.2.dsk.DO
Beagle Compiler v3.2.dsk.PO
BEAGLE.COMPILER.PO

Does it matter if they are all renamed by appending PO/DO after DSK (as above)?
Or, should the .DSK be dropped entirely?

Also, I was understanding that DSK was generic for DO or PO, which are specific, and that DSK was non-specific.

James Davis

Steve Nickolas

unread,

May 6, 2017, 8:51:56 PM5/6/17

to

On Sat, 6 May 2017, James Davis wrote:

> Also, I was understanding that DSK was generic for DO or PO, which are
> specific, and that DSK was non-specific.

Well, most emulators assume that DSK is always DO.

-uso.

James Davis

unread,

May 6, 2017, 9:16:53 PM5/6/17

to

OK. But, CiderPress does not. Does AppleWin? . . . I wonder!

Next time a disk image crashes, I'll try to remember this.

James Davis

Steve Nickolas

unread,

May 6, 2017, 9:48:21 PM5/6/17

to

On Sat, 6 May 2017, James Davis wrote:

> On Saturday, May 6, 2017 at 5:51:56 PM UTC-7, Steve Nickolas wrote:
>> On Sat, 6 May 2017, James Davis wrote:
>>
>>> Also, I was understanding that DSK was generic for DO or PO, which are
>>> specific, and that DSK was non-specific.
>>
>> Well, most emulators assume that DSK is always DO.
>>
>> -uso.
>
> OK. But, CiderPress does not. Does AppleWin? . . . I wonder!
>
> Next time a disk image crashes, I'll try to remember this.
>
> James Davis
>

Historically, at least, it didn't. Not sure if that has changed.

-uso.

Michael J. Mahon

unread,

May 7, 2017, 2:12:44 AM5/7/17

to

Steve Nickolas <usot...@buric.co> wrote:
> On Sat, 6 May 2017, James Davis wrote:
>
>> On Saturday, May 6, 2017 at 5:51:56 PM UTC-7, Steve Nickolas wrote:
>>> On Sat, 6 May 2017, James Davis wrote:
>>>
>>>> Also, I was understanding that DSK was generic for DO or PO, which are
>>>> specific, and that DSK was non-specific.
>>>
>>> Well, most emulators assume that DSK is always DO.
>>>
>>> -uso.
>>
>> OK. But, CiderPress does not. Does AppleWin? . . . I wonder!
>>
>> Next time a disk image crashes, I'll try to remember this.
>>
>> James Davis
>>
>
> Historically, at least, it didn't. Not sure if that has changed.
>
> -uso.
>

In most cases involving quasi-normal DOS or ProDOS images, simple
heuristics can reliably determine whether an image is DOS order or ProDOS
order. And for historic reasons, DSK is best treated as generic, as
AppleWin and CiderPress treat them.

Emulators and image utilities that do not treat DSK images as generic are
placing an undue burden on their users.

--
-michael - NadaNet 3.1 and AppleCrate II: http://michaeljmahon.com

fadden

unread,

May 7, 2017, 11:34:21 AM5/7/17

to

On Saturday, May 6, 2017 at 11:12:44 PM UTC-7, Michael J. Mahon wrote:
> In most cases involving quasi-normal DOS or ProDOS images, simple
> heuristics can reliably determine whether an image is DOS order or ProDOS
> order. And for historic reasons, DSK is best treated as generic, as
> AppleWin and CiderPress treat them.

CiderPress actually starts by checking to see if ".dsk" is a DiskCopy 4.2 image. That association is considered "unreliable", so CP then walks through every other format that might work. .po/.do are handled the same way: it starts by trying to interpret it as what it claims to be, then if that doesn't work it tries other options. This means you can mis-label an image and it'll still work, but if you've got an odd image that manages to pass the tests for both .do and .po, CP will prefer the way you have it labeled.

(In contrast, ".shk" is considered "reliable"... if we can't open .shk as a ShrinkIt archive, we assume it's a corrupted archive.)

cf. https://github.com/fadden/ciderpress/blob/v402-a2/diskimg/DiskImg.cpp#L893