Micro optimizations, indeed.
The big guns are avoiding redundant computation, particularly complex
computations like trig functions and square roots.
"The fastest computation is the one avoided."
Among the more significant micro optimizations is replacing frequently
referenced literal constants, particularly multi-digit ones, with short
variables.
ISTR that FOR pushes the loop-top address on a stack for quick NEXTs,
unlike GOTOs and GOSUBs,
so there's nothing to gain by positioning or line number fiddling.
It would be interesting to profile this program to see what fraction of its
execution time is spent in various regions.
This doesn't have to be complex--if some region is taking 30% of the time,
then three out of every ten samples will be in that region. Taking just 20
to 30 samples will instantly reveal the approximate proportion of time
spent in the BASIC interpreter, as opposed to the time spent in graphics or
trig ROM routines.
Such profiling requires only a ROM memory map and an emulator with a
debugger option. For example, in AppleWin F7 enters the debugger, revealing
the PC address. Another F7 returns to normal execution. So simply pressing
F7, recording the PC, pressing F7 again, waiting a few seconds, and
pressing F7 again, provides another sample.
Repeat this 20 to 30 times, then use the ROM map to see what each sample
was doing, and group the results. Presto--the truth about where the
execution time is going!
Find the top two or three "peaks" and see if any optimization is possible
to reduce them. Everything else is essentially "noise".
I once sped up a prototype compiler by a factor of three in about ten
minutes by using this approach--stopping the machine every few seconds
during a compile and writing the PC on the back of a punched card. (Now you
know approximately when this was!)
Fully 3/4ths of the samples were in the lexical scanner, and a brief code
inspection revealed a glaring (but pretty) inefficiency that could be fixed
with two lines of code!
A slight variation on this technique that is more BASIC-oriented is to
sample the current statement number, which can be automated in the AppleWin
debugger by setting a memory "mini-window" on the appropriate page zero
bytes. (Of course, if there are lots of computations per line, this may not
provide enough detail.)
Optimization should be about large improvements before even considering
fooling with trivialities. A 1% improvement that adds complexity or hurts
transparency is usually a bad tradeoff. (The exception is where the 1%
makes the difference between something being possible or impossible. ;-)