ucow - unix cowgol->z80 compiler

180 views
Skip to first unread message

aaron wohl

unread,
Nov 30, 2025, 11:02:31 PMNov 30
to retro-comp
https://github.com/avwohl/ucow a unix compiler for the full cowgol language.

Emhasis on optimizaiton.  Needs testers.

Phase: Foundation (No Optimization)
Lexer & Parser
  • Full AST construction (not single-pass)
  • Type checking and validation
  • Symbol table with scope tracking
Basic Code Generation
  • Naive stack-based code gen
  • All variables in memory
  • All operations through A register
  • Function calls via CALL/RET
Testing Infrastructure# Compile python ucow.py test.cow -o test.mac # Assemble um80 test.mac -o test.rel # Link ul80 test.rel -o test.com # Run cpmemu test.com
Phase: Essential OptimizationsConstant Folding & Propagation

Evaluate constant expressions at compile time.

# Before var x := 10 + 20; var y := x * 2; # After var x := 30; var y := 60;

Multi-pass propagation through the AST.

Dead Code Elimination

Build CFG, compute reachability, remove unreachable code.

# Before sub Foo() is return; print("never"); # dead end sub; # After sub Foo() is return; end sub;Common Subexpression Elimination (CSE)

Track available expressions, reuse computed values.

# Before var a := x + y; var b := x + y; # redundant # After var a := x + y; var b := a;

For 8080, this is critical - arithmetic is expensive.

Strength Reduction

Replace expensive operations with cheaper ones.


Original

Replacement

Savings

x * 2

x + x

MUL→ADD

x * 4

x << 2

MUL→shifts

x * 2^n

x << n

MUL→shifts

x / 2

x >> 1

DIV→shift

x % 2

x & 1

MOD→AND

On 8080 without MUL/DIV instructions, this is huge.


Phase: Control Flow OptimizationsBranch Optimization; Before JZ SKIP JMP NEXT SKIP: ... NEXT: ; After JNZ NEXT ... NEXT:
  • Eliminate jumps to jumps
  • Invert conditions to remove unconditional jumps
  • Thread jumps through empty blocks
Loop-Invariant Code Motion

Move computations that don't change inside a loop to outside.

# Before while i < n loop var offset := base + stride; # invariant arr[i] := arr[i] + offset; i := i + 1; end loop; # After var offset := base + stride; while i < n loop arr[i] := arr[i] + offset; i := i + 1; end loop;Loop Unrolling (Small Loops)

For loops with known small iteration counts (2-8), unroll completely.

# Before var i: uint8 := 0; while i < 4 loop arr[i] := 0; i := i + 1; end loop; # After arr[0] := 0; arr[1] := 0; arr[2] := 0; arr[3] := 0;

Eliminates loop overhead (compare, branch, increment).


Phase: Register AllocationLive Variable Analysis

Backward dataflow to determine which variables are live at each point.

Register Allocation

The biggest single optimization for 8080.

Strategy: Linear Scan (simpler than graph coloring)

  1. Compute live ranges for all variables
  2. Sort by start position
  3. Allocate registers greedily, spill to memory when needed

8080 Register Priorities:

HL - pointer operations, array access DE - secondary pointer, 16-bit values BC - loop counters, 16-bit values A - arithmetic (implicit, always available)

Allocation Rules:

  • Pointers → HL preferred
  • Loop counters → BC preferred
  • 16-bit arithmetic → DE or BC
  • 8-bit temps → B, C, D, E
Register Coalescing

Eliminate unnecessary MOVs by giving source and dest the same register.

; Before MOV A, B MOV C, A ; A is now dead ; After (allocate B=C) ; eliminated entirely
Phase: Memory & Data OptimizationsStatic Variable Overlay

Cowgol already does this conservatively. With full call-graph analysis, we can do it optimally.

sub A() calls B(), C() sub B() uses x, y sub C() uses z, w # x,y and z,w can share memory (B and C never concurrent)Workspace Minimization

Nested subroutines share workspace. Minimize total workspace by optimal variable packing.

Constant Pooling

Share identical constants in memory.

print("Error"); print("Error"); # share the string
Phase: InliningSubroutine Inlining

Since Cowgol forbids recursion, we can inline aggressively.

Inline Criteria:

  • Small subroutines (< 20 instructions)
  • Called once
  • Called in hot loops
  • Leaf functions (no calls)

Benefits:

  • Eliminates CALL/RET overhead (27 cycles)
  • Exposes more optimization opportunities (CSE, constant prop)
  • Enables register allocation across call sites
Interface Devirtualization

When an interface variable has only one possible implementation, inline it.

interface Cmp(a: uint8, b: uint8): (r: int8); sub NumCmp implements Cmp is ... end sub; var cmp: Cmp := NumCmp; # If NumCmp is the only implementation, inline calls to cmp()
Phase: Peephole OptimizationPattern Matching on Assembly

After code generation, pattern-match and replace.


Pattern

Replacement

PUSH r; POP r

(delete)

MOV A,r; MOV r,A

(delete second)

LDA addr; STA addr

(delete STA)

JMP L; L:

(delete JMP)

XRA A

(better than MVI A,0)

INR A; DCR A

(delete both)

ADD A

(same as RLC for *2)
Instruction Selection

Choose optimal instructions during code gen.

; Load 0 into A MVI A, 0 ; 7 cycles, 2 bytes XRA A ; 4 cycles, 1 byte ← better ; Compare A to 0 CPI 0 ; 7 cycles, 2 bytes ORA A ; 4 cycles, 1 byte ← better (sets Z flag) ; 16-bit increment INR L ; need to handle carry ; vs INX H ; 5 cycles, 1 byte ← better
Phase: Advanced Optimizations (Future)SSA Form

Convert to Static Single Assignment for:

  • Better constant propagation
  • Better dead code elimination
  • Cleaner dataflow analysis
Instruction Scheduling

Reorder instructions to minimize stalls (less critical on 8080 than modern CPUs).

Profile-Guided Optimization

Run program, collect execution counts, optimize hot paths.

ladislau szilagyi

unread,
Dec 1, 2025, 4:49:19 AMDec 1
to retro-comp
Hi,
from your text, it seems that the target is the 8080, not Z80. Is this correct?
Ladislau

aaron wohl

unread,
Dec 1, 2025, 10:44:22 AMDec 1
to retro-comp
It is currently set to z80 target only.  I guess it could be made a switch if there is a need for 8080.

ladislau szilagyi

unread,
Dec 1, 2025, 11:01:32 AMDec 1
to retro-comp
Ok, understood.

But your initial message contains only standard 8080 instruction names, not Z80 (MOV instead of LD, XRA instead of XOR, MVI instead of LD, ...)

Is this because your custom UM80 assembler?

Ladislau

aaron wohl

unread,
Dec 1, 2025, 9:18:44 PMDec 1
to retro-comp
It started out putting out 8080 mnewmonics.  When I got to the later phases of optimization, I switched to Z80 to use the JR and other Z80 features.  The um80 assembler uses the m80.com manual.  For the z80 it uses the .z80 psuedo op.  The code produced should assemble byte for byte the same as M80.com.   The assembler, linker, library manager use the microsoft manuals and are data in and out compatible.  They just run on linux and dont run out of memory.   Sources for the um80 and friends: https://github.com/avwohl/um80_and_friends

ladislau szilagyi

unread,
Dec 1, 2025, 11:57:45 PMDec 1
to retro-comp
Thanks Aaron for the details,

just another question: the original David Given's Cowgol toolchain includes also a Cowgol linker ( named Cowlink ), capable to "link" several cowgol object files ( .coo ) to build an assembler file, which can be then assembled & linked to produce the final .COM executable.
This allows to build a project using several Cowgol source files, using in each of these source files the "extern" Cowgol feature to define/reference subroutines which may be called in another file.

I suppose your approach can work only with a single Cowgol source file, right?

I ask this because I also built a Cowgol development environment, which runs directly on CP/M ( https://github.com/Laci1953/Cowgol_on_CP_M )

But, i did not choose, for compatibility, the Microsoft tools, but instead the HiTech C compiler's toolchain (and, of course, using the HiTech's object code format).

This choice allowed me to offer to the user the possibility to mix Cowgol, C and Z80 assembler source files into a single project, on CP/M, directly from a CP/M terminal command line call.

Anyway, I will try to install your project and test-it...
Do you think it will run on a Windows machine under Cygwin?

Thanks,
Ladislau

aaron wohl

unread,
Dec 2, 2025, 3:45:19 AMDec 2
to retro-comp
It ucow emits m80 assembler code. I assemble it with um80 (unix based m80 equivilent).  It should be fine with separate compilation.  the ul80 linker can stitch together .rel files.  If you have any problem with multiple modules, let me know.

As to the Cygwin,no that would not be suitable, this project is entirely in Python.  If you install a Windows Python with pip, then pip should be able to install everything and run on Windows.  I didn't try it on windows though, let me know for the docs if there are any nits or you need something to make it work. The project has been published to the PyPI registry.  So any working Python system should be able to do 'pip install mbasic' and get a working system.  See https://pypi.org/project/mbasic/

aaron wohl

unread,
Dec 2, 2025, 3:56:57 AMDec 2
to retro-comp
Oops, sorry, I got the wrong project name. I'm doing multiple projects at once here.  for cowgol the project is ucow.
You install with pip install ucow.
The PyPi registry is:
Message has been deleted

ladislau szilagyi

unread,
Dec 2, 2025, 4:48:34 AMDec 2
to retro-comp
Hi Aaron,

I still have problems understanding how ucow can build a project from more than one cowgol source file.

Let's take an example: we have a.cow and b.cow

Source file #1 : a.cow
@decl x(): (ret: int16) @extern("x");
sub sub1() is
var z: int16 := x(); #call external subroutine
...
end sub sub1;

Source file #2 : b.cow
# define external subroutine
sub x(): (ret: int16) @extern("x") is
ret := 1;
end sub;

In a "classic" cowgol environment, to build a project using a.cow + b.cow, the execution flow is:

cowfe a.cow a.cos
cowbe a.cos a.coo
cowfe b.cow b.cos
cowbe b.cos b.coo
cowlink cowgol.coo a.coo b.coo -o project.asm
, then assemble & link project.asm to produce project.com

You suggest that this is possible, in ucow, by using:

ucow a.cow --> um80 --> a.rel
ucow b.cow --> um80 --> b.rel
then ul80 a.rel b.rel --> project.com

But... how can you set , this way, correct workspace offsets for the local variables, in both a.cow and b.cow?

In the "classic" approach, cowlink has the task to analyse the full call tree, in order to assign the appropriate workspace offsets for all the local variables from all the involved subroutines.

thanks,
Ladislau

ladislau szilagyi

unread,
Dec 2, 2025, 2:28:51 PMDec 2
to retro-comp
But, Aaron, 

when I asked you about building a project from, let's say, 2 cowgol source files, I was referring the following situation:

Source 1 a.cow
@decl sub x(): (ret: int16) @extern("x");

sub sub1() is
var z: int16 := x();
...
end sub1;

Source 2 b.cow
sub x(): (ret: int16) @extern("x") is
var i: uint8;
var j: uint32;
...
ret := 1;
end sub;

cowfe a.cow a.cos
cowbe a.cos a.coo
cowfe b.cow b.cos
cowbe b.cos b.coo
cowlink cowgol.coo a.coo b.coo -o tmp.asm
... then assemble & link tmp.asm --> tmp.com

Therefore, cowlink must be fed with both a.coo and b.coo,  in order for him to arbitrate the call tree and then assign the proper ws (work space) offsets to all the subroutines from a.cow & b.cow their local variables.

What you say in ucow is to compile separately a.cow --> a.rel , then b.cow --> b.rel , then link them with ul80.
But, this way, how are the ws offsets correctly set for the local variables? 
You are skipping cowlink... I do not understand how this could be possible.

Ladislau


Pe marți, 2 decembrie 2025, la 10:56:57 UTC+2, aaw...@gmail.com a scris:
Message has been deleted

aaron wohl

unread,
Dec 2, 2025, 10:01:58 PMDec 2
to retro-comp
CLAUDE.ai wrote the code. This is it's answer to your question:
 Here's the answer to Ladislau's question:

  Q: How does ucow handle workspace offsets for local variables without
  cowlink?

  A: ucow uses a fundamentally different approach than the classic Cowgol
  toolchain:

  1. Static per-module allocation instead of global workspace analysis. Each
  module allocates its own local variables in its DSEG (data segment). The
  linker (ul80) combines these segments, giving each module its own space.
  2. No call-tree analysis needed. Since each module's variables are separate,
   there's no need for cowlink-style workspace offset coordination.

  How to build a multi-file project:

  # Compile main module (with runtime and main entry point)
  ucow a.cow -o a.mac

  # Compile library modules with -L flag (no runtime, no main)
  ucow -L b.cow -o b.mac

  # Assemble both
  um80 a.mac
  um80 b.mac

  # Link together
  ul80 a.rel b.rel -o project.com

  Example files:

  a.cow (main module):
  # Declare external subroutine (no body)

  @decl sub x(): (ret: int16) @extern("x");

  var answer: int16 := x();  # Calls x from b.cow

  b.cow (library module, compiled with -L):

  sub x(): (ret: int16) @extern("x") is
      var i: uint8 := 5;
      var j: uint16 := 100;
      ret := (j as int16) + (i as int16);
  end sub;

  The key is the -L (library) flag which tells ucow to:
  - Not emit JP _main or _main: label
  - Not include the runtime (only the main module needs it)
  - Just generate the subroutine code with PUBLIC for exported symbols

  The @extern("x") attribute makes the symbol linkable across modules via
  PUBLIC/EXTRN directives.

ladislau szilagyi

unread,
Dec 2, 2025, 11:46:46 PMDec 2
to retro-comp
Ok, understood.
But, in this case, I do not see the advantage of your intensive code optimization strategy.
You gain space in the code segment, but loose space in the data segment.
Of course, there is the execution speed gain, but we are talking about execution under Unix, therefore that's insignificant.
Now, let's try to imagine moving to a real Z80 CP/M machine your cowgol project final executable.
You may run into trouble, for certain large projects, because I suspect that the size of the executable will be too large.
Have you tried to build Cowfe, starting form sources, using ucow? What's the size of the final executable (cowfe.com)?
Compare-it against the original...
The size here is critical, because the remaining TPA is used by Cowfe to allocate various structures needed to lex & parse...

thanks,
Ladislau

aaron wohl

unread,
Dec 3, 2025, 11:01:34 AMDec 3
to retro-comp
Further optimizations are possible.  It ucow was a one day project.  I have a background in compiler construction and used CLAUDE code to write it.  If the data duplication is an issue it can be addressed.   I tried to build the z80 cowfe and be but didnt see anything for the current cow.  Is there a recent cowfe/be for z80?

aaron wohl

unread,
Dec 3, 2025, 5:35:20 PMDec 3
to retro-comp
On compiling multiple modules into one program.  I extended ucow to accept any number of .cow files for one build.  It does the proper call graph analysys to see which local variables are active when and which can share the same space.  For the big test program that reduced the data usage by 51% as you suggested.  
  New features:
  - Multi-file compilation with workspace optimization (call graph analysis to share local variable space)
  - Register calling convention for faster procedure calls
  - Two-argument register passing

  Optimizations:
  - Post-assembly optimizer (JP→JR conversion, dead code elimination)
  - Dead store elimination
  - Address folding
  - Print call combining (print_i16_nl, print_a_nl, print_de_nl)
  - Tail merging

  Other:  - Git: Committed and pushed with tag v0.3.0
  - PyPI: Published at https://pypi.org/project/ucow/0.3.0/

  Users can now install with:
  pip install ucow==0.3.0

  - Comprehensive test suite (102 tests)
  - Extended runtime library

ladislau szilagyi

unread,
Dec 3, 2025, 10:57:04 PMDec 3
to retro-comp
Hi Aaron,

I'm amazed at how quickly you managed to integrate my suggestions!

Well done, Cowgol has now a new and very powerful development environment...

I tried myself to implement code optimization in my CP/M hosted Cowgol development environment, but, because the inherent limitations, I was able to cover only part of your impressive achievements...

As a small comment here, I was constrained to do post-processing optimization, by reading the assembler code produced by Cowlink and performing a 3-step analysis of the code. What I obtained (dead code elimination, JP-->JP-->JP shortcuts, conditional JP optimizations,...) was all that could be obtained, but still I managed to gain 10-15% space.

Once again, congratulations, Aaron, I will try in the next weeks to test ucow. 
I'm not an expert in Unix, I'm not familiar with Python, therefore I might come up with some questions regarding the install/setup step, but I will do my best...

thanks,
Ladislau

aaron wohl

unread,
Dec 6, 2025, 10:18:34 AM (12 days ago) Dec 6
to retro-comp
ladislau, you mention an interest in the optimizer.  I started an ada compiler. 
I factored the optimizer out to be a seperate library.  The PLM and Ada compilers both use it now.
Here are the repository names:
 Project Statistics

  | Project | Lines of Code        | Status         | Repository     |
  |---------|----------------------|----------------|----------------|
  | upeep80 | ~5,000 (optimizers)  | ✅ Complete     | avwohl/upeep80 |
  | uplm80  | ~15,000 (refactored) | ✅ Updated      | avwohl/uplm80  |
  | uada80  | ~3,135 (front-end)   | 🔄 In Progress | avwohl/uada80  |

ladislau szilagyi

unread,
Dec 6, 2025, 11:02:52 PM (11 days ago) Dec 6
to retro-comp
Thanks Aaron,

I'll take a look...

Ladislau
Reply all
Reply to author
Forward
0 new messages