Text-to-text compiling

James Harris

unread,

Jul 21, 2017, 2:38:43 AM7/21/17

to

I recently looked at bootstrapping a compiler to get from source to
assembly. And I thought the first step would be to parse the source to a
tree. After all, that's what compilers do, right? But I am beginning to
think that the compilation could go /directly/ from source text to
assembly text, with no intervening stages whatsoever. In other words,
the compiler would read the source phrase by phrase, writing assembly
output to stdout as it goes. If it encounters an error in the source
then it would write a message to stderr and stop compiling. (The stdout
output would exist but be incomplete and could include a spoiler to
ensure that it could not assemble.)

Much of this single-pass, text-only translation would be made possible
because the assembler will resolve references to labels. So the compiler
can do a lot simply by using consistent label names.

If I am right then, with the exception of expressions, the compiler can
do its thing simply by translating source fragments as it sees them. For
example:

If the compiler encounters "while (C)" it gets a unique number for the
loop (which I'll call N) and writes the following assembly

loop_N_top:
{evaluate C into EAX}
cmp eax, 0
je loop_N_end

After translating the loop body it then writes

jmp loop_N_top
loop_N_end:

Aside from evaluating the loop condition, which I'll go into below,
that's really simple. To be able to translate a "break" out of the loop
the compiler would need to know which loop we are in. That could be done
by having a small stack of loop numbers. Then "break" would pick the
number, N, on top of the stack and simply compile to

jmp loop_N_end

Similar translations could be effected for for-loops and if-statements -
simply write out the assembly with the right labels as we see the
source. But what about evaluating expressions? They cannot be translated
in order as most languages apply precedence rules which can mean that
later parts must be evaluated first, e.g. the * in

A + B * C

For such expressions, recursive calls are needed. And one needs a
consistent approach to results. In this case I'll use the convention
that the results of subexpressions will be returned in EAX. Something
more flexible would be needed if we allowed data types wider than
registers but EAX will do to illustrate for now.

Expressions have the basic form

E1 [ op E2 ]

where [ op E2 ] means that op E2 is not mandatory. E1 and E2 are
subexpressions. To translate such an expression into EAX needs

{evaluate E1 into EAX}
push eax
{evaluate E2 into EAX}
pop ebx
{op} ebx, eax
mov eax, ebx

The operation {op} would thereby return its result in EAX, ready for
input to any superexpression which called this one. Of course, that
translation naturally leads to subexpressions needing to be evaluated by
recursion. The recursion will stop when getting to a constant or a
variable so that these two examples

25
V

would translate to

mov eax, 25
mov eax, [V]

A unary operator such as pointer dereference is also easy. *V would end
up as

{evaluate V into EAX}
mov eax, [eax]

Stuff like that should deal with expressions but not function calls or
assignments. I'll post about function calls separately as they are more
substantial - and more interesting - but for assignments I think a
/separate/ set of evaluators is needed; they need to return the location
of the result rather than its value. Some examples,

A (i.e. assign to A)

lea eax, [A]

*A (i.e. assign to what A points to)

{evaluate location of A into EAX}
mov eax, [eax]

A[E3] (i.e. assign to array element E3)

{evaluate E3 into EAX}
add eax, A

As for the assignment itself, to compile E4 = E5 (assign to E4) and
assuming integer sizes,

{evaluate the location of E4 into EAX}
push eax
{evaluate E5 into EAX}
pop ebx
mov [ebx], eax

Am I missing something or can compiling really be this easy...? :-)

Function calls are more of a challenge but I have worked out a way that
I think will work. I'll post about it separately.

--
James Harris

Pascal J. Bourguignon

unread,

Jul 21, 2017, 4:20:51 AM7/21/17

to

James Harris <james.h...@gmail.com> writes:

> Am I missing something or can compiling really be this easy...? :-)

It can be this easy.
But if it is, why do you need a compiler?
You could use just a macro-assembler.

What you're missing is that usually there's a big difference in the
abstractions dealt with the compiled language and the target language
(it's not necessarily assembler, you could target other programming
languages).

Consider for example prolog (a logic, declarative programming
language). The structure of the corresponding assembly program is
completely ortogonal to the source lines of a prolog program!

(At most, you could translate the clauses (not text lines) of a prolog
programs into some data structure, clause by clause).

Take another example, Common Lisp (or perl), where the source code can
be _read_ (tokenized) differently depending on the reader macros that
has been created by previous expressions in the program source! (No
context-free grammar here (even if locally it may be context-free,
globally a Common Lisp or perl program is definitely not context-free)).

And even if you ignore reader macros, then you have normal lisp macros
which generate the code that you should actually compile. Lisp macros
are written in Lisp, a Turing Complete programming language: you cannot
know from the source of the macro expression what will be produced (or
when, if at all!). But CL and perl compilers still exist.

Take for example: SQL, Ocaml, Haskell, any OO programming language
(include C++, where the same source code may generate entirely different
target code, depending on the template or type declarations previously
read)!

Almost any interesting language will have the same disconnect between
the source and the target.

Also, when it is possible to generate code using only local information,
it is often very suboptimal: to generate good optimized code, you will
often have to perform GLOBAL analysis on the whole program sources (not
only of the current compilation unit).

I would assume you're basing your consideration on the only knowledge of
C. C is no more than a portable assembler. This may explain why you
believed that text-to-text compilation may be possible in general. But
no, it's only possible in particular, when you have an uninteresting
programming language in the first place.

--
__Pascal J. Bourguignon
http://www.informatimago.com

bartc

unread,

Jul 21, 2017, 7:32:17 AM7/21/17

to

On 21/07/2017 07:38, James Harris wrote:
>
> I recently looked at bootstrapping a compiler to get from source to
> assembly. And I thought the first step would be to parse the source to a
> tree. After all, that's what compilers do, right? But I am beginning to
> think that the compilation could go /directly/ from source text to
> assembly text, with no intervening stages whatsoever. In other words,
> the compiler would read the source phrase by phrase, writing assembly
> output to stdout as it goes. If it encounters an error in the source
> then it would write a message to stderr and stop compiling. (The stdout
> output would exist but be incomplete and could include a spoiler to
> ensure that it could not assemble.)
>
> Much of this single-pass, text-only translation would be made possible
> because the assembler will resolve references to labels. So the compiler
> can do a lot simply by using consistent label names.

I think it's workable but it depends on the input language. It wouldn't
work on one of my newer languages because that has features that best
require separate passes, otherwise it gets very hairy (requiring so much
backtracking that it wouldn't count as single pass anyway).

I've done a one-pass compiler for a dynamic language compiled to
byte-code (so no type system to worry about). And I believe my earliest
static compilers would have been one-pass, but the languages were crude
affairs.

It might be possible with C: my C compiler is currently two main passes,
the second one being code generation, and that is actually split into
two stages). If the quality of the generated code wasn't important, and
it was only for one target, then probably it could be done in one pass.

So I'm not sure if you can do this and still maintain the quality of the
compiler (that single pass will have so many things crammed in that it
would be unreadable!).

> But what about evaluating expressions? They cannot be translated
> in order as most languages apply precedence rules which can mean that
> later parts must be evaluated first, e.g. the * in
>
> A + B * C

I don't think expressions will be a problem. A C-like language will have
a type system requiring casts and promotions to be applied after both
operands of a binary op has been processed.

My byte-code compiler generates code for a stack machine, and that can
be applied to native code too (although it will be of terrible quality).

Then applying conversions later on just involves manipulating operands
on the stack.

> Am I missing something or can compiling really be this easy...? :-)

I think it can be that easy...

Function calls are not a problem (using the stack model; nothing is).
The function will have been declared (if C-like) and you know all the types.

--
bartc

Andy Walker

unread,

Jul 21, 2017, 8:22:58 AM7/21/17

to

On 21/07/17 07:38, James Harris wrote:
> I recently looked at bootstrapping a compiler to get from source to
> assembly. And I thought the first step would be to parse the source
> to a tree. After all, that's what compilers do, right? But I am
> beginning to think that the compilation could go /directly/ from
> source text to assembly text, with no intervening stages whatsoever.
> In other words, the compiler would read the source phrase by phrase,

> writing assembly output to stdout as it goes. [...]> Am I missing something or can compiling really be this easy...? :-)

You should perhaps bear in mind that that is exactly what
many compilers did in the early days -- before there were all the
modern tools, and long before there was anywhere to store a parse
tree [not enough RAM, and no disc]. Indeed, there may well not
even have been "assembly text" -- an extra pass is no fun if the
assembler is on paper tape or punched cards, and you have to wait
while the computer outputs the text, then load the assembler and
your text. So "load and go" compilers were common, and if not then
you expected the compiler to spit out a paper tape with binary for
your program.

You need, as Pascal points out, your language to be suitable
for this treatment. But I assume you are talking about a home-grown
language, so this is under your control?

--
Andy Walker,
Nottingham.

Nils M Holm

unread,

Jul 21, 2017, 9:02:50 AM7/21/17

to

James Harris <james.h...@gmail.com> wrote:
> I recently looked at bootstrapping a compiler to get from source to
> assembly. And I thought the first step would be to parse the source to a
> tree. After all, that's what compilers do, right? But I am beginning to
> think that the compilation could go /directly/ from source text to

> assembly text, with no intervening stages whatsoever. [...]

I've been doing this for a long time. The first SubC compiler went
from a HLL to assembly via templates, like you suggested:

http://www.t3x.org/subc/index.html

The latest T3X compiler goes from HLL directly to ELF:

http://www.t3x.org/t3x/index.html

The entire compiler is <1600 lines in size and self-compiles.
The binary is below 32K, statically linked, 32-bit ELF.

--
Nils M Holm < n m h @ t 3 x . o r g > www.t3x.org

James Harris

unread,

Jul 22, 2017, 6:57:25 AM7/22/17

to

On 21/07/2017 07:38, James Harris wrote:
>
> I recently looked at bootstrapping a compiler to get from source to
> assembly. And I thought the first step would be to parse the source to a
> tree. After all, that's what compilers do, right? But I am beginning to
> think that the compilation could go /directly/ from source text to
> assembly text, with no intervening stages whatsoever. In other words,
> the compiler would read the source phrase by phrase, writing assembly
> output to stdout as it goes. If it encounters an error in the source
> then it would write a message to stderr and stop compiling. (The stdout
> output would exist but be incomplete and could include a spoiler to
> ensure that it could not assemble.)

Thanks for the replies to this. To confirm, yes, this is about simply
being able to compile, not about being able to compile a specific
language or to compile well. Proving the concept with a reasonable input
language is enough. I would set it in the context of two things:
traditional compiler texts and Let's Build a Compiler by Jack Crenshaw.

1. I first tried to write a compiler way back when I was in college. I'd
written an assembler in school and it seemed the logical next step. But
I was very wrong. I read up loads of info on how compilers did their
magic but it was so complex as to be daunting. Greek letters and
codebreaker symbols. Sentential forms and phrase transformations. Lexing
using FSMs. Parsing using grammars. And many different types of grammar.
At 18 it went well over my head. I failed to produce the compiler. It
was obviously far too complex an undertaking for a normal human.

2. Then, years later, I saw Jack Crenshaw's essay. That was a bit of an
eye opener as to how easy it could be to get a working compiler. But as
I got near the end of it I felt that what he had produced was clever but
limited. And, more to the point, could not easily be made less limited.

One thing I got from my failed compiler project (and, IIRC, from reading
the ZX Spectrum ROM disassembly - remember that?) was a realisation that
it's important not to get bogged down in the mechanism but to focus on
the goal - the compiler's output. Over the years I've thought about
writing a compiler by producing the code generator first, and then
working backwards - with each stage assuming its input and knowing what
it had to produce.

Then, when thinking about bootstrapping a compiler I wondered how
straightforward the process could be made if one were to accept certain
short-term simplifications. To my surprise, it looks as though it could
be made rather easy. Simply focus on what the output assembly code is to
look like for each element of the language. And if the language can be
tweaked then there's flexibility to adjust it to suit the compile
process. In this case: all declarations before use, limited error
checking, making the source suit the assembler, and handling a limited
set of data types.

In fact, it looks as though it might be possible to get away with just
having two data types:

32-bit signed integer
pointer to 32-bit signed integer

"What of strings?" I hear you say. I am thinking that all chars would be
32-bit (and signed but in positive ranges). Strings would be converted
on input or output to/from strings of 32-bit chars. That's maybe not a
bad idea anyway, these days, but the important point here is that it
means that strings can be handled with the 32-bit pointer type already
mentioned.

Even better, the source's use of pointers can be explicit. When pointers
are created or dereferenced then the source can show it. Then we know
the operation and the type from the context and we don't even need to
maintain a symbol table!

Given the simple-data-type approach, consider an example of tweaking the
source to suit the assembler. Global declarations can be simplified to
the following Nasm-format assembly code.

Label: %times N dd V

In that,

Label is the identifier name
N is the number of elements [*]
V is the initial value

[*] N for a scalar would be 1. But having it allows for an array to be
constructed.

Putting the above together, the source does not need to specify a type.
It only needs to say how many machine words are needed and their initial
value. The form would be:

id <number of elements> <name> <initial value>

Some examples:

id 1 LIMIT 200 ;A constant
id 1 ptr 0 ;A variable
id LIMIT buffer 0 ;An array

They would result in the following assembly code.

LIMIT: %times 1 dd 200
ptr: %times 1 dd 0
buffer: %times LIMIT dd 0

Basically, each global data declaration, whether constant, variable, or
array; whether it needs to be initialised or not, goes into the .data
section and is given an initial value. All the same. Very simple.
Adequate for the short-term purpose of bootstrapping a new language.

...

> A[E3] (i.e. assign to array element E3)
>
> {evaluate E3 into EAX}
> add eax, A

I realised that was wrong. It should be

{evaluate A into EAX}
push eax
{evaluate E3 into EAX}
pop ebx
lea eax, [eax + ebx * 4]

> Function calls are more of a challenge but I have worked out a way that
> I think will work. I'll post about it separately.

I started this post meaning to get in to function call translation but
it got a bit historical/philosophical. I'll post about function calls next.

--
James Harris

James Harris

unread,

Jul 22, 2017, 9:55:57 AM7/22/17

to

On 21/07/2017 07:38, James Harris wrote:

...

> Function calls are more of a challenge but I have worked out a way that
> I think will work. I'll post about it separately.

Here are the translations I came up with for function definitions and
function calls. As an anorak, I thought this was fun and will go into
the details but YMMV. It is a long post, and perhaps one to read only if
you are really, really interested!

All values in the following will be of the same type: 32-bit words.

When translating on the fly, function-call code comes up in four
contexts: encountering the start of a function definition, coming to the
end of a function definition, encountering a return statement, and
calling a function. All work with a common calling convention. Central
to that is the layout of the stack frame. For that, I ended up with the
following form (top of stack, low addresses first).

Callee-saved registers
Array of temporaries, if needed
Named locals (at EBP - offset)
Old EBP <-- pointed at by current EBP
Return address
Parameters (at EBP + offset)

In other words, during execution that frame is built from the bottom up:
parameters first, and so on.

The layout is conventional. And I found that the assembler provides
facilities which work with the convention and make such a stack frame
easy to use. As below.

Start of a function definition
==============================

Say we come across source which starts the definition of a function. In
C terms it would look something like

void f1 (int a, int b) {
int i;
int j;

It would be translated to

section .text
;******************************
;
; function f1
;
;******************************
f1:
%push
%stacksize flat
%arg a :dword
%arg b :dword
%assign %$localsize 0
%local i :dword
%local j :dword
push ebp
mov ebp, esp
sub esp, %$localsize + _f1_tempsize
push the usual callee-save regs

In that,

%push pushes a new assembler "context" - maybe akin to a namespace
%stacksize flat tells NASM to address relative to EBP
%arg names an argument
%assign %$localsize, 0 sets/resets a name which %local uses
%local names a local/automatic identifier

None of those generate any code. They merely set up assembler symbols.
The rest, from PUSH EBP, does generate code and is a standard function
prologue. It sets up the stack frame as described above.

Crucially, within the function, references to names now relate to the
right names. As if by magic!

* A local, L, is assembled as [EBP - offset]
* An argument, A, is assembled as [EBP + offset]
* A global, G, is assembled as [G]

Maybe that's cheating. I was going to address the symbols manually using
structs and all kinds of good stuff. But, at least for now, the inbuilt
Nasm facilities seem to work well and make other things easier. So I've
taken off the hair shirt for a moment. Ahh!

End of a function definition
============================

What goes up must come down. And a function started must end. When we
parse the end of the function definition there's obviously the need to
restore registers and return etc. But I found it helpful to have labels
that allowed two types of exit: fall into, with a named return, and
return with a preset EAX which "return val" can jump to. The function
end, then:

_f1_return:
mov eax, [retval]
_f1_return_value: ;Already in EAX
pop callee-save regs
mov esp, ebp
pop ebp
ret
_f1_tempsize equ N * 4 ;See below
%pop

That has underscores in front of its labels as I only use non-underscore
names for names which appear in the source. A function called "fred"
would begin with the label

fred:

and end with the labels

_fred_return:
_fred_return_value:

The convention of copying names to the output is intended to make it
easier to work with assembly - though I know that many compilers mangle
original names in different ways, including prepending an underscore.

The "_f1_tempsize equ N" line in the above sets the size in bytes of the
temporaries needed by this function. There may be none. In fact, I
haven't needed any yet. But they might be needed by expression
evaluators. And the number of them needed by evaluations in this
function, N, won't be known until we get to the end of the function.
Hence, the line appears here.

The %pop removes the context (which appears to be an assembler
namespace) which was pushed at the start.

To return
=========

A return statement within the function would compile simply.

return 53;

would become

mov eax, 53
jmp _f1_return_value

Function calls
==============

Function /calls/ are conventional but they do have an interesting
requirement in that the above %arg directive expects its arguments to be
named in the order they appear on the stack. That means they have to be
pushed in reverse order. And that, in turn, presents a challenge to a
compiler which is supposed to be working through the source in sequence.

Fortunately, it is easy to evaluate the arguments in order and to push
them in reverse order. And I do mean "evaluate", not just copy.
Arbitrary expressions can form arguments just as easily as constants or
identifier names. And they can end up on the stack in the right (i.e.
reverse) order. Pretty cool!

It does take up twice as much stack but, remember, this is a proof of
concept, not an optimising compiler. If it works, then it works. And
that's what's being sought now.

The scheme is as follows. To evaluate

f1(a, b, c)

where a, b and c can be arbitrary expressions and need to be pushed in
reverse order I found it easiest to "evaluate and push" them in order
and then to re-push each one in reverse order. So f1(a,b,c) would be
compiled to:

{evaluate a}
push eax
{evaluate b}
push eax
{evaluate c}
push eax
push dword [esp + 0]
push dword [esp + 8]
push dword [esp + 16]
call f1
add esp, (2 * 3) * 4 ;NParms is 3

That leaves the result in EAX, as was the adopted convention.

The offsets from ESP are easy to generate as they always go up by twice
the word size.

Note that I could avoid the 4 bytes used in duplicating the last
parameter but it's not worth the extra work for this application. And
keeping the translator simple makes it easier to apply the same
translation algorithm to any number of arguments. For example,

f1(a)

becomes

{evaluate a}
push eax
push dword [esp + 0]
call f1
add esp, (2 * 1) * 4 ;NParms is 1

and plain

f1()

becomes

call f1
add esp, (2 * 0) * 4 ;NParms is 0

Sundries
========

I think that's pretty much it on function calling. I should mention that
the assembler puts locals at addresses which are in reverse order to the
order they appear in the source but that shouldn't be an issue unless
you want to step between them. Arguments are in source order. (The
assembler builds its lists of both away from EBP but in opposite
directions.)

Once a compiler can generate function calls then all sorts of useful
facilities can be pre-provided by the compiler writer - perhaps to do
things which would be hard or impossible to express in a proto-language.
As an instance, in the above case, even though the language supports
only 32-bit integers, 16-bit and 8-bit operations could be carried out
by library functions. And, certainly, all IO can be carried out by
library calls.

Because there would be no type checking, I think the above would let a
compile take place even without having a symbol table - making this
basic compilation yet simpler! If so, the assembly step would pick up
absent names but not type mismatches. It would not pick up mismatches in
the number of parameters. But, given that I chose the convention to have
callers tidy the stack, it would automatically allow variadic parameter
lists such as the va_args stuff used in printf.

--
James Harris

Rod Pemberton

unread,

Jul 22, 2017, 12:08:11 PM7/22/17

to

On Fri, 21 Jul 2017 07:38:39 +0100
James Harris <james.h...@gmail.com> wrote:

> I recently looked at bootstrapping a compiler to get from source to
> assembly. And I thought the first step would be to parse the source
> to a tree. After all, that's what compilers do, right? But I am
> beginning to think that the compilation could go /directly/ from
> source text to assembly text, with no intervening stages whatsoever.

Yes, if you keep it simple.

Maybe, use some character directed parsing, e.g., a quote " character
indicates a quote follows. Switching on a character is easy in C, etc.

> To be able to translate a "break" out of the
> loop the compiler would need to know which loop we are in.

Yes, I think ... this is similar to or the same as the nested braces {}
("compound statement") issue in C we discussed on a.o.d. in 2015. You
can use a few counters to construct numbered labels unique to each loop.

msg-id op.x5sw4jnryfako5@localhost
https://groups.google.com/d/msg/alt.os.development/ezj6EUaqePo/Fs7f6hFXBgAJ

Rod Pemberton
--
Liberals love to point out that vehicles contribute to climate change.
Conservatives should point out that living in skyscrapers does so too.

Rod Pemberton

unread,

Jul 22, 2017, 12:08:22 PM7/22/17

to

On Fri, 21 Jul 2017 07:38:39 +0100
James Harris <james.h...@gmail.com> wrote:

> I recently looked at bootstrapping a compiler to get from source to
> assembly. And I thought the first step would be to parse the source
> to a tree. After all, that's what compilers do, right? But I am
> beginning to think that the compilation could go /directly/ from
> source text to assembly text, with no intervening stages whatsoever.

Homoiconic
https://en.wikipedia.org/wiki/Homoiconicity

Concatenative
https://en.wikipedia.org/wiki/Concatenative_programming_language

> Stuff like that should deal with expressions but not function calls
> or assignments. I'll post about function calls separately as they are
> more substantial - and more interesting - but for assignments I think
> a /separate/ set of evaluators is needed; they need to return the
> location of the result rather than its value. Some examples,

0-operand
https://en.wikipedia.org/wiki/Instruction_set_architecture#Number_of_operands

> Similar translations could be effected for for-loops and
> if-statements - simply write out the assembly with the right labels
> as we see the source. But what about evaluating expressions? They
> cannot be translated in order as most languages apply precedence
> rules which can mean that later parts must be evaluated first, e.g.
> the * in

RPN
https://en.wikipedia.org/wiki/Reverse_Polish_notation

bartc

unread,

Jul 24, 2017, 5:46:32 AM7/24/17

to

On 22/07/2017 14:55, James Harris wrote:

> Say we come across source which starts the definition of a function. In
> C terms it would look something like
>
> void f1 (int a, int b) {
> int i;
> int j;
>
> It would be translated to

sub esp, %$localsize + _f1_tempsize

> Crucially, within the function, references to names now relate to the
> right names. As if by magic!
>
> * A local, L, is assembled as [EBP - offset]
> * An argument, A, is assembled as [EBP + offset]
> * A global, G, is assembled as [G]
>
> Maybe that's cheating. I was going to address the symbols manually using
> structs and all kinds of good stuff. But, at least for now, the inbuilt
> Nasm facilities seem to work well and make other things easier. So I've
> taken off the hair shirt for a moment. Ahh!

Well, it is a bit of a cheat if the aim is for you to do this in one
pass, because it is Nasm that then has to do multiple passes!

I found out how many passes when Nasm ended up taking minutes when
working with tens of thousands of lines, and discovered the -O0 option
to have one pass, which halved assembly time.

But -O0 might still be enough to resolve the forward references you use,
although it will necessarily have to do some fix-ups of code generated
earlier.

> Function calls
> ==============
>
> Function /calls/ are conventional but they do have an interesting
> requirement in that the above %arg directive expects its arguments to be
> named in the order they appear on the stack. That means they have to be
> pushed in reverse order.

Why? That's commonly done in C (something to do with making their
variadic functions possible) but arguments can be passed in left to
right order. (Unless you need to call foreign functions which expect
right to left.)

--
bartc

James Harris

unread,

Jul 24, 2017, 7:24:41 AM7/24/17

to

On 24/07/2017 10:46, bartc wrote:
> On 22/07/2017 14:55, James Harris wrote:
>
>> Say we come across source which starts the definition of a function.
>> In C terms it would look something like
>>
>> void f1 (int a, int b) {
>> int i;
>> int j;
>>
>> It would be translated to
>
> sub esp, %$localsize + _f1_tempsize
>
>> Crucially, within the function, references to names now relate to the
>> right names. As if by magic!
>>
>> * A local, L, is assembled as [EBP - offset]
>> * An argument, A, is assembled as [EBP + offset]
>> * A global, G, is assembled as [G]
>>
>> Maybe that's cheating. I was going to address the symbols manually
>> using structs and all kinds of good stuff. But, at least for now, the
>> inbuilt Nasm facilities seem to work well and make other things
>> easier. So I've taken off the hair shirt for a moment. Ahh!
>
> Well, it is a bit of a cheat if the aim is for you to do this in one
> pass, because it is Nasm that then has to do multiple passes!

The goal is to make a compile so simple a child could write it! (As I
said later in the original post, can compiling really be made so simple?)

And it's absolutely fine that the scheme requires the assembler to
resolve symbols. Even if I was writing a highly sophisticated compiler I
would still want it to emit assembly with symbols; so allowing them in
this case is not a problem at all.

What is a bit of a cheat, AISI, is to use Nasm's %arg and %local
facilities - because they are not present in all assemblers, though
others may have similar facilities, or allow macros to be written to do
the same things.

But using %arg and %local, and having all data types evident from
context in the source, means the compile can, I think, go ahead without
even needing a symbol table! In other words, the compiler would appear
to be becoming very simple indeed - nothing more than text translation.
But I haven't written it yet so there's no proof for the time being.

>
> I found out how many passes when Nasm ended up taking minutes when
> working with tens of thousands of lines, and discovered the -O0 option
> to have one pass, which halved assembly time.
>
> But -O0 might still be enough to resolve the forward references you use,
> although it will necessarily have to do some fix-ups of code generated
> earlier.

Different thing, I think, as -O sets the optimisation level rather than
the number of assembler passes over the source. (This maybe relates to
the other thread where I suggested that "passes" is not a helpful term;
it has lost its original meaning and now means different things to
different people.)

>
>> Function calls
>> ==============
>>
>> Function /calls/ are conventional but they do have an interesting
>> requirement in that the above %arg directive expects its arguments to
>> be named in the order they appear on the stack. That means they have
>> to be pushed in reverse order.
>
> Why? That's commonly done in C (something to do with making their
> variadic functions possible) but arguments can be passed in left to
> right order. (Unless you need to call foreign functions which expect
> right to left.)

When you ask "why" I am not sure what you are referring to. Could you
say a bit more?

--
James Harris

bartc

unread,

Jul 24, 2017, 8:10:45 AM7/24/17

to

Why arguments need to pushed right to left instead of left to right.

(Left to right may mean parameter offsets have to be allocated in
reverse order in function entry code, but that's a smaller problem than
evaluating arbitrarily complex expressions in reverse order.)

--
bartc

James Harris

unread,

Jul 24, 2017, 11:36:20 AM7/24/17

to

On 24/07/2017 13:10, bartc wrote:
> On 24/07/2017 12:24, James Harris wrote:
>> On 24/07/2017 10:46, bartc wrote:
>
>>> Why? That's commonly done in C (something to do with making their
>>> variadic functions possible) but arguments can be passed in left to
>>> right order. (Unless you need to call foreign functions which expect
>>> right to left.)
>>
>> When you ask "why" I am not sure what you are referring to. Could you
>> say a bit more?
>
> Why arguments need to pushed right to left instead of left to right.

In this case it's because that's how the Nasm %arg directive expects
arguments to be pushed. What I think it does as allow for the return
address and then alter argument names as follows

%arg A :dword
%arg B :dword
%arg C :dword
%ard D :dword

Then A, B, C, D become

[ebp + 4]
[ebp + 8]
[ebp + 12]
[ebp + 16]

So when the function (strictly, any line in the current assembler
context) makes a reference to one of the former the assembler writes the
corresponding latter. It's a pretty neat system. And it allows the
function to name parameters and globals just as an HLL does. The
assembler will correctly access parameters and globals without us having
to do anything special, even if they have the same names!

The point here being that to fit in with that, the arguments have to be
pushed right-to-left so they appear left-to-right in the subroutine.

>
> (Left to right may mean parameter offsets have to be allocated in
> reverse order in function entry code, but that's a smaller problem than
> evaluating arbitrarily complex expressions in reverse order.)

Right-to-left, when combined with the caller fixing the stack, is /very/
flexible. As I think you mentioned somewhere it allows normal and vararg
passing with no special arrangements.

--
James Harris