Address manipulations

James Harris

unread,

Nov 7, 2021, 6:57:40 PM11/7/21

to

I'll set out below what to my knowledge is a novel way of looking at
certain aspects of expression parsing. Don't be alarmed, it doesn't
parse Martian. In fact, I think (subject to correction) that it
implements the normal kind of parsing that a programmer would be
familiar with. But AISI it handles some of it in a simpler, more
natural, and more understandable way than I've seen anywhere else.

To explain, since the 1960s it has been traditional to think of some
identifiers are resolving to lvalues and others to rvalues. However, I
suggest below that another way of looking at matters is that when
parsing an expression the presence of an identifier name such as

X

/always/ results not in the value but in the address of the named
identifier X. An address is, of course, how it is interpreted in certain
contexts. But programmers find it natural if in other contexts X is
implicitly and automatically dereferenced to yield a value. Classically,
in the assignment

X = X

even though they look the same the last X is dereferenced while the
first is not.

What matters is semantics but contexts are easiest to discuss in terms
of the syntax so I'll do that. In simple terms one could say that if an
expression (of any sort) is followed by one of

= (assignment)
. (field selection)
( (function invocation)
[ (array lookup)

or is tweaked with increment or decrement operators (as in C's ++ and
--) then the /address/ is used. In all other contexts, however, an
implicit deference is automatically inserted by a compiler such that the
value at the designated address is used instead. To illustrate, consider

A[2][4]

Note that after both A *and* the first closing square bracket there is
no dereference. In syntax terms one can consider that that's because
each is followed by one of the aforementioned symbols. IOW both A and
the first closing square bracket are followed by an opening square
bracket so there is no deference. But there /is/ an automatic
dereference after the final square bracket because it is not followed by
one of the listed symbols. So the key as to whether an automatic
dereference is inserted or not is what comes next after an expression.

That's very flexible, allowing expressions to work with an arbitrary
number of addresses. For example,

B = A[2][4][6][8][10]

etc. That expression uses addresses all the way through. Each array
lookup results in yet another address. Only after the final square
bracket would there be a dereference.

Of course, it's not just array indexing. Anything which /produces/ an
address can have its output fed into anything which /uses/ an address
and such operators can be combined arbitrarily. For example,

vectors[1](2).data[3] = y

Such an expression may be horrendous but illustrates how a programmer
could combine addresses in any way desired. Only after the y would there
be a dereference.

(Perhaps it's strange that as programmers we accept the inconsistency
that some contexts get implicit dereferences and some don't. But we
would probably not want to write all deref or no-deref points in code.
So we are where we are.)

Importantly, it is always possible to dereference an address to get a
value but there is no way to operate on a value to get its address. For
that reason my precedence table has all the address-consuming operators
first. That's probably true of most other languages as well but I've not
seen that set out as a rationale.

Consider how C uses its 'address of' operator, & as a prefix.

&X gets the address of X
&X[4] gets the address of X[4]
&X.f gets the address of field f

Yet C's & is not a normal operator. It does not transform its argument.
As stated, it is not possible to get from a value to an address. So &E
cannot evaluate E and then take its address. Therefore & is not an
operator in the normal sense that it manipulates a value. Instead, &E
inhibits the automatic dereference that would have been inserted at the
end of E: it prevents emission of the dereference that the compiler
would otherwise have emitted.

There is, perhaps, an additional oddity that an 'operator' at the
beginning of a subexpression really applies at the end of that
subexpression.

It may be more straightforward for & to be placed at the location where
the dereference would otherwise have been.

Assuming for discussion purposes that trailing & and infix & can be
distinguished (so we don't need to use another symbol) the above
expressions would become

X& the address of X
X[4]& the address of X[4]
X.f& the address of field f

Then the unary trailing & joins the symbols in the list above and
becomes just another of the operators which, when it appears after an
expression, inhibits the automatic dereference that would otherwise have
occurred at that point:

= assign
. field selection
( function call
[ index
& nothing except, like all the others, inhibit dereference

To summarise, there would no longer be the conceptual difference between
lvalues and rvalues. All identifiers would be considered as producing
their addresses, never their values. There would instead be contexts in
which automatic dereference takes place, and the programmer would put &
in any of those places where the automatic dereference was to be inhibited.

AFAIK that's a new way of looking at addresses in expressions but maybe
you know otherwise.

More importantly, as a programmer how easy would you find it to think in
those terms?

I wanted to go on to say more but this post is already overlong. I'll
come back to some of the other points.

Naturally, comments welcome!

--
James Harris

Bart

unread,

Nov 7, 2021, 7:25:57 PM11/7/21

to

On 07/11/2021 23:57, James Harris wrote:

> To explain, since the 1960s it has been traditional to think of some
> identifiers are resolving to lvalues and others to rvalues. However, I
> suggest below that another way of looking at matters is that when
> parsing an expression the presence of an identifier name such as
>
> X
>
> /always/ results not in the value but in the address of the named
> identifier X. An address is, of course, how it is interpreted in certain
> contexts. But programmers find it natural if in other contexts X is
> implicitly and automatically dereferenced to yield a value. Classically,
> in the assignment
>
> X = X
>
> even though they look the same the last X is dereferenced while the
> first is not.

(Haven't we been here before?)

In X = X, both sides are dereferenced, one for reading, one for writing:

mov D0, [x]
mov [x], D0

If you like, emulate a language that doesn't dereference automatically
by writing &X instead of X. Then to do that assignment, you'd need to write:

*(&X) = *(&X)

Why would you need * on both sides if only one side is dereferenced?

Rod Pemberton

unread,

Nov 7, 2021, 8:23:24 PM11/7/21

to

On Sun, 7 Nov 2021 23:57:38 +0000
James Harris <james.h...@gmail.com> wrote:

> I'll set out below what to my knowledge is a novel way of looking at
> certain aspects of expression parsing. Don't be alarmed, it doesn't
> parse Martian. In fact, I think (subject to correction) that it
> implements the normal kind of parsing that a programmer would be
> familiar with. But AISI it handles some of it in a simpler, more
> natural, and more understandable way than I've seen anywhere else.
>
>
> To explain, since the 1960s it has been traditional to think of some
> identifiers are resolving to lvalues and others to rvalues. However,
> I suggest below that another way of looking at matters is that when
> parsing an expression the presence of an identifier name such as
>
> X
>
> /always/ results not in the value but in the address of the named
> identifier X. An address is, of course, how it is interpreted in
> certain contexts.

Well, that depends on the language. E.g., the variety of PL/1 I
programmed, all variables were passed-by-reference. I.e., they were
always treated as addresses. You could specify pass-by-value, if
desired (unneeded). For C, they are pretty much always treated as
addresses, which is mostly unseen or unnoticed by the programmer, or
even rejected as a concept by those pedants on c.l.c.. Of course, they
are passed-by-value in C, but sometimes by reference, if coded that way.

> But programmers find it natural if in other
> contexts X is implicitly and automatically dereferenced to yield a
> value. Classically, in the assignment
>
> X = X
>
> even though they look the same the last X is dereferenced while the
> first is not.

What? ... Of course, it is. X is dereferenced twice here.

You must get the address of X in both instances.
You need the address on the right to read/access X's value.
You need the address on the left to write/store X's value.

> What matters is semantics but contexts are easiest to discuss in
> terms of the syntax so I'll do that. In simple terms one could say
> that if an expression (of any sort) is followed by one of
>
> = (assignment)
> . (field selection)
> ( (function invocation)
> [ (array lookup)
>
> or is tweaked with increment or decrement operators (as in C's ++ and
> --) then the /address/ is used.

Personally, I think you're looking at this all the wrong way around.
Treat everything as an address from the get-go. Then, you should be
able to recognize that everything is an address.

E.g.,
printf("Hello World\n");

"Hello World\n" <-- placeholder for the address of the string constant:
Hello World\n\0 which stored somewhere else

Hello World\n\0 <-- string constant stored at the address of a
placeholder, which you see as: "Hello World\n", i.e., which
is essentially a unnamed or hidden variable, or compiler created
temporary variable

> In all other contexts, however, an implicit [dereferences]

I don't give deference to dereferences.

If everything is an address from the get-go, there are no implicit
dereferences. As I said, the variety of PL/1 I programmed worked this
way, as does much of C, whether recognized as such or not.

> In all other contexts, however, an implicit [dereference] is

> automatically inserted by a compiler such that the value at the
> designated address is used instead. To illustrate, consider
>
> A[2][4]
>

Let's make it proper with an assignment:

B = A[2][4];

Start with A is an address.
Also, B is an address.
Adjust A's address by [2][4], which depends on the type's size.
Read data from the adjusted address of whatever type A was declared as.
Store data read from A at address B.

> Note that after both A *and* the first closing square bracket there
> is no dereference. In syntax terms one can consider that that's
> because each is followed by one of the aforementioned symbols. IOW
> both A and the first closing square bracket are followed by an
> opening square bracket so there is no deference. But there /is/ an
> automatic dereference after the final square bracket because it is
> not followed by one of the listed symbols. So the key as to whether
> an automatic dereference is inserted or not is what comes next after
> an expression.

In my explanation above, the assignment operator = does the dereference
of adjusted address for A to read and the dereference of B's address to
store.

> That's very flexible, allowing expressions to work with an arbitrary
> number of addresses. For example,
>
> B = A[2][4][6][8][10]
>
> etc. That expression uses addresses all the way through. Each array
> lookup results in yet another address. Only after the final square
> bracket would there be a dereference.

...

> Of course, it's not just array indexing. Anything which /produces/ an
> address can have its output fed into anything which /uses/ an address
> and such operators can be combined arbitrarily. For example,
>
> vectors[1](2).data[3] = y

Please, trust me and treat everything as an address. It will make your
like much easier.

> Such an expression may be horrendous but illustrates how a programmer
> could combine addresses in any way desired. Only after the y would
> there be a dereference.
>
> (Perhaps it's strange that as programmers we accept the inconsistency
> that some contexts get implicit dereferences and some don't. But we
> would probably not want to write all deref or no-deref points in
> code. So we are where we are.)

IMO, implicit dereferences are just used to explain away the fact that
everything is really an address, because this fact is beyond the
comprehension of novices who are not taught about addresses and data
types, but are taught about variables and strings, etc.

> Importantly, it is always possible to dereference an address to get a
> value but there is no way to operate on a value to get its address.
> For that reason my precedence table has all the address-consuming
> operators first. That's probably true of most other languages as well
> but I've not seen that set out as a rationale.
>
> Consider how C uses its 'address of' operator, & as a prefix.
>
> &X gets the address of X
> &X[4] gets the address of X[4]
> &X.f gets the address of field f
>
> Yet C's & is not a normal operator.

...

> It does not transform its argument.

Correct.

It actually tells C to **NOT** dereference the address of X, or
adjusted address from X, as is normally done prior to an assignment or
pass-by-value to a function, thereby leaving the address instead of the
value.

> As stated, it is not possible to get from a value to an
> address. So &E cannot evaluate E and then take its address. Therefore
> & is not an operator in the normal sense that it manipulates a value.
> Instead, &E inhibits the automatic dereference that would have been
> inserted at the end of E: it prevents emission of the dereference
> that the compiler would otherwise have emitted.

Yes. This is a result of everything in C being an address, a concept
rejected by C pedants on c.l.c. and elsewhere, and even you ...

> There is, perhaps, an additional oddity that an 'operator' at the
> beginning of a subexpression really applies at the end of that
> subexpression.
>
> It may be more straightforward for & to be placed at the location
> where the dereference would otherwise have been.
>
> Assuming for discussion purposes that trailing & and infix & can be
> distinguished (so we don't need to use another symbol) the above
> expressions would become
>
> X& the address of X
> X[4]& the address of X[4]
> X.f& the address of field f
>
> Then the unary trailing & joins the symbols in the list above and
> becomes just another of the operators which, when it appears after an
> expression, inhibits the automatic dereference that would otherwise
> have occurred at that point:
>
> = assign
> . field selection
> ( function call
> [ index
> & nothing except, like all the others, inhibit dereference
>
> To summarise, there would no longer be the conceptual difference
> between lvalues and rvalues.

...

> All identifiers would be considered as
> producing their addresses, never their values.

As stated previously here and numerous other posts, that is, IMO, the
correct approach.

> There would instead be
> contexts in which automatic dereference takes place, and the
> programmer would put & in any of those places where the automatic
> dereference was to be inhibited.
>
> AFAIK that's a new way of looking at addresses in expressions but
> maybe you know otherwise.

New? Perhaps, a new understanding for you, I guess.

> More importantly, as a programmer how easy would you find it to think
> in those terms?

I already do. Have for decades, in regards to C. No one discussing C
ever agrees with me though.

> I wanted to go on to say more but this post is already overlong. I'll
> come back to some of the other points.
>
> Naturally, comments welcome!

--

Is this the year that Oregon ceases to exist?

James Harris

unread,

Nov 8, 2021, 2:53:23 AM11/8/21

to

On 08/11/2021 00:25, Bart wrote:
> On 07/11/2021 23:57, James Harris wrote:
>
>> To explain, since the 1960s it has been traditional to think of some
>> identifiers are resolving to lvalues and others to rvalues. However, I
>> suggest below that another way of looking at matters is that when
>> parsing an expression the presence of an identifier name such as
>>
>> X
>>
>> /always/ results not in the value but in the address of the named
>> identifier X. An address is, of course, how it is interpreted in
>> certain contexts. But programmers find it natural if in other contexts
>> X is implicitly and automatically dereferenced to yield a value.
>> Classically, in the assignment
>>
>> X = X
>>
>> even though they look the same the last X is dereferenced while the
>> first is not.
>
> (Haven't we been here before?)

Some. Although I think this is the first time I've ever written here
about some aspects such as the trailing & construct.

>
> In X = X, both sides are dereferenced, one for reading, one for writing:
>
> mov D0, [x]
> mov [x], D0

Rather than dereferenced do you mean that both sides are /accessed/?

I should explain what I mean by dereferencing. I mean, effectively,
following a pointer. I don't know x86-64 asm but in x86-32 asm the
dereference operation would be of the form

mov eax, [eax]

IOW EAX contains an address and that instruction replaces it with the
value at that address, aka it follows a pointer, aka it 'dereferences'!

To make it clearer consider assignments with different variables

B = A

that could translate to

lea eax, [A]
mov eax, [eax] ;<=== A is dereferenced
lea ebx, [B]
;<=== B is not dereferenced
mov [ebx], eax ;<=== the assign operation

A and B are both treated the same way - i.e. as addresses. However, B
cannot be dereferenced - i.e. its address cannot be converted to a value
- because its address is what's needed.

>
> If you like, emulate a language that doesn't dereference automatically
> by writing &X instead of X. Then to do that assignment, you'd need to
> write:
>
> *(&X) = *(&X)
>
> Why would you need * on both sides if only one side is dereferenced?
>

If I read that right the * on the RHS will be honoured but the one on
the left will not! Part of my thesis is that the LHS's * will be
inhibited by the = assignment. That's easy to see in simple assembly. If
the RHS of your expression is translated to

lea eax, [X] ;the & operator
mov eax, [eax] ;the * operator

then the LHS would correspondingly be translated to

lea eax, [X] ;the & operator

But the * operator on the LHS would be suppressed. If you would
translate it differently I suggest there would still be one fewer
dereferences on the LHS than on the RHS. Do you see now what I am
getting at?

--
James Harris

James Harris

unread,

Nov 8, 2021, 3:41:03 AM11/8/21

to

On 08/11/2021 01:24, Rod Pemberton wrote:
> On Sun, 7 Nov 2021 23:57:38 +0000
> James Harris <james.h...@gmail.com> wrote:

...

>> Classically, in the assignment
>>
>> X = X
>>
>> even though they look the same the last X is dereferenced while the
>> first is not.
>
> What? ... Of course, it is. X is dereferenced twice here.
>
> You must get the address of X in both instances.
> You need the address on the right to read/access X's value.
> You need the address on the left to write/store X's value.

You seem to be thinking of /accesses/ rather than dereferences. Bart
did, too. By dereference I mean fetching the value at an address. For
example, in C

**p

will have one more dereference than

*p

>
>> What matters is semantics but contexts are easiest to discuss in
>> terms of the syntax so I'll do that. In simple terms one could say
>> that if an expression (of any sort) is followed by one of
>>
>> = (assignment)
>> . (field selection)
>> ( (function invocation)
>> [ (array lookup)
>>
>> or is tweaked with increment or decrement operators (as in C's ++ and
>> --) then the /address/ is used.
>
> Personally, I think you're looking at this all the wrong way around.

:-)

> Treat everything as an address from the get-go.

I do. In my compiler every identifier is initially treated as an
address. The difference is in /where/ dereference operations should be
inserted.

...

> Let's make it proper with an assignment:
>
> B = A[2][4];
>
> Start with A is an address.
> Also, B is an address.
> Adjust A's address by [2][4], which depends on the type's size.
> Read data from the adjusted address of whatever type A was declared as.
> Store data read from A at address B.

...

> In my explanation above, the assignment operator = does the dereference
> of adjusted address for A to read and the dereference of B's address to
> store.

A key question for you: Would the assignment operator still do the
dereference in

A = B + C

?

...

>> = assign
>> . field selection
>> ( function call
>> [ index
>> & nothing except, like all the others, inhibit dereference
>>
>> To summarise, there would no longer be the conceptual difference
>> between lvalues and rvalues.
>
> ...
>
>> All identifiers would be considered as
>> producing their addresses, never their values.
>
> As stated previously here and numerous other posts, that is, IMO, the
> correct approach.

Sounds good but do you also accept my thesis about expressions having
implicit dereference points? I am saying they take place in those places
which are not followed by one of the above symbols.

For example,

print A + 4
print B[0] + 4

Aren't there implicit dereference points in there where I've put @ signs
in the following?

print A@ + 4
print B[0]@ + 4

NB no deref immediately after B even though there is one after A.

--
James Harris

David Brown

unread,

Nov 8, 2021, 3:52:43 AM11/8/21

to

On 08/11/2021 00:57, James Harris wrote:

> Consider how C uses its 'address of' operator, & as a prefix.
>
> &X      gets the address of X
> &X[4]   gets the address of X[4]
> &X.f    gets the address of field f
>
> Yet C's & is not a normal operator. It does not transform its argument

I'm not commenting on your main points at the moment - I think it is an
interesting view, and worth thinking about.

However, your comment that "C's & is not a normal operator" is somewhat
bizarre - it implies there is such a thing as a "normal operator". C
has all sorts of operators - function calls are operators, sizeof and
_Alignof are operators (neither of which evaluates their operand, and
the operand can be a type rather than an expression), assignment is an
operator (while in many languages, it is a statement). Casts are
operators that may or may not affect the value of the operand. The
comma operator evaluates and then discards its first operand. Structure
and union member access are operators.

I suppose you mean to say that "&" is somewhat different from addition
or multiplication. Alternatively, you could say that most operators in
C are not normal operators!

Dmitry A. Kazakov

unread,

Nov 8, 2021, 3:57:31 AM11/8/21

to

On 2021-11-08 02:24, Rod Pemberton wrote:
> On Sun, 7 Nov 2021 23:57:38 +0000
> James Harris <james.h...@gmail.com> wrote:

>> Of course, it's not just array indexing. Anything which /produces/ an
>> address can have its output fed into anything which /uses/ an address
>> and such operators can be combined arbitrarily. For example,
>>
>> vectors[1](2).data[3] = y
>
> Please, trust me and treat everything as an address. It will make your
> like much easier.

Not true even for the assembler James calls language. Even machine code
need to have registers and immediates.

Everything (object) is a set of instructions bringing the computations
into the state corresponding the actual value of the object in the
actual context.

Note that even this set is not fixed, it may vary, as the object can be
stored in a register, it can be packed in a way you cannot address it,
it can be marshaled over a network connection, the subprogram can be
inlined, a closure with the object can passed indirectly via display or
whatever, and so on and so forth.

--
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de

James Harris

unread,

Nov 8, 2021, 5:03:03 AM11/8/21

to

On 08/11/2021 08:52, David Brown wrote:
> On 08/11/2021 00:57, James Harris wrote:
>
>> Consider how C uses its 'address of' operator, & as a prefix.
>>
>> &X      gets the address of X
>> &X[4]   gets the address of X[4]
>> &X.f    gets the address of field f
>>
>> Yet C's & is not a normal operator. It does not transform its argument

> I'm not commenting on your main points at the moment - I think it is an
> interesting view, and worth thinking about.

Cool. Considered views are appreciated.

>
> However, your comment that "C's & is not a normal operator" is somewhat
> bizarre - it implies there is such a thing as a "normal operator". C
> has all sorts of operators - function calls are operators, sizeof and
> _Alignof are operators (neither of which evaluates their operand, and
> the operand can be a type rather than an expression), assignment is an
> operator (while in many languages, it is a statement). Casts are
> operators that may or may not affect the value of the operand. The
> comma operator evaluates and then discards its first operand. Structure
> and union member access are operators.
>
> I suppose you mean to say that "&" is somewhat different from addition
> or multiplication. Alternatively, you could say that most operators in
> C are not normal operators!

Maybe it comes down to nomenclature. I think of an operator as something
which 'operates' on one or more 'operands' (ostensibly at run time but,
for example, operations involving constants may be pre-evaluated in the
compiler).

I agree that C has some 'operators' which do not do that - particularly
sizeof which rather than emitting code to calculate the size really
changes subsequent evaluation rules (!) so that what follows is not even
evaluated! The oddity of that is, IMO, reflected in the number of times
the C standards include words such as

"except in the case of sizeof ..."

So I agree with you about sizeof and _Alignof (not that I've ever used
it but I can guess from the name what it's for).

However, function calls, assignment, casts, and comma fit what I would
call operators because they operate on values ostensibly at run time.

Structure and union member accesses are interesting ones. First
impression is that I would call them operators because they add the
field offset to the expression on their left and are thus examples of
what I call 'addressing functions', as is array indexing or even
locating a node in a complex data structure.

That said, compared with C's sizeof & is more of an operator because
while it says /not/ to do something it at least says not to do something
to the value to which a subexpression evaluates. :-) In my original post
the issue is raised as to whether & is better preceding or following the
subexpression it relates to.

I have wondered in the past whether there's a more logical replacement
for 'operators' which change evaluation rules such as sizeof but I've
not [yet :-)] come up with one.

--
James Harris

Bart

unread,

Nov 8, 2021, 5:19:28 AM11/8/21

to

On 08/11/2021 08:52, David Brown wrote:

> On 08/11/2021 00:57, James Harris wrote:
>
>> Consider how C uses its 'address of' operator, & as a prefix.
>>
>> &X      gets the address of X
>> &X[4]   gets the address of X[4]
>> &X.f    gets the address of field f
>>
>> Yet C's & is not a normal operator. It does not transform its argument
> I'm not commenting on your main points at the moment - I think it is an
> interesting view, and worth thinking about.
>
> However, your comment that "C's & is not a normal operator" is somewhat
> bizarre - it implies there is such a thing as a "normal operator".

Well, you can't implement it with a function! Such as:

int a;
int* p = addressof(a);

Although you can do it if you twist the language around, but in general,
if 'a' normally means its value, you can't turn a value into the address
where it's stored.

Bart

unread,

Nov 8, 2021, 5:36:51 AM11/8/21

to

On 08/11/2021 07:53, James Harris wrote:
> On 08/11/2021 00:25, Bart wrote:

>> In X = X, both sides are dereferenced, one for reading, one for writing:
>>
>> mov D0, [x]
>> mov [x], D0
>
> Rather than dereferenced do you mean that both sides are /accessed/?
>
> I should explain what I mean by dereferencing. I mean, effectively,
> following a pointer.

Following a pointer and then doing what? If you have a chains of derefs
like this:

***p = 0;

The first two will be read, the last used for writing.

> IOW EAX contains an address and that instruction replaces it with the
> value at that address, aka it follows a pointer, aka it 'dereferences'!

OK, now I understand. If you have a machine with one register which
contains a pointer, and read the address at the pointer:

mov R, [R]

then R is replaced with the target. But that doesn't happen here:

mov [R], 0 # R is unchanged

It needn't happen here either:

mov R2, [R] # R is unchanged

I see 'dereferencing' as something to do with type system.

If P is a pointer, it might have type T*. If you dereference it, the
value you get has type T. The '*' reference has disappeared! But that
happens whether reading or writing:

*Q = *P

Both P and Q have type T*. During and after the assigning, they will
still have type T*.

To implement the assignment, * is used to dereference P's value of type
T* to get a value X of type T, and * is used to dereference Q's value of
type T*, to store X of type T.

>
> lea eax, [X] ;the & operator
>
> But the * operator on the LHS would be suppressed.

Only because you have haven't shown it. But to write to the address in
eax to complete the assignment, you have to use [eax].

You seem to want to distinguish between an address used for reading
([eax]), and an address used for writing ([eax]).

Charles Lindsey

unread,

Nov 8, 2021, 6:38:55 AM11/8/21

to

On 07/11/2021 23:57, James Harris wrote:

> I'll set out below what to my knowledge is a novel way of looking at certain
> aspects of expression parsing. Don't be alarmed, it doesn't parse Martian. In
> fact, I think (subject to correction) that it implements the normal kind of
> parsing that a programmer would be familiar with. But AISI it handles some of it
> in a simpler, more natural, and more understandable way than I've seen anywhere
> else.
>
>
> To explain, since the 1960s it has been traditional to think of some identifiers
> are resolving to lvalues and others to rvalues. However, I suggest below that
> another way of looking at matters is that when parsing an expression the
> presence of an identifier name such as
>
> X
>
> /always/ results not in the value but in the address of the named identifier X.
> An address is, of course, how it is interpreted in certain contexts. But
> programmers find it natural if in other contexts X is implicitly and
> automatically dereferenced to yield a value. Classically, in the assignment
>
> X = X
>
> even though they look the same the last X is dereferenced while the first is not.

I think you have just re-invented Algol68.

--
Charles H. Lindsey ---------At my New Home, still doing my own thing------
Tel: +44 161 488 1845 Web: https://www.clerew.man.ac.uk
Email: c...@clerew.man.ac.uk Snail-mail: Apt 40, SK8 5BF, U.K.
PGP: 2C15F1A9 Fingerprint: 73 6D C2 51 93 A0 01 E7 65 E8 64 7E 14 A4 AB A5

Andy Walker

unread,

Nov 8, 2021, 8:55:12 AM11/8/21

to

On 08/11/2021 11:38, Charles Lindsey wrote:
> On 07/11/2021 23:57, James Harris wrote:
>> To explain, since the 1960s it has been traditional to think of
>> some identifiers are resolving to lvalues and others to rvalues.

I think you mean the '70s, or perhaps even the '80s? It
didn't become in any way "traditional" until well after C became
popular. Also, I suspect you meant "expression" rather than
"identifier"?

>> [...] Classically, in the assignment

>> X = X
>> even though they look the same the last X is dereferenced while the
>> first is not.
> I think you have just re-invented Algol68.

Everyone gets there in the end! It's so-o-o much simpler.

--
Andy Walker, Nottingham.
Andy's music pages: www.cuboid.me.uk/andy/Music
Composer of the day: www.cuboid.me.uk/andy/Music/Composers/Mendelssohn

Bart

unread,

Nov 8, 2021, 9:08:45 AM11/8/21

to

On 07/11/2021 23:57, James Harris wrote:

This is pretty much my approach in my static language. However it is a
little simplistic, and introduces some restrictions.

For example, it makes it harder to apply "." and "[]" to values that are
not in memory, since there is no address. This would apply to arrays and
structs small enough to be located in registers, passed as value
parameters, or returned from a function by value. Example:

f().m
g()[i]

With a dynamic language, then it may be different yet again. In mine,
anything you can apply "." or "[]" to is generally manpulated by
reference, but can be used as though it was all by value. These will work:

f().m
g()[i]

As can this:

(a+b).m # (with suitable types and overloads)
(c+d)[i] # c, d can be strings for example

Here, "." and "[]" are being applied to an rvalue, something else that
doesn't have an address.

David Brown

unread,

Nov 8, 2021, 10:35:04 AM11/8/21

to

It is correct that you can't implement the & address operator as a
function in C. But I can't see how that is relevant - you also cannot
implement most other C operators as functions. My point is that as C
operators go, & does not stand out as being unusual.

Oh, and I'm sure you'll be pleased to hear that in C++, not only can
"addressof" be implemented as a function, but it is part of the standard
library (so that you can always get the address of an object, even if
its class has overridden the unary & operator).

James Harris

unread,

Nov 8, 2021, 1:30:05 PM11/8/21

to

On 08/11/2021 10:36, Bart wrote:
> On 08/11/2021 07:53, James Harris wrote:
>> On 08/11/2021 00:25, Bart wrote:
>
>>> In X = X, both sides are dereferenced, one for reading, one for writing:
>>>
>>> mov D0, [x]
>>> mov [x], D0
>>
>> Rather than dereferenced do you mean that both sides are /accessed/?
>>
>> I should explain what I mean by dereferencing. I mean, effectively,
>> following a pointer.
>
> Following a pointer and then doing what? If you have a chains of derefs
> like this:
>
> ***p = 0;
>
> The first two will be read, the last used for writing.
>
>> IOW EAX contains an address and that instruction replaces it with the
>> value at that address, aka it follows a pointer, aka it 'dereferences'!
>
> OK, now I understand. If you have a machine with one register which
> contains a pointer, and read the address at the pointer:
>
> mov R, [R]
>
> then R is replaced with the target.

Yes, that's approximately the model. Your R could, in practice, be the
value at the top of the evaluation stack - even if the top word of the
evaluation stack is kept in a register, if you see what I mean. But,
yes, what I am calling a dereference would replace TOS with what TOS
points at.

Every variable reference such as

X

would (in terms of the parse tree) add a node for the address of X. Then
in certain contexts only it would add a node to replace TOS with what
TOS points at.

> But that doesn't happen here:
>
> mov [R], 0 # R is unchanged
>
> It needn't happen here either:
>
> mov R2, [R] # R is unchanged

For both of those consider a fully generic model of assignment which
strips out any recognition of particular cases:

(expression 0) = (expression 1)

In that, expression 0 can be absolutely anything legal which results in
an address. Similarly, expression 1, type checking permitting, can be
absolutely anything legal which results in the value to be stored at the
aforementioned address. In register terms you could have the evaluated
result of expression 0 in R0 and the evaluated result of expression 1 in
R1. Then the assignment would be

mov [R0], R1

>
> I see 'dereferencing' as something to do with type system.

Perhaps that's because an explicit dereference does, indeed, always
convert one type to another, as you point out below. But the important
point, here, is that a dereference replaces TOS with what TOS points at.

>
> If P is a pointer, it might have type T*. If you dereference it, the
> value you get has type T. The '*' reference has disappeared! But that
> happens whether reading or writing:
>
> *Q = *P
>
> Both P and Q have type T*. During and after the assigning, they will
> still have type T*.
>
> To implement the assignment, * is used to dereference P's value of type
> T* to get a value X of type T, and * is used to dereference Q's value of
> type T*, to store X of type T.

The pointers complicate matters a little but don't change anything. Your
*Q=*P expression as used in C would still insert an automatic
dereference after P. If @ indicates where that dereference happens then
the expression would be

*Q = *P@

And it's not the assignment operator which inserts the dereference.
Consider

*Q = *P@ + *P@

C would add those two auto dereferences. Why? In simple terms because
neither P is followed by one of the operators which inhibit auto
dereference. By contrast, Q is followed by = and so it gets no auto
dereference.

>
>>
>> lea eax, [X] ;the & operator
>>
>> But the * operator on the LHS would be suppressed.
>
> Only because you have haven't shown it. But to write to the address in
> eax to complete the assignment, you have to use [eax].
>
> You seem to want to distinguish between an address used for reading
> ([eax]), and an address used for writing ([eax]).
>

I can only suggest to think of it in terms of R0 and R1, as above, where
the expression on the LHS is evaluated to produce R0 and the expression
on the RHS is evaluated to produce R1. R0 has to end up holding an
address (because it will be used in the assignment operation). By
contrast, R1 has to end up holding a value (because it could have been
formed by operators which work on values - e.g. addition - and thus have
no addressable location).

--
James Harris

Bart

unread,

Nov 8, 2021, 2:54:45 PM11/8/21

to

So for you, dereferencing can only ever produce an rvalue.

Using an analogy of numbered lockers, if you had a card in your hand
with locker number 37 on it, dereferencing is the process of opening
door 37, and extracting some artefact.

But if you had the card in one hand, and already had an artefact in the
other, what would you call the process of opening door 37, and
/inserting/ that object?

To me, acting on that '37' by opening the door to the locker is
'dereferencing' whether you put something in or take something out.

Going back to code, take this example:

*Q += *P

Now, *Q has to be dereferenced to extract a value, modify it with *P,
and put it back.

>> But that doesn't happen here:
>>
>> mov [R], 0 # R is unchanged
>>
>> It needn't happen here either:
>>
>> mov R2, [R] # R is unchanged
>
> For both of those consider a fully generic model of assignment which
> strips out any recognition of particular cases:
>
> (expression 0) = (expression 1)
>
> In that, expression 0 can be absolutely anything legal which results in
> an address. Similarly, expression 1, type checking permitting, can be
> absolutely anything legal which results in the value to be stored at the
> aforementioned address. In register terms you could have the evaluated
> result of expression 0 in R0 and the evaluated result of expression 1 in
> R1. Then the assignment would be
>
> mov [R0], R1

At one time I used to transform my assignments so that:

A := B

was processed as:

(&A)^ := B

(&A)^ is exactly equivalent to the auto-dereferencing of variables that
would go on (as lvalue or rvalue), but it was done like that to ensure
the LHS was actually an lvalue. So trying:

345 := B

wouldn't work.

>
>>
>> I see 'dereferencing' as something to do with type system.
>
> Perhaps that's because an explicit dereference does, indeed, always
> convert one type to another, as you point out below. But the important
> point, here, is that a dereference replaces TOS with what TOS points at.

It depends on the 'instruction' set of the virtual machine.

I normally do A := B with:

push B
pop A

I could also do it like this:

push &B
pushptr # replace TOS with *TOS - your 'deref'
pop B

or doing it both sides:

push &B
pushptr
push &A
popptr

In the case of C := A := B, the last popptr would be replaced with:

storeptr # does not pop the stack
pop C # or push &C; popptr

So for me, it's also about the operations. Simple loads and stores use
PUSH and POP (or STORE), which use immediate operands; anything more
elaborate uses the more general purpose PUSHPTR and POPPTR (and
STOREPTR), whose operands are addresses.

James Harris

unread,

Nov 8, 2021, 3:31:53 PM11/8/21

to

On 08/11/2021 19:54, Bart wrote:
> On 08/11/2021 18:30, James Harris wrote:
>> On 08/11/2021 10:36, Bart wrote:

...

>>> OK, now I understand. If you have a machine with one register which
>>> contains a pointer, and read the address at the pointer:
>>>
>>> mov R, [R]
>>>
>>> then R is replaced with the target.
>>
>> Yes, that's approximately the model. Your R could, in practice, be the
>> value at the top of the evaluation stack - even if the top word of the
>> evaluation stack is kept in a register, if you see what I mean. But,
>> yes, what I am calling a dereference would replace TOS with what TOS
>> points at.
>
> So for you, dereferencing can only ever produce an rvalue.

I think l/r relates to how a value is treated rather than to anything
intrinsic about it. Dereferencing could produce a pointer, for example!

>
> Using an analogy of numbered lockers, if you had a card in your hand
> with locker number 37 on it, dereferencing is the process of opening
> door 37, and extracting some artefact.

No, if I had a card with 37 on it dereferencing would be opening locker
37 and replacing the card with what's in the locker. In asm

mov eax, 37 <== get card in hand
mov eax, [eax] <== dereference

and there could be as many of the latter as necessary

mov eax, [eax] <== dereference
mov eax, [eax] <== dereference
mov eax, [eax] <== dereference
mov eax, [eax] <== dereference

In that sense, the general case is really simple. Every dereference
would be exactly that one instruction.

>
> But if you had the card in one hand, and already had an artefact in the
> other, what would you call the process of opening door 37, and
> /inserting/ that object?

I would call that /accessing/.

>
> To me, acting on that '37' by opening the door to the locker is
> 'dereferencing' whether you put something in or take something out.

Fine but that's not what I was thinking of when I used the term. See

https://en.wikipedia.org/wiki/Dereference_operator

where it speaks about *returning the value at the pointer address*.

>
> Going back to code, take this example:
>
> *Q += *P
>
> Now, *Q has to be dereferenced to extract a value, modify it with *P,
> and put it back.
>

...

>>> I see 'dereferencing' as something to do with type system.
>>
>> Perhaps that's because an explicit dereference does, indeed, always
>> convert one type to another, as you point out below. But the important
>> point, here, is that a dereference replaces TOS with what TOS points at.
>
> It depends on the 'instruction' set of the virtual machine.
>
> I normally do A := B with:
>
> push B
> pop A

Fine but that wouldn't work if A and B were arbitrary expressions.

>
> I could also do it like this:
>
>     push &B
>     pushptr      # replace TOS with *TOS - your 'deref'
>     pop B

What would that look like if the assignment were

A := B + C

?

>

> or doing it both sides:
>
>     push &B
>     pushptr
>     push &A
>     popptr

That looks close to what I have been talking about but what if the terms
were expressions such as

A[2] := B[3] + C[4]

?

--
James Harris

Bart

unread,

Nov 8, 2021, 4:08:40 PM11/8/21

to

On 08/11/2021 20:31, James Harris wrote:

> On 08/11/2021 19:54, Bart wrote:

> I would call that /accessing/.

OK. We'll have to disagree on that point. Except, what would you call
what happens on the LHS here:

*Q += *P

>>
>> To me, acting on that '37' by opening the door to the locker is
>> 'dereferencing' whether you put something in or take something out.
>
> Fine but that's not what I was thinking of when I used the term. See
>
> https://en.wikipedia.org/wiki/Dereference_operator
>
> where it speaks about *returning the value at the pointer address*.

That looks a poorly written article.

Note that it uses examples of "*" and "^" for dereference operators, but
fails to address the fact those same operators are also used on the LHS
of an assignment.

However look on the section on Pascal, where it mentions 'dereference'
but the only examples are on the LHS of an assignment, notably:

Complex^ := Complex

>> I could also do it like this:
>>
>>      push &B
>>      pushptr      # replace TOS with *TOS - your 'deref'
>>      pop B
>
> What would that look like if the assignment were
>
> A := B + C

If the pushes were done via PUSHPTR, then:

Stack (grows LTR)

push &B &B
pushptr B
push &C B &C
pushptr B C
add B+C
pop A -

> That looks close to what I have been talking about but what if the terms
> were expressions such as
>
> A[2] := B[3] + C[4]

That gets complicated to do by hand. The actual IR I generate for that is:

push &b
push 3 i64
pushptroff i64 8 -8
push &c
push 4 i64
pushptroff i64 8 -8
add i64
push &a
push 2 i64
popptroff i64 8 -8

PUSHPTROFF is like PUSHPTR but takes an offset, which can be scaled and
a further constant offset added (the 8 and -8 shown). It's equivalent to
this C:

*((char*)A+2*8-8) = *((char*)B+3*8-8) + *((char*)C+4*8-8)

The -8 is due to my arrays being 1-based. This reduces down to this x64
code:

mov D0, [b+16]
mov D1, [c+24]
add D0, D1
mov [a+8], D0

James Harris

unread,

Nov 9, 2021, 5:42:53 AM11/9/21

to

On 08/11/2021 21:08, Bart wrote:
> On 08/11/2021 20:31, James Harris wrote:
>> On 08/11/2021 19:54, Bart wrote:
>
>> I would call that /accessing/.
>
> OK. We'll have to disagree on that point.

Sure.

> Except,

:-)

> what would you call
> what happens on the LHS here:
>
> *Q += *P

I don't know. I don't support augmented assignment at the moment but
thinking it through I guess that in the general case that would resolve to

R0 = LHS expression
R1 = RHS expression

then

add [R0], R1

There's no dereference in the assignment itself. But in

*Q += *P * *P

I'd say that the two arguments to * would be dereferenced before the
multiplication takes place. Therefore, in your example,

*Q += *P

while there would be a derefernce after P it would be due to P's context
(i.e. not being followed by a symbol which suppresses dereferences)
rather than to the assignment operation.

...

>>> I could also do it like this:
>>>
>>>      push &B
>>>      pushptr      # replace TOS with *TOS - your 'deref'
>>>      pop B
>>
>> What would that look like if the assignment were
>>
>>    A := B + C
>
> If the pushes were done via PUSHPTR, then:
>
>                      Stack (grows LTR)
>
>         push &B      &B
>         pushptr      B
>         push &C      B &C
>         pushptr      B C
>         add          B+C
>         pop A        -

OK. Then as it replaces &B with B I'd say your pushptr instructions are
dereference operations.

Note that your code results in the /value/ B+C on the stack, not an
address.

>
>> That looks close to what I have been talking about but what if the
>> terms were expressions such as
>>
>>    A[2] := B[3] + C[4]
>
> That gets complicated to do by hand. The actual IR I generate for that is:
>
>     push           &b
>     push           3          i64
>     pushptroff                i64 8 -8
>     push           &c
>     push           4          i64
>     pushptroff                i64 8 -8
>     add                       i64
>     push           &a
>     push           2          i64
>     popptroff                 i64 8 -8
>
>
> PUSHPTROFF is like PUSHPTR but takes an offset, which can be scaled and
> a further constant offset added (the 8 and -8 shown). It's equivalent to
> this C:
>
>     *((char*)A+2*8-8) = *((char*)B+3*8-8) + *((char*)C+4*8-8)
>
> The -8 is due to my arrays being 1-based. This reduces down to this x64
> code:
>
>           mov D0, [b+16]
>           mov D1, [c+24]
>           add D0, D1
>           mov [a+8], D0

OK. In resolving to [a+8] your optimiser has defeated part of the point
I wanted to make which was that the LHS can be some arbitrarily complex
expression and so would end up being an address stored in a register but
your example does still show that the RHS gets resolved to D0 holding a
value rather than an address.

--
James Harris

James Harris

unread,

Feb 12, 2022, 1:02:44 PM2/12/22

to

On 08/11/2021 11:38, Charles Lindsey wrote:

> On 07/11/2021 23:57, James Harris wrote:
>> I'll set out below what to my knowledge is a novel way of looking at
>> certain aspects of expression parsing. Don't be alarmed, it doesn't
>> parse Martian. In fact, I think (subject to correction) that it
>> implements the normal kind of parsing that a programmer would be
>> familiar with. But AISI it handles some of it in a simpler, more
>> natural, and more understandable way than I've seen anywhere else.
>>
>>
>> To explain, since the 1960s it has been traditional to think of some
>> identifiers are resolving to lvalues and others to rvalues. However, I
>> suggest below that another way of looking at matters is that when
>> parsing an expression the presence of an identifier name such as
>>
>> X
>>
>> /always/ results not in the value but in the address of the named
>> identifier X. An address is, of course, how it is interpreted in
>> certain contexts. But programmers find it natural if in other contexts
>> X is implicitly and automatically dereferenced to yield a value.
>> Classically, in the assignment
>>
>> X = X
>>
>> even though they look the same the last X is dereferenced while the
>> first is not.
>
> I think you have just re-invented Algol68.
>

It's funny you should say that. I do think there's a similarity to
Algol68 in returns from functions which I hadn't even mentioned but will
do so now.

Consider a subexpression such as

A + B

I suggested before that both A and B should initially be taken
semantically as being addresses and that in the context in which they
appear both would be dereferenced because, in simple terms, they are not
followed by one of the symbols which inhibit dereferences:

. member selection
( function call
[ array indexing
= assignment
* do nothing (other than inhibit dereference)

Now consider

A[1] + B(0)

To be consistent, the subexpression A[1] would also yield the /address/
of the element rather than its value. Then assignment to an element
would happen naturally:

A[1] = A[2]

Because of the = sign the A[1] would not be dereferenced. By contrast,
because there's no inhibiting symbol the A[2] /would/ be dereferenced -
exactly as for simple variables and as expected in most familiar
programming languages.

Now, here's the point I wanted to add: To be even more consistent the
same would be true of B(0). It, too, would result in the address of the
return value rather than the value itself. AIUI that is what Algol68
also does but it could lead to some strange expressions such as

B(0) = 5

To 'tame' that I am thinking that a function would have to explicitly
mark any return as being writeable if it could be assigned to by the
caller. Any return which was not thus marked would only be treatable as
a value. I think that offers the best of both worlds: consistent
treatment but with the default option being safe.

--
James Harris

James Harris

unread,

Feb 12, 2022, 1:07:34 PM2/12/22

to

On 08/11/2021 13:55, Andy Walker wrote:
> On 08/11/2021 11:38, Charles Lindsey wrote:
>> On 07/11/2021 23:57, James Harris wrote:
>>> To explain, since the 1960s it has been traditional to think of
>>> some identifiers are resolving to lvalues and others to rvalues.
>
> I think you mean the '70s, or perhaps even the '80s? It
> didn't become in any way "traditional" until well after C became
> popular. Also, I suspect you meant "expression" rather than
> "identifier"?

I meant 'identifier' in the context of the post but you are right that
this is meant to apply to subexpressions. I've just broached the wider
subject in my reply to Charles (qv).

If I have gone down a path the designers of Algol went down then I am in
good company, albeit a long way behind them.

As also mentioned to Charles, ISTM such flexibility needs to be tamed
and made safe to use.

--
James Harris

Charles Lindsey

unread,

Feb 13, 2022, 10:26:11 AM2/13/22

to

Yes, Algol 68 takes care of all that. The LHS of an assignment MUST be a
reference (otherwise you are trying to assign to a constant). That means you
know, at compile time, the exact type expected on the RHS (hence it is a
"strong" context), so you can use any known coercion to make it so (usually
dereferencing as in your examples). In the case of an operator in an expression,
the context of each operand is "weak", so fewer coercions are permitted; the
reason for this is that operators can be overloaded (there is no overloadiing of
functions in Algol 68).

C gets to more or less the same result by mumbling about LHS and RHS values,
which is harder to get your head around.

When it comes to arrays (and structures too), if the type of A is
reference-to-row-of-something, then the type of A[0] is reference-to-something,
so you can assign to it, dereferencing the RHS if necessary. But if the type of
A is just row-of-something, then it is a constant array, and A[0] is a constant
something.

Alexei A. Frounze

unread,

Feb 13, 2022, 4:49:23 PM2/13/22

to

On Sunday, November 7, 2021 at 3:57:40 PM UTC-8, James Harris wrote:
[Joining late, haven't read all of the conversation.]

> I'll set out below what to my knowledge is a novel way of looking at
> certain aspects of expression parsing. Don't be alarmed, it doesn't
> parse Martian. In fact, I think (subject to correction) that it
> implements the normal kind of parsing that a programmer would be
> familiar with. But AISI it handles some of it in a simpler, more
> natural, and more understandable way than I've seen anywhere else.

Not sure about simpler/more natural. There may be "some" regularization,
true.

> To explain, since the 1960s it has been traditional to think of some
> identifiers are resolving to lvalues and others to rvalues.

C enums are never lvalues by design.
C arrays are somewhat an artificial construct and so can be viewed
kind of as both or neither (you can't directly assign to an entire array
with = unless you're assigning to a struct that contains an array,
you can't pass an array by value, unless it's again wrapped in a
struct, but an array still contains some other lvalues in the end).

> However, I
> suggest below that another way of looking at matters is that when
> parsing an expression the presence of an identifier name such as
>
> X
>
> /always/ results not in the value but in the address of the named
> identifier X. An address is, of course, how it is interpreted in certain
> contexts. But programmers find it natural if in other contexts X is
> implicitly and automatically dereferenced to yield a value. Classically,
> in the assignment
>
> X = X
>
> even though they look the same the last X is dereferenced while the
> first is not.

Um... It may be somewhat confusing because dereferences for the
purpose of reading from memory and dereferences for the purpose of
writing to memory appear somewhat different.

C's assign operators require their left operand to be an lvalue.
What is an lvalue (in, perhaps, a somewhat mechanistic view)?
It's an expression formed by dereferencing an address. And you
naturally need a memory address to both read and write memory.

But your = operator by itself screams in your face "I'm a memory
writing dereference!". Effectively, you may think that the lvalue's
own dereference and the one implied by the = operator are
duplicating one another or are two parts of one thing. Either way,
when you're writing a C compiler, once you've checked the types
in the assignment expression, you end up either eliminating the
lvalue's own dereference or you somehow fuse it with =
because in the end you generate just a single memory store
instruction that represents both the dereference and =.

Given this you may indeed think that the left operand of =
needs to be no more than an address and it's somehow
different from the right operand of =. But you need to consider
both, the dereference and =, together.

> What matters is semantics but contexts are easiest to discuss in terms
> of the syntax so I'll do that. In simple terms one could say that if an
> expression (of any sort) is followed by one of
>
> = (assignment)
> . (field selection)
> ( (function invocation)
> [ (array lookup)
>
> or is tweaked with increment or decrement operators (as in C's ++ and
> --) then the /address/ is used. In all other contexts, however, an
> implicit deference is automatically inserted by a compiler such that the
> value at the designated address is used instead. To illustrate, consider
>
> A[2][4]
>
> Note that after both A *and* the first closing square bracket there is
> no dereference.

Strictly speaking, there is and its result is, as usual, an element of the
array, which happens to be another array, which luckily needs no memory
read/write (yet) and only pointer arithmetic is needed here.
But with a further dereference you will have to access memory because
that array element is not an array anymore.

But in a different language every array element access (dereference/
subscript) may involve memory access. Java's multidimensional
arrays implemented that way: an element of one array is a pointer to
(or an address of) another array. And you need to fetch addresses of
subarrays, you can't simply compute them by adding an index to
the pointer to the enclosing array.

> In syntax terms one can consider that that's because
> each is followed by one of the aforementioned symbols. IOW both A and
> the first closing square bracket are followed by an opening square
> bracket so there is no deference. But there /is/ an automatic
> dereference after the final square bracket because it is not followed by
> one of the listed symbols. So the key as to whether an automatic
> dereference is inserted or not is what comes next after an expression.
>
> That's very flexible, allowing expressions to work with an arbitrary
> number of addresses. For example,
>
> B = A[2][4][6][8][10]
>
> etc. That expression uses addresses all the way through. Each array
> lookup results in yet another address. Only after the final square
> bracket would there be a dereference.

Doesn't have to be that way. Java is an example.

> Of course, it's not just array indexing. Anything which /produces/ an
> address can have its output fed into anything which /uses/ an address
> and such operators can be combined arbitrarily. For example,
>
> vectors[1](2).data[3] = y
>
> Such an expression may be horrendous but illustrates how a programmer
> could combine addresses in any way desired. Only after the y would there
> be a dereference.

It's probably important to note that the above expression in C produces
a temporary value (the function return value) that needs to hang around
for a while in order for .data to be accessed off it. Mechanically
it needs to be an lvalue, but it's short lived and messing with it is
therefore troublesome, hence the standard says modifying the return
value yields undefined behavior. That is, if that data member is a
pointer, the expression may be well formed. If data is an array, you
have UB right there where you attempt to modify its 3rd element.

> (Perhaps it's strange that as programmers we accept the inconsistency
> that some contexts get implicit dereferences and some don't. But we
> would probably not want to write all deref or no-deref points in code.
> So we are where we are.)

Definitely, you don't have to expose the underlying mechanics when
it creates unnecessary friction (e.g. in form of verbosity and mental
effort). But with enough shortcuts you may end up looking at a
collection of nonuniform things. C arrays have their own problems,
C string literals add to this, then again you don't have to have both .
and -> to access members of a structure. And then there's C++ with
a mess of different ways to construct and initialize objects using
different syntaxes.
I particularly like Pascal's approach to passing variables by reference:
just prepend "var" before the parameter and the additional associated
dereferences will be generated by the compiler.

> Importantly, it is always possible to dereference an address to get a
> value but there is no way to operate on a value to get its address.

When implementing a C compiler you may treat most (if not all)
expressions as trees with operators in non-leaf nodes and
integer/float numbers and addresses in leaf nodes.
That's all there is, pretty much.
Structures, arrays, complex numbers don't map onto the CPU registers
and don't make it to the backend level.
So, your "cond ? struct1 : struct2" transform into
"*(cond ? &struct1 : &struct2)" under the hood just like
"struct1 = struct2" transforms into "memcpy(&struct1, &struct2)"
or something similar that can be more readily be translated into
CPU instructions and mapped into its registers.

> For
> that reason my precedence table has all the address-consuming operators
> first. That's probably true of most other languages as well but I've not
> seen that set out as a rationale.
>
> Consider how C uses its 'address of' operator, & as a prefix.
>
> &X gets the address of X
> &X[4] gets the address of X[4]
> &X.f gets the address of field f
>
> Yet C's & is not a normal operator. It does not transform its argument.
> As stated, it is not possible to get from a value to an address. So &E
> cannot evaluate E and then take its address. Therefore & is not an
> operator in the normal sense that it manipulates a value. Instead, &E
> inhibits the automatic dereference that would have been inserted at the
> end of E: it prevents emission of the dereference that the compiler
> would otherwise have emitted.
>
> There is, perhaps, an additional oddity that an 'operator' at the
> beginning of a subexpression really applies at the end of that
> subexpression.

Well, if it helps to read the code, you could use parens, e.g. &(X[4]),
but they are meaningless here. You could also prohibit large and complex
expressions and require them to be broken down into shorter and
simpler ones with e.g. temporary variables at every step, but
that (temporaries and low code density) in itself is problematic.
If you read X[4] into a temporary, taking its (temporary's) address wouldn't
give you the address within X[], which is kinda bad.
I think postfix expressions (and I mean not just postfix ++ and -- but
all of this subscripting, calling, member accessing) in C are more useful
than not.

> It may be more straightforward for & to be placed at the location where
> the dereference would otherwise have been.
>
> Assuming for discussion purposes that trailing & and infix & can be
> distinguished (so we don't need to use another symbol) the above
> expressions would become
>
> X& the address of X
> X[4]& the address of X[4]
> X.f& the address of field f

Should we also use numeric negation this way, e.g. X[4]- in place of
-X[4]? That would look pretty awkward to mathy people (not that
they'd find it an insurmountable obstacle, I hope).
I think what we've got in C here is good enough.

> Then the unary trailing & joins the symbols in the list above and
> becomes just another of the operators which, when it appears after an
> expression, inhibits the automatic dereference that would otherwise have
> occurred at that point:
>
> = assign
> . field selection
> ( function call
> [ index
> & nothing except, like all the others, inhibit dereference
>
> To summarise, there would no longer be the conceptual difference between
> lvalues and rvalues. All identifiers would be considered as producing
> their addresses, never their values. There would instead be contexts in
> which automatic dereference takes place, and the programmer would put &
> in any of those places where the automatic dereference was to be inhibited.

Then you also need to distinguish pointer arithmetic from non-pointer arithmetic
if you still want to keep both.
If a-b now gives me the distance between a and b in memory instead of the
numeric difference of the values stored at addresses a and b, it's kinda bad.
Similarly, I don't always mean a pointer when I write a+1.
No?

> AFAIK that's a new way of looking at addresses in expressions but maybe
> you know otherwise.
>
> More importantly, as a programmer how easy would you find it to think in
> those terms?

I'd keep implicit pointers hidden. Seems like you want to expose them for no
good reason.

Alex

James Harris

unread,

Feb 14, 2022, 1:38:51 PM2/14/22

to

On 13/02/2022 21:49, Alexei A. Frounze wrote:
> On Sunday, November 7, 2021 at 3:57:40 PM UTC-8, James Harris wrote:

> [Joining late, haven't read all of the conversation.]

No problem. Welcome!

...

>> However, I
>> suggest below that another way of looking at matters is that when
>> parsing an expression the presence of an identifier name such as
>>
>> X
>>
>> /always/ results not in the value but in the address of the named
>> identifier X. An address is, of course, how it is interpreted in certain
>> contexts. But programmers find it natural if in other contexts X is
>> implicitly and automatically dereferenced to yield a value. Classically,
>> in the assignment
>>
>> X = X
>>
>> even though they look the same the last X is dereferenced while the
>> first is not.
>
> Um... It may be somewhat confusing because dereferences for the
> purpose of reading from memory and dereferences for the purpose of
> writing to memory appear somewhat different.

Bart said similar but by 'dereference' I mean essentially what C's '*'
prefix operator does.

As humans we understand assignment so perhaps we focus on that specific
case but consider that /in general/ an assignment requires two
expressions. The first has to result in an address (or reference, if you
prefer) but the second expression naturally results in a value. For
example, in

X = X + 1

the LHS has to result in an address whereas the RHS naturally results in

a value rather than an address.

>

> C's assign operators require their left operand to be an lvalue.
> What is an lvalue (in, perhaps, a somewhat mechanistic view)?
> It's an expression formed by dereferencing an address. And you
> naturally need a memory address to both read and write memory.
>
> But your = operator by itself screams in your face "I'm a memory
> writing dereference!". Effectively, you may think that the lvalue's
> own dereference and the one implied by the = operator are
> duplicating one another or are two parts of one thing. Either way,
> when you're writing a C compiler, once you've checked the types
> in the assignment expression, you end up either eliminating the
> lvalue's own dereference or you somehow fuse it with =
> because in the end you generate just a single memory store
> instruction that represents both the dereference and =.

I should be clearer about terms:

* reference: an address (or the equivalent)
* dereference: fetch the value at the reference

Such a dereference is /a monadic operation/ which takes what it assumes
to be an address and yields the value at that address.

As mentioned, it's akin to C's monadic asterisk operator (except that
it's how code is processed; it's not present in source code).

>
> Given this you may indeed think that the left operand of =
> needs to be no more than an address and it's somehow
> different from the right operand of =. But you need to consider
> both, the dereference and =, together.

In the example of

X = X + 1

note that both X's would initially be addresses but the second would be
dereferenced (as defined above) because it is not followed by one of the
operators which inhibit dereferences.

...

>> In all other contexts, however, an
>> implicit deference is automatically inserted by a compiler such that the
>> value at the designated address is used instead. To illustrate, consider
>>
>> A[2][4]
>>
>> Note that after both A *and* the first closing square bracket there is
>> no dereference.
>
> Strictly speaking, there is and its result is, as usual, an element of the
> array, which happens to be another array, which luckily needs no memory
> read/write (yet) and only pointer arithmetic is needed here.

Ah, no. I am proposing that

A[2][4] = A[2][4] + 1

would parse in exactly the same way as X = X + 1, above. The inner
subexpression

A[2][4]

appears twice just as X appeared twice and the latter instance would be
dereferenced just as the latter X was dereferenced because it is not
followed by an operator which inhibits dereferences.

Neither A nor A[2] would be dereferenced (as defined above) at any
point. The expressions A, A[2] and A[2][4] would manipulate only addresses.

> But with a further dereference you will have to access memory because
> that array element is not an array anymore.

Yes, a dereference changes the type from 'ref T' to T.

...

>> Of course, it's not just array indexing. Anything which /produces/ an
>> address can have its output fed into anything which /uses/ an address
>> and such operators can be combined arbitrarily. For example,
>>
>> vectors[1](2).data[3] = y
>>
>> Such an expression may be horrendous but illustrates how a programmer
>> could combine addresses in any way desired. Only after the y would there
>> be a dereference.
>
> It's probably important to note that the above expression in C produces
> a temporary value (the function return value) that needs to hang around
> for a while in order for .data to be accessed off it. Mechanically
> it needs to be an lvalue, but it's short lived and messing with it is
> therefore troublesome, hence the standard says modifying the return
> value yields undefined behavior. That is, if that data member is a
> pointer, the expression may be well formed. If data is an array, you
> have UB right there where you attempt to modify its 3rd element.

That sounds important but I can't parse it. If vectors[1] holds the
address of a function and that function returns an address why is a
temporary needed?

Here's how the expression may be parsed:

get the /address/ of the 'vectors' array
because it's followed by "[" don't dereference it
add 1 * sizeof a vector
call the function at that address (with 2 as a parameter)
add the offset of the field called 'data'
add 3 * sizeof each element of data
use that address in the assignment

Each stage produces an address, even the function call.

>
>> (Perhaps it's strange that as programmers we accept the inconsistency
>> that some contexts get implicit dereferences and some don't. But we
>> would probably not want to write all deref or no-deref points in code.
>> So we are where we are.)
>
> Definitely, you don't have to expose the underlying mechanics when
> it creates unnecessary friction (e.g. in form of verbosity and mental
> effort).

Though it's inconsistent. For example,

X[Y]

Even though the two names have the same form we dereference one (Y) but
not the other (X).

I'm not complaining about that, BTW, just pointing out that that's what
we as programmers have got used to so a compiler has to deal with it.

...

>> Consider how C uses its 'address of' operator, & as a prefix.
>>
>> &X gets the address of X
>> &X[4] gets the address of X[4]
>> &X.f gets the address of field f

...

>> There is, perhaps, an additional oddity that an 'operator' at the
>> beginning of a subexpression really applies at the end of that
>> subexpression.
>
> Well, if it helps to read the code, you could use parens, e.g. &(X[4]),
> but they are meaningless here. You could also prohibit large and complex
> expressions and require them to be broken down into shorter and
> simpler ones with e.g. temporary variables at every step, but
> that (temporaries and low code density) in itself is problematic.
> If you read X[4] into a temporary, taking its (temporary's) address wouldn't
> give you the address within X[], which is kinda bad.
> I think postfix expressions (and I mean not just postfix ++ and -- but
> all of this subscripting, calling, member accessing) in C are more useful
> than not.

I don't think there's a problem. ISTM that

X&

is a good way to yield X's address.

>
>> It may be more straightforward for & to be placed at the location where
>> the dereference would otherwise have been.
>>
>> Assuming for discussion purposes that trailing & and infix & can be
>> distinguished (so we don't need to use another symbol) the above
>> expressions would become
>>
>> X& the address of X
>> X[4]& the address of X[4]
>> X.f& the address of field f
>
> Should we also use numeric negation this way, e.g. X[4]- in place of
> -X[4]? That would look pretty awkward to mathy people (not that
> they'd find it an insurmountable obstacle, I hope).

If you want consistently all unaries to be at the front you'd end up
with something like

-x
&x
[4]x
()x

:-(

In reality programmers expect some operators to be prefix and some to be
postfix. I didn't invent this!

...

>> To summarise, there would no longer be the conceptual difference between
>> lvalues and rvalues. All identifiers would be considered as producing
>> their addresses, never their values. There would instead be contexts in
>> which automatic dereference takes place, and the programmer would put &
>> in any of those places where the automatic dereference was to be inhibited.
>
> Then you also need to distinguish pointer arithmetic from non-pointer arithmetic
> if you still want to keep both.
> If a-b now gives me the distance between a and b in memory instead of the
> numeric difference of the values stored at addresses a and b, it's kinda bad.

That's not the proposal. An expression such as a - b would use the
/values/ of both.

> Similarly, I don't always mean a pointer when I write a+1.
> No?

Ditto. That would parse to 'the /value/ of a' plus 1.

Maybe you've not understood the proposal. Not blaming you. I should
reiterate it. A name such as N in these contexts:

N=
N(
N[
N.
N$

would use the address of N. In all other contexts it would result in the
value stored at N.

There's nothing novel in that, BTW. It's what programmers expect
expressions to mean. If anything, this is just a cerrtain way to look at
expressions to understand them.

A useful way to look at this is that /all/ occurrences of N result in
the address of N and a dereference, and that the above forms will
require the presence of a dereference operation and will delete that
dereference operation. That is, in fact, how I intend to parse it.

--
James Harris