On Sunday, November 7, 2021 at 3:57:40 PM UTC-8, James Harris wrote:
[Joining late, haven't read all of the conversation.]
> I'll set out below what to my knowledge is a novel way of looking at
> certain aspects of expression parsing. Don't be alarmed, it doesn't
> parse Martian. In fact, I think (subject to correction) that it
> implements the normal kind of parsing that a programmer would be
> familiar with. But AISI it handles some of it in a simpler, more
> natural, and more understandable way than I've seen anywhere else.
Not sure about simpler/more natural. There may be "some" regularization,
true.
> To explain, since the 1960s it has been traditional to think of some
> identifiers are resolving to lvalues and others to rvalues.
C enums are never lvalues by design.
C arrays are somewhat an artificial construct and so can be viewed
kind of as both or neither (you can't directly assign to an entire array
with = unless you're assigning to a struct that contains an array,
you can't pass an array by value, unless it's again wrapped in a
struct, but an array still contains some other lvalues in the end).
> However, I
> suggest below that another way of looking at matters is that when
> parsing an expression the presence of an identifier name such as
>
> X
>
> /always/ results not in the value but in the address of the named
> identifier X. An address is, of course, how it is interpreted in certain
> contexts. But programmers find it natural if in other contexts X is
> implicitly and automatically dereferenced to yield a value. Classically,
> in the assignment
>
> X = X
>
> even though they look the same the last X is dereferenced while the
> first is not.
Um... It may be somewhat confusing because dereferences for the
purpose of reading from memory and dereferences for the purpose of
writing to memory appear somewhat different.
C's assign operators require their left operand to be an lvalue.
What is an lvalue (in, perhaps, a somewhat mechanistic view)?
It's an expression formed by dereferencing an address. And you
naturally need a memory address to both read and write memory.
But your = operator by itself screams in your face "I'm a memory
writing dereference!". Effectively, you may think that the lvalue's
own dereference and the one implied by the = operator are
duplicating one another or are two parts of one thing. Either way,
when you're writing a C compiler, once you've checked the types
in the assignment expression, you end up either eliminating the
lvalue's own dereference or you somehow fuse it with =
because in the end you generate just a single memory store
instruction that represents both the dereference and =.
Given this you may indeed think that the left operand of =
needs to be no more than an address and it's somehow
different from the right operand of =. But you need to consider
both, the dereference and =, together.
> What matters is semantics but contexts are easiest to discuss in terms
> of the syntax so I'll do that. In simple terms one could say that if an
> expression (of any sort) is followed by one of
>
> = (assignment)
> . (field selection)
> ( (function invocation)
> [ (array lookup)
>
> or is tweaked with increment or decrement operators (as in C's ++ and
> --) then the /address/ is used. In all other contexts, however, an
> implicit deference is automatically inserted by a compiler such that the
> value at the designated address is used instead. To illustrate, consider
>
> A[2][4]
>
> Note that after both A *and* the first closing square bracket there is
> no dereference.
Strictly speaking, there is and its result is, as usual, an element of the
array, which happens to be another array, which luckily needs no memory
read/write (yet) and only pointer arithmetic is needed here.
But with a further dereference you will have to access memory because
that array element is not an array anymore.
But in a different language every array element access (dereference/
subscript) may involve memory access. Java's multidimensional
arrays implemented that way: an element of one array is a pointer to
(or an address of) another array. And you need to fetch addresses of
subarrays, you can't simply compute them by adding an index to
the pointer to the enclosing array.
> In syntax terms one can consider that that's because
> each is followed by one of the aforementioned symbols. IOW both A and
> the first closing square bracket are followed by an opening square
> bracket so there is no deference. But there /is/ an automatic
> dereference after the final square bracket because it is not followed by
> one of the listed symbols. So the key as to whether an automatic
> dereference is inserted or not is what comes next after an expression.
>
> That's very flexible, allowing expressions to work with an arbitrary
> number of addresses. For example,
>
> B = A[2][4][6][8][10]
>
> etc. That expression uses addresses all the way through. Each array
> lookup results in yet another address. Only after the final square
> bracket would there be a dereference.
Doesn't have to be that way. Java is an example.
> Of course, it's not just array indexing. Anything which /produces/ an
> address can have its output fed into anything which /uses/ an address
> and such operators can be combined arbitrarily. For example,
>
> vectors[1](2).data[3] = y
>
> Such an expression may be horrendous but illustrates how a programmer
> could combine addresses in any way desired. Only after the y would there
> be a dereference.
It's probably important to note that the above expression in C produces
a temporary value (the function return value) that needs to hang around
for a while in order for .data to be accessed off it. Mechanically
it needs to be an lvalue, but it's short lived and messing with it is
therefore troublesome, hence the standard says modifying the return
value yields undefined behavior. That is, if that data member is a
pointer, the expression may be well formed. If data is an array, you
have UB right there where you attempt to modify its 3rd element.
> (Perhaps it's strange that as programmers we accept the inconsistency
> that some contexts get implicit dereferences and some don't. But we
> would probably not want to write all deref or no-deref points in code.
> So we are where we are.)
Definitely, you don't have to expose the underlying mechanics when
it creates unnecessary friction (e.g. in form of verbosity and mental
effort). But with enough shortcuts you may end up looking at a
collection of nonuniform things. C arrays have their own problems,
C string literals add to this, then again you don't have to have both .
and -> to access members of a structure. And then there's C++ with
a mess of different ways to construct and initialize objects using
different syntaxes.
I particularly like Pascal's approach to passing variables by reference:
just prepend "var" before the parameter and the additional associated
dereferences will be generated by the compiler.
> Importantly, it is always possible to dereference an address to get a
> value but there is no way to operate on a value to get its address.
When implementing a C compiler you may treat most (if not all)
expressions as trees with operators in non-leaf nodes and
integer/float numbers and addresses in leaf nodes.
That's all there is, pretty much.
Structures, arrays, complex numbers don't map onto the CPU registers
and don't make it to the backend level.
So, your "cond ? struct1 : struct2" transform into
"*(cond ? &struct1 : &struct2)" under the hood just like
"struct1 = struct2" transforms into "memcpy(&struct1, &struct2)"
or something similar that can be more readily be translated into
CPU instructions and mapped into its registers.
> For
> that reason my precedence table has all the address-consuming operators
> first. That's probably true of most other languages as well but I've not
> seen that set out as a rationale.
>
> Consider how C uses its 'address of' operator, & as a prefix.
>
> &X gets the address of X
> &X[4] gets the address of X[4]
> &X.f gets the address of field f
>
> Yet C's & is not a normal operator. It does not transform its argument.
> As stated, it is not possible to get from a value to an address. So &E
> cannot evaluate E and then take its address. Therefore & is not an
> operator in the normal sense that it manipulates a value. Instead, &E
> inhibits the automatic dereference that would have been inserted at the
> end of E: it prevents emission of the dereference that the compiler
> would otherwise have emitted.
>
> There is, perhaps, an additional oddity that an 'operator' at the
> beginning of a subexpression really applies at the end of that
> subexpression.
Well, if it helps to read the code, you could use parens, e.g. &(X[4]),
but they are meaningless here. You could also prohibit large and complex
expressions and require them to be broken down into shorter and
simpler ones with e.g. temporary variables at every step, but
that (temporaries and low code density) in itself is problematic.
If you read X[4] into a temporary, taking its (temporary's) address wouldn't
give you the address within X[], which is kinda bad.
I think postfix expressions (and I mean not just postfix ++ and -- but
all of this subscripting, calling, member accessing) in C are more useful
than not.
> It may be more straightforward for & to be placed at the location where
> the dereference would otherwise have been.
>
> Assuming for discussion purposes that trailing & and infix & can be
> distinguished (so we don't need to use another symbol) the above
> expressions would become
>
> X& the address of X
> X[4]& the address of X[4]
> X.f& the address of field f
Should we also use numeric negation this way, e.g. X[4]- in place of
-X[4]? That would look pretty awkward to mathy people (not that
they'd find it an insurmountable obstacle, I hope).
I think what we've got in C here is good enough.
> Then the unary trailing & joins the symbols in the list above and
> becomes just another of the operators which, when it appears after an
> expression, inhibits the automatic dereference that would otherwise have
> occurred at that point:
>
> = assign
> . field selection
> ( function call
> [ index
> & nothing except, like all the others, inhibit dereference
>
> To summarise, there would no longer be the conceptual difference between
> lvalues and rvalues. All identifiers would be considered as producing
> their addresses, never their values. There would instead be contexts in
> which automatic dereference takes place, and the programmer would put &
> in any of those places where the automatic dereference was to be inhibited.
Then you also need to distinguish pointer arithmetic from non-pointer arithmetic
if you still want to keep both.
If a-b now gives me the distance between a and b in memory instead of the
numeric difference of the values stored at addresses a and b, it's kinda bad.
Similarly, I don't always mean a pointer when I write a+1.
No?
> AFAIK that's a new way of looking at addresses in expressions but maybe
> you know otherwise.
>
> More importantly, as a programmer how easy would you find it to think in
> those terms?
I'd keep implicit pointers hidden. Seems like you want to expose them for no
good reason.
Alex