Why are ambiguous grammars usually a bad idea? Why are languages usually defined and implemented with ambiguous grammars?

Roger L Costello

unread,

Dec 26, 2021, 10:19:08 PM12/26/21

to

Hello Compiler Experts!

I am reading the (excellent) book "flex & bison" by John Levine. Chapter 7
talks about conflicts in grammars: shift/reduce and reduce/reduce conflicts.

At the end of the chapter are a few exercises/questions. I'd like to check
with you on whether my answers to the questions are accurate and complete.

Question: Beyond the fact that bison doesn't like them, why are ambiguous
grammars usually bad?

My answer: Ambiguous grammars make it hard for users of the grammar to write
correct instances of the grammar. Stated another way, ambiguous grammars make
it easy to introduce errors into instances of the grammar.

Question: Opine about why languages are usually defined and implemented with
ambiguous grammars.

My answer: Writing a grammar that completely avoids ambiguities might result
in a grammar that is nightmarishly complex. It is easier/better to create a
simple grammar and then add rules/descriptions which explain and resolve the
ambiguities.

Are my answers correct? Are they complete?

/Roger
[Sounds good to me. Also keep in mind that the LALR parser cannot parse all unambiguous
grammars, only the ones that fit the LALR model, so sometimes as you say it's easier
to handle some of the structure outside the grammar. -John]

Anton Ertl

unread,

Dec 27, 2021, 9:27:13 PM12/27/21

to

Roger L Costello <cost...@mitre.org> writes:
>Question: Beyond the fact that bison doesn't like them, why are ambiguous
>grammars usually bad?

An ambiguous grammar means that the same input can be parsed in at
least two different ways. In many cases the two different ways have
different meanings.

The classical example is the dangling else. In EBNF:

S -> if E then S [ else S ]

if you see

if E then if E then S else S

which if does the else belong to? The two parses usually have very
different meanings.

>My answer: Ambiguous grammars make it hard for users of the grammar to write
>correct instances of the grammar. Stated another way, ambiguous grammars make
>it easy to introduce errors into instances of the grammar.

Sounds wrong to me. The if-Statement above is a correct Pascal
statement (apart from the unexpanded Es and Ss) and not erroneous (and
likewise for other languages with similar syntaxes, e.g., Algol 60 and C).

Of course, one way to deal with ambiguities would be to make the input
a syntax error, but that's usually not done (and I know of no parser
generator that supports this approach).

>Question: Opine about why languages are usually defined and implemented with
>ambiguous grammars.
>
>My answer: Writing a grammar that completely avoids ambiguities might result
>in a grammar that is nightmarishly complex. It is easier/better to create a
>simple grammar and then add rules/descriptions which explain and resolve the
>ambiguities.

Writing the grammar above in an unambiguous way to match up the else
with the closest unresolved if is more complex, but not nightmarishly
complex. But the better solution was used in Modula-2: require an
ending for if:

S -> if E then S [ else S ] end

[Modula-2 also uses a Statement-Sequence instead of a Statement in the
then-branch and else-branch, avoiding the need for begin...end.]

This grammar is unambiguous and the programmer can write either of

if E then if E then S else S end end
if E then if E then S end else S end

depending on what is intended.

Hmm, makes me wonder where that style comes from; Lisp has had a
closed COND from the beginning (i.e., 1960), Forth also had a closed
IF from early on (i.e., early 1970s), but I guess neither of those
inspired Wirth when he did Modula-2. Dijkstra's guarded command
language (which closed the if with fi) was published in 1975, and
maybe that was the inspiration, but OTOH, the Modula-2 if syntax was
already present in Modula, developed 1975, report in March 1976
<https://www.research-collection.ethz.ch/handle/20.500.11850/68669>.
Wirth writes in his HOPL-III paper:

|In planning Modula-2, I saw it as a new version of Pascal, updated to
|the requirements of the time, and I seized the opportunity to correct
|various mistakes in Pascal’s design, such as, for example, the
|syntactic anomaly of the dangling “else”

but he does not give any background on where the new syntax is coming
from.

I recommend to my students to get rid of conflicts, because the
default resolution of the conflict by Bison may be other than what is
required. But I also design the language they have to implement such
that not having conflicts is not too hard; e.g., I give them
Modula-2-like rather than Algol-like syntax to implement.

- anton
--
M. Anton Ertl
an...@mips.complang.tuwien.ac.at
http://www.complang.tuwien.ac.at/anton/
[Algol68 had if ... then ... else .. fi in 1968. -John]

Fernando

unread,

Dec 27, 2021, 9:27:49 PM12/27/21

to

Hi Roger,

One problem with ambiguous grammars is that ambiguities might confuse the
semantics of the programming language. One example is with the associativity
and precedence of operators. Concerning associativity, the ambiguous grammar
below does not specify if subtraction is left or right associative:

E ::= E - E | num

And, there is this classic example from C-like languages involving conditional
statements:

<cmd> ::= if <bool_expr> then <cmd>
| if <bool_expr> then <cmd> else <cmd>

What would be the meaning of a statement like the one below? Depending on how
you fix the ambiguity, it's possible to make the else refer to the innermost
or the outermost conditional.

if (a > b) then if (c > d) then print(1) else print(2)

Regards,

Fernando

gah4

unread,

Dec 27, 2021, 9:29:01 PM12/27/21

to

On Sunday, December 26, 2021 at 7:19:08 PM UTC-8, Roger L Costello wrote:

> I am reading the (excellent) book "flex & bison" by John Levine. Chapter 7
> talks about conflicts in grammars: shift/reduce and reduce/reduce conflicts.

> At the end of the chapter are a few exercises/questions. I'd like to check
> with you on whether my answers to the questions are accurate and complete.

As far as I know, the main cause of ambiguous grammar in programming languages
is the nested if-then-optional-else structure. If you require else, then it isn't ambiguous,
but people like the optional else. That usually comes out as a shift-reduce conflict,
and parser generators know how to handle that.

Otherwise, the usual regular expression has an ambiguity which is often cured
by taking the longest of the possible matches. It seems to me that more often
I want the shorter match, though.

But you already have the reply from John Levine ...
[See previous message, where we fixed that with "fi" in 1968. -John]

gah4

unread,

Dec 28, 2021, 12:39:45 PM12/28/21

to

(snip about conflicts in grammars)
(then I wrote)

> As far as I know, the main cause of ambiguous grammar in programming languages
> is the nested if-then-optional-else structure. If you require else, then it isn't ambiguous,
> but people like the optional else. That usually comes out as a shift-reduce conflict,
> and parser generators know how to handle that.

(snip)

> But you already have the reply from John Levine ...
> [See previous message, where we fixed that with "fi" in 1968. -John]

I first learned if-then-else from PL/I, where it was designed in
about 1963. I now suspect that people get used to the one they
learned first, and find it more obvious.

There is also some language, and I forget now which, with an ELIF
construct, such that you can make a nested if-then-else sequence
without the increase of nesting level, and so need only one ENDIF.

IF e THEN s; ELIF e THEN s; ELIF e THEN s; ENDIF;

(No comment on my preference for ENDIF vs. FI.)

[That would also be Algol 68, using "fi"
It didn't need the semicolons because they are separators,
not terminators as in PL/I and C.
If you've written scripts in the Bourne shell or its descendants
such as bash and zsh, it deliberately looks a lot like Algol 68.
A request from your moderator: if anyone is planning another round
in the semicolon terminator vs. separator war, please don't.
-John]

Christopher F Clark

unread,

Dec 28, 2021, 12:41:40 PM12/28/21

to

You already have lots of relevant commentary on your answer.

However, I want to offer some counterpoint.

The 1973 paper by Aho, Johnson, and Ullman: Deterministic Parsing of
Ambiguous Grammars,
gives the background for how this is handled in yacc.

https://www.researchgate.net/publication/234791287_Deterministic_parsing_of_ambiguous_grammars

Thus, the use of the various directives is a long-known solution to
the issue. However, LR conflicts frighten people. Their resolution
hides the fact that the grammar is ambiguous and at least two parses
are possible. But, with practice, one can get good at using them
judiciously and knowing when they are simply expressing the desired
intent or hiding a potential source of confusion.

More recently, GLR parsing has deferred the resolution and builds a
parse forest when the grammar is truly ambiguous and not the result of
the limitations of the LR parsing methodology. When that happens, one
can use semantics to determine which parse was correct or declare a
relevant error.

By the way, an LR conflict doesn't guarantee that the grammar is
actually ambiguous, just that resolving whether it is or not is beyond
the capacity of the LR algorithm.

So, while it is good to know when your grammar is ambiguous (or even
just potentially ambiguous) is useful in knowing whether your grammar
describes the language you think it describes, it is not always the
worst thing.

Ambiguous grammars can have valuable properties. As you noted, writing
an unambiguous grammar can be quite complicated (and in some cases not
even possible), and the comparable ambiguous grammar can be quite
concise, simple, and easy to read. And, putting the disambiguation
outside the grammar can be a more maintainable solution.

This is particularly, true if you want a language with user defined
operators or ones where the user might want to adjust the precedence
or associativity of the existing operators. Using numbers to indicate
precedence is a well-known technique. It maps well onto things people
understand. And the ability to compare the precedence of two operators
based upon an associated number is easy to implement (while strictly
outside the capacity of most parser generators).

Kind regards,
Chris

--
******************************************************************************
Chris Clark email: christoph...@compiler-resources.com
Compiler Resources, Inc. Web Site: http://world.std.com/~compres
23 Bailey Rd voice: (508) 435-5016
Berlin, MA 01503 USA twitter: @intel_chris
------------------------------------------------------------------------------

Jan Ziak

unread,

Dec 28, 2021, 12:45:22 PM12/28/21

to

On Monday, December 27, 2021 at 4:19:08 AM UTC+1, Roger L Costello wrote:
> Question: Opine about why languages are usually defined and implemented with
> ambiguous grammars.
>
> My answer: Writing a grammar that completely avoids ambiguities might result
> in a grammar that is nightmarishly complex. It is easier/better to create a
> simple grammar and then add rules/descriptions which explain and resolve the
> ambiguities.

Another issue is that creating an unambiguous grammar (without extending the
parser generator with new features) would require inventing new abstract names
in grammar rules to handle cases resolved by compilation passes executed after
the parsing phase - but inventing new abstract names is very hard.

For example, in C/C++ the meaning of the statement "a*b;" depends on the code
parsed before the statement. If the code-before is "int a" then "a*b;" is a
statement containing an expression - if the code-before is "typedef int a;"
then "a*b;" is a variable declaration.

Another C/C++ example: "(a)+1.23" is a binary expression if "a" is an integer
variable - but it is a type conversion if "a" has been defined as "typedef
double a;".

In order to handle the above examples, the left-hand side of the [bison]
grammar rules might want to use names such as STMT__MUL_EXP_OR_VAR_DECL and
EXP__ADD_OR_CONVERT.

In some languages derived from C/C++ but different from C/C++, declarations in
the top-level scope and the package-level scope are order-independent. The
meaning of the statement "a*b;" might depend on the code parsed before the
statement or parsed after the statement.

The other option (that is: extending the parser generator with new features,
and thus avoiding abstract names in the rules of the grammar, and thus
avoiding a nightmarishly complex grammar) means that the parser
postpones the resolution of ambiguities until information computed in a
subsequent compilation phase resolves the ambiguity. This basically requires
both [the code generated by the parser generator] and [the programming
language used to implement the compiler] to be concurrent programming
languages.

In summary: In my opinion, the answer to the question "Why languages are
usually defined and implemented with ambiguous grammars?" translates to how
well the parser generator integrates with compilation stages executed after
the parsing stage.

-atom

Kaz Kylheku

unread,

Dec 29, 2021, 5:28:34 PM12/29/21

to

On 2021-12-16, Roger L Costello <cost...@mitre.org> wrote:
> Question: Opine about why languages are usually defined and implemented with
> ambiguous grammars.

But they aren't.

Languages are processed as a stream of characters or tokens, with hidden
rules about how those relate together and the meaning that emerges.
All of the rules are hidden, including the entire grammar.

If you're only aware of some of the hidden rules, but not others, then
you see ambiguity.

But if you're only aware of some of the hidden rules, but not others,
then you are not working with the correct language.

For instance, I don't know of any mainstream language in which if/else
is actually ambiguous. They have a hidden rule like that the else goes
with the closest preceding if statement. This is no more or less hidden
than the rule which says that the token "if" heads a phrase structure
called an if statement.

I think what the question is really asking is why computer languages are
designed in layers, such as an ambiguous grammar, with rules added to
it.

That simply has to do with the convenience of specification in relation
to the available tooling.

If you have a parser generator which lets you write an ambiguous grammar
like:

E := E + E | E - E | E * E | E / E | E ** E | (E) | id | num

and then add precedence/associativity specifications, then it behooves
you to take advantage of it, rather than breaking out separate rules
like "additive expression", "multiplicative expression", ...

When you add those rules, though, you no longer have an ambiguous
grammar.

There is another effect at play which is that designers are infatuated
with complicated grammars that have lots of hidden rules. Thus we have
languages whose programs can look ambiguous to someone who isn't an
expert in all their rules. Keeping up the full expertise can require
regular practice: constantly working with the language. (Use it or lose
it).

Thus, even though, two languages we may be looking at are formally
unambiguous, one may be informally more ambigous than the other, due to
being more loaded with hidden rules of syntax that one must internalize
to read the code.

So we can interpret the question as, why do we have all these languages
with baroque syntax which give rise to ambiguity the moment you forget
any of it?

Languages are designed this way because of the belief that there is a
notational advantage in it. If you have some hidden rule which causes
some symbols to be related in such and such a way, it means that you
have omitted the need for additional symbols which would otherwise
indicate that structure. For instance in C, we can deduce from the
hidden rules that A << B | C means (A << B) | C which is obvious to
someone who has memorized the precedence rules and works with this
stuff daily. Yet, we tend to more or less reject the philosphy in our
coding standards; we call for disambiguating parentheses. The GNU C
compiler won't let you write A && B || C if you have -Wall warnings
enabled: you get the "suggest parentheses" warning.

(It's a kind of ironic situation: why do we have hidden rules that allow
parentheses to be omitted, only to turn around and write tooling and
coding standards which asks for them to be put in.)

Novice programmers have historically been attracted to cryptic-looking
languages. It is one of the main reasons for the success of languages
like C and Perl.

For novice programmers, syntax is a barrier before semantics, and
if you make the barrier sufficiently, though not impossibly high, that
creates motivation. Novices feel they are really learning something and
getting ahead when all they are doing is absorbing the rules of syntax.
Simply being able to work out the syntax of some code example, or write
one that has no errors, is an accomplishment.

If you give most people a language in which the syntax is easy with few
opportunities for informal ambiguity, they will just rush through the
syntax and hit the brick wall of semantics: confronting the fact that
programming is semantically hard. Of course, because people most often
blame external factors for their failings, they will blame the language.
Since they are not heavily invested it in, they can easily move on to
something else. Maybe they will return to programming later, using
a different language, and then pin their better success on that language
rather than their own improved maturity.

Informally amibiguous languages are needed to create a kind of tar pit
to slow down newbies and keep the motivated. Then by the time they
hit the real difficulties, thay are too invested in it to quit.

"But I know all this syntax after months of learning! How can it be that
my program doesn't work? I'm too far along not to stick with it and get
it working. Doggone it, I now have a self-image as a programmer to
defend!"

I also believe there is one more element at play: mathematics. People
study mathematics in school, and those who go on to do programming tend
to be ones who were more exposed to it or paid more attention.

People who are programmers actually had a first contact with formal
syntax in mathematics.

The conflation between syntax and semantics may ultimately come from
that place. Mathematicians design their notations deliberately, in such
ways that when they manipulate symbols, while observing certain rules,
they are actually preserving semantics. The notation directly enables
semantically meaningful manipulation, as a tool of thought.

There is a psychological effect at play that a programming language
designed with lots of syntactic rules will somehow also serve as a tool
of thought, similarly to math notation. It cannot be denied that, to
some extent, that plan pans out. Programmers play wiuth the symbols and
discover idioms similar to algebraic rules. You look at C code and
recognize "Duff's device" similarly to how you might recognize some
Lagrangian thing in a math formula.

--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal

Jan Ziak

unread,

Dec 29, 2021, 7:13:07 PM12/29/21

to

On Wednesday, December 29, 2021 at 11:28:34 PM UTC+1, Kaz Kylheku wrote:

> On 2021-12-16, Roger L Costello wrote:
> > Question: Opine about why languages are usually defined and implemented with
> > ambiguous grammars.
> But they aren't.
>
> Languages are processed as a stream of characters or tokens, with hidden
> rules about how those relate together and the meaning that emerges.
> All of the rules are hidden, including the entire grammar.
>
> If you're only aware of some of the hidden rules, but not others, then
> you see ambiguity.
>
> But if you're only aware of some of the hidden rules, but not others,
> then you are not working with the correct language.
>
> For instance, I don't know of any mainstream language in which if/else
> is actually ambiguous. They have a hidden rule like that the else goes
> with the closest preceding if statement.

When designing a grammar and implementing a parser: the grammar can
either be unambiguous by design or unambiguous by accident. The
viewpoint that "there isn't any mainstream language in which if/else
is actually ambiguous" is actually the latter option: unambiguous by
accident.

A primary reason why grammars in many mainstream languages (that don't
have a parser generated straight from a verified grammar) are
unambiguous isn't intentional design, but rather it is a consequence
of the fact that those parsers are directly implemented in a language
that is executing statements/expressions/instructions without
verifying consequences of the executions. Some examples of languages
with such execution properties: assembly language (such as: i386,
ARM), C, Haskell. Contrary to the accidental approach, a parser
generator by design cares about consequences and it is verifying that
the specification actually is unambiguous despite the fact that in the
end the parser gets compiled down into machine instructions.
Verification means to search the whole search space (or at least a
reasonably large subspace of it) - but asm/C/Haskell will run a search
only if it is explicitly (step by step) forced by the programmer to
perform a search.

-atom

gah4

unread,

Dec 29, 2021, 10:28:33 PM12/29/21

to

On Wednesday, December 29, 2021 at 2:28:34 PM UTC-8, Kaz Kylheku wrote:

(snip)

> I also believe there is one more element at play: mathematics. People
> study mathematics in school, and those who go on to do programming tend
> to be ones who were more exposed to it or paid more attention.

This reminds me of learning associativity of exponentiation (**)
in Fortran IV (I believe it isn't in the Fortran 66 standard) before I
learned it in algebra class. I suspect that there are others I learned
from programming before learning them in math class

> People who are programmers actually had a first contact with formal
> syntax in mathematics.

> The conflation between syntax and semantics may ultimately come from
> that place. Mathematicians design their notations deliberately, in such
> ways that when they manipulate symbols, while observing certain rules,
> they are actually preserving semantics. The notation directly enables
> semantically meaningful manipulation, as a tool of thought.

I suspect that people learn some things in the first programming language
that they learn, and then expect it to be the same in others. When it isn't,
people get surprised or confused.

When I started with unix, I learned csh programming, and mostly
avoided sh (and successors). One reason for that is, as well as I
knew at the time, differences in the syntax and semantics of them.
[Fortran has always had ** exponentiation, starting with the original
version in 1956. It always bound tighter than +-*/ but wasn't
associative, A**B**C not allowed, -John]

Kaz Kylheku

unread,

Dec 30, 2021, 1:55:04 PM12/30/21

to

On 2021-12-30, Jan Ziak <0xe2.0x...@gmail.com> wrote:
> On Wednesday, December 29, 2021 at 11:28:34 PM UTC+1, Kaz Kylheku wrote:
>> On 2021-12-16, Roger L Costello wrote:
>> > Question: Opine about why languages are usually defined and implemented with
>> > ambiguous grammars.
>> But they aren't.
>>
>> Languages are processed as a stream of characters or tokens, with hidden
>> rules about how those relate together and the meaning that emerges.
>> All of the rules are hidden, including the entire grammar.
>>
>> If you're only aware of some of the hidden rules, but not others, then
>> you see ambiguity.
>>
>> But if you're only aware of some of the hidden rules, but not others,
>> then you are not working with the correct language.
>>
>> For instance, I don't know of any mainstream language in which if/else
>> is actually ambiguous. They have a hidden rule like that the else goes
>> with the closest preceding if statement.
>
> When designing a grammar and implementing a parser: the grammar can
> either be unambiguous by design or unambiguous by accident. The
> viewpoint that "there isn't any mainstream language in which if/else
> is actually ambiguous" is actually the latter option: unambiguous by
> accident.

Languages can be ambiguous at the specification level, even if a
given implementation behaves unambiguouisly:

E.g.:

1. The implementation is completely unambiguous and says that else goes
with closest preceding if.
2. The documentation says something else. (So the users figure this out
and get their code working based on the implementation.)
3. (But) a new implementation appears, based on the documentation.

Or:

1. The language spec doesn't say anything about which way something
is parsed.
2. Mutiple implementations do it willy-nilly.

Clearly, for instance in C, we have semantic ambiguities like
a[i] = i++. (A parser depending on the behavior of something like that
could have a grammar ambiguity: something is parsed in two or more
different ways based on which way the undefined construct behaves. That
would be a defect, of course.)

> A primary reason why grammars in many mainstream languages (that don't
> have a parser generated straight from a verified grammar) are
> unambiguous isn't intentional design, but rather it is a consequence
> of the fact that those parsers are directly implemented in a language
> that is executing statements/expressions/instructions without
> verifying consequences of the executions. Some examples of languages
> with such execution properties: assembly language (such as: i386,
> ARM), C, Haskell.

I don't quite follow this; you seem to be saying that Turing machines
are deterministic, and so if we implement a parser as a Turing process,
it will be "accidentally" unambiguous because of determinism.

However, it may so happen that the lack of ambiguity depends on a whole
lot of context. For instance, a Turing process parsing a language,
proceeding left to right, could decide the rule for "if/then/else" on
a case-by-case basis, influenced by everything it has parsed before.

Suppose you have a language which parses a sequence of top-level
expressions from a terminal or file, and executes each one before moving
to the next. Those expressions could be used to invoke an API in the
language implementation to change the treatment of subsequent syntax.

Sure, everything is still unambiguous, if we take into account the
entire stream of expressions from the beginning and understand its
effect on the parsing machine.

If you don't do anything of this sort, and just write, say, a recursive
descent parser which isn't influenced by any weird state flags (whether
purely internal or external too) that change the parsing, and in a safe,
very high level language in which there are few risks of nonportable or
undefined behaviors, then you end up with a "simply unambiguous"
grammar. You should be able to investigate the behavior of if/else
with several test cases and be confident that the observations hold
in all relevant contexts where that construct can appear.
You might not exactly know what the grammar is for the entire language,
but if you figure it out from the code and accurately document it,
then you're good. (Unless the thing is required to conform to some
external specification and fails.)

> Contrary to the accidental approach, a parser
> generator by design cares about consequences and it is verifying that
> the specification actually is unambiguous despite the fact that in the
> end the parser gets compiled down into machine instructions.
> Verification means to search the whole search space (or at least a
> reasonably large subspace of it) - but asm/C/Haskell will run a search
> only if it is explicitly (step by step) forced by the programmer to
> perform a search.

Yes, e.g. a LALR(1) shift-reduce parser generator generates the entire
space of LR(0) items that drive the stack machine, and then when it
populates tables, it discovers conflicts there.

With someone's hand-written parser, we cannot be sure without
searching with test cases, which are language instances, which is
intractable to do exhaustively.

Just because if/else is behaving in certain ways in certain test cases
doesn't mean that a new, more complex test case with more/different
context cannot be found in which the if/else behaves differently.

The procedural/recursive parser code doesn't declare to us what the
grammar is. It could be hiding multiple rules for if/else active in
different contexts which differ from each other.

That's not an actual ambiguity, but for the purposes of working with the
language, it's an informal ambiguity (related to my earlier observation
of the user not knowing what all hidden rules are).

--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal

[If you write a hand written recursive descent parser, which was typical
before yacc made it easy to use a LALR parser, it was common
for the language the parser accepted to be somewhat different from the
one the programmer intended to accept due to less than complete checking
for unexpected inputs.
On the other hand, that C example isn't ambiguous, it's deliberately
indeterminate. Early Fortran allowed the compiler to compile any
mathematically equivalent, not just numerically equivalent, version
of an expression so A*B+A*C could turn into A*(B+C) which was great
for optimizing, not so much for predictable results. -John]

Kaz Kylheku

unread,

Dec 30, 2021, 1:56:15 PM12/30/21

to

On 2021-12-30, gah4 <ga...@u.washington.edu> wrote:
> On Wednesday, December 29, 2021 at 2:28:34 PM UTC-8, Kaz Kylheku wrote:
>
> (snip)
>
>> I also believe there is one more element at play: mathematics. People
>> study mathematics in school, and those who go on to do programming tend
>> to be ones who were more exposed to it or paid more attention.
>
> This reminds me of learning associativity of exponentiation (**)
> in Fortran IV (I believe it isn't in the Fortran 66 standard) before I
> learned it in algebra class. I suspect that there are others I learned
> from programming before learning them in math class

In Common Lisp, the expt function is strictly binary, so it eliminates
the question of associativity. Some basic arithmetic functions are
n-ary, like (+ a b c d e f), where that is documented (and readily
understood) that in cases where it matters, it is left-to-right
reduction.

In TXR Lisp, I made expt n-ary, so you can write

(expt x y z w)

But! The associativity is right-to-left, making that equivalent to:

(expt x (expt y (expt z w)))

This is for two reasons. One is math: (expt x y z w) defined this
way follows:

w
z
y
x

secondly, it is more useful, becuase the left-to-right interpretation
is:

((( y) z) w)
(((x ) ) )

and that is just

(expt x (* y z w))

which is easy enough to write if that's what you want! It's not much
more verbiage than the (expt x y z w) you may have wanted. Regardless of
the number of operands, it's just an extra set of parentheses and a *
operator.

Whereas if you want (expt x (expt y (expt z w)) and (expt x y z w)
doesn't give it to you, that *is* a lot of verbiage, whose nesting grows
with each additional argument.

The associativity rule that saves the most verbiage is the better one,
even if it is opposite to many other arithmetic functions.

>> People who are programmers actually had a first contact with formal
>> syntax in mathematics.
>
>> The conflation between syntax and semantics may ultimately come from
>> that place. Mathematicians design their notations deliberately, in such
>> ways that when they manipulate symbols, while observing certain rules,
>> they are actually preserving semantics. The notation directly enables
>> semantically meaningful manipulation, as a tool of thought.
>
> I suspect that people learn some things in the first programming language
> that they learn, and then expect it to be the same in others. When it isn't,
> people get surprised or confused.
>
> When I started with unix, I learned csh programming, and mostly
> avoided sh (and successors). One reason for that is, as well as I
> knew at the time, differences in the syntax and semantics of them.
> [Fortran has always had ** exponentiation, starting with the original
> version in 1956. It always bound tighter than +-*/ but wasn't
> associative, A**B**C not allowed, -John]

When I started programming from nothing, I saw BASIC examples in a
book which was doing things like:

10 X = 2
20 X = X + 1

The only language with formulas that I was coming from was math.
(Though I was only in grade 6, I know how to solve systems of linear
equations from school, because I had recently come to Canada from
Slovakia.)

So, I thought, what? How can X be equal to X + 1; you cannot solve
this absurdity!

From then I knew that the people who program computers to understand
symbols are free thinkers who make them mean anything they want.

Kaz Kylheku

unread,

Dec 30, 2021, 3:37:29 PM12/30/21

to

On 2021-12-30, Kaz Kylheku <480-99...@kylheku.com> wrote:
John> On the other hand, that C example isn't ambiguous, it's deliberately
John> indeterminate.

The C example isn't ambiguous in its parse, but the semantics is
ambiguous. Expressions can be evaluated in multiple orders in C,
and if we follow different orders for that expression, we get different
results (in such a way that it's deemed undefined).

Here is something merely unspecified:

a() + b() + c()

Suppose a prints "a" to stdout, b prints "b" and c prints "c":

int a() { putchar('a'); return 0; }

The behavior is not undefined, but unspecified: we don't know
which of six permutations is printed, but we know it's one of them:
abc, acb, bac, bca, cab, cba.

That's a kind of ambiguity in the language (syntax + semantics).

We know that the parse is

(a() + b()) + c()

But the meaning requires that a, b and c are executed in order
to produce the operands to the + operator, the order in which
those calls take place is not specified.

That's a clear ambiguity.

Just like if I say "pick up the dry cleaning and fill up the gas
tank", there is no grammar ambiguity; yet we don't know whether
the sequencing is required, or whether the gas tank can be
filled first. The request describes multiple possible scenarios,
including going to gas station that does dry-cleaning, and
picking up while an attendant fills the tank.

> indeterminate. Early Fortran allowed the compiler to compile any
> mathematically equivalent, not just numerically equivalent, version
> of an expression so A*B+A*C could turn into A*(B+C) which was great
> for optimizing, not so much for predictable results. -John]

Why wait for early Fortran to arrive? If you write in C, you can
use gcc -ffast-math (an umbrella option for turning on a whole lot
of stuff, among it -fassociative-math and -freciprocal-math).

gah4

unread,

Dec 30, 2021, 6:40:40 PM12/30/21

to

On Wednesday, December 29, 2021 at 7:28:33 PM UTC-8, gah4 wrote:

(snip, I wrote)

> This reminds me of learning associativity of exponentiation (**)
> in Fortran IV (I believe it isn't in the Fortran 66 standard) before I
> learned it in algebra class. I suspect that there are others I learned
> from programming before learning them in math class

(snip)

> [Fortran has always had ** exponentiation, starting with the original
> version in 1956. It always bound tighter than +-*/ but wasn't
> associative, A**B**C not allowed, -John]

It was, at least, in Fortran IV for IBM 360/370:

https://atariwiki.org/wiki/attach/Fortran/IBM_FORTRAN_IV-Language_1973.pdf

My 8th grade graduation present was the above manual, though maybe
one year earlier. I used to read IBM reference manuals like books,
from start to finish. By the end of summer, I had run many Fortran programs.

As well as I know it, IBM Fortran IV was the input to the 1966 standard,
but not all features were included. It might also be that extensions were
added later.
[I used Fortran H on Princeston's 360/91 in a summer job I had in
college in about 1973. -John]

Jan Ziak

unread,

Dec 30, 2021, 6:45:56 PM12/30/21

to

On Thursday, December 30, 2021 at 7:56:15 PM UTC+1, Kaz Kylheku wrote:
> When I started programming from nothing, I saw BASIC examples in a
> book which was doing things like:
>
> 10 X = 2
> 20 X = X + 1
>
> The only language with formulas that I was coming from was math.
>

> So, I thought, what? How can X be equal to X + 1; you cannot solve
> this absurdity!
>
> From then I knew that the people who program computers to understand
> symbols are free thinkers who make them mean anything they want.

"X = X + Y" means "X[t+1] = X[t] + Y[t]" where t is time. Time had to be
omitted from the notation of the BASIC programming language because otherwise
the source code would consume a much larger amount of computer memory and it
would complicate GOTO and FOR/NEXT statements.

-atom
[Interesting take. In reality, of couse, BASIC borrowed that from Fortran. Algol
used := for assignment, different from = for equality comparison. -John]

Jan Ziak

unread,

Dec 30, 2021, 8:17:29 PM12/30/21

to

On Friday, December 31, 2021 at 12:45:56 AM UTC+1, Jan Ziak wrote:
> On Thursday, December 30, 2021 at 7:56:15 PM UTC+1, Kaz Kylheku wrote:
> > When I started programming from nothing, I saw BASIC examples in a
> > book which was doing things like:
> >
> > 10 X = 2
> > 20 X = X + 1
> >
> > The only language with formulas that I was coming from was math.
> >
> > So, I thought, what? How can X be equal to X + 1; you cannot solve
> > this absurdity!
> >
> > From then I knew that the people who program computers to understand
> > symbols are free thinkers who make them mean anything they want.
>
> "X = X + Y" means "X[t+1] = X[t] + Y[t]" where t is time. Time had to be
> omitted from the notation of the BASIC programming language because otherwise
> the source code would consume a much larger amount of computer memory and it
> would complicate GOTO and FOR/NEXT statements.
>
> -atom
>

> [Interesting take. In reality, of course, BASIC borrowed that from Fortran.

> Algol used := for assignment, different from = for equality comparison. -John]

@John: Indeed, BASIC wasn't the 1st programming language. To generalize, I
wanted to point out that the notion of time is implicit to almost all
programming languages, of course not just BASIC. In my opinion, contrary to
the Kaz's opinion, most children who will later become programmers can quite
easily understand what "X=X+1" means in a language like BASIC/Python/etc.
(Thus, I disagree with the belief that "people who program computers to
understand symbols are free thinkers who make them mean anything they want".)

-atom

Jan Ziak

unread,

Dec 31, 2021, 12:30:40 PM12/31/21

to

On Wednesday, December 29, 2021 at 11:28:34 PM UTC+1, Kaz Kylheku wrote:
> On 2021-12-16, Roger L Costello

> > Question: Opine about why languages are usually defined and implemented with
> > ambiguous grammars.
>

> Novice programmers have historically been attracted to cryptic-looking
> languages. It is one of the main reasons for the success of languages
> like C and Perl.

> ....

I know that what I am about to write does not answer the original question
about ambiguous grammars, but I feel I have to respond to the claim that
novices are attracted to cryptic-looking languages. If that was true then the
brainf**k language would be in the top 10 languages in use today.

People new to programming aren't attracted to C because it is cryptic, but
because - for example - in the 1990-ties they learned that C was used to
implement the game Doom with only a few elements of assembly
(https://en.wikipedia.org/wiki/Development_of_Doom#Programming). Doom was
implemented in C and wasn't implemented in Lisp/Pascal/Smalltalk - which
increases the popularity of C and decreases the popularity of
Lisp/Pascal/Smalltalk.

Some young programmers were attracted to Smalltalk after the year 2002 because
they watched the Squeakers movie (I believe it is this one:
https://www.imdb.com/title/tt2172065/).

In summary: Novice programmers are attracted to particular programming
languages because those languages are popular in their social networks.

-atom
[Sigh. You're probably right. Historically, novices started with a toy
language which left out more advanced but important ideas like data
structures and name scope, and gave them an unfortunately blinkered
idea of what programming involves. One time when I was a grad student
I had to explain to one of the undergrads why you really didn't want
to write all your programs in Tiny Basic. -John]

mac

unread,

Jan 3, 2022, 2:58:39 PM1/3/22

to

> [Interesting take. In reality, of couse, BASIC borrowed that from Fortran. Algol
> used := for assignment, different from = for equality comparison. -John]

Indeed.
Unfortunately, assignment is probably the single most common operator.
The ASCII committee should have kept the left-arrow character instead of
replacing it with underscore.

gah4

unread,

Jan 4, 2022, 1:15:42 PM1/4/22

to

The assignment statement in BASIC, at least the ones I know, has an
(optional) LET keyword, so it might say:

10 LET A=3

Most people leave it off, though.

Is PL/I the only language that uses = for both assignment and the relational operator?
Since expressions are not statements, it avoids the ambiguity that would otherwise occur.
I believe some BASIC also use = for both.

Underscore is a pretty useful character.

The two ASCII characters that don't exist in EBCDIC are ^ and ~.
Two EBCDIC characters that don't exist in ASCII are 𝇍 (cent)
and ¬ (logical NOT sign). Conversion tables usually cross
map those pairs. (PL/I, at least, uses ¬ and ¬= operators.)

[In original Dartmouth BASIC the LET was mandatory, but it was a considerably
smaller and fully compiled language than the later dialects. On the other
hand, PL/I made a fetish of nothing being a reserved word, e.g.

IF IF = THEN THEN ELSE = BEGIN; ELSE END = IF;

-John]

Thomas Koenig

unread,

Jan 4, 2022, 2:46:25 PM1/4/22

to

> [In original Dartmouth BASIC the LET was mandatory, but it was a considerably
> smaller and fully compiled language than the later dialects. On the other
> hand, PL/I made a fetish of nothing being a reserved word, e.g.
>
> IF IF = THEN THEN ELSE = BEGIN; ELSE END = IF;

Fortran shares this property.

This may sound slightly odd to people brought up on languages with
reserved keywords, but it has a big advantage: You can extend the
language with new keywords without making existing programs invalid.

There is a cost to this, in several aspects:

- More CPU time needed for parsing (important earlier, now it
can generally be neglected).

- More complexity in the parser. This cost is paid once, and
by the compiler developers, not the users.

- Similarity to FORTRAN may induce fear and loathing in computer
scientists (the last remark is not to be taken too seriously :-)

[Fortran barely had tokens since it ignored spaces outside of Hollerith
strings. Having written a few F77 parsers, I can say it was possible
to tokenize using hints from the parser about what lexical kludge to
apply, but it wasn't pleasant. The yacc parser was straightforward
other than figuring out when to send which kludge ID to the lexer.

It also meant that one character typos could cause large semantic
changes, notably DO 5 I = 1,10 is a loop while DO 5 I = 1.10
is an assignment. Legend says we lost a satellite to that one.

-John]

gah4

unread,

Jan 4, 2022, 5:38:08 PM1/4/22

to

On Tuesday, January 4, 2022 at 10:15:42 AM UTC-8, gah4 wrote:

(snip, our moderator wrote)

> [In original Dartmouth BASIC the LET was mandatory, but it was a considerably
> smaller and fully compiled language than the later dialects. On the other
> hand, PL/I made a fetish of nothing being a reserved word, e.g.
>
> IF IF = THEN THEN ELSE = BEGIN; ELSE END = IF;
>
> -John]

I never used any close to the original BASIC, but did use, for some time,
the HP TSB2000 version. HP stores programs after tokenizing, so I suspect
that even if you don't put in LET, the tokenizer will add it.

As for PL/I, it borrowed many features from COBOL, but not the use
of reserved words. For one, they wanted people not to have to know the whole
language, and not even the words. Stories are that COBOL programmers always
keep the list of reserved words nearby, to avoid using them.

Counting from a recent IBM web page on their COBOL compiler, there are
over 400 reserved words, many common English words that people might
like to use. Somehow out of 50 years of programming, I have managed
never to even type in and run a COBOL program, and especially not to
write one.

As for Fortran parsing, I do remember that WATFIV reserves the sequence
'FORMAT(' at the beginning of a statement for actual FORMAT statements.
You can't assign to elements of an array named FORMAT. That might not
be so bad, except that Fortran 66, in its run-time format feature, requires the
format data to be in an array. And the obvious name is FORMAT!
[COBOL doesn't have that many reserved words. See https://www.ibm.com/docs/en/i/7.1?topic=list-reserved-words
Re FORMAT statements, WATFOR/FIV punted for some reason. It's not that
hard to tell a format statement from a statement like FORMAT(I5,A4) =
42 but I realize no sane programmer would do that. -John]

Martin Ward

unread,

Jan 5, 2022, 12:11:34 PM1/5/22

to

On 04/01/2022 21:26, gah4 wrote:
> Stories are that COBOL programmers always
> keep the list of reserved words nearby, to avoid using them.

Our esteemed moderator claims:

> [COBOL doesn't have that many reserved words

I count 510 reserved words in IBM COBOL. Adding a few other dialects
can push the total to 700 or more. By comparison, C has about 32
reserved words.

The story I heard was of a COBOL shop where it was mandatory to
include a hyphen in every data name: in effect, *every* unhyphenated
word was treated as a reserved word. The slightly more managable list
of *hyphenated* reserved words (149 in IBM COBOL, but 46 of these are
of the form COMP-0, COMP-1, COMP-2 etc) was printed out and posted on
the wall.

I just noticed that if you include a digit in the part of the name
before the first hyphen, you can guarantee to avoid all
the reserved words!

PL/I went to the other extreme of no reserved words in reaction
to COBOL. Also, the aim of PL/I was to be a language which does
everything: business programming (like COBOL) and scientific
programming (like FORTRAN). In theory, if you only wanted
to do, say, business programming, you only needed to learn
part of the language and you would not get tripped up by keywords
from the other part of the language that you didn't know about yet.

Using a language that you don't know in its entirety might seem
dangerous, but everybody seems to do it these days:
how many C programmers have read the entire 500+ pages of
the latest C standard and memorised the 200+ varieties
of "undefined behaviour" so that they can avoid all of them
in every line of code that they write?
--
Martin

Dr Martin Ward | Email: mar...@gkc.org.uk | http://www.gkc.org.uk
G.K.Chesterton site: http://www.gkc.org.uk/gkc | Erdos number: 4
[IBM hoped everyone would switch from Fortran and COBOL to PL/I and
it was obvious Fortran programmers would not put up with reserved
words, particularly ones unrelated to scientific programming.
As far as the size of languages, that seems a matter of point of
view. Python is a large language if you consider the standard
library to be part of the language, a very small one if you don't.
-John]

David Brown

unread,

Jan 6, 2022, 10:49:13 AM1/6/22

to

On 05/01/2022 11:25, Martin Ward wrote:

> Using a language that you don't know in its entirety might seem
> dangerous, but everybody seems to do it these days:
> how many C programmers have read the entire 500+ pages of
> the latest C standard and memorised the 200+ varieties
> of "undefined behaviour" so that they can avoid all of them
> in every line of code that they write?

I think it is normal not to know everything about the language you use.
And if you include the language's standard library, then there are very
few currently used languages where it would even be possible to learn it
all. By the time you learned all of the language and default libraries
of C++, Java, Python, etc., there would be a new version out and you'd
have more to learn.

The important things for writing code are to know enough to be able to
write the kind of code you are doing, and to avoid accidentally doing
things you didn't intend. Static warning tools are vital here - from
syntax-highlighting and check-as-you-type editors and IDE's, through
compiler warning flags, to stand-alone checkers. Your tools should
tell you if you are accidentally using a reserved word as an
identifier.

There is no need to memorize undefined behaviours for a language -
indeed, such a thing is impossible since everything not defined by a
language standard is, by definition, undefined behaviour. (C and C++
are not special here - the unusual thing is just that their standards
say this explicitly.)

The trick is to memorize the /defined/ behaviours, and stick to them.
You generally don't need to know if a language leaves (1 / 0) as
undefined, or gives a specific value, or prints an error message -
usually it is sufficient to know the values for which (x / y) /is/
defined, and stick to those values.

Basically, trying to execute undefined behaviour is no more and no less
than a bug in the program - whether it is "undefined" in terms of the
language, the library, the code you wrote yourself, the customer's
specification, or anything else. People program primarily by trying to
write correct code - not by trying to think of all the ways they could
write incorrect code!

The real challenge from big languages and big standard libraries is not
/writing/ code, it is /reading/ it. It doesn't really matter if a C
programmer, when writing some code, does not know what the syntax "void
foo(int a[static 10]);" means. (Most C programmers don't know it, and
never miss it.) But it can be a problem if they have to read and
understand code that uses something they don't know.

Thomas Koenig

unread,

Jan 6, 2022, 1:11:23 PM1/6/22

to

David Brown <david...@hesbynett.no> schrieb:

> There is no need to memorize undefined behaviours for a language -
> indeed, such a thing is impossible since everything not defined by a
> language standard is, by definition, undefined behaviour. (C and C++
> are not special here - the unusual thing is just that their standards
> say this explicitly.)

This is a rather C-centric view of things. The Fortran standard
uses a different model.

There are constraints, which are numbered. Any violation of such
a constraint needs to be reported by the compiler ("processor",
in Fortran parlance). If it fails to do so, this is a bug in
the compiler.

There are also phrases which have "shall" or "shall not". If this
is violated, this is an error in the program. Catching such a
violation is a good thing from quality of implementation standpoint,
but is not required. Many run-time errors such as array overruns
fall into this category.

[...]

> The real challenge from big languages and big standard libraries is not
> /writing/ code, it is /reading/ it. It doesn't really matter if a C
> programmer, when writing some code, does not know what the syntax "void
> foo(int a[static 10]);" means. (Most C programmers don't know it, and
> never miss it.) But it can be a problem if they have to read and
> understand code that uses something they don't know.

Agreed.

Robert Prins

unread,

Jan 6, 2022, 1:14:09 PM1/6/22

to

On 2022-01-06 08:11, David Brown wrote:
> On 05/01/2022 11:25, Martin Ward wrote:
>
> Your tools should tell you if you are accidentally using a reserved word as an
> identifier.

Your language should not have reserved words, if PL/I (AD 1964) could already do
without them...

'nuff said!

Robert
--
Robert AH Prins
robert(a)prino(d)org
The hitchhiking grandfather - https://prino.neocities.org/
Some REXX code for use on z/OS - https://prino.neocities.org/zOS/zOS-Tools.html
[Just because it's possible to do something doesn't mean it is a good idea. A
lot of us think a reasonable number of reserved words are fine and make it less
likely that a typo will silently change the meaning of a program. -John]

David Brown

unread,

Jan 7, 2022, 8:24:06 PM1/7/22

to

On 06/01/2022 17:43, Thomas Koenig wrote:
> David Brown <david...@hesbynett.no> schrieb:
>
>> There is no need to memorize undefined behaviours for a language -
>> indeed, such a thing is impossible since everything not defined by a
>> language standard is, by definition, undefined behaviour. (C and C++
>> are not special here - the unusual thing is just that their standards
>> say this explicitly.)
>
> This is a rather C-centric view of things. The Fortran standard
> uses a different model.
>
> There are constraints, which are numbered. Any violation of such
> a constraint needs to be reported by the compiler ("processor",
> in Fortran parlance). If it fails to do so, this is a bug in
> the compiler.

C has basically the same concept.

(IIRC, C++ as a few constraints such as the "one definition rule" that
where the standard says no diagnostics are necessary, because
identifying the error would mean the compiler has to see multiple
translation units at once. Compilers often diagnose these if they have
some kind of link-time optimisation or program-at-once mode.)

>
> There are also phrases which have "shall" or "shall not". If this
> is violated, this is an error in the program. Catching such a
> violation is a good thing from quality of implementation standpoint,
> but is not required. Many run-time errors such as array overruns
> fall into this category.

That is the same in C. From 4.2 "Conformance" :

"""
If a “shall” or “shall not” requirement that appears outside of a
constraint or runtime-constraint is violated, the behavior is undefined.
Undefined behavior is otherwise indicated in this International Standard
by the words “undefined behavior” or by the omission of any explicit
definition of behavior. There is no difference in emphasis among these
three; they all describe “behavior that is undefined”.
"""

The only difference I see from what you describe of Fortran (I have not
read any Fortran standards) is that the C standards also note that
behaviour that is not defined in the standards is undefined behaviour as
far as the standards are concerned. That is a tautology, of course, and
applies equally to Fortran and any other language.

It is quite possible that the details of which behaviours are defined or
not varies between the languages - things like division by 0,
out-of-bounds array access, etc., may be different. As I understand it,
passing aliased pointers or array references as different parameters to
the same function can lead to undefined behaviour in Fortran, whereas it
is defined in C (unless you use "restrict").

Spiros Bousbouras

unread,

Jan 7, 2022, 8:25:15 PM1/7/22

to

On Thu, 6 Jan 2022 16:43:05 -0000 (UTC)
Thomas Koenig <tko...@netcologne.de> wrote:
> David Brown <david...@hesbynett.no> schrieb:
>
> > There is no need to memorize undefined behaviours for a language -
> > indeed, such a thing is impossible since everything not defined by a
> > language standard is, by definition, undefined behaviour. (C and C++
> > are not special here - the unusual thing is just that their standards
> > say this explicitly.)
>
> This is a rather C-centric view of things. The Fortran standard
> uses a different model.
>
> There are constraints, which are numbered. Any violation of such
> a constraint needs to be reported by the compiler ("processor",
> in Fortran parlance). If it fails to do so, this is a bug in
> the compiler.
>
> There are also phrases which have "shall" or "shall not". If this
> is violated, this is an error in the program. Catching such a
> violation is a good thing from quality of implementation standpoint,
> but is not required. Many run-time errors such as array overruns
> fall into this category.

This seems to me exactly like the C model. What difference do you see ?

Regarding the more general issue, it seems to me that undefined behaviour is
a red herring (which I think is the point David was making). Every time one
writes code in any language , one must have an expectation on how the code is
supposed to behave and some reasoning on why the code they wrote will behave
according to their expectations. The reasoning will be based (apart from
general rules from logic and mathematics) on what the standard of the
programming language specifies (if the language has a standard) , what the
translator/compiler documentation specifies , what the documentation of any
libraries they use specifies and so forth.

For example lets say that I write in C

int a = INT_MAX + 1 ;

with the expectation that a will get the value INT_MIN. The onus is on me
to provide a reasoning why the code above will meet my expectation. If I
cannot provide such a reasoning then from my point of view the code is
already undefined. The fact that the C standard also says that the code is
undefined is irrelevant. Even if the C standard specified for example that
signed integer arithmetic uses wraparound, unless I could point to the place
in the standard where it said so, the code is still undefined from my point
of view so I should not use it.

But lets say that I have the above code and I intend to compile it with
GCC using the -fwrapv flag. Then my expectation is actually justified
based on the GCC documentation for what -fwrapv means and the parts
of the C standard which define what the various symbols in

int a = INT_MAX + 1 ;

mean. I'm not going to provide a proof because it should be obvious. But
any such proof would not need to cite any part of the C standard which
explicitly mentions undefined behaviour.

The only occasion where an explicit mention of undefined behaviour would be
relevant would be if the C standard (or any standard) were contradictory i.e.
it said in some place that some construct has a certain defined behaviour and
it said in some other place that the same construct has undefined behaviour.
But with a popular language like C , if such contradictions existed , they
would be caught early and corrected.

Martin Ward

unread,

Jan 7, 2022, 8:25:45 PM1/7/22

to

On 06/01/2022 08:11, David Brown wrote:
> The trick is to memorize the/defined/ behaviours, and stick to them.

Isn't the set of defined behaviours bigger than the set
of undefined behaviours? How do you know what is defined
if you don't know what is undefined?

For example, a = b + c is precisely defined in C and C++ for
floating point variables, but the result can be "undefined behaviour"
for ordinary 32 bit signed integer values.

If you want to stick to defined behaviours then you need
to add extra code. For example, CERT recommends:

if (((si_b > 0) && (si_a > (INT_MAX - si_b))) ||
((si_b < 0) && (si_a < (INT_MIN - si_b)))) {
/* Handle error */
} else {
sum = si_a + si_b;

David Brown

unread,

Jan 7, 2022, 8:27:08 PM1/7/22

to

On 07/01/2022 15:02, Martin Ward wrote:
> On 06/01/2022 08:11, David Brown wrote:

>> The trick is to memorize the/defined/ behaviours, and stick to them.

>
> Isn't the set of defined behaviours bigger than the set
> of undefined behaviours? How do you know what is defined
> if you don't know what is undefined?

You know what is "defined" because you can find the definition for it -
everything else is undefined. You could enumerate all defined
behaviours for a language - after all, the documentation (language
standards, compiler manual, library documentation, etc.) is finite. It
doesn't really make sense to try to find how many undefined behaviours
there are - it's like asking how many things are there that are apples.

Language standards tell you the defined behaviour for a language.
Anything that is not there, is undefined - that's simply what the word
"undefined" means.

Note that there are many other things besides language standards that
define behaviour of code in practice - compilers or interpreters can add
their own definitions to things that are not defined by the language
standards, as can additional standards such as POSIX.

If you write a function "foo" - perhaps written in the same language
(such as C), perhaps in a completely different language - then its
behaviour is not defined by the language standards. It is not mentioned
anywhere in those documents, so it is undefined. (That is different
from functions whose behaviour is specified in the standard, such as
"memcpy".)

Undefined behaviour, as far as language standards are concerned, are
omnipresent in programming - for all languages. The problem only comes
when you attempt to execute something that does not have its behaviour
defined /anywhere/. Then it is incorrect code - a bug.

When I learned to program (i.e., during my university education rather
than from books, magazines and trial and error previous to that), we
were very clear about how a function is specified. You have a
pre-condition and a post-condition. The function can assume the
pre-condition is logically "true", and it will guarantee that the
post-condition is true at the exit. (Typically you also have an
"invariant" that is a clause in both parts, but that is just for
convenience.) If the function is called when the pre-condition is
false, the function has no obligation to do anything - it can give an
error, launch nasal daemons, give the answer it thinks the programmer
hoped for, or anything else. The behaviour is undefined.

This concept has existed since the dawn of programming:

"""
On two occasions I have been asked, 'Pray, Mr. Babbage, if you put into
the machine wrong figures, will the right answers come out?' I am not
able rightly to apprehend the kind of confusion of ideas that could
provoke such a question.

Charles Babbage
"""

The C standards contain a fair number of explicit undefined behaviours.
They do that for convenience and clarity, and often to encourage
compiler developers towards greater efficiency rather than run-time
checks, and to encourage programmers towards not assuming particular
behaviours even if one compiler happens to define the behaviour. So a
compiler writer knows that they can assume "a + b" never overflows (for
integer arithmetic), and a programmer knows that they can't assume
signed arithmetic is wrapping even if the compiler they are using at the
time /guarantees/ wrapping behaviour. (I have never seen a C compiler
that guarantees this without explicit flags.)

C is a language that expects the programmer to take responsibility for
his or her code, and ensure that it is correct. Fortunately, good
compiler developers know this is difficult and provide tools to help
people find their bugs. Thus you have a language that can give
efficient results, /and/ provide good debugging and run-time checking,
as long as you get good tools and understand how to use them.

>
> For example, a = b + c is precisely defined in C and C++ for
> floating point variables, but the result can be "undefined behaviour"
> for ordinary 32 bit signed integer values.
>

Actually, it is not precisely defined for floating point operations - if
there is an "exceptional condition" during the evaluation (the result is
not mathematically defined or not in the range of representable values
for its type), the behaviour is undefined. That applies to all
expressions - integer and floating point.

Now, it is very common (but certainly not universal) for C
implementations to use IEEE floating point formats and rules. These
provide the "mathematical definitions" for floating point operations,
including handling of calculations outside the normal ranges. But if
you are not using these, such calculations could result in undefined
behaviour. (For example, if you use "gcc -ffast-math", the compiler
will assume that all expressions are normal finite numbers - that's
perfectly valid for C, and can be very much more efficient on a lot of
targets.)

Signed integer overflow is undefined behaviour on most compilers (the
size is not necessarily 32-bit). The only one I know that defines the
behaviour is gcc (and compatibles, such as clang and icc) with the
"-fwrapv" flag enabled.

And of course that makes perfect sense. It is logical to assume that if
you add two positive numbers, you get a positive number - it is
illogical to suppose that sometimes the "correct" answer will be
negative. Some programming languages (such as Java) specifically define
signed integer arithmetic to be wrapping - the result is that sometimes
you get the wrong answer in Java, while in C you would get undefined
behaviour. Wrong answers are less helpful - leaving the behaviour
undefined means you get more efficient code and that you can use
debugging tools (such as gcc's -fsantitize=undefined) to help find the
errors in your code.

> If you want to stick to defined behaviours then you need
> to add extra code. For example, CERT recommends:
>
> if (((si_b > 0) && (si_a > (INT_MAX - si_b))) ||
>       ((si_b < 0) && (si_a < (INT_MIN - si_b)))) {
>     /* Handle error */
> } else {
>     sum = si_a + si_b;
> }
>

That is /not/ code to "stick to defined behaviours". It is code to
identify problems and perhaps find some way to handle it (depending on
what the "handle error" code is).

You can "stick to defined behaviour" much more simply:

int sum = (unsigned int) si_a + (unsigned int) si_b;

The behaviour is fully defined, and the result will be wrong if there is
an overflow - just like when you use a language that has fully defined
signed integer arithmetic by wrapping.

The answer here is /not/ to worry about what happens when your
expressions overflow and you get undefined behaviour. The answer is to
think about the code you are writing, and make sure that the types and
expressions you write are appropriate for the values you have. Check
your values for validity when you get them in (from files, user input,
etc.), then write code that is correct for the full range of values.
Simple. (Well, as simple as any programming!)

Spiros Bousbouras

unread,

Jan 7, 2022, 11:12:54 PM1/7/22

to

On Fri, 7 Jan 2022 14:02:50 +0000
Martin Ward <mar...@gkc.org.uk> wrote:
> On 06/01/2022 08:11, David Brown wrote:
> > The trick is to memorize the/defined/ behaviours, and stick to them.
>
> Isn't the set of defined behaviours bigger than the set
> of undefined behaviours?

That depends on how you define those sets. For example, any finite string is
a potential C source code and, of strings of length N (for any value of N),
only a very small percentage have defined behaviour. But regardless, you
need to know at least some defined behaviours to be able to programme at all
and, as long as you stick to those, you are not using any undefined
behaviours.

> How do you know what is defined
> if you don't know what is undefined?

As David has already said, you know by reading the definitions. And this is
the only way to know. Trying to guess what you're getting at, perhaps you
are thinking of someone who learns some C, then makes some unwarranted
assumptions from what they have learned and then has those assumptions scaled
back by coming across explicit mentions of "undefined behaviour" in the C
standard. Perhaps some people do behave this way. For example someone who
already knows assembly and begins to learn C may assume that all address
manipulations which would be legal in assembly are also legal using C
pointers. The correct remedy is not to make unwarranted assumptions to begin
with, whether one learns C or any other programming language. There is an
infinite number of unwarranted assumptions one can make and the C standard
can only caution against a finite number of them.

> For example, a = b + c is precisely defined in C and C++ for
> floating point variables, but the result can be "undefined behaviour"
> for ordinary 32 bit signed integer values.
>
> If you want to stick to defined behaviours then you need
> to add extra code. For example, CERT recommends:
>
> if (((si_b > 0) && (si_a > (INT_MAX - si_b))) ||
> ((si_b < 0) && (si_a < (INT_MIN - si_b)))) {
> /* Handle error */
> } else {
> sum = si_a + si_b;
> }

Whether you need to add code as the above will depend on what you already
know about the types and values of si_a and si_b .

Thomas Koenig

unread,

Jan 8, 2022, 1:11:55 PM1/8/22

to

Spiros Bousbouras <spi...@gmail.com> schrieb:

> On Thu, 6 Jan 2022 16:43:05 -0000 (UTC)
> Thomas Koenig <tko...@netcologne.de> wrote:
>> David Brown <david...@hesbynett.no> schrieb:
>>
>> > There is no need to memorize undefined behaviours for a language -
>> > indeed, such a thing is impossible since everything not defined by a
>> > language standard is, by definition, undefined behaviour. (C and C++
>> > are not special here - the unusual thing is just that their standards
>> > say this explicitly.)
>>
>> This is a rather C-centric view of things. The Fortran standard
>> uses a different model.
>>
>> There are constraints, which are numbered. Any violation of such
>> a constraint needs to be reported by the compiler ("processor",
>> in Fortran parlance). If it fails to do so, this is a bug in
>> the compiler.
>>
>> There are also phrases which have "shall" or "shall not". If this
>> is violated, this is an error in the program. Catching such a
>> violation is a good thing from quality of implementation standpoint,
>> but is not required. Many run-time errors such as array overruns
>> fall into this category.
>
> This seems to me exactly like the C model. What difference do you see ?

First, I see a difference in result. Highly intelligent and
knowledgable people argue vehemently if a program should be able
to use undefined behavior or not, and lot of vitriol is directed
against compiler writers who use the assumption that undefined
behavior cannot happen in their compilers for optimization,
especially if it turns out that existing code was broken and no
longer works after a compiler upgrade (Just read a few of Linus
Torvald's comments on that matter).

I see C conflating two separate concepts: Programm errors and
behavior that is outside the standard. "Undefined behavior is
always a programming error" does not work; that would make

#include <unistd.h>
#include <string.h>

int main()
{
char a[] = "Hello, world!\n";
write (1, a, strlen(a));
return 0;
}

not more and not less erroneous than

int main()
{
int *p = 0;
*p = 42;
}

whereas I would argue that there is an important difference between
the two.

If the C standard replaced "the behavior is undefined" with "the
program is in error, and the subsequent behavior is undefined"
or something along those lines, the discussion would be much
muted.

(Somebody may point out to me that this what the standard is
actually saying. If so, that would sort of reinforce my argument
that it should be clearer :-)

Anton Ertl

unread,

Jan 8, 2022, 1:20:53 PM1/8/22

to

David Brown <david...@hesbynett.no> writes:
>Undefined behaviour, as far as language standards are concerned, are
>omnipresent in programming - for all languages.

Please prove this astounding assertion. My impression is that managed
languages define everything, at least to some extent, and leave
nothing undefined. If they allowed nasal demons, the appeal of
managed languages would evaporate instantly.

- anton
--
M. Anton Ertl
an...@mips.complang.tuwien.ac.at
http://www.complang.tuwien.ac.at/anton/
[Things like .NET define a lot but they still are at the mercy
of their envronment when you ask for a variable sized chunk of
storage. -John]

Spiros Bousbouras

unread,

Jan 8, 2022, 5:58:32 PM1/8/22

to

On Sat, 8 Jan 2022 09:31:06 -0000 (UTC)

The C standard is in no position to say that some programme is in
error. This would require near omniscience from the standard
writers.

> #include <unistd.h>
> #include <string.h>
>
> int main()
> {
> char a[] = "Hello, world!\n";
> write (1, a, strlen(a));
> return 0;
> }
>
> not more and not less erroneous than
>
> int main()
> {
> int *p = 0;
> *p = 42;
> }
>
> whereas I would argue that there is an important difference between
> the two.

The only difference I see between the two is that the first is defined
by POSIX and the second is not. According to POSIX the first is required
to print something on stdout. I cannot imagine any extension which
would make the second programme do something useful and a conforming
implementation may well compile it as essentially a no-op.

But with something like

int main(voidd) {

int *p = 0 ;
*p = 42 ;

.... do other stuff ...
return 0 ;
}

the C standard allows for a conforming implementation to do something
useful like perhaps store 42 to address 0.

> If the C standard replaced "the behavior is undefined" with "the
> program is in error, and the subsequent behavior is undefined"
> or something along those lines, the discussion would be much
> muted.
>
> (Somebody may point out to me that this what the standard is
> actually saying. If so, that would sort of reinforce my argument
> that it should be clearer :-)

No , it most definitely does not say that nor could it possibly say
that.

Thomas Koenig

unread,

Jan 8, 2022, 9:29:40 PM1/8/22

to

A standard (or other specification document) is certainly able to
state that some construct is in error. To grab an often-quoted
example:

J3/18-007r1, the Fortran 2018 interpretation documents, states in
subclause 9.5.3, "Array elements and array sections",

# The value of a subscript in an array element shall be within the
# bounds for its dimension.

No omnicience required to write or understand that sentence.

This puts the burden on the programmer. The compiler might catch
such an error error and abort the program, or other unpredictable
things such as overwriting an unrelated variable might also happen.

Reading a language standard can be hard. Quite often, information
is scattered throughout the text and needs to be pieced together
to find the necessary information, especially definition of terms
which are crucial to understanding. Most programmers do do not
read standards (at least final committee drafts can usually be
found these days on the Internet), but compiler writers should at
least be familiar with what they are implementing.

Programmers often rely on books, but these can also get things wrong.

Because programmers are human, they also can get ticked off when being
told that a construct they have used for years has been illegal
for decades :-|

Having a good standard is crucial to being able to write good compilers.

Spiros Bousbouras

unread,

Jan 9, 2022, 4:53:28 PM1/9/22

to

On Sun, 9 Jan 2022 00:09:19 -0000 (UTC)

Thomas Koenig <tko...@netcologne.de> wrote:
> Spiros Bousbouras <spi...@gmail.com> schrieb:
> > On Sat, 8 Jan 2022 09:31:06 -0000 (UTC)
> > Thomas Koenig <tko...@netcologne.de> wrote:
> >> I see C conflating two separate concepts: Programm errors and
> >> behavior that is outside the standard. "Undefined behavior is
> >> always a programming error" does not work; that would make
>
> > The C standard is in no position to say that some programme is in
> > error. This would require near omniscience from the standard
> > writers.
>
> A standard (or other specification document) is certainly able to
> state that some construct is in error. To grab an often-quoted
> example:
>
> J3/18-007r1, the Fortran 2018 interpretation documents, states in
> subclause 9.5.3, "Array elements and array sections",
>
> # The value of a subscript in an array element shall be within the
> # bounds for its dimension.
>
> No omnicience required to write or understand that sentence.
>
> This puts the burden on the programmer. The compiler might catch
> such an error error and abort the program, or other unpredictable
> things such as overwriting an unrelated variable might also happen.

I haven't read any Fortran standards so I can only go by the above quote.
Only the programmer knows what their requirements are and why they think that
the code they wrote will meet those requirements. My idea of error is that
either the code does not meet the requirements or it does so only by accident
and the programmer does not have a correct reasoning as to why their code
will meet those requirements. You seem to be reading the quote as saying

No matter what the programmer requirements and no matter what extensions
their Fortram implementation offers , the programmer requirements will
not be justifiably met if they use an array subscript outside the bounds
for its dimension.

Perhaps some Fortran implementation gives information as to the layout of
distinct variables so that one knows what will be overwritten by writing off
the bounds of some aray and it will be overwritten in the way the programmer
wants. Unlikely (especially for Fortran) but it cannot be excluded. I can
imagine a C implementation for small embedded systems which does provide such
information and a programmer using it to reduce the number of instructions to
achieve a desired result. A more realistic example is the following :

#include <stdio.h>

int main(void) {
int a = 12 , b = 14 ;
printf("%2$d %1$d\n" , a , b) ;
return 0 ;
}

The above code has undefined behaviour according to the C standard. It is
defined according to POSIX .Whether it is in error depends on whether the
programmer really wanted to print
14 12

and no standards committee can possibly know this. So I still think that your
reading requires omniscience from the Fortran standard writers. But perhaps
there are other parts of the standard which justify your reading. For example
some parts of the Common Lisp standard do state that an implementation must
not extend some construct to provide useful functionality beyond what the
standard specifies. I don't remember precisely how it states it and I can't
find those parts now.

> Reading a language standard can be hard. Quite often, information
> is scattered throughout the text and needs to be pieced together
> to find the necessary information, especially definition of terms
> which are crucial to understanding. Most programmers do do not
> read standards (at least final committee drafts can usually be
> found these days on the Internet), but compiler writers should at
> least be familiar with what they are implementing.
>
> Programmers often rely on books, but these can also get things wrong.

C books at least usually don't go into the fine details of undefined
behaviour. To hone one's instincts in this area one should spend a few
months systematically reading comp.lang.c while consulting a draft
of the standard !

> Because programmers are human, they also can get ticked off when being
> told that a construct they have used for years has been illegal
> for decades :-|

This may happen but my impression with C is that the strongest complaints
come from people who

- have read the C standard (or at least the relevant parts of it)

- know that their code has undefined behaviour and know what the term means

- they do not rely on any compiler extensions

yet still feel certain (dare I say "entitled" ?) that their code ought to
behave in a certain way. For an extreme example see Robert M. Hyatt of
crafty fame (a chess programme which has won awards in the past) :
http://www.open-chess.org/viewtopic.php?f=5&t=2519 .
[Fortran used to require that arrays were stored in column major order, that
double precision took twice the space of real and integer, and you were allowed
to use EQUIVALENCE and adjustable dimensions in argument arrays to do overlaying
assuming that layout. Dunno how much more modern Fortran has deprecated it. -John]

David Brown

unread,

Jan 9, 2022, 5:44:28 PM1/9/22

to

On 08/01/2022 10:31, Thomas Koenig wrote:
> Spiros Bousbouras <spi...@gmail.com> schrieb:

>> This seems to me exactly like the C model. What difference do you see ?
>
> First, I see a difference in result. Highly intelligent and
> knowledgable people argue vehemently if a program should be able
> to use undefined behavior or not, and lot of vitriol is directed
> against compiler writers who use the assumption that undefined
> behavior cannot happen in their compilers for optimization,
> especially if it turns out that existing code was broken and no
> longer works after a compiler upgrade (Just read a few of Linus
> Torvald's comments on that matter).

People want compilers to do what the programmer meant, not what he or
she wrote. And in particular, if a compiler did one thing once, they
want it to continue to do the same thing with the same code - as long as
they got what they wanted the first time round.

This is, of course, entirely natural for humans. But it is not natural
for computer programs like compilers.

Linus Torvald's is known for blowing his top on matters that he either
does not understand, or when he has mixed his personal opinions with
facts, or while only looking at a small part of the big picture. (He is
also known as an incredible programmer, a world-class project leader,
and a charismatic visionary who revolutionised the software world - but
that's beside the point here!).

A key example of his complaints in this area revolve around a function
that was something equivalent to :

int foo(int * p) {
int x = *p;
if (!p) return -1;
return x;
}

His complaint was that the compiler saw that "*p" was accessed, and
therefore assumed "p" could not be zero and optimised away the test.
The compiler did exactly what it was asked to do - the optimisation is
perfectly valid according to the C standards and additional definitions
given by the compiler. But it was not what the programmer wanted, and
not what older versions of the compiler had done.

Of course, when a new optimisation simply makes object code more
efficient, programmers want that - they don't /always/ want the compiler
to handle things the way older versions did. They want the compiler to
read their minds and see what they meant to write, and generate optimal
code for that.

None of this is helped by the fact that C code often has to work
efficiently on a variety of targets and compilers, and some compilers
give extra guarantees about how they interpret code beyond the
definitions given in the C standards. Many more compilers can be relied
upon in practice to work in particular ways, though they don't guarantee
or document it, and this means the most efficient code that works in
practice on one compiler may be wrong and give incorrect results on
another compiler. You can write C code that is correct and widely
portable, but you can't write C code that is correct, optimally
efficient, and widely portable.

The big question here, is why do you think Fortran is any different? In
theory, there isn't a difference - nothing you have said here convinces
me that there is any fundamental difference between Fortran and C in
regards to undefined behaviour. (And there's no difference in the
implementations - the most commonly used Fortran compilers also handle
C, C++, and perhaps other languages.)

I believe it is a matter of who writes Fortran programs, and what these
programs do. Now, I don't know or use Fortran myself, so I might be
wrong here. However, it seems to me that Fortran is typically used by
experienced professional programmers and for scientific or numerical
programming. C is used by a much wider range of programmers, for a much
wider range of programming tasks. I think it is inevitable that you'll
get more people programming in C when they are not fully sure of what
they are doing, more code where subtle mistakes can be made, more people
using C when other languages would have been better choices, and more C
programmers who are likely to blame their tools for their own mistakes.

>
> I see C conflating two separate concepts: Programm errors and
> behavior that is outside the standard. "Undefined behavior is
> always a programming error" does not work; that would make
>
> #include <unistd.h>
> #include <string.h>
>
> int main()
> {
> char a[] = "Hello, world!\n";
> write (1, a, strlen(a));
> return 0;
> }
>

C does not have a "write" function in the standard library. So the
behaviour of "write" is not defined by the C standards - but that does
not mean the behaviour is undefined. It just means it is defined
elsewhere, not in the C standards. If the programmer doesn't know what
the "write" function does or how it is specified, then it might be
undefined behaviour - certainly it is bad programming.

> not more and not less erroneous than
>
> int main()
> {
> int *p = 0;
> *p = 42;
> }
>
> whereas I would argue that there is an important difference between
> the two.
>

There is no fundamental difference - if you know the behaviour is
defined, it is defined. (The program is then correct or incorrect
depending on how that definition matches your requirements.) If not, it
is undefined (and incorrect). In neither case is the behaviour defined
by the C standard, but the behaviour could be defined by something else
(library documentation or external definition of "write", or a C
compiler that specifically says it defines the behaviour of
dereferencing null pointers).

> If the C standard replaced "the behavior is undefined" with "the
> program is in error, and the subsequent behavior is undefined"
> or something along those lines, the discussion would be much
> muted.
>

That sounds like you dislike the "time travel" aspect of C's undefined
behaviour. Many would agree with that - they don't like the idea that
undefined behaviour later in the program can be used to change the
behaviour of code earlier on. The C standard considers undefined
behaviour to be program-wide - if you execute something that has
undefined behaviour (remembering that this means there is no definition
/anywhere/ of what will happen), the whole program is wrong and you
can't expect anything from it.

People often find this disturbing. They think perhaps it is fair enough
that dereferencing a null pointer can crash a program, but it shouldn't
affect things that came before it.

However, there are two key points to think about. First, the standards
handling of undefined behaviour means that a compiler /can/ use UB to
change the object code generated for earlier source code, not that it
/must/ do so. A compiler always balances efficient code generation with
ease-of-use and ease-of-debugging. The ideal balance point will depend
on the programmer writing the code, so compiler flags are used to tune
it, but surprises can still happen.

The other point is to consider how the standards could say anything
else. If the standards required observable behaviour to be completed
before undefined behaviour occurred, the results would be terrible.
Dereferencing a null pointer or dividing by zero could cause a complete
crash (remember the "Windows for Warships" affair? A single divide by
zero brought the whole ship network down, leaving it dead in the water
for hours). That means the compiler would need to make sure any
volatile writes had hit main memory before reading a pointer. It would
have to ensure all file stream buffers were flushed to disk before doing
a division. You can be sure Linus Torvalds would have a thing or two to
say about such a compiler.

> (Somebody may point out to me that this what the standard is
> actually saying. If so, that would sort of reinforce my argument
> that it should be clearer :-)

[Fortran has in principle historically allowed rather aggressive optimization,
e.g., A*B+A*C can turn into A*(B+C). On the other hand, in the real world,
when IBM improved their optimizing compiler Fortran H into Fortran X, the
developers said any new optimization had to produce bit identical results
to what the old compiler did. So this is not a new issue. -John]

David Brown

unread,

Jan 9, 2022, 6:34:43 PM1/9/22

to

On 08/01/2022 18:52, Anton Ertl wrote:
> David Brown <david...@hesbynett.no> writes:
>> Undefined behaviour, as far as language standards are concerned, are
>> omnipresent in programming - for all languages.
>
> Please prove this astounding assertion. My impression is that managed
> languages define everything, at least to some extent, and leave
> nothing undefined. If they allowed nasal demons, the appeal of
> managed languages would evaporate instantly.
>

Certainly managed languages define far more than unmanaged languages.
But equally certainly, they do not define everything.

In Python, I can write :

x = flooble(123)

Nowhere in any part of the documentation for Python is a definition of
what the function "flooble" should do. Calling it is /undefined
behaviour/ as far as the language standards are concerned.

Certainly some aspects of calling it - such as the calling convention -
are defined. What should happen if the function does not exist is
defined. But the language and the standards do not define the behaviour
of "flooble".

Being "undefined behaviour as far as the language standards are
concerned" does not mean you can get nasal daemons, it means that the
language standards do not say what will happen. When one says "Division
by 0 is undefined behaviour in C", that is what is meant - as a compiler
or a host OS could give you well-defined and predictable behaviour when
you attempt to divide by 0.

A managed language may put limits on the kind of effect of undefined
behaviour. In Python (at least, CPython), it is possible to call
externally defined functions in shared libraries - even if the Python
bytecode interpreter limits possible effects of pure Python code,
calling external functions gets around those limits. I suppose you
could have a more locked-down managed language that does not allow any
external code, and has additional tracking on things like data space
usage, time usage, and other resources to stop run-away code.

Within such a closed language, you could have defined behaviour for all
code, since any code run or functions called would be in the same
language and have their definitions clear to the interpreter.

Personally, I don't see minimising undefined behaviour as part of the
appeal of managed languages. I make as much effort not to divide by
zero or work with invalid references in my Python code as I do in my C
code - it doesn't much matter if the program stops with Python exception
or a crash. I use Python for the convenience of working with strings,
dictionaries, and other data structures with little concern for memory
management, for its libraries, and other high-level features.

When running unknown code - such as javascript from a website - it is
vital that the effect of any code is limited. Code may have behaviour
that is undefined by the language standards, but it will be defined by
other parts of the code or by its environment (browser, built-in
libraries, etc.). And while it may crash the javascript program or hang
the browser, it should never be able to launch nasal daemons.

Thomas Koenig

unread,

Jan 10, 2022, 2:39:19 PM1/10/22

to

David Brown <david...@hesbynett.no> schrieb:

> The big question here, is why do you think Fortran is any different? In
> theory, there isn't a difference - nothing you have said here convinces
> me that there is any fundamental difference between Fortran and C in
> regards to undefined behaviour.

I am not sure how to better explain it. I will try a bit, but
this will be my last reply to you in this thread. We seem to have
a fundamental difference in our understanding, and seem to be
unable to resolve it.

> (And there's no difference in the
> implementations - the most commonly used Fortran compilers also handle
> C, C++, and perhaps other languages.)

Sort of.

At the risk of boring most readers of this group, a very short, but
(hopefully) pertinent introduction of how modern compilers work:

A front end translates the source to an abstract syntax tree (which
you can view with gfortran with -fdump-fortran-original) and from
that into an intermediate representation (which you can view with
gfortran, or with gcc in general, with -fdump-tree-original).
This intermediate representation is then optimized, in
an architecture-independent way (usually using SSA) and then
translated into assembler or directly to object code using a
"back end", of which many compilers also have several.

An example: The program

print *,"Hello, world"
end

is translated into (code only)

WRITE UNIT=6 FMT=-1
TRANSFER 'Hello, world'
DT_END

and then, in the intermediate representation.

MAIN__ ()
{
{
struct __st_parameter_dt dt_parm.0;

dt_parm.0.common.filename = &"hello.f90"[1]{lb: 1 sz: 1};
dt_parm.0.common.line = 2;
dt_parm.0.common.flags = 128;
dt_parm.0.common.unit = 6;
_gfortran_st_write (&dt_parm.0);
_gfortran_transfer_character_write (&dt_parm.0, &"Hello, world"[1]{lb: 1 sz: 1}, 12);
_gfortran_st_write_done (&dt_parm.0);
}
}

There is no compiler (if you mean a single binary) that handles both
C and Fortran. They are separate front ends to common middle
and back ends.

And there are certainly differences in the code that the front
ends handle to the middle end, so saying that there is "no
difference in the implementations" is not correct.

>> I see C conflating two separate concepts: Programm errors and
>> behavior that is outside the standard. "Undefined behavior is
>> always a programming error" does not work; that would make
>>
>> #include <unistd.h>
>> #include <string.h>
>>
>> int main()
>> {
>> char a[] = "Hello, world!\n";
>> write (1, a, strlen(a));
>> return 0;
>> }
>>
>
> C does not have a "write" function in the standard library. So the
> behaviour of "write" is not defined by the C standards - but that does
> not mean the behaviour is undefined.

When interpreting at a language standard, you _must_ follow the
definitions in the standards if they exist, you cannot use everyday
interpretations.

Subclause 3.4.3 (N2596) defines

# undefined behavior

# behavior, upon use of a nonportable or erroneous program
# construct or of erroneous data, for which this document imposes
# no requirements

write() is nonportable and the C standard imposes no requirements
on it. Therefore, the program above invokes undefined behavior.

> It just means it is defined
> elsewhere, not in the C standards.

Nope, see above.

(If you replaced every occurence of "undefined behavior" in the C
standard with "WRTLPFMFT behavior" and "the behavior is undefined"
with "the behavior is WRTLPFMFT", the meaning of the standard
would not change.)
[It seems like nitpicking here. Yes, the C and POSIX standards are
different things, but we all know how common it is to use them
together. -John]

gah4

unread,

Jan 10, 2022, 9:28:19 PM1/10/22

to

On Saturday, January 8, 2022 at 10:11:55 AM UTC-8, Thomas Koenig wrote:

(snip)

> I see C conflating two separate concepts: Programm errors and
> behavior that is outside the standard. "Undefined behavior is
> always a programming error" does not work; that would make

> #include <unistd.h>
> #include <string.h>
> int main()
> {
> char a[] = "Hello, world!\n";
> write (1, a, strlen(a));
> return 0;
> }

Without the:

#include <unistd.h>

I agree that this would be undefined behavior. But with the include file,
you are agreeing to use whatever standard the include file belongs to.

The include file defines the arguments to write(), but even more indicates
that you either supply (in another file), or use an otherwise supplied library
defining write().

Kaz Kylheku

unread,

Jan 11, 2022, 1:07:35 PM1/11/22

to

On 2022-01-08, Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
> David Brown <david...@hesbynett.no> writes:
>>Undefined behaviour, as far as language standards are concerned, are
>>omnipresent in programming - for all languages.
>
> Please prove this astounding assertion. My impression is that managed
> languages define everything, at least to some extent, and leave
> nothing undefined. If they allowed nasal demons, the appeal of
> managed languages would evaporate instantly.

The Lisp-like programming language Scheme has unspecified order of
argument evaluation. And you can stuff side effects into argument
expressions, like in C.

Its built-in imperative have undefined return values.

ANSI Common Lisp leaves the effects undefined of modifying literals,
just like C. ANSI Lisp code that perpetrates some kind of error is
safe only if compiled in safe mode; if you compile with reduced safety,
e.g. (declare (optimize (safety 0))), then error become undefined
behavior, including type errors. If you declare that some quantity is
a fixnum integer, and request safety 0 speed 3, and then it turns
out that it's other than an integer, woe to that code.
However, in these cases you're invoking the safety escape hatch;
it's not like C where you are shackled by chains of undefined behavior
which make themselves felt every time you squirm.

David Brown

unread,

Jan 11, 2022, 1:08:33 PM1/11/22

to

On 10/01/2022 13:04, Thomas Koenig wrote:
> David Brown <david...@hesbynett.no> schrieb:
>
>> The big question here, is why do you think Fortran is any different? In
>> theory, there isn't a difference - nothing you have said here convinces
>> me that there is any fundamental difference between Fortran and C in
>> regards to undefined behaviour.
>
> I am not sure how to better explain it. I will try a bit, but
> this will be my last reply to you in this thread. We seem to have
> a fundamental difference in our understanding, and seem to be
> unable to resolve it.
>

Fair enough. Maybe in a future discussion, one of us will have an
"Aha!" moment and understand the other's viewpoint, and progress will be
made - until then, there's no point in going around in circles. I'll
snip bits of your post here, and try to minimise new points (unless I
get that "Aha!") - but be sure I am reading and appreciating your entire
post.

>> (And there's no difference in the
>> implementations - the most commonly used Fortran compilers also handle
>> C, C++, and perhaps other languages.)
>
> Sort of.
>
> At the risk of boring most readers of this group, a very short, but
> (hopefully) pertinent introduction of how modern compilers work:
>
>

> There is no compiler (if you mean a single binary) that handles both
> C and Fortran. They are separate front ends to common middle
> and back ends.

Yes. But it is the middle end that handles most of the optimisations,
including those based on undefined behaviour. The front end determines
whether code can have undefined behaviour and in what circumstances.

>> C does not have a "write" function in the standard library. So the
>> behaviour of "write" is not defined by the C standards - but that does
>> not mean the behaviour is undefined.
>
> When interpreting at a language standard, you _must_ follow the
> definitions in the standards if they exist, you cannot use everyday
> interpretations.
>
> Subclause 3.4.3 (N2596) defines
>
> # undefined behavior
>
> # behavior, upon use of a nonportable or erroneous program
> # construct or of erroneous data, for which this document imposes
> # no requirements
>
> write() is nonportable and the C standard imposes no requirements
> on it. Therefore, the program above invokes undefined behavior.

No. (As always, this is based on my interpretation of the standards -
consider everything to have "IMHO" attached.) The implementation of
"write" is outside the scope of the standards, and is therefore
undefined as far as the standards are concerned. That does not make it
undefined behaviour in the program - it just means the standards don't
say what "write" should do.

Kaz Kylheku

unread,

Jan 11, 2022, 2:47:26 PM1/11/22

to

More precisely, optimizations are based on the absence of undefined
behavior: the assumption that contracts are being upheld.

More precisely, that contracts are being upheld in the face of the
inability to determine and diagnose statically whether they are
violated; i.e. there is a "blind trust". (Though there do exist
situations in which, in principle, undefined behavior is easily
deducible at translation time, without a requirement to do so.)

Front-ends for different languages are written to the respective
requirements of those languages. Their first aim is to handle
well-defined constructs and situations. They target the intermediate
language of the compiler middle. That language has its own contracts.
The front end for each respective language has to ensure that every
situation in which behavior is defined (contract is upheld) is
translated to reliable intermediate code whose contract is upheld.
Care has to be taken that the intermediate code is expressed in the
right way so that it will not change behavior in invalid ways due to
optimizations.

This leaves a lot of room for Fortran and C to have entirely different
defined/undefined behaviors.

Even the front end for one single language can have a lot of switches
affecting what is defined or not.

Thre could be a switch which says that overflowing integer addition has
two's complement wrapping behavior. In that case, the compiler then
selects the intermediate instructions which provide that behavior
reliably (possibly simulating signed arithmetic with unsigned), and
also disables any inferences in the front end that might be based on the
assumption that overflow has not occurred.

>>> C does not have a "write" function in the standard library. So the
>>> behaviour of "write" is not defined by the C standards - but that does
>>> not mean the behaviour is undefined.
>>
>> When interpreting at a language standard, you _must_ follow the
>> definitions in the standards if they exist, you cannot use everyday
>> interpretations.
>>
>> Subclause 3.4.3 (N2596) defines
>>
>> # undefined behavior
>>
>> # behavior, upon use of a nonportable or erroneous program
>> # construct or of erroneous data, for which this document imposes
>> # no requirements
>>
>> write() is nonportable and the C standard imposes no requirements
>> on it. Therefore, the program above invokes undefined behavior.
>
> No. (As always, this is based on my interpretation of the standards -

Yes; using any function that is not in the C program, or in the
standard, is ISO C undefined behavior.

A program which includes <unistd.h> is not required to compile
according to ISO C; it can fail with an error message about the
header not being defined. Or, #include <unistd.h> is allowed, in
a conforming implementation, to bring in tokens which have nothing
to do with POSIX.

Furthermore, a program which calls write, and does not provide such a
function itself, is not required to successfully link. If it does link,
there is no requirement that this symbol is a function described by
POSIX.

POSIX implementations have to go out of their way to allow C programs
to use write as an external name, which ISO C allows.

For instance, the GNU C Library defines write as a weak symbol for
some identifier which resembles __libc_write: the "strong" symbol.

The C library internally uses only that __libc_write: it never calls
write, because user code could replace it:

int write(char *x) { ... }

double write = 42.0;

When the application defines the external name write, the weak symbol
coming from glibc yields; it is suppressed in favor of the program's
definition.

> consider everything to have "IMHO" attached.) The implementation of
> "write" is outside the scope of the standards, and is therefore
> undefined as far as the standards are concerned. That does not make it
> undefined behaviour in the program - it just means the standards don't
> say what "write" should do.

Right; it's "ISO C formal undefined behavior", not "behavior that is
not defined by any party whatsoever" ... though it could well be.

--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal

gah4

unread,

Jan 11, 2022, 8:08:05 PM1/11/22

to

On Tuesday, January 11, 2022 at 11:47:26 AM UTC-8, Kaz Kylheku wrote:

(big snip)

> This leaves a lot of room for Fortran and C to have entirely different
> defined/undefined behaviors.

> Even the front end for one single language can have a lot of switches
> affecting what is defined or not.

I suppose so. But more usual, the compiler works to the least
common denominator.

For one, C requires static variables, and especially external ones, to
initialize to zero, but Fortran doesn't. Fortran compilers that use C
compiler middle and back ends, tend to zero such variables.

I suspect that there are many more that I don't know about.
As long as the cost is small, and it satisfies both standards,
not much reason not to do it.

Fortran has stricter rules on aliasing than C. I don't actually know
about any effect on C programs, though, but it might be that
compilers do the same for C.

One that is not C or Fortran, but IEEE 754, is the effect of
relational operators with NaN. Comparisons with NaN,
except for "not equal", return false. That means that compilers
have to be careful optimizing such, and especially that
"greater than or equal" is not the logical complement of "less than".
(I haven't looked at how compilers handle this, or, even more,
how the hardware handles it.)

George Neuner

unread,

Jan 12, 2022, 5:51:09 PM1/12/22

to

On Tue, 11 Jan 2022 16:55:54 -0000 (UTC), Kaz Kylheku
<480-99...@kylheku.com> wrote:

>On 2022-01-08, Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
>> David Brown <david...@hesbynett.no> writes:
>>>Undefined behaviour, as far as language standards are concerned, are
>>>omnipresent in programming - for all languages.
>>
>> Please prove this astounding assertion. My impression is that managed
>> languages define everything, at least to some extent, and leave
>> nothing undefined. If they allowed nasal demons, the appeal of
>> managed languages would evaporate instantly.
>
>The Lisp-like programming language Scheme has unspecified order of
>argument evaluation. And you can stuff side effects into argument
>expressions, like in C.

In Scheme the order of evaluation for let expressions similarly is
unspecified.

There is at least one Scheme which deliberately randomizes the order
of function argument and let evaluation. And there are parallel
Schemes which evaluate function arguments and lets in parallel.

>Its built-in imperative have undefined return values.
>
>ANSI Common Lisp leaves the effects undefined of modifying literals,
>just like C. ANSI Lisp code that perpetrates some kind of error is
>safe only if compiled in safe mode; if you compile with reduced safety,
>e.g. (declare (optimize (safety 0))), then error become undefined
>behavior, including type errors. If you declare that some quantity is
>a fixnum integer, and request safety 0 speed 3, and then it turns
>out that it's other than an integer, woe to that code.
>However, in these cases you're invoking the safety escape hatch;
>it's not like C where you are shackled by chains of undefined behavior
>which make themselves felt every time you squirm.

And Lisp's optimization settings can be changed per function or per
compilation unit as well as globally. ["declaim" vs "declare"]

Thomas Koenig

unread,

Jan 12, 2022, 5:53:22 PM1/12/22

to

gah4 <ga...@u.washington.edu> schrieb:

> On Tuesday, January 11, 2022 at 11:47:26 AM UTC-8, Kaz Kylheku wrote:
>
> (big snip)
>
>> This leaves a lot of room for Fortran and C to have entirely different
>> defined/undefined behaviors.
>
>> Even the front end for one single language can have a lot of switches
>> affecting what is defined or not.
>
> I suppose so. But more usual, the compiler works to the least
> common denominator.
>
> For one, C requires static variables, and especially external ones, to
> initialize to zero, but Fortran doesn't. Fortran compilers that use C
> compiler middle and back ends, tend to zero such variables.

This is more a matter of operating system and linker conventions
than of compilers.

Looking at the ELF standard, one finds

.bss

This section holds uninitialized data that contribute to the program's
memory image. By definition, the system initializes the data with zeros
when the program begins to run. The section occupies no file space, as
indicated by the section type, SHT_NOBITS.

which, unsurprisingly, matches exactly what C is doing.

Anybody who writes a Fortran compiler for an ELF system will
use .bss for COMMOM blocks, because it is easiest. Initialization
with zeros then happens automatically.

> I suspect that there are many more that I don't know about.
> As long as the cost is small, and it satisfies both standards,
> not much reason not to do it.
>
> Fortran has stricter rules on aliasing than C. I don't actually know
> about any effect on C programs, though, but it might be that
> compilers do the same for C.

The rules are different, and unless C is the intermediate language,
a good compiler will hand the corresponding hints to the middle end.
[I have used Fortran systems that initialized otherwise undefined data to a value that would
trap, to help find use-before-set errors. -John]

David Brown

unread,

Jan 14, 2022, 12:39:09 PM1/14/22

to

On 12/01/2022 20:02, Thomas Koenig wrote:
> gah4 <ga...@u.washington.edu> schrieb:
>> On Tuesday, January 11, 2022 at 11:47:26 AM UTC-8, Kaz Kylheku wrote:

>> For one, C requires static variables, and especially external ones, to
>> initialize to zero, but Fortran doesn't. Fortran compilers that use C
>> compiler middle and back ends, tend to zero such variables.
>
> This is more a matter of operating system and linker conventions
> than of compilers.
>
> Looking at the ELF standard, one finds
>
> .bss
>
> This section holds uninitialized data that contribute to the program's
> memory image. By definition, the system initializes the data with zeros
> when the program begins to run. The section occupies no file space, as
> indicated by the section type, SHT_NOBITS.
>
> which, unsurprisingly, matches exactly what C is doing.
>
> Anybody who writes a Fortran compiler for an ELF system will
> use .bss for COMMOM blocks, because it is easiest. Initialization
> with zeros then happens automatically.

I was under the impression that FORTRAN compilers typically put data in
the ".common" section of object files. A key difference between .common
and .bss is that (with standard linker setup) duplicate symbols in .bss
are an error, while duplicate symbols in .common are merged. But in C
startup code, .common is also zeroed (FORTRAN may have different startup
code here - with no experience of the language, I don't know such details).

The use of ".common" by C compilers such as gcc was common practice
precisely to improve compatibility with FORTRAN in the early days, and
it let people write "int global_x;" in headers and have everything work,
rather than the correct practice of "extern int global_x;" in headers
and a single "int global_x;" in one object file. The big disadvantages
are that if you have "int local_x;" in two files, and don't use static,
they'll be merged with no error, and if you have "int global_x;" in one
file and "double global_x;" in another, it's a mess. Modern gcc now
uses "-fno-common" to avoid this.

>
>> I suspect that there are many more that I don't know about.
>> As long as the cost is small, and it satisfies both standards,
>> not much reason not to do it.
>>
>> Fortran has stricter rules on aliasing than C. I don't actually know
>> about any effect on C programs, though, but it might be that
>> compilers do the same for C.
>
> The rules are different, and unless C is the intermediate language,
> a good compiler will hand the corresponding hints to the middle end.

AFAIUI the difference in aliasing rules is that in FORTRAN, pointer or
array parameters are assumed not to alias, while in C the compiler must
assume that they might alias, unless you use "restrict". Are there
other differences?

Thomas Koenig

unread,

Jan 14, 2022, 12:39:32 PM1/14/22

to

Thomas Koenig <tko...@netcologne.de> schrieb:

> [I have used Fortran systems that initialized otherwise undefined
> data to a value that would trap, to help find use-before-set errors.
> -John]

That usually is still available, but optional. An short example:

$ cat a.f90
program main
print *,a
end program main
$ gfortran -g -ffpe-trap=invalid -finit-real=snan a.f90
$ ./a.out

Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.

with a backtrace pointing to the offending line.

It does not necessarily work on COMMON blocks, though.