LLVM and Forth

krishna...@ccreweb.org

unread,

Apr 1, 2017, 9:36:52 AM4/1/17

to

Question:

Are there any existing Forth systems built from the LLVM tools (link below)? What advantages might LLVM provide for a low-level language such as Forth?

Krishna

http://llvm.org/

Anton Ertl

unread,

Apr 1, 2017, 11:12:08 AM4/1/17

to

krishna...@ccreweb.org writes:
>Question:
>
>Are there any existing Forth systems built from the LLVM tools (link below)?

AFAIK no.

> What advantages might LLVM provide for a low-level language such as Forth?

None that justify the disadvantages.

The advantages are that there LLVM has at least two targets (some ARM
architecture and some Intel architecture).

The disadvantages are:

- It compiles slowly.

- It's oriented towards nasal demon C, the dialect of C that's
currently popular with C compiler maintainers. As a consequence,

- You have to divide your program into (the LLVM representation
of) C functions, that call each other with C calling
conventions. There has been one attempt to use a different
calling convention, but that's apparently very difficult.

- If you lie down with dogs, you get up with fleas. I expect that
your compiler can easily become possessed by nasal demons (i.e.,
compile programs differently than what you and the programmer
intended).

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: http://www.forth200x.org/forth200x.html
EuroForth 2016: http://www.euroforth.org/ef16/

rickman

unread,

Apr 1, 2017, 11:21:41 AM4/1/17

to

"Despite its name, LLVM has little to do with traditional virtual
machines"... "The name "LLVM" itself is not an acronym; it is the full
name of the project."

So if the name is just some unpronounceable letters, why would that
imply anything about virtual machines???

--

Rick C

krishna...@ccreweb.org

unread,

Apr 1, 2017, 2:55:55 PM4/1/17

to

On Saturday, April 1, 2017 at 10:12:08 AM UTC-5, Anton Ertl wrote:
> krishna...@ccreweb.org writes:
> >Question:
> >
> >Are there any existing Forth systems built from the LLVM tools (link below)?
>
> AFAIK no.
>
> > What advantages might LLVM provide for a low-level language such as Forth?
>
> None that justify the disadvantages.
>

I'm always wary of tying programs to complex libraries, unless the benefits are warranted. In this case, I didn't know much about the complexity of the libraries, the LLVM language, and the benefits, other than being able to target x86, PowerPC, and ARM all at the same time, from the same source code.

> The advantages are that there LLVM has at least two targets (some ARM
> architecture and some Intel architecture).
>

+ PowerPC (?)

> The disadvantages are:
>
> - It compiles slowly.
>

That's not necessarily a show stopper.

> - It's oriented towards nasal demon C, the dialect of C that's
> currently popular with C compiler maintainers. As a consequence,
>
> - You have to divide your program into (the LLVM representation
> of) C functions, that call each other with C calling
> conventions. There has been one attempt to use a different
> calling convention, but that's apparently very difficult.
>
> - If you lie down with dogs, you get up with fleas. I expect that
> your compiler can easily become possessed by nasal demons (i.e.,
> compile programs differently than what you and the programmer
> intended).
>

Hmm... your assessment seems pretty harsh. However, the mutable interpretation of previously solid C code in recent C compilers, as we discussed previously here in c.l.f., is a prevalent problem. I would be very concerned in using the LLVM tools if it isolated me from mixing machine-specific assembly code into the compiler.

Krishna

krishna...@ccreweb.org

unread,

Apr 1, 2017, 2:57:25 PM4/1/17

to

Supposedly Java virtual machines have been developed with LLVM.

Krishna

Andrew Haley

unread,

Apr 2, 2017, 5:02:26 AM4/2/17

to

krishna...@ccreweb.org wrote:
>
> Supposedly Java virtual machines have been developed with LLVM.

It is very difficult, partly for the reasons described: LLVM uses C
semantics for its intermedieate language, so if your language is along
way from C this can be problematic. Optimizing without specific
language semantics is somewhere between hard and impossible.

Andrew.

krishna...@ccreweb.org

unread,

Apr 2, 2017, 9:35:26 AM4/2/17

to

The fundamental question appears to be independent of LLVM:

If one were to write a Forth to C translator, how likely is it that the translator would result in C code which is considered undefined in the new C standard, i.e. how problematic would "nasal demons" be in the translated code?

I have one incident of manually implementing a Forth word in C that resulted in undefined code with the result that a dictionary word broke without warning at some point. Anton may have had a larger share of these types of problems. Who would have thought that we would reach a stage where Fortran became a saner language than C?! The scientific community, at least, was proven correct in hanging on to Fortran.

Krishna

Anton Ertl

unread,

Apr 2, 2017, 11:23:39 AM4/2/17

to

krishna...@ccreweb.org writes:
>On Sunday, April 2, 2017 at 4:02:26 AM UTC-5, Andrew Haley wrote:
>> Optimizing without specific
>> language semantics is somewhere between hard and impossible.

Not in the least. On the contrary, it's harder to "optimize" using
nasal demon semantics, and that's why such "optimizations" are a
relatively recent development.

You can translate the source code to machine code according to the
semantics of the programming language, and then optimize the machine
code. In practice, they used to translate to an intermediate
representation (IR) with machine-level operations, then optimize that,
and then generate code from that.

The newer developments are that they added stuff such as "undefined",
assertions about, e.g., the possible values of things (derived from
nasal demon C semantics), and operations with language-specific nasal
demons to the IR. One could try to avoid all these cliffs, and just
use the machine-level operations (if they are still there), but I
would not bet that every optimization pass (there are several hundred
in LLVM and GCC) actually observes the machine-level semantics rather
than taking liberties based on C semantics.

>The fundamental question appears to be independent of LLVM:
>

>If one were to write a Forth to C translator, how likely is it that the tra=
>nslator would result in C code which is considered undefined in the new C s=
>tandard, i.e. how problematic would "nasal demons" be in the translated cod=
>e?

For a straightforward translation, the likelyhood of undefined
behaviour from many Forth programs is 100%; maybe you are lucky, and
the C compiler still generates the code you intended, maybe not; in
the next compiler version it may be different.

You could go for a more roundabout translation, and with lots of
perseverance and some luck get a translation without undefined
behaviour, but it would be slow and interfacing with the rest of the
world would probably be a problem.

But that's a nasal-demon-C-specific problem. The LLVM people could
have chosen to just have a machine-level IR without any nasal demon
nonsense, and then LLVM would not have had this problem. So the
problem is not really independent of LLVM.

Anton Ertl

unread,

Apr 2, 2017, 11:35:32 AM4/2/17

to

krishna...@ccreweb.org writes:
>On Saturday, April 1, 2017 at 10:12:08 AM UTC-5, Anton Ertl wrote:

[...]
>In this case, I didn't know much about the complexity of th=
>e libraries, the LLVM language, and the benefits, other than being able to =
>target x86, PowerPC, and ARM all at the same time, from the same source cod=
>e.

If that's what's of interest to you, you could also take a look at GNU
Lightning.

>> The advantages are that there LLVM has at least two targets (some ARM
>> architecture and some Intel architecture).
>>
>
>+ PowerPC (?)

Maybe. Apple removed OS support for PowerPC in 2009, so they probably
no longer use the PowerPC back end. So while the PowerPC port may
still be there, it may suffer from bit rot.

Alex

unread,

Apr 2, 2017, 4:15:22 PM4/2/17

to

Early in the development of LLVM there was this; Stacker. It was very
Forth like, but compiled and no interpretation.

http://web.cs.ucla.edu/classes/spring08/cs259/llvm-2.2/docs/Stacker.html

It appears to have not been maintained, and I don't know where the
source is..

--
Alex

hughag...@gmail.com

unread,

Apr 2, 2017, 6:18:16 PM4/2/17

to

On Sunday, April 2, 2017 at 1:15:22 PM UTC-7, Alex wrote:
> On 4/1/2017 14:36, krishna...@ccreweb.org wrote:
> > Question:
> >
> > Are there any existing Forth systems built from the LLVM tools (link
> > below)? What advantages might LLVM provide for a low-level language
> > such as Forth?
> >
> > Krishna
> >
> > http://llvm.org/
> >
>
> Early in the development of LLVM there was this; Stacker. It was very
> Forth like, but compiled and no interpretation.
>
> http://web.cs.ucla.edu/classes/spring08/cs259/llvm-2.2/docs/Stacker.html

This is primarily why I'm not interested in LLVM --- because it has to be cross-compiled into --- you can't have the traditional interactive Forth development system.

minf...@arcor.de

unread,

Apr 3, 2017, 4:10:00 AM4/3/17

to

I am using a Forth to C translator, where only a rather small subset of (pseudo) Forth words is translated to create an application kernel, including interactivity if the application requires it. The rest is built on top of the kernel without the translator. One proof-of-concept application is a full Forth system that passes the Forth200x test suite successfully including all wordsets.

So there is no undefined code. I have never encountered C Compiler problems because kernels are small and the resulting C code is straightforward and doesn't use any esoteric constructs. KISS is always a good pronciple.

I do not understand why Forth-to-C cross-compilers are so poorly regarded here.

Andrew Haley

unread,

Apr 3, 2017, 4:17:59 AM4/3/17

to

krishna...@ccreweb.org wrote:
> On Sunday, April 2, 2017 at 4:02:26 AM UTC-5, Andrew Haley wrote:
>> krishna...@ccreweb.org wrote:
>> >
>> > Supposedly Java virtual machines have been developed with LLVM.
>>
>> It is very difficult, partly for the reasons described: LLVM uses C
>> semantics for its intermedieate language, so if your language is along
>> way from C this can be problematic. Optimizing without specific
>> language semantics is somewhere between hard and impossible.
>
> The fundamental question appears to be independent of LLVM:
>
> If one were to write a Forth to C translator, how likely is it that
> the translator would result in C code which is considered undefined
> in the new C standard, i.e. how problematic would "nasal demons" be
> in the translated code?

Single-threaded, it wouldn't be problematic at all. The nasal daemons
tend to happen if you're pushing the envelope of optimization. As
long as you translated into well-defined C, which isn't difficult,
you'd be fine. I don't think there are any Forth constructs that
wouldn't translate easily. The real problem occurs when you're trying
to generate highly efficient C.

Multi-threading would be harder, at least in theory.

Andrew.

minf...@arcor.de

unread,

Apr 3, 2017, 8:04:24 AM4/3/17

to

There is no highly efficient C per se, only good algorithms and good code compilers.

Only occasionally I do some disassembling or profiling of the final code. As last ressort there is always the possibility to put some assembler lines within the C code.

But this was _never_ required to solve low-performance issues, but to write special hardware drivers.

Alex

unread,

Apr 3, 2017, 8:07:19 AM4/3/17

to

Cross compiled? What?

Gforth demonstrates it's more than possible to write an interactive
Forth in a high level language. Stacker demonstrates that it's possible
to use LLVM to generate code from a Forth like language. So what
technical reason is there that you can't you have interactive Forth
development system that uses LLVM as its code generator?

--
Alex

krishna...@ccreweb.org

unread,

Apr 3, 2017, 8:20:43 AM4/3/17

to

Excellent. There was mention of a repository for the source code, but I believe the link no longer worked. It should be possible to build an interpreter using the same tools.

Krishna

krishna...@ccreweb.org

unread,

Apr 3, 2017, 8:32:54 AM4/3/17

to

A couple of us have been burned by changes to the C compiler (GCC) which affected (broke) the Forth systems we maintained. These compiler changes did not output warnings, unless one specified a high enough warning level, and, therefore, I learned of the side effects much later. This was a case of transforming an unambiguous C statement into undefined behavior in the new C standard, for the sake of optimization. The new C standard apparently has plenty of such undefined behaviors, sacrificing language safety for the sake of compiler optimization. From the LLVM project blog:

"Undefined behavior exists in C-based languages because the designers of C wanted it to be an extremely efficient low-level programming language. In contrast, languages like Java (and many other 'safe' languages) have eschewed undefined behavior because they want safe and reproducible behavior across implementations, and willing to sacrifice performance to get it. While neither is "the right goal to aim for," if you're a C programmer you really should understand what undefined behavior is."

http://blog.llvm.org/2011/05/what-every-c-programmer-should-know.html

Personally, I think the extra CPU power of modern processors should be weighted much more towards language safety rather than maximal efficiency. As people get burned, the glaring deficiencies of the insane drive to efficiency are driving some of us away from C-like languages.

Krishna

Anton Ertl

unread,

Apr 3, 2017, 9:51:35 AM4/3/17

to

Andrew Haley <andr...@littlepinkcloud.invalid> writes:

>krishna...@ccreweb.org wrote:
>> If one were to write a Forth to C translator, how likely is it that
>> the translator would result in C code which is considered undefined
>> in the new C standard, i.e. how problematic would "nasal demons" be
>> in the translated code?
>
>Single-threaded, it wouldn't be problematic at all. The nasal daemons
>tend to happen if you're pushing the envelope of optimization. As
>long as you translated into well-defined C, which isn't difficult,
>you'd be fine. I don't think there are any Forth constructs that
>wouldn't translate easily. The real problem occurs when you're trying
>to generate highly efficient C.

Bullshit. Bullshit. Bullshit. Bullshit. Bullshit. A whole
paragraph of bullshit, with every single sentence being bullshit. You
don't have to believe me, in the following I refute them by one:

1) All the nasal demon examples I came across have been in
single-threaded programs. (You can find some of them in
<http://www.complang.tuwien.ac.at/kps2015/proceedings/KPS_2015_submission_29.pdf>,
and more in the referenced papers).

2) Nasal demon compilers "optimized" away the memset() that was
intended to prevent Heartbleed, and the programmers did not put this
memset() in to push the envelope of optimization. Likewise,
programmers do not put in bounds checks to push the envelope of
optimization, yet nasal demon compilers often "optimize" these bounds
checks away. Actually, most of the examples of nasal demons I know of
have nothing to do with "pushing the envelope of optimization".

3) If it is not difficult to produce well-defined C, you would expect
that at least the hardcore members of the nasal demon church, the
developers of GCC and Clang/LLVM, would be able to do it, yet both of
these compilers exercised undefined behaviour even when compiling an
empty program with optimization off.

4) Anything to do with treating cells as either addresses or as
integers can easily turn into undefined behaviour. Dealing with
memory can also easily incur undefined behaviour. Implementing Forth
addresses as C constructs without incurring undefined behaviour is
also challenging, and certainly not easy (e.g., how do you implement
ALLOCATE and "-"?).

5) Nasal demon compilers "optimized" away the memset() that was
intended to prevent Heartbleed, and the programmers did not put this
memset() in to generate highly efficient C. Likewise, programmers do
not put in bounds checks to generate highly-efficient C, yet nasal
demon compilers often "optimize" these bounds checks away. Actually,
most of the examples of nasal demons I know of have nothing to do with
"highly-efficient C".

It may be, though, that "highly-efficient" C programs that "push the
envelope of optimization" may be even more frequently afflicted by
nasal demons than the rest. You suggest to avoid going there. Yet
nasal demon fans claim (without empirical evidence) that nasal demons
are a good thing for efficiency reasons. This does not compute.

Anton Ertl

unread,

Apr 3, 2017, 10:27:46 AM4/3/17

to

krishna...@ccreweb.org writes:
>On Monday, April 3, 2017 at 3:10:00 AM UTC-5, minf...@arcor.de wrote:

>> So there is no undefined code. I have never encountered C Compiler proble=
>ms because kernels are small and the resulting C code is straightforward an=

>d doesn't use any esoteric constructs. KISS is always a good pronciple.

It does not help against nasal demons. You can get them in just a few
characters, and with code that most people would see as
straighforward. E.g., try to write rotl32(int x, unsigned n) or
memmove() in C.

>The new C standard apparently has p=
>lenty of such undefined behaviors, sacrificing language safety for the sake=

> of compiler optimization. From the LLVM project blog:
>

>"Undefined behavior exists in C-based languages because the designers of C =
>wanted it to be an extremely efficient low-level programming language. In c=
>ontrast, languages like Java (and many other 'safe' languages) have eschewe=
>d undefined behavior because they want safe and reproducible behavior acros=
>s implementations, and willing to sacrifice performance to get it. While ne=
>ither is "the right goal to aim for," if you're a C programmer you really s=

>hould understand what undefined behavior is."
>
>http://blog.llvm.org/2011/05/what-every-c-programmer-should-know.html

Yes, the nasal demon fans claim that they invite nasal demons for
performance reasons, but they rarely, if ever provide empirical
evidence of the performance effects.

That's because the performance effects of nasal demon "optimizations"
are so small that most production programmers would boycott the newer
C compilers with more nasal demons if they knew it. However, Wang et
al. actually measured the performance effect: Even on the SPECint
benchmarks, which are in a privileged position (GCC and LLVM
maintainers optimize for them and don't dare miscompile them, and
therefore they don't have to be "sanitized", which could reduce their
performance), the speedup is <2%.

>Personally, I think the extra CPU power of modern processors should be weig=

>hted much more towards language safety rather than maximal efficiency.

This is a false dichotomy. Nasal demons hurt efficiency: If you want
to avoid undefined behaviour, you avoid many efficient idioms; in
addition, you have to invest lots of time in "sanitizing" your
program, and you miss that time when it comes to making your program
faster; plus, every change to your program (such as implementing an
optimization idea) incurs the danger of introducing undefined
behaviour into your carefully sanitized program, so better not go
there.

This is actually not specific to newer C standards. Older C standards
also had lots of undefined behaviour. The problem is actually in the
newer C compilers and their maintainers, who think it is a good idea
to put nasal demons in their compilers.

Anton Ertl

unread,

Apr 3, 2017, 10:41:06 AM4/3/17

to

minf...@arcor.de writes:
>I am using a Forth to C translator, where only a rather small subset of (ps=
>eudo) Forth words is translated to create an application kernel, including =
>interactivity if the application requires it. The rest is built on top of t=
>he kernel without the translator. One proof-of-concept application is a ful=
>l Forth system that passes the Forth200x test suite successfully including =

>all wordsets.
>
>So there is no undefined code.

How do you know? Just because the program works as intended does not
guarantee that there is no undefined behaviour. The next version of
your C compiler could turn the undefined behaviour into nasal demons.

Gforth is chock full of undefined behaviour, yet it works as intended
with the compilers we tested. Now and then newer compiler versions
miscompile Gforth, then we need to deal with that by disabling (if
possible) the "optimization" that caused this, or by changing the
source code of Gforth.

minf...@arcor.de

unread,

Apr 3, 2017, 10:59:17 AM4/3/17

to

Am Montag, 3. April 2017 16:27:46 UTC+2 schrieb Anton Ertl:
> krishna...@ccreweb.org writes:
> >On Monday, April 3, 2017 at 3:10:00 AM UTC-5, minf...@arcor.de wrote:
> >> So there is no undefined code. I have never encountered C Compiler proble=
> >ms because kernels are small and the resulting C code is straightforward an=
> >d doesn't use any esoteric constructs. KISS is always a good pronciple.
>
> It does not help against nasal demons. You can get them in just a few
> characters, and with code that most people would see as
> straighforward. E.g., try to write rotl32(int x, unsigned n) or
> memmove() in C.
>

I am interested in bug-free results, not so much in perfect tools. Primarily I have to assert my _applications_ thoroughly.

Thereby hoping that wrong binaries caused by compiler bugs or poor language specifications are detected as well. In the desktop world one can chose from many compiler vendors, but in the embedded or controller domain there is often just one = the compiler that you have to live with.

BTW is there any "nasal demon" free programming language specification at all??
How do Ada, Haskell, Misra-C, FORTRAN fare here?

Andrew Haley

unread,

Apr 3, 2017, 11:08:59 AM4/3/17

to

Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
> Andrew Haley <andr...@littlepinkcloud.invalid> writes:
>>krishna...@ccreweb.org wrote:
>>> If one were to write a Forth to C translator, how likely is it that
>>> the translator would result in C code which is considered undefined
>>> in the new C standard, i.e. how problematic would "nasal demons" be
>>> in the translated code?
>>
>>Single-threaded, it wouldn't be problematic at all. The nasal daemons
>>tend to happen if you're pushing the envelope of optimization. As
>>long as you translated into well-defined C, which isn't difficult,
>>you'd be fine. I don't think there are any Forth constructs that
>>wouldn't translate easily. The real problem occurs when you're trying
>>to generate highly efficient C.
>
> Bullshit. Bullshit. Bullshit. Bullshit. Bullshit. A whole
> paragraph of bullshit, with every single sentence being bullshit. You
> don't have to believe me, in the following I refute them by one:

LOL! Bullshit yourself.

> 1) All the nasal demon examples I came across have been in
> single-threaded programs. (You can find some of them in
> <http://www.complang.tuwien.ac.at/kps2015/proceedings/KPS_2015_submission_29.pdf>,
> and more in the referenced papers).
>
> 2) Nasal demon compilers "optimized" away the memset() that was
> intended to prevent Heartbleed, and the programmers did not put this
> memset() in to push the envelope of optimization. Likewise,
> programmers do not put in bounds checks to push the envelope of
> optimization, yet nasal demon compilers often "optimize" these bounds
> checks away. Actually, most of the examples of nasal demons I know of
> have nothing to do with "pushing the envelope of optimization".

I've seen the Heartbleed bug, and it was was a missing bounds check.
It was fixed by adding the bounds check. I can't find any reference
to a miscompiled memset.

> 3) If it is not difficult to produce well-defined C, you would expect
> that at least the hardcore members of the nasal demon church, the
> developers of GCC and Clang/LLVM, would be able to do it, yet both of
> these compilers exercised undefined behaviour even when compiling an
> empty program with optimization off.

Nobody is perfect. There are bugs in all large programs. There's
nothing special about undefined behaviour: it's just another kind of
error.

> 4) Anything to do with treating cells as either addresses or as
> integers can easily turn into undefined behaviour. Dealing with
> memory can also easily incur undefined behaviour. Implementing Forth
> addresses as C constructs without incurring undefined behaviour is
> also challenging, and certainly not easy (e.g., how do you implement
> ALLOCATE and "-"?).

I'd do it like this: Forth cells are unsigned ints, allocated on top
of a byte array which is data space. All integer arithmetic is modulo
N, where N is the word size. Division has to be handled by careful
conversion, to make sure there aren't any overflows. That's a little
bit fiddly, but all perfectly practicable.

What problem do you forsee with ALLOCATE? The only integer types seen
in memory by C are bytes and unsigned ints. I suppose there might be
some corner cases if you, say, store an int which you then read as a
floating-point type, but I don't think that is fully defined by Forth
either.

> 5) Nasal demon compilers "optimized" away the memset() that was
> intended to prevent Heartbleed, and the programmers did not put this
> memset() in to generate highly efficient C. Likewise, programmers do
> not put in bounds checks to generate highly-efficient C, yet nasal
> demon compilers often "optimize" these bounds checks away. Actually,
> most of the examples of nasal demons I know of have nothing to do with
> "highly-efficient C".

So don't write undefined programs, then. The question, as you seem to
have forgotten it, was:

>>> If one were to write a Forth to C translator, how likely is it that
>>> the translator would result in C code which is considered undefined
>>> in the new C standard, i.e. how problematic would "nasal demons" be
>>> in the translated code?

And the correct answer is that it wouldn't be problematic at all.
There's no need for a Forth translator to have undefined behaviour.

Of course it is likely that any program of this complexity will have
bugs to begin with, and some of these bugs will probably exhibit
undefined behaviour. However, it's then just a matter of fixing the
bugs, and you're done. And there's an undefined behaviour checker in
GCC to help with that.

Andrew.

Alex

unread,

Apr 3, 2017, 12:02:55 PM4/3/17

to

On 4/3/2017 13:55, Anton Ertl wrote:
> Bullshit. Bullshit. Bullshit. Bullshit. Bullshit. A whole
> paragraph of bullshit, with every single sentence being bullshit.

Steady on Anton. I think we got the point after the first incantation.

--
Alex

Alex

unread,

Apr 3, 2017, 12:11:41 PM4/3/17

to

I think this is a copy;
https://chromium.googlesource.com/chromiumos/third_party/llvm/+/release_16/projects/Stacker

--
Alex

hughag...@gmail.com

unread,

Apr 3, 2017, 12:45:28 PM4/3/17

to

Performance.

It is possible to write a traditional Forth in any language --- John Passaniti's claim to fame was to have written a "Forth-like interpreter" in Perl (afaik, Perl was the only programming language he knew, as all of his code posted on comp.lang.forth was in Perl) --- this doesn't mean that it is a good idea.

With something like LLVM, I would write a cross-compiler and generate LLVM code --- like STC except that it would be LLVM assembly-language --- that is the only way to get any performance, but the downside is there is no interactive Forth system.

I wouldn't actually use LLVM anyway --- it isn't designed for Forth --- any Forth programmer could design a VM that is for Forth (has the Forth stacks).

hughag...@gmail.com

unread,

Apr 3, 2017, 1:03:55 PM4/3/17

to

On Monday, April 3, 2017 at 6:51:35 AM UTC-7, Anton Ertl wrote:
> Andrew Haley <andr...@littlepinkcloud.invalid> writes:
> >krishna...@ccreweb.org wrote:
> >> If one were to write a Forth to C translator, how likely is it that
> >> the translator would result in C code which is considered undefined
> >> in the new C standard, i.e. how problematic would "nasal demons" be
> >> in the translated code?
> >
> >Single-threaded, it wouldn't be problematic at all. The nasal daemons
> >tend to happen if you're pushing the envelope of optimization. As
> >long as you translated into well-defined C, which isn't difficult,
> >you'd be fine. I don't think there are any Forth constructs that
> >wouldn't translate easily. The real problem occurs when you're trying
> >to generate highly efficient C.
>
> Bullshit. Bullshit. Bullshit. Bullshit. Bullshit. A whole
> paragraph of bullshit, with every single sentence being bullshit. You

> don't have to believe me, in the following I refute them by one: ...

Anton --- why don't you tell us how you really feel?

Also, why don't you just use 64-bit x86 assembly-language? In the old days, there were multiple processors in use:
1.) First, we had the Z80 and the 65c02, plus some others such as the MC6809 etc..
2.) Then we had the i8086 and the MC68000, plus some others such as the i8096 etc..
3.) Then we had the Pentium and the PowerPC and the ARM, plus some others such as the DEC Alpha etc..
4.) Now we have the 64-bit x86 for desktop computers and the 32-bit ARM for micro-controllers, and the others (the PIC24 PIC32 etc.) have such a small market share as to be irrelevant.

Gforth is for desktop computers, so just write in 64-bit x86 --- I like FASM, but NASM etc. have their fans --- there isn't going to be any other desktop-computer processor within our lifetimes.

Of course, computers such as the Raspberry Pi can be used like handheld desktop-computers --- they have such a small market share as to be irrelevant --- if a gnu 32-bit ARM Forth is developed it should be designed to run without an OS so it could be used in a micro-controller, but that would be useful for writing commercial software in competition with MPE and Forth Inc., so you can't do that and continue to be on the Forth-200x committee.

Anton Ertl

unread,

Apr 3, 2017, 1:05:35 PM4/3/17

to

minf...@arcor.de writes:
>BTW is there any "nasal demon" free programming language specification at a=
>ll??

Sure; many higher-level languages define everything, i.e., every
operation produces either a result or a defined error condition. Java
even tries to guarantee the same result, and uses the slogan "Write
once, run anywhere" for that.

Machine language also defines everything.

For other low-level languages, Forth and probably many others do not
have the time-traveling twist that C has and that plays a significant
role in conjuring up nasal demons. And most don't have compilers that
do such nonsense.

>How do Ada, Haskell, Misra-C, FORTRAN fare here?

Fortran is notorious for undefined behaviour; however, nasal demon C
may have overtaken it in the nasal demon department.

Andrew Haley

unread,

Apr 3, 2017, 1:22:39 PM4/3/17

to

Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
> minf...@arcor.de writes:
>>BTW is there any "nasal demon" free programming language specification at a=
>>ll??
>
> Sure; many higher-level languages define everything, i.e., every
> operation produces either a result or a defined error condition. Java
> even tries to guarantee the same result, and uses the slogan "Write
> once, run anywhere" for that.
>
> Machine language also defines everything.

Not quite: IIRC you've said that before, and I thought I corrected you
when you said it before. ARM says for code which is modified and then
executed without proper synchronization:

"Concurrent modification and execution of instructions can lead to
the resulting instruction performing any behavior that can be
achieved by executing any sequence of instructions that can be
executed from the same Exception level, except where the instruction
before modification and the instruction after modification is a B,
BL, NOP, BRK, SVC, HVC, or SMC instruction."

Granted, this isn't quite nasal daemons, but "any sequence of
instructions" is pretty close.

Andrew.

Anton Ertl

unread,

Apr 3, 2017, 1:31:52 PM4/3/17

to

Andrew Haley <andr...@littlepinkcloud.invalid> writes:

>Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
>> 2) Nasal demon compilers "optimized" away the memset() that was
>> intended to prevent Heartbleed, and the programmers did not put this
>> memset() in to push the envelope of optimization. Likewise,
>> programmers do not put in bounds checks to push the envelope of
>> optimization, yet nasal demon compilers often "optimize" these bounds
>> checks away. Actually, most of the examples of nasal demons I know of
>> have nothing to do with "pushing the envelope of optimization".
>
>I've seen the Heartbleed bug, and it was was a missing bounds check.
>It was fixed by adding the bounds check. I can't find any reference
>to a miscompiled memset.

I am sorry, I confused Heartbleed with an OpenSSH bug
<https://lwn.net/Articles/672465/>, where GCC "optimized" away the
memset() that was intended to mitigate the effects that other
vulnerabilities had.

>> 3) If it is not difficult to produce well-defined C, you would expect
>> that at least the hardcore members of the nasal demon church, the
>> developers of GCC and Clang/LLVM, would be able to do it, yet both of
>> these compilers exercised undefined behaviour even when compiling an
>> empty program with optimization off.
>
>Nobody is perfect. There are bugs in all large programs. There's
>nothing special about undefined behaviour: it's just another kind of
>error.

As I wrote: "well-defined C, which isn't difficult" is bullshit.

>> 4) Anything to do with treating cells as either addresses or as
>> integers can easily turn into undefined behaviour. Dealing with
>> memory can also easily incur undefined behaviour. Implementing Forth
>> addresses as C constructs without incurring undefined behaviour is
>> also challenging, and certainly not easy (e.g., how do you implement
>> ALLOCATE and "-"?).
>
>I'd do it like this: Forth cells are unsigned ints, allocated on top
>of a byte array which is data space. All integer arithmetic is modulo
>N, where N is the word size. Division has to be handled by careful
>conversion, to make sure there aren't any overflows. That's a little
>bit fiddly, but all perfectly practicable.
>
>What problem do you forsee with ALLOCATE?

How do you translate it "easily"?

>The only integer types seen
>in memory by C are bytes and unsigned ints.

How do you implement @ and C@?

>> 5) Nasal demon compilers "optimized" away the memset() that was
>> intended to prevent Heartbleed, and the programmers did not put this
>> memset() in to generate highly efficient C. Likewise, programmers do
>> not put in bounds checks to generate highly-efficient C, yet nasal
>> demon compilers often "optimize" these bounds checks away. Actually,
>> most of the examples of nasal demons I know of have nothing to do with
>> "highly-efficient C".
>
>So don't write undefined programs, then. The question, as you seem to
>have forgotten it, was:
>
>>>> If one were to write a Forth to C translator, how likely is it that
>>>> the translator would result in C code which is considered undefined
>>>> in the new C standard, i.e. how problematic would "nasal demons" be
>>>> in the translated code?
>
>And the correct answer is that it wouldn't be problematic at all.
>There's no need for a Forth translator to have undefined behaviour.
>
>Of course it is likely that any program of this complexity will have
>bugs to begin with, and some of these bugs will probably exhibit
>undefined behaviour. However, it's then just a matter of fixing the
>bugs, and you're done. And there's an undefined behaviour checker in
>GCC to help with that.

And after all this effort (which I wouldn't call "not problematic at
all" nor "not difficult"; it's not just like a -Wall test, instead you
have lots of mutually exclusive -fsanitize flags, and then the
undefined behaviours only show up at run-time with appropriate
inputs), at the end you have a program with undefined behaviours that
have not been caught by the checker in GCC. But sure, if you love
nasal demons, you also find that not problematic at all.

Alex

unread,

Apr 3, 2017, 5:52:20 PM4/3/17

to

So you think that most of the time in Forth is spent compiling?

>
> It is possible to write a traditional Forth in any language --- John
> Passaniti's claim to fame was to have written a "Forth-like
> interpreter" in Perl (afaik, Perl was the only programming language
> he knew, as all of his code posted on comp.lang.forth was in Perl)
> --- this doesn't mean that it is a good idea.
>
> With something like LLVM, I would write a cross-compiler and
> generate LLVM code --- like STC except that it would be LLVM
> assembly-language --- that is the only way to get any performance,
> but the downside is there is no interactive Forth system.

Again, why write a "cross compiler"? Why would it not be possible to
have an interactive Forth system?

>
> I wouldn't actually use LLVM anyway --- it isn't designed for Forth
> --- any Forth programmer could design a VM that is for Forth (has
> the Forth stacks).
>

Probably just as well. Do you actually know what LLVM is? In the LLVM
case, the clue is not in the name. It's not a VM.

--
Alex

hughag...@gmail.com

unread,

Apr 3, 2017, 6:25:10 PM4/3/17

to

On Monday, April 3, 2017 at 2:52:20 PM UTC-7, Alex wrote:
> On 4/3/2017 17:45, hughag...@gmail.com wrote:
> > Performance.
>
> So you think that most of the time in Forth is spent compiling?

run-time performance

Rod Pemberton

unread,

Apr 3, 2017, 8:38:47 PM4/3/17

to

On Sun, 2 Apr 2017 06:35:25 -0700 (PDT)
krishna...@ccreweb.org wrote:

> On Sunday, April 2, 2017 at 4:02:26 AM UTC-5, Andrew Haley wrote:
> > krishna...@ccreweb.org wrote:

> If one were to write a Forth to C translator, how likely is it that
> the translator would result in C code which is considered undefined
> in the new C standard,

IMO, unlikely, especially if the Forth was mostly written in high-level
Forth, being built up from "primitives" like eForth. However, it is
entirely possible that there are complicated interactions.

Typically, eForths have 30 to 60 or so low-level words or
"primitives". All the high-level words are coded in terms of the
primitives. The advantage here is that the high-level words translate
easily to C, and most of an eForth like Forth is coded as high-level
Forth words. So, much of your work is done for you. You just need to
implement the "primitives" in C and the Forth data structures, like
stacks, etc.

In this scenario, high-level Forth words would become a procedure call,
where each procedure has a list of calls to other procedures. Forth
"primitive" words would be C procedures that actually do some work on
the stacks, instead of just calling other procedures. I think the only
issue here would be conversion of the names for Forth words, e.g., '+'
'\' etc, to procedure names which are valid in C, e.g., 'add' 'divide'
etc.

E.g., a high-level Forth word using a "primitive" twice:

: DUP2 DUP DUP ;

In C,

void DUP2(void) /* high-level Forth word */
{
DUP(); /* list of procedures */
DUP();
}

void DUP(void) /* "primitive" Forth word */
{ /* actual code to modify Forth stacks */
TOS=*SP; /* retrieve TOS value */
SP++; /* adjust stack pointer to add space for new TOS */
*SP=TOS; /* save TOS value into new TOS */
}

As you can see, the high-level definition of DUP2 can be converted to C
simply by inserting or replacing text. E.g., ':' becomes 'void', ';'
becomes '}', the definition name has '(void) {' inserted after it, and
the called Forth words have '();' inserted afterwards.

The "primitive" DUP actually needs C code to do work on the Forth
stacks. The functions have no parameters in C because the code uses a
data stack set up in C for Forth, and not C's internal stack.

Of course, the code would need to be self-contained to be a compiled
program, otherwise, it's just a library of functions. I.e., you'd need
to have all interactive commands saved as text to be compiled. E.g.,
you'd not only need the code for DUP2 but also the interactive line
using DUP2 such as:

: DUP2 DUP DUP ;
5 DUP2 . . .

So, the "5 DUP2 . . ." would become your main() procedure in C:

int main(void)
{
LIT(5);
DUP2();
DOT();
DOT();
DOT();
}

As you can see '.' is an example of the name conversion issue, i.e.,
'.' in Forth to DOT() in C. Implementing DOT() is left as an exercise
for the reader.

Also, all control-flow words in Forth, e.g., IF ELSE THEN DO LOOP BEGIN
AGAIN UNTIL WHILE REPEAT etc, will map directly onto existing C
control-flow constructs. So, converting Forth control-flow to C is not
a problem either.

One issue is that the Forth interactive commands and Forth definitions
can occur in any order, so your conversion program must filter out the
Forth definitions and define them first. This is an issue if Forth
words are redefined. The control-flow of the Forth program will be
changed by re-ordering to fit C's requirements, and C will complain
about redefining a procedure. I.e., your conversion program will need
to keep track of both DUP2's below and need to treat each of them as
being distinct procedures:

: DUP2 DUP DUP ;
5 DUP2 . . .

: DUP2 DUP DUP + ;
6 DUP2 . .

void DUP2_1(void) /* high-level Forth word */
{
DUP(); /* list of procedures */
DUP();
}

void DUP2_2(void) /* high-level Forth word */
{
DUP(); /* list of procedures */
DUP();
PLUS();
}

void DUP(void) /* "primitive" Forth word */
{ /* actual code to modify Forth stacks */
TOS=*SP; /* retrieve TOS value */
SP++; /* adjust stack pointer to add space for new TOS */
*SP=TOS; /* save TOS value into new TOS */
}

void PLUS(void)
{
TOS=*SP
SP--;
*SP+=TOS;
}

int main(void)
{
LIT(5);
DUP2_1();
DOT();
DOT();
DOT();

LIT(6);
DUP2_2();
DOT();
DOT();
}

As you can see, both DUP2's are defined first, kept distinct, and then
the respective interactive commands for each DUP2 are called in main(),
thereby preserving the correct control-flow of the program. Obviously,
this will require that you implement a Forth dictionary as a look-up
table to determine the correct DUP2 to use during the conversion ...
I.e., when "6 DUP2" is seen, the look-up will need to correctly return
DUP2_2() for DUP2 instead of DUP2_1(). Depending on how you convert
code for string words, you may also need to filter out strings and
define them first too.

To prevent naming collisions in C, it may make sense to convert all
Forth procedure names to numbers and emit a comment as to what the
original name was. The 'X' is because C requires a first character of a
procedure name to be alphabet character or an underscore. A counter
would be incremented upon each new colon-def in Forth in order to
generate the numeric names.

void X0002(void) /* DUP */
{
...
}

void X0003(void) /* DUP */
{
...
}

int main(void)
{
X0001(5); /* LIT */
X0002(); /* DUP */
...

X0001(6); /* LIT */
X0003(); /* DUP */
...

}

> i.e. how problematic would "nasal demons" be
> in the translated code?

AISI, one operation in Forth which might be an issue is the
CREATE ... DOES> construct. It's somewhat object-oriented. I haven't
attempted DOES> to convert DOES> code either by myself or by program
conversions. However, I suspect the conversion is rather straight
forward, as DOES> basically redirects the list of procedures which are
called to another set.

Another area of problems might be words with different run-time versus
compile-time actions, etc. The word DUP2 in the example obviously
works the same way for both, as will most "primitives". So, I suspect
much of the compile-time actions will be taken care of by the code
conversion to C. The resulting C code obviously should be the run-time
actions for the Forth code, with compile-time actions taken care of in
the code.

I hope that helps. I'm sure that's a rather rudimentary presentation,
but you'll discover and solve other issues as you work on the problem.
So, I'd guess that solving those basic issues above, and implementing
the "primitives" will solve most of the conversion problems, especially
if the high-level words are constructed using "primitives" like eForth.

Rod Pemberton

--
All it takes for humanity to conquer the world's toughest problems
is to hope, to believe, to share, and to do, in cooperation.

Paul Rubin

unread,

Apr 3, 2017, 11:10:29 PM4/3/17

to

minf...@arcor.de writes:
> BTW is there any "nasal demon" free programming language specification
> at all?? How do Ada, Haskell, Misra-C, FORTRAN fare here?

Misra-C has the same issues as regular C. Haskell might not have nasal
demons but its specification is somewhat loose. Ada is extremely
solidly specified and I don't know of any UB in it. One of the design
goals of Algol 60 was also to have no UB. ML or something related to it
is completely formally specified. Rust seems to go far compared with C
in avoiding UB but I don't know if UB is completely absent in it.

minf...@arcor.de

unread,

Apr 4, 2017, 1:19:36 AM4/4/17

to

Is having a formal specification equivalent to free of unspecified behaviour?

Altogether it seems somewhat academic. Otherwise all safety-critical system software would have to be programmed in certified languages. And I never heard of ML being used for that.

krishna...@ccreweb.org

unread,

Apr 4, 2017, 6:25:29 AM4/4/17

to

On Monday, April 3, 2017 at 3:17:59 AM UTC-5, Andrew Haley wrote:
> krishna...@ccreweb.org wrote:
> > On Sunday, April 2, 2017 at 4:02:26 AM UTC-5, Andrew Haley wrote:
> >> krishna...@ccreweb.org wrote:
> >> >
> >> > Supposedly Java virtual machines have been developed with LLVM.
> >>
> >> It is very difficult, partly for the reasons described: LLVM uses C
> >> semantics for its intermedieate language, so if your language is along
> >> way from C this can be problematic. Optimizing without specific
> >> language semantics is somewhere between hard and impossible.
> >
> > The fundamental question appears to be independent of LLVM:
> >
> > If one were to write a Forth to C translator, how likely is it that
> > the translator would result in C code which is considered undefined
> > in the new C standard, i.e. how problematic would "nasal demons" be
> > in the translated code?
>
> Single-threaded, it wouldn't be problematic at all. The nasal daemons
> tend to happen if you're pushing the envelope of optimization. As

> long as you translated into well-defined C, ...

Let's consider an example. Here is the Forth code:

variable v
0 v !

v @ 1+ 2 + v !

Now this could be translated into C in a couple of ways:

T1)

int v = 0;
v = ++v + 2;

T2)

int v = 0;
v = v++ + 2;

My understanding is that, per the C standard, translation T1 is well-defined, but translation T2 results in undefined code. I'm still trying to wrap my head around sequence points, but understanding why one translation is well-defined while the other is undefined seems to presents a challenge for writing a translator.

Krishna

Spiros Bousbouras

unread,

Apr 4, 2017, 7:24:18 AM4/4/17

to

No , they are both undefined behaviour. A correct translation is in fact the
most straightforward one , namely

int v = 0;
v++ ;
v += 2 ;

As for wrapping one's head around sequence points you make it sound as if it's some
complicated topic. It's actually trivial. As I was expecting , there is a wikipedia
article , en.wikipedia.org/wiki/Sequence_point (not the most authoritative source
but with a quick look I don't see any obvious errors). A draft of the C standard can
be found online and it even helpfully collects all the sequence points in annex C.
And if you have doubts about anything in there or in the wikipedia article you can
always ask on comp.lang.c .So if one has any confusion on sequence points it must
be because they didn't bother to ask or search in the most obvious places. Obviously
if one wants to write a Forth (or anything else) to C translator then they must know
enough C (and Forth !).

--
South Dakota today passed the most restrictive abortion law in the country.
It includes the requirement that pregnant wives notify both their husband
AND the baby's father.
<S1797...@netfunny.com>

krishna...@ccreweb.org

unread,

Apr 4, 2017, 7:55:41 AM4/4/17

to

On Tuesday, April 4, 2017 at 6:24:18 AM UTC-5, Spiros Bousbouras wrote:
> On Tue, 4 Apr 2017 03:25:27 -0700 (PDT)
> krishna...@ccreweb.org wrote:
> > On Monday, April 3, 2017 at 3:17:59 AM UTC-5, Andrew Haley wrote:
> > > Single-threaded, it wouldn't be problematic at all. The nasal daemons
> > > tend to happen if you're pushing the envelope of optimization. As
> > > long as you translated into well-defined C, ...
> >
> > Let's consider an example. Here is the Forth code:
> >
> > variable v
> > 0 v !
> >
> > v @ 1+ 2 + v !
> >
> > Now this could be translated into C in a couple of ways:
> >
> > T1)
> >
> > int v = 0;
> > v = ++v + 2;
> >
> > T2)
> >
> > int v = 0;
> > v = v++ + 2;
> >
> > My understanding is that, per the C standard, translation T1 is well-defined,
> > but translation T2 results in undefined code. I'm still trying to wrap my
> > head around sequence points, but understanding why one translation is
> > well-defined while the other is undefined seems to presents a challenge for
> > writing a translator.
>

> No , they are both undefined behaviour. ...

See http://www.open-std.org/jtc1/sc22/wg21/docs/cwg_defects.html#637

Translation T1 is supposedly not undefined behavior, even though gcc 4.8.x issues a warning with -Wall.

> Obviously if one wants to write a Forth (or anything else) to C translator then they must know > enough C (and Forth !).

Agreed.

Krishna

Spiros Bousbouras

unread,

Apr 4, 2017, 8:28:34 AM4/4/17

to

On Tue, 4 Apr 2017 04:55:40 -0700 (PDT)

Ok , thank you. It appears it was me who hadn't done his homework ; I'm well
familiar with the C99 standard (where both are undefined behaviour) but I
have only briefly skimmed the C11 one. But what I said in my previous post ,
namely "A correct translation is in fact the most straightforward one" ,
still stands. One only needs to understand esoteric stuff if one *must* use
esoteric stuff. But in the Forth code you presented one does not need to use
the distinction between

v = ++v + 2;

and

v = v++ + 2;

to translate the Forth code into C. In fact I can't think of an occasion
where such a distinction would be relevant to practical programming whether
one writes C directly or uses computer translation of something else into C.
Such contrived examples are only useful for clarifying what the standard
specifies.

--
What the liberal elites do now is not moral. It is self-exaltation disguised as
piety. It is part of the carnival act.
http://www.truthdig.com/report/item/donald_trumps_greatest_allies_are_the_liberal_elites_20170305

Albert van der Horst

unread,

Apr 4, 2017, 9:37:46 AM4/4/17

to

In article <72a319fa-8035-4a17...@googlegroups.com>,

<krishna...@ccreweb.org> wrote:
>On Monday, April 3, 2017 at 3:17:59 AM UTC-5, Andrew Haley wrote:
>> krishna...@ccreweb.org wrote:
>> > On Sunday, April 2, 2017 at 4:02:26 AM UTC-5, Andrew Haley wrote:
>> >> krishna...@ccreweb.org wrote:

>> >> >=20

>> >> > Supposedly Java virtual machines have been developed with LLVM.

>> >>=20

>> >> It is very difficult, partly for the reasons described: LLVM uses C
>> >> semantics for its intermedieate language, so if your language is along
>> >> way from C this can be problematic. Optimizing without specific
>> >> language semantics is somewhere between hard and impossible.

>> >=20

>> > The fundamental question appears to be independent of LLVM:

>> >=20

>> > If one were to write a Forth to C translator, how likely is it that
>> > the translator would result in C code which is considered undefined
>> > in the new C standard, i.e. how problematic would "nasal demons" be
>> > in the translated code?

>>=20

>> Single-threaded, it wouldn't be problematic at all. The nasal daemons
>> tend to happen if you're pushing the envelope of optimization. As
>> long as you translated into well-defined C, ...
>
>Let's consider an example. Here is the Forth code:
>
>variable v
>0 v !
>
>v @ 1+ 2 + v !
>
>Now this could be translated into C in a couple of ways:
>
>T1)
>

>int v =3D 0;
>v =3D ++v + 2;
>=20
>T2)
>
>int v =3D 0;
>v =3D v++ + 2;
>

Both are wrong in my book.

v++ implies that v is a pointer
A good translation is

v = v+1+2;

Having a special action for 1+ is silly anyway.
The ++ operator is a remnance of the PDP11 era, there is not a shred
of a reason to use it in output for a translator.
It mixes address manipulation into an expression. We know that
for a variable v the expression v in Forth is an address.
In c sometime it is and sometimes it isn't.
The ++ operator optimises that confusion.

1 V +!
could become
v++
but that is stupid because it only works for 1.
I think
12 V +!
should become
v += 12;
and that is about it.
None of the other ++ -- += -= /= etc. symbols need be present in
the c-output.

>My understanding is that, per the C standard, translation T1 is well-define=
>d, but translation T2 results in undefined code. I'm still trying to wrap m=
>y head around sequence points, but understanding why one translation is wel=
>l-defined while the other is undefined seems to presents a challenge for wr=
>iting a translator.

That story applies if v is a pointer. I don't go into that,
because the conclusion is already:

Everyone using ++ or -- in the output of a Forth to C translator
is an idiot.

>
>Krishna
--
Albert van der Horst, UTRECHT,THE NETHERLANDS
Economic growth -- being exponential -- ultimately falters.
albert@spe&ar&c.xs4all.nl &=n http://home.hccnet.nl/a.w.m.van.der.horst

Albert van der Horst

unread,

Apr 4, 2017, 10:05:50 AM4/4/17

to

In article <oc07kj$il8$1...@cherry.spenarnc.xs4all.nl>,

But one gives the intended outcome.
------------
#include <errno.h>
#include <stdlib.h>
#include <stdio.h>

int main(int argc, char **argv)
{
int i;
i=12;
printf("%d\n",i++);
i=12;
printf("%d\n",++i);
return 0;
}
----------
This gives
12
13

>
>v++ implies that v is a pointer

Oops that is absolutely not true. I meant to say that
using ++ in and expression normally is done for pointers.

Please ignore that line.

Of course you can use v++ instead of (v) in an expression and have v
changed at the same time, in other words it involves the address of v
too.
Of course you can use ++v instead of (v+1) in an expression and have v
changed at the same time.
If you use that expression then to store it into the v that you're
in the process of changing already, it may or may not work.
To find that out, it is more of a headache than I want to spent
right know. You know it: "Doctor, it aches when I do this."
"Then don't do that!"

> I don't go into that,
>because the conclusion is already:
>
> Everyone using ++ or -- in the output of a Forth to C translator
> is an idiot.

I stand by that conclusion.

Albert van der Horst

unread,

Apr 4, 2017, 10:12:22 AM4/4/17

to

In article <87zifw5...@nightsong.com>,

Also algol68 is completely formally specified using a van Wijngaarden
grammer. This went so far that a program with what in other languages
would lead to an undefined identifier, would be a ungrammatical
program. (The infamous two level grammar. I'm not aware of a
compiler that could take advantage of this.).

Groetjes Albert

Anton Ertl

unread,

Apr 4, 2017, 12:27:19 PM4/4/17

to

Spiros Bousbouras <spi...@gmail.com> writes:
>On Tue, 4 Apr 2017 03:25:27 -0700 (PDT)
>krishna...@ccreweb.org wrote:
>> Let's consider an example. Here is the Forth code:
>>
>> variable v
>> 0 v !
>>
>> v @ 1+ 2 + v !
>>
>> Now this could be translated into C in a couple of ways:
>>
>> T1)
>>
>> int v = 0;
>> v = ++v + 2;
>>
>> T2)
>>
>> int v = 0;
>> v = v++ + 2;
>>
>> My understanding is that, per the C standard, translation T1 is well-defined,
>> but translation T2 results in undefined code. I'm still trying to wrap my
>> head around sequence points, but understanding why one translation is
>> well-defined while the other is undefined seems to presents a challenge for
>> writing a translator.
>
>No , they are both undefined behaviour. A correct translation is in fact the
>most straightforward one , namely
>
>int v = 0;
>v++ ;
>v += 2 ;

That's not straightforward. It changes v at every step, unlike the
Forth code. Consider:

v @ 1+ 2 + w !

A correct translation looks very different from your version above.

I think a straightforward translation of Krishna Myneni's code is:

Cell v, s0, s1;
v = 0;
s0 = v;
s0 = s0+1;
s1 = 2;
s0 = s0+s1
v = s0;

And for my variant, you only need to change the last statement.

Andrew Haley

unread,

Apr 4, 2017, 12:29:19 PM4/4/17

to

Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
> Andrew Haley <andr...@littlepinkcloud.invalid> writes:
>>Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
>>
>>Nobody is perfect. There are bugs in all large programs. There's
>>nothing special about undefined behaviour: it's just another kind of
>>error.
>
> As I wrote: "well-defined C, which isn't difficult" is bullshit.

Note that "not difficult" wasn't in the OP's question. But "C which
isn't difficult" doesn't smell much better. C is what it is; C is not
an easy language.

>>> 4) Anything to do with treating cells as either addresses or as
>>> integers can easily turn into undefined behaviour. Dealing with
>>> memory can also easily incur undefined behaviour. Implementing Forth
>>> addresses as C constructs without incurring undefined behaviour is
>>> also challenging, and certainly not easy (e.g., how do you implement
>>> ALLOCATE and "-"?).
>>
>>I'd do it like this: Forth cells are unsigned ints, allocated on top

My error: this should have been "unsigned integers of the same size as
size_t and intptr_t" not "unsigned ints".)

>>of a byte array which is data space. All integer arithmetic is modulo
>>N, where N is the word size. Division has to be handled by careful
>>conversion, to make sure there aren't any overflows. That's a little
>>bit fiddly, but all perfectly practicable.
>>
>>What problem do you forsee with ALLOCATE?
>
> How do you translate it "easily"?

I'd be tempted to allocate all of Forth's data space into a single
char array, and implement ALLOCATE with a custom allocator in that
space. That fits nicely with Forth's flat memory space. But it
certainly could be done with malloc(), although care would have to be
taken not to hit UB comparing pointers to malloc'd data and Forth's
data space.

>>The only integer types seen in memory by C are bytes and unsigned
>>ints.
>
> How do you implement @ and C@?

Simple memory reads. It's legitimate to alias character types and
other types, so a suitably aligned character address can be cast to a
size_t* or a uintptr_t*.

I think it'd probably make the most sense to use uintptr_t for Forth
cells. Of course, this requires a system to be able to cast freely
between uintptr_t and and pointer types, but that is implementation-
defined rather than undefined behaviour. Every C compiler I care
about (or, in fact, know about) supports it.

Alternatively, Forth addresses could be integers and used as indexes
into the data array; this might be more convenient, if slightly
slower.

>>Of course it is likely that any program of this complexity will have
>>bugs to begin with, and some of these bugs will probably exhibit
>>undefined behaviour. However, it's then just a matter of fixing the
>>bugs, and you're done. And there's an undefined behaviour checker in
>>GCC to help with that.
>
> And after all this effort (which I wouldn't call "not problematic at
> all" nor "not difficult"; it's not just like a -Wall test, instead you
> have lots of mutually exclusive -fsanitize flags, and then the
> undefined behaviours only show up at run-time with appropriate
> inputs), at the end you have a program with undefined behaviours that
> have not been caught by the checker in GCC.

I think you'd be fine with "-fsanitize=undefined". I agree that it
wouldn't catch everything: there are no magic bullets.

> But sure, if you love nasal demons, you also find that not
> problematic at all.

Andrew.

Albert van der Horst

unread,

Apr 4, 2017, 1:11:22 PM4/4/17

to

In article <a-SdnVgDEexEVH7F...@supernews.com>,
Andrew Haley <andr...@littlepinkcloud.invalid> wrote:
<SNIP>

>Simple memory reads. It's legitimate to alias character types and
>other types, so a suitably aligned character address can be cast to a
>size_t* or a uintptr_t*.
>
>I think it'd probably make the most sense to use uintptr_t for Forth
>cells. Of course, this requires a system to be able to cast freely
>between uintptr_t and and pointer types, but that is implementation-
>defined rather than undefined behaviour. Every C compiler I care
>about (or, in fact, know about) supports it.

Casts to a different type are problematic. As I said before,
the proper c-idiom is unions. This probably results in less
problems because you tell the c-compiler precisely what is going
on.

<SNIP>

>Andrew.

krishna...@ccreweb.org

unread,

Apr 4, 2017, 8:57:53 PM4/4/17

to

On Tuesday, April 4, 2017 at 8:37:46 AM UTC-5, Albert van der Horst wrote:
> In article <72a319fa-8035-4a17...@googlegroups.com>,
> <krishna...@ccreweb.org> wrote:
...
> >Let's consider an example. Here is the Forth code:
> >
> >variable v
> >0 v !
> >
> >v @ 1+ 2 + v !
> >
> >Now this could be translated into C in a couple of ways:
> >
> >T1)
> >
> >int v =3D 0;
> >v =3D ++v + 2;
> >=20
> >T2)
> >
> >int v =3D 0;
> >v =3D v++ + 2;
> >
>
> Both are wrong in my book.
>

T2 is certainly an incorrect translation and does not have the meaning of the original Forth code. But my point here was not about the correctness of the translation, but about the boundary between well-defined and undefined behavior of C code per the standard.

...

> Having a special action for 1+ is silly anyway.
> The ++ operator is a remnance of the PDP11 era, there is not a shred

> of a reason to use it in output for a translator. ...

Isn't there an INC operation in x86?

> None of the other ++ -- += -= /= etc. symbols need be present in
> the c-output.
>

In effect, you are you recommending that all of these operations be stricken from the C language itself. Translating from some source language into C is just a place holder for transcribing a mental model of a computation or algorithm directly into the target language, i.e. for coding in C.

> Everyone using ++ or -- in the output of a Forth to C translator
> is an idiot.
>

I would think the criteria for being an idiot is more general than that, at least with present day C compilers.

Krishna

Rod Pemberton

unread,

Apr 5, 2017, 12:43:37 AM4/5/17

to

On Tue, 4 Apr 2017 17:57:52 -0700 (PDT)
krishna...@ccreweb.org wrote:

> On Tuesday, April 4, 2017 at 8:37:46 AM UTC-5, Albert van der Horst
> wrote:
> > In article <72a319fa-8035-4a17...@googlegroups.com>,
> > <krishna...@ccreweb.org> wrote:

> > Having a special action for 1+ is silly anyway.
> > The ++ operator is a remnance of the PDP11 era, there is not a shred
> > of a reason to use it in output for a translator. ...
>
> Isn't there an INC operation in x86?
>

16-bit. Yes.
32-bit. Yes.
64-bit. No. (It was reassigned for the REX prefix.)

You can see this on Sandpile.org's "1 byte opcodes" page:
http://www.sandpile.org/x86/opc_1.htm

Sandpile.org
"The world's leading source for technical x86 processor information."
http://www.sandpile.org/

> > Everyone using ++ or -- in the output of a Forth to C translator
> > is an idiot.
> >
>
> I would think the criteria for being an idiot is more general than
> that, at least with present day C compilers.
>

Today, adding one or subtracting one optimizes just as well, but isn't
as compact to code.

There is nothing wrong with ++ or -- , AS LONG AS you follow the ANSI C
rules for using them. Of course, almost no one knows or remembers or
uses these rules ...

ANSI C introduced rules for their use with the "sequence points"
concept of ANSI C. This resulted in the pre- and post-decrement or
pre- and post-increment operations to sometimes erroneously affect the
result of a computed value when the sequence is optimized or compiled.
I.e., the order of operations for ANSI C was changed and is not
guaranteed to be identical to that of K&R C or earlier.

The ANSI C rules say that you must separate out the usage or pre- and
post-decrement and pre- and post-increment from other C code. These
rules carry over into ISO C, even if not stated.

This, for ANSI C compliance,

v = ++v + 2;

should be rewritten as this:
++v;
v = v + 2:

This, for ANSI C compliance,

v = v++ + 2;

should be rewritten as this:
v = v + 2;
v++;

Of course, these are equivalent:
v += 2;
v = v + 2;

The rules are because of situations like this one:
v += ++v + 2; /* coded */
v = v + ++v + 2; /* equivalent */

If v is 3, is the result 9 or 10?

Historically, the expected result in C is 10. I.e., it's equivalent to:
++v;
v += v + 2;

This is because historically pre-increment was supposed to occur first,
prior to any other operations. However, ANSI C's sequence points
concept meant that the operation could occur anywhere prior to the
semicolon. It doesn't have to explicitly come first. I.e., ANSI C
fiddled with the expected order of operations and completion.

v++, v=4, v = 4 + 4 + 2, v=10; /* pre-ANSI C, v++ occurs first */
v = 3 + 3 + 2, v++, v=9; /* valid for ANSI C ... */
v++ (2nd only), v = 3 + 4 + 2, v=9; /* valid for ANSI C ... */

It becomes more complicated with sequence points if multiple increments
or decrements are involved:

v = ++v + ++v + 2 ;

The compiler can apply both pre-increments before, both after, one
before and one after ... as long as the operations are applied prior to
the semicolon.

C also has the rarely used comma ',' operator (sequential expression)
which allows for multiple statements to occur during the same sequence
point. IMO, the sequence point concept is an abomination, but is
needed for compilers. And, it is not the only mistake ANSI C committee
introduced into C. They spent years correcting a number of other
serious mistakes.

minf...@arcor.de

unread,

Apr 5, 2017, 3:00:45 AM4/5/17

to

So what the heck? Your contrived examples violate basic software engineering rules to not write obscure code.

C and Assembler are well suited to shoot one's own foot. But that does not imply that you have to protect the shootist. Forth is full of unspecified corners as well. No reason for language talibanism.

peter....@gmail.com

unread,

Apr 5, 2017, 8:29:27 AM4/5/17

to

On Wednesday, 5 April 2017 06:43:37 UTC+2, Rod Pemberton wrote:
> On Tue, 4 Apr 2017 17:57:52 -0700 (PDT)
> krishna...@ccreweb.org wrote:
>
> > On Tuesday, April 4, 2017 at 8:37:46 AM UTC-5, Albert van der Horst
> > wrote:
> > > In article <72a319fa-8035-4a17...@googlegroups.com>,
> > > <krishna...@ccreweb.org> wrote:
>
> > > Having a special action for 1+ is silly anyway.
> > > The ++ operator is a remnance of the PDP11 era, there is not a shred
> > > of a reason to use it in output for a translator. ...
> >
> > Isn't there an INC operation in x86?
> >
>
> 16-bit. Yes.
> 32-bit. Yes.
> 64-bit. No. (It was reassigned for the REX prefix.)

It works perfectly well also in 64 bit

inc rax produces 48 FF C0

it is not a one byte opcode but it is still shorter then

add rax, 1 that produces 48 83 C0 01

Peter

krishna...@ccreweb.org

unread,

Apr 5, 2017, 11:47:53 AM4/5/17

to

On Wednesday, April 5, 2017 at 2:00:45 AM UTC-5, minf...@arcor.de wrote:
...

> C and Assembler are well suited to shoot one's own foot. But that does not imply that you have to protect the shootist. Forth is full of unspecified corners as well. No reason for language talibanism.

You miss an important distinction between C and assembler -- the ability to predict the actual instructions executed by the processor and the order in which they are executed. One can look at assembly language code and predict exactly which machine instructions will be executed and in what order. That ability is gone with optimized C compilers which are given license to eliminate and/or reorder source code instructions during compilation. C is very different from assembly language in this sense. With Forth, because of its stack-based nature, I think the predictability of run-time execution of source code is much higher than it is for C, but perhaps I'm wrong on this point.

Incidentally, I always compile my Forth system, written in a mix of C++, C, and assembly language, at the lowest optimization level (-O0) for the C and C++ source files. The assembler of course has no optimization settings (of which I'm aware). The speed of my system is largely independent of the C/C++ compiler optimization anyway since the bulk of the dictionary is implemented in assembly code.

Krishna

Andrew Haley

unread,

Apr 5, 2017, 12:04:31 PM4/5/17

to

krishna...@ccreweb.org wrote:
> On Wednesday, April 5, 2017 at 2:00:45 AM UTC-5, minf...@arcor.de wrote:
> ...

>> C and Assembler are well suited to shoot one's own foot. But that
>> does not imply that you have to protect the shootist. Forth is full
>> of unspecified corners as well. No reason for language talibanism.
>
> You miss an important distinction between C and assembler -- the
> ability to predict the actual instructions executed by the processor
> and the order in which they are executed. One can look at assembly
> language code and predict exactly which machine instructions will be
> executed and in what order.

Not necessarily, no. There are lots of things one can do in assembly
code which lead to behaviour you or I, the programmer, can't predict.
It happens to me all the time. All you have to do is overwite some
critical register or forget to initialize it. Or forget to insert a
memory fence instruction. That's one of the favourites because it
causes rare and hard-to-find bugs. Or, or...

> That ability is gone with optimized C compilers which are given
> license to eliminate and/or reorder source code instructions during
> compilation. C is very different from assembly language in this
> sense.

I don't think there's really any difference.

Andrew.

john

unread,

Apr 5, 2017, 1:05:08 PM4/5/17

to

In article <CZednRNE7OwXiHjF...@supernews.com>,
andr...@littlepinkcloud.invalid says...

>
> > That ability is gone with optimized C compilers which are given
> > license to eliminate and/or reorder source code instructions during
> > compilation. C is very different from assembly language in this
> > sense.
>
> I don't think there's really any difference.
>
> Andrew.

I seem to remember when I was looking at assemblers that fasm does
multiple passes to perform some optimisation. Perhaps a fasm user could
confirm that?

--

john

=========================
http://johntech.co.uk

"Bleeding Edge Forum"
http://johntech.co.uk/forum/

=========================

Paul Rubin

unread,

Apr 5, 2017, 1:25:37 PM4/5/17

to

Andrew Haley <andr...@littlepinkcloud.invalid> writes:
> I think you'd be fine with "-fsanitize=undefined". I agree that it
> wouldn't catch everything: there are no magic bullets.

http://blog.regehr.org/archives/523 might help some.

Anton Ertl

unread,

Apr 5, 2017, 1:31:43 PM4/5/17

to

Andrew Haley <andr...@littlepinkcloud.invalid> writes:
>Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
>> Andrew Haley <andr...@littlepinkcloud.invalid> writes:
>>>Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
>>>
>>>Nobody is perfect. There are bugs in all large programs. There's
>>>nothing special about undefined behaviour: it's just another kind of
>>>error.
>>
>> As I wrote: "well-defined C, which isn't difficult" is bullshit.
>
>Note that "not difficult" wasn't in the OP's question.

Not in these words. Anyway, you found it necessary to make that claim
in your answer.

>But "C which
>isn't difficult" doesn't smell much better. C is what it is; C is not
>an easy language.

I know it does not smell much better to you. That's because the only
C that you accept as C is "well-defined C"; for everything else you
pretend that you don't know what it means.

Translating Forth into traditional C (that compiles as intended by PCC
and early versions of GCC) is indeed not difficult; after all,
traditional C and Forth are semantically close to each other. Take a
look at Gforth or forth2c to get an idea on how to do it.

The resulting code is, of course, full of undefined behaviour, like
most of the other C code around; and in this case I think it's
impossible to get rid of the undefined behaviour without losing
functionality and performance, and even if you are willing to pay that
price, it requires significantly more effort, and in the end, you will
probably still have some undefined behaviour somewhere.

In general, programming is not easy, but there are different levels of
difficulty:

1) Programmers tend to be able to satisfy the requirements that the
users use a lot, and/or care about a lot. Corner cases, less so (and
in the Forth community, some people celebrate being wrong in corner
cases, no matter how easy it would have been to do the right thing;
actually, that's very similar to the attitude that gives us nasal
demons).

2) Programmers are pretty bad at dealing with cases that do not occur
during the course of normal operations. That's why security
vulnerabilities occur so often and are so long-lived. Writing tests
for unusual cases helps a little, but only shines a light on the cases
that are tested. There is no guarantee that tests cover everything.
Even with automated fuzz testing these days, the amount of automation
is limited: a specialized setup for a certain system call or something
caught things that the general fuzz tester had not caught.

3) Complete correctness proofs of programs are outside the competence
of nearly all programmers (and pretty pointless, because they would
prove only the correctness wrt the specification, which itself may be
buggy).

Now, to write a C program that is guaranteed to never exercise
undefined behaviour requires competences on the level 3. The
much-hyped sanitizers of GCC and Clang allow finding some (not all)
undefined behaviours using the well-established testing methodology.

So you will find undefined behaviours in the cases exercized by normal
users, but you will tend not to find undefined behaviour in those
cases exercized by an attacker; it's as difficult as finding the all
the vulnerabilities in the first place. E.g., in the OpenSSH case, if
you thought about the vulnerability, you would fix it right away,
rather than writing a test case that exploited it, and waiting for the
sanitizer to report that freed memory was accessed (if the sanitizer
actually does check that), and then fixing it.

>>>> 4) Anything to do with treating cells as either addresses or as
>>>> integers can easily turn into undefined behaviour. Dealing with
>>>> memory can also easily incur undefined behaviour. Implementing Forth
>>>> addresses as C constructs without incurring undefined behaviour is
>>>> also challenging, and certainly not easy (e.g., how do you implement
>>>> ALLOCATE and "-"?).
>>>
>>>I'd do it like this: Forth cells are unsigned ints, allocated on top
>
>My error: this should have been "unsigned integers of the same size as
>size_t and intptr_t" not "unsigned ints".)
>
>>>of a byte array which is data space. All integer arithmetic is modulo
>>>N, where N is the word size. Division has to be handled by careful
>>>conversion, to make sure there aren't any overflows. That's a little
>>>bit fiddly, but all perfectly practicable.
>>>
>>>What problem do you forsee with ALLOCATE?
>>
>> How do you translate it "easily"?
>
>I'd be tempted to allocate all of Forth's data space into a single
>char array, and implement ALLOCATE with a custom allocator in that
>space.

That's "easy" in your book? The resulting Forth system would be
severely restricted: E.g., you cannot use it to call system calls like
mmap() and make use of the results (but then, that's undefined
in the C standard anyway).

> That fits nicely with Forth's flat memory space. But it
>certainly could be done with malloc(), although care would have to be
>taken not to hit UB comparing pointers to malloc'd data and Forth's
>data space.

Right. Anything but "not difficult"; I don't think it's possible.
Subtraction is also a problem.

>> How do you implement @ and C@?
>
>Simple memory reads.

What C code?

>It's legitimate to alias character types and
>other types, so a suitably aligned character address can be cast to a
>size_t* or a uintptr_t*.

That contradicts things I have read about accessing casted pointers.

>I think it'd probably make the most sense to use uintptr_t for Forth
>cells. Of course, this requires a system to be able to cast freely
>between uintptr_t and and pointer types, but that is implementation-
>defined rather than undefined behaviour. Every C compiler I care
>about (or, in fact, know about) supports it.

It's interesting how quick you come off your high horse once things
like writing actual C code come into play. "It works as intended on
the compilers and/or machines the programmer used" has been my
yardstick for determining the correctness of programs all along, yet
nasal demon fans insist that most (possibly all) of these programs are
"buggy".

>Alternatively, Forth addresses could be integers and used as indexes
>into the data array; this might be more convenient, if slightly
>slower.

Convenient in what way? Win32Forth once used relative pointers to
achieve relocatability. They soon returned to absolute pointers.

>>>Of course it is likely that any program of this complexity will have
>>>bugs to begin with, and some of these bugs will probably exhibit
>>>undefined behaviour. However, it's then just a matter of fixing the
>>>bugs, and you're done. And there's an undefined behaviour checker in
>>>GCC to help with that.
>>
>> And after all this effort (which I wouldn't call "not problematic at
>> all" nor "not difficult"; it's not just like a -Wall test, instead you
>> have lots of mutually exclusive -fsanitize flags, and then the
>> undefined behaviours only show up at run-time with appropriate
>> inputs), at the end you have a program with undefined behaviours that
>> have not been caught by the checker in GCC.
>
>I think you'd be fine with "-fsanitize=undefined". I agree that it
>wouldn't catch everything: there are no magic bullets.

I.e., it's not "not difficult" to write "well-defined C".

Looking at
<https://gcc.gnu.org/onlinedocs/gcc/Instrumentation-Options.html>, I
see

|-fsanitize=all is not allowed, as some sanitizers cannot be used together.

Alex

unread,

Apr 5, 2017, 1:42:55 PM4/5/17

to

On 4/5/2017 16:01, Anton Ertl wrote:
> Convenient in what way? Win32Forth once used relative pointers to
> achieve relocatability. They soon returned to absolute pointers.

The decision was mine, and the reasons were pretty simple.

1. Relocation was a requirement of the C wrapper that loaded the Forth
image by allocating a lump of storage but with no guarantee of getting
the same address every time. With a proper EXE format for the image,
absolute addresses were possible, and the Windows loader is instructed
to use a specific load address (normally $40000).

2. Every external access to the OS required those parameters that were
pointers to be relocated/derelocated on each access. It is painful
coding with REL>ABS and ABS>REL.

3. Run time relocation required the permanent reservation of a register.
The x86 in 32 bit mode is not overendowed with them.

--
Alex

minf...@arcor.de

unread,

Apr 5, 2017, 3:16:26 PM4/5/17

to

Using -O0 on an optimizing compiler is certainly a good way to reduce optimization bugs. ;-) OTOH with one assembler you are stuck to one processor, you lose portability. Luckily most CPU vendors follow down-compatibility between CPU generations, notably then x86 line since the 90s. But as others have pointed out even then you don't always get the same opcodes and of course different cycle timings. Things get even worse when you have to transcode between different non-compatible CPUs. So I don't follow your argument that assembler coding will generate more predictable results.

hughag...@gmail.com

unread,

Apr 5, 2017, 4:04:16 PM4/5/17

to

On Wednesday, April 5, 2017 at 10:05:08 AM UTC-7, john wrote:
> In article <CZednRNE7OwXiHjF...@supernews.com>,
> andr...@littlepinkcloud.invalid says...
>
> >
> > > That ability is gone with optimized C compilers which are given
> > > license to eliminate and/or reorder source code instructions during
> > > compilation. C is very different from assembly language in this
> > > sense.
> >
> > I don't think there's really any difference.
> >
> > Andrew.
>
> I seem to remember when I was looking at assemblers that fasm does
> multiple passes to perform some optimisation. Perhaps a fasm user could
> confirm that?

FASM is multi-pass --- FASM does not eliminate and/or reorder source-code instructions --- the multiple passes are mostly for reducing Jxx operand size.

My MFX included an assembler. I did reorder the source-code instructions significantly to maximize parallelization --- but I did so in a manner that would guarantee that the program did the same thing as if the instructions were executed sequentially in the same order that they appeared in the source-code.

Eliminating instructions is an extremely bad idea --- this assumes that the programmer is an idiot to writes needless instructions --- there are only two people in the world who know how to program in MFX assembly (myself and my coworker), neither of whom are idiots (in my estimation, although I'm obviously biased on the matter).

Walter Banks

unread,

Apr 5, 2017, 5:27:40 PM4/5/17

to

Your point is well taken but Misra-C significantly reduces many of the C
language issues.

A second point is may automotive applications are written using
implementation rules that are a subset of Misra-C.

There are implementation process standards for most industries that
require reliable application implementation.

This is true for automotive, medical devices, and aviation. It is
possible to make applications reliable independent of the implementation
languages.

One of the simplest ways of making applications reliable is having an
application beta tested for a significant amount of time on a real
application following the last change before release.

Automotive engine controllers require two years data after the internal
release until the code is shipped.

w..

Paul Rubin

unread,

Apr 5, 2017, 9:20:30 PM4/5/17

to

john <an...@example.com> writes:
> I seem to remember when I was looking at assemblers that fasm does
> multiple passes to perform some optimisation. Perhaps a fasm user could
> confirm that?

I don't know about fasm, but there are instruction sets where branch
instructions can be encoded more compactly if the branch target is
within a certain distance. But that distance might depend on the
encodings of intervening branch instructions, so it can take a few
iterations to get the shortest possible encoding.

anti...@math.uni.wroc.pl

unread,

Apr 5, 2017, 9:36:49 PM4/5/17

to

hughag...@gmail.com wrote:
> On Sunday, April 2, 2017 at 1:15:22 PM UTC-7, Alex wrote:
> >
> > Early in the development of LLVM there was this; Stacker. It was very
> > Forth like, but compiled and no interpretation.
> >
> > http://web.cs.ucla.edu/classes/spring08/cs259/llvm-2.2/docs/Stacker.html
>
> This is primarily why I'm not interested in LLVM --- because it has to be cross-compiled into --- you can't have the traditional interactive Forth development system.

Apparently you do not understand how LLVM works. You can use LLVM
as a traditional compiler. But you can also use it for run-time
code generation. Basically, you type some input to your Forth,
your Forth translates it and calls LLVM to generate code. LLVM
puts machine code in a buffer and you can immediately call this
code. In other words once you pressed Enter you see results
of running your code -- as interactive as it can be.

Several big name products that used to contains their own
interpreters now use LLVM to get faster code for interactive
use.

--
Waldek Hebisch

krishna...@ccreweb.org

unread,

Apr 5, 2017, 11:14:28 PM4/5/17

to

On Wednesday, April 5, 2017 at 11:04:31 AM UTC-5, Andrew Haley wrote:
> krishna...@ccreweb.org wrote:
> > On Wednesday, April 5, 2017 at 2:00:45 AM UTC-5, minf...@arcor.de wrote:
> > ...
>
> >> C and Assembler are well suited to shoot one's own foot. But that
> >> does not imply that you have to protect the shootist. Forth is full
> >> of unspecified corners as well. No reason for language talibanism.
> >
> > You miss an important distinction between C and assembler -- the
> > ability to predict the actual instructions executed by the processor
> > and the order in which they are executed. One can look at assembly
> > language code and predict exactly which machine instructions will be
> > executed and in what order.
>
> Not necessarily, no. There are lots of things one can do in assembly
> code which lead to behaviour you or I, the programmer, can't predict.
> It happens to me all the time. All you have to do is overwite some
> critical register or forget to initialize it. Or forget to insert a
> memory fence instruction. That's one of the favourites because it
> causes rare and hard-to-find bugs. Or, or...
>

Given the full initial state of the processor+memory at a given instant, prior to executing a finite set of instructions, expressed as a set of source code lines in assembly, it is not impossible to simulate the executed sequence of instructions and the successive states of the processor+memory. I'm not talking about solving the halting problem. All I'm saying is that the assembly code source permits such a trace of the system.

Now, given a specific C compiler and a specific set of lines of C code, it is also possible to do the same, but only by knowing its optimization rules in full detail, or by examining the intermediate assembly code. Two different C compilers on the same system, or a single compiler at different optimization levels, will generate a different set of machine instructions. Two different assemblers on the same system, with the same assembly source, and allowing for syntax differences, better generate the same sequence of machine instructions, except for maybe minor differences such as near or long jumps. It is in that sense that I'm saying assembly language source is more predictable than C source code.

> > That ability is gone with optimized C compilers which are given
> > license to eliminate and/or reorder source code instructions during
> > compilation. C is very different from assembly language in this
> > sense.
>
> I don't think there's really any difference.
>

It's a matter of trusting the compiler will code what you ask it to do, and understanding the cases of undefined behavior when the compiler may not generate expected code. It's not clear to me that mere mortals can keep track of all of the cases of UB in the C standard, and the latitude given to compilers to do different things in those cases. Certainly coding guidelines, e.g. NASA's C coding guidelines (ref. 1) can help avoid such difficulties. But, it's also clear that mere mortal programmers may be easily confused. See the old thread on comp.std.c (ref. 2). Even some of the experts had problems with the interpretation of the standard; but, to be fair, that's an old thread.

I intend to continue using the GNU C++ and C compilers, but with -O0 and -Wall flags turned on. Some of the code it generates at -O0 is fairly ugly, but I expect it's reliable when the source is not UB.

Krishna

1. http://sdtimes.com/nasas-10-rules-developing-safety-critical-code/

2. https://groups.google.com/d/msg/comp.std.c/9lYbeLVaZCA/__behqP09dIJ

krishna...@ccreweb.org

unread,

Apr 5, 2017, 11:27:23 PM4/5/17

to

On Wednesday, April 5, 2017 at 2:16:26 PM UTC-5, minf...@arcor.de wrote:
...

> Using -O0 on an optimizing compiler is certainly a good way to reduce optimization bugs. ;-) OTOH with one assembler you are stuck to one processor, you lose portability.

True. That's one of the drawbacks of assembly language, that you're tied to one set of instructions.

> Luckily most CPU vendors follow down-compatibility between CPU generations, notably then x86 line since the 90s. But as others have pointed out even then you don't always get the same opcodes and of course different cycle timings.

Predictability at the clock cycle level is not achievable unless all hardware and software interrupts are disabled. However, predictability of the executed instruction sequence, or being able to simulate it just from knowing the assembler source, without regard to the implementation of the assembler is far more probable than for C source code, without knowing about the C compiler implementation.

> Things get even worse when you have to transcode between different non-compatible CPUs. So I don't follow your argument that assembler coding will generate more predictable results.

Please see my response to Andrew above.

Krishna

minf...@arcor.de

unread,

Apr 6, 2017, 1:00:58 AM4/6/17

to

I've never heard of a Forth system with JIT compiler, LLVM-based or different. Seems a software monstrosity to me.

The other way round makes more sense: a JIT compiler spitting out Forth code. ;-)

Alex

unread,

Apr 6, 2017, 4:35:28 AM4/6/17

to

On 4/3/2017 23:25, hughag...@gmail.com wrote:
> On Monday, April 3, 2017 at 2:52:20 PM UTC-7, Alex wrote:
>> On 4/3/2017 17:45, hughag...@gmail.com wrote:
>>> Performance.
>>
>> So you think that most of the time in Forth is spent compiling?
>
> run-time performance
>

I just saw this. Clueless.

--
Alex

Alex

unread,

Apr 6, 2017, 4:52:52 AM4/6/17

to

Why is it a monster? An ITC is an example of an intermediate
representation of the code that's eventually executed, rather than
compiled & executed; but it could be another intermediate that's
compiled and executed just as readily.

>
> The other way round makes more sense: a JIT compiler spitting out
> Forth code. ;-)
>

Again, why? A stack machine might benefit having Forth as an
intermediate language (IL) but it makes many activities hard to do
efficiently in a register based machine. The only reason folks generate
C as an IL is the ready availability of compilers, most of which
generate pretty decent code (nose demons acknowledged).

--
Alex

Andrew Haley

unread,

Apr 6, 2017, 5:01:35 AM4/6/17

to

Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
> Andrew Haley <andr...@littlepinkcloud.invalid> writes:
>>Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
>>> Andrew Haley <andr...@littlepinkcloud.invalid> writes:
>>>>Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
>>>>
>>>>Nobody is perfect. There are bugs in all large programs. There's
>>>>nothing special about undefined behaviour: it's just another kind of
>>>>error.
>>>
>>> As I wrote: "well-defined C, which isn't difficult" is bullshit.
>>
>>Note that "not difficult" wasn't in the OP's question.
>
> Not in these words. Anyway, you found it necessary to make that claim
> in your answer.
>
>>But "C which isn't difficult" doesn't smell much better. C is what
>>it is; C is not an easy language.
>
> I know it does not smell much better to you. That's because the only
> C that you accept as C is "well-defined C"; for everything else you
> pretend that you don't know what it means.

Au contraire, I know that it does not mean anything as a C program.
It might mean something on an actual implementation, running on
particular hardware, if you decide that such an operational definition
satisfies you. But we must separate our concerns: language
specifications exist independently of implementations.

> Now, to write a C program that is guaranteed to never exercise
> undefined behaviour requires competences on the level 3.

That is precisely the same problem as writing a program in any
language which is guaranteed to be correct.

>>I'd be tempted to allocate all of Forth's data space into a single
>>char array, and implement ALLOCATE with a custom allocator in that
>>space.
>
> That's "easy" in your book?

No. But it's not a huge part of implementing a Forth-to-C translator
for the entire standard Forth language and all of the optional
wordsets.

> The resulting Forth system would be severely restricted: E.g., you
> cannot use it to call system calls like mmap() and make use of the
> results (but then, that's undefined in the C standard anyway).

You can't call mmap() from standard Forth or standard C anyway. But
if POSIX is allowed, a lot more possibilities become available. With
POSIX it would be a lot easier: you could mmap() a single flat space
for all of Forth's memory and remap parts of that as required.

> Right. Anything but "not difficult"; I don't think it's possible.

It is possible. You'd have to modify @ and ! to know about the
diferent spaces.

>>> How do you implement @ and C@?
>>
>>Simple memory reads.
>
> What C code?

I'm not going to write the damn thing for you.

>>It's legitimate to alias character types and other types, so a
>>suitably aligned character address can be cast to a size_t* or a
>>uintptr_t*.
>
> That contradicts things I have read about accessing casted pointers.

Then what you have read is wrong. You can't cast between pointers to
incompatible types and access the data.

>>I think it'd probably make the most sense to use uintptr_t for Forth
>>cells. Of course, this requires a system to be able to cast freely
>>between uintptr_t and and pointer types, but that is implementation-
>>defined rather than undefined behaviour. Every C compiler I care
>>about (or, in fact, know about) supports it.
>
> It's interesting how quick you come off your high horse once things
> like writing actual C code come into play. "It works as intended on
> the compilers and/or machines the programmer used" has been my
> yardstick for determining the correctness of programs all along, yet
> nasal demon fans insist that most (possibly all) of these programs
> are "buggy".

The question was unambiguously limited to undefined behaviour. It was
not about translating Forth into totally portable C. Using
implementation-defined behaviour is just fine. But implementation-
defined behaviour does not mean "it seemed to work the last time I
tried it."

Andrew.

Rod Pemberton

unread,

Apr 6, 2017, 5:22:27 AM4/6/17

to

On Wed, 5 Apr 2017 00:00:43 -0700 (PDT)
minf...@arcor.de wrote:

> So what the heck? Your contrived examples violate basic software
> engineering rules to not write obscure code.

My examples? Most were elsewhere in the thread, perhaps by Krishna.

> C and Assembler are well suited to shoot one's own foot.

Do they program robots in C or Assembly? They might use Forth for it's
"real-time" capability ... So, you might be able to shoot your own
foot, literally, with Forth, by using a robot.

Rod Pemberton

unread,

Apr 6, 2017, 5:23:11 AM4/6/17

to

Of course, you use -O0 with GCC specifically because doing so allows
you to "predict the actual instructions executed by the processor and
the order in which they are executed." Correct? So, if you can
disable assembly optimizations for C and the result is approximately
the expected one-to-one translation of C to assembly, does the
aforementioned important distinction between C and assembler actually
exist? ...

Andrew Haley

unread,

Apr 6, 2017, 5:44:36 AM4/6/17

to

Rod Pemberton <NeedNotR...@xrsevnneqk.cem> wrote:
>
> Do they program robots in C or Assembly? They might use Forth for
> it's "real-time" capability ... So, you might be able to shoot your
> own foot, literally, with Forth, by using a robot.

Probably. I have worked on a system which had the property where if I
made a mistake in the control software, boiling water was thrown
across the room. That concentrated the mind, I can tell you.

Andrew.

Alex

unread,

Apr 6, 2017, 5:57:58 AM4/6/17

to

I've had co-workers like that.

--
Alex

krishna...@ccreweb.org

unread,

Apr 6, 2017, 6:26:40 AM4/6/17

to

On Thursday, April 6, 2017 at 4:23:11 AM UTC-5, Rod Pemberton wrote:
> On Wed, 5 Apr 2017 08:47:51 -0700 (PDT)
> krishna...@ccreweb.org wrote:
>

...

> > One can look at assembly
> > language code and predict exactly which machine instructions will be
> > executed and in what order. That ability is gone with optimized C
> > compilers which are given license to eliminate and/or reorder source
> > code instructions during compilation. C is very different from
> > assembly language in this sense. With Forth, because of its
> > stack-based nature, I think the predictability of run-time execution
> > of source code is much higher than it is for C, but perhaps I'm wrong
> > on this point.
> >
> > Incidentally, I always compile my Forth system, written in a mix of
> > C++, C, and assembly language, at the lowest optimization level (-O0)
> > for the C and C++ source files. The assembler of course has no
> > optimization settings (of which I'm aware). The speed of my system is
> > largely independent of the C/C++ compiler optimization anyway since
> > the bulk of the dictionary is implemented in assembly code.
> >
>
> Of course, you use -O0 with GCC specifically because doing so allows
> you to "predict the actual instructions executed by the processor and
> the order in which they are executed." Correct? So, if you can
> disable assembly optimizations for C and the result is approximately
> the expected one-to-one translation of C to assembly, does the
> aforementioned important distinction between C and assembler actually
> exist? ...
>

Yes, the distinction still exists at the source code level comparison. With no optimizations for the C compiler (-O0), you have a better chance of obtaining object code that maps well to your source without omission or resequencing. But it is still not guaranteed because of cases where the C source is considered undefined behavior in the standard. Then, the C compiler can compile any damn thing it wants into the object code. The -Wall flag at least gives one a chance to see if the C compiler detects undefined behavior, but the compiler is not required to report anything, and possibly might not be able to when the optimizer is turned on.

When I write C code, I don't want to play chess with the compiler. I want it to compile what I ask it to compile, or tell me there's a problem. With optimizations, the compiler is substituting it's judgement about the need for the code, and the order in which execution happens, over your judgement. An assembler will compile assembly source code with exactly the instructions you wrote in the source. No nasal demons. Your assembly code may not make any sense, and might result in a processor exception -- the assembler can't check for that.

minf...@arcor.de

unread,

Apr 6, 2017, 7:05:08 AM4/6/17

to

Good that we are not talking about optimising Forth compilers. ;-)))

Andrew Haley

unread,

Apr 6, 2017, 7:50:13 AM4/6/17

to

minf...@arcor.de wrote:

> Good that we are not talking about optimising Forth compilers. ;-)))

They exist, and some are pretty good. What's not to like?

Andrew.

Spiros Bousbouras

unread,

Apr 6, 2017, 8:33:45 AM4/6/17

to

Since this has become about C now , I'm crossposting to comp.lang.c
and setting follow-ups to the same.

On Wed, 5 Apr 2017 20:14:27 -0700 (PDT)
krishna...@ccreweb.org wrote:
> It's a matter of trusting the compiler will code what you ask it to do,
> and understanding the cases of undefined behavior when the compiler may
> not generate expected code. It's not clear to me that mere mortals can
> keep track of all of the cases of UB in the C standard, and the latitude
> given to compilers to do different things in those cases.

I'm a mere mortal and I can certainly keep track of UB relevant to the code
I write. In most cases it's actually common sense. For example I don't need
to consult the C standard to know that trying to read beyond the bounds of
an array will probably not result in anything useful because , even *conceptually*
(i.e. without referring to the standard of any specific language) , it doesn't
make sense. Perhaps the confusion arises because some people when programming
in C are not thinking in terms of C but in terms of assembly. So they translate
mentally the C code in some generic assembly instructions and then try to decide
if the C code does something meaningful based on what those hypothetical assembly
instructions will do.

Example : the expression array[i] where i is beyond the end of the array.

Thinking in terms of C : Not defined which means you aren't going to get anything
useful so you shouldn't use it. Pondering or testing what's the most crazy thing
which might happen , may be a bit of fun every now and again but is not relevant to
practical programming.

Thinking in terms of assembly : "array[i] will be translated into an assembly
instruction which calculates an address of the form base + i * constant .
This will always point into some address in memory so I should get some value
or at least a segmentation fault ; so not a totally unpredictable result."

Note first that not thinking in terms of C is more complicated and requires more
knowledge than thinking in terms of C. Second , if one wants to think in terms of
assembly then , from a practical point of view , it makes sense to program in
assembly , not C.

> Certainly coding guidelines, e.g. NASA's C coding guidelines (ref. 1) can
> help avoid such difficulties. But, it's also clear that mere mortal
> programmers may be easily confused. See the old thread on comp.std.c (ref. 2).
> Even some of the experts had problems with the interpretation of the standard;
> but, to be fair, that's an old thread.

[ References are
1. http://sdtimes.com/nasas-10-rules-developing-safety-critical-code
and
2. https://groups.google.com/d/msg/comp.std.c/9lYbeLVaZCA/__behqP09dIJ
]

Regarding the NASA guidelines , they are unsuitable for a lot of desktop software.
For example "Do not use dynamic memory allocation after initialization." .How
are you going to implement for example the Unix sort utility without using
realloc() (depending on the size of the input) ?

Your second reference has
char *index;
....
*index++ = toupper(*index);

.He doesn't explain why he decided to write code like this as opposed to
*index = toupper(*index);
index++ ;

.The latter seems to me more readable. So unless he had good reasons to prefer the
first version then worrying whether it's undefined behaviour is irrelevant for
programming ; one simply shouldn't write code like this. Do you have an example
of C code which satisfies the following 3 criteria ?

1. Is not contrived (according to your judgement).
2. You are not sure whether it has defined behaviour.
3. You do not see an obvious and easy way to change it so that it does what you
want *and* doesn't look contrived (according to your judgement) *and* you are
sure is not undefined behaviour.

> I intend to continue using the GNU C++ and C compilers, but with -O0 and
> -Wall flags turned on. Some of the code it generates at -O0 is fairly ugly,
> but I expect it's reliable when the source is not UB.

I don't think the -O0 part makes sense. If the compiler has bugs and miscompiles
correct code then you want to find this out as early as possible rather than at a
later point where you many need a higher optimisation level because otherwise the
code does not run fast enough. If your code has undefined behaviour then if it does
what you want then it's only by coincidence and your luck may run out even if you
always use -O0 .So in my opinion , one should at least conduct some tests with
other optimisation levels (and preferably using multiple compilers) to see if the
code continues to run correctly.

--
Every application has an inherent amount of irreducible complexity. The only
question is: Who will have to deal with it\u2014the user, the application
developer, or the platform developer?
http://www.nomodes.com/Larry_Tesler_Consulting/Complexity_Law.html

Alex

unread,

Apr 6, 2017, 12:43:19 PM4/6/17

to

On 4/1/2017 14:36, krishna...@ccreweb.org wrote:
> Question:
>
> Are there any existing Forth systems built from the LLVM tools (link
> below)? What advantages might LLVM provide for a low-level language
> such as Forth?
>
> Krishna
>
> http://llvm.org/
>

I have also just come across FIRM. It looks interesting in that it
appears a lot simpler than LLVM and has a well documented C library
(libfirm) that would appear to be compilable on a number of platforms.

It also has a Forth implementation (FirmForth) that could be used as a
basis for your evaluation.

<quote>
libFirm is a C library that provides a graph-based intermediate
representation, optimizations, and assembly code generation suitable for
use in compilers.

Features

Completely graph-based, source- and target-independent intermediate
representation in SSA form

Accompanying GCC-compatible C frontend with full C99 support

Extensive set of optimizations

High-quality register allocation

Mature code generation support for x86 (32-bit) and SPARC

<links>
Paper: http://pp.info.uni-karlsruhe.de/uploads/publikationen/braun11wir.pdf

@techreport{braun11wir,
title = {Firm---A Graph-Based Intermediate Representation},
booktitle = {Karlsruhe Reports in Informatics},
year = {2011},
publisher = {Karlsruhe},
number = {35},
author = {Matthias Braun and Sebastian Buchwald and Andreas Zwinkau},
institution = {Karlsruhe Institute of Technology},
url = {http://digbib.ubka.uni-karlsruhe.de/volltexte/1000025470},
}

Website: http://pp.ipd.kit.edu/firm/

FirmForth: https://github.com/anse1/firmforth
Presentation on FirmForth:
https://korte.credativ.com/~ase/firm-postgres-jit-forth.pdf

--
Alex

krishna...@ccreweb.org

unread,

Apr 6, 2017, 1:49:54 PM4/6/17

to

On Thursday, April 6, 2017 at 7:33:45 AM UTC-5, Spiros Bousbouras wrote:
> Since this has become about C now , I'm crossposting to comp.lang.c
> and setting follow-ups to the same.
>
> On Wed, 5 Apr 2017 20:14:27 -0700 (PDT)
> krishna...@ccreweb.org wrote:
> > It's a matter of trusting the compiler will code what you ask it to do,
> > and understanding the cases of undefined behavior when the compiler may
> > not generate expected code. It's not clear to me that mere mortals can
> > keep track of all of the cases of UB in the C standard, and the latitude
> > given to compilers to do different things in those cases.
>
> I'm a mere mortal and I can certainly keep track of UB relevant to the code
> I write. In most cases it's actually common sense. For example I don't need
> to consult the C standard to know that trying to read beyond the bounds of
> an array will probably not result in anything useful because , even *conceptually*
> (i.e. without referring to the standard of any specific language) , it doesn't

> make sense. ...

There are over 200 undefined behavior cases documented in the C11 standard. I haven't reviewed the full list, and without doing so, I won't comment further on how "common sense" such cases are. However, the cases which are of most concern are the ones which were idioms typically in use, and relied upon by programmers. Destroying that backwards compatibility seems foolish.

> Note first that not thinking in terms of C is more complicated and requires more
> knowledge than thinking in terms of C. Second , if one wants to think in terms of
> assembly then , from a practical point of view , it makes sense to program in
> assembly , not C.
>

I'm not sure what the above means. I don't think in a computer language when I have a computation problem at hand that I want to express in code. Although I'm guessing many programmers skip this part, it's useful to write down the computation problem in pseudo-code, which is usually a high-level abstraction. I don't immediately start writing C or assembly code.

>
> [ References are
> 1. http://sdtimes.com/nasas-10-rules-developing-safety-critical-code
> and
> 2. https://groups.google.com/d/msg/comp.std.c/9lYbeLVaZCA/__behqP09dIJ
> ]
>
> Regarding the NASA guidelines , they are unsuitable for a lot of desktop software.

> For example "Do not use dynamic memory allocation after initialization." ...

Good point. Not all of the guidelines apply for every case. Their focus is on embedded systems which must run continuously for years.

> Your second reference has
> char *index;
> ....
> *index++ = toupper(*index);
>
> .He doesn't explain why he decided to write code like this as opposed to
> *index = toupper(*index);
> index++ ;

...
The point of referring to this thread was not the coding problem at hand, which was use of an idiom that was common. It was the pervasive confusion among many C experts, including at least one who was apparently present for the standards meeting, about how to classify the statement as undefined behavior in terms of the language of the (developing) standard at that time. Even Dennis Ritchie chimes in, apparently with some comments not relevant to the discussion at hand. Reading the full thread will reveal that it may not be so easy as you think to determine when code will have undefined behavior based on the standards language.

> So unless he had good reasons to prefer the

first version then worrying whether it's undefined behaviour is irrelevant for programming ; ...
>

For the rest of your query, I will refer you to Anton's paper,

http://www.complang.tuwien.ac.at/kps2015/proceedings/KPS_2015_submission_29.pdf

> > I intend to continue using the GNU C++ and C compilers, but with -O0 and
> > -Wall flags turned on. Some of the code it generates at -O0 is fairly ugly,
> > but I expect it's reliable when the source is not UB.
>
> I don't think the -O0 part makes sense. If the compiler has bugs and miscompiles
> correct code then you want to find this out as early as possible rather than at a
> later point where you many need a higher optimisation level because otherwise the
> code does not run fast enough. If your code has undefined behaviour then if it does
> what you want then it's only by coincidence and your luck may run out even if you
> always use -O0 .

The -O0 is to prevent the compiler from optimizing the code, although as Anton shows in his paper, GCC still seems to make some optimizations at this level. If the compiler is buggy, that will hopefully come out in the testing of the code. The -Wall option is used to find undefined behavior in the C source. For example, for the source statement

i = i++ + 2;

GCC with -Wall -O0 flags gives the following output:

warning: operation on ‘i’ may be undefined [-Wsequence-point]

> So in my opinion , one should at least conduct some tests with

> other optimisation levels ...

Agreed, but, as I mentioned earlier, the efficiency of the C++ and C code is not relevant to my Forth system, since the words are mostly in assembler, as well as the virtual machine.

When very high efficiency is a concern, I've come to prefer implementing the algorithm in assembler, including in the Forth assembler. For example, the following numerical integration algorithm is implemented in assembler and can be called from within the Forth environmen. No doubt an optimizing Forth compiler can achieve the same or perhaps exceed the efficiency of this code. However, I can maintain, debug, or extend this code, which is something I can't do with the C optimizer output.

ftp://ccreweb.org/software/kforth/examples/fsl/extras/numerov_x86.4th

Krishna

krishna...@ccreweb.org

unread,

Apr 6, 2017, 1:53:47 PM4/6/17

to

On Thursday, April 6, 2017 at 11:43:19 AM UTC-5, Alex wrote:
> On 4/1/2017 14:36, krishna...@ccreweb.org wrote:
> > Question:
> >
> > Are there any existing Forth systems built from the LLVM tools (link
> > below)? What advantages might LLVM provide for a low-level language
> > such as Forth?
> >
> > Krishna
> >
> > http://llvm.org/
> >
>
> I have also just come across FIRM. It looks interesting in that it
> appears a lot simpler than LLVM and has a well documented C library
> (libfirm) that would appear to be compilable on a number of platforms.
>
> It also has a Forth implementation (FirmForth) that could be used as a
> basis for your evaluation.

> ...

Thanks, Alex. I'll have a look as time permits.

Krishna

Alex

unread,

Apr 6, 2017, 1:55:37 PM4/6/17

to

On 4/6/2017 18:49, krishna...@ccreweb.org wrote:
> For the rest of your query, I will refer you to Anton's paper,
>

> http://www.complang.tuwien.ac.at/kps2015/proceedings/KPS_2015_submission_29..pdf

Corrected link
http://www.complang.tuwien.ac.at/kps2015/proceedings/KPS_2015_submission_29.pdf

--
Alex

Spiros Bousbouras

unread,

Apr 6, 2017, 2:48:57 PM4/6/17

to

On Wed, 05 Apr 2017 15:01:45 GMT
an...@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
> Andrew Haley <andr...@littlepinkcloud.invalid> writes:
>
> Translating Forth into traditional C (that compiles as intended by PCC
> and early versions of GCC) is indeed not difficult; after all,
> traditional C and Forth are semantically close to each other. Take a
> look at Gforth or forth2c to get an idea on how to do it.
>
> The resulting code is, of course, full of undefined behaviour, like
> most of the other C code around; and in this case I think it's
> impossible to get rid of the undefined behaviour without losing
> functionality and performance, and even if you are willing to pay that
> price, it requires significantly more effort, and in the end, you will
> probably still have some undefined behaviour somewhere.

Here we need to make a distinction : I will term "weak undefined behaviour"
(WUB) C code which is not defined according to the C standard and "strong
undefined behaviour" (SUB) C code which is not defined according to
*anything* where "anything" could mean some standard which enhances the C
standard (like POSIX) and also relevant documentation of the compiler or
libraries one uses or whatever. Our main conern here is predictable behaviour
of the code so I believe the deciding concept is SUB rather than WUB. So if
"The resulting code is, of course, full of undefined behaviour" means WUB
then that's probably correct (there's a lot of useful stuff you can't do just
by <C according to the standard>) but if you mean SUB then I doubt it. In any
case , if code has SUB then I cannot imagine on what basis one may think it
works and it will reliably continue to do so in the future : even if it
passes testing , then that's by coincidence. So if a code exhibits SUB then
it's buggy and it needs to be fixed.

> In general, programming is not easy, but there are different levels of
> difficulty:
>
> 1) Programmers tend to be able to satisfy the requirements that the
> users use a lot, and/or care about a lot. Corner cases, less so (and
> in the Forth community, some people celebrate being wrong in corner
> cases, no matter how easy it would have been to do the right thing;
> actually, that's very similar to the attitude that gives us nasal
> demons).
>
> 2) Programmers are pretty bad at dealing with cases that do not occur
> during the course of normal operations. That's why security
> vulnerabilities occur so often and are so long-lived. Writing tests
> for unusual cases helps a little, but only shines a light on the cases
> that are tested. There is no guarantee that tests cover everything.
> Even with automated fuzz testing these days, the amount of automation
> is limited: a specialized setup for a certain system call or something
> caught things that the general fuzz tester had not caught.
>
> 3) Complete correctness proofs of programs are outside the competence
> of nearly all programmers (and pretty pointless, because they would
> prove only the correctness wrt the specification, which itself may be
> buggy).
>
> Now, to write a C program that is guaranteed to never exercise
> undefined behaviour requires competences on the level 3.

Not in the slightest. Consider the following piece of C code :

int is_leap_year(unsigned int year) {
return 1 ;
}

It's trivial to see that this is defined but , judging from the function's
name , it's almost certainly buggy. A complete correctness proof is many
orders of magnitude harder than simply ensuring defined behaviour.

> The
> much-hyped sanitizers of GCC and Clang allow finding some (not all)
> undefined behaviours using the well-established testing methodology.
>
> So you will find undefined behaviours in the cases exercized by normal
> users, but you will tend not to find undefined behaviour in those
> cases exercized by an attacker; it's as difficult as finding the all
> the vulnerabilities in the first place. E.g., in the OpenSSH case, if
> you thought about the vulnerability, you would fix it right away,
> rather than writing a test case that exploited it, and waiting for the
> sanitizer to report that freed memory was accessed (if the sanitizer
> actually does check that), and then fixing it.

The best write-up I could find on the vulnerability is at
https://www.qualys.com/2016/01/14/cve-2016-0777-cve-2016-0778/openssh-cve-2016-0777-cve-2016-0778.txt .

The crucial bug occurred much earlier than compiler optimisations eliminating
code which wrote 0's on some memory. From the previous link

- "out_start == out_last" (lines 205-206): no data was ever written to
out_buf (and both out_start and out_last are still equal to 0) because
no data was ever sent to the server after roaming_reply() was called,
but the client sends (leaks) the entire uninitialized out_buf to the
server (line 214), as if out_buf_size bytes of data were available.

This would still be a bug (albeit a less serious one) even if the memory had
been zeroed because it sends bogus information to the server. An analogous
bug would be sending over the network unencrypted data which should have been
encrypted. In both cases , to see the bugs one needs to think on the level of
what the application wants and does not want to do , not on the low level
details of the programming language. Some low level detail may accidentally
eliminate some bug caused by a high level thinking oversight but this is
entirely random so it could just as well make it worse.

Another bug was with not handling correctly C arithmetic (see the page).
Finally the not zeroing of memory happened because (from the link)

Unfortunately, an optimizing compiler may remove this
memset() or bzero() call, because the buffer is written to, but never
again read from (an optimization known as Dead Store Elimination).

so it's not due to undefined behaviour. So your reasoning is totally wrong both
in terms of the large picture *and* the details.

> >I'd be tempted to allocate all of Forth's data space into a single
> >char array, and implement ALLOCATE with a custom allocator in that
> >space.
>
> That's "easy" in your book? The resulting Forth system would be
> severely restricted: E.g., you cannot use it to call system calls like
> mmap() and make use of the results (but then, that's undefined
> in the C standard anyway).
>
> > That fits nicely with Forth's flat memory space. But it
> >certainly could be done with malloc(), although care would have to be
> >taken not to hit UB comparing pointers to malloc'd data and Forth's
> >data space.
>
> Right. Anything but "not difficult"; I don't think it's possible.
> Subtraction is also a problem.
>
> >> How do you implement @ and C@?
> >
> >Simple memory reads.
>
> What C code?
>
> >It's legitimate to alias character types and
> >other types, so a suitably aligned character address can be cast to a
> >size_t* or a uintptr_t*.
>
> That contradicts things I have read about accessing casted pointers.

I couldn't tell you off the top of my head but I'm sure you could get the
answer on comp.lang.c very quickly. Looking at the big picture , it could
be that C is not a suitable back-end for Forth after all even if you got
lucky for a number of years and your code (which presumably was exhibiting
SUB) worked as intended. Note that this may turn out to be an opportunity for
Forth : it may be that Forth itself is a more suitable back-end than C for other
programming languages and if so then this needs to be advertised. A
*constructive* thing to do (as opposed to complaining) would be to describe
what guarantees with regard to C pointer comparisons on top of the C standard
would be needed to do what you want to do. Then you could ask on comp.lang.c
or some compiler mailing list whether these additional guarantees can be
offered by using some appropriate flag and if not how easy would be to implement
them as an extension.

> >I think it'd probably make the most sense to use uintptr_t for Forth
> >cells. Of course, this requires a system to be able to cast freely
> >between uintptr_t and and pointer types, but that is implementation-
> >defined rather than undefined behaviour. Every C compiler I care
> >about (or, in fact, know about) supports it.
>
> It's interesting how quick you come off your high horse once things
> like writing actual C code come into play. "It works as intended on
> the compilers and/or machines the programmer used" has been my
> yardstick for determining the correctness of programs all along, yet
> nasal demon fans insist that most (possibly all) of these programs are
> "buggy".

He didn't get off any high horse. Implementation defined behaviour is
stronger than "It works as intended on the compilers and/or machines the
programmer used" which may simply be by accident (and if it's SUB then it
almost certainly is by accident). As for this being your yardstick , that's
just shocking especially coming from someone who teaches computer science and
would be expected to teach students correct thinking. Testing is a necessary
but not sufficient condition to trust code : one must also have abstract
reasons (i.e. reasons which follow from some standard , specification or
relevant documentation) to think that the code will behave as intended.
Otherwise it's just programming superstition.

And speaking of "nasal demon fans" , I don't think such a thing exists.
Undefined behaviour is an unfortunate side effect of 3 factors :

1. The desire to allow implementors optimisations.
2. Inability to reconcile previous (i.e. before the standard) existing practices
therefore any attempt to define the behaviour more narrowly would unfairly
privilege some implementations (or architectures) over others.
3. The desire to not make the standard too large.

(Note that the above 3 factors apply to most programming languages , not just
C.) But noone *likes* undefined behaviour , it is just a trade off. The talk
of nasal demons is to drive home the point (especially to new programmers)
that when dealing with undefined behaviour (that is SUB) there's no
"reasonable" behaviour one can expect therefore *** Just say no *** .If
programmers get convinced to avoid undefined behaviour then this ultimately
will make code more predictable not less. But then there are always some
stubborn ones like yourself or
http://www.open-chess.org/viewtopic.php?f=5&t=2519 where Robert Hyatt
insists that SUB should still do what he wants just because it did in the past.

--
I can tell what you're thinking. Did he click Send or Save Now? Well, to tell you the
truth, in all the excitement of composing that angry email, I kind of lost track myself.
Good thing we can easily undo a sent mail! Oh wait, we totally can't.
https://blog.codinghorror.com/the-opposite-of-fitts-law/

Spiros Bousbouras

unread,

Apr 6, 2017, 4:41:21 PM4/6/17

to

Krishna's link was already correct. To be precise , his post has
http://www.complang.tuwien.ac.at/kps2015/proceedings/KPS_2015_submission_29=
.pdf
and
Content-Transfer-Encoding: quoted-printable
in the header so after the required unfolding
http://www.complang.tuwien.ac.at/kps2015/proceedings/KPS_2015_submission_29=
.pdf
becomes
http://www.complang.tuwien.ac.at/kps2015/proceedings/KPS_2015_submission_29.pdf

It would seem that your newsreader has a bug (or undefined behaviour ;-)
w3m also displays it wrong ("29..pdf" instead of "29.pdf"). Interesting ;
I can't imagine what coding error would create the same mistake in 2 different
readers.
https://groups.google.com/d/msg/comp.lang.forth/pPVv1nVH0ZI/hw7J_1KhAQAJ
shows it right.

Alex

unread,

Apr 6, 2017, 4:47:58 PM4/6/17

to

On 4/6/2017 21:41, Spiros Bousbouras wrote:
> It would seem that your newsreader has a bug

Thunderbird 45.8.0

--
Alex

Alex

unread,

Apr 6, 2017, 4:56:25 PM4/6/17

to

As does Google Groups.

--
Alex

Joel Rees

unread,

Apr 6, 2017, 8:10:58 PM4/6/17

to

On Thursday, April 6, 2017 at 9:33:45 PM UTC+9, Spiros Bousbouras wrote:
> Since this has become about C now , I'm crossposting to comp.lang.c
> and setting follow-ups to the same.

I'm using Google groups because I'm lazy, so I have no idea whether this will maintain the cross-post.

> On Wed, 5 Apr 2017 20:14:27 -0700 (PDT)
> krishna(something)@(something).(something) wrote:
> > It's a matter of trusting the compiler will code what you ask it to do,
> > and understanding the cases of undefined behavior when the compiler may
> > not generate expected code. It's not clear to me that mere mortals can
> > keep track of all of the cases of UB in the C standard, and the latitude
> > given to compilers to do different things in those cases.
>
> I'm a mere mortal and I can certainly keep track of UB relevant to the code
> I write. In most cases it's actually common sense. For example I don't need
> to consult the C standard to know that trying to read beyond the bounds of
> an array will probably not result in anything useful because , even *conceptually*
> (i.e. without referring to the standard of any specific language) , it doesn't
> make sense. Perhaps the confusion arises because some people when programming
> in C are not thinking in terms of C but in terms of assembly. So they translate
> mentally the C code in some generic assembly instructions and then try to decide
> if the C code does something meaningful based on what those hypothetical assembly
> instructions will do.
>
> Example : the expression array[i] where i is beyond the end of the array.

The problem here is knowing whether your loop properly accesses the first and last elements of an array. Too many programmers are too lazy to understand the concept of where the exit comes in a loop, and too lazy to learn to track the state of the variables relevant to the loop control.

"Undefined Behavior" has two meanings here. Perhaps the C standard should have said "undefinable behavior", but that has certain potential misinterpretations as well.

> Thinking in terms of C : Not defined which means you aren't going to get anything
> useful so you shouldn't use it.

That's the approach Anton seems to be rather militant about. (And I don't blame him.)

But it's actually a third meaning for UB.

According to a common-sense interpretation (unless you are one who spews demons from your nose when faced with code that does not generalize well in certain corner-cutting hardware contexts), the standard defines UB as having the potential to be mean something different or nothing at all in certain contexts that the standards committee chose to accept as permissible relative to the standard, mostly because of the charismatic influence of the practitioners of said hardware.

("Charismatic" here refers to the sort of the charisma that includes the Kim family among leaders who rule by "charisma".)

> Pondering or testing what's the most crazy thing
> which might happen , may be a bit of fun every now and again but is not relevant to
> practical programming.

Being unable to specify target hardware is another way to invite bugs.

> Thinking in terms of assembly : "array[i] will be translated into an assembly
> instruction which calculates an address of the form base + i * constant .
> This will always point into some address in memory so I should get some value
> or at least a segmentation fault ; so not a totally unpredictable result."

Some people conflate segmentation fault with UB, partly, we might guess, because they do not understand why, uhm, segmentation should fault. ;->

> Note first that not thinking in terms of C is more complicated and requires more
> knowledge than thinking in terms of C.

I'm not sure I could agree with this. Current "standard" C is way more complicated than most of the practitioners thereof have any idea.

> Second , if one wants to think in terms of
> assembly then , from a practical point of view , it makes sense to program in
> assembly , not C.

Hear! Hear!

(Which is what attracts some people to Forth, really.)

> > Certainly coding guidelines, e.g. NASA's C coding guidelines (ref. 1) can
> > help avoid such difficulties. But, it's also clear that mere mortal
> > programmers may be easily confused.

No one who claims to be an engineer should ever refer to him/herself as a mere mortal as an excuse for not understanding the tools he or she as fully as possible.

Sure, it easily becomes idolatry, and is among the sins of society, to require certain occupations to operate in hero mode, but if engineers don't operate in hero mode, airplanes fall from the sky, etc.

> > See the old thread on comp.std.c (ref. 2).
> > Even some of the experts had problems with the interpretation of the standard;
> > but, to be fair, that's an old thread.
>
> [ References are
> 1. http://sdtimes.com/nasas-10-rules-developing-safety-critical-code
> and
> 2. https://groups.google.com/d/msg/comp.std.c/9lYbeLVaZCA/__behqP09dIJ
> ]
>
> Regarding the NASA guidelines , they are unsuitable for a lot of desktop software.
> For example "Do not use dynamic memory allocation after initialization." .How
> are you going to implement for example the Unix sort utility without using
> realloc() (depending on the size of the input) ?
>
> Your second reference has
> char *index;
> ....
> *index++ = toupper(*index);
>
> .He doesn't explain why he decided to write code like this as opposed to
> *index = toupper(*index);
> index++ ;
>
> .The latter seems to me more readable. So unless he had good reasons to prefer the
> first version then worrying whether it's undefined behaviour is irrelevant for
> programming ; one simply shouldn't write code like this.

I'll say amen to that, too.

I used to be a fan of auto-increment modes, but then I realized how many times I had been writing i++ when I really meant ++i, just because some pedant was scared of surprising later programmers with an idiom that wasn't probably taught in schools. (And ten years later, those who breath demons have finally begun to recognize that i = i + 1 in a C program is ++i, not i++. And get too excited about the fact.)

> Do you have an example
> of C code which satisfies the following 3 criteria ?
>
> 1. Is not contrived (according to your judgement).
> 2. You are not sure whether it has defined behaviour.
> 3. You do not see an obvious and easy way to change it so that it does what you
> want *and* doesn't look contrived (according to your judgement) *and* you are
> sure is not undefined behaviour.

I'm a late-comer to the conversation, but I have a few sitting around in my piles of old code.

I remember being excited to recognize that the reason my old assembler for the 6800, which I had written in college, was failing to run on my modern stuff was that I had, without realizing it, depended on a run-time semantic of dereferencing a NULL pointer. That one I understand now. But I know there are still dark corners hiding in code that I think I understand.

Yes, illuminating those dark corners is important. But it would be nice for the drunk policeman to allow me to continue looking where they wallet fell, rather than insist that I look under the lamp where there is light. So to speak.

> > I intend to continue using the GNU C++ and C compilers, but with -O0 and
> > -Wall flags turned on. Some of the code it generates at -O0 is fairly ugly,
> > but I expect it's reliable when the source is not UB.
>
> I don't think the -O0 part makes sense. If the compiler has bugs and miscompiles
> correct code then you want to find this out as early as possible rather than at a
> later point where you many need a higher optimisation level because otherwise the
> code does not run fast enough. If your code has undefined behaviour then if it does
> what you want then it's only by coincidence and your luck may run out even if you
> always use -O0 .So in my opinion , one should at least conduct some tests with
> other optimisation levels (and preferably using multiple compilers) to see if the
> code continues to run correctly.

Wish management would be more willing to put up with this kind of exploration, but that almighty false proxy of value is the bottom line.

> --
> Every application has an inherent amount of irreducible complexity. The only
> question is: Who will have to deal with it\u2014the user, the application
> developer, or the platform developer?
> http://www.nomodes.com/Larry_Tesler_Consulting/Complexity_Law.html

--
Joel Rees

http://reiisi.blogspot.com

Joel Rees

unread,

Apr 6, 2017, 9:42:32 PM4/6/17

to

On Friday, April 7, 2017 at 3:48:57 AM UTC+9, Spiros Bousbouras wrote:
> On Wed, 05 Apr 2017 15:01:45 GMT

> anton(something)@(something).(something)(Anton Ertl) wrote:

> > Andrew Haley <andrew(something)@(something).(something)> writes:
> >
> > Translating Forth into traditional C (that compiles as intended by PCC
> > and early versions of GCC) is indeed not difficult; after all,
> > traditional C and Forth are semantically close to each other. Take a
> > look at Gforth or forth2c to get an idea on how to do it.
> >
> > The resulting code is, of course, full of undefined behaviour, like
> > most of the other C code around; and in this case I think it's
> > impossible to get rid of the undefined behaviour without losing
> > functionality and performance, and even if you are willing to pay that
> > price, it requires significantly more effort, and in the end, you will
> > probably still have some undefined behaviour somewhere.
>
> Here we need to make a distinction : I will term "weak undefined behaviour"
> (WUB) C code which is not defined according to the C standard and "strong
> undefined behaviour" (SUB) C code which is not defined according to
> *anything* where "anything" could mean some standard which enhances the C
> standard (like POSIX) and also relevant documentation of the compiler or
> libraries one uses or whatever.

I'm not up-to-date on the C standard. Does the standard make the same distinction?

If not, sorry to be a demon breather of human languages, but you just introduced UB into the discussion.

> Our main conern here is predictable behaviour
> of the code so I believe the deciding concept is SUB rather than WUB.

And who do you represent in making that the deciding concept?

> So if
> "The resulting code is, of course, full of undefined behaviour" means WUB
> then that's probably correct (there's a lot of useful stuff you can't do just
> by <C according to the standard>) but if you mean SUB then I doubt it. In any
> case , if code has SUB then I cannot imagine on what basis one may think it
> works and it will reliably continue to do so in the future : even if it
> passes testing , then that's by coincidence. So if a code exhibits SUB then
> it's buggy and it needs to be fixed.

Having criticized you for making the distinction, I'll agree with your intent, partially. But only if you go look up my bif-c project and read the header file that I use to clear the problems Anton is frustrated about, and the configuration code I use to generate that header file. (... which I now discover, on current 64 bit machines, has some potential UB that occurs because the compiler writers have trouble understanding whether a C int should ever be 64 bits on a nominally 64 bit architecture, and make an assumption that didn't match mine, and I'm still not sure their assumption matches the future realities better than mine.)

And then you may contemplate how that header perverts some of the hidden corners of my interpreter, and whether the bugs it introduces now are worth the reward of illuminating those corners some years from now when I can finally turn my attention back to the code for a day or two.

But that doesn't mean that correctness is any closer to being achievable.

Rather, it requires the compiler writers (and standards committees) to act on a level that is a bit beyond mere hero -- perhaps a worthy goal, but not achievable.

I'm shooting from the hip, and my memory is sure to be faulty, but my memory is that the compiler's assumption that this was valid dead store elimination was derived from an interpretation elsewhere of UB.

Now, there is a potential for conflating the general principle of UB with the dead-code optimizations that became significantly more aggressive around that time.

But I'm not sure the conflation is entirely spurious.

K&R C and the first ANSI standard acknowledged that a compiler should never try to interpret too far beyond a pointer, because, especially in a multiprocessing run-time, the compiler had no way of knowing what was happening to the memory being pointed to when the code was not accessing it.

Microsoft insisted they could predict such things in their OS. We know why. We also know what happens later.

Intel makes their market by helping Microsoft do stupid things. So Intel wants us to believe Microsoft's charismatic delusions of omnipotence, and they find it convenient to optimize their hardware to do so. And fudge around the problems with the next shrink in the die and the accompanying speed bump, which, by the way, hasn't happened in several years, and, by the way, is a significant part of the motivation for trying to get the compiler to juice every last ounce of performance from a CPU that has serious bugs in its spec.

In other words, from one point of view, the question of dead store is not separable from the question of what happens when a pointer to undefined memory is sniffed (which is one of the bogies of UB) or of what happens when a pointer to an object in one allocation area is compared to a pointer to an object in another allocation area (which is another).

(Yes, I still grudge Intel those four-bit pointer jaggies, along with the imposition of the LSB addressing perversion. We are finally becoming able, as an industry, to describe why Intel was wrong when they said it wouldn't matter, but we now have a huge burden of false assumptions in an entire industry to overcome, to get rid of the effects of those two specific false optimizations.)

I'll again refer you to the source code for my bif-c project, the same place I mentioned above.

There are some extent bugs in my source code which are derived from something in that huge union I made to get around the problem Anton is pointing at. And I kind of implied it above, but that union is itself a straitjacket that prevented me from pushing that project forward into the realm of production-usable code.

Maybe, if I'd been able to get someone to pay me while I thought about it, I could have found a way around the straitjacket.

As it was, getting it that far took time away from my family which my children are now paying for. (And I am paying for in terms of going back to help them learn things now that I should have been helping them learn when they were much younger. I'm fortunate to have the chance to do so. Not all engineers who are parents end up having the chance.)

> Note that this may turn out to be an opportunity for
> Forth : it may be that Forth itself is a more suitable back-end than C for other
> programming languages and if so then this needs to be advertised.

One of the hidden features of Forth is the difficulty of imposing a standard on any level of the language families. It would be nice, perhaps, but it would have to have a different name than Forth.

Uhm, they would have different names than Forth.

(Which is really the solution that should have been used with C -- split the standard into three or more separate languages with separate target classes of run-time environments. If C had a split context stack like Forth, and kept its symbol table on-line like Forth, and provided default access to the debugger like Forth, and had a per-user segment like many versions of Forth, etc., maybe it would have been more obvious that they shouldn't be trying to conflate all the standards into one.)

> A
> *constructive* thing to do (as opposed to complaining) would be to describe
> what guarantees with regard to C pointer comparisons on top of the C standard
> would be needed to do what you want to do. Then you could ask on comp.lang.c
> or some compiler mailing list whether these additional guarantees can be
> offered by using some appropriate flag and if not how easy would be to implement
> them as an extension.

I think that was what he was trying to do. Didn't you see that when you read his paper?

The prragmas have generally been available in the past. Some of them are no longer available, due to the insistence that standard-undefined behavior is equivalent to non-existing behavior. (See the confusion about the SSH bug you mention above. The authors of ssh are no less critical of the arrogance of the current batch of compiler engineers than Anton, and if you were looking for a group who understand the implications of compiler semantics on security, the authors of ssh are about as qualified as any group of engineers on the planet. BTW, the authors of ssh were never confused about this. The compiler engineers have just plain gone random.)

> > >I think it'd probably make the most sense to use uintptr_t for Forth
> > >cells. Of course, this requires a system to be able to cast freely
> > >between uintptr_t and and pointer types, but that is implementation-
> > >defined rather than undefined behaviour. Every C compiler I care
> > >about (or, in fact, know about) supports it.
> >
> > It's interesting how quick you come off your high horse once things
> > like writing actual C code come into play. "It works as intended on
> > the compilers and/or machines the programmer used" has been my
> > yardstick for determining the correctness of programs all along, yet
> > nasal demon fans insist that most (possibly all) of these programs are
> > "buggy".
>
> He didn't get off any high horse. Implementation defined behaviour is
> stronger than "It works as intended on the compilers and/or machines the
> programmer used" which may simply be by accident (and if it's SUB then it
> almost certainly is by accident

When a pointer was an integer, what worked because a pointer was an integer was not by accident.

Pointers are still integers.

We, as a community have some deeper understanding about the nature of integers than we did before.

But the standards have declared now that pointers are not integers compatible with computation integers, and that makes problems, and if the compiler engineers keep dumping code that is UB under the assumption that pointers are not integers, well, in several years, all C compilers will be very quick.

Every program will compile to NOOP.

> ). As for this being your yardstick , that's
> just shocking especially coming from someone who teaches computer science and
> would be expected to teach students correct thinking. Testing is a necessary
> but not sufficient condition to trust code : one must also have abstract
> reasons (i.e. reasons which follow from some standard , specification or
> relevant documentation) to think that the code will behave as intended.
> Otherwise it's just programming superstition.

Abstract vs. real. Where do you set the fulcrum to get the balance?

Balance is the wrong question. Having both halfway doesn't really work very well.

> And speaking of "nasal demon fans" , I don't think such a thing exists.

I'm guessing he gets that idiom from a more polite restatement of the monkeys flying out of the dark place idiom that was used as the general proxy symbol for talking about UB in certain subgroups of our industry in the not-so-distant past.

> Undefined behaviour is an unfortunate side effect of 3 factors :
>
> 1. The desire to allow implementors optimisations.

Restated as Intel's need to find some way to keep claiming that their patent-laden archaic architecture can still be kept up to speed, if nothing else.

> 2. Inability to reconcile previous (i.e. before the standard) existing practices
> therefore any attempt to define the behaviour more narrowly would unfairly
> privilege some implementations (or architectures) over others.

An unwillingness to acknowledge that the requirements of various industries who have found a commonality in C now need to refine their interests in the common language.

> 3. The desire to not make the standard too large.

What?

It was already too large as of ANSI C.

> (Note that the above 3 factors apply to most programming languages , not just
> C.) But noone *likes* undefined behaviour , it is just a trade off.

So every application program now has to correctly guide rockets into orbit, correctly interpret inane tax law, correctly play chess, correctly diagnose your mother's health issues, correctly perform open-heart surgery on my friend, and correctly tend the baby in the nursery? (Just for starters.)

If we can't learn to deal with context, we are, as a society and a race, completely unworthy of the technologies that allow us to define anything beyond a context-free grammar.

> The talk
> of nasal demons is to drive home the point (especially to new programmers)
> that when dealing with undefined behaviour (that is SUB) there's no
> "reasonable" behaviour one can expect therefore *** Just say no *** .

Why should admitting that pointers are integers be equated with random unprotected sex or crack cocaine?

> If
> programmers get convinced to avoid undefined behaviour then this ultimately
> will make code more predictable not less. But then there are always some
> stubborn ones like yourself or
> http://www.open-chess.org/viewtopic.php?f=5&t=2519 where Robert Hyatt
> insists that SUB should still do what he wants just because it did in the past.
>
> --
> I can tell what you're thinking. Did he click Send or Save Now? Well, to tell you the
> truth, in all the excitement of composing that angry email, I kind of lost track myself.
> Good thing we can easily undo a sent mail! Oh wait, we totally can't.
> https://blog.codinghorror.com/the-opposite-of-fitts-law/

Is that your siggy?

Are you even subliminally aware of the irony implicit?

hughag...@gmail.com

unread,

Apr 6, 2017, 11:15:19 PM4/6/17

to

Thanks for telling me about that --- I'll look into LLVM.

Rod Pemberton

unread,

Apr 6, 2017, 11:58:11 PM4/6/17

to

On Thu, 6 Apr 2017 21:56:24 +0100
Alex <al...@rivadpm.com> wrote:

> On 4/6/2017 21:47, Alex wrote:
> > On 4/6/2017 21:41, Spiros Bousbouras wrote:

> >> It would seem that your newsreader has a bug
> >
> > Thunderbird 45.8.0
> >
>

The OP's link appears correct here. Claws-mail 3.13.1.
It appears correct in the message source view too.

> As does Google Groups.

Google Groups link appears correct here. Firefox 52.0.2.
It appears correct in the message source view too.

Rod Pemberton

unread,

Apr 7, 2017, 3:32:11 AM4/7/17

to

On Thu, 06 Apr 2017 12:33:44 GMT
Spiros Bousbouras <spi...@gmail.com> wrote:

This post was intended for c.l.f. since the issues dealt with
comparison of Forth and C functionality. I've re-posted it here since
it made it to c.l.c. instead of c.l.f. Spiros redirected which was not
expected as I took note of the topic from a reply to him on c.l.f. ...
Hopefully, those on c.l.c. will not redirect their crap here. I'm not
even interested in seeing what type of shit those idiots posted in
response.

> Example : the expression array[i] where i is beyond the end of the
> array.

FYI, C explicitly allows you to access one element past the end of the
array. This is to allow you to access the nul character when comparing
strings to check for string termination. If you go beyond that though,
it's undefined ... However, many C implementations use the ability to
access outside a C object behind the scenes without any issues, e.g.,
to access a memory header prior to malloc'd memory blocks.

> Thinking in terms of assembly : "array[i] will be translated into an
> assembly instruction which calculates an address of the form base +
> i * constant .

This is a requirement of the subscript operator [] for K&R, ANSI, and
ISO C. It's in all the standards. They define "E1[E2]" to be
equivalent to "*((E1)+(E2))". The subscript operator takes a pointer
to a type and also an index. The index is scaled by the size of the
data type pointed to. The pointer which cannot be a pointer-to-void
'void *' since 'void' has no data type, i.e., no size, pointer only.
The pointer used in the subscript operator can be any pointer to any
valid type, not just pointer for array types. The arguments to the
subscript operator can be either order, e.g., "0123456789"[i] is
equivalent to i["0123456789"] . Technically, you should think of the
subscript operator as just the left-brace, then you'll understand it's
like any other operator in C, e.g., arithmetic +, -> component
selection, or whatever.

BTW, the other required equivalence is "E1->MOS" is equivalent to
"(*E1).MOS". This is the equivalence between the two component
selection operators . and ->

> This will always point into some address in memory so
> I should get some value or at least a segmentation fault ; so not a
> totally unpredictable result."

C's objects don't have bounds checks built in. It's not an
object-oriented language. So, it's up to the programmer to make sure
that they're within range.

> Note first that not thinking in terms of C is more complicated and
> requires more knowledge than thinking in terms of C. Second , if one
> wants to think in terms of assembly then , from a practical point of
> view , it makes sense to program in assembly , not C.

Non sequitur.

IMO, the key reasons to use C instead of assembly are the ability to
use variables and the portability of the code. Even so, you need to
understand how your code is converted to assembly or at least check the
compiler output.

> How are you going to implement for example
> the Unix sort utility without using realloc() (depending on the
> size of the input) ?

I don't follow. I've never had to use realloc() for anything, ever.
If you're manipulating large data sets in C, it's better to use files
instead of memory.

> Your second reference has
> char *index;
> ....
> *index++ = toupper(*index);
>
> .He doesn't explain why he decided to write code like this as opposed
> to *index = toupper(*index);
> index++ ;

I'm guessing that this is most likely just pre-ANSI C. Most C style
guides etc are older and are for K&R C. E.g., most will recommend
that you use #defines and mask with logical-and instead of using
bit-fields, since many earlier C compilers were broken. They also
recommend avoiding enums for similar reasons. They also recommend
placing numbers in comparisons as "if(0==value)" instead of
"if(value==0)" to detect incorrect assignments such as "if(value=0)".
etc. They'll also say not to use the address of a struct by itself,
i.e., as a pointer, since an early, widespread C compiler, PCC, didn't
keep track of the struct's name or somesuch issue ... ANSI C added
other problems, e.g., obsoleted implicit int's since they caused a
syntax conflict with typedef's, messed up pre- and post- increment and
decrement operators, etc. Of course, C has some required behavior,
which is not part of some of the standards, such as the C struct hack
and Duff's device, which must work correctly.

Andrew Haley

unread,

Apr 7, 2017, 5:02:19 AM4/7/17

to

Joel Rees <joel...@gmail.com> wrote:
>
> K&R C and the first ANSI standard acknowledged that a compiler
> should never try to interpret too far beyond a pointer, because,
> especially in a multiprocessing run-time, the compiler had no way of
> knowing what was happening to the memory being pointed to when the
> code was not accessing it.

I don't believe it. Chapter and verse of the first ANSI standard,
please.

Andrew.

Spiros Bousbouras

unread,

Apr 8, 2017, 6:43:28 PM4/8/17

to

On Thu, 6 Apr 2017 18:42:30 -0700 (PDT)
Joel Rees <joel...@gmail.com> wrote:

I'll give a more complete answer tomorrow but for the time being...

> Having criticized you for making the distinction, I'll agree with your intent,
> partially. But only if you go look up my bif-c project and read the header file that
> I use to clear the problems Anton is frustrated about, and the configuration code I
> use to generate that header file. (... which I now discover, on current 64 bit
> machines, has some potential UB that occurs because the compiler writers have
> trouble understanding whether a C int should ever be 64 bits on a nominally 64 bit
> architecture, and make an assumption that didn't match mine, and I'm still not sure
> their assumption matches the future realities better than mine.)

You could have provided a link you know. I will guess that you mean
https://sourceforge.net/projects/bif-c/files/latest/download .But I see
several *.h files so I don't know what "the header file" is. Also if
you want my input you must explain which parts of your code correspond
to which of Anton's complaints. Surely you don't expect me to try all
possible combinations of lines in your code with claims in Anton's paper
and see if I come up with a match which makes sense.

By the way , README.TXT seems to be missing text at the end (assuming I got
the correct download).

Spiros Bousbouras

unread,

Apr 8, 2017, 6:48:56 PM4/8/17

to

On Fri, 7 Apr 2017 03:33:09 -0400
Rod Pemberton <NeedNotR...@xrsevnneqk.cem> wrote:
> This post was intended for c.l.f. since the issues dealt with
> comparison of Forth and C functionality. I've re-posted it here since
> it made it to c.l.c. instead of c.l.f. Spiros redirected which was not
> expected as I took note of the topic from a reply to him on c.l.f. ...

I did say I set follow-up to comp.lang.c .

> Hopefully, those on c.l.c. will not redirect their crap here. I'm not
> even interested in seeing what type of shit those idiots posted in
> response.

It's a pity you're not interested because they corrected your numerous mistakes.

Joel Rees

unread,

Apr 8, 2017, 11:54:43 PM4/8/17

to

On Sunday, April 9, 2017 at 7:43:28 AM UTC+9, Spiros Bousbouras wrote:
> On Thu, 6 Apr 2017 18:42:30 -0700 (PDT)
> Joel Rees <joel.rees@(something).(something)> wrote:
>
> I'll give a more complete answer tomorrow but for the time being...
>
> > Having criticized you for making the distinction, I'll agree with your intent,
> > partially. But only if you go look up my bif-c project and read the header file that
> > I use to clear the problems Anton is frustrated about,

Sorry, I had forgotten that I had not finished moving that part of the code into more meaningfully named files.

The union is in

https://sourceforge.net/p/bif-c/code/52/tree/trunk/bifu_i.h

> > and the configuration code I
> > use to generate that header file.

The configurations are in configs/celltype.h, which is generated by

https://sourceforge.net/p/bif-c/code/52/tree/trunk/configs/makecelltype.c

> > (... which I now discover, on current 64 bit
> > machines, has some potential UB that occurs because the compiler writers have
> > trouble understanding whether a C int should ever be 64 bits on a nominally 64 bit
> > architecture, and make an assumption that didn't match mine, and I'm still not sure
> > their assumption matches the future realities better than mine.)
>
> You could have provided a link you know.

8-*

> I will guess that you mean
> https://sourceforge.net/projects/bif-c/files/latest/download .But I see
> several *.h files so I don't know what "the header file" is. Also if
> you want my input you must explain which parts of your code correspond
> to which of Anton's complaints. Surely you don't expect me to try all
> possible combinations of lines in your code with claims in Anton's paper
> and see if I come up with a match which makes sense.

I had intended to move it into either biftypes.h or bifsym.h, and ran out of time while I was holding a debate with myself on the question. (No one else has joined the project, so I have to debate such things with myself, you see. ;)

> By the way , README.TXT seems to be missing text at the end (assuming I got
> the correct download).

Hmm. Now I need to make some time to do some magnetic domain dumpster diving. Or maybe just write a new readme, since the plans I had then are now quite apparently unachievable.

I assume you will do the make in config/, to get the generated header. For the list, the generated header looks like this on my AMD64 running Debian Wheezy, compiled with gcc 4.7.2:
----------------------------
/*
** celltype.h
** bif-c
**
** Template created by Joel Rees on 2013/06/18.
** Copyright 2013 __Reiisi_Kenkyuu__. All rights reserved.
**
** Type information for cell proto-type.
*/

#if !defined CELLTYPE_H /* recursive inclusion blocker */
#define CELLTYPE_H

typedef char * prCharPtr_p;
typedef void * prVoidPtr_p;
typedef void (* icode_f )( void );
#define PROBE_POINTER_BYTE 8
#define PROBE_POINTER_BIT 64
#define PROBE_POINTER_TYPE icode_f

typedef unsigned char ubyte_t;
typedef signed char sbyte_t;
/* For now, assuming two's complement: */
#define UBYTE_MAX ((ubyte_t) 0xff)
#define BITSPERBYTE 8
#define BYTE_HIGH_BIT ((ubyte_t) 0x80)
#define SBYTE_MAX ((sbyte_t) 0x7f)
#define SBYTE_MIN ((sbyte_t) BYTE_HIGH_BIT)

typedef unsigned short ushortw_t;
typedef signed short sshortw_t;
/* For now, assuming two's complement: */
#define USHORTW_MAX ((ushortw_t) 0xffff)
#define BITSPERSHORTW 16
#define SHORTW_HIGH_BIT ((ushortw_t) 0x8000)
#define SSHORTW_MAX ((sshortw_t) 0x7fff)
#define SSHORTW_MIN ((sshortw_t) SHORTW_HIGH_BIT)

typedef unsigned int uintw_t;
typedef signed int sintw_t;
/* For now, assuming two's complement: */
#define UINTW_MAX ((uintw_t) 0xffffffff)
#define BITSPERINTW 32
#define INTW_HIGH_BIT ((uintw_t) 0x80000000)
#define SINTW_MAX ((sintw_t) 0x7fffffff)
#define SINTW_MIN ((sintw_t) INTW_HIGH_BIT)

typedef unsigned long ulongw_t;
typedef signed long slongw_t;
/* For now, assuming two's complement: */
#define ULONGW_MAX ((ulongw_t) 0xffffffffffffffffL)
#define BITSPERLONGW 64
#define LONGW_HIGH_BIT ((ulongw_t) 0x8000000000000000L)
#define SLONGW_MAX ((slongw_t) 0x7fffffffffffffffL)
#define SLONGW_MIN ((slongw_t) LONGW_HIGH_BIT)

/* *** long long type supported on this architecture. *** */
typedef unsigned long long ullongw_t;
typedef signed long long sllongw_t;
/* For now, assuming two's complement: */
#define ULLONGW_MAX ((ullongw_t) 0xffffffffffffffffLL)
#define BITSPERLLONGW 64
#define LLONGW_HIGH_BIT ((ullongw_t) 0x8000000000000000LL)
#define SLLONGW_MAX ((sllongw_t) 0x7fffffffffffffffLL)
#define SLLONGW_MIN ((sllongw_t) LLONGW_HIGH_BIT)

#define SINGLE_C_CELL_DOUBLE

/* This is the primitive cell,
** not the target cell.
*/
typedef slongw_t probe_scell_t;
typedef ulongw_t probe_ucell_t;
#define PROBE_CELL_BYTE 8
#define PROBE_CELL_BIT BITSPERLONGW
#define PROBE_CELL_HIGH_BIT ((probe_ucell_t) LONGW_HIGH_BIT)
#define PROBE_UCELL_MAX ((probe_ucell_t) ULONGW_MAX)
#define PROBE_SCELL_MAX ((probe_scell_t) SLONGW_MAX)
#define PROBE_SCELL_MIN ((probe_scell_t) SLONGW_MIN)

/* A manufactured double has to be constructed on the target.
** All we can do here is provide the context.
**
** Cells are never smaller than C int.
*/
#endif /* defined CELLTYPE_H recursive inclusion blocker */
---------------------

--
Joel Rees

http://reiisi.blogspot.jp/p/novels-i-am-writing.html

Joel Rees

unread,

Apr 9, 2017, 12:18:47 AM4/9/17

to

On Friday, April 7, 2017 at 6:02:19 PM UTC+9, Andrew Haley wrote:

I might be misremembering just when they decided to allow Microsoft to pretend that it is possible to always know what a pointer is pointing at. It might have been ANSI C that provided the volatile keyword to specifically prevent the compiler from looking deeper than it should.

There was, of course, no standard for K&R, but quite a lot of discussion concerning optimization lamented the fact that C pointers could point to hardware registers and shared data and other things that could change behind the compiler's back. And that therefore, ForTran allowed parallel processing on array elements where C did not.

It would have been far more correct, and have postponed a lot of standard-rot, to have provided, instead, a "non-volatile" keyword to tell the compiler it could decide it wouldn't need to repeat a dereference of such as pixels[ 10 ] every time it occurred.

Local optimizations are much safer and can be tightened to a far greater degree than global optimizations.

--
Joel Rees

Joel Rees

unread,

Apr 9, 2017, 1:45:52 AM4/9/17

to

On Friday, April 7, 2017 at 4:32:11 PM UTC+9, Rod Pemberton wrote:
> On Thu, 06 Apr 2017 12:33:44 GMT

> Spiros Bousbouras <spibou@(something).(something)> wrote:
>
>
> [...]

> > Example : the expression array[i] where i is beyond the end of the
> > array.
>
> FYI, C explicitly allows you to access one element past the end of the
> array. This is to allow you to access the nul character when comparing
> strings to check for string termination.

Now we know who has been writing all that vulnerable code. (Just kidding. Sort of.)

I have heard teachers say that setting a pointer to NULL and then dereferencing it is safe.

(There are, in fact, some embedded environments where it is safe, but those are special cases. Still not what I call clean coding practice.)

C array syntax allows you to access whatever a point points at, and pointer parameters can be named as if they were indefinitely allocated arrays. Moreover, the grammar specifically allows arbitrary array references even when the l-value is an array with a specific declared size.

In the past, certain compiler vendors allocated extra space around arrays to "help protect" the end users from programmers who did not understand that C does not bounds check arrays, and therefore assumed that no breakage this time meant their loops or other array processing was correct. This would give the appearance of allowing dereferencing one past the end of an array.

It also allowed programmers to get thoroughly confused when they started working with multi-dimension arrays, because the compiler and libraries can't cheat like that for you on multi-dimension arrays.

If you read, for instance, the library documentation for string processing functions in the standard library (fgets(), for instance), you will note that the buffer size passed in must include room for a trailing NUL if you want to be sure the trailing NUL gets stored.

This kind of thing should help you understand that something doesn't match what you just said.

The standard does specify that the run-time has to allow a pointer to point one past the end of an allocated area, as long as the pointer is not dereferenced in that state. This allows a loop that has a post-increment in the test to complete safely.

But that means, no, you must not actually access one beyond. That way lies one-off pointer errors which become vulnerabilities which allow servers and workstations to be pwned by malicious access.

> If you go beyond that though,
> it's undefined ... However, many C implementations use the ability to
> access outside a C object behind the scenes without any issues, e.g.,
> to access a memory header prior to malloc'd memory blocks.

When you talk about memory allocation headers, you imply two levels (at least) of allocation -- a mass block allocation, and a fine-grain allocation.

If you try to access beyond either end of a mass block allocation, you should expect segment faults.

Within a mass block, the library maintains an allocation list of some sort, and that list might use headers that sit either edge of the fine-grained allocation block. The grammar and syntax do allow you to take a look at those headers, but you really should not, because other library architects don't want to risk allocation headers being overwritten by stray pointers. So the allocation lists are elsewhere in those runt-times.

The grammar and syntax allow the accesses, and it is not just one past. A million past is also allowed by the grammar and syntax. The run-time, however, is only required to complete an operation safely if it is within the declared bounds.

> > Thinking in terms of assembly : "array[i] will be translated into an
> > assembly instruction which calculates an address of the form base +
> > i * constant .
>
> This is a requirement of the subscript operator [] for K&R, ANSI, and
> ISO C.

Not really required, just the most natural interpretation of a one-dimensional array reference.

> It's in all the standards. They define "E1[E2]" to be
> equivalent to "*((E1)+(E2))". The subscript operator takes a pointer
> to a type and also an index. The index is scaled by the size of the
> data type pointed to. The pointer which cannot be a pointer-to-void
> 'void *' since 'void' has no data type, i.e., no size, pointer only.
> The pointer used in the subscript operator can be any pointer to any
> valid type, not just pointer for array types.

Up to here you are just talking about arbitrary math.

> The arguments to the
> subscript operator can be either order, e.g., "0123456789"[i] is
> equivalent to i["0123456789"] .

This does not make sense. I know what "string"[index] means, but I and the compiler are both extremely puzzled by index["string"]. This is not perl. "string" may resolve to a pointer, but pointers and indexes have two different sorts of existence, which you refer to when you mention void pointers. You can't use a pointer as an index very meaningfully.

> Technically, you should think of the
> subscript operator as just the left-brace, then you'll understand it's
> like any other operator in C, e.g., arithmetic +, -> component
> selection, or whatever.

Syntactically similar, semantically significantly different.

The implementation of toupper() is not really relevant to the semantics of

*index = toupper( *index++ );

demonstrates the need of sequence points. The assignment is a sequence point there is the assignment operator, "=".

++index is equivalent to index = index + 1 in several ways, but the increment does not contain a sequence point.

> They also
> recommend avoiding enums for similar reasons. They also recommend
> placing numbers in comparisons as "if(0==value)" instead of
> "if(value==0)" to detect incorrect assignments such as "if(value=0)".
> etc. They'll also say not to use the address of a struct by itself,
> i.e., as a pointer, since an early, widespread C compiler, PCC, didn't
> keep track of the struct's name or somesuch issue ... ANSI C added
> other problems, e.g., obsoleted implicit int's since they caused a
> syntax conflict with typedef's, messed up pre- and post- increment and
> decrement operators, etc. Of course, C has some required behavior,
> which is not part of some of the standards, such as the C struct hack
> and Duff's device, which must work correctly.

I'm finding the above too hard to parse. Try again?

> Rod Pemberton
> --
> All it takes for humanity to conquer the world's toughest problems
> is to hope, to believe, to share, and to do, in cooperation.

Anton Ertl

unread,

Apr 9, 2017, 10:43:35 AM4/9/17

to

Joel Rees <joel...@gmail.com> writes:
>I have heard teachers say that setting a pointer to NULL and then dereferen=
>cing it is safe.=20
>
>(There are, in fact, some embedded environments where it is safe, but those=

> are special cases. Still not what I call clean coding practice.)

The problem is that nasal-demon-C compilers "optimize" this
irrespective of the environment for which the program is compiled.

BTW, looking at point 1 of
<http://catb.org/jargon/html/V/vaxocentrism.html>, environments were
this was safe used to be very common once upon a time. Of course,
most user-level programs probably have been compiled and (tried to)
run on systems where dereferencing a null pointer causes a SIGSEGV, so
that practice is probably pretty uncommon these days. However, the
Linux kernel is an environment where it does not cause an exception,
and where it is used for good purposes, and which is miscompiled by
gcc unless you tell it not to (at least in this case there is a flag
for telling it not to).

>> The arguments to the
>> subscript operator can be either order, e.g., "0123456789"[i] is
>> equivalent to i["0123456789"] .
>

>This does not make sense. I know what "string"[index] means, but I and the =

>compiler are both extremely puzzled by index["string"].

Obviously you. But not the compiler:

[/tmp:93578] cat xxx.c
#include <stdio.h>
int main()
{
printf("%c\n",5["abcdefghijk"]);
return 0;
}
[/tmp:93579] gcc -Wall -O xxx.c
[/tmp:93580] a.out
f

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: http://www.forth200x.org/forth200x.html
EuroForth 2016: http://www.euroforth.org/ef16/

Anton Ertl

unread,

Apr 9, 2017, 1:08:43 PM4/9/17

to

Spiros Bousbouras <spi...@gmail.com> writes:
>On Wed, 05 Apr 2017 15:01:45 GMT
>an...@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
>> Andrew Haley <andr...@littlepinkcloud.invalid> writes:
>>
>> Translating Forth into traditional C (that compiles as intended by PCC
>> and early versions of GCC) is indeed not difficult; after all,
>> traditional C and Forth are semantically close to each other. Take a
>> look at Gforth or forth2c to get an idea on how to do it.
>>
>> The resulting code is, of course, full of undefined behaviour, like
>> most of the other C code around; and in this case I think it's
>> impossible to get rid of the undefined behaviour without losing
>> functionality and performance, and even if you are willing to pay that
>> price, it requires significantly more effort, and in the end, you will
>> probably still have some undefined behaviour somewhere.
>
>Here we need to make a distinction : I will term "weak undefined behaviour"
>(WUB) C code which is not defined according to the C standard and "strong
>undefined behaviour" (SUB) C code which is not defined according to
>*anything* where "anything" could mean some standard which enhances the C
>standard (like POSIX) and also relevant documentation of the compiler or
>libraries one uses or whatever. Our main conern here is predictable behaviour
>of the code so I believe the deciding concept is SUB rather than WUB.

I don't think that this distinction helps in this discussion, because

1) Different people consider different things to be "not defined
according to *anything*". E.g., assembly language and most Forth
programmers consider signed integer overflow to have a pretty
natural definition (wraparound aka modulo arithmetics), some more
mathematically-inclined people think they deserve nasal demons.

2) Nasal demon fans want license to produce nasal demons for every
undefined behaviour. E.g., in the general case both gcc and
Clang/LLVM generate code for 2s-complement arithmetic with
wraparound for signed integer addition, subtraction, and
multiplication, so you might classify it as WUB. Nasal demon fans
still want to keep signed integer overflow undefined even in those
compilers so that they make use of it for "optimizations".

3) People aware of the evils of nasal demons don't want them anywhere,
no matter what undefined behaviour there is. E.g., for an
out-of-bounds array read, I expect either an unspecified, but
stable value, or a segmentation violation/general protection fault;
that's not fully defined, but it excludes nasal demons. E.g., I
don't expect it to turn a counted loop into an endless loop (and
this is actually a nasal demon "optimization" that occurs in
practice).

>In any
>case , if code has SUB then I cannot imagine on what basis one may think it
>works and it will reliably continue to do so in the future : even if it
>passes testing , then that's by coincidence. So if a code exhibits SUB then
>it's buggy and it needs to be fixed.

Of course that depends on what you consider SUB. Anyway, here's an
example.

int d[16];

int SATD (void)
{
int satd = 0, dd, k;
for (dd=d[k=0]; k<16; dd=d[++k]) {
satd += (dd < 0 ? -dd : dd);
}
return satd;
}

If there is some other global data defined earlier and later in the
program, then d[16] is a mapped (i.e., readable) memory location. And
if it is an unmapped memory location, you get a SIGSEGV, not some
nasal demon.

>> 3) Complete correctness proofs of programs are outside the competence
>> of nearly all programmers (and pretty pointless, because they would
>> prove only the correctness wrt the specification, which itself may be
>> buggy).
>>
>> Now, to write a C program that is guaranteed to never exercise
>> undefined behaviour requires competences on the level 3.
>
>Not in the slightest. Consider the following piece of C code :
>
> int is_leap_year(unsigned int year) {
> return 1 ;
> }
>
>It's trivial to see that this is defined but , judging from the function's
>name , it's almost certainly buggy. A complete correctness proof is many
>orders of magnitude harder than simply ensuring defined behaviour.

I don't think it is trivial to see that this is defined. I once
<2015May...@mips.complang.tuwien.ac.at> wrote:

|It seems to me that
|
| if ((size_t)(dest-src) < len)
|
|is a more efficient way to test whether dest is inside [src, src+len).
|
|Of course, since this code contains more than three tokens, it is
|likely to contain undefined behaviour

That was mostly in jest, but a post or two later, someone explained
that yes, this code is undefined. So while I do not see anything
undefined about the code you posted, just let the language lawyers at
it, and they will point it out.

In any case, in your example it is trivial to see that it does not
satisfy the requirement that is implied by the name, so even if it
really is trivial to see that it is defined, it only shows that both
correctness and definedness are trivial to determine for trivial
programs.

But actually the bug resulted in reading from the buffer. So the
assumption by the compiler that the buffer is never again read from is
wrong, and applying dead-store elimination to this memset() is wrong;
in other words, the compiler is buggy.

But of course the GCC maintainers don't accept this as a bug: They say
that reading from the buffer after the free() is undefined behaviour,
therefore does not happen, and therefore they are in their rights to
conjure up nasal demons and "optimize" the memset()/bzero() away. And
this is not an issue that could turn out one way or the other, this
memset()/bzero() is there exactly for mitigating information leaks,
such as the one that is the main reason of the CVE.

It is telling that you do not understand how the nasal-demon
interpretation of C allowed the compiler to "optimize" the
memset()/bzero() away. So much for Andrew Haley's claim that it is
"not difficult" to write "well-defined" C, and your defense of that
stance.

>I couldn't tell you off the top of my head but I'm sure you could get the
>answer on comp.lang.c very quickly. Looking at the big picture , it could
>be that C is not a suitable back-end for Forth after all even if you got
>lucky for a number of years and your code (which presumably was exhibiting
>SUB) worked as intended.

I think that nasal-demon C is not a suitable language for anything.

> Note that this may turn out to be an opportunity for
>Forth : it may be that Forth itself is a more suitable back-end than C for other
>programming languages

Yes.

>A
>*constructive* thing to do (as opposed to complaining) would be to describe
>what guarantees with regard to C pointer comparisons on top of the C standard
>would be needed to do what you want to do. Then you could ask on comp.lang.c
>or some compiler mailing list whether these additional guarantees can be
>offered by using some appropriate flag and if not how easy would be to implement
>them as an extension.

If you think that's a promising approach, go ahead. My experience is
that they always tell you to go to the C standards committee.

>Undefined behaviour is an unfortunate side effect of 3 factors :
>
>1. The desire to allow implementors optimisations.
>2. Inability to reconcile previous (i.e. before the standard) existing practices
> therefore any attempt to define the behaviour more narrowly would unfairly
> privilege some implementations (or architectures) over others.
>3. The desire to not make the standard too large.

That was all fine and dandy in the good old days of PCC and early GCC,
but in recent years the compiler maintainers went insane and use them
as justification for miscompiling (they call it "optimizing") as many
programs as possible, with some exceptions, in particular, standard
benchmarks. Paying customers are probably also not treated as badly
as the rest of us; hmm, so nasal demons may be part of the business
models.

> But noone *likes* undefined behaviour , it is just a trade off.

The GCC and Clang compiler maintainers love it. For them reason 2 and
3 don't apply, yet with every release they miscompile more and more
cases that are undefined in the standard for reasons 2 and 3; e.g.,
signed integer overflow.

>If
>programmers get convinced to avoid undefined behaviour then this ultimately
>will make code more predictable not less.

So you love nasal demons, too.

Many programmers are convinced. But because there is actually no good
way to write code without undefined behaviour, we will get programs by
convinced programmers that have fewer undefined behaviours, but it
will still be there; and the next release of the compiler breaks some
code, the convinced programmers get busy fixing it, and the next
release breaks some more code. Bottom line: slightly faster
benchmarks, more programs that break, and more work for the C
programmers without any benefit.

Of course, if you are a programmer from a paying customer, you may be
able to exert enough pressure that they don't break your program. Or
if you are a programmer from a company that distributes binaries only,
you just use a compiler version that works, and don't downgrade to
newer versions. It's free software that's written by volunteers that
is hit the hardest by nasal demons. That is probably in line with the
sponsors of Clang/LLVM, but for GCC, I doubt it.

>But then there are always some
>stubborn ones like yourself or
>http://www.open-chess.org/viewtopic.php?f=5&t=2519 where Robert Hyatt
>insists that SUB should still do what he wants just because it did in the past.

So you think that using strcpy to do an overlapping backwards copy is
SUB, i.e., "not defined according to *anything*". Interesting.

Rod Pemberton

unread,

Apr 9, 2017, 1:34:16 PM4/9/17

to

On Sat, 8 Apr 2017 22:45:51 -0700 (PDT)
Joel Rees <joel...@gmail.com> wrote:

> On Friday, April 7, 2017 at 4:32:11 PM UTC+9, Rod Pemberton wrote:
> > On Thu, 06 Apr 2017 12:33:44 GMT
> > Spiros Bousbouras <spibou@(something).(something)> wrote:

> > > Example : the expression array[i] where i is beyond the end of
> > > the array.
> >
> > FYI, C explicitly allows you to access one element past the end of
> > the array. This is to allow you to access the nul character when
> > comparing strings to check for string termination.
>
> Now we know who has been writing all that vulnerable code. (Just
> kidding. Sort of.)

This is used within the C libraries, e.g., string and memory functions,
as well as by programmers.

> I have heard teachers say that setting a pointer to NULL and then
> dereferencing it is safe.

Setting a pointer to any explicit value is undefined behavior in C.
This is implementation defined behavior. Most OSes and resulting C
compilers are reasonable in that they allow the programmer to access
memory mapped devices. Hence, they allow you to assign values to
pointers.

> (There are, in fact, some embedded environments where it is safe, but
> those are special cases. Still not what I call clean coding practice.)

In many OS environments, you'll end up with a page fault. The OS will
not map the zero'th page in order to detect NULL pointer dereferences.

> C array syntax allows you to access whatever a [pointer] points at,

> and pointer parameters can be named as if they were indefinitely
> allocated arrays. Moreover, the grammar specifically allows arbitrary
> array references even when the l-value is an array with a specific
> declared size.

I'm not sure I quite follow here.

The C language has array declarations, but doesn't have arrays as a
fundamental type as part of the language. C uses the [] subscript
operator with a pointer and offset to effect array usage. The name of
an array reduces to a pointer to the zero'th element of the array. This
allows the array name to be used with the subscript operator.

> In the past, certain compiler vendors allocated extra space around
> arrays to "help protect" the end users from programmers who did not
> understand that C does not bounds check arrays, and therefore assumed
> that no breakage this time meant their loops or other array
> processing was correct. This would give the appearance of allowing
> dereferencing one past the end of an array.

C explicitly allows dereferencing one past the end of an array. As
stated previously, this requirement is in ALL of the C specifications.

> It also allowed programmers to get thoroughly confused when they
> started working with multi-dimension arrays, because the compiler and
> libraries can't cheat like that for you on multi-dimension arrays.

...

> If you read, for instance, the library documentation for string
> processing functions in the standard library (fgets(), for instance),
> you will note that the buffer size passed in must include room for a
> trailing NUL if you want to be sure the trailing NUL gets stored.
>
> This kind of thing should help you understand that something doesn't
> match what you just said.

You're confusing strings and arrays. C doesn't have strings as a
fundamental type, but uses arrays, which are a partial type. Part of
the parsing process of C requires that an extra character for nul be
added onto the end of the string. This is done automatically by the
compiler unless a specific size is declared for the string. In the case
of fgets(), the programmer must account for the extra character needed
to terminate strings.

> The standard does specify that the run-time has to allow a pointer to
> point one past the end of an allocated area, as long as the pointer
> is not dereferenced in that state. This allows a loop that has a
> post-increment in the test to complete safely.

...

> But that means, no, you must not actually access one beyond. That way
> lies one-off pointer errors which become vulnerabilities which allow
> servers and workstations to be pwned by malicious access.

Accessing one beyond is required to check for a nul character. E.g.,

while(*str++); /* find end-of-string *//* This is not UB */

> > If you go beyond that though,
> > it's undefined ... However, many C implementations use the ability
> > to access outside a C object behind the scenes without any issues,
> > e.g., to access a memory header prior to malloc'd memory blocks.
>
> When you talk about memory allocation headers, you imply two levels
> (at least) of allocation -- a mass block allocation, and a fine-grain
> allocation.

It doesn't imply that. The memory block and header can be allocated at
once.

> If you try to access beyond either end of a mass block allocation,
> you should expect segment faults.

This assumes that paging is enabled and memory isn't contiguously
mapped. The later is an implicit requirement of C. Every C object is
required to be contiguous. This is usually effected by making memory
contiguously mapped.

> Within a mass block, the library maintains an allocation list of some
> sort, and that list might use headers that sit either edge of the
> fine-grained allocation block. The grammar and syntax do allow you to
> take a look at those headers, but you really should not, because
> other library architects don't want to risk allocation headers being
> overwritten by stray pointers. So the allocation lists are elsewhere
> in those runt-times.

...

> The grammar and syntax allow the accesses, and it is not just one
> past. A million past is also allowed by the grammar and syntax. The
> run-time, however, is only required to complete an operation safely
> if it is within the declared bounds.

This is not true. For C, you can legally only access data within C
objects or malloc'd memory. Any other access is undefined behavior. C
makes an exception for strings to allow accessing one past the
end-of-string to check for nul characters. Assigning a value directly
to a pointer is also undefined behavior for C. Many environments allow
this to access memory mapped devices.

> > > Thinking in terms of assembly : "array[i] will be translated into
> > > an assembly instruction which calculates an address of the form
> > > base + i * constant .
> >
> > This is a requirement of the subscript operator [] for K&R, ANSI,
> > and ISO C.
>
> Not really required, just the most natural interpretation of a
> one-dimensional array reference.
>
> > It's in all the standards. They define "E1[E2]" to be
> > equivalent to "*((E1)+(E2))". The subscript operator takes a
> > pointer to a type and also an index. The index is scaled by the
> > size of the data type pointed to. The pointer which cannot be a
> > pointer-to-void 'void *' since 'void' has no data type, i.e., no
> > size, pointer only. The pointer used in the subscript operator can
> > be any pointer to any valid type, not just pointer for array
> > types.
>
> Up to here you are just talking about arbitrary math.

No, I'm talking about how the subscript operator [] in C works. Most
people think it's for array indexes (or indices), but it works with
pointers too. C doesn't have arrays within the language itself, but
has array declarations. These are reduced to pointers by the compiler,
as mentioned elsewhere.

> > The arguments to the
> > subscript operator can be either order, e.g., "0123456789"[i] is
> > equivalent to i["0123456789"] .
>
> This does not make sense. I know what "string"[index] means, but I
> and the compiler are both extremely puzzled by index["string"]. This
> is not perl. "string" may resolve to a pointer, but pointers and
> indexes have two different sorts of existence, which you refer to
> when you mention void pointers. You can't use a pointer as an index
> very meaningfully.

This is valid C. The compiler knows the type of both parameters for
the subscript operator [] . The compiler knows that the string is a
pointer to char and it knows the index is one of the integer types.

> > Technically, you should think of the
> > subscript operator as just the left-brace, then you'll understand
> > it's like any other operator in C, e.g., arithmetic +, -> component
> > selection, or whatever.
>
> Syntactically similar, semantically significantly different.
>

...

> > > Your second reference has
> > > char *index;
> > > ....
> > > *index++ = toupper(*index);
> > >
> > > .He doesn't explain why he decided to write code like this as
> > > opposed to *index = toupper(*index);
> > > index++ ;
> >
> > I'm guessing that this is most likely just pre-ANSI C. Most C style
> > guides etc are older and are for K&R C. E.g., most will recommend
> > that you use #defines and mask with logical-and instead of using
> > bit-fields, since many earlier C compilers were broken.
>
> The implementation of toupper() is not really relevant to the
> semantics of
>
> *index = toupper( *index++ );
>
> demonstrates the need of sequence points. The assignment is a
> sequence point there is the assignment operator, "=".
>
> ++index is equivalent to index = index + 1 in several ways, but the
> increment does not contain a sequence point.
>

It indicates the author didn't follow ANSI C coding rules which require
that pre- and post- increment and pre- and post-decrement be removed
from surrounding code. This is because they introduced sequence point
concept with ANSI C and broke the expected order of operations.

> > They also
> > recommend avoiding enums for similar reasons. They also recommend
> > placing numbers in comparisons as "if(0==value)" instead of
> > "if(value==0)" to detect incorrect assignments such as
> > "if(value=0)". etc. They'll also say not to use the address of a
> > struct by itself, i.e., as a pointer, since an early, widespread C
> > compiler, PCC, didn't keep track of the struct's name or somesuch
> > issue ... ANSI C added other problems, e.g., obsoleted implicit
> > int's since they caused a syntax conflict with typedef's, messed up
> > pre- and post- increment and decrement operators, etc. Of course,
> > C has some required behavior, which is not part of some of the
> > standards, such as the C struct hack and Duff's device, which must
> > work correctly.
>
> I'm finding the above too hard to parse. Try again?
>

The point was there are a variety of things recommended by C style
guidelines due problems with C compilers. Also, some required C
behavior is not required by the specifications.

Rod Pemberton

unread,

Apr 9, 2017, 3:05:52 PM4/9/17

to

No, it's a pity that you've attempted to discredit me or anger me,
instead of learning something.

I used to post comp.lang.c a decade ago. Weren't you there then? I
seem to recall your name.

What I posted is either directly from the C specifications, the C
Rationale, Harbison and Steele's "C: A Reference Manual" or P.J.
Plauger's "The Standard C Library", or papers by D.M. Ritchie
et.al., or historical behavior. I also know that my understanding of
C corresponds with that of one of the original ANSI C committee members
(not P.J. Plauger), as I've had conversations with this other fellow.
I've been programming since 1981, about 1992 for C.

So, yeah, since I posted to c.l.c. in the past, I know how c.l.c. works
or did. A few self-proclaimed C experts, like Richard Heathfield or
Keith Thompson and their clique, e.g., CBFalconer RIP **PLONK**, beat up
other people in an attempt to prove themselves correct by harassing
innocent newbs. At that time, Keith knew the C specification well, but
didn't understand C itself or how it was compiled. The authors of the
failed c.l.c. book "C Unleashed" attempted to prove themselves correct
too on occasion. No one there was familiar with the Bell Laboratories
papers on C by D.M. Ritchie, B.W. Kernighan, and S.C. Johnson which
explained low-level details of how C works. None of them had read
ANSI's or ISO's "The C Rationale" and everyone dismissed it. No one
there understood that a nul character can be larger in size that a char
even though DMR wrote about it in his papers as to why the C
specification required it. No one there was familiar with the ANSI C
committee's TC's that obsoleted implicit ints due to conflicts with
typedef's or that C was no longer an LALR(1) grammar because of
typedef's and the implicit int conflict. Many of them would site Steve
Summit's C FAQ which had some major errors at the time which have since
been corrected. Everyone there complains about some unknown C author
by the name of Schildt. People harass Heathfield by asking the same
questions over and over because he can't remember anything for more
than a few months. AIR, there were only two people there besides me
who truly understood C, but they were infrequent posters. One of them
might have been Chris Torek ... Perhaps, the other was David Tribble?
Or, Eric Sosman? Sorry, I don't recall the other person's name
anymore. I'd have to re-read a bunch of posts from the era. Of
course, I'm not really interested in rereading the posts from all the
assholes, CBFalconer, Flash Gordon, Martin Ambuhl, Mark McIntyre, Keith
Thompson, Richard Heathfield, ... Nor am I interested in reading any
posts from the current batch of them. There was only one person there
who was funny from time to time, Kenny McCormack.

Paul Rubin

unread,

Apr 9, 2017, 5:07:22 PM4/9/17

to

Rod Pemberton <NeedNotR...@xrsevnneqk.cem> writes:
> C explicitly allows dereferencing one past the end of an array. As
> stated previously, this requirement is in ALL of the C specifications.

I don't think that is right. My understanding was you're allowed to set
a pointer to point one past the end of the array, but you're not allowed
to dereference that pointer.

http://stackoverflow.com/questions/988158/take-the-address-of-a-one-past-the-end-array-element-via-subscript-legal-by-the

discusses this at some length.

> Accessing one beyond is required to check for a nul character. E.g.,
> while(*str++); /* find end-of-string *//* This is not UB */

The nul character is supposed to be within the array, not past it.
After the exit from that loop, str points one past the end of the array,
which is not UB, since you don't actually dereference it. Dereferencing
it would be UB according to everything I remember.

Spiros Bousbouras

unread,

Apr 9, 2017, 8:38:55 PM4/9/17

to

[ Crossposted to comp.lang.c since we need some C lawyers ;-) ]

On Sun, 09 Apr 2017 14:53:01 GMT

I guess I wasn't clear enough. By "anything" I mean anything related to the context.
So if I'm translating some C code (note that "translating" doesn't necessarily mean
compiling ; C is usually compiled but there are C interpreters) relevant stuff are

- C standard.
- Translator documentation. For example by using the appropriate options one may get
additional guarantees beyond the C standard like -fwrapv gcc supports and
which makes signed integer oveflow defined.
- If one uses headers or functions or macros , etc. which do not appear in the C
standard then these must be defined somewhere else like in some other standard
or they must be part of some library and be defined in the documentation of the
library.
- There may more which is relevant. I don't see how Forth or assembly are relevant
unless one calls an external process which is written into one of these languages.
Mathematics is relevant only to the extent that the related documentation or
standard(s) says that it's relevant. For example it follows from the C standard
that for integers x and y of the same type T , if the mathematically correct result
of x+y fits into T then that's what x+y evaluates to. But if it didn't specify
this then one should not have assumed that the mathematical definition of + is
relevant even if they felt that it would be the natural way to define C's +
operator.

The underlying principle is that one must have some rational reason to expect
code to behave in a certain way. For example if I have C code which includes
frooboz(123) ;

I must have have some rational reason to expect this to translate without errors and
do something specific I have in mind. Since the C standard does not define frooboz()
, this reason must come from somewhere else. So we definitely need some stronger
concept than WUB. What counts as "pretty natural definition" for signed integer
oveflow (or anything else) seems subjective to me and I don't even see how it is
relevant whether someone is a Forth programmer or assembly programmer (of which
architecture ?) or mathematically inclined. Even in mathematics , what x+y means
depends on the context. The definition is different if x and y are elements of ℤ
(integers) as opposed to say ℤ/5ℤ .So what is defined in some context depends
on the context rather than the overall knowledge and experiences one has. To get to
one of your favorite examples , if your C code has signed integer oveflow what's
your rational reason to expect that it will do something specific (like wraparound)?
The reason cannot be the C standard because the C standard says it's undefined
behaviour. If your compiler documentation does not say that it will produce assembly
which does wraparound then you have no reason to expect the code to behave in a
predictable manner so that code needs to be changed. You say above "to get rid of
the undefined behaviour without losing functionality and performance". Without
rational reason to believe that your code will do what you want then you don't
actually have functionality ; if it seems to behave the way you want then it is just
a coincidence and the behaviour may go away without warning like if you use a
different version of compiler. So there's nothing to lose (apart perhaps from
occasions where you urgently want to achieve something and you're willing to kludge
it) , either you fix the code or you have nothing you can depend on.

> 2) Nasal demon fans want license to produce nasal demons for every
> undefined behaviour. E.g., in the general case both gcc and
> Clang/LLVM generate code for 2s-complement arithmetic with
> wraparound for signed integer addition, subtraction, and
> multiplication, so you might classify it as WUB. Nasal demon fans
> still want to keep signed integer overflow undefined even in those
> compilers so that they make use of it for "optimizations".

As I said in a previous post , I don't believe there are "nasal demon fans". As for
wanting license I find this puzzling because I don't believe they need a license.
For example , one might decide to write a C compiler which supports all other parts
of the C standard but not the for statement. One does not need any kind of license
to do this , one simply needs (from a moral point of view) to be honest about what
they offer. And then other people can decide if they want to use the compiler or
not. But if they decide to use it then their expectations must be based on what the
compiler does claim to offer , not on what they would wish it to offer. They might
wish that it also supported the for statement but since it doesn't claim to do so
then they shouldn't use for in their code (if they want to translate their code
with this compiler).

By the way , regarding gcc and signed integer operations , what do you mean "general
case" ? My impression is that without using -ftrapv or -fwrapv then oveflow for
signed integer operations is undefined.

Are you translating this with a C translator which defines in its documentation
things like "mapped memory location" ? If not then these terms are meaningless (in
the specific context of using this specific code with this specific translator) , it
is SUB and you can't have any expectations whatsoever regarding how this code will
behave. I certainly wouldn't. I take it you would ? If yes on what basis ?

> >> 3) Complete correctness proofs of programs are outside the competence
> >> of nearly all programmers (and pretty pointless, because they would
> >> prove only the correctness wrt the specification, which itself may be
> >> buggy).
> >>
> >> Now, to write a C program that is guaranteed to never exercise
> >> undefined behaviour requires competences on the level 3.
> >
> >Not in the slightest. Consider the following piece of C code :
> >
> > int is_leap_year(unsigned int year) {
> > return 1 ;
> > }
> >
> >It's trivial to see that this is defined but , judging from the function's
> >name , it's almost certainly buggy. A complete correctness proof is many
> >orders of magnitude harder than simply ensuring defined behaviour.
>
> I don't think it is trivial to see that this is defined.

Well , I'm not going to locate and quote all the parts of the C standard which
support my view but I have no doubts that it is defined.

> I once
> <2015May...@mips.complang.tuwien.ac.at> wrote:
>
> |It seems to me that
> |
> | if ((size_t)(dest-src) < len)
> |
> |is a more efficient way to test whether dest is inside [src, src+len).
> |
> |Of course, since this code contains more than three tokens, it is
> |likely to contain undefined behaviour
>
> That was mostly in jest, but a post or two later, someone explained
> that yes, this code is undefined. So while I do not see anything
> undefined about the code you posted, just let the language lawyers at
> it, and they will point it out.

Ok , we'll see what the language lawyers say.

> In any case, in your example it is trivial to see that it does not
> satisfy the requirement that is implied by the name, so even if it
> really is trivial to see that it is defined, it only shows that both
> correctness and definedness are trivial to determine for trivial
> programs.

My point was simply to provide a counterexample to your statement

Now, to write a C program that is guaranteed to never exercise
undefined behaviour requires competences on the level 3.

I provided a reference above and it does not support your claims that this
is what caused the bug. Do you have a reference which says otherwise ?

> It is telling that you do not understand how the nasal-demon
> interpretation of C allowed the compiler to "optimize" the
> memset()/bzero() away. So much for Andrew Haley's claim that it is
> "not difficult" to write "well-defined" C, and your defense of that
> stance.

I understand what you're claiming but you haven't provided any reference that this
is what caused the bug in the case of OpenSSH.

> >I couldn't tell you off the top of my head but I'm sure you could get the
> >answer on comp.lang.c very quickly. Looking at the big picture , it could
> >be that C is not a suitable back-end for Forth after all even if you got
> >lucky for a number of years and your code (which presumably was exhibiting
> >SUB) worked as intended.
>
> I think that nasal-demon C is not a suitable language for anything.

I use it all the time with perfectly predictable results. I have my complaints
but they are not caused by undefined behaviour.

> > Note that this may turn out to be an opportunity for
> >Forth : it may be that Forth itself is a more suitable back-end than C for other
> >programming languages
>
> Yes.
>
> >A
> >*constructive* thing to do (as opposed to complaining) would be to describe
> >what guarantees with regard to C pointer comparisons on top of the C standard
> >would be needed to do what you want to do. Then you could ask on comp.lang.c
> >or some compiler mailing list whether these additional guarantees can be
> >offered by using some appropriate flag and if not how easy would be to implement
> >them as an extension.
>
> If you think that's a promising approach, go ahead. My experience is
> that they always tell you to go to the C standards committee.

Before you go to the C standards committee you need to have a specific proposal. I
take it that for C to be a suitable back end for implementing Forth , you want
comparison of C pointers pointing to different objects to be defined. Could you
provide a set of axioms that such comparisons ought to satisfy to be adequate for
your purposes ? If not then even if C compiler writers or the C standards committee
wanted to accommodate you , they have no way to do so. You can't expect them to read
your mind , can you ? "Do the natural thing" is not a meaningful definition. I don't
think that the following from your paper is a meaningful definition either :

C* A language (or family of languages) where language constructs correspond
directly to things that the hardware does. E.g., * corresponds to what a
hardware multiply instruction does. In terms of the C standard, conforming
programs are written in C .

I couldn't provide such a set of axioms because I've never implemented a Forth so I
don't know in what ways standard C as a back end is lacking.

> >Undefined behaviour is an unfortunate side effect of 3 factors :
> >
> >1. The desire to allow implementors optimisations.
> >2. Inability to reconcile previous (i.e. before the standard) existing practices
> > therefore any attempt to define the behaviour more narrowly would unfairly
> > privilege some implementations (or architectures) over others.
> >3. The desire to not make the standard too large.
>
> That was all fine and dandy in the good old days of PCC and early GCC,
> but in recent years the compiler maintainers went insane and use them
> as justification for miscompiling (they call it "optimizing") as many
> programs as possible, with some exceptions, in particular, standard
> benchmarks. Paying customers are probably also not treated as badly
> as the rest of us; hmm, so nasal demons may be part of the business
> models.

If it's SUB then it's impossible to miscompile it because correct compilation is not
defined.

> > But noone *likes* undefined behaviour , it is just a trade off.
>
> The GCC and Clang compiler maintainers love it. For them reason 2 and
> 3 don't apply, yet with every release they miscompile more and more
> cases that are undefined in the standard for reasons 2 and 3; e.g.,
> signed integer overflow.

gcc provides 2 different options for getting predictable behaviour for signed
integer overflow. So this contradicts your statement.

> >If
> >programmers get convinced to avoid undefined behaviour then this ultimately
> >will make code more predictable not less.
>
> So you love nasal demons, too.

I avoid SUB so they are indifferent to me. For the point I was making , you snipped
relevant context. I said

The talk of nasal demons is to drive home the point (especially to new
programmers) that when dealing with undefined behaviour (that is SUB) there's no
"reasonable" behaviour one can expect therefore *** Just say no *** .If

programmers get convinced to avoid undefined behaviour then this ultimately will
make code more predictable not less.

To add to the above , as a metaphor which is used as an educational tool , I
consider it unfortunate because it ignores the fact that a computer exists and
operates in the real world so it can only cause things which are physically
possible. There are other ways to drive home the point that SUB is unpredictable
without mentioning physical impossibilities.

> Many programmers are convinced. But because there is actually no good
> way to write code without undefined behaviour, we will get programs by
> convinced programmers that have fewer undefined behaviours, but it
> will still be there; and the next release of the compiler breaks some
> code, the convinced programmers get busy fixing it, and the next
> release breaks some more code. Bottom line: slightly faster
> benchmarks, more programs that break, and more work for the C
> programmers without any benefit.

There is a way to write code without SUB and that is to be familiar with the
standards and documentation which is relevant to the code one writes. Bugs may exist
of course but this would be the case just as much even with no undefined behaviour.

[...]

> >But then there are always some
> >stubborn ones like yourself or
> >http://www.open-chess.org/viewtopic.php?f=5&t=2519 where Robert Hyatt
> >insists that SUB should still do what he wants just because it did in the past.
>
> So you think that using strcpy to do an overlapping backwards copy is
> SUB, i.e., "not defined according to *anything*". Interesting.

I read the thread a while ago and I can't be bothered to reread it but I don't think
that Hyatt said that he was using a C library which , as an extension to the C
standard , defined strcpy() for overlapping objects. If that's the case then the
behaviour was not defined according to anything relevant to the code he was writing
so I would classify it as SUB and I don't believe that he had any rational reason to
expect a specific behaviour (but he thought otherwise). What one might consider
"natural" or useful behaviour for strcpy() does not affect the definition of SUB. I
will present a more general principle :

For the tools you use , you cannot expect what has not been promised to you.

A C compiler is a tool and so is a C library (whether it's the standard C library or
some other) ; they were written by humans. If those humans do not tell you for
example "we made sure that strcpy() copies correctly even overlapping objects" then
you have no rational reason to expect it.

--
This, of course, is a straw man. As the Stanford Encyclopedia of Philosophy observes
Moral relativism has the unusual distinction - both within philosophy and outside
it - of being attributed to others, almost always as a criticism, far more often than
it is explicitly professed by anyone.
http://www.mathpages.com/home/kmath557/kmath557.htm

James Kuyper

unread,

Apr 9, 2017, 10:46:08 PM4/9/17

to

On 04/09/2017 08:38 PM, Spiros Bousbouras wrote:
> [ Crossposted to comp.lang.c since we need some C lawyers ;-) ]
>
> On Sun, 09 Apr 2017 14:53:01 GMT
> an...@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

...

>> I once
>> <2015May...@mips.complang.tuwien.ac.at> wrote:
>>
>> |It seems to me that
>> |
>> | if ((size_t)(dest-src) < len)
>> |
>> |is a more efficient way to test whether dest is inside [src, src+len).
>> |
>> |Of course, since this code contains more than three tokens, it is
>> |likely to contain undefined behaviour
>>
>> That was mostly in jest, but a post or two later, someone explained
>> that yes, this code is undefined. So while I do not see anything
>> undefined about the code you posted, just let the language lawyers at
>> it, and they will point it out.
>
> Ok , we'll see what the language lawyers say.

It would help to know what the relevant data types are, but the comment
about "dest is inside [src, src+len] suggests that dest and src are both
pointers. The C standard defines subtraction of pointers in terms of
positions in the SAME array. If src and dest don't both point into or
one past the end of the same array, subtraction of those pointers has
undefined behavior (6.5.6p9).
Consider an implementation of C targeting a machine with a
segment-offset architecture, where every pointer into a given object
(including pointers to sub-objects) has the same segment value (which
implies that no object can be bigger than a single segment), while
pointers into completely unrelated objects will, in general, have
different segments. On such a machine, the fact that this behavior is
undefined means that a fully conforming implementation of C is allowed
to implement pointer subtraction by simply subtracting the offsets and
IGNORING the segment numbers. Feel free to figure out the implications
that would have for your dest-size test.

...

>>> A
>>> *constructive* thing to do (as opposed to complaining) would be to describe
>>> what guarantees with regard to C pointer comparisons on top of the C standard
>>> would be needed to do what you want to do. Then you could ask on comp.lang.c
>>> or some compiler mailing list whether these additional guarantees can be
>>> offered by using some appropriate flag and if not how easy would be to implement
>>> them as an extension.
>>
>> If you think that's a promising approach, go ahead. My experience is
>> that they always tell you to go to the C standards committee.

Redirecting you to the standards committee would be an inappropriate
response to a inquiry about the existence or feasibility of an extension
to C. Are you sure you didn't word your inquiry in such a way that it
sounded like you were asking for a change to the standard itself?

> Before you go to the C standards committee you need to have a specific proposal. I
> take it that for C to be a suitable back end for implementing Forth , you want
> comparison of C pointers pointing to different objects to be defined. Could you

Note: C++ requires that std::less<T*> must provide a total ordering of
all pointers of type T*, even if the "<" operator does not provide such
an ordering (20.9.5p14). C++ has a sufficiently strong shared heritage
with C that the fact that the C++ standard can successfully impose this
requirement strongly suggests that the C standard could impose a similar
requirement.
However, it's telling that C++ did NOT impose this requirement on the
"<" operator itself, but only on std::less<T*>. That's because there are
some systems where meeting this requirement makes std::less<T*>(p,q)
substantially slower than p<q.

Robert Wessel

unread,

Apr 9, 2017, 11:28:03 PM4/9/17

to

It also does not (necessarily) produce the same ordering as does the
built-in operator <. It does if the built-in < comparison is between
objects that can be compared without undefined behavior, but if not,
the two can produce different results.

It also does not require any particular order for those objects, just
a consistent one - that might become visible if the objects overlap in
memory. IOW, if you have two overlapping objects of 10 bytes each, A
and B, with an overlap of five bytes, such that A[5] is the same
position in storage as B[0], there's no requirement that std::less
actually order A before B. Of course, that kind of overlap is
undefined behavior itself.