Pointer to address zero

65 views
Skip to first unread message

James Harris

unread,
Jul 20, 2021, 1:33:45 PMJul 20
to
For my OS project I have been looking at program loading and that has
led me to query what would be required in a language to support address
zero being accessible and a pointer to it being considered to be valid.

If I use paging I can reserve the lowest page so that address zero is
inaccessible. A dereference of a zeroed pointer would be caught by the
CPU triggering a fault.

However, if on x86 I don't use paging then a reference to address zero
would not trigger a fault so it would not be caught and diagnosed. And
it's not just that case. Other CPUs or microcontrollers may, presumably,
allow access to address zero. Therefore there are cases where a program
may have a pointer to address zero and that pointer could be legitimate.

Hence the question: how should a language support access to address
zero? Any ideas?


--
James Harris


Dmitry A. Kazakov

unread,
Jul 20, 2021, 2:01:25 PMJul 20
to
On 2021-07-20 19:33, James Harris wrote:

> Hence the question: how should a language support access to address
> zero? Any ideas?

Where is a problem? Machine address is neither integer nor pointer, any
resemblance to persons living or dead is purely coincidental as they
write before the movie starts...

--
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de

James Harris

unread,
Jul 20, 2021, 4:40:55 PMJul 20
to
On 20/07/2021 19:01, Dmitry A. Kazakov wrote:
> On 2021-07-20 19:33, James Harris wrote:
>
>> Hence the question: how should a language support access to address
>> zero? Any ideas?
>
> Where is a problem?

The problem is in the implementation. Should a language recognise such a
thing as an invalid address and, if so, what value or range of values
should indicate that a given address is invalid?

Compiled code often uses address zero to indicate 'invalid' but that
would not be possible if address zero were to be accessible.


>
> Machine address is neither integer nor pointer, any
> resemblance to persons living or dead is purely coincidental as they
> write before the movie starts...
>

How are you distinguishing between pointer and address? In C a pointer
is usually, though it does not have to be always, implemented as an
address.

(Doesn't that annotation usually come at the end?)


--
James Harris

Dmitry A. Kazakov

unread,
Jul 21, 2021, 3:19:52 AMJul 21
to
On 2021-07-20 22:40, James Harris wrote:
> On 20/07/2021 19:01, Dmitry A. Kazakov wrote:
>> On 2021-07-20 19:33, James Harris wrote:
>>
>>> Hence the question: how should a language support access to address
>>> zero? Any ideas?
>>
>> Where is a problem?
>
> The problem is in the implementation. Should a language recognise such a
> thing as an invalid address and, if so, what value or range of values
> should indicate that a given address is invalid?

All addresses are valid. Some cannot be converted to some pointer types.

The system-dependent package would usually provide a constant
Null_Address that is guaranteed to never indicate accessible memory on
the given machine. The representation of Null_Address is irrelevant.

> Compiled code often uses address zero to indicate 'invalid' but that
> would not be possible if address zero were to be accessible.

In a properly designed language you would not be able to write a program
in such a manner.

> How are you distinguishing between pointer and address?

That thing is called the type.

> In C a pointer
> is usually, though it does not have to be always, implemented as an
> address.

So what?

Rod Pemberton

unread,
Jul 21, 2021, 12:52:09 PMJul 21
to
On Tue, 20 Jul 2021 21:40:51 +0100
James Harris <james.h...@gmail.com> wrote:

> On 20/07/2021 19:01, Dmitry A. Kazakov wrote:
> > On 2021-07-20 19:33, James Harris wrote:

> >> Hence the question: how should a language support access to
> >> address zero? Any ideas?
> >
> > Where is a problem?
>
> The problem is in the implementation. Should a language recognise
> such a thing as an invalid address and, if so, what value or range of
> values should indicate that a given address is invalid?

What exactly do you mean by,
"recognize such a thing as an invalid address"?

e.g., prohibiting any of the following,

a) assignment of zero value to an address pointer
b) comparison of zero value with a address pointer
c) writing to location with address zero, i.e., dereference


First, let's for sake of argument say the language's NULL pointer (or
equivalent) is actually of value zero, as it doesn't have to be, at
least for the C language, but usually is implemented as a zero value.
This eliminates the "thought" complexity over NULL in C etc being a
non-zero value.

I'd strongly argue that allowing b) is a language requirement.
I'd argue that a) is up to the language implementation.
I'd argue that c) is up to the operating system implementation or
hardware.

> Compiled code often uses address zero to indicate 'invalid' but that
> would not be possible if address zero were to be accessible.

Sure it is.

The 'invalid' condition of which you speak is the result of a pointer
to pointer comparison, as nothing is written to address zero.

The writing to address zero is a dereference, not a pointer comparison,
and will be allowed if the hardware is incapable of blocking the write
to the address location, i.e., zero, e.g., via an invalid or unmapped
page.


--
Liberals preach diversity, equity, and inclusion, but engage in
misandry, are racist against whites, promote hatred of conservatives.

James Harris

unread,
Jul 21, 2021, 1:24:09 PMJul 21
to
On 21/07/2021 08:19, Dmitry A. Kazakov wrote:
> On 2021-07-20 22:40, James Harris wrote:
>> On 20/07/2021 19:01, Dmitry A. Kazakov wrote:
>>> On 2021-07-20 19:33, James Harris wrote:
>>>
>>>> Hence the question: how should a language support access to address
>>>> zero? Any ideas?
>>>
>>> Where is a problem?
>>
>> The problem is in the implementation. Should a language recognise such
>> a thing as an invalid address and, if so, what value or range of
>> values should indicate that a given address is invalid?
>
> All addresses are valid. Some cannot be converted to some pointer types.
>
> The system-dependent package would usually provide a constant
> Null_Address that is guaranteed to never indicate accessible memory on
> the given machine. The representation of Null_Address is irrelevant.
>
>> Compiled code often uses address zero to indicate 'invalid' but that
>> would not be possible if address zero were to be accessible.
>
> In a properly designed language you would not be able to write a program
> in such a manner.

I said /compiled/ code, not program source. In the /source/ the
programmer would still be able to write tests akin to

if p != null

I may even allow

if p

as meaning the same as the above though I guess some folk (e.g. Dmitry?)
will not like the idea of treating a pointer as a boolean.

As I say, compiled code often uses address zero as a null pointer,
knowing that the OS will make the lowest page inaccessible so that an
attempt to access it would generate a fault. And that's what I have long
intended to do. But when looking at different ways of loading a program
I realised that it was not always possible to reserve address zero.

In fact, sometimes a lot more than just one page is reserved. According
to a video explainer I watched a while ago 32-bit Linux sets the lowest
accessible address to something like 0x0040_0000 so that the hardware
will trap not just

*p

but also

p[q]

for some significant size of q (all where p is null). The trouble is
that that takes away 4M of the address space and, more importantly,
means that the addresses a programmer will see in a debugging session or
a dump would have more significant digits than they need to have and,
therefore, be harder to read than necessary.

If, by contrast, null is set to a little higher than the accessible
memory program data areas could be at lower addresses making debugging
sessions and dumps easier to read.

The hardware would still trap on both of

*p
p[q]

In fact, q could potentially be a lot higher than in the 'normal' model
because not just 4M but all the addresses from p[0] to the highest
memory address would be inaccessible and would trap (given a suitable CPU).

To make this work I would have to have null (the null address)
determined not at compile time but at program load time.

That ought to cope well with various OS memory models though it does
have a downside. If a structure containing a pointer is mapped over
zeroed memory the pointer will not be null but will be considered to be
valid. (It will point at location zero.)

>
>> How are you distinguishing between pointer and address?
>
> That thing is called the type.

OK ... then what, to you, distinguishes them? Alignment? Range? History?
Something else?


--
James Harris

Dmitry A. Kazakov

unread,
Jul 21, 2021, 1:44:34 PMJul 21
to
On 2021-07-21 19:24, James Harris wrote:

> I said /compiled/ code, not program source. In the /source/ the
> programmer would still be able to write tests akin to
>
>   if p != null
>
> I may even allow
>
>   if p
>
> as meaning the same as the above though I guess some folk (e.g. Dmitry?)
> will not like the idea of treating a pointer as a boolean.
>
> As I say, compiled code often uses address zero as a null pointer,

No, it uses the representation of null, whatever it be [*].

Furthermore, in a decent language with memory pools support each pool
could have its own null.

>> That thing is called the type.
>
> OK ... then what, to you, distinguishes them? Alignment? Range? History?
> Something else?

https://en.wikipedia.org/wiki/Nominal_type_system

-------------------------
In an advanced language pointer comparisons could be non-trivial, e.g.
when two pointers indicate different classes of the same object under
multiple inheritance. In that case memory representations of p and q
could be different, yet semantically p = q because both ultimately point
to the same object [provided, the language lets p and q be comparable].
An implementation would convert both p and q to the pointers of specific
type and then compare these.

James Harris

unread,
Jul 21, 2021, 5:24:03 PMJul 21
to
On 21/07/2021 18:53, Rod Pemberton wrote:
> On Tue, 20 Jul 2021 21:40:51 +0100
> James Harris <james.h...@gmail.com> wrote:
>
>> On 20/07/2021 19:01, Dmitry A. Kazakov wrote:
>>> On 2021-07-20 19:33, James Harris wrote:
>
>>>> Hence the question: how should a language support access to
>>>> address zero? Any ideas?
>>>
>>> Where is a problem?
>>
>> The problem is in the implementation. Should a language recognise
>> such a thing as an invalid address and, if so, what value or range of
>> values should indicate that a given address is invalid?
>
> What exactly do you mean by,
> "recognize such a thing as an invalid address"?
>
> e.g., prohibiting any of the following,
>
> a) assignment of zero value to an address pointer
> b) comparison of zero value with a address pointer
> c) writing to location with address zero, i.e., dereference

I'm not really sure what you mean by the above so my replies below may
not be in line with what you were thinking of.

I was really asking whether a language ought to have the concept of one
address which is invalid. Imagine a 64k machine. One might want to point
to any of those 64k memory cells. In which case it's maybe a bad idea to
reserve one address as invalid.

But the rest of the topic is assuming that one address would be reserved
as the null address.

>
>
> First, let's for sake of argument say the language's NULL pointer (or
> equivalent) is actually of value zero, as it doesn't have to be, at
> least for the C language, but usually is implemented as a zero value.
> This eliminates the "thought" complexity over NULL in C etc being a
> non-zero value.
>
> I'd strongly argue that allowing b) is a language requirement.

Do you mean as in C's

if (p == 0)

?

Why not, instead, require the comparison to be against NULL, instead of
zero, as in

if (p == NULL)

or, indeed, just allow

if (p)

?

> I'd argue that a) is up to the language implementation.

I presume you mean

p = 0;

but is zero really needed when that could be, instead,

p = NULL;

?

> I'd argue that c) is up to the operating system implementation or
> hardware.

OK.

>
>> Compiled code often uses address zero to indicate 'invalid' but that
>> would not be possible if address zero were to be accessible.
>
> Sure it is.

I don't understand what you mean.

>
> The 'invalid' condition of which you speak is the result of a pointer
> to pointer comparison, as nothing is written to address zero.

No, I was really thinking of paging reserving page 0 as inaccessible -
so that any dereference of a null pointer (null being 0, in this case)
whether read or write would generate a fault.

But it's only really possible to prohibit access to address zero if one
is using paging. And as you know, that's not the only way to design an
OS. That's why I was thinking to make null an address /above/ the
user-accessible memory space. That would work whether one was using
paging or not. (The actual address for null would be set when a program
was loaded rather than being a constant known at compile time. That's a
bit different from normal but I think it could be done.)

>
> The writing to address zero is a dereference, not a pointer comparison,
> and will be allowed if the hardware is incapable of blocking the write
> to the address location, i.e., zero, e.g., via an invalid or unmapped
> page.

Interesting. So you are suggesting that if p is zero then p will compare
as NULL (and, hence, invalid) but that

*p = 99;

would still work?


--
James Harris


Bart

unread,
Jul 21, 2021, 6:48:19 PMJul 21
to
On 20/07/2021 18:33, James Harris wrote:
> For my OS project I have been looking at program loading and that has
> led me to query what would be required in a language to support address
> zero being accessible and a pointer to it being considered to be valid.
>
> If I use paging I can reserve the lowest page so that address zero is
> inaccessible. A dereference of a zeroed pointer would be caught by the
> CPU triggering a fault.
>
> However, if on x86 I don't use paging then a reference to address zero
> would not trigger a fault so it would not be caught and diagnosed.

How is that different from any other access to invalid memory? Or a
access of address 1 (or 8 if aligned)?

> And
> it's not just that case. Other CPUs or microcontrollers may, presumably,
> allow access to address zero. Therefore there are cases where a program
> may have a pointer to address zero and that pointer could be legitimate.
>
> Hence the question: how should a language support access to address
> zero? Any ideas?

If the hardware allows a meaningful dereference to address 0, and you
need to have your HLL access that same location via a pointer, then you
need to make it possible.

However if zero is also used for a null pointer value, so that for
example P=null means that P has not been asigned to anything, then that
might interfere with that,

Then you might look at using an alternate representation for a null
pointer value.

Personally, I'd just make the first few bytes of memory special. Make
sure address 0 never occurs as a heap allocation, and rarely comes up as
the address of an object in the HLL.

My latest language has 'nil' value for pointers (you can't use 0),
whose value is not specified. But it is generally understood it is all
zeros.

That means that data structures that exist in the zero-data segment
(BSS?) will be guaranteed to have any embedded pointers set to nil.

Just stick something at address 0 that is not going to be dereferenced
via a HLL pointer. But if you really need to access that location, then
just do it.

Rod Pemberton

unread,
Jul 21, 2021, 11:36:15 PMJul 21
to
On Wed, 21 Jul 2021 22:24:00 +0100
James Harris <james.h...@gmail.com> wrote:

> On 21/07/2021 18:53, Rod Pemberton wrote:
> > On Tue, 20 Jul 2021 21:40:51 +0100
> > James Harris <james.h...@gmail.com> wrote:
> >> On 20/07/2021 19:01, Dmitry A. Kazakov wrote:
> >>> On 2021-07-20 19:33, James Harris wrote:

> >>>> Hence the question: how should a language support access to
> >>>> address zero? Any ideas?
> >>>
> >>> Where is a problem?
> >>
> >> The problem is in the implementation. Should a language recognise
> >> such a thing as an invalid address and, if so, what value or range
> >> of values should indicate that a given address is invalid?
> >
> > What exactly do you mean by,
> > "recognize such a thing as an invalid address"?
> >
> > e.g., prohibiting any of the following,
> >
> > a) assignment of zero value to an address pointer
> > b) comparison of zero value with a address pointer
> > c) writing to location with address zero, i.e., dereference
>
> I'm not really sure what you mean by the above so my replies below
> may not be in line with what you were thinking of.
>
> I was really asking whether a language ought to have the concept of
> one address which is invalid.

Invalid for what? i.e., comparison? writing? reading?

> Imagine a 64k machine. One might want
> to point to any of those 64k memory cells. In which case it's maybe a
> bad idea to reserve one address as invalid.
>
> But the rest of the topic is assuming that one address would be
> reserved as the null address.
>

The fact that one address is reserved as the null address doesn't
prohibit writing to that address or dereferencing that address. The
address only ensures pointers don't compare equal, in order to detect
their initialization or lack thereof.

> > First, let's for sake of argument say the language's NULL pointer
> > (or equivalent) is actually of value zero, as it doesn't have to
> > be, at least for the C language, but usually is implemented as a
> > zero value. This eliminates the "thought" complexity over NULL in C
> > etc being a non-zero value.
> >
> > I'd strongly argue that allowing b) is a language requirement.
>
> Do you mean as in C's
>
> if (p == 0)
>
> ?
>
> Why not, instead, require the comparison to be against NULL, instead
> of zero, as in
>
> if (p == NULL)
>
> or, indeed, just allow
>
> if (p)
>
> ?

Since I defined NULL as zero for argument's sake, these are equivalent.
There is no difference between using NULL and using zero. If we had
chosen to use or allow a non-zero value for NULL, then we'd need to do
the latter, i.e., use (p==NULL), to ensure the correct non-zero value
is used for NULL in the comparison.

> > I'd argue that a) is up to the language implementation.
>
> I presume you mean
>
> p = 0;
>
> but is zero really needed when that could be, instead,
>
> p = NULL;
>
> ?

Since I defined NULL as zero for argument's sake, these are equivalent.
There is no difference between using NULL and using zero. If we had
chosen to use or allow a non-zero value for NULL, then, yes, we'd
really need to have a distinct zero value (since NULL would be
non-zero), as there would be no other way to access address zero.

> > [snip]
>
> No, I was really thinking of paging reserving page 0 as inaccessible
> - so that any dereference of a null pointer (null being 0, in this
> case) whether read or write would generate a fault.
>
> But it's only really possible to prohibit access to address zero if
> one is using paging. And as you know, that's not the only way to
> design an OS. That's why I was thinking to make null an address
> /above/ the user-accessible memory space. That would work whether one
> was using paging or not. (The actual address for null would be set
> when a program was loaded rather than being a constant known at
> compile time. That's a bit different from normal but I think it could
> be done.)

Yes. Didn't I link you to one of the C specification's authors on a
example non-zero implementation of NULL?
https://groups.google.com/g/comp.std.c/c/ez822gwxxYA/m/Jt94XH7AVacJ

> > The writing to address zero is a dereference, not a pointer
> > comparison, and will be allowed if the hardware is incapable of
> > blocking the write to the address location, i.e., zero, e.g., via
> > an invalid or unmapped page.
>
> Interesting. So you are suggesting that if p is zero then p will
> compare as NULL (and, hence, invalid) but that
>

Let's at least initialize p. Should we declare it with some type?

p = NULL;

> *p = 99;
>
> would still work?
>

In general, that's "Yes", but it's conditional. On Linux or Windows,
I'd expect a page fault, since they use paging and probably mark the
zero'th page as invalid. For DOS, I'd **usually** - but not always -
expect a write to address location zero with value 99.

If NULL is defined as zero, and writing to address zero isn't blocked
due to paging etc, then the answer is: Yes. That is how most C
compilers work/worked on environments a) without paging or prior to
paging, and b) where NULL is defined as zero (which is most of them).
For example, it works this way for many DOS C compilers, as most of
them are set up for 16-bit DOS, or for most 32-bit DOS DPMI hosts which
don't use paging. However, some 32-bit DOS DPMI hosts do use paging,
where such a dereference would page fault, just like for Windows or
Linux.

If you remember, assignment of an integer to a pointer is explicitly
undefined behavior (UB) for C, but most C compilers define this
behavior to be valid in order to access memory locations outside of C's
memory allocations. This is especially useful for memory-mapped
hardware or data structures, e.g., BIOS BDA, EBDA, IVT, etc. Typically,
you'll need to declare such access to be "volatile" as well, as the
compiler doesn't recognize the region as being allocated or in-use, and
may optimize away access.

Rod Pemberton

unread,
Jul 21, 2021, 11:38:02 PMJul 21
to
On Wed, 21 Jul 2021 18:24:01 +0100
James Harris <james.h...@gmail.com> wrote:

> As I say, compiled code often uses address zero as a null pointer,
> knowing that the OS will make the lowest page inaccessible so that an
> attempt to access it would generate a fault. And that's what I have
> long intended to do. But when looking at different ways of loading a
> program I realised that it was not always possible to reserve address
> zero.

Yes.

You'll have problems on any environment without paging, e.g., any 8-bit
computers (C64, Apple II), 16-bit computers (MacIntosh, Amiga), 16-bit
DOS, 32-bit DOS DPMI without a paging DPMI host, etc.

> In fact, sometimes a lot more than just one page is reserved.
> According to a video explainer I watched a while ago 32-bit Linux
> sets the lowest accessible address to something like 0x0040_0000 so
> that the hardware will trap not just
>
> *p
>
> but also
>
> p[q]
>
> for some significant size of q (all where p is null).

Good idea. That will prohibit C programmers from directly programming
the hardware. This is only needed for rudimentary OSes like DOS, and
isn't needed and shouldn't be allowed for advanced OSes like Linux or
Windows, which have paging. I.e., for safety, the modern C programmer
should be restricted to just the C application space - no hardware. Of
course, if the programmer was doing OS development, they'd need a way
around that restriction. E.g., DJGPP has special functions to allow
access to memory below 1MB.

> The trouble is that that takes away 4M of the address space and, more
> importantly, means that the addresses a programmer will see in a
> debugging session or a dump would have more significant digits than
> they need to have and, therefore, be harder to read than necessary.

To me, this is irrelevant. I use printf() to debug.

> If, by contrast, null is set to a little higher than the accessible
> memory program data areas could be at lower addresses making
> debugging sessions and dumps easier to read.
>
> The hardware would still trap on both of
>
> *p
> p[q]
>
> In fact, q could potentially be a lot higher than in the 'normal'
> model because not just 4M but all the addresses from p[0] to the
> highest memory address would be inaccessible and would trap (given a
> suitable CPU).
>
> To make this work I would have to have null (the null address)
> determined not at compile time but at program load time.

Why would it need to be any value above 1MB? ... It's usually only RAM
above 1MB, especially for older computers. I.e., in general, all the
older BIOS, I/O ports, Vesa BIOS, IVT, BDA, EBDA, RTC, CMOS, etc
hardware are all located low, below 1MB. It's only modern
memory-mapped devices, e.g., video frame buffers etc, which are located
high above 1MB, and indicated via E820h memory map. Don't ask me about
UEFI.

> That ought to cope well with various OS memory models though it does
> have a downside. If a structure containing a pointer is mapped over
> zeroed memory the pointer will not be null but will be considered to
> be valid. (It will point at location zero.)

True. If that's a concern, you'd need to write NULL's value.

luserdroog

unread,
Jul 22, 2021, 11:42:09 PMJul 22
to
I think the point others are trying to make is that your compiler should
make this distinction at a higher level, using higher level information
that the language describes.

The compiled code should have no need to make this determination
at runtime using only the machine representation of a memory address.

James Harris

unread,
Jul 23, 2021, 6:50:41 AMJul 23
to
By "invalid" I mean "guaranteed not to be the address of any object" and
that it would be an error to attempt to dereference an invalid pointer.

>
>> Imagine a 64k machine. One might want
>> to point to any of those 64k memory cells. In which case it's maybe a
>> bad idea to reserve one address as invalid.
>>
>> But the rest of the topic is assuming that one address would be
>> reserved as the null address.
>>
>
> The fact that one address is reserved as the null address doesn't
> prohibit writing to that address or dereferencing that address. The
> address only ensures pointers don't compare equal, in order to detect
> their initialization or lack thereof.

That's an interesting idea. How would it work? Say one had a routine
which checked whether a pointer was valid before dereferencing it, along
the lines of

if (p != NULL) {
/* work with the object at p */
}

Most of the time that routine would work properly but if were to be
passed the address of an object which just happened to sit at the null
address then the routine would fail - a nasty bug which might not show
up in testing.

>
>>> First, let's for sake of argument say the language's NULL pointer
>>> (or equivalent) is actually of value zero, as it doesn't have to
>>> be, at least for the C language, but usually is implemented as a
>>> zero value. This eliminates the "thought" complexity over NULL in C
>>> etc being a non-zero value.
>>>
>>> I'd strongly argue that allowing b) is a language requirement.
>>
>> Do you mean as in C's
>>
>> if (p == 0)
>>
>> ?
>>
>> Why not, instead, require the comparison to be against NULL, instead
>> of zero, as in
>>
>> if (p == NULL)
>>
>> or, indeed, just allow
>>
>> if (p)
>>
>> ?
>
> Since I defined NULL as zero for argument's sake, these are equivalent.

Yes, though the "== NULL" variant would work always. It would work
whether NULL was zero or not. Further, in a language other than C the simple

if p

could, if p was of pointer type, be defined as

if bool(p)

where bool() of a pointer would return false if the pointer was to the
null address.

> There is no difference between using NULL and using zero. If we had
> chosen to use or allow a non-zero value for NULL, then we'd need to do
> the latter, i.e., use (p==NULL), to ensure the correct non-zero value
> is used for NULL in the comparison.
>

Yes.


>>> I'd argue that a) is up to the language implementation.
>>
>> I presume you mean
>>
>> p = 0;
>>
>> but is zero really needed when that could be, instead,
>>
>> p = NULL;
>>
>> ?
>
> Since I defined NULL as zero for argument's sake, these are equivalent.

That's the easy case!


> There is no difference between using NULL and using zero. If we had
> chosen to use or allow a non-zero value for NULL, then, yes, we'd
> really need to have a distinct zero value (since NULL would be
> non-zero), as there would be no other way to access address zero.

Yes.

>
>>> [snip]
>>
>> No, I was really thinking of paging reserving page 0 as inaccessible
>> - so that any dereference of a null pointer (null being 0, in this
>> case) whether read or write would generate a fault.
>>
>> But it's only really possible to prohibit access to address zero if
>> one is using paging. And as you know, that's not the only way to
>> design an OS. That's why I was thinking to make null an address
>> /above/ the user-accessible memory space. That would work whether one
>> was using paging or not. (The actual address for null would be set
>> when a program was loaded rather than being a constant known at
>> compile time. That's a bit different from normal but I think it could
>> be done.)
>
> Yes. Didn't I link you to one of the C specification's authors on a
> example non-zero implementation of NULL?
> https://groups.google.com/g/comp.std.c/c/ez822gwxxYA/m/Jt94XH7AVacJ

I've been reading that but I'm not sure I understand it or see how it helps.

>
>>> The writing to address zero is a dereference, not a pointer
>>> comparison, and will be allowed if the hardware is incapable of
>>> blocking the write to the address location, i.e., zero, e.g., via
>>> an invalid or unmapped page.
>>
>> Interesting. So you are suggesting that if p is zero then p will
>> compare as NULL (and, hence, invalid) but that
>>
>
> Let's at least initialize p. Should we declare it with some type?
>
> p = NULL;
>
>> *p = 99;
>>
>> would still work?
>>

That would be required to fail.

>
> In general, that's "Yes", but it's conditional. On Linux or Windows,
> I'd expect a page fault, since they use paging and probably mark the
> zero'th page as invalid. For DOS, I'd **usually** - but not always -
> expect a write to address location zero with value 99.

True, but for DOS the compiler could insert a check so that it, too,
would lead to an exception. (I would like attempts to dereference a null
pointer to throw an exception on all platforms to that program behaviour
is consistent.)

>
> If NULL is defined as zero, and writing to address zero isn't blocked
> due to paging etc, then the answer is: Yes. That is how most C
> compilers work/worked on environments a) without paging or prior to
> paging, and b) where NULL is defined as zero (which is most of them).
> For example, it works this way for many DOS C compilers, as most of
> them are set up for 16-bit DOS, or for most 32-bit DOS DPMI hosts which
> don't use paging. However, some 32-bit DOS DPMI hosts do use paging,
> where such a dereference would page fault, just like for Windows or
> Linux.
>
> If you remember, assignment of an integer to a pointer is explicitly
> undefined behavior (UB) for C, but most C compilers define this
> behavior to be valid in order to access memory locations outside of C's
> memory allocations.

Isn't it /implementation-defined/ behaviour rather than UB?

> This is especially useful for memory-mapped
> hardware or data structures, e.g., BIOS BDA, EBDA, IVT, etc. Typically,
> you'll need to declare such access to be "volatile" as well, as the
> compiler doesn't recognize the region as being allocated or in-use, and
> may optimize away access.

Agreed.


--
James Harris

anti...@math.uni.wroc.pl

unread,
Jul 23, 2021, 9:44:13 AMJul 23
to
You mix many different things. As other noted, null pointer in
a programming may be address 0, but there are also different
possible representations. It is for programming language to
decide what happens with null pointer is dereferenced. AFAIK
in Vax C page 0 was filled with 0, null pointer was represented
by 0, so read acccess via null pointer gave 0. Effectively,
null pointer was pointer to a canonical empty string. I guess
that trying to write gave a fault. On machine with true ROM
normally attempts to write to ROM are ignored. If there is
0 byte in ROM you could represent null pointer as address of
zero byte in ROM. In such case reads via null pointer would
gave 0, writes would be ignored. In Lisp instead of null
pointer there is NIL. In Lisp dereferencing (reading via) NIL
has defined effect (IIRC writes are illegal).

So it is really up to you what you decide. Most modern
languages make dereferencing null pointers illegal, but
there are notable examples that do differently.

Another thing is safety, namely question if your implementation
will detect illegal program behaviour. There is "C attitude"
which basically says that "do what the machine do". Actually,
this attitude goes back at least to Fortran and to same
degree is present in Pascal. According to this attitude
it is programmer responsibility to avoid dereferencing
null pointer (original Fortran had no pointers, but
similar attitude applyes in other places). If hardware
have appropriate support language my arrange to trap
dereferencing via null pointer, but otherwise programmer
has to handle this. Some languages, notably Ada have
different attitude, they promise to catch all illegal
actions. In case of null pointer this may require
extra checks before each memory access via pointer.
"Safe" compiler is supposed to insert such checks.
Expensive, but optimizing compilers can see that most
checks are unnecessary and insert checks only in
case of doubt. And when hardware is capable of checking
compiler can depend on hardware checks.

Personaly I am against making null pointer dereference
legal, IMO gains from this are negligible, and loss
(mainly inability to catch errors) is large. I am
for having as much checking as possible, if you feel
that having checks always on is too expensive at least
make them available for debugging.

--
Waldek Hebisch

Rod Pemberton

unread,
Jul 24, 2021, 7:03:06 AMJul 24
to
On Fri, 23 Jul 2021 11:50:38 +0100
If you meant to say, "... any C object ..." or
" ... any object in my language ...", then

Yes.

Obviously, if you're linking to code from another language, or
accessing the address is for a memory-mapped device or data, then you
can't make that claim of the address being "invalid", because it's
outside the scope of what the C compiler controls or what your language
compile controls.

> and that it would be an error to attempt to dereference an invalid
> pointer.

I guess this depends on how you happen to define "error" for this
statement.


If this is just a coding error to be avoided by the programmer, much
like a syntax error but without a warning, and nothing else, that's
acceptable.

If it's compilation error, it could be problematic, as the pointer's
value may not be defined until run time, and could be changed to NULL
at some random point during execution. I.e., the compiler wouldn't
always be able to detect during compilation that the pointer would be
dereferenced after being set to NULL.

If it's a run time error, this would really require hardware support,
such as paging to be effective. Software checks on the pointer's value
could add excessive overhead to identify this case.

> >> Imagine a 64k machine. One might want
> >> to point to any of those 64k memory cells. In which case it's
> >> maybe a bad idea to reserve one address as invalid.
> >>
> >> But the rest of the topic is assuming that one address would be
> >> reserved as the null address.
> >>
> >
> > The fact that one address is reserved as the null address doesn't
> > prohibit writing to that address or dereferencing that address. The
> > address only ensures pointers don't compare equal, in order to
> > detect their initialization or lack thereof.
>
> That's an interesting idea. How would it work?

AISI, it would work just like any other pointer dereference.

I.e., the value zero isn't special, nor is NULL as zero, nor is NULL as
non-zero. For C, NULL is only special because valid C objects can't be
located at the same address. I.e., the address of C objects must
compare unequal to NULL.

> Say one had a routine
> which checked whether a pointer was valid before dereferencing it,
> along the lines of
>
> if (p != NULL) {
> /* work with the object at p */
> }

Sigh. Why would you (or anyone) ever do this? ...

As the programmer, you should be coding your program to not have
pointers set to NULL especially if the pointer is to be dereferenced.
I.e., initialize the pointer prior to use.

Also, the programmer should be tracking when and where your pointers
get set to NULL to prevent this, e.g., set to NULL by library functions.

While this is clearly defensive programming (CYA) against a potential
error, you're not doing your job properly, if you need to do this as a
matter of course, IMO.

> Most of the time that routine would work properly but if were to be
> passed the address of an object which just happened to sit at the
> null address then the routine would fail - a nasty bug which might
> not show up in testing.

I think your assumption that simply accessing junk data at the NULL
address would cause the routine to fail to be incorrect.

If the routine fails for junk data, then it fails for junk data no
matter where it's located or retrieved from. The NULL address isn't
special in this regard. The data there (at NULL) may be invalid or
junk for the routine, but there are millions or billions of other
memory locations where the data could be invalid or junk for the
routine too, thereby causing it to fail.

How do you intend to filter out millions or billions of other "invalid"
addresses from being passed to this routine, knowing that they would
cause the routine to fail? If you're doing this for x86, shouldn't
you, at a minimum, filter out every pointer address below 1MB?

> True, but for DOS the compiler could insert a check so that it, too,
> would lead to an exception.

Why? Why would you generate an exception for a NULL dereference?

E.g., on x86, the RM IVT starts at zero (by default). It's a
memory-mapped data structure which would have a zero address in C,
which would likely match the NULL pointer address. Address zero
corresponds to the divide-by-zero interrupt vector, which may need to
be changed by DOS programs.

Let's get back to the question of why would you generate an exception
for that. Does the ability to detect a NULL pointer dereference
outweigh the ability to manipulate the interrupt vectors? What about a
memory-mapped device or data? If you answered "Yes" in favor of the
dereference, I'd have to ask you, "Since when?"

If you can't use the language to program the hardware, then there is no
point in using the language at all. If you don't understand this,
you're clearly not familiar Pascal. Pascal effectively died because it
was unable to be used to directly program the hardware or access memory
outside of the scope of the language, e.g., memory-mapped devices,
because it lacked pointers.

> (I would like attempts to dereference a null pointer to throw an
> exception on all platforms [so] that program behaviour is consistent.)

I simply don't agree with this. That is only valid for platforms where
the NULL address doesn't correspond to valid memory-mapped data or valid
memory-mapped device, which admittedly is most of them nowadays, but
clearly not all of them. I.e., you're going to break or restrict
something by doing this for every platform. Obviously, this would
affect DOS for x86, and old 8-bits like Apple II or C64 which use
6502/6510 micro-processors which use the zero-page for a register file.

> > If NULL is defined as zero, and writing to address zero isn't
> > blocked due to paging etc, then the answer is: Yes. That is how
> > most C compilers work/worked on environments a) without paging or
> > prior to paging, and b) where NULL is defined as zero (which is
> > most of them). For example, it works this way for many DOS C
> > compilers, as most of them are set up for 16-bit DOS, or for most
> > 32-bit DOS DPMI hosts which don't use paging. However, some 32-bit
> > DOS DPMI hosts do use paging, where such a dereference would page
> > fault, just like for Windows or Linux.
> >
> > If you remember, assignment of an integer to a pointer is explicitly
> > undefined behavior (UB) for C, but most C compilers define this
> > behavior to be valid in order to access memory locations outside of
> > C's memory allocations.
>
> Isn't it /implementation-defined/ behaviour rather than UB?

Sigh ...

James Harris

unread,
Jul 24, 2021, 8:40:57 AMJul 24
to
On 24/07/2021 13:04, Rod Pemberton wrote:
> On Fri, 23 Jul 2021 11:50:38 +0100
> James Harris <james.h...@gmail.com> wrote:

...

Rod, your replies on this topic are puzzling me a bit. Perhaps I
misunderstand what you are saying but ISTM that you have the wrong idea
of how a null pointer is frequently used. I don't think that's likely to
be the case so to try to clear up any confusion I'll reply to this part
of your post specifically and will come back to the rest of your post
later.


>> Say one had a routine
>> which checked whether a pointer was valid before dereferencing it,
>> along the lines of
>>
>> if (p != NULL) {
>> /* work with the object at p */
>> }
>
> Sigh. Why would you (or anyone) ever do this? ...

It's standard programming which is used all the time. For example, if
you wanted to process a tree - let's say in preorder - you could write

preorder(node *n)
{
process(n->data);
if (n->left != NULL) preorder(n->left);
if (n->right != NULL) preorder(n->right);
}

or, if you prefer,

preorder(node *n)
{
if (n != NULL)
{
process(n->data);
preorder(n->left);
preorder(n->right);
}
}

In either case the point is that the code tests for a pointer being null
because null means that there is no node.

As with the theme of this topic, there is a particular address which is
guaranteed not to refer to an object.

>
> As the programmer, you should be coding your program to not have
> pointers set to NULL especially if the pointer is to be dereferenced.
> I.e., initialize the pointer prior to use.

Null doesn't mean uninitialised. A pointer can be initialised to null,
as it would be for the tree-walking code, above.

Similarly, someone might walk a linked list by

while (n != NULL)
{
process(n->data);
n = n->next;
}

Again, the pointer being null means that there is no object being
referred to. You can imagine that n-next will have explicitly been set
to NULL in the last node in the list.


Does that change your view of the topic and where a pointer may be null?


--
James Harris

Bart

unread,
Jul 24, 2021, 8:57:43 AMJul 24
to
On 24/07/2021 13:04, Rod Pemberton wrote:
> On Fri, 23 Jul 2021 11:50:38 +0100

>> Say one had a routine
>> which checked whether a pointer was valid before dereferencing it,
>> along the lines of
>>
>> if (p != NULL) {
>> /* work with the object at p */
>> }
>
> Sigh. Why would you (or anyone) ever do this? ...
>
> As the programmer, you should be coding your program to not have
> pointers set to NULL especially if the pointer is to be dereferenced.
> I.e., initialize the pointer prior to use.

You don't seem to understand what NULL is for.

How do /you/ use NULL in C, or do you never actually use it?

Actually, how would you code a linked list without using some sentinel
to mark the end of the list?

The use of NULL like this is EVERYWHERE and in every API.

For example the return value of C's fopen() function is NULL when the
operation failed.

David Brown

unread,
Jul 24, 2021, 9:29:35 AMJul 24
to
It is useful for a language to have a concept of pointers or references
that are guaranteed valid - then you don't need to check them like this.
(C does not have this concept, but C++ does - you can't create a
reference "pointer" without it being a reference to /something/.)

It is also useful for a language to have a concept of "optional" values.
That is, a way of saying "x is either NULL or a valid value of type T".
Sometimes this is so useful that it applies to all types (like SQL).
Sometimes you force the programmer to do this manually, like in C using
"struct maybe_int { bool valid; int x; };". Sometimes you make it a
convenient part of the standard library, like C++ "std::optional<int>".
Sometimes you have it through summation or algebraic types fully
supported in the language, like Haskell "data Maybeint = Invalid | Int".

There is no simple and efficient way to do this for simple types - for
an integer, you either have to sacrifice a valid integer value, or you
have to add an extra boolean flag to go with it. Sacrificing a value
makes your arithmetic coding a lot more complicated and inefficient.
But for pointers, sacrificing a value to use as an "invalid" indicator
is cheap and easy, and the gains are certainly worth it. The most
efficient "invalid" value to use is 0, since it is quick and easy to
test. An alternative worth considering is to use the highest bit to
indicate invalid - cutting half your address space is often not a
problem, you can use a pointer to address 0, and you can represent many
different invalid values.


So IMHO it makes sense for a language to support /both/ concepts of an
always valid pointer, and of optionally valid pointers.

I'd also suggest pointers being viewed as a lot more general than just
holding an address, allowing for references, weak references, shared
pointers, and other ways of referring to objects.

I'd also avoid "if (p != NULL)", and prefer "if (p)" or "if (valid(p))",
making it clearer that you are checking for the validity of the pointer
rather than for it happening to match a particular value. (In C, an
implementation can have more than one null pointer, and "if (p != NULL)"
actually checks for any of them - something that is not apparent from
the syntax.)

Andy Walker

unread,
Jul 24, 2021, 11:06:03 AMJul 24
to
On 24/07/2021 13:04, Rod Pemberton wrote:
[...]
[Dereferencing "NULL":]
> If it's compilation error, it could be problematic, as the pointer's
> value may not be defined until run time, and could be changed to NULL
> at some random point during execution. I.e., the compiler wouldn't
> always be able to detect during compilation that the pointer would be
> dereferenced after being set to NULL.

A sufficiently-explicit dereference of "NULL" for it to be a
compile-time error is surely in hen's-teeth territory; as rare as
explicitly writing "i := 1/0". It's clearly an error, but how far
that propagates back up the compilation is to some extent a matter
of taste.

> If it's a run time error, this would really require hardware support,
> such as paging to be effective. Software checks on the pointer's value
> could add excessive overhead to identify this case.

It /is/ a run-time error, in any half-way sensible language.
/Not/ checking is the sort of thing that generates bugs that surface
years later. The time spent debugging and re-issuing software [often
after the bug has been exploited to install malware on millions of
computers] typically far exceeds the overhead of checking in the first
place. It's not just dereferencing "NULL", of course; it's also
buffer over-runs, uninitialised variables, storage used after "free",
and many others. In sensible languages, most of the checks can be
optimised away [see below] even by quite simple compilers.

[James:]
>> Say one had a routine
>> which checked whether a pointer was valid before dereferencing it,
>> along the lines of
>> if (p != NULL) {
>> /* work with the object at p */
>> }
> Sigh. Why would you (or anyone) ever do this? ...

Others have discussed why lots of us would do this. Note
that after the initial check, for as far into "/* work ... */" as
the first potential assignment to "p" [very commonly the whole
statement], no further checks need to be made; IOW, in typical
uses, /no/ run-time checks at all need be added by the compiler.

In similar vein, in Algol's "destination := source", it is
required [Revised Report 5.2.1.2b] that "destination" be not "NIL"
and that "source" be not newer in scope than "destination". Taken
literally, that implies that extravagant checking be done on very
assignment. In Real Life, both checks are commonly trivial to
optimise away at compilation time, and where they aren't they enable
some nasty bugs to be detected as they happen, not much later when
the program eventually fails.

[...]
> If you can't use the language to program the hardware, then there is no
> point in using the language at all. If you don't understand this,
> you're clearly not familiar Pascal. Pascal effectively died because it
> was unable to be used to directly program the hardware or access memory
> outside of the scope of the language, e.g., memory-mapped devices,
> because it lacked pointers.

Almost by definition, no high-level language can be used
/portably/ to program the hardware. You need either or both of an
escape into a lower-level language or a [portable?] library of ways
to access the hardware. But many of us write programs that never
need to program the hardware. For the past ~40 years, I've been
quite content to let others worry about the hardware; I just write
programs in C, Algol, Java, yes even Pascal, and a dozen+ scripting
languages [HTML, Sh, Sed, Awk, CAL, ...]. The idea that there's
no point to any of this is manifestly absurd. Rather the opposite;
for as long as "we" [FSVO] had to worry about the hardware, it was
a sign that computers had not yet matured.

As for Pascal, I don't believe that's the reason it died.
See Brian Kernighan's polemic for better reasons.

--
Andy Walker, Nottingham.
Andy's music pages: www.cuboid.me.uk/andy/Music
Composer of the day: www.cuboid.me.uk/andy/Music/Composers/Rubinstein

James Harris

unread,
Jul 24, 2021, 1:15:14 PMJul 24
to
On 24/07/2021 14:29, David Brown wrote:

...

> So IMHO it makes sense for a language to support /both/ concepts of an
> always valid pointer, and of optionally valid pointers.

OK.

...

>
> I'd also avoid "if (p != NULL)", and prefer "if (p)" or "if (valid(p))",

Agreed. But if you reject the test

p != NULL

then presumably you also reject the assignment

p = NULL;

If so, what's your preferred way to make a pointer invalid?


> making it clearer that you are checking for the validity of the pointer
> rather than for it happening to match a particular value. (In C, an
> implementation can have more than one null pointer, and "if (p != NULL)"
> actually checks for any of them - something that is not apparent from
> the syntax.)
>

That's surprising!


--
James Harris

James Harris

unread,
Jul 24, 2021, 1:29:21 PMJul 24
to
On 21/07/2021 18:44, Dmitry A. Kazakov wrote:
> On 2021-07-21 19:24, James Harris wrote:
>
>> I said /compiled/ code, not program source. In the /source/ the
>> programmer would still be able to write tests akin to
>>
>>    if p != null
>>
>> I may even allow
>>
>>    if p
>>
>> as meaning the same as the above though I guess some folk (e.g.
>> Dmitry?) will not like the idea of treating a pointer as a boolean.
>>
>> As I say, compiled code often uses address zero as a null pointer,
>
> No, it uses the representation of null, whatever it be [*].

You are trying to correct a correct statement.

>
> Furthermore, in a decent language with memory pools support each pool
> could have its own null.

This is a rather mechanical view, coming from you. I would have thought
you would prefer each /type/ to have its own version of null, especially
given your focus on nominal rather than structural type systems!

>
>>> That thing is called the type.
>>
>> OK ... then what, to you, distinguishes them? Alignment? Range?
>> History? Something else?
>
> https://en.wikipedia.org/wiki/Nominal_type_system

Aka different name for the same thing. ;-)

>
> -------------------------
> In an advanced language pointer comparisons could be non-trivial, e.g.
> when two pointers indicate different classes of the same object under
> multiple inheritance. In that case memory representations of p and q
> could be different, yet semantically p = q because both ultimately point
> to the same object [provided, the language lets p and q be comparable].
> An implementation would convert both p and q to the pointers of specific
> type and then compare these.
>

OK.


--
James Harris

Dmitry A. Kazakov

unread,
Jul 24, 2021, 1:43:12 PMJul 24
to
On 2021-07-24 19:29, James Harris wrote:
> On 21/07/2021 18:44, Dmitry A. Kazakov wrote:

>> Furthermore, in a decent language with memory pools support each pool
>> could have its own null.
>
> This is a rather mechanical view, coming from you. I would have thought
> you would prefer each /type/ to have its own version of null, especially
> given your focus on nominal rather than structural type systems!

Null is a value. Each type has values. Values of one type are not values
of another. So, yes, each pointer type has null values of its own.

But you asked about representations of such values. It is possible that
representations differ too.

>> https://en.wikipedia.org/wiki/Nominal_type_system
>
> Aka different name for the same thing. ;-)

How do you know if the thing is same? Is there a serial number on the
back side?

James Harris

unread,
Jul 24, 2021, 1:51:26 PMJul 24
to
On 21/07/2021 23:48, Bart wrote:
> On 20/07/2021 18:33, James Harris wrote:
>> For my OS project I have been looking at program loading and that has
>> led me to query what would be required in a language to support
>> address zero being accessible and a pointer to it being considered to
>> be valid.
>>
>> If I use paging I can reserve the lowest page so that address zero is
>> inaccessible. A dereference of a zeroed pointer would be caught by the
>> CPU triggering a fault.
>>
>> However, if on x86 I don't use paging then a reference to address zero
>> would not trigger a fault so it would not be caught and diagnosed.
>
> How is that different from any other access to invalid memory? Or a
> access of address 1 (or 8 if aligned)?

I cannot quite work out what you are asking. Could you say more?

>
>> And it's not just that case. Other CPUs or microcontrollers may,
>> presumably, allow access to address zero. Therefore there are cases
>> where a program may have a pointer to address zero and that pointer
>> could be legitimate.
>>
>> Hence the question: how should a language support access to address
>> zero? Any ideas?
>
> If the hardware allows a meaningful dereference to address 0, and you
> need to have your HLL access that same location via a pointer, then you
> need to make it possible.

Noted.

>
> However if zero is also used for a null pointer value, so that for
> example P=null means that P has not been asigned to anything, then that
> might interfere with that,
>
> Then you might look at using an alternate representation for a null
> pointer value.
>
> Personally, I'd just make the first few bytes of memory special. Make
> sure address 0 never occurs as a heap allocation, and rarely comes up as
> the address of an object in the HLL.

There are two issues with that.

First, dereferences of address zero can be /automatically/ checked for
validity by the hardware if one is using paging but if paging is not
enabled then address zero cannot be checked automatically and, crucially
for this thread, that would mean that the same executable would run
differently on the two systems - which I want to avoid.

By contrast, and, again, with a given executable, an address /above/ the
program's accessible address space would trigger an exception whether
paging was enabled or not. IOW if I use a high address rather than a low
one then my compiled code can omit checks for null and would work the
same way in both environments.

Second, there are CPUs which, rather unwelcomely, use signed addresses.
For them, a 16-bit address, say, will use the range -32768 to 32767 and
address zero could easily be part of the heap.


>
> My latest language has  'nil' value for pointers (you can't use 0),
> whose value is not specified. But it is generally understood it is all
> zeros.

That sounds good. A minor point but why use the keyword nil rather than
null?

>
> That means that data structures that exist in the zero-data segment
> (BSS?) will be guaranteed to have any embedded pointers set to nil.

The issue with that is that it could run into problems if nil ever
happened to be something other than zero, couldn't it?

>
> Just stick something at address 0 that is not going to be dereferenced
> via a HLL pointer. But if you really need to access that location, then
> just do it.


--
James Harris

Bart

unread,
Jul 24, 2021, 5:31:38 PMJul 24
to
On 24/07/2021 18:51, James Harris wrote:
> On 21/07/2021 23:48, Bart wrote:
>> On 20/07/2021 18:33, James Harris wrote:
>>> For my OS project I have been looking at program loading and that has
>>> led me to query what would be required in a language to support
>>> address zero being accessible and a pointer to it being considered to
>>> be valid.
>>>
>>> If I use paging I can reserve the lowest page so that address zero is
>>> inaccessible. A dereference of a zeroed pointer would be caught by
>>> the CPU triggering a fault.
>>>
>>> However, if on x86 I don't use paging then a reference to address
>>> zero would not trigger a fault so it would not be caught and diagnosed.
>>
>> How is that different from any other access to invalid memory? Or a
>> access of address 1 (or 8 if aligned)?
>
> I cannot quite work out what you are asking. Could you say more?


Suppose you are accessing u16 value at address 0, occupying addresses 0
and 1. That access is illegal, but what about the accessing the upper
byte separately, at address 1?

What I'm really saying is there will be lots of memory addresses that
are not valid; why make address 0 special compared with any of those
(including address 1), when trying to detect an illegal access.

The difference is that an arbitrary address is usually due to some bug,
while address 0 can be deliberately stored in a pointer.

The purpose may be for the software to check that a pointer is in use; I
don't think it's that critical for a runtime or hardware check for
acessing address zero. But it would be useful while debugging.

>> Personally, I'd just make the first few bytes of memory special. Make
>> sure address 0 never occurs as a heap allocation, and rarely comes up
>> as the address of an object in the HLL.
>
> There are two issues with that.
>
> First, dereferences of address zero can be /automatically/ checked for
> validity by the hardware if one is using paging but if paging is not
> enabled then address zero cannot be checked automatically and, crucially
> for this thread, that would mean that the same executable would run
> differently on the two systems - which I want to avoid.
>
> By contrast, and, again, with a given executable, an address /above/ the
> program's accessible address space would trigger an exception whether
> paging was enabled or not.

An address can be within the program's data space, but can still be
wrong: pointing at the wrong object, or inside an object, or spanning
two objects, or at some unused gap.

This is the same point I made above really. While a pointer value of
null can be easy to check even in software, an invalid one is harder.

But this is also depends on the language: how easy is it for a user
program to allow some random number to be stored in a pointer? A
lower-level one like C makes it very easy. (Or like mine, but it makes
it a little bit harder!)

> IOW if I use a high address rather than a low
> one then my compiled code can omit checks for null and would work the
> same way in both environments.
>
> Second, there are CPUs which, rather unwelcomely, use signed addresses.
> For them, a 16-bit address, say, will use the range -32768 to 32767 and
> address zero could easily be part of the heap.

Which ones are those? (So I can make a note to never use them!)

A language could take care of that aspect (so programs see an address
space of 0 to 65535) but that comes at a cost.

However, remember that's C pandering to weird hardware that no one is
ever going to encounter in real life is probably the cause of half of
its UBs.

>
>>
>> My latest language has  'nil' value for pointers (you can't use 0),
>> whose value is not specified. But it is generally understood it is all
>> zeros.
>
> That sounds good. A minor point but why use the keyword nil rather than
> null?

I think that was copied from Pascal which used 'nil'. I didn't encounter
C until over a decade later.

>>
>> That means that data structures that exist in the zero-data segment
>> (BSS?) will be guaranteed to have any embedded pointers set to nil.
>
> The issue with that is that it could run into problems if nil ever
> happened to be something other than zero, couldn't it?

That's why you should strive to have null as all zeros if possible. In
the same you try to have have all zeros also for integer 0, or float 0.0.

And I still, now, when creating sets of enums, often arrange to have the
first have a value of 0, meaning no-value or not-set, so that when it is
used as a tag (in a manually tagged union), then zeroed data won't have
erroneous values. (As might happen if enum 0 means another field is
expected to have a certain set-up.)


Bart

unread,
Jul 24, 2021, 7:40:16 PMJul 24
to
On 24/07/2021 14:29, David Brown wrote:
> On 24/07/2021 14:57, Bart wrote:

>> Actually, how would you code a linked list without using some sentinel
>> to mark the end of the list?
>>
>> The use of NULL like this is EVERYWHERE and in every API.
>>
>> For example the return value of C's fopen() function is NULL when the
>> operation failed.
>>
>
> It is useful for a language to have a concept of pointers or references
> that are guaranteed valid - then you don't need to check them like this.

How does that work for my linked list example?


> It is also useful for a language to have a concept of "optional" values.
> That is, a way of saying "x is either NULL or a valid value of type T".
> Sometimes this is so useful that it applies to all types (like SQL).
> Sometimes you force the programmer to do this manually, like in C using
> "struct maybe_int { bool valid; int x; };". Sometimes you make it a
> convenient part of the standard library, like C++ "std::optional<int>".
> Sometimes you have it through summation or algebraic types fully
> supported in the language, like Haskell "data Maybeint = Invalid | Int".

So how does Haskell distinguish, internally, between an Int value, and
an Invalid type?

> There is no simple and efficient way to do this for simple types - for
> an integer, you either have to sacrifice a valid integer value, or you
> have to add an extra boolean flag to go with it.

Exactly.

> Sacrificing a value
> makes your arithmetic coding a lot more complicated and inefficient.
> But for pointers, sacrificing a value to use as an "invalid" indicator
> is cheap and easy, and the gains are certainly worth it. The most
> efficient "invalid" value to use is 0, since it is quick and easy to
> test. An alternative worth considering is to use the highest bit to
> indicate invalid - cutting half your address space is often not a
> problem, you can use a pointer to address 0, and you can represent many
> different invalid values.

The interesting invalid values are the ones in your 'valid' range. Just
randomly pointing somewhere in your memory doesn't mean the pointer is
any good!


>
> So IMHO it makes sense for a language to support /both/ concepts of an
> always valid pointer, and of optionally valid pointers.
>
> I'd also suggest pointers being viewed as a lot more general than just
> holding an address, allowing for references, weak references, shared
> pointers, and other ways of referring to objects.

I'd rather keep them as simple as possible, and preferably use them as
little as possible too.

(I implement some of those things, but transparently so you aren't even
aware of pointers being used.)


> I'd also avoid "if (p != NULL)", and prefer "if (p)" or "if (valid(p))",
> making it clearer that you are checking for the validity of the pointer

What pointer? Because if writing 'if (X)', then X can be anything. Write
'if (X==NULL)', and you can assume (in C) or know (in mine) that X is a
pointer.

> rather than for it happening to match a particular value. (In C, an
> implementation can have more than one null pointer, and "if (p != NULL)"
> actually checks for any of them - something that is not apparent from
> the syntax.)

Never heard of that. And it also sounds a nightmare to implement. How
many kinds of pointer are we talking about? Which version of NULL do you
get when you do p = NULL?


James Harris

unread,
Jul 25, 2021, 5:02:39 AMJul 25
to
On 24/07/2021 22:31, Bart wrote:
> On 24/07/2021 18:51, James Harris wrote:
>> On 21/07/2021 23:48, Bart wrote:
>>> On 20/07/2021 18:33, James Harris wrote:
>>>> For my OS project I have been looking at program loading and that
>>>> has led me to query what would be required in a language to support
>>>> address zero being accessible and a pointer to it being considered
>>>> to be valid.
>>>>
>>>> If I use paging I can reserve the lowest page so that address zero
>>>> is inaccessible. A dereference of a zeroed pointer would be caught
>>>> by the CPU triggering a fault.
>>>>
>>>> However, if on x86 I don't use paging then a reference to address
>>>> zero would not trigger a fault so it would not be caught and diagnosed.
>>>
>>> How is that different from any other access to invalid memory? Or a
>>> access of address 1 (or 8 if aligned)?
>>
>> I cannot quite work out what you are asking. Could you say more?
>
>
> Suppose you are accessing u16 value at address 0, occupying addresses 0
> and 1. That access is illegal, but what about the accessing the upper
> byte separately, at address 1?
>
> What I'm really saying is there will be lots of memory addresses that
> are not valid; why make address 0 special compared with any of those
> (including address 1), when trying to detect an illegal access.

OK. AISI, although there could be lots of addresses which would be
invalid it's best to choose one so that it can be used in comparisons
such as

if p != null

For the sake of having something specific to use in discussion I'll pick
-1.

>
> The difference is that an arbitrary address is usually due to some bug,
> while address 0 can be deliberately stored in a pointer.

So can -1.

>
> The purpose may be for the software to check that a pointer is in use; I
> don't think it's that critical for a runtime or hardware check for
> acessing address zero. But it would be useful while debugging.

I agree with what you say below that it's not possible in a callee to
check that a pointer which is, say, passed in as a parameter is valid.
For example, 44, 102, 8230 all look like valid addresses but they may
not point to an object.

But we can have one address which is there to designate 'invalid'.

>
>>> Personally, I'd just make the first few bytes of memory special. Make
>>> sure address 0 never occurs as a heap allocation, and rarely comes up
>>> as the address of an object in the HLL.
>>
>> There are two issues with that.
>>
>> First, dereferences of address zero can be /automatically/ checked for
>> validity by the hardware if one is using paging but if paging is not
>> enabled then address zero cannot be checked automatically and,
>> crucially for this thread, that would mean that the same executable
>> would run differently on the two systems - which I want to avoid.
>>
>> By contrast, and, again, with a given executable, an address /above/
>> the program's accessible address space would trigger an exception
>> whether paging was enabled or not.
>
> An address can be within the program's data space, but can still be
> wrong: pointing at the wrong object, or inside an object, or spanning
> two objects, or at some unused gap.

Indeed.

>
> This is the same point I made above really. While a pointer value of
> null can be easy to check even in software, an invalid one is harder.
>
> But this is also depends on the language: how easy is it for a user
> program to allow some random number to be stored in a pointer? A
> lower-level one like C makes it very easy. (Or like mine, but it makes
> it a little bit harder!)

Agreed.

>
>> IOW if I use a high address rather than a low one then my compiled
>> code can omit checks for null and would work the same way in both
>> environments.
>>
>> Second, there are CPUs which, rather unwelcomely, use signed
>> addresses. For them, a 16-bit address, say, will use the range -32768
>> to 32767 and address zero could easily be part of the heap.
>
> Which ones are those? (So I can make a note to never use them!)

IIRC some of the transputers did that, and maybe other CPUs too. There
were plenty of designs that we, today, would consider to be unusual.

>
> A language could take care of that aspect (so programs see an address
> space of 0 to 65535) but that comes at a cost.

A program wouldn't need to see a particular address, would it?
Continuing the illustration, above, of setting null to -1, the
corresponding number for signed memory would be MOSTPOS.

...

>>>
>>> That means that data structures that exist in the zero-data segment
>>> (BSS?) will be guaranteed to have any embedded pointers set to nil.
>>
>> The issue with that is that it could run into problems if nil ever
>> happened to be something other than zero, couldn't it?
>
> That's why you should strive to have null as all zeros if possible. In
> the same you try to have have all zeros also for integer 0, or float 0.0.

I agree. But what's your preferred alternative if running your code on a
machine or mode which has location zero as accessible?

>
> And I still, now, when creating sets of enums, often arrange to have the
> first have a value of 0, meaning no-value or not-set, so that when it is
> used as a tag (in a manually tagged union), then zeroed data won't have
> erroneous values. (As might happen if enum 0 means another field is
> expected to have a certain set-up.)

Understood.


--
James Harris

David Brown

unread,
Jul 25, 2021, 5:53:52 AMJul 25
to
On 24/07/2021 19:15, James Harris wrote:
> On 24/07/2021 14:29, David Brown wrote:
>
> ...
>
>> So IMHO it makes sense for a language to support /both/ concepts of an
>> always valid pointer, and of optionally valid pointers.
>
> OK.
>
> ...
>
>>
>> I'd also avoid "if (p != NULL)", and prefer "if (p)" or "if (valid(p))",
>
> Agreed. But if you reject the test
>
>   p != NULL
>
> then presumably you also reject the assignment
>
>   p = NULL;

Of course - for comparable values, != is defined as "not =" (or not ==,
if you prefer).

>
> If so, what's your preferred way to make a pointer invalid?

You need way to indicate one or more invalid values, and a way to test
them. (Note that with pointers, you are usually also going to have
values that are valid but which still don't point to anything, such as
non-existent addresses. And depending on the semantics of your
language, you might have more requirements before the pointer is really
valid. Here we are only looking at "null" pointers.)

If you restrict the language to a /single/ null pointer, then an
equality comparison to null is fair enough. But you might have other
methods. For example, a segmented architecture might represent pointers
as segment + offset, and treat segment 0 as "null". Or you might use
all "negative" pointers as null pointers.

In C, these are valid possibilities. You can also have different null
values for different pointer types - just as you can have different
sizes for different pointer types. (There are real examples of C
implementations where char* pointers are bigger than int* pointers, or
where function pointers are wildly different from data pointers.)

Allowing multiple invalid pointers can have advantages - it could help
tracing errors, for example, as you could encode the origin of the
problem. Or it could allow you to have pointers that are temporarily
invalidated, or to track data that is no longer accessible to program
code but still exists while waiting for garbage collection. (I'm just
throwing these ideas in the air - I have not considered whether they are
useful or possible.)

But they do have a disadvantage in comparisons. In C, if there is more
than one null pointer for a given type, "if (p == NULL)" has to take
that into account - it might not be just a simple comparison. "if (p)",
on the other hand, can be a lot more efficient for a variety of
different implementations of null pointers.

As for my personal preference, were I designing a language, I'd probably
have 0 as the single null pointer. I'd have some other low-level
mechanism for accessing data at address 0 for those rare situations
where it might be useful (in embedded systems, for example, that might
be a valid address in flash that you read as part of an integrity test).
In C (in real implementations), you can do that using volatile accesses
- other languages might have different methods.

But I would still prefer to write "if (p)", rather than "if (p == NULL)"
- it is simpler, clearer, more general, and says what you actually mean.
I don't want to test if the value of "p" happens to match the value of
some macro - I want to test if "p" holds a valid value or not.

David Brown

unread,
Jul 25, 2021, 6:14:56 AMJul 25
to
On 25/07/2021 01:40, Bart wrote:
> On 24/07/2021 14:29, David Brown wrote:
>> On 24/07/2021 14:57, Bart wrote:
>
>>> Actually, how would you code a linked list without using some sentinel
>>> to mark the end of the list?
>>>
>>> The use of NULL like this is EVERYWHERE and in every API.
>>>
>>> For example the return value of C's fopen() function is NULL when the
>>> operation failed.
>>>
>>
>> It is useful for a language to have a concept of pointers or references
>> that are guaranteed valid - then you don't need to check them like this.
>
> How does that work for my linked list example?
>

It would not work well - see the next paragraph!

>
>> It is also useful for a language to have a concept of "optional" values.
>>   That is, a way of saying "x is either NULL or a valid value of type T".
>>   Sometimes this is so useful that it applies to all types (like SQL).
>> Sometimes you force the programmer to do this manually, like in C using
>> "struct maybe_int { bool valid; int x; };".  Sometimes you make it a
>> convenient part of the standard library, like C++ "std::optional<int>".
>>   Sometimes you have it through summation or algebraic types fully
>> supported in the language, like Haskell "data Maybeint = Invalid | Int".
>
> So how does Haskell distinguish, internally, between an Int value, and
> an Invalid type?

I don't know. My Haskell is quite shaky - I used a related functional
programming language at university, but have done little more than
occasionally playing with Haskell. And I have not looked at generated
code at all.

But I would guess there would be one of three possibilities. One is
that it "disappears" in the compilation. Functional programming
languages do not compile to code that bears any resemblance to the
structure of the original source. A second is that you might have a
implementation like a C tagged union. Or it might be more complicated,
since numbers in Haskell are not (IIRC) fixed size types and thus have
more advanced structures anyway.

>
>> There is no simple and efficient way to do this for simple types - for
>> an integer, you either have to sacrifice a valid integer value, or you
>> have to add an extra boolean flag to go with it.
>
> Exactly.
>
>> Sacrificing a value
>> makes your arithmetic coding a lot more complicated and inefficient.
>> But for pointers, sacrificing a value to use as an "invalid" indicator
>> is cheap and easy, and the gains are certainly worth it.  The most
>> efficient "invalid" value to use is 0, since it is quick and easy to
>> test.  An alternative worth considering is to use the highest bit to
>> indicate invalid - cutting half your address space is often not a
>> problem, you can use a pointer to address 0, and you can represent many
>> different invalid values.
>
> The interesting invalid values are the ones in your 'valid' range. Just
> randomly pointing somewhere in your memory doesn't mean the pointer is
> any good!
>

Yes, indeed - but that's another issue entirely. (It's an important
one, but it would cloud the points in this thread.)

>
>>
>> So IMHO it makes sense for a language to support /both/ concepts of an
>> always valid pointer, and of optionally valid pointers.
>>
>> I'd also suggest pointers being viewed as a lot more general than just
>> holding an address, allowing for references, weak references, shared
>> pointers, and other ways of referring to objects.
>
> I'd rather keep them as simple as possible, and preferably use them as
> little as possible too.

That all depends on the kind of language you are writing, and its purposes.

>
> (I implement some of those things, but transparently so you aren't even
> aware of pointers being used.)
>
>
>> I'd also avoid "if (p != NULL)", and prefer "if (p)" or "if (valid(p))",
>> making it clearer that you are checking for the validity of the pointer
>
> What pointer? Because if writing 'if (X)', then X can be anything. Write
> 'if (X==NULL)', and you can assume (in C) or know (in mine) that X is a
> pointer.

If I write "if (X)" in code, I know what "X" is. And in this case, we
are talking about a pointer "p".

But as I said, a language designer might prefer "if (valid(P))", or
perhaps "if nonull(p)" (maybe the language has different parenthesis
requirements from C). In my view, that is better than "if (p == NULL)".
Such things are subjective, and need to be considered along with the
rest of the language.

>
>> rather than for it happening to match a particular value.  (In C, an
>> implementation can have more than one null pointer, and "if (p != NULL)"
>> actually checks for any of them - something that is not apparent from
>> the syntax.)
>
> Never heard of that. And it also sounds a nightmare to implement. How
> many kinds of pointer are we talking about? Which version of NULL do you
> get when you do p = NULL?
>

As far as the C standards are concerned, "0", or "(void*) 0", is /a/
"null pointer constant". If it is converted to a pointer type, you get
/a/ null pointer of that type. NULL is defined (in <stddef.h>) to
expand to "an implementation-defined null-pointer constant". So for "p
= NULL", you get whichever null pointer constant the implementation
wants to give you.

I can't remember ever hearing about any C implementations that have more
than one null pointer constant - I agree that, on most systems at least,
it would be difficult to implement in a C compiler and have no obvious
benefit. But perhaps there are systems that /did/ support them - such
as systems with sign plus magnitude integers and therefore both positive
and negative integer zero. Or perhaps they have been used in other C
implementations where the flexibility of additional null pointer
constants was considered worth the cost of the implementation. Maybe
someone over at c.l.c. knows of such systems.


David Brown

unread,
Jul 25, 2021, 6:29:09 AMJul 25
to
On 24/07/2021 23:31, Bart wrote:
> On 24/07/2021 18:51, James Harris wrote:

>
>> IOW if I use a high address rather than a low one then my compiled
>> code can omit checks for null and would work the same way in both
>> environments.
>>
>> Second, there are CPUs which, rather unwelcomely, use signed
>> addresses. For them, a 16-bit address, say, will use the range -32768
>> to 32767 and address zero could easily be part of the heap.
>
> Which ones are those? (So I can make a note to never use them!)
>
> A language could take care of that aspect (so programs see an address
> space of 0 to 65535) but that comes at a cost.
>
> However, remember that's C pandering to weird hardware that no one is
> ever going to encounter in real life is probably the cause of half of
> its UBs.

There are a many processors (standard ones, not "weird" ones) which have
addressing modes of the form "x + signed_extension(y)". These are
mostly relevant when "y" is immediate, and so the details get hidden by
the compiler, but it is certainly conceivable for a programming language
to have "short pointers" that are, say, 16-bit in length and are used as
signed offsets to a base pointer. I'm not sure it is worth bothering
about too much in a language, but I don't know what James' aims are for
his language.


>
>>
>>>
>>> My latest language has  'nil' value for pointers (you can't use 0),
>>> whose value is not specified. But it is generally understood it is
>>> all zeros.
>>
>> That sounds good. A minor point but why use the keyword nil rather
>> than null?
>
> I think that was copied from Pascal which used 'nil'. I didn't encounter
> C until over a decade later.
>

I would have guessed you'd have picked it up from Algol 68. A quick
look at <https://rosettacode.org/wiki/Null_object> shows there are many
languages that use "nil", though "null" is a bit more popular.

Bart

unread,
Jul 25, 2021, 7:39:50 AMJul 25
to
On 25/07/2021 11:29, David Brown wrote:
> On 24/07/2021 23:31, Bart wrote:
>> On 24/07/2021 18:51, James Harris wrote:
>
>>
>>> IOW if I use a high address rather than a low one then my compiled
>>> code can omit checks for null and would work the same way in both
>>> environments.
>>>
>>> Second, there are CPUs which, rather unwelcomely, use signed
>>> addresses. For them, a 16-bit address, say, will use the range -32768
>>> to 32767 and address zero could easily be part of the heap.
>>
>> Which ones are those? (So I can make a note to never use them!)
>>
>> A language could take care of that aspect (so programs see an address
>> space of 0 to 65535) but that comes at a cost.
>>
>> However, remember that's C pandering to weird hardware that no one is
>> ever going to encounter in real life is probably the cause of half of
>> its UBs.
>
> There are a many processors (standard ones, not "weird" ones) which have
> addressing modes of the form "x + signed_extension(y)". These are
> mostly relevant when "y" is immediate, and so the details get hidden by
> the compiler, but it is certainly conceivable for a programming language
> to have "short pointers" that are, say, 16-bit in length and are used as
> signed offsets to a base pointer. I'm not sure it is worth bothering
> about too much in a language, but I don't know what James' aims are for
> his language.
>


If you have a pointer, and add in a signed offset, then you get a new
pointer value; it doesn't change its type.

With a pointer, it doesn't usually make sense to talk of its being
signed or unsigned. But in the case of that processor with addresses
from -32768 to 0 to 32767, you can consider those 16 bits as being
unsigned, so that there /is/ a contiguous range from 0x0000 to 0xFFFF.

It's just that at 0x7FFF, at the end of memory, the next byte is at
0x8000, which is at the start of memory. But it depends on how the
hardware works: if you took 16 address lines as inputs to a 64K x 8-bit
address chip, it wouldn't care whether the address was signed or not!

All it cares about is that there are 2**16 combinations of 1/0 values on
those 16 pins.

So it might be that a 'signed' address is just how it was represented in
the datasheet.



>>
>>>
>>>>
>>>> My latest language has  'nil' value for pointers (you can't use 0),
>>>> whose value is not specified. But it is generally understood it is
>>>> all zeros.
>>>
>>> That sounds good. A minor point but why use the keyword nil rather
>>> than null?
>>
>> I think that was copied from Pascal which used 'nil'. I didn't encounter
>> C until over a decade later.
>>
>
> I would have guessed you'd have picked it up from Algol 68. A quick
> look at <https://rosettacode.org/wiki/Null_object> shows there are many
> languages that use "nil", though "null" is a bit more popular.
>

Well, I'd used Pascal at that time (early 80s) but not Algol68. But your
link shows that null or nil (both are used) was a common concept.

It's probably tricker in dynamically typed languages; in some, it
corresponds to an 'unassigned' value of a variable.

I've checked my latest dynamic language; that one has four pointer types
(refvar; refpack; refbit; or symbol, used for function pointers), and a
'nil' value that corresponds to a 'refpack' pointer with an all-zeros
address (and a void target).

You don't need to worry about the distinctions, as nil can be used with
all of them, eg for comparisons.

David Brown

unread,
Jul 25, 2021, 12:24:51 PMJul 25
to
Yes.

>
> With a pointer, it doesn't usually make sense to talk of its being
> signed or unsigned. But in the case of that processor with addresses
> from -32768 to 0 to 32767, you can consider those 16 bits as being
> unsigned, so that there /is/ a contiguous range from 0x0000 to 0xFFFF.

I agree that it makes no sense to think of a pointer as being "signed"
or "unsigned". I was referring here to the underlying hardware
implementation, where some processors have addressing modes that
sign-extend smaller sizes to full addresses. (This is common amongst
32-bit RISC systems, where a 16-bit immediate value can be part of an
instruction, but a 32-bit immediate value cannot. So they might have
faster addressing modes for Rx + signed 16-bit offset - including an R0
which always evaluates to 0.) I was not referring to devices that have
only 16-bit addresses.

>
> It's just that at 0x7FFF, at the end of memory, the next byte is at
> 0x8000, which is at the start of memory. But it depends on how the
> hardware works: if you took 16 address lines as inputs to a 64K x 8-bit
> address chip, it wouldn't care whether the address was signed or not!
>
> All it cares about is that there are 2**16 combinations of 1/0 values on
> those 16 pins.
>
> So it might be that a 'signed' address is just how it was represented in
> the datasheet.
>
>
>
>>>
>>>>
>>>>>
>>>>> My latest language has  'nil' value for pointers (you can't use 0),
>>>>> whose value is not specified. But it is generally understood it is
>>>>> all zeros.
>>>>
>>>> That sounds good. A minor point but why use the keyword nil rather
>>>> than null?
>>>
>>> I think that was copied from Pascal which used 'nil'. I didn't encounter
>>> C until over a decade later.
>>>
>>
>> I would have guessed you'd have picked it up from Algol 68.  A quick
>> look at <https://rosettacode.org/wiki/Null_object> shows there are many
>> languages that use "nil", though "null" is a bit more popular.
>>
>
> Well, I'd used Pascal at that time (early 80s) but not Algol68. But your
> link shows that null or nil (both are used) was a common concept.
>

I too am familiar with "nil" from Pascal - but I was under the
impression that you were familiar with Algol 68. Did you work with
Algol 68 /after/ working with Pascal? Or have I merely imagined that
you used that language?

> It's probably tricker in dynamically typed languages; in some, it
> corresponds to an 'unassigned' value of a variable.
>

In a dynamic language, you might have to distinguish between "a null
value of a pointer type" and "not a value of any type", or "a value of a
null type". Yes, plenty of scope for complications!

Bart

unread,
Jul 26, 2021, 11:07:26 AMJul 26
to
On 25/07/2021 17:24, David Brown wrote:
> On 25/07/2021 13:39, Bart wrote:

[About Algol68]

>> Well, I'd used Pascal at that time (early 80s) but not Algol68. But your
>> link shows that null or nil (both are used) was a common concept.
>>
>
> I too am familiar with "nil" from Pascal - but I was under the
> impression that you were familiar with Algol 68. Did you work with
> Algol 68 /after/ working with Pascal? Or have I merely imagined that
> you used that language?

I'd never used Algol 68; my college didn't have a working compiler, and
it didn't really come up with 8-bit micros.

But I'd read about it and was impressed enough to borrow some of its
syntax, although with some tweaks to make it more practical.

When I finally got to use Algol 68 for real, it might have been 30 years
later, and by then I was somewhat less impressed, after some decades of
devising systems languages for actual, productive work.


Fibonacci benchmark in Algol 68 for A68G (notice the semicolon, which
separates things):

----------------------------------
PROC fib=(INT n)INT:BEGIN
IF n<3 THEN
1
ELSE
fib(n-1)+fib(n-2)
FI
END;

FOR i TO 36 DO
print((i,fib(i),newline))
OD
----------------------------------


In my systems language which here will be similar to the 1980s version
(optional 'return' left out to make it match):

----------------------------------
function fib(int n)int=
if n<3 then
1
else
fib(n-1)+fib(n-2)
fi
end

proc start=
for i to 36 do
println i,fib(i)
od
end
----------------------------------

And in my current script language (a 'start' function is optional):

----------------------------------
function fib(n)=
if n<3 then
1
else
fib(n-1)+fib(n-2)
fi
end

for i to 36 do
println i,fib(i)
od
----------------------------------

I think there's no doubt that these look like Algol 68, but the
'stropping' needed there, to allow white space in identifiers, the rules
for semicolons, and various details of syntax (I can never remember if
it's : then = or = then : in function definitions), make it a pain to use.

Certainly my syntax isn't Pascal, nor C.

David Brown

unread,
Jul 26, 2021, 11:29:27 AMJul 26
to
OK, thanks for the explanation.

I have never been quite comfortable with the "value of the function is
the value of the last expression" style supported by many languages - I
like an explicit "return" statement (except in functional programming
languages, which have a very different style). But I guess it is a
matter of habit.

Bart

unread,
Jul 26, 2021, 12:02:19 PMJul 26
to
I'm too too keen on it either; I'd normally write that function as:

function fib(n)=
if n<3 then
return 1
else
return fib(n-1)+fib(n-2)
fi
end

(Or with the 'return' just before the 'if'.)

But when you make statements and expressions interchangeable, it has to
work without. However, my syntax needs 'return' for early returns,
otherwise you'd have to arrange for that early return value to be the
last thing executed in that particular branch of the function body,
which can be awkward.

As I said, I like my languages practical, and not be a pita. (Which
seems to be the opposite approach to many new languages now.)

Andy Walker

unread,
Jul 26, 2021, 12:34:33 PMJul 26
to
On 26/07/2021 16:29, David Brown wrote:
> I have never been quite comfortable with the "value of the function is
> the value of the last expression" style supported by many languages - I
> like an explicit "return" statement [...].

FWIW, you can have that in Algol if you want it, at least if it
has "#define" or near equivalent. Just define "RETURN" to be empty, and
add "RETURN" in front of the [unit which is the] procedure body. There
should perhaps be a "smiley" there for non-UK readers?

--
Andy Walker, Nottingham.
Andy's music pages: www.cuboid.me.uk/andy/Music
Composer of the day: www.cuboid.me.uk/andy/Music/Composers/Coleridge-Taylor

Rod Pemberton

unread,
Jul 27, 2021, 1:22:44 AMJul 27
to
On Sat, 24 Jul 2021 13:40:54 +0100
James Harris <james.h...@gmail.com> wrote:

> On 24/07/2021 13:04, Rod Pemberton wrote:
> > On Fri, 23 Jul 2021 11:50:38 +0100
> > James Harris <james.h...@gmail.com> wrote:

> Rod, your replies on this topic are puzzling me a bit. Perhaps I
> misunderstand what you are saying

...

> but ISTM that you have the wrong
> idea of how a null pointer is frequently used.

No.

> I don't think that's
> likely to be the case so to try to clear up any confusion I'll reply
> to this part of your post specifically and will come back to the rest
> of your post later.

...

> >> Say one had a routine
> >> which checked whether a pointer was valid before dereferencing it,
> >> along the lines of
> >>
> >> if (p != NULL) {
> >> /* work with the object at p */
> >> }
> >
> > Sigh. Why would you (or anyone) ever do this? ...
>
> It's standard programming which is used all the time.

No.

> For example, if
> you wanted to process a tree - let's say in preorder - you could write
>
> preorder(node *n)
> {
> process(n->data);
> if (n->left != NULL) preorder(n->left);
> if (n->right != NULL) preorder(n->right);
> }
>
> or, if you prefer,
>
> preorder(node *n)
> {
> if (n != NULL)
> {
> process(n->data);
> preorder(n->left);
> preorder(n->right);
> }
> }
>
> In either case the point is that the code tests for a pointer being
> null because null means that there is no node.
>
> As with the theme of this topic, there is a particular address which
> is guaranteed not to refer to an object.

Well, I see this is a special case - abuse of NULL actually - which was
outside the topic of the prior conversation, which was avoiding NULL
pointer dereferences, in general, for a language.

Anyway, a NULL address is not required to be used here as a node
terminator. E.g., other magic numbers such as -1 (all bits set) or
(zero if not equivalent to NULL) could be used as well. The choice of
using NULL as a sentinel (or magic value or canary etc) is completely
arbitrary here. I.e., the node terminator in the tree only needs to not
point to any valid node in the tree. For example, you could create
your own "NULL node", which points to an empty node, not inserted into
the tree.

> > As the programmer, you should be coding your program to not have
> > pointers set to NULL especially if the pointer is to be
> > dereferenced. I.e., initialize the pointer prior to use.
>
> Null doesn't mean uninitialised.

True.

> A pointer can be initialised to
> null, as it would be for the tree-walking code, above.

IMO, that (initializing a pointer to NULL) is really bad idea.


Doing this may be acceptable for the binary tree or a graph, if you
place the appropriate checks. However, in general, it's a bad idea
to set a pointer to NULL, especially in C, as this is what leads to the
unintentional dereferencing of NULL pointers, which is what you said
you wanted to avoid. I.e., initialize the pointer - to some non-NULL
value - prior to usage of the pointer - result is no NULL pointer
dereference. Avoid using/abusing NULL for other things too.

> Similarly, someone might walk a linked list by
>
> while (n != NULL)
> {
> process(n->data);
> n = n->next;
> }
>
> Again, the pointer being null means that there is no object being
> referred to. You can imagine that n-next will have explicitly been
> set to NULL in the last node in the list.
>
>
> Does that change your view of the topic and where a pointer may be
> null?

No. Unfortunately, none of the examples do.

The fact that bad coding practice has been accepted and perhaps is
widely used, doesn't make it good practice. Does it?

Interestingly, I think I've only coded a binary tree once since Pascal
in High School over 3 some decades ago ... It's a really useless data
structure, much like a linked-list. Computer Science professors always
force the students to solve every problem using a linked-list.

Rod Pemberton

unread,
Jul 27, 2021, 1:25:25 AMJul 27
to
On Sat, 24 Jul 2021 16:06:01 +0100
Andy Walker <a...@cuboid.co.uk> wrote:

> On 24/07/2021 13:04, Rod Pemberton wrote:

> > If you can't use the language to program the hardware, then there
> > is no point in using the language at all. If you don't understand
> > this, you're clearly not familiar Pascal. Pascal effectively died
> > because it was unable to be used to directly program the hardware
> > or access memory outside of the scope of the language, e.g.,
> > memory-mapped devices, because it lacked pointers.
>
> Almost by definition, no high-level language can be used
> /portably/ to program the hardware.

Likewise, by definition, no low-level hardware is uniform and
ubiquitous to be portably programmed, so the insertion of "portably"
into your claim makes no sense to me. So, if I excise the apparently
useless "portably" from your claim, then I must disagree.

> You need either or both of an
> escape into a lower-level language

True, for things like assembly language instructions and I/O ports.

False, for things memory-mapped devices.

> or a [portable?] library of ways to access the hardware.

Sure.

If you wish to take that approach, there are some C specifications
which help, such as "Embedded C", and there are portable libraries like
libSDL.

> But many of us write programs that never need to program the hardware.

Does that eliminate the need for the language to be able to do so? In
many instances, the answer is: "No, it doesn't," as that same language
and compiler is usually also used to code the operating system.

Rod Pemberton

unread,
Jul 27, 2021, 1:28:50 AMJul 27
to
On Sat, 24 Jul 2021 13:57:27 +0100
Bart <b...@freeuk.com> wrote:

> On 24/07/2021 13:04, Rod Pemberton wrote:
> > On Fri, 23 Jul 2021 11:50:38 +0100

> >> Say one had a routine
> >> which checked whether a pointer was valid before dereferencing it,
> >> along the lines of
> >>
> >> if (p != NULL) {
> >> /* work with the object at p */
> >> }
> >
> > Sigh. Why would you (or anyone) ever do this? ...
> >
> > As the programmer, you should be coding your program to not have
> > pointers set to NULL especially if the pointer is to be
> > dereferenced. I.e., initialize the pointer prior to use.
>
> You don't seem to understand what NULL is for.
>
> How do /you/ use NULL in C, or do you never actually use it?

fopen(), strrchr(), strtod(), and strtoul() seem to be the most common
situations where I use NULL, i.e., return value, required parameter.
There is very little use of it in my code otherwise.

In other words, I generally only use NULL when or where required, e.g.,
many C library functions return it, or where required for unused
parameters.

Otherwise, I want my pointers to be initialized to a valid non-NULL
value so if they're unintentionally dereferenced, nothing bad happens.

Rarely, there were some instances where I've initialize the pointer to
NULL, to make sure the correct code branch initialized the pointer to a
non-NULL value.

> Actually, how would you code a linked list without using some
> sentinel to mark the end of the list?

As stated to James, the choice of using NULL as the node terminator,
whether you call it a sentinel or magic value or canary etc, is a
completely arbitrary choice.

AISI, a linked-list was really outside the topic of what James was
discussing originally, or was just a small subset of it. He wanted to
prevent dereferencing NULL pointers in the generic sense for the
entirety of the language. I.e., initialize pointers to non-NULL, avoid
NULL, etc.

> The use of NULL like this is EVERYWHERE and in every API.

I don't see NULL checks as being the dominant use of NULL in C, at
least for the programmer. I'm sure CYA programming is rife in APIs and
libraries. As stated to James, bad coding is bad coding. C has other
mechanisms, like ERRNO and integer return values, which could be or
could've been used instead of abusing NULL.

> For example the return value of C's fopen() function is NULL when the
> operation failed.

Yes. But, should it be though for a new language? ...

I.e., I'm sure that at some point, a C library functions setting a
pointer to NULL has resulted in unintentionally dereferencing a NULL
pointer.

Bart

unread,
Jul 27, 2021, 10:14:04 AMJul 27
to
On 27/07/2021 07:24, Rod Pemberton wrote:
> On Sat, 24 Jul 2021 13:40:54 +0100
> James Harris <james.h...@gmail.com> wrote:

>> In either case the point is that the code tests for a pointer being
>> null because null means that there is no node.
>>
>> As with the theme of this topic, there is a particular address which
>> is guaranteed not to refer to an object.
>
> Well, I see this is a special case - abuse of NULL actually - which was
> outside the topic of the prior conversation, which was avoiding NULL
> pointer dereferences, in general, for a language.
>
> Anyway, a NULL address is not required to be used here as a node
> terminator. E.g., other magic numbers such as -1 (all bits set) or
> (zero if not equivalent to NULL) could be used as well. The choice of
> using NULL as a sentinel (or magic value or canary etc) is completely
> arbitrary here. I.e., the node terminator in the tree only needs to not
> point to any valid node in the tree. For example, you could create
> your own "NULL node", which points to an empty node, not inserted into
> the tree.

You could do that, but what would you call it? Would you need to create
a custom 'null'-pointer value for every pointer type? So fopen() would
return NULL_NOFILE for failed operation, and so on for every such
function in existence.

>>> As the programmer, you should be coding your program to not have
>>> pointers set to NULL especially if the pointer is to be
>>> dereferenced. I.e., initialize the pointer prior to use.
>>
>> Null doesn't mean uninitialised.
>
> True.
>
>> A pointer can be initialised to
>> null, as it would be for the tree-walking code, above.
>
> IMO, that (initializing a pointer to NULL) is really bad idea.

Do you have the same opinion about 0 (zero) being used to inialise
signed and unsigned ints of any width, various widths of floats, big
integers, and any of a myriad of derived numeric types?


> Doing this may be acceptable for the binary tree or a graph, if you
> place the appropriate checks. However, in general, it's a bad idea
> to set a pointer to NULL, especially in C, as this is what leads to the
> unintentional dereferencing of NULL pointers, which is what you said
> you wanted to avoid.

So instead you get unintentional deferencing of pointers to dummy
objects that are not meant to be accessed, which is now undetectable. At
least NULL derefs are normally detected, /when a program goes wrong/.

How would you even get a language to insert checks for invalid pointer
accesses, when the knowledge of what is invalid depends on user code
within the application, outside of the language? Languages KNOW about Null.


> No. Unfortunately, none of the examples do.

You accept that some sentinel value is needed. You do not accept that
that should be NULL, and consider it an error. Most people disagree.

> The fact that bad coding practice has been accepted and perhaps is
> widely used, doesn't make it good practice. Does it?
>
> Interestingly, I think I've only coded a binary tree once since Pascal
> in High School over 3 some decades ago ... It's a really useless data
> structure, much like a linked-list.

You don't use trees at all? I guess you've never implemented a language
then, because the syntax of most languages is tree-shaped.

Computer Science professors always
> force the students to solve every problem using a linked-list.

What would you use instead for a data structure that can incrementally
grow and only needs to be serially accessed?

In a lower level language, a linked list is the perfect solution!

(A few years ago, I converted a compiler written in a dynamic language,
with resizable lists that could be organised into trees without using
explicit pointers, into a lower level language.

Most of those reduced down to tight, efficient, linked lists. Trees
(symbol tables and ASTs) were linked multiple ways via pointers, to
result in a version that ran about 30 times faster. Null/nil values were
used everywhere.)

luserdroog

unread,
Jul 27, 2021, 1:07:37 PMJul 27
to
I'm with Bart and the professors on this one. Linked lists are amazing,
versatile, dare-I-say /natural/ data structures and students deserve
the experience to use them early and often. This drags in (or ignores)
other peripheral concerns like memory management and maybe GC.
But IMO those are also valuable topics to learn something about.

(I'd love to read some more details about this compiler conversion effort,
maybe start a new thread to tell the story?)

Rod Pemberton

unread,
Jul 30, 2021, 5:23:48 AMJul 30
to
On Tue, 27 Jul 2021 15:13:57 +0100
Bart <b...@freeuk.com> wrote:

> On 27/07/2021 07:24, Rod Pemberton wrote:
> > On Sat, 24 Jul 2021 13:40:54 +0100
> > James Harris <james.h...@gmail.com> wrote:

> >> In either case the point is that the code tests for a pointer being
> >> null because null means that there is no node.
> >>
> >> As with the theme of this topic, there is a particular address
> >> which is guaranteed not to refer to an object.
> >
> > Well, I see this is a special case - abuse of NULL actually - which
> > was outside the topic of the prior conversation, which was avoiding
> > NULL pointer dereferences, in general, for a language.
> >
> > Anyway, a NULL address is not required to be used here as a node
> > terminator. E.g., other magic numbers such as -1 (all bits set) or
> > (zero if not equivalent to NULL) could be used as well. The choice
> > of using NULL as a sentinel (or magic value or canary etc) is
> > completely arbitrary here. I.e., the node terminator in the tree
> > only needs to not point to any valid node in the tree. For
> > example, you could create your own "NULL node", which points to an
> > empty node, not inserted into the tree.
>
> You could do that, but what would you call it?

Whatever you want. Something appropriate would probably be good. Not
that it really matters, as whatever you name it, it won't help a
non-native language speaker understand it, just like names of
everything else in a program. Have you ever tried reading a program
coded in a language you don't know? ...

> Would you need to create a custom 'null'-pointer value for every
> pointer type?

No, since you don't need to detect NULL for every pointer type. In
general, you only need to detect NULL when a pointer hasn't been
properly initialized or has been changed to NULL. If the pointer is
properly initialized and nothing resets/sets it to NULL, why would you
need to check for NULL? ... This is similar to keeping track of stack
objects in Forth, or free flags for 6502, or buffer size in C, etc. Is
the programmer keeping track or not? If not, then NULL check overload.

> So fopen() would return NULL_NOFILE for failed operation, and
> so on for every such function in existence.

How is that any different from returning NULL for every failed
operation? I.e., you clearly have an arbitrary preference. Does it
matter if the failure sentinel is NULL, 0, -1, 0xDEAD, 0xBEEF, etc?
The programmer has to use whatever the language provides. So, why
should the failure sentinel be NULL if that could cause problems?

> >>> As the programmer, you should be coding your program to not have
> >>> pointers set to NULL especially if the pointer is to be
> >>> dereferenced. I.e., initialize the pointer prior to use.
> >>
> >> Null doesn't mean uninitialised.
> >
> > True.
> >
> >> A pointer can be initialised to
> >> null, as it would be for the tree-walking code, above.
> >
> > IMO, that (initializing a pointer to NULL) is really bad idea.
>
> Do you have the same opinion about 0 (zero) being used to inialise
> signed and unsigned ints of any width, various widths of floats, big
> integers, and any of a myriad of derived numeric types?

No, because non-pointers aren't supposed to be dereferenced.

No, because integers on certain processors and floats on certain
math coprocessors require all bits zero for the hardware to recognize
them as zero.

> > Doing this may be acceptable for the binary tree or a graph, if you
> > place the appropriate checks. However, in general, it's a bad idea
> > to set a pointer to NULL, especially in C, as this is what leads to
> > the unintentional dereferencing of NULL pointers, which is what you
> > said you wanted to avoid.
>
> So instead you get unintentional [dereferencing] of pointers

Yes.

> to dummy objects that are not meant to be accessed,

No. These are valid objects or should be.

> which is now undetectable.

... only because you're assumption is incorrect?

> At least NULL derefs are normally detected, /when a program goes
> wrong/.

When NULL derefs are detected, this halts and/or crashes the program.
How is that any better than coding the program to not fail, and not use
bad data?

> How would you even get a language to insert checks for invalid
> pointer accesses, when the knowledge of what is invalid depends on
> user code within the application, outside of the language? Languages
> KNOW about Null.

What invalid pointer accesses? ... Once you initialize your pointers
to point to valid objects, there is no such thing.

If a "Null node" is used instead of NULL, the dereference will be to
known data instead of bad data, which should be designed to work with
the program without failure, and without producing an incorrect result.

> > No. Unfortunately, none of the examples do.
>
> You accept that some sentinel value is needed.

... is needed for terminal nodes in a linked-list or tree.

> You do not accept that that should be NULL

Yes.

> and consider it an error.

Bad programming.

> Most people disagree.

This is going to sound really arrogant, but I assure you that it's not.

People have always disagreed with me. That has never made them correct.

> > The fact that bad coding practice has been accepted and perhaps is
> > widely used, doesn't make it good practice. Does it?
> >
> > Interestingly, I think I've only coded a binary tree once since
> > Pascal in High School over 3 some decades ago ... It's a really
> > useless data structure, much like a linked-list.
>
> You don't use trees at all?

It seems that I have /one/ piece of code which uses a linked-list,
and it used NULL as the terminator, but it's old, and I never used it.

> I guess you've never implemented a language

Presumptuous. False.

> then, because the syntax of most languages is tree-shaped.

They're tree-shaped, but only if you allow them to be represented that
way. You don't have to. Why should you? Natural fit? If you can
represent them in a manner which better fits the program or language,
why wouldn't you do that instead of using the natural fit?

> > Computer Science professors always
> > force the students to solve every problem using a linked-list.
>
> What would you use instead for a data structure [...?]

Array.
Stack.

> What would you use instead for a data structure that can
> incrementally grow and only needs to be serially accessed?

Both of my answers can "incrementally grow", both can be "serially
accessed" too. You can store all the information in a tree or
linked-list as a link-less, node-less stack or array. However, using
just one pointer into the stack/array will allow you to do everything
you need to access, manipulate, delete, etc. The only real issue is
insertion. So, don't insert or use other techniques to locate the data
elsewhere. This technique has less memory overhead too, since the
header structure for the linked-list or nodes containing all the
pointers isn't present.

> In a lower level language, a linked list is the perfect solution!

...

> (A few years ago, I converted a compiler written in a dynamic
> language, with resizable lists that could be organised into trees
> without using explicit pointers, into a lower level language.
>
> Most of those reduced down to tight, efficient, linked lists. Trees
> (symbol tables and ASTs) were linked multiple ways via pointers, to
> result in a version that ran about 30 times faster. Null/nil values
> were used everywhere.)

They probably could've been reduced down to tight, efficient, arrays or
stacks too.


In general, I see linked-lists as over-prescribed, overused, usually
unnecessary, having excess overhead, and used as a crutch by
inexperienced programmers to eliminate needing to think about using a
different solution. They've become a one-size fits all technique.

--
...

Rod Pemberton

unread,
Jul 30, 2021, 5:24:19 AMJul 30
to
Well now, I won't let your apparent love of linked-lists taint my
replies to you about Forth on c.l.a.x. or c.l.f.


--
...

Bart

unread,
Jul 30, 2021, 10:57:42 AMJul 30
to
On 30/07/2021 11:25, Rod Pemberton wrote:
> On Tue, 27 Jul 2021 15:13:57 +0100
> Bart <b...@freeuk.com> wrote:
>


> Bad programming.

So, what does your memory allocator return when there was a problem
allocating?

A pointer to a predermined, dummy object which the caller has to compare
the result with? Fine, but if you call that pointer NULL, then that is
little different to what happens now.

The trouble with your scheme is that every library, every function,
might define a different dummy object (although only one can be called
NULL).

If you want to write any generic routines, then it will cause problems,
For example, just to print the value of a pointer of any type, it's
going to be hard to see which of these pointer values:

0F490D8 0F490C0 0F500F8

has a 'null' value. Apparently, in your programs, all of them could be!
Take a wild guess as to which ones are nulls in this version:

0F490D8 0000000 0F500F8

>> I guess you've never implemented a language
>
> Presumptuous. False.
>
>> then, because the syntax of most languages is tree-shaped.
>
> They're tree-shaped, but only if you allow them to be represented that
> way. You don't have to. Why should you? Natural fit? If you can
> represent them in a manner which better fits the program or language,
> why wouldn't you do that instead of using the natural fit?

The most typical compilers convert source of a structured language into
an AST (syntax tree), and may represent the generated symbol table as a
hierarchy, also a tree, before processing further into more linear formats.

You can skip those steps and translate straight to linear code, but IME
those products are inferior. (I've done it both ways.)

>>> Computer Science professors always
>>> force the students to solve every problem using a linked-list.

I didn't have a CS professor looking over my shoulder when devising my
own languages. Then the limitations of the hardware dictated their
design and my style of programming.

(Anyway, these days they'll all be teaching functional, immutable
programming; forget using pointers at all, or even loops!)


>> Most of those reduced down to tight, efficient, linked lists. Trees
>> (symbol tables and ASTs) were linked multiple ways via pointers, to
>> result in a version that ran about 30 times faster. Null/nil values
>> were used everywhere.)
>
> They probably could've been reduced down to tight, efficient, arrays or
> stacks too.
>
>
> In general, I see linked-lists as over-prescribed, overused, usually
> unnecessary, having excess overhead

I write million-lines-per-second compilers using linked lists.

Sometimes I use resizable arrays, but they are less flexible, and have
more overhead than a linked list. I only use them when they will have
very large numbers of elements. Not for sizes of a handful of elements.

Note that my symbol table entries are created within a resizable array,
but have a hierarchical structure imposed via pointer links.

James Harris

unread,
Jul 31, 2021, 6:23:04 AMJul 31
to
On 23/07/2021 14:44, anti...@math.uni.wroc.pl wrote:
> James Harris <james.h...@gmail.com> wrote:
>> For my OS project I have been looking at program loading and that has
>> led me to query what would be required in a language to support address
>> zero being accessible and a pointer to it being considered to be valid.
>>
>> If I use paging I can reserve the lowest page so that address zero is
>> inaccessible. A dereference of a zeroed pointer would be caught by the
>> CPU triggering a fault.
>>
>> However, if on x86 I don't use paging then a reference to address zero
>> would not trigger a fault so it would not be caught and diagnosed. And
>> it's not just that case. Other CPUs or microcontrollers may, presumably,
>> allow access to address zero. Therefore there are cases where a program
>> may have a pointer to address zero and that pointer could be legitimate.
>>
>> Hence the question: how should a language support access to address
>> zero? Any ideas?
>
> You mix many different things. As other noted, null pointer in
> a programming may be address 0, but there are also different
> possible representations. It is for programming language to
> decide what happens with null pointer is dereferenced. AFAIK
> in Vax C page 0 was filled with 0, null pointer was represented
> by 0, so read acccess via null pointer gave 0. Effectively,
> null pointer was pointer to a canonical empty string. I guess
> that trying to write gave a fault. On machine with true ROM
> normally attempts to write to ROM are ignored. If there is
> 0 byte in ROM you could represent null pointer as address of
> zero byte in ROM. In such case reads via null pointer would
> gave 0, writes would be ignored. In Lisp instead of null
> pointer there is NIL. In Lisp dereferencing (reading via) NIL
> has defined effect (IIRC writes are illegal).
>
> So it is really up to you what you decide. Most modern
> languages make dereferencing null pointers illegal, but
> there are notable examples that do differently.

I'd prefer to make dereferencing of null pointers illegal. I cannot see
why anyone would want it to be otherwise because, as I see it, a pointer
which is null means that there's nothing to point at. In such a case,
trying to dereference it is clearly an error.

Such invalid accesses could be detected by the compiled code or by
hardware. It would always be safe for the compiler to emit checks to see
whether a given pointer is null. It's just that on some architectures
such checks could be carried out by the hardware and so could be omitted
from the code.

Whether the attempt to dereference a null pointer were to be detected by
compiled code or by hardware the outcome should, IMO, be the same: a
memory-access exception. Hardware detection just allows some checks to
be omitted from the compiled code.

The point remains, though, that in some (perhaps many) architectures
location zero is valid and accessible meaning that zero is not really an
ideal candidate for null.

The best addresses for null seem to me to be either

* an address which is just above the highest address the program is
permitted to access
* the highest address in memory

If p were null then either scheme could be used to detect

*p

The advantage of the first scheme is that expressions such as

p[n]

would trigger an automatic exception for small n. But that doesn't scale
too well: it could fail if n were to be so large that the resulting
address wrapped round past address zero into accessible space.

So at the moment I am thinking of using the highest address in memory.

The problem with using /any/ value other than zero is that any pointer
in zeroed memory will not automatically be set to null.


>
> Another thing is safety, namely question if your implementation
> will detect illegal program behaviour. There is "C attitude"
> which basically says that "do what the machine do".

Yes though in my case I want the language to specify its behaviour so
that it runs in the same way on any machine.

...

> "Safe" compiler is supposed to insert such checks.
> Expensive, but optimizing compilers can see that most
> checks are unnecessary and insert checks only in
> case of doubt. And when hardware is capable of checking
> compiler can depend on hardware checks.
>
> Personaly I am against making null pointer dereference
> legal, IMO gains from this are negligible, and loss
> (mainly inability to catch errors) is large.

I agree.

> I am
> for having as much checking as possible, if you feel
> that having checks always on is too expensive at least
> make them available for debugging.
>

I agree but I'd add that there are ways in the design of a language and
a compiler to avoid checks in inner loops where they matter most.


--
James Harris

James Harris

unread,
Jul 31, 2021, 7:07:24 AMJul 31
to
On 25/07/2021 10:53, David Brown wrote:
> On 24/07/2021 19:15, James Harris wrote:
>> On 24/07/2021 14:29, David Brown wrote:
>>
>> ...
>>
>>> So IMHO it makes sense for a language to support /both/ concepts of an
>>> always valid pointer, and of optionally valid pointers.
>>
>> OK.
>>
>> ...
>>
>>>
>>> I'd also avoid "if (p != NULL)", and prefer "if (p)" or "if (valid(p))",
>>
>> Agreed. But if you reject the test
>>
>>   p != NULL
>>
>> then presumably you also reject the assignment
>>
>>   p = NULL;
>
> Of course - for comparable values, != is defined as "not =" (or not ==,
> if you prefer).

I don't follow. The operators you mention are for comparison rather than
assignment.
As above, those comments seem to be about comparison. I don't
necessarily disagree with them but I was asking how you would prefer a
language to allow a pointer to be set to something invalid if you don't
have a NULL-type constant or macro. For example, if p is a pointer to
int are you saying you would prefer something like

p = int.invalid

over

p = NULL

?

>
> As for my personal preference, were I designing a language, I'd probably
> have 0 as the single null pointer. I'd have some other low-level
> mechanism for accessing data at address 0 for those rare situations
> where it might be useful (in embedded systems, for example, that might
> be a valid address in flash that you read as part of an integrity test).

Well, if access to location zero were to be treated as a special case
and pointers could be passed to callees then wouldn't you have to
distinguish between those callees which could work with a pointer to
address zero and those which could not? That could get rather complex
rather quickly.

> In C (in real implementations), you can do that using volatile accesses
> - other languages might have different methods.

OT but I thought volatile was essentially about addresses whose values
could be changed without the compiler's knowledge.

>
> But I would still prefer to write "if (p)", rather than "if (p == NULL)"
> - it is simpler, clearer, more general, and says what you actually mean.
> I don't want to test if the value of "p" happens to match the value of
> some macro - I want to test if "p" holds a valid value or not.

Noted and agreed.


--
James Harris

James Harris

unread,
Jul 31, 2021, 8:08:22 AMJul 31
to
On 27/07/2021 07:24, Rod Pemberton wrote:
> On Sat, 24 Jul 2021 13:40:54 +0100
> James Harris <james.h...@gmail.com> wrote:
>
>> On 24/07/2021 13:04, Rod Pemberton wrote:
>>> On Fri, 23 Jul 2021 11:50:38 +0100
>>> James Harris <james.h...@gmail.com> wrote:
>
>> Rod, your replies on this topic are puzzling me a bit. Perhaps I
>> misunderstand what you are saying
>
> ...
>
>> but ISTM that you have the wrong
>> idea of how a null pointer is frequently used.
>
> No.

It's comforting to know that you, Rod, have a different view from almost
every other programmer in existence! In a world of changing values it's
good to see some things remain the same...! ;-)


--
James Harris

James Harris

unread,
Jul 31, 2021, 8:25:01 AMJul 31
to
On 27/07/2021 07:30, Rod Pemberton wrote:
> On Sat, 24 Jul 2021 13:57:27 +0100
> Bart <b...@freeuk.com> wrote:

...

>> You don't seem to understand what NULL is for.
>>
>> How do /you/ use NULL in C, or do you never actually use it?
>
> fopen(), strrchr(), strtod(), and strtoul() seem to be the most common
> situations where I use NULL, i.e., return value, required parameter.
> There is very little use of it in my code otherwise.
>
> In other words, I generally only use NULL when or where required, e.g.,
> many C library functions return it, or where required for unused
> parameters.
>
> Otherwise, I want my pointers to be initialized to a valid non-NULL
> value so if they're unintentionally dereferenced, nothing bad happens.

Wow, that comment is telling! Is the reason you don't like null
pointers, Rod, that they can cause segfaults?

If an invalid dereference slips through the program logic wouldn't it be
better to detect it? Failing to detect an unintentional dereference
could be far worse than a segfault, surely.

RP: "nothing bad happens"..... Eek! **Failing to detect** the
dereference of a pointer which is not supposed to be dereferenced seems
like something bad to me!

...

>> The use of NULL like this is EVERYWHERE and in every API.
>
> I don't see NULL checks as being the dominant use of NULL in C, at
> least for the programmer. I'm sure CYA programming is rife in APIs and
> libraries. As stated to James, bad coding is bad coding. C has other
> mechanisms, like ERRNO and integer return values, which could be or
> could've been used instead of abusing NULL.

I'll probably regret asking but what's CYA programming?


--
James Harris

James Harris

unread,
Jul 31, 2021, 8:47:59 AMJul 31
to
On 30/07/2021 11:25, Rod Pemberton wrote:
> On Tue, 27 Jul 2021 15:13:57 +0100
> Bart <b...@freeuk.com> wrote:

...

>> At least NULL derefs are normally detected, /when a program goes
>> wrong/.
>
> When NULL derefs are detected, this halts and/or crashes the program.
> How is that any better than coding the program to not fail, and not use
> bad data?

Detection of an attempt to dereference a null pointer does not need to
crash a program. It can be converted to an exception. For example, a
program could include code along the lines of

try
*p++ /* Update what p points at */
catch
case memory_access_exception
/* Deal with p being invalid */

It is much better for a program to be informed of an invalid access
attempt than for such an attempt to go undetected.


--
James Harris

James Harris

unread,
Jul 31, 2021, 9:06:10 AMJul 31
to
On 22/07/2021 05:39, Rod Pemberton wrote:
> On Wed, 21 Jul 2021 18:24:01 +0100
> James Harris <james.h...@gmail.com> wrote:
>
>> As I say, compiled code often uses address zero as a null pointer,
>> knowing that the OS will make the lowest page inaccessible so that an
>> attempt to access it would generate a fault. And that's what I have
>> long intended to do. But when looking at different ways of loading a
>> program I realised that it was not always possible to reserve address
>> zero.
>
> Yes.
>
> You'll have problems on any environment without paging, e.g., any 8-bit
> computers (C64, Apple II), 16-bit computers (MacIntosh, Amiga), 16-bit
> DOS, 32-bit DOS DPMI without a paging DPMI host, etc.
>
>> In fact, sometimes a lot more than just one page is reserved.
>> According to a video explainer I watched a while ago 32-bit Linux
>> sets the lowest accessible address to something like 0x0040_0000 so
>> that the hardware will trap not just
>>
>> *p
>>
>> but also
>>
>> p[q]
>>
>> for some significant size of q (all where p is null).
>
> Good idea. That will prohibit C programmers from directly programming
> the hardware. This is only needed for rudimentary OSes like DOS, and
> isn't needed and shouldn't be allowed for advanced OSes like Linux or
> Windows, which have paging. I.e., for safety, the modern C programmer
> should be restricted to just the C application space - no hardware. Of
> course, if the programmer was doing OS development, they'd need a way
> around that restriction. E.g., DJGPP has special functions to allow
> access to memory below 1MB.
>
>> The trouble is that that takes away 4M of the address space and, more
>> importantly, means that the addresses a programmer will see in a
>> debugging session or a dump would have more significant digits than
>> they need to have and, therefore, be harder to read than necessary.
>
> To me, this is irrelevant. I use printf() to debug.

Do you never print addresses?

>
>> If, by contrast, null is set to a little higher than the accessible
>> memory program data areas could be at lower addresses making
>> debugging sessions and dumps easier to read.
>>
>> The hardware would still trap on both of
>>
>> *p
>> p[q]
>>
>> In fact, q could potentially be a lot higher than in the 'normal'
>> model because not just 4M but all the addresses from p[0] to the
>> highest memory address would be inaccessible and would trap (given a
>> suitable CPU).
>>
>> To make this work I would have to have null (the null address)
>> determined not at compile time but at program load time.



>
> Why would it need to be any value above 1MB? ... It's usually only RAM
> above 1MB, especially for older computers. I.e., in general, all the
> older BIOS, I/O ports, Vesa BIOS, IVT, BDA, EBDA, RTC, CMOS, etc
> hardware are all located low, below 1MB. It's only modern
> memory-mapped devices, e.g., video frame buffers etc, which are located
> high above 1MB, and indicated via E820h memory map. Don't ask me about
> UEFI.

On x86-32 without paging a program's accessible data space will run from
address zero up to whatever has been defined as the limit of the DS
segment.

On x86-32 with paging the accessible data space will still have an upper
limit but the start of accessible memory could be at an address other
than zero.

Ideally, the same 32-bit code could run in either mode and still detect
attempts to dereference a null pointer.



--
James Harris

anti...@math.uni.wroc.pl

unread,
Aug 1, 2021, 10:44:03 AMAug 1
to
I do not argue to making null pointer dereference legal.
My point was that "nothing to point at" is how you define it.
Other folks (I would call some of them brilliant) decided
to define meaning of such dereference. If you look at Lisp
NIL, it point at real thing in memory. This real thing
is simultanously list node and a (Lisp) symbol. You may
call this creasy or smart depending on your view. However
in Lisp you can do

* (car NIL)

NIL
* (cdr NIL)

NIL
* (symbol-name NIL)

"NIL"

True Lispers will show you various cute snippets of code
which work only due to those properties of NIL and alternatives
in world where dereferencing NIL is an error are more
complicated. So AFAICS question is if ability to write
such cute code is more important than errors due to unwanted
dereference of NIL? IMO no, but reasonable people may have
different opinion.

> Such invalid accesses could be detected by the compiled code or by
> hardware. It would always be safe for the compiler to emit checks to see
> whether a given pointer is null. It's just that on some architectures
> such checks could be carried out by the hardware and so could be omitted
> from the code.
>
> Whether the attempt to dereference a null pointer were to be detected by
> compiled code or by hardware the outcome should, IMO, be the same: a
> memory-access exception. Hardware detection just allows some checks to
> be omitted from the compiled code.
>
> The point remains, though, that in some (perhaps many) architectures
> location zero is valid and accessible meaning that zero is not really an
> ideal candidate for null.

If you are going beyod "high-level assembler" it should be invisible
to program in your language which address (if any) is used for null
pointer. Concrete value may be chosen separately for each machine,
based on its properties.

> The best addresses for null seem to me to be either
>
> * an address which is just above the highest address the program is
> permitted to access
> * the highest address in memory
>
> If p were null then either scheme could be used to detect
>
> *p
>
> The advantage of the first scheme is that expressions such as
>
> p[n]
>
> would trigger an automatic exception for small n. But that doesn't scale
> too well: it could fail if n were to be so large that the resulting
> address wrapped round past address zero into accessible space.
>
> So at the moment I am thinking of using the highest address in memory.
>
> The problem with using /any/ value other than zero is that any pointer
> in zeroed memory will not automatically be set to null.

That is quite different aspect. Some languages provide for
automatic initialization. You may declare special integer
type so that each variable of this type will be automatically
initialized to 42. Or special pointer type that will be
initialied so that it points to string "Uninitialized pointer".

> >
> > Another thing is safety, namely question if your implementation
> > will detect illegal program behaviour. There is "C attitude"
> > which basically says that "do what the machine do".
>
> Yes though in my case I want the language to specify its behaviour so
> that it runs in the same way on any machine.

Do you really mean "on any machine"? On maybe "on any machine
that runs Java well"? Mainstream machine have paging and many
errors will be detected by hardware as page faults (for example,
if you set up things properly stack overflow will be detected
as page fault). And you may assume that you will have at least
32-bit arithmetic. But you can not make such assumption when
you want to code for small embedded processors.

--
Waldek Hebisch

luserdroog

unread,
Aug 1, 2021, 10:06:03 PMAug 1
to
Thanks. I agree also with your points about stacks and arrays. I think
students should use them a lot, too. Maybe my point is that there ought
to be more /programming/ in the CS curriculum.