End of File (EOF) - multiple meanings

James Harris

unread,

Aug 31, 2022, 4:40:37 AM8/31/22

to

It's been a while since we had a discussion. I hope you guys are all well.

Am I right that a programmer might want to distinguish between multiple
potential meanings of EOF, as below? How should a programming language
or library support such different indications?

1. In the simplest case (that which is most familiar) one has a file of
8-bit bytes that nothing else has open for writing. If we are reading
that file and there are no bytes left then we get an EOF indication.
That's fine.

2. But what if we have a file or a stream that something else is
potentially still writing to? The reader finding that there are no bytes
left to read is just being told that there are no bytes /now/ but there
may be more soon. (To an extent, that's true of the simple case, too,
where some other task could append to a file.)

3. There's also record-based IO where the file ends with a record which
is so-far incomplete. Something may still be writing that last record
(but doing so without full-record writes).

You may think that strict record-based IO is unusual - and it is in many
contexts - but it also applies with byte-by-byte IO where, for example,
one program wants to read a line at a time and yet that file is being
written to character-by-character. If the last line of the file doesn't
/yet/ have a terminating newline does that mean that it's still being
written or that the writer has closed the file without adding the newline?

4. There's a further case of a composite or multiplexed stream which
consists of multiple logical streams but I'll not go into that in this
post in order to try to keep the post short.

So on a read, if there are /at that moment/ no more bytes left to read
what indication(s) should be returned to the program?

IOW, what facilities should a programming language or library provide to
a programmer to help him to handle different cases of EOF?

--
James Harris

Dmitry A. Kazakov

unread,

Aug 31, 2022, 5:36:06 AM8/31/22

to

There is only one case of EOF, namely the end of file (:-))

Other scenarios you described are not.

#2. This is called potential blocking. Blocking /= end of file, as
simple as that. The potential blocking scenario is treated depending on
the I/O mode. Blocking I/O occasionally blocks, non-blocking (also
called immediate) I/O faults with "more data" error. See terminal
interfaces, overlapped I/O under Windows etc.

#3. This is either encoding/buffering or concurrency. It is unclear
which one you meant but either has nothing to do with end of file.

Encoding is transparent to the higher I/O level, so however you encode
the file end it does not matter. Even the QLC SSDs have all files ended
(:-))

Concurrency in I/O just means that you have a race condition while
determining the end of file status. Normally, when undesired, this sort
of stuff is prevented using transactions. Again, that is transparent and
thus does not really matter. The reader starts one transaction, the
writer another, so whatever the writer does the reader sees the old end
of file and consistent stream of. For pipes etc see #2.

#4. However you combine non-ends they remain unended... (:-))

> IOW, what facilities should a programming language or library provide to
> a programmer to help him to handle different cases of EOF?

Do not let the programmer test for EOF. It is either greatly inefficient
or impossible and non-portable. So just do not provide the test. Force
exceptions for exceptional cases:

begin
loop
Foo := Read (File);
... -- Do something useful
end loop;
exception
when End_Error =>
null; -- We are done
when Timeout_Error =>
... -- Hey, the peer is too slow
when Data_Error =>
... -- Opps, the file is corrupted
end;

--
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de

James Harris

unread,

Sep 1, 2022, 9:57:11 AM9/1/22

to

So you would prefer EOF to mean "there are zero more bytes available to
read at the moment"?

>
> Other scenarios you described are not.
>
> #2. This is called potential blocking. Blocking /= end of file, as
> simple as that. The potential blocking scenario is treated depending on
> the I/O mode. Blocking I/O occasionally blocks, non-blocking (also
> called immediate) I/O faults with "more data" error. See terminal
> interfaces, overlapped I/O under Windows etc.

Have to say that "blocking" is possibly a bad name as it has another
meaning of "assembling into blocks". Maybe "waiting" would be better
meaning that the caller doesn't mind waiting for data.

>
> #3. This is either encoding/buffering or concurrency. It is unclear
> which one you meant but either has nothing to do with end of file.

To make up an example, say a process is reading a Unix-form text file
expecting whole lines but the last line in the file doesn't have a
terminating newline character. Its options for each 'line read' are

1) Don't wait.
2) Wait forever.
3) Wait a limited amount of time.

Each has its disadvantages. Under option 1 a program may conclude that
it has reached EOF (and the last line in the file doesn't have a
terminating newline) even when another program is still writing to the file.

Imagine a file which had been created years ago but the last line of
which omitted the trailing newline which it should have. Under option 2
a program reading it would wait forever and never see EOF.

Option 3 could have the disadvantages of the other two but in addition
it could ignore an unterminated trailing line even though a user looking
at the file would see text which he expected to be processed.

I am not trying to be awkward, by the way. :) Just to think about
situations a programmer might want to distinguish between.

>
> Encoding is transparent to the higher I/O level, so however you encode
> the file end it does not matter. Even the QLC SSDs have all files ended
> (:-))
>
> Concurrency in I/O just means that you have a race condition while
> determining the end of file status. Normally, when undesired, this sort
> of stuff is prevented using transactions. Again, that is transparent and
> thus does not really matter. The reader starts one transaction, the
> writer another, so whatever the writer does the reader sees the old end
> of file and consistent stream of. For pipes etc see #2.

Yes, a higher-level concept can be imposed on a simple stream.

>
> #4. However you combine non-ends they remain unended... (:-))

I don't understand that. [:(] Is it important?

>
>> IOW, what facilities should a programming language or library provide
>> to a programmer to help him to handle different cases of EOF?
>
> Do not let the programmer test for EOF. It is either greatly inefficient
> or impossible and non-portable. So just do not provide the test. Force
> exceptions for exceptional cases:
>
>    begin
>       loop
>          Foo := Read (File);
>          ... -- Do something useful
>       end loop;
>    exception
>       when End_Error =>
>          null; -- We are done
>       when Timeout_Error =>
>          ... -- Hey, the peer is too slow
>       when Data_Error =>
>          ... -- Opps, the file is corrupted
>    end;
>

That's good. My language has exceptions and they are the mechanism I
planned to use for EOF and the other cases. I just wasn't sure which
conditions (including exceptions) a programmer might want to test for.

In your code I take it that there being zero bytes to read would lead to
End_Error.

--
James Harris

Dmitry A. Kazakov

unread,

Sep 1, 2022, 11:12:16 AM9/1/22

to

On 2022-09-01 15:57, James Harris wrote:
> On 31/08/2022 10:36, Dmitry A. Kazakov wrote:

>> There is only one case of EOF, namely the end of file (:-))
>
> So you would prefer EOF to mean "there are zero more bytes available to
> read at the moment"?

No. EOF means the file/container ends here, like 'Z' is the last letter
of the alphabet.

>> Other scenarios you described are not.
>>
>> #2. This is called potential blocking. Blocking /= end of file, as
>> simple as that. The potential blocking scenario is treated depending
>> on the I/O mode. Blocking I/O occasionally blocks, non-blocking (also
>> called immediate) I/O faults with "more data" error. See terminal
>> interfaces, overlapped I/O under Windows etc.
>
> Have to say that "blocking" is possibly a bad name as it has another
> meaning of "assembling into blocks". Maybe "waiting" would be better
> meaning that the caller doesn't mind waiting for data.

No, blocking vs. non-blocking I/O is kind of official term. The thing
you meant is called just "block," e.g. "block device" under Linux,
"block I/O" (in blocks).

>> #3. This is either encoding/buffering or concurrency. It is unclear
>> which one you meant but either has nothing to do with end of file.
>
> To make up an example, say a process is reading a Unix-form text file
> expecting whole lines but the last line in the file doesn't have a
> terminating newline character. Its options for each 'line read' are
>
> 1) Don't wait.
> 2) Wait forever.
> 3) Wait a limited amount of time.

If EOF™ has been reached neither above happens. Instead you get:

0) Data_Error exception: file is corrupt, no end line delimiter found
before the file end.

You are trying to break an encoding abstraction here: line is a sequence
of characters ending with LF. Arguable a poor one, but must of UNIX
stuff is... (:-))

[...]

> I am not trying to be awkward, by the way. :) Just to think about
> situations a programmer might want to distinguish between.

He need not, because in a properly designed software he would call
"gets" or "Get_Line" and let the library take care of.

>> #4. However you combine non-ends they remain unended... (:-))
>
> I don't understand that. [:(] Is it important?

Yes, abstractions are important, just in order to keep things
consistent. If you break abstraction you might get into a situation
where no answer exist.

>>> IOW, what facilities should a programming language or library provide
>>> to a programmer to help him to handle different cases of EOF?

As I said. There is just one case and it is handled pretty much
consistently across different OSes. E.g.

https://docs.microsoft.com/en-us/windows/win32/api/fileapi/nf-fileapi-setendoffile

>> Do not let the programmer test for EOF. It is either greatly
>> inefficient or impossible and non-portable. So just do not provide the
>> test. Force exceptions for exceptional cases:
>>
>>     begin
>>        loop
>>           Foo := Read (File);
>>           ... -- Do something useful
>>        end loop;
>>     exception
>>        when End_Error =>
>>           null; -- We are done
>>        when Timeout_Error =>
>>           ... -- Hey, the peer is too slow
>>        when Data_Error =>
>>           ... -- Opps, the file is corrupted
>>     end;
>>
>
> That's good. My language has exceptions and they are the mechanism I
> planned to use for EOF and the other cases. I just wasn't sure which
> conditions (including exceptions) a programmer might want to test for.
>
> In your code I take it that there being zero bytes to read would lead to
> End_Error.

AFAIK, that is a convention deployed in TCP sockets only. If you read
empty payload from TCP socket that is an indication that the connection
was gracefully closed by the peer, an equivalent of EOF.

This is a quirk of the socket library. It could use error states to
differentiate closed socket and allow zero payload. Zero payload is used
in many protocols, e.g. for pinging, keeping connection alive, time
synchronization etc. So the choice is rather unfortunate. However, on
the other hand non-zero stuff is very helpful for parsers and messaging.
That you never stay in same place is a guaranty against live-locks.

In serial communication the same idea led to introduction of EOT
character in ASCII. So you could read EOT instead of nothing when there
is nothing to read but you still wanted to... (:-)) Of course, in
practice nobody ever uses EOT. If I correctly remember, Microsoft used
EOT at the end of text files in MS-DOS?

James Harris

unread,

Sep 3, 2022, 8:12:51 AM9/3/22

to

On 01/09/2022 16:12, Dmitry A. Kazakov wrote:
> On 2022-09-01 15:57, James Harris wrote:
>> On 31/08/2022 10:36, Dmitry A. Kazakov wrote:
>
>>> There is only one case of EOF, namely the end of file (:-))
>>
>> So you would prefer EOF to mean "there are zero more bytes available
>> to read at the moment"?
>
> No. EOF means the file/container ends here, like 'Z' is the last letter
> of the alphabet.

How is "the file ends here" different from there being nothing left to
read? If the file ends here (your definition) then there's nothing left,
surely.

There's a further case, too, as follows.

Imagine that the offset of the next byte to read is the same as the
file's size. Take that as the standard condition for a read to return EOF.

What if the offset is set to some byte /after/ where the file ends, e.g.
the file length is 50 and the offset is 55.

As a programmer, would you want to get EOF in that case, too, or would
you want some separate exception such as 'read past EOF'?

...

>> To make up an example, say a process is reading a Unix-form text file
>> expecting whole lines but the last line in the file doesn't have a
>> terminating newline character. Its options for each 'line read' are
>>
>> 1) Don't wait.
>> 2) Wait forever.
>> 3) Wait a limited amount of time.
>
> If EOF™ has been reached neither above happens. Instead you get:
>
> 0) Data_Error exception: file is corrupt, no end line delimiter found
> before the file end.

I like that. Throwing an exception is a good way to address the problem
of an incomplete last line. Then the caller can choose how to respond.

>
> You are trying to break an encoding abstraction here: line is a sequence
> of characters ending with LF. Arguable a poor one, but must of UNIX
> stuff is... (:-))
>
> [...]
>
>> I am not trying to be awkward, by the way. :) Just to think about
>> situations a programmer might want to distinguish between.
>
> He need not, because in a properly designed software he would call
> "gets" or "Get_Line" and let the library take care of.

Well, gets or Get_Line will still have the same issues to deal with. In
fact, the programmer might want to be able to tell such a function how
it should respond to different conditions: throw exception, return a
partial result, return nothing, etc.

...

>>> Do not let the programmer test for EOF. It is either greatly
>>> inefficient or impossible and non-portable. So just do not provide
>>> the test. Force exceptions for exceptional cases:
>>>
>>>     begin
>>>        loop
>>>           Foo := Read (File);
>>>           ... -- Do something useful
>>>        end loop;
>>>     exception
>>>        when End_Error =>
>>>           null; -- We are done
>>>        when Timeout_Error =>
>>>           ... -- Hey, the peer is too slow
>>>        when Data_Error =>
>>>           ... -- Opps, the file is corrupted
>>>     end;
>>>
>>
>> That's good. My language has exceptions and they are the mechanism I
>> planned to use for EOF and the other cases. I just wasn't sure which
>> conditions (including exceptions) a programmer might want to test for.
>>
>> In your code I take it that there being zero bytes to read would lead
>> to End_Error.
>
> AFAIK, that is a convention deployed in TCP sockets only. If you read
> empty payload from TCP socket that is an indication that the connection
> was gracefully closed by the peer, an equivalent of EOF.

At the point of the call

Foo := Read (File);

let's say the file has ended. What would you expect to happen? I thought
that was where your End_Error exception would be thrown.

--
James Harris

Dmitry A. Kazakov

unread,

Sep 3, 2022, 10:47:38 AM9/3/22

to

On 2022-09-03 14:12, James Harris wrote:
> On 01/09/2022 16:12, Dmitry A. Kazakov wrote:
>> On 2022-09-01 15:57, James Harris wrote:
>>> On 31/08/2022 10:36, Dmitry A. Kazakov wrote:
>>
>>>> There is only one case of EOF, namely the end of file (:-))
>>>
>>> So you would prefer EOF to mean "there are zero more bytes available
>>> to read at the moment"?
>>
>> No. EOF means the file/container ends here, like 'Z' is the last
>> letter of the alphabet.
>
> How is "the file ends here" different from there being nothing left to
> read? If the file ends here (your definition) then there's nothing left,
> surely.

Reading is an operation, EOF is a state. The semantics of read depends
on the state.

> There's a further case, too, as follows.
>
> Imagine that the offset of the next byte to read is the same as the
> file's size. Take that as the standard condition for a read to return EOF.

Unnecessary assumptions. There could be no bytes and the file newer
constructed as a whole as in the case with pipes.

> What if the offset is set to some byte /after/ where the file ends, e.g.
> the file length is 50 and the offset is 55.

Then you have wrong offset provided offset exist, since that depends on
the type of file, e.g. a random access file.

> As a programmer, would you want to get EOF in that case, too, or would
> you want some separate exception such as 'read past EOF'?

You cannot read past EOF per definition of. Whether reading past the
file end causes an exception or like in the case of the C library
returns a special value is up to the designer of the API.

>> You are trying to break an encoding abstraction here: line is a
>> sequence of characters ending with LF. Arguable a poor one, but must
>> of UNIX stuff is... (:-))
>>
>> [...]
>>
>>> I am not trying to be awkward, by the way. :) Just to think about
>>> situations a programmer might want to distinguish between.
>>
>> He need not, because in a properly designed software he would call
>> "gets" or "Get_Line" and let the library take care of.
>
> Well, gets or Get_Line will still have the same issues to deal with. In
> fact, the programmer might want to be able to tell such a function how
> it should respond to different conditions: throw exception, return a
> partial result, return nothing, etc.

If you are not satisfied with the abstraction use a different one.

Note that before the Dark Age, advanced file systems supported
lines/records physically. E.g. in VMS the line would be a varying
record, no delimiters. If you make an abstraction transparent you limit
possible implementations of.

If the case of sockets? They are not proper files. So you need to
reformulate the question. E.g. let you wanted to add a stream interface
to a socket? The answer is that you could do that, but usually such
streams are useless in the sense that most network protocols are
packet-oriented. Thus you read packets rather than a raw stream of
octets, and you always know how much there is to read. If the peer
closes the connection in the middle of a packet, you have a Data_Error
rather than End_Error. Furthermore production code tends to use socket
select rather than blocking reads:

https://man7.org/linux/man-pages/man2/select.2.html

It is a totally inverse abstraction. You get a socket signaled when
there is something to read and then take the buffered stuff from the
socket. So if Foo is a composite object you could not just read as:

Foo := Read (Socket);

because you do not know if the encoded instance of the object is all
there. It is like the situation with lines. The line might be incomplete
and there is nothing to read yet. You cannot recover from that unless
you block but you are not allowed to block. The abstraction (stream of
octets) leaks. This is why socket streams have very limited use in
practice, namely only for quick and dirty implementations deploying
blocking I/O.

luserdroog

unread,

Sep 11, 2022, 9:58:14 PM9/11/22

to

On Wednesday, August 31, 2022 at 3:40:37 AM UTC-5, James Harris wrote:
> It's been a while since we had a discussion. I hope you guys are all well.
>
> Am I right that a programmer might want to distinguish between multiple
> potential meanings of EOF, as below? How should a programming language
> or library support such different indications?
>

Ignoring all of your actual questions, I've been using a few different means of
indicating an EOF condition in my parser combinator library (which doubles as
the compiler's data structure library or the interpreter runtime -- depending on
what application is built from the parsers).

An EOF condition that is produced while reading the bytes of a file, where the
file is represented as a lazy list of its bytes, is represented by the <symbol> EOF.
Although this symbol can also be chopped off in which case the end of file is
indicated by a NIL instead of another cons node. While the symbol EOF is *backed*
by its code value of -1, it has a distinguished type from the bytes of the file which
will be <integer> typed objects. A file stream might even contain a 32bit -1 encoded
in extended UTF8 if the file is fed through the UTF8 decoder, but the <integer> -1 and
the <symbol> -1 (whose print name is "EOF") are different things.

James Harris

unread,

Nov 16, 2022, 5:39:23 AM11/16/22

to

On 03/09/2022 15:47, Dmitry A. Kazakov wrote:
> On 2022-09-03 14:12, James Harris wrote:
>> On 01/09/2022 16:12, Dmitry A. Kazakov wrote:
>>> On 2022-09-01 15:57, James Harris wrote:
>>>> On 31/08/2022 10:36, Dmitry A. Kazakov wrote:

Going back to this thread as I had to make a choice for some code I
wrote recently.

>>>
>>>>> There is only one case of EOF, namely the end of file (:-))
>>>>
>>>> So you would prefer EOF to mean "there are zero more bytes available
>>>> to read at the moment"?
>>>
>>> No. EOF means the file/container ends here, like 'Z' is the last
>>> letter of the alphabet.
>>
>> How is "the file ends here" different from there being nothing left to
>> read? If the file ends here (your definition) then there's nothing
>> left, surely.
>
> Reading is an operation, EOF is a state. The semantics of read depends
> on the state.

My best guess at what you are driving at is that to you EOF is a
higher-level, logical state rather than a physical one. That's fine.
Sometimes one needs to recognise that "/at the moment/ the file ends
here" is different from "we are at the end of a complete file".

For instance, a file may currently end at byte 49 but a nanosecond later
something is going to write byte 50 and the file is not complete until
byte 50 is also present.

The problem is that the OS may well not know, so it could not tell a
program anything other than "there are currently no more bytes to read".

If the OS had some way to know that byte 50 was needed it could block
the reader until byte 50 arrived or return an indication that more data
was to follow. But that's not always possible.

>
>> There's a further case, too, as follows.
>>
>> Imagine that the offset of the next byte to read is the same as the
>> file's size. Take that as the standard condition for a read to return
>> EOF.
>
> Unnecessary assumptions. There could be no bytes and the file newer
> constructed as a whole as in the case with pipes.

Some streams of data (e.g. TCP and pipes) do have out-of-band indication
of the difference between "there is nothing more to read now" and "the
stream has ended; nothing more will or can be added".

But plain files do not. Which comes back to the question of what are the
best indications to return to a program.

>
>> What if the offset is set to some byte /after/ where the file ends,
>> e.g. the file length is 50 and the offset is 55.
>
> Then you have wrong offset provided offset exist, since that depends on
> the type of file, e.g. a random access file.
>
>> As a programmer, would you want to get EOF in that case, too, or would
>> you want some separate exception such as 'read past EOF'?
>
> You cannot read past EOF per definition of.

OK, "read requested when file pointer is past EOF", if you prefer.

> Whether reading past the
> file end causes an exception or like in the case of the C library
> returns a special value is up to the designer of the API.

Sure. I was just wondering what a programmer would find most convenient
and useful to cover the different cases. There are two parts which must
be brought together:

* what the programmer would like to know
* what the environment (the RTS or OS) may be able to say

The programmer may want full information but the environment in which
the program runs may not be able to give that much detail. Yet the same
program will need to be able to run in different environments and on
different types of stream.

This doesn't sound easy to reconcile. For example, a program may want to
know when the logical end of the data has been reached but the
environment may only be able to say "there's nothing more just now" as
with the case of files, above.

So, trying to brainstorm what a program could be told when it tries to
read from a stream of data:

For blocking reads
* Here's all the data you asked for,
* Here's some but less than you asked for.
* I have x amount of the data but it's less than you asked for so I am
returning nothing.
* I have nothing more for you to read just now.
* I have nothing more for you to read just now but the stream is
closable and is not closed so there may be more.
* The stream is closable and is closed; there will be nothing more.
* There was an unrecoverable input error.
* There was an input error which may be correctable but could not be
corrected before the timeout expired.
Or the environment could block until enough data arrive or there's a
timeout.

For nonblocking reads the responses would probably be the same except,
of course, for the potential for blocking. Put another way, nonblocking
reads may be the same as blocking reads with a timeout of zero (?).

Basically, AISI an ostensibly simple 'read' call could get a reply of
any of the above responses and maybe others. The problem is to work out
how such info could be returned so as to make a programmer's life as
easy as possible, especially bearing in mind that his program may run in
different environments and on different types of stream.

Can anyone see where I am going with this - and save me a few steps!
There must already be some standard model of reads that resolves these
issues.

--
James Harris

Dmitry A. Kazakov

unread,

Nov 16, 2022, 7:02:42 AM11/16/22

to

On 2022-11-16 11:39, James Harris wrote:
> On 03/09/2022 15:47, Dmitry A. Kazakov wrote:

>> Unnecessary assumptions. There could be no bytes and the file newer
>> constructed as a whole as in the case with pipes.
>
> Some streams of data (e.g. TCP and pipes) do have out-of-band indication
> of the difference between "there is nothing more to read now" and "the
> stream has ended; nothing more will or can be added".

No, in general case there is no way to tell, unless you close connection
or close the pipe.

> But plain files do not. Which comes back to the question of what are the
> best indications to return to a program.

Whatever way suitable in the language. E.g. if you have exceptions, then
an exception. Note that streams are not files, but in an OO language you
can set a file interface on the stream top and conversely. E.g. if you
create a file interface to a stream, then the interface implementation
is responsible to signal end of file if determinable from the stream
state. For example, when stream is memory resident, e.g. backed by a
string then end of the string = end of the file.

>> You cannot read past EOF per definition of.
>
> OK, "read requested when file pointer is past EOF", if you prefer.

File offsets can be a burden in some cases. E.g. null device, stock
device, terminal device may have no pointers at all.

> Basically, AISI an ostensibly simple 'read' call could get a reply of
> any of the above responses and maybe others. The problem is to work out
> how such info could be returned so as to make a programmer's life as
> easy as possible, especially bearing in mind that his program may run in
> different environments and on different types of stream.

You pass an array in and get the index of the last overwritten element.
Errors raise exceptions:

- Timeout
- Data error
- End of file, when requested data unavailable