poll(2) on a pipe fd

Alexander Burger

unread,

Sep 18, 2010, 3:15:41 AM9/18/10

to

Under which circumstances can it happen that poll(2) on a pipe file
descriptor says "no data available", when in fact data are available?

The real situation is more complicated, but I could trace the problem
and reduce it to the following:

We have two processes
P (parent) and C (child)
P and C communicate via a pipe(2)
C holds an exclusive lock on some file record using fcntl(2)
C writes some data into a pipe to P
C releases the lock
P waits until the lock is released
P calls poll(2) on that pipe

Now, poll(2) sometimes - under heavy load and with many child processes
- says that no data are available. But we know from the traces that C
wrote something to the pipe before it released the lock, and thus before
poll(2) was called by P.

If poll(2) is called by P a little later again, it says that there are
indeed data available, but why it didn't say so immediately, as the data
was definitely in the pipe from C?

The question is: Is the assumption wrong that data in a pipe should be
immediately available to the receiving process, after it made sure that
the sending process successfully finished its job (via the lock)?

It is difficult to provide a simple code example that demonstrates this.
The above effect was detected in PicoLisp, where the 32-bit version uses
select(2) and runs stable since many years, while the 64-bit version
uses poll(2) and shows the observed behavior. It can be reproduced by
running the database stress test "misc/stress.l" in the distribution.

After changing the 64-bit version to select(2) too, it also runs stable
now. Nothing else was changed, so this suggests that there is some
difference in behavior between select(2) and poll(2), but which?

I've already posted the same question to this newsgroup 10 years ago,
when I developed that mechanism for the 32-bit version:

http://www.mail-archive.com/linux-deve...@senator-bedfellow.mit.edu/msg01641.html

At that time, Kaz Kylheku replied that data should be immediately
available. How does he know? Is this documented somewhere? From all I
can see when tracing the above processes, this is not the case. If it
is, then only for select(2) but not for poll(2)?

Cheers,
- Alex
--
Software Lab. Alexander Burger
Bahnhofstr. 24a, D-86462 Langweid
a...@software-lab.de, www.software-lab.de, +49 8230 5060

Ersek, Laszlo

unread,

Sep 18, 2010, 8:30:57 AM9/18/10

to

On Sat, 18 Sep 2010, Alexander Burger wrote:

> Under which circumstances can it happen that poll(2) on a pipe file
> descriptor says "no data available", when in fact data are available?
>
> The real situation is more complicated, but I could trace the problem
> and reduce it to the following:
>
> We have two processes
> P (parent) and C (child)
> P and C communicate via a pipe(2)
> C holds an exclusive lock on some file record using fcntl(2)
> C writes some data into a pipe to P
> C releases the lock
> P waits until the lock is released
> P calls poll(2) on that pipe
>
> Now, poll(2) sometimes - under heavy load and with many child processes
> - says that no data are available. But we know from the traces that C
> wrote something to the pipe before it released the lock, and thus before
> poll(2) was called by P.
>
> If poll(2) is called by P a little later again, it says that there are
> indeed data available, but why it didn't say so immediately, as the data
> was definitely in the pipe from C?
>
> The question is: Is the assumption wrong that data in a pipe should be
> immediately available to the receiving process, after it made sure that
> the sending process successfully finished its job (via the lock)?

The assumption is wrong, in my opinion. I think the system is under no
obligation to treat fcntl() locks as a synchronization primitive wrt. pipe
buffer contents (or memory in general). That is, it doesn't guarantee
immediate visibility of "changes".

The system guarantees, however, that the data will become available
*eventually*.

http://www.opengroup.org/onlinepubs/9699919799/functions/pipe.html

Specify infinite timeout to poll() instead of zero timeout. Once you
acquire the fcntl() lock in P, you know for certain that some data was at
least *submitted*. It becoming visible doesn't depend on any "external"
factors, hence poll() won't block indefinitely -- you just have to give
time to the OS so one asynchronously running part of it can catch up with
your process. In practice it should happen almost immediately. If worried,
use 5 seconds or so, instead of infinite timeout.

I'm not sure whether you need said fcntl() locks genuinely, ie. to
synchronize access to regular file contents. If you don't, and (1) per
individual pipe, you have many writers and a single reader, and (2)
"messages" can fit in _POSIX_PIPE_BUF (512) bytes, then you could remove
fcntl() completely, and block in poll().

http://www.opengroup.org/onlinepubs/9699919799/functions/write.html
http://www.opengroup.org/onlinepubs/9699919799/basedefs/limits.h.html

The 512 bytes mentioned above is a guaranteed minimum. You can figure out
the actual limit (and perhaps increase efficiency) with fpathconf(pipe_fd,
_PC_PIPE_BUF).

http://www.opengroup.org/onlinepubs/9699919799/functions/fpathconf.html

lacos

Alexander Burger

unread,

Sep 18, 2010, 1:02:45 PM9/18/10

to

Hi lacos,

thanks for you answer!

> The assumption is wrong, in my opinion. I think the system is under no
> obligation to treat fcntl() locks as a synchronization primitive wrt. pipe
> buffer contents (or memory in general). That is, it doesn't guarantee
> immediate visibility of "changes".

Perhaps I didn't explain it well. The locking per se is not relevant
here. It is just to guarantee the _sequence_ of operations, i.e. the
first process completely wrote all its data into the pipe before the
second one starts to call poll().

> The system guarantees, however, that the data will become available
> *eventually*.

So this is the critical point. How do you know that? Where are the data
then? A pipe is not much more than a buffer in the kernel after all.

> http://www.opengroup.org/onlinepubs/9699919799/functions/pipe.html

I think I know the mechanics of pipes well. My question is just why the
data are not seen by poll() despite they were written by another
process, while they are seen by select().

jack

unread,

Sep 18, 2010, 1:51:54 PM9/18/10

to

Alexander Burger wrote:
> Hi lacos,
>
> thanks for you answer!
>

>> The system guarantees, however, that the data will become available
>> *eventually*.
>
> So this is the critical point. How do you know that? Where are the data
> then? A pipe is not much more than a buffer in the kernel after all.
>
>
>> http://www.opengroup.org/onlinepubs/9699919799/functions/pipe.html
>
> I think I know the mechanics of pipes well. My question is just why the
> data are not seen by poll() despite they were written by another
> process, while they are seen by select().
>

Because data entering the pipe on one end, and the data becoming
available on the other end takes a finite, but non-zero time inside the
kernel. When the write call to the pipe returns, all we know is that the
kernel has copied the data, and established that there is enough space
in the pipe. When internal pointers are updated, and threads sitting in
a blocking read are woken up, is another story.

Depending on how the scheduler schedules A and B, there might not be
enough time for the kernel to do its internal housekeeping. The moment A
releases the lock the scheduler might realize B was waiting for that
lock, and switch to B, without spending any time in 'housekeeping mode'.

Bottom line: in process B, one has to treat acquiring the lock as a
signal that there *might* be data in the pipe. In a solid design, the
subsequent poll/read should still use timeouts.

-j

Alexander Burger

unread,

Sep 18, 2010, 3:21:05 PM9/18/10

to

jack <jcfma...@yahoo.com> wrote:
> Depending on how the scheduler schedules A and B, there might not be
> enough time for the kernel to do its internal housekeeping. The moment A
> releases the lock the scheduler might realize B was waiting for that
> lock, and switch to B, without spending any time in 'housekeeping mode'.

OK, this sounds reasonable. Still, I can't believe that "updating a
pointer" takes a significant time. To my understanding, a pipe is one of
the simplest and most efficient data structures.

> Bottom line: in process B, one has to treat acquiring the lock as a
> signal that there *might* be data in the pipe. In a solid design, the
> subsequent poll/read should still use timeouts.

This is not an option. How long would you wait? 100 milliseconds? 5
seconds? A minute? There is no "safe" limit. An application depending on
such an assumption would be a bad design. The behavior must be
predictable. Unfortunately, fsync() doesn't work on pipes.

Ersek, Laszlo

unread,

Sep 18, 2010, 4:56:29 PM9/18/10

to

On Sat, 18 Sep 2010, Alexander Burger wrote:

> jack <jcfma...@yahoo.com> wrote:

>> Depending on how the scheduler schedules A and B, there might not be
>> enough time for the kernel to do its internal housekeeping. The moment
>> A releases the lock the scheduler might realize B was waiting for that
>> lock, and switch to B, without spending any time in 'housekeeping
>> mode'.
>
> OK, this sounds reasonable. Still, I can't believe that "updating a
> pointer" takes a significant time. To my understanding, a pipe is one of
> the simplest and most efficient data structures.

http://lxr.linux.no/#linux+v2.6.35.4/fs/pipe.c

I didn't spend too much time with it, so I can't prove nor disprove my
suspicion, but you might find the source useful.

>> Bottom line: in process B, one has to treat acquiring the lock as a
>> signal that there *might* be data in the pipe. In a solid design, the
>> subsequent poll/read should still use timeouts.
>
> This is not an option. How long would you wait? 100 milliseconds? 5
> seconds? A minute? There is no "safe" limit.

The wakeup should happen as soon as the kernel gets there. It doesn't
depend on any external factor, so it will happen quickly. An infinite
timeout is a "safe limit", because that's never shorter than the time the
kernel needs in order to catch up. If you're worried about a hung poll()
-- perhaps due to a programming error elsewhere --, you can use a timeout
of a few seconds and perhaps a loop around poll().

What happens if you modify your source, as in: wrapping the poll() call in
a busy loop (thus keeping the zero timeout intact), incrementing a 64-bit
counter? Does the data arrive? At which counter value? Is it affected if
you renice the processes to lower or higher priorities?

> An application depending on such an assumption would be a bad design.
> The behavior must be predictable. Unfortunately, fsync() doesn't work on
> pipes.

I believe timing (scheduling) is never predictable in a non-realtime OS.

Perhaps you should trace both processes in kernel space too -- I admit I
don't know how to do that (yet).

lacos

David Schwartz

unread,

Sep 19, 2010, 2:08:23 AM9/19/10

to

On Sep 18, 12:15 am, Alexander Burger <a...@software-lab.de> wrote:

> We have two processes
> P (parent) and C (child)
> P and C communicate via a pipe(2)
> C holds an exclusive lock on some file record using fcntl(2)
> C writes some data into a pipe to P
> C releases the lock
> P waits until the lock is released
> P calls poll(2) on that pipe

> Now, poll(2) sometimes - under heavy load and with many child processes
> - says that no data are available. But we know from the traces that C
> wrote something to the pipe before it released the lock, and thus before
> poll(2) was called by P.

Umm, no, you don't. The word "before" has no meaning in your last
sentence.

> If poll(2) is called by P a little later again, it says that there are
> indeed data available, but why it didn't say so immediately, as the data
> was definitely in the pipe from C?

Because it wasn't in the pipe. There is no guarantee that these
operations have to occur in any particular order. They can even
overlap or happen in no order at all.

> The question is: Is the assumption wrong that data in a pipe should be
> immediately available to the receiving process, after it made sure that
> the sending process successfully finished its job (via the lock)?

It did not make sure the sending process successfully finished its
job. In fact, it made no attempt whatsoever to do that. All it
determined is that the thread started a different job -- that is not
the same thing.

> After changing the 64-bit version to select(2) too, it also runs stable
> now. Nothing else was changed, so this suggests that there is some
> difference in behavior between select(2) and poll(2), but which?

When you do things that are not defined, sometimes it does what you
expect and sometimes not.

A return from a 'write' just means the data has been queued by the
kernel and you are free to modify the pointer. There are some specific
guarantees of global ordering, but his happens not to be one of them.

DS

Alexander Burger

unread,

Sep 19, 2010, 3:03:58 AM9/19/10

to

David Schwartz <dav...@webmaster.com> wrote:
>> Now, poll(2) sometimes - under heavy load and with many child processes
>> - says that no data are available. But we know from the traces that C
>> wrote something to the pipe before it released the lock, and thus before
>> poll(2) was called by P.
>
> Umm, no, you don't. The word "before" has no meaning in your last
> sentence.

No, it has. All involved processes log their operations using usec
timestamps, and the final output is sorted by these timestamps. When I
refer to "before" or "after", it is in regard to these timestamps.

>> If poll(2) is called by P a little later again, it says that there are
>> indeed data available, but why it didn't say so immediately, as the data
>> was definitely in the pipe from C?
>
> Because it wasn't in the pipe. There is no guarantee that these
> operations have to occur in any particular order. They can even
> overlap or happen in no order at all.

No, as I wrote, the involved processes synchronize using locks on a
database file.

Only after the first process finished writing to the db (while at the
same time sending messages through the pipe), it releases the lock, and
only after that the next process will start to look at the pipe.

Cheers,
- Alex

Alexander Burger

unread,

Sep 19, 2010, 3:21:00 AM9/19/10

to

Ersek, Laszlo <la...@caesar.elte.hu> wrote:
> http://lxr.linux.no/#linux+v2.6.35.4/fs/pipe.c
>
> I didn't spend too much time with it, so I can't prove nor disprove my
> suspicion, but you might find the source useful.

OK, thanks.

> What happens if you modify your source, as in: wrapping the poll() call in
> a busy loop (thus keeping the zero timeout intact), incrementing a 64-bit
> counter? Does the data arrive? At which counter value? Is it affected if
> you renice the processes to lower or higher priorities?

Yes, I did similar things. The error condition is hard to reproduce,
happens only when many child processes are involved and on relatively
fast machines. I could monitor, for example, that some processes
recieved the data while others not (yet).

As I wrote in my first post, the situation is more complicated than the
reduced model I described. The relevant code is around the line

http://code.google.com/p/picolisp/source/browse/src/io.c#1378

But the old version, which uses select() instead of poll(), runs since
ten years without flaw in several commercial installations, and I hoped
that in the new version I (1) either used poll() somehow in a wrong way,
or (2) there is an inherent difference between poll() and select(). (1)
cannot be excluded, and (2) is obviously the case, but I don't know if
select() is safe.

> I believe timing (scheduling) is never predictable in a non-realtime OS.

Yep.

> Perhaps you should trace both processes in kernel space too -- I admit I
> don't know how to do that (yet).

Me too, though this would be a indeed good idea.

Cheers,
- Alex

Christof Meerwald

unread,

Sep 19, 2010, 3:43:07 AM9/19/10

to

On Sat, 18 Sep 2010 07:15:41 +0000 (UTC), Alexander Burger wrote:
[...]

> After changing the 64-bit version to select(2) too, it also runs stable
> now. Nothing else was changed, so this suggests that there is some
> difference in behavior between select(2) and poll(2), but which?

That's interesting, because AFAIK select is just a wrapper around poll
in Linux.

Christof

--
http://cmeerw.org sip:cmeerw at cmeerw.org
mailto:cmeerw at cmeerw.org xmpp:cmeerw at cmeerw.org

Alexander Burger

unread,

Sep 19, 2010, 4:07:35 AM9/19/10

to

Christof Meerwald <NOSPAM-see...@usenet.cmeerw.org> wrote:
>> difference in behavior between select(2) and poll(2), but which?
>
> That's interesting, because AFAIK select is just a wrapper around poll
> in Linux.

I thought so too. So the suspicion arises that it perhaps just changes
the timing somehow, and thus covers the actual problem.

Christof Meerwald

unread,

Sep 19, 2010, 5:09:44 AM9/19/10

to

On Sat, 18 Sep 2010 22:56:29 +0200, Ersek, Laszlo wrote:
[...]

> http://lxr.linux.no/#linux+v2.6.35.4/fs/pipe.c
>
> I didn't spend too much time with it, so I can't prove nor disprove my
> suspicion, but you might find the source useful.

line 683 (http://lxr.linux.no/#linux+v2.6.35.4/fs/pipe.c#L683) might
be relevant here - there is no locking, so maybe there is some memory
visibility issue.

David Schwartz

unread,

Sep 20, 2010, 11:00:29 AM9/20/10

to

On Sep 19, 12:03 am, Alexander Burger <a...@software-lab.de> wrote:

> > Because it wasn't in the pipe. There is no guarantee that these
> > operations have to occur in any particular order. They can even
> > overlap or happen in no order at all.

> No, as I wrote, the involved processes synchronize using locks on a
> database file.

No, they don't. Really. They don't.

> Only after the first process finished writing to the db (while at the
> same time sending messages through the pipe), it releases the lock, and
> only after that the next process will start to look at the pipe.

No. No.

If I tell you to do something then I tell Jeff to do something, is it
guaranteed that you finish before Jeff? You are assuming each
operation takes place at a well-defined time, but there is requirement
that this be the case.

Say your code does this:

a();
b();

There is no free-floating universal guarantee that if you see any
effects of calling b(), you will see all the effects of having called
a(). There are guarantees for specific ways of detecting effects, but
this is not one of those.]

Your code relies on a guarantee that does not exist. It assumes a
universal concept of "after" which would only exist if computers did
not start one task until they had finished all previous tasks. This is
simply not how computers operate.

DS

Alexander Burger

unread,

Sep 20, 2010, 1:31:02 PM9/20/10

to

David Schwartz <dav...@webmaster.com> wrote:
>> No, as I wrote, the involved processes synchronize using locks on a
>> database file.
>
> No, they don't. Really. They don't.

Come on. Then no multi-user database will ever work. The system in
question runs with tens of users since ten years, all accessing that
database where the logic absolutely relies on the discussed IPC.

>> Only after the first process finished writing to the db (while at the
>> same time sending messages through the pipe), it releases the lock, and
>> only after that the next process will start to look at the pipe.
>
> No. No.
>
> If I tell you to do something then I tell Jeff to do something, is it
> guaranteed that you finish before Jeff? You are assuming each
> operation takes place at a well-defined time, but there is requirement
> that this be the case.

You must be talking about something completely different. It doesn't
matter for this sequence of operations who tells something to anybody.
It is simply that the first process does

...
fl.l_type = F_WRLCK;
fcntl(fd, F_SETLKW, &fl);
.. do something ..
fl.l_type = F_UNLCK;
fcntl(fd, F_SETLK, &fl);
...

while the first process does the same. However, it is guaranteed that
when the second process returns from

fl.l_type = F_WRLCK;
fcntl(fd, F_SETLKW, &fl);

the first process is done with

fl.l_type = F_UNLCK;
fcntl(fd, F_SETLK, &fl);

because the second process will block (because of F_SETLKW) until it
gets the lock.

Besides, why are you ignoring my explanation about the log files, merged
and sorted by timestamps? Do you assume that the common children of a
single parent (and thus on a single machine) have different timers
(return values of gettimeofday())?

Anyway, thanks for your input!

Cheers,
- Alex

David Schwartz

unread,

Sep 20, 2010, 2:45:57 PM9/20/10

to

On Sep 20, 10:31 am, Alexander Burger <a...@software-lab.de> wrote:
> David Schwartz <dav...@webmaster.com> wrote:

> >> No, as I wrote, the involved processes synchronize using locks on a
> >> database file.

> > No, they don't. Really. They don't.

> Come on. Then no multi-user database will ever work. The system in
> question runs with tens of users since ten years, all accessing that
> database where the logic absolutely relies on the discussed IPC.

And ... surprise ... it doesn't work! File locks do not provide some
kind of global "before" and "after". They just don't. They're no
documented to do so and they don't do so in practice. Expecting them
to do is reasonable but relying on them to do so is not.

> > If I tell you to do something then I tell Jeff to do something, is it
> > guaranteed that you finish before Jeff? You are assuming each
> > operation takes place at a well-defined time, but there is requirement
> > that this be the case.

> You must be talking about something completely different. It doesn't
> matter for this sequence of operations who tells something to anybody.
> It is simply that the first process does
> ...
> fl.l_type = F_WRLCK;
> fcntl(fd, F_SETLKW, &fl);
> .. do something ..
> fl.l_type = F_UNLCK;
> fcntl(fd, F_SETLK, &fl);
> ...
>
> while the first process does the same. However, it is guaranteed that
> when the second process returns from
>
> fl.l_type = F_WRLCK;
> fcntl(fd, F_SETLKW, &fl);
>
> the first process is done with
>
> fl.l_type = F_UNLCK;
> fcntl(fd, F_SETLK, &fl);
>
> because the second process will block (because of F_SETLKW) until it
> gets the lock.

No, it is not guaranteed that it is done with it. The general concept
you are arguing is that if one process releases a lock and then
another process acquires it, this guarantees the second process will
see all changes the first process might have made. This is simply
*not* a guarantee that is actually made *anywhere*.

File locks do not provide some kind of global ordering because
*nothing* provides global ordering. There simply is no concept of
global ordering. You will not find it in any standard or guaranteed
EVER. It simply doesn't exist.

> Besides, why are you ignoring my explanation about the log files, merged
> and sorted by timestamps? Do you assume that the common children of a
> single parent (and thus on a single machine) have different timers
> (return values of gettimeofday())?

Because the timestamps only tell you the time the function call was
entered or returned. They do not provide any global ordering because
such ordering does not exist.

> Anyway, thanks for your input!

You're welcome. This is a very serious misconception that leads to a
*lot* of bad code, so anything I can do to stamp it out should be
appreciated.

A lock being released by one thread and then acquired by another
thread does *not* provide any global ordering of other operations
performed by those two threads.

DS

Ersek, Laszlo

unread,

Sep 20, 2010, 5:00:03 PM9/20/10

to

On Mon, 20 Sep 2010, David Schwartz wrote:

> On Sep 20, 10:31 am, Alexander Burger <a...@software-lab.de> wrote:

>> Besides, why are you ignoring my explanation about the log files,
>> merged and sorted by timestamps? Do you assume that the common children
>> of a single parent (and thus on a single machine) have different timers
>> (return values of gettimeofday())?
>
> Because the timestamps only tell you the time the function call was
> entered or returned. They do not provide any global ordering because
> such ordering does not exist.

FWIW, I tried to look up write()'s spec in SUSv4 [0], and a "special
guarantee" seems to exist for read()/write() done to *regular* files.

----v----
After a write() to a regular file has successfully returned:

* Any successful read() from each byte position in the file that was
modified by that write shall return the data specified by the write() for
that position until such byte positions are again modified.
----^----

That is, if whatever mechanism ensures that write() returns before read()
commences, and the written/read object is a regular file, then visibility
is guaranteed. (*)

Now, the spec goes on to say,

----v----
Write requests to a pipe or FIFO shall be handled in the same way as a
regular file with the following exceptions:

* There is no file offset associated with a pipe, hence each write
request shall append to the end of the pipe.
----^----

I honestly can't tell whether this (or any other part I didn't quote)
would extend the visibility promise made for regular files to pipes.

(*) There's also this in the Rationale section:

----v----
Writes can be serialized with respect to other reads and writes. If a
read() of file data can be proven (by any means) to occur after a write()
of the data, it must reflect that write(), even if the calls are made by
different processes. A similar requirement applies to multiple write
operations to the same file position. This is needed to guarantee the
propagation of data from write() calls to subsequent read() calls. This
requirement is particularly significant for networked file systems, where
some caching schemes violate these semantics.
----^----

This passage of the Rationale seems to occur after "finishing" the
treatise of pipes -- the last occurrence of the word "pipe" is before the
quoted paragraph within the Rationale. I'm tempted to think the visibility
guarantee applies only to regular files.

lacos

[0] http://www.opengroup.org/onlinepubs/9699919799/functions/write.html

David Schwartz

unread,

Sep 20, 2010, 6:43:52 PM9/20/10

to

On Sep 20, 2:00 pm, "Ersek, Laszlo" <la...@caesar.elte.hu> wrote:

> FWIW, I tried to look up write()'s spec in SUSv4 [0], and a "special
> guarantee" seems to exist for read()/write() done to *regular* files.
>
> ----v----
> After a write() to a regular file has successfully returned:
>
> * Any successful read() from each byte position in the file that was
> modified by that write shall return the data specified by the write() for
> that position until such byte positions are again modified.
> ----^----

The problem is the word "after" in that guarantee. That requires some
defined ordering mechanism.

> That is, if whatever mechanism ensures that write() returns before read()
> commences, and the written/read object is a regular file, then visibility
> is guaranteed. (*)

I don't think that guarantee can apply to "whatever mechanism". The
guarantee can only apply to a mechanism that is defined to provide the
"after" relationship required. He tries to use file locks to establish
this relationship, so he needs some guarantee that file lock creates
global ordering, not just ordering with respect to the file locked.

You always need two halves to have an ordering guarantee. One that
guarantees the the thing tested obeys ordering and one that guarantees
that the "after" relationship establishes ordering. This is the first
half of that, for regular files only. I believe it also holds for
pipes on sensible systems. But file locks do not establish global
ordering because there is no such thing.

> Now, the spec goes on to say,
>
> ----v----
> Write requests to a pipe or FIFO shall be handled in the same way as a
> regular file with the following exceptions:
>
> * There is no file offset associated with a pipe, hence each write
> request shall append to the end of the pipe.
> ----^----
>
> I honestly can't tell whether this (or any other part I didn't quote)
> would extend the visibility promise made for regular files to pipes.

I think that's a reasonable argument. So anything that creates
ordering with respect to that file or pipe (such as locking/unlocking
*that* pipe) would have to make the data visible.

> (*) There's also this in the Rationale section:
>
> ----v----
> Writes can be serialized with respect to other reads and writes. If a
> read() of file data can be proven (by any means) to occur after a write()
> of the data, it must reflect that write(), even if the calls are made by
> different processes. A similar requirement applies to multiple write
> operations to the same file position. This is needed to guarantee the
> propagation of data from write() calls to subsequent read() calls. This
> requirement is particularly significant for networked file systems, where
> some caching schemes violate these semantics.
> ----^----

The problem is that in general, a read can't be proven to occur after
because that notion of "after" is, generally, not meaningful.

> This passage of the Rationale seems to occur after "finishing" the
> treatise of pipes -- the last occurrence of the word "pipe" is before the
> quoted paragraph within the Rationale. I'm tempted to think the visibility
> guarantee applies only to regular files.

I think it applies to pipes as well, but I guess you can argue it
either way. In any event, there's no guarantee of any kind that a
device will become ready in a particular time frame.

DS

Ersek, Laszlo

unread,

Sep 20, 2010, 9:59:08 PM9/20/10

to

On Mon, 20 Sep 2010, David Schwartz wrote:

> You always need two halves to have an ordering guarantee. One that
> guarantees the the thing tested obeys ordering and one that guarantees
> that the "after" relationship establishes ordering. This is the first
> half of that, for regular files only. I believe it also holds for pipes
> on sensible systems. But file locks do not establish global ordering
> because there is no such thing.

Great description. To restate my point, my doubts concern whether the
first half extends to pipes. I'm claiming that file locks (in this case,
but they could be something else) do establish relative ordering between
write() and read(), hence the second half should be covered.

> On Sep 20, 2:00 pm, "Ersek, Laszlo" <la...@caesar.elte.hu> wrote:

>> (*) There's also this in the Rationale section:
>>
>> ----v----

>> Writes can be serialized with respect to other reads and writes. If a
>> read() of file data can be proven (by any means) to occur after a
>> write() of the data, it must reflect that write(), even if the calls
>> are made by different processes. A similar requirement applies to
>> multiple write operations to the same file position. This is needed to
>> guarantee the propagation of data from write() calls to subsequent
>> read() calls. This requirement is particularly significant for
>> networked file systems, where some caching schemes violate these
>> semantics.

>> ----^----

> The problem is that in general, a read can't be proven to occur after
> because that notion of "after" is, generally, not meaningful.

Suppose process A holds the lock and does this:

if (SUCCESS == write_to_pipe()) {
release_lock();
}

while process B does this:

if (SUCCESS == acquire_lock()) {
read_from_pipe();
}

The relative ordering within an individual process is established by
logical dependency. The relative ordering between process A and process B
only involves release_lock() and acquire_lock(), nothing else. Given these
dependency edges:

write_to_pipe() -> release_lock()
release_lock() -> acquire_lock()
acquire_lock() -> read_from_pipe()

a path from write_to_pipe() to read_from_pipe() emerges -- a different
ordering of these two would violate at least one direct edge.

I mean this as a global ordering over only these four operations, not
other things the processes might otherwise do. Once process B calls
read_from_pipe(), it can be sure process A has successfully finished
write_to_pipe() -- otherwise, process B could have never acquired the lock
and attempt to read from the pipe.

For me this supplies the second half (the ordering).

And now the rebuttal :)

Since write_to_pipe() and release_lock() operate on different things, and
the system is not required by specs to synchronize one with the other,
process A might execute release_lock() speculatively before
write_to_pipe(), *all the while* making it look *within* process A as if
write_to_pipe() happened first. But considering process A and B together
as a set, there direct edge

write_to_pipe() -> release_lock()

is no more. (It is invisible from B.)

What if release_lock() and acquire_lock() are functions listed under
"Memory Synchronization" [0]? Or isn't a pipe "memory"? :)

Thanks,
lacos

[0] http://www.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap04.html#tag_04_11

Alexander Burger

unread,

Sep 21, 2010, 1:21:59 AM9/21/10

to

David Schwartz <dav...@webmaster.com> wrote:
> And ... surprise ... it doesn't work! File locks do not provide some
> kind of global "before" and "after". They just don't. They're no

No, this part works perfectly. You should actually read my posts. My
initial question to this group was *not* because I doubted the order of
the sequence operations.

>> because the second process will block (because of F_SETLKW) until it
>> gets the lock.
>
> No, it is not guaranteed that it is done with it. The general concept
> you are arguing is that if one process releases a lock and then
> another process acquires it, this guarantees the second process will
> see all changes the first process might have made. This is simply

*This* was my initial question to this group. This was *not* my
argumentation.

So it is absolutely sure that the first process, after the second
process obtained the lock with F_SETLKW, has finished writing. And when
this writing was to a disk file, it is also sure that the second process
will these data. My question was only if this also applies to pipes.

> Because the timestamps only tell you the time the function call was
> entered or returned. They do not provide any global ordering because
> such ordering does not exist.

They do. If one process does a(), *then* releases the lock, *then* the
second proces waits for the lock and *then* does b(), it is guaranteed
that b() happens completely after a(). The timestamps are generated
at the end of a() and at the begining of b().

Alexander Burger

unread,

Sep 21, 2010, 1:29:19 AM9/21/10

to

Ersek, Laszlo <la...@caesar.elte.hu> wrote:
> FWIW, I tried to look up write()'s spec in SUSv4 [0], and a "special
> guarantee" seems to exist for read()/write() done to *regular* files.

> ...

> Write requests to a pipe or FIFO shall be handled in the same way as a
> regular file with the following exceptions:
>
> * There is no file offset associated with a pipe, hence each write
> request shall append to the end of the pipe.

Thanks. That's interesting, especially the phrase "... shall be handled
in the same way ...". This would support my initial assumption.

> I honestly can't tell whether this (or any other part I didn't quote)
> would extend the visibility promise made for regular files to pipes.

Yep. That's the question.

Alexander Burger

unread,

Sep 21, 2010, 1:58:05 AM9/21/10

to

Ersek, Laszlo <la...@caesar.elte.hu> wrote:
> Since write_to_pipe() and release_lock() operate on different things, and
> the system is not required by specs to synchronize one with the other,
> process A might execute release_lock() speculatively before
> write_to_pipe(), *all the while* making it look *within* process A as if
> write_to_pipe() happened first. But considering process A and B together
> as a set, there direct edge
>
> write_to_pipe() -> release_lock()
>
> is no more. (It is invisible from B.)

This is the problem. If the spec is not clear about the way pipes should
be treated in this context, it might well depend on the kernel
implementation.

And, I'm still wondering about the different behavior of poll() vs.
select().

Thank you,
- Alex

Alexander Burger

unread,

Sep 21, 2010, 3:45:09 AM9/21/10

to

Thanks to all who responded so far!

In summary: We are back at the starting point. Though David has a
different opinion, I'm still convinced that data which are *in* the pipe
are not seen by poll() for some reason.

Ersek, Laszlo <la...@caesar.elte.hu> wrote:
> FWIW, I tried to look up write()'s spec in SUSv4 [0], and a "special
> guarantee" seems to exist for read()/write() done to *regular* files.

Thus, if we would use regular files instead of pipes, we would be on the
safe side. But this would contratict the purpose of pipes. So I must
assume that either

1. I am calling poll() in a wrong way or with a wrong setup, or
2. poll() does not return all available data for some other reason

I'll stay with select() for now.

Thanks again,
- Alex

Rainer Weikusat

unread,

Sep 21, 2010, 7:31:27 AM9/21/10

to

Alexander Burger <a...@software-lab.de> writes:

[...]

> Ersek, Laszlo <la...@caesar.elte.hu> wrote:
>> FWIW, I tried to look up write()'s spec in SUSv4 [0], and a "special
>> guarantee" seems to exist for read()/write() done to *regular* files.
>
> Thus, if we would use regular files instead of pipes, we would be on the
> safe side. But this would contratict the purpose of pipes. So I must
> assume that either
>
> 1. I am calling poll() in a wrong way or with a wrong setup, or
> 2. poll() does not return all available data for some other reason
>
> I'll stay with select() for now.

Everything in this subthread was concerned with visibility of data
written by a call of write from the perspective of a subsequent
read. Neither are 'select' or 'poll' 'read' (this difference in
spelling should suggest that) nor does the statement

Any successful read() from each byte position in the file that
was modified by that write shall return the data specified by
the write() for that position

preclude read from returning unsuccessfully with errno == EAGAIN for
an aritbrary amount of time after a write to a file which supports a
notion of non-blocking I/O (such as a pipe) has occurred. I haven't
seen the poll-using code (and I certainly spent no time with trying to
follow logic of the select-using code) but when your description is
correct, your code is simply broken: It is supposed to read input but
never waits until input is actually available, relying on the
assumption that poll will immediately indicate that data can be read
from the pipe after some other process has finished a call to
write.

Alexander Burger

unread,

Sep 21, 2010, 8:07:31 AM9/21/10

to

Rainer Weikusat <rwei...@mssgmbh.com> wrote:
> Everything in this subthread was concerned with visibility of data
> written by a call of write from the perspective of a subsequent
> read.

No, not at all. Since the first posting I'm talking about the
*availability* of data, e.g.:

> Now, poll(2) sometimes - under heavy load and with many child processes
> - says that no data are available. But we know from the traces that C

> Neither are 'select' or 'poll' 'read' (this difference in
> spelling should suggest that) nor does the statement
>
> Any successful read() from each byte position in the file that
> was modified by that write shall return the data specified by
> the write() for that position
>
> preclude read from returning unsuccessfully with errno == EAGAIN for

Sure. I know that. However, EAGAIN is returned only if the file
descriptor is set to non-blocking, which is not the case here.

> seen the poll-using code (and I certainly spent no time with trying to
> follow logic of the select-using code) but when your description is
> correct, your code is simply broken: It is supposed to read input but

Yes, that's what I suspected in my last posting too. However, again,
'read' is not the issue. We are just talking about poll() or select()
signaling the availability of data.

Cheers,
- Alex

Alexander Burger

unread,

Sep 21, 2010, 8:11:15 AM9/21/10

to

Rainer Weikusat <rwei...@mssgmbh.com> wrote:
> never waits until input is actually available, relying on the
> assumption that poll will immediately indicate that data can be read
> from the pipe after some other process has finished a call to
> write.

Still the two questions (which popped up several times in the
discussion) remain:

1. Why can we rely on data being immediately available for write to a
regular file, but not to a pipe?

2. Why do poll() and select() show a different behavior?

Rainer Weikusat

unread,

Sep 21, 2010, 8:22:39 AM9/21/10

to

Alexander Burger <a...@software-lab.de> writes:
> Rainer Weikusat <rwei...@mssgmbh.com> wrote:
>> Everything in this subthread was concerned with visibility of data
>> written by a call of write from the perspective of a subsequent
>> read.
>
> No, not at all. Since the first posting I'm talking about the
> *availability* of data, e.g.:
>
>> Now, poll(2) sometimes - under heavy load and with many child processes
>> - says that no data are available. But we know from the traces that
>> C

Yes. And I wrote "this _sub_thread" because I meant it: You were
writing about poll/ select. The quoted parts of the standard talked
about write and read.

>> Neither are 'select' or 'poll' 'read' (this difference in
>> spelling should suggest that) nor does the statement
>>
>> Any successful read() from each byte position in the file that
>> was modified by that write shall return the data specified by
>> the write() for that position
>>
>> preclude read from returning unsuccessfully with errno == EAGAIN for
>
> Sure. I know that. However, EAGAIN is returned only if the file
> descriptor is set to non-blocking, which is not the case here.

Yes again. Otherwise, your process will block until the data is
available, provided that this makes sense for the file descriptor in
question. But (according to your description) your process does not
block to wait for data being available. Consequently, depending on
details of the implementation and arbitrary external factors, you
either get the data or you don't get it.

Rainer Weikusat

unread,

Sep 21, 2010, 8:23:51 AM9/21/10

to

Alexander Burger <a...@software-lab.de> writes:
> Rainer Weikusat <rwei...@mssgmbh.com> wrote:
>> never waits until input is actually available, relying on the
>> assumption that poll will immediately indicate that data can be read
>> from the pipe after some other process has finished a call to
>> write.
>
> Still the two questions (which popped up several times in the
> discussion) remain:
>
> 1. Why can we rely on data being immediately available for write to a
> regular file, but not to a pipe?

And the answer to that is: You can't. The next successful read system
call is supposed to return the data. Possibly next year. That's not
specified.

Alexander Burger

unread,

Sep 21, 2010, 9:27:32 AM9/21/10

to

Rainer Weikusat <rwei...@mssgmbh.com> wrote:
>> 1. Why can we rely on data being immediately available for write to a
>> regular file, but not to a pipe?
>
> And the answer to that is: You can't. The next successful read system
> call is supposed to return the data. Possibly next year. That's not
> specified.

OK, that's a clear statement. Makes sense. So we have a guarantee only
for the read().

Concerning the original question, the pure indication of availability,
you seem to assume that it is _not_ guaranteed. So this may be, but do
you have any prove for or against it? Meanwhile we have all kinds of
opinions, ranging from "The data are immediately available" (Kaz
Kylheku) to "They are not immediately available". Is everybody just
wildly guessing? For a reliable IPC we need hard facts.

Ersek, Laszlo

unread,

Sep 21, 2010, 12:25:30 PM9/21/10

to

On Tue, 21 Sep 2010, Rainer Weikusat wrote:

> Alexander Burger <a...@software-lab.de> writes:
>
> [...]
>
>> Ersek, Laszlo <la...@caesar.elte.hu> wrote:
>>> FWIW, I tried to look up write()'s spec in SUSv4 [0], and a "special
>>> guarantee" seems to exist for read()/write() done to *regular* files.
>>
>> Thus, if we would use regular files instead of pipes, we would be on the
>> safe side. But this would contratict the purpose of pipes. So I must
>> assume that either
>>
>> 1. I am calling poll() in a wrong way or with a wrong setup, or
>> 2. poll() does not return all available data for some other reason
>>
>> I'll stay with select() for now.
>
> Everything in this subthread was concerned with visibility of data
> written by a call of write from the perspective of a subsequent
> read. Neither are 'select' or 'poll' 'read' (this difference in
> spelling should suggest that) nor does the statement
>
> Any successful read() from each byte position in the file that
> was modified by that write shall return the data specified by
> the write() for that position
>
> preclude read from returning unsuccessfully with errno == EAGAIN for
> an aritbrary amount of time after a write to a file which supports a
> notion of non-blocking I/O (such as a pipe) has occurred.

Ahhh. I've got to learn to read.

I attempted to summarize the question faithfully on the Austin Group
mailing list -- see the thread "visibility of write() to a pipe", starting
with sequences 14583-14584 [0]. I've made an analogy between a
zero-timeout poll() returning with no readiness and an O_NONBLOCK read()
returning with -1/EAGAIN. But, as you point out, the spec doesn't talk
about *any* first read() following write() -- setting the definition of
"following" aside for a moment --; it talks about *successful* reads.

What we all seem to have established so far:

- Ordering is hard to define and/or achieve.

- If ordering is proven, it was not obvious if the visibility guarantee
covers pipes.

- If the standard requires (by intent or by chance) pipes to conform to
said visibility, it's unclear what implementations satisfy that
requirement.

- Visibility is described in terms of *successful* write() and read(), not
successful write and { select(), poll(), failed read() }.

Alexander: the code seemed quite involved, so I was too lazy to dive into
it, but: *why* can't you use infinite timeout? Or a finite nonzero timeout
with a loop around it? That should take care of all of the above.

Cheers,
lacos

[0] https://www.opengroup.org/sophocles/show_archive.tpl?listname=austin-group-l

Alexander Burger

unread,

Sep 21, 2010, 2:12:03 PM9/21/10

to

Ersek, Laszlo <la...@caesar.elte.hu> wrote:
> - Ordering is hard to define and/or achieve.
>
> - If ordering is proven, it was not obvious if the visibility guarantee
> covers pipes.
>
> - If the standard requires (by intent or by chance) pipes to conform to
> said visibility, it's unclear what implementations satisfy that
> requirement.
>
> - Visibility is described in terms of *successful* write() and read(), not
> successful write and { select(), poll(), failed read() }.

Yes, I fully agree. So we have neither a precise specification or a
guarantee that an implementation fully observes it. I was just assuming
that a pipe is a memory data structure with immediate effect
(visibility) after the data arrived in the kernel.

> Alexander: the code seemed quite involved, so I was too lazy to dive into

Don't worry. I didn't expect that. That's why I initially only presented
a reduced model.

> it, but: *why* can't you use infinite timeout? Or a finite nonzero timeout
> with a loop around it? That should take care of all of the above.

This select is a generic event loop, which is responsible for an
arbitrary number of timeout-triggered tasks and for listening at an
arbitrary number of file descriptors. These file descriptors also
include those to and from child processes, and to and from parent
processes.

The functionality in question, which depends on that "immediate
availability" (and where I'm not sure if its base assumption is correct)
is the Lisp-level function (sync)

http://software-lab.de/doc/refS.html#sync

That is, this function is called by a Lisp process after it got
exclusive control (typically with the described lock on the database,
but possibly also other means), with the purpose to receive all messages
sent over certain pipes from all other child processes. In "reality",
this sync request goes over a parent process which does the actual
polling (so this is actually a communication between three processes),
but for a simplified view we can just assume that the process needs to
check whether there are any messages from certain processes pending.

We might say, the purpose of 'sync' is to "drain" all pipes. This should
be as fast as possible, because the speed of the database depends on it.
We cannot block or wait here, because there may be (and most of the time
is) no message at all pending. If we block, then 'sync' will never
succeed, and if we wait for a certain time then this time might be too
short and/or still slow down everything.

When the assumption of immediate availability of data in the pipe does
not hold, there is no direct fix. Instead, the spec of the 'sync'
function must be changed, to make it less general. It cannot be assumed
any longer that you can call 'sync' to receive all possible messages
that were sent before you aquired the exclusive state, and which have
not arrived yet.

The fix would be relatively simple (each child needs a 'sync' flag, and
must send a special message when it is done), but then the Lisp level
'sync' can only be called in the context of database operations in the
future. This is not a big problem, I presume, at least I haven't ever
used 'sync' in another context, but it *is* a change in that function's
spec.

So it would be nice if I could avoid that fix because we could prove
somehow that the availability assumption holds.

David Schwartz

unread,

Sep 21, 2010, 6:03:10 PM9/21/10

to

On Sep 20, 10:21 pm, Alexander Burger <a...@software-lab.de> wrote:

> They do. If one process does a(), *then* releases the lock, *then* the
> second proces waits for the lock and *then* does b(), it is guaranteed
> that b() happens completely after a().

No, it is not. File locks do not enforce global ordering.

> The timestamps are generated
> at the end of a() and at the begining of b().

The timestamps do not establish global ordering either.

Both of your arguments are based on establishing and enforcing global
ordering. There is no such thing as 'global ordering'. It does not
exist. You will not find it guaranteed *anywhere*.

In particular, if one process does:
some_kernel_call(); unlock();
and another process does:
lock(); other_kernel_call();

There is *no* free-floating guarantee that the second process call to
other_kernel_call will see the results of the first process'
some_kernel_call. There is no generic requirement that the kernel
completely finish performing a request before it returns to the
calling process.

Although I think in practice, that's not what's actually biting you
here. What's biting you here is that pipes do not provide "timely
delivery" guarantees. However, the subtle bugs and bad software
created by mistaken assumptions about global ordering are a more
serious problem, so I think it's more important to point them out.

DS

David Schwartz

unread,

Sep 21, 2010, 7:09:52 PM9/21/10

to

On Sep 21, 12:45 am, Alexander Burger <a...@software-lab.de> wrote:

> In summary: We are back at the starting point. Though David has a
> different opinion, I'm still convinced that data which are *in* the pipe
> are not seen by poll() for some reason.

I agree that you are correct that data which are in the pipe are not
seen by 'poll'. I also think that there is another, much more serious,
issue in his code in that it assumes global ordering.

However, I think the reason his code is failing in practice is because
it assumes timely delivery of data in pipes. In principle, there's no
reason an implementation could impose a delay between when it accepts
data at the input of a pipe and when it makes that data available at
the output of the pipe.

I would argue that for standards purposes, the input and output ends
of a pipe are two different files. So the regular file read/write
guarantee doesn't help.

DS

Alexander Burger

unread,

Sep 22, 2010, 1:31:28 AM9/22/10

to

You seem to be mixing here two different types of "global ordering".

While you may be right that that there is no guarantee for the
visibility of effects (be it by 'poll' or by 'read') created by a
previous process, the ordering in time can be guranteed.

David Schwartz <dav...@webmaster.com> wrote:
>> They do. If one process does a(), *then* releases the lock, *then* the
>> second proces waits for the lock and *then* does b(), it is guaranteed
>> that b() happens completely after a().
>
> No, it is not. File locks do not enforce global ordering.

Then can you give an example how one process can run earlier in *time*
after it obtained a lock released by another process?

>> The timestamps are generated
>> at the end of a() and at the begining of b().
>
> The timestamps do not establish global ordering either.

I didn't say that they "establish" the order. But they "reflect" it. I
used them only in log files to monitor what happened in sequence.
Assuming that the clock in each process is based on the same internal
clock, and that each process executes its own instructions in timely
order, then the merged and sorted timestamp log will show the global
ordering.

> Although I think in practice, that's not what's actually biting you
> here. What's biting you here is that pipes do not provide "timely
> delivery" guarantees.

Yes, that's the question.

> However, the subtle bugs and bad software
> created by mistaken assumptions about global ordering are a more
> serious problem, so I think it's more important to point them out.

Agreed. Many thanks!

Cheers,
- Alex

David Schwartz

unread,

Sep 22, 2010, 2:01:34 AM9/22/10

to

On Sep 21, 10:31 pm, Alexander Burger <a...@software-lab.de> wrote:

> Then can you give an example how one process can run earlier in *time*
> after it obtained a lock released by another process?

Sure. An operation may be stored in some kind of cache or holding area
and only be flushed by a subsequent operation that is defined to see
its effects. This is the reason multithreaded code needs mutexes.

You perform an operation and then assume that a 'subsequent' operation
will see its effects. However, there is no interposing operation that
guarantees that its effects will be visible. So the operation could be
held somewhere and not be visible at that time.

For example, suppose one process does a 'write' to a regular file that
enlarges it. That 'write' could be held somewhere to be accumulated
with possible later writes. That means a 'stat' may not show the new
size. But a 'read' is defined to see the results, so if the 'write'
was held in some way, a 'read' to the same file would have to flush
it.

You assume that the operation can't actually take place after the
system call to perform it returns. However, it absolutely can, unless
some specific requirement prevents delaying the effects in that
specific case.

DS

Alexander Burger

unread,

Sep 22, 2010, 3:21:36 AM9/22/10

to

David Schwartz <dav...@webmaster.com> wrote:
> On Sep 21, 10:31 pm, Alexander Burger <a...@software-lab.de> wrote:
>
>> Then can you give an example how one process can run earlier in *time*
>> after it obtained a lock released by another process?
>
> Sure. An operation may be stored in some kind of cache or holding area
> and only be flushed by a subsequent operation that is defined to see
> its effects. This is the reason multithreaded code needs mutexes.
>
> You perform an operation and then assume that a 'subsequent' operation
> will see its effects. However, there is no interposing operation that

Sigh. I give up. Again, you are confusing "sequential in time" with
"seeing the effects".

David Schwartz

unread,

Sep 22, 2010, 5:58:37 AM9/22/10

to

My example is the same for both cases. If you think of the queue as
holding the operation itself, then the operation takes place later in
time. If you think of the queue as holding the results, then though
the operation has taken place, you cannot see its effects.

I'm only confusing them in this case because they are the same thing.
There is no meaningful distinction between "the operation has not
taken place yet" and "the operation has taken place but you cannot yet
see its effects".

DS

Jasen Betts

unread,

Sep 22, 2010, 7:44:43 AM9/22/10

to

On 2010-09-21, Alexander Burger <a...@software-lab.de> wrote:

> 1. Why can we rely on data being immediately available for write to a
> regular file, but not to a pipe?

The two ends of a pipe are different places

> 2. Why do poll() and select() show a different behavior?

undefined behaviour is like that.
different execution time?
different cache hits?

--
¡spuɐɥ ou 'ɐꟽ ʞooꞀ

Ersek, Laszlo

unread,

Sep 22, 2010, 8:46:43 AM9/22/10

to

I'm glad I was finally given the opportunity to read this from you -- it
bothered me to no end that I couldn't decide whether you debate the
ordering of system call *submissions*, *completions*, or the immediate
visibility of their effects once they completed.

I agree that ultimately there is no practical difference between "not yet
completed" or "completed but invisible". I do find a tiny theoretic
difference between them: in the current specific case, completion seems to
be required by the standard, and prompt visibility is missing only because
a given implementation chose not to ensure it as an extension (with eg.
cache flushes or whatever). In my mind that is somehow different (exists
on a different level, has a different weight etc etc) from when an
implementation delays the entire write() processing, even though the
application gets to see the same behavior.

(With this post I'm not trying to question anything further.)

Cheers,
lacos

Ersek, Laszlo

unread,

Sep 22, 2010, 9:03:19 AM9/22/10

to

On Tue, 21 Sep 2010, David Schwartz wrote:

> For example, suppose one process does a 'write' to a regular file that
> enlarges it. That 'write' could be held somewhere to be accumulated with
> possible later writes. That means a 'stat' may not show the new size.

While in general I understand the explanation and agree with it, for
write() and stat() there is a visibility requirement:

http://www.opengroup.org/onlinepubs/9699919799/functions/write.html

----v----

On a regular file or other file capable of seeking, the actual writing of
data shall proceed from the position in the file indicated by the file
offset associated with fildes. Before successful return from write(), the
file offset shall be incremented by the number of bytes actually written.
On a regular file, if the position of the last byte written is greater
than or equal to the length of the file, the length of the file shall be
set to this position plus one.

----^----

I believe this does intend to say that a stat() following the successful
completion of such a write() will see the new size. ("Successful
completion" meaning "returning", and "following" meaning the same thing as
elsewhere in this thread, ie. happens-before ensured through logical
dependency between the issuance of system calls.)

lacos

Alexander Burger

unread,

Sep 22, 2010, 9:09:42 AM9/22/10

to

Jasen Betts <ja...@xnet.co.nz> wrote:
> On 2010-09-21, Alexander Burger <a...@software-lab.de> wrote:
>
>> 1. Why can we rely on data being immediately available for write to a
>> regular file, but not to a pipe?
>
> The two ends of a pipe are different places

OK, I didn't see it that way.

>> 2. Why do poll() and select() show a different behavior?
>
> undefined behaviour is like that.
> different execution time?
> different cache hits?

I didn't talk about undefined behaviour. As I wrote, I see a very well
defined behavior, in that select() never gives an error, while poll()
does easily. This doesn't prove anything, but is reproducible.

Actually, I was hoping I did something wrong with poll(), and people
here would ask things like "did you only set the POLLIN bit, or did you
also pay attention to POLLHUP?", "How do you react to POLLHUP?", or "if
you are polling the same file descriptors for POLLIN and POLLOUT did you
take care that they end up in the same poll structure?" ... Anything
which might be relevant in this context.

Thanks,
- Alex

Rainer Weikusat

unread,

Sep 22, 2010, 9:13:08 AM9/22/10

to

Alexander Burger <a...@software-lab.de> writes:

[...]

> Actually, I was hoping I did something wrong with poll(), and people

> here would ask things like "did you only set the POLLIN bit, or did you
> also pay attention to POLLHUP?", "How do you react to POLLHUP?", or "if
> you are polling the same file descriptors for POLLIN and POLLOUT did you
> take care that they end up in the same poll structure?" ... Anything
> which might be relevant in this context.

How is this supposed to happen for code no one except you has ever
seen?

David Schwartz

unread,

Sep 22, 2010, 9:44:36 AM9/22/10

to

On Sep 22, 6:03 am, "Ersek, Laszlo" <la...@caesar.elte.hu> wrote:

> I believe this does intend to say that a stat() following the successful
> completion of such a write() will see the new size. ("Successful
> completion" meaning "returning", and "following" meaning the same thing as
> elsewhere in this thread, ie. happens-before ensured through logical
> dependency between the issuance of system calls.)

I agree with the first half of what you said but not the second half.
I agree that this does intend to say that a 'stat' following will see
the new size. But I disagree that you can use any logical dependency
you choose to establish that the 'stat' follows the write.

For example, if you release a lock on a file on one filesystem and
then perform a 'write' on another filesystem, the implementation is
completely free to reorder those operations as they have no defined
dependencies. Another process that acquires the lock cannot assume
that a call to 'stat' will see the affects of the 'write'.

DS

Rainer Weikusat

unread,

Sep 22, 2010, 10:55:57 AM9/22/10

to

David Schwartz <dav...@webmaster.com> writes:
> On Sep 22, 6:03 am, "Ersek, Laszlo" <la...@caesar.elte.hu> wrote:
>> I believe this does intend to say that a stat() following the successful
>> completion of such a write() will see the new size. ("Successful
>> completion" meaning "returning", and "following" meaning the same thing as
>> elsewhere in this thread, ie. happens-before ensured through logical
>> dependency between the issuance of system calls.)
>
> I agree with the first half of what you said but not the second half.
> I agree that this does intend to say that a 'stat' following will see
> the new size. But I disagree that you can use any logical dependency
> you choose to establish that the 'stat' follows the write.
>
> For example, if you release a lock on a file on one filesystem and
> then perform a 'write' on another filesystem, the implementation is
> completely free to reorder those operations as they have no defined
> dependencies.

Can you please quote a passage of the standard which not only
abolishes the concept of an independent flow of time for networked
computers, as you are wont to desire, but also for mankind as such?

Alexander Burger

unread,

Sep 22, 2010, 10:59:54 AM9/22/10

to

Rainer Weikusat <rwei...@mssgmbh.com> wrote:
>> Actually, I was hoping I did something wrong with poll(), and people

>> here would ask things ...

>
> How is this supposed to happen for code no one except you has ever
> seen?

The code can be looked up in an older version of the link I posted
before. But I don't want to trouble anyone, the code is in assembly
anyway. No, what I hoped for is somebody having experienced a similar
problem, or some insight in possible errors in that context. Not just
messages like "it is undefined" when in fact is meant "it is unknown".
Not just guessing, but practical experience. Where here in a newsgroup
called "...development.apps" after all ;-)

Rainer Weikusat

unread,

Sep 22, 2010, 11:32:03 AM9/22/10

to

My practical experience makes me assume that your code is buggy. I
wouldn't usually claim that without looking at this code first, but
I'm not going to go hunting for it. In any case, the assumption that
the next call to read refering to a file which supports blocking/
non-blocking I/O which provably occurs after a write to the same file
will return the written data without any implementation-induced delay
is not supported by the UNIX(*) standard although the code of the
pipe-implementation of Linux I have looked at ought to have this
property.

David Schwartz

unread,

Sep 22, 2010, 7:00:08 PM9/22/10

to

On Sep 22, 7:55 am, Rainer Weikusat <rweiku...@mssgmbh.com> wrote:

> Can you please quote a passage of the standard which not only
> abolishes the concept of an independent flow of time for networked
> computers, as you are wont to desire, but also for mankind as such?

The standard doesn't have to (and shouldn't) list all the things that
are not guaranteed.

DS

Rainer Weikusat

unread,

Sep 23, 2010, 7:40:54 AM9/23/10

to

At best, it is your (IMHO somewhat weird) opinion that terms like
'before' and 'after' would have a mysterious, undefined and decidedly
uncommon meaning. My opinion on that is that understanding a text
requires making sensible assumption regarding everything which isn't
spelled out explicitly, and apart from that, I can proof you wrong
:->: Assuming some processes execute the (fictional) function calls

get_exclusive_lock();
do_what_the_heck();
release_exlusive_lock();

and another process executes the command sequence

get_exclusive_lock();
look_for_what_the_heck();
release_exclusive_lock();

it is guaranteed that, by the time the second process gets to calling
look_for_what_the_heck(), all callers of do_what_the_heck() which
acquired the exclusive lock prior to the second process are done with
do_what_the_heck (similar to the before/after requirements for write
and read). First, as an appeal to reason: Without this guarantee, file
locking could not be used to serialize anything except the locking
calls themselves. It is unreasonable to assume that the people who
wrote the standard text intended this feature to be useless and even
more unreasonable to assume that every implementor has just mindlessly
implemented it in this way. But more important, according to ISO/IEC
9899:1999 (E), 5.1.2.3|2, each function call is C is by definition 'a
side effect' and it is also a sequence point (6.5.2.2|10). It is
required that

At certain specified points in the execution sequence called
sequence points, all side effects of previous evaluations
shall be complete and no side effects of subsequent
evaluations shall have taken place.

Thus, the function calls are ordered with respect to each other for
each individual process and insofar calling a particular function
enforce an time-based ordering between different processes also
calling this function, the 'contents' of the respective critical
sections will obey to the same time-based ordering (I hope this makes
any sense at all in English ...).

Ersek, Laszlo

unread,

Sep 23, 2010, 10:37:07 AM9/23/10

to

On Thu, 23 Sep 2010, Rainer Weikusat wrote:

> David Schwartz <dav...@webmaster.com> writes:
>> On Sep 22, 7:55 am, Rainer Weikusat <rweiku...@mssgmbh.com> wrote:
>>> Can you please quote a passage of the standard which not only
>>> abolishes the concept of an independent flow of time for networked
>>> computers, as you are wont to desire, but also for mankind as such?
>>
>> The standard doesn't have to (and shouldn't) list all the things that
>> are not guaranteed.
>
> At best, it is your (IMHO somewhat weird) opinion that terms like
> 'before' and 'after' would have a mysterious, undefined and decidedly
> uncommon meaning. My opinion on that is that understanding a text
> requires making sensible assumption regarding everything which isn't
> spelled out explicitly,

(I'll drag this out of context:)

> It is unreasonable to assume that the people who wrote the standard text
> intended this feature to be useless and even more unreasonable to assume
> that every implementor has just mindlessly implemented it in this way.

I agree with you that certain assumptions must have been the intention of
the standard developers even without spelling them out explicitly, and
that any implementation not supporting such unwritten assumptions is
borked.

For example, suppose I'm interactively checking and re-checking a status
file on file system FS_1. After some time, program P writes to it "I'm
done with file F on FS_2", and goes on to work on something else. I start
another program P2 to process file F. If the system moved the appearance
of the status message before the last write (and final close) of F -- in
contrary to the order present in the source code of P --, I could process
an incomplete F with P2. I consider any system allowing this broken.

C and UNIX(R) is not a lazy evaluation functional language environment.

( in P system wide
------------------ ---------------------------------------
write(data_file) write(status_file)
close(data_file) read(status_file)
write(status_file) read(data_file)
write(data_file)
close(data_file)

Though write()/read() visibility is satisfied, the broken system reordered
two writes, and fooled an interactive user.)

OTOH, I must agree with David that the letter of the standards would allow
such a brain-damaged implementation. (In the name of performance,
obviously!) So I'll try to argue his point below, even though I actually
agree with the claim that some assumptions *must* hold.

> Assuming some processes execute the (fictional) function calls
>
> get_exclusive_lock();
> do_what_the_heck();
> release_exlusive_lock();
>
> and another process executes the command sequence
>
> get_exclusive_lock();
> look_for_what_the_heck();
> release_exclusive_lock();
>
> it is guaranteed that, by the time the second process gets to calling
> look_for_what_the_heck(), all callers of do_what_the_heck() which
> acquired the exclusive lock prior to the second process are done with
> do_what_the_heck (similar to the before/after requirements for write and
> read). First, as an appeal to reason: Without this guarantee, file
> locking could not be used to serialize anything except the locking calls
> themselves.

... and except other operations referring to the locked file.

> [sentences atop moved from here]

> But more important, according to ISO/IEC 9899:1999 (E), 5.1.2.3|2, each
> function call is C is by definition 'a side effect' and it is also a
> sequence point (6.5.2.2|10). It is required that
>
> At certain specified points in the execution sequence called
> sequence points, all side effects of previous evaluations
> shall be complete and no side effects of subsequent
> evaluations shall have taken place.
>
> Thus, the function calls are ordered with respect to each other for
> each individual process and insofar calling a particular function
> enforce an time-based ordering between different processes also
> calling this function, the 'contents' of the respective critical
> sections will obey to the same time-based ordering (I hope this makes
> any sense at all in English ...).

At some earlier point in this thread I was going to cite this exact
passage from the C standard, but if you also look at 5.1.2.3p3, p5 and p7:

----v----

3 In the abstract machine, all expressions are evaluated as specified by
the semantics. An actual implementation need not evaluate part of an
expression if it can deduce that its value is not used and that no
needed side effects are produced (including any caused by calling a
function or accessing a volatile object).

4 [...]

5 The least requirements on a conforming implementation are:

- At sequence points, volatile objects are stable in the sense that
previous accesses are complete and subsequent accesses have not yet
occurred.

- At program termination, all data written into files shall be
identical to the result that execution of the program according to the
abstract semantics would have produced.

- The input and output dynamics of interactive devices shall take place
as specified in 7.19.3. The intent of these requirements is that
unbuffered or line-buffered output appear as soon as possible, to
ensure that prompting messages actually appear prior to a program
waiting for input.

6 [...]

7 More stringent correspondences between abstract and actual semantics may
be defined by each implementation.

----^----

For me the above implies: even though sequencing *must look* like you
describe *within a process*, a brain damaged ^W^W very performant system
is allowed to reorder *everything* except volatile accesses in front of
external observers. p7 allows implementations to add further restrictions,
but POSIX still doesn't seem to spell out those assumptions we deem
indispensable.

<rant>

I believe I personally have no other choice than to stick to said
assumptions, even if as conservatively as possible. I don't know how one
could write *any* C UNIX(R) program without them. If everything is allowed
to appear out of order system-wide per default, except the *very few*
explicit guarantees, then we're really headed towards a lazy eval
functional language.

I find it instrumental that C99 *still* lacks a memory model and only C1x
is starting to intoduce one (*), and also that the first SUS containing
the Memory Synchronization section is SUSv3. This tells me everybody
operated originally with a more or less consistent set of base
assumptions. Then speeds of processor and memory and disk started to
diverge insanely, compilers and kernels started to violate those unwritten
assumptions for "better performance", and now no previously valid
assumption can be codified in a *general* standard (the original "spirit"
can't be presented in the common "letter" anymore), because for each
assumption there exists a system, however fringe, that violates it. (Even
though sensible systems still ensure most of them.) Instead, only the
remnants of said assumptions can be collected and codified, and they, if
we refrain from relying on the informative rationales and the traditional
ways, are in themselves enough for nothing. IMHO.

(*) I clearly remember a C1X committee member posting a remark somewhere
about how hard it was to word the C1X sequencing rules, but for the life
of me I can't find it.

</rant>

lacos

Rainer Weikusat

unread,

Sep 23, 2010, 12:28:00 PM9/23/10

to

- I'm going to cut this down quite aggressively -

"Ersek, Laszlo" <la...@caesar.elte.hu> writes:
> On Thu, 23 Sep 2010, Rainer Weikusat wrote:
>> David Schwartz <dav...@webmaster.com> writes:
>>> On Sep 22, 7:55 am, Rainer Weikusat <rweiku...@mssgmbh.com> wrote:

[...]

> ( in P system wide
> ------------------ ---------------------------------------
> write(data_file) write(status_file)
> close(data_file) read(status_file)
> write(status_file) read(data_file)
> write(data_file)
> close(data_file)
>
> Though write()/read() visibility is satisfied, the broken system
> reordered two writes, and fooled an interactive user.)
>
> OTOH, I must agree with David that the letter of the standards would
> allow such a brain-damaged implementation.

In this respect, you're both seriously confused and shouldn't have
read any low-level programming documentation of a modern CPU intended
to be used in a SMP system without the help of a sufficiently learned
adult. Sorry if this is blunt but not everything which occurs in some
contexts (like possible re-ordering of memory accesses which happen in
executed binary code) automatically appears in all other contexts (eg,
assuming that Intel starts to produce Roman-Catholic priest CPUs at some
point in time, this doesn't transitively imply that all
UNIX(*)-systems will, from now on, be required live in celibacy).

[...]

>> Assuming some processes execute the (fictional) function calls
>>
>> get_exclusive_lock();
>> do_what_the_heck();
>> release_exlusive_lock();
>>
>> and another process executes the command sequence
>>
>> get_exclusive_lock();
>> look_for_what_the_heck();
>> release_exclusive_lock();
>>
>> it is guaranteed that, by the time the second process gets to
>> calling look_for_what_the_heck(), all callers of do_what_the_heck()
>> which acquired the exclusive lock prior to the second process are
>> done with do_what_the_heck (similar to the before/after requirements
>> for write and read). First, as an appeal to reason: Without this
>> guarantee, file locking could not be used to serialize anything
>> except the locking calls themselves.
>
> ... and except other operations referring to the locked file.

They wouldn't be 'advisory locks' then.

This doesn't fly. If you believe that these two passages of the
C-standard contradict each other, you should try to get the text
fixed. Until then, conforming implementations have to conform to all
requirements, no matter how contradictory some of their conceivable
implementations might seem.

David Schwartz

unread,

Sep 23, 2010, 3:20:37 PM9/23/10

to

On Sep 23, 4:40 am, Rainer Weikusat <rweiku...@mssgmbh.com> wrote:

> At best, it is your (IMHO somewhat weird) opinion that terms like
> 'before' and 'after' would have a mysterious, undefined and decidedly
> uncommon meaning.

Yes, that is exactly my opinion, except it's not just an opinion, it's
a fact. The naive assumption that A either comes before or after B in
some kind of global ordering is simply not true -- there is no such
thing as global ordering.

> My opinion on that is that understanding a text
> requires making sensible assumption regarding everything which isn't
> spelled out explicitly,

I agree. Just make sure the assumptions are actually sensible.

> and apart from that, I can proof you wrong
> :->: Assuming some processes execute the (fictional) function calls
>
> get_exclusive_lock();
> do_what_the_heck();
> release_exlusive_lock();
>
> and another process executes the command sequence
>
> get_exclusive_lock();
> look_for_what_the_heck();
> release_exclusive_lock();
>
> it is guaranteed that, by the time the second process gets to calling
> look_for_what_the_heck(), all callers of do_what_the_heck() which
> acquired the exclusive lock prior to the second process are done with
> do_what_the_heck (similar to the before/after requirements for write
> and read).

Done in the sense that the call has returned. They are not necessarily
done in the sense that the call has finished doing whatever work it
may need to do.

> First, as an appeal to reason: Without this guarantee, file
> locking could not be used to serialize anything except the locking
> calls themselves. It is unreasonable to assume that the people who
> wrote the standard text intended this feature to be useless and even
> more unreasonable to assume that every implementor has just mindlessly
> implemented it in this way. But more important, according to ISO/IEC
> 9899:1999 (E), 5.1.2.3|2, each function call is C is by definition 'a
> side effect' and it is also a sequence point (6.5.2.2|10). It is
> required that
>
> At certain specified points in the execution sequence called
> sequence points, all side effects of previous evaluations
> shall be complete and no side effects of subsequent
> evaluations shall have taken place.
>
> Thus, the function calls are ordered with respect to each other for
> each individual process and insofar calling a particular function
> enforce an time-based ordering between different processes also
> calling this function, the 'contents' of the respective critical
> sections will obey to the same time-based ordering (I hope this makes
> any sense at all in English ...).

Now *that* is a nonsensical interpretation. If that were true, memory
barriers would never be needed, a simple C sequence point would
suffice.

Also, that would imply that if a program writes a byte to one TCP
connection and then writes a byte to another TCP connection, if the
second byte is detected, so must the first byte be. But of course this
is not so, because though the write operation has completed, the
completion of a write operation does not guarantee visibility of the
write on the other end. And that same argument applies to pretty much
everything else -- the completion of a system call does not guarantee
(in general) that the effects of that system call will be visible in
any particular way.

DS

Ersek, Laszlo

unread,

Sep 23, 2010, 4:05:40 PM9/23/10

to

On Thu, 23 Sep 2010, Rainer Weikusat wrote:

> "Ersek, Laszlo" <la...@caesar.elte.hu> writes:
>> On Thu, 23 Sep 2010, Rainer Weikusat wrote:
>>> David Schwartz <dav...@webmaster.com> writes:
>>>> On Sep 22, 7:55 am, Rainer Weikusat <rweiku...@mssgmbh.com> wrote:
>
> [...]
>
>> ( in P system wide
>> ------------------ ---------------------------------------
>> write(data_file) write(status_file)
>> close(data_file) read(status_file)
>> write(status_file) read(data_file)
>> write(data_file)
>> close(data_file)
>>
>> Though write()/read() visibility is satisfied, the broken system
>> reordered two writes, and fooled an interactive user.)
>>
>> OTOH, I must agree with David that the letter of the standards would
>> allow such a brain-damaged implementation.
>
> In this respect, you're both seriously confused and shouldn't have
> read any low-level programming documentation of a modern CPU intended
> to be used in a SMP system without the help of a sufficiently learned
> adult.

You certainly know how to wrap up a discussion, but I resist the urge to
plonk you just yet. I don't know how I deserved this offense by simply
stating "I agree with you, but not because your proof is correct --
because, independently, I find that proof ungrounded in standards".

> Sorry if this is blunt but not everything which occurs in some contexts
> (like possible re-ordering of memory accesses which happen in executed
> binary code) automatically appears in all other contexts (eg, assuming
> that Intel starts to produce Roman-Catholic priest CPUs at some point in
> time, this doesn't transitively imply that all UNIX(*)-systems will,
> from now on, be required live in celibacy).

Agree completely. This is what I meant. I also tried to express my
impression that this can't be derived from the normative parts of relevant
standards, and that we depend on sensible systems that nonetheless support
it.

>>> Assuming some processes execute the (fictional) function calls
>>>
>>> get_exclusive_lock();
>>> do_what_the_heck();
>>> release_exlusive_lock();
>>>
>>> and another process executes the command sequence
>>>
>>> get_exclusive_lock();
>>> look_for_what_the_heck();
>>> release_exclusive_lock();
>>>
>>> it is guaranteed that, by the time the second process gets to
>>> calling look_for_what_the_heck(), all callers of do_what_the_heck()
>>> which acquired the exclusive lock prior to the second process are
>>> done with do_what_the_heck (similar to the before/after requirements
>>> for write and read). First, as an appeal to reason: Without this
>>> guarantee, file locking could not be used to serialize anything
>>> except the locking calls themselves.
>>
>> ... and except other operations referring to the locked file.
>
> They wouldn't be 'advisory locks' then.

I meant "operations referring to the locked file that are executed with
the lock held".

I didn't imply that they would contradict each other. 5.1.2.3p2 talks
about the abstract machine, that is, what you can expect to happen and to
be present inside the C program, and in the execution environment when
looked at from within the C program.

The rest seems to talk about what has to happen in the guts of the
implementation. Ie. in the execution environment when looked at from
outside the C program.

For example, p3 says that if "x" is a non-volatile-qualified object, then
the expression-statement

x;

doesn't need to trigger a read watchpoint in your debugger. (I checked in
gdb if there is such a thing, just to be sure, and there is, see
"rwatch".)

p5 makes a positive, minimum statement about what must show in the
execution environment when looked at from outside the C program. For
example, whatever file modifications you do through stdio streams don't
need to be observable outside the C program itself at all until the C
program terminates. If we rely on nothing more than the ISO C standard.

Please look at p8-9 as well (which is an example, hence informative only).

But this is already comp.lang.c material; subscribers there could probably
prove my point more aptly -- or disprove it easily.

I wanted to express in my previous post that I find these minimum
guarantees (and those additionally provided by POSIX) painfully scant, and
that we're lucky to have sensible systems that keep lots of our
non-standardized assumptions working.

Whatever, I rest my case. Have a good day, kind sir.
lacos

Alexander Burger

unread,

Sep 24, 2010, 3:04:16 AM9/24/10

to

Thanks to all!

Concerning the original question, I have to accept that nobody so far
can give a guarantee about data written to a pipe being immediately
visible to other processes.

To be on the safe side, I changed the API of the Lisp 'sync' function,
requiring applications now to send of an end-of-transmission message.
This makes 'sync' less general, but only in certain rare cases.

So the original problem it is no longer an issue.

Cheers,
- Alex