pwrite() and threads

Sad Clouds

unread,

May 26, 2010, 7:19:58 AM5/26/10

to

Hi, I'm writing a small logging library that can be used concurrently by
multiple threads. Each thread has its own log buffer and when the
buffer gets full it is flushed to file on disk, and/or syslog socket.

I'm a bit confused about the semantics of pwrite(2). As far as I
understand pwrite() does atomic lseek-and-write. However I'm not sure
what happens when data to be written is larger than the disk block size.

For example, two threads need to flush their log buffer to the same
file:

thread1 needs to write 2048 bytes to /var/log/out.log

thread2 needs to write 2048 bytes to /var/log/out.log

If both threads call pwrite() at the same time, does pwrite() guarantee
that all 2048 bytes from a thread will be written atomically?

Or is it likely that large writes from multiple threads will be broken
down into smaller atomic units and intermingled in the log file?

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-...@muc.de

Wouter Klouwen

unread,

May 26, 2010, 8:37:55 AM5/26/10

to

Sad Clouds(cryintot...@googlemail.com) said 2010.05.26 12:19:58 +0000:

> I'm a bit confused about the semantics of pwrite(2). As far as I
> understand pwrite() does atomic lseek-and-write. However I'm not sure
> what happens when data to be written is larger than the disk block size.

Depends. As far as I can tell, pwrite itself should be atomic, regardless
how big the data buffer is.

> If both threads call pwrite() at the same time, does pwrite() guarantee
> that all 2048 bytes from a thread will be written atomically?

Yes, but if you write at a specific offset, you might have a small problem.
Say two threads simultaneously want to write, they both calculate the offset
and then write to that offset. The end result? Possibly one entry lost as
one writes over the other due to invalidated offsets.
It seems to me that you need to lock before you calculate the offset anyway.
If you are logging, why are you not opening these files in "a" mode and just
using write() anyway?

--Wouter

Sad Clouds

unread,

May 26, 2010, 9:23:26 AM5/26/10

to

On Wed, 26 May 2010 12:37:55 +0000
Wouter Klouwen <dub...@acm.org> wrote:

> Yes, but if you write at a specific offset, you might have a small
> problem. Say two threads simultaneously want to write, they both
> calculate the offset and then write to that offset. The end result?
> Possibly one entry lost as one writes over the other due to
> invalidated offsets. It seems to me that you need to lock before you
> calculate the offset anyway. If you are logging, why are you not
> opening these files in "a" mode and just using write() anyway?
>
> --Wouter

Yeah I think this is what I was looking for, if you open a file with
O_APPEND flag, the kernel will atomically position file pointer to the
end of file before each write().

There is a constant PIPE_BUF which tells how much data can be written
atomically to a pipe or fifo. I'm not sure if the same applies to
regular files.

Giles Lean

unread,

May 26, 2010, 9:26:22 AM5/26/10

to

Wouter Klouwen <dub...@acm.org> wrote:

> Depends. As far as I can tell, pwrite itself should be atomic, regardless
> how big the data buffer is.

POSIX doesn't agree with you, I'm afraid. The Rationale section
of the write() specification mentions the issue specifically
(pwrite() is defined largely in terms of a version of write()
which doesn't change the file position):

| Atomic/non-atomic: A write is atomic if the whole amount
| written in one operation is not interleaved with data from
| any other process. This is useful when there are multiple
| writers sending data to a single reader. Applications need
| to know how large a write request can be expected to be
| performed atomically. This maximum is called
| {PIPE_BUF}. This volume of IEEE Std 1003.1-2001 does not say
| whether write requests for more than {PIPE_BUF} bytes are
| atomic, but requires that writes of {PIPE_BUF} or fewer
| bytes shall be atomic.

So if you want non-interleaved writes:

a) write in smaller than PIPE_BUF amounts, or
b) trust to luck

I've found with computing that Murphy rules and would not
choose 'b' myself. :-)

> It seems to me that you need to lock before you calculate the offset anyway.
> If you are logging, why are you not opening these files in "a" mode and just
> using write() anyway?

You've still got the atomic issue with write().

So, yeah, either give up on multiple threads/processes
writing, keep below PIPE_BUF, or lock the file before writing
is all I can think of at the moment.

Giles

Wouter Klouwen

unread,

May 26, 2010, 9:50:05 AM5/26/10

to

Giles Lean(giles...@pobox.com) said 2010.05.26 23:26:21 +0000:

> POSIX doesn't agree with you, I'm afraid.

Fair enough. I was looking at a UN*X spec that was rather sparse on details.

> > It seems to me that you need to lock before you calculate the offset anyway.
> > If you are logging, why are you not opening these files in "a" mode and just
> > using write() anyway?
>

> You've still got the atomic issue with write().
>
> So, yeah, either give up on multiple threads/processes
> writing, keep below PIPE_BUF, or lock the file before writing
> is all I can think of at the moment.

There is another option. Have one thread dedicated to writing, have all
other threads pass a message to this thread with the data they wish to have
written to what location. This single thread can then proceed to append to
each file in order without locking issues. As an improvement to parallelism
there could be a small pool of these threads each dedicated to certain
files.
Avoiding threading safety issues is better than dealing with them.

> Giles
Wouter

Giles Lean

unread,

May 26, 2010, 10:02:28 AM5/26/10

to

Wouter Klouwen <dub...@acm.org> wrote:

> Giles Lean(giles...@pobox.com) said 2010.05.26 23:26:21 +0000:
> > POSIX doesn't agree with you, I'm afraid.
>
> Fair enough. I was looking at a UN*X spec that was rather sparse on
> details.

In another lifetime, I got to play standards lawyer (and language
lawyer) from time to time. ("Your C library is broken!" "No, your
application's expections are wrong. Sorry.") So I've had practice.

FYI the standard is available freely (you have to register) at
http://www.opengroup.org. (I won't try to give a direct URL; I usually
find my way into their web site via Google.)

> Avoiding threading safety issues is better than dealing with them.

As one who spent uncounted hours dealing with customer threads problems,
some real and bugs on our (I worked for a vendor) part but many being
application problems, I can only heartily endorse this!

The Go language has some interesting alterntives for concurrency; I've
started on a NetBSD port but got thoroughly distracted. I'll try to get
myself undistracted again.

Rob Pike says porting Go is "easy", but I'm not sure Rob Pike's idea of
"easy" will match mine. Still, at least I'm porting an open source
language to an open source operating system, and it runs on OS X and
FreeBSD already. Of course their threads support (at OS level, below
pthreads) looks quite different to NetBSD's.

Giles

Sad Clouds

unread,

May 26, 2010, 10:13:08 AM5/26/10

to

On Wed, 26 May 2010 13:50:05 +0000
Wouter Klouwen <dub...@acm.org> wrote:

> There is another option. Have one thread dedicated to writing, have
> all other threads pass a message to this thread with the data they
> wish to have written to what location. This single thread can then
> proceed to append to each file in order without locking issues. As an
> improvement to parallelism there could be a small pool of these
> threads each dedicated to certain files.
> Avoiding threading safety issues is better than dealing with them.

You would still use locking, only this time you would also need
condition variables to signal to other threads when the copy was
complete.

I think it's simpler to lock a mutex, write to file descriptor and
unlock a mutex when copy is done. If you buffer your data, then
overhead of locking should be negligible.

Wouter Klouwen

unread,

May 26, 2010, 10:27:47 AM5/26/10

to

Sad Clouds(cryintot...@googlemail.com) said 2010.05.26 15:13:08 +0000:

> On Wed, 26 May 2010 13:50:05 +0000
> Wouter Klouwen <dub...@acm.org> wrote:
> > There is another option. Have one thread dedicated to writing, have
> > all other threads pass a message to this thread with the data they
> > wish to have written to what location. This single thread can then
> > proceed to append to each file in order without locking issues. As an
> > improvement to parallelism there could be a small pool of these
> > threads each dedicated to certain files.
> > Avoiding threading safety issues is better than dealing with them.
>
> You would still use locking, only this time you would also need
> condition variables to signal to other threads when the copy was
> complete.

Of course. It's a threading problem, so therefore locking is involved. The
issue here is where do you want the locking to happen?
Using a semaphore (or condition variable) isn't terribly complex and may
actually aid code clarity. Being able to maintain your threaded code is an
important issue.
Does your library care when the writing of data is complete, or does it only
care whether or not an error occured?

> I think it's simpler to lock a mutex, write to file descriptor and
> unlock a mutex when copy is done. If you buffer your data, then
> overhead of locking should be negligible.

Depends how you want the rest of the system behave when it's locking your
entire thread because it's writing. I/O can be very slow and if the OS is
already writing a bunch of data, you may get some serious delays.

Anyway, it's just a tought and it seems to me that this is hardly a NetBSD
specific problem.

Wouter

Sad Clouds

unread,

May 26, 2010, 11:21:38 AM5/26/10

to

On Wed, 26 May 2010 14:27:47 +0000
Wouter Klouwen <dub...@acm.org> wrote:

> Depends how you want the rest of the system behave when it's locking
> your entire thread because it's writing. I/O can be very slow and if
> the OS is already writing a bunch of data, you may get some serious
> delays.

If I/O is saturated then there is not much you can do. You can only go
as fast as the slowest component in your system. No matter how big your
buffers are, or how many threads you have, if the producer overrunning
the consumer with data, sooner or later you'll hit the bottleneck. If
you're logging data and the buffers are full, you have to flush them,
event if you have to wait for disk I/O. The other option is discarding
data.

The logging package I'm writing is designed to solve the following
issues:

- MT safe. When initialising the package, you specify max number of
concurrent threads. Each thread has its own log buffer, which can be
flushed manually, or automatically when it gets full. Threads register
with the log package and get a thread ID, which they use for logging
data.

- Multiple logs with multiple destinations. Each log is identified by
a log ID and each log can be sent to different destinations, e.g.
syslog, /var/log/all.log and so on.

- Support different "flavours" of syslog, i.e. MT safe syslog_r/syslog,
and regular syslog that needs locking.

This is mainly to provide easy to use logging package for network
servers with different logs and large numbers of threads, e.g. Sun
Niagara processor.

Adam Hoka

unread,

May 26, 2010, 11:22:06 AM5/26/10

to

On Wed, 26 May 2010 14:23:26 +0100
Sad Clouds <cryintot...@googlemail.com> wrote:

> On Wed, 26 May 2010 12:37:55 +0000
> Wouter Klouwen <dub...@acm.org> wrote:
>
> > Yes, but if you write at a specific offset, you might have a small
> > problem. Say two threads simultaneously want to write, they both
> > calculate the offset and then write to that offset. The end result?
> > Possibly one entry lost as one writes over the other due to
> > invalidated offsets. It seems to me that you need to lock before you
> > calculate the offset anyway. If you are logging, why are you not
> > opening these files in "a" mode and just using write() anyway?
> >
> > --Wouter
>
> Yeah I think this is what I was looking for, if you open a file with
> O_APPEND flag, the kernel will atomically position file pointer to the
> end of file before each write().
>
> There is a constant PIPE_BUF which tells how much data can be written
> atomically to a pipe or fifo. I'm not sure if the same applies to
> regular files.

Just lock the file, or use a logging thread to do the writes from a queue.

--
NetBSD - Simplicity is prerequisite for reliability