You're even less alone, I'm running into the same issue just now. But I
think I've found a way around it, see below.
> > The manual page says "When possible, the file is opened in non-blocking
> > mode" . Your write is probably not blocking - but the memory allocation
> > for it is forcing other data to disk to make room. ie it didn't
> block it
> > was just "slow".
>
> Even though I know quit well what blocking is, I am not sure how we
> define "slowness". Perhaps when we do define it, we can also define
> "immediately" to mean anything less than five seconds ;-)
>
> You are correct that io to the disk is precisely what must happen to
> complete, and last time I checked, that was the very definition of
> blocking. Not only are writes blocking, even reads are blocking. The
> docs for read(2) also says it will return EAGAIN if "Non-blocking I/O
> has been selected using O_NONBLOCK and no data was immediately
> available for reading."
>
The read(2) manpage reads, under NOTES:
"Many file systems and disks were considered to be fast enough that the
implementation of O_NONBLOCK was deemed unnecessary. So, O_NONBLOCK may
not be available on files and/or disks."
The statement ("fast enough") maybe only reflects the state of affairs
at that time - 10 ms seek time takes an eternity at 3 GHz, and times
100k it takes an eternity IRL as well. I would define "immediately" if
the data is available from kernel (or disk) buffers.
I need to do vast amounts (100k+) of scattered and unordered small reads
from harddisk and want to keep my seeks short through sorting them. I
have done some measurements and it seems perfectly possible to derive
the physical disk layout from statistics on some 10-100k random seeks,
so I can solve everything in userland. But before writing my own I/O
scheduler I'd thought to give the kernel and/or SATA's NCQ tricks a shot.
Now the problem is how to tell the kernel/disk which data I want without
blocking. readv(2) appearantly reads the requests in array order.
Multithreading doesn't sound too good for just this purpose.
posix_fadvise(2) sounds like something: "POSIX_FADV_WILLNEED initiates a
non-blocking read of the specified region into the page cache."
But there's appearantly no signalling to the process that an actual
read() will indeed not block.
readahead(2) blocks until the specified data has been read.
aio_read(2) appearantly doesn't issue a real non blocking read request,
so you will get the unneeded overhead of one thread per outstanding request.
mmap(2) / madvise(2) / mincore(2) may be a way around things (although
non-atomic), but I haven't tested it yet. It might also solve the
problem that started this thread, at least for the reading part of it.
Writing a small read() like function that operates through mmap()
doesn't seem too complicated. As for writing, you could use msync() with
MS_ASYNC to initiate a write. I'm not sure how to find out if a write
has indeed taken place, but at least initiating a non-blocking write is
possible. munmap() might then still block.
Maybe some guru here can tell beforehand if such an approach would work?
Cheers,
M.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
>> > > If O_NONBLOCK is meaningful whatsoever (see man page docs for
>> > > semantics) against block devices, one would expect a nonblocking io
>> >
>> > It isn't...
>>
>> Thanks for the reply. It's good to get confirmation that I am not all
>> alone in an alternate non blocking universe. The linux man pages
>> actually had me convinced O_NONBLOCK would actually keep a process
>> from blocking on device io :-)
>>
>
> You're even less alone, I'm running into the same issue just now. But
> I think I've found a way around it, see below.
I guess I should note that I've suggested nonblocking I/O for files
before:
http://linux.derkeiler.com/Mailing-Lists/Kernel/2004-10/0290.html
I'll also note that enabling such a patch broke apps that accessed cd
burners, for example, since O_NONBLOCK had some preexisting semantics
there that I fail to recall.
Cheers,
Jeff
> I guess I should note that I've suggested nonblocking I/O for files
> before:
>
> http://linux.derkeiler.com/Mailing-Lists/Kernel/2004-10/0290.html
>
> I'll also note that enabling such a patch broke apps that accessed cd
> burners, for example, since O_NONBLOCK had some preexisting semantics
> there that I fail to recall.
Sounds like nonblocking read/write calls are strongly tied to threads
instead of state related to a file descriptor. I haven't poked around
in there, but perhaps the current linux io architecture is just too
set in stone to design an efficient non blocking mechanism. It would
be a shame not to fix it simply because some broken apps depend upon
blocking behavior when they have been explicitly specifying O_NONBLOCK
via open or fcntl.
At least for now we can describe it's actual behavior in the man
pages; I will be submitting a man page patch for consideration later
today.
- Mike
For the record I would like to share my very positive experience with
the approach described. Thanks to 64 bit addressing you can mmap() an
entire block device, and madvise() and mincore() work like you would
expect them to. I haven't tried writing.
I also briefly tried aio_* and the libaio interface. The former is not
really asynchronous - all requests are put in one separate thread where
they will be executed in order, i.e. blocking, so you don't get any
advantage from NCQ or data that was cached by the disk or the kernel.
The latter apparently ends in an io_submit() which will block until all
queued reads are finished, but I might have missed something there.
Imagine the orderly world in which O_NONBLOCK would make syscalls
actually non-blocking...
What you missed is that the native aio system calls require O_DIRECT.
Cheers,
Jeff
Thanks, that made it work. It seems without O_DIRECT it's just like
aio_* but without the separate thread. But I now get the "benefits" of
O_DIRECT for free...
Cheers,
M.
> > > > What you missed is that the native aio system calls require O_DIRECT.
> > > >
> > >
> > > Thanks, that made it work. It seems without O_DIRECT it's just like
> > > aio_* but without the separate thread. But I now get the "benefits" of
> > > O_DIRECT for free...
> >
> > That is awesome news; I was worried. I saw that about O_DIRECT in the
> > doc but assumed you were doing it.
> >
> >
> Where did you see that? I reverted to the kernel source where indeed I
> saw __generic_file_aio_read() in mm/filemap.c check for O_DIRECT.
>
> io_submit(3), io_setup(3) etc don't mention O_DIRECT. Even the example
> in io(3) doesn't do O_DIRECT, so it must be broken. The example has no
> means to see if it is in fact a non blocking system call. But io(3)
> states "The libaio library defines a new set of I/O operations which
> can significantly reduce the time an application spends waiting at I/O.
> The new functions allow a program to initiate one or more I/O operations
> and then immediately resume normal work while the I/O operations are
> executed in parallel."
Not in the linux man pages, but a few folks around have web pages I
was able to google about actual behavior:
http://lse.sourceforge.net/io/aio.html
Like you, I trusted that the man pages actually described the behavior
in my software design until the "nonblocking" read and writev calls
choked off the nonblocking sockets :-) Now I am writing extra test code
for all system calls I consider using; not sure if there is actual public
test code or not.
You are absolutely right, the blocking behavior of libaio without
specifying O_DIRECT should also definitely be in the man pages. Why
isn't it an error to not specify O_DIRECT when that's the only way
libaio to block devs is actual async io? It seems a bit odd to go to
the trouble to use libaio if synchronous behavior is expected?!
You might want to wait and see if my man patch even gets applied
before going to the trouble to make another one. Alan Cox suggested I
post a patch to spell out the actual behavior of the blocking
"O_NONBLOCK" read and write class of calls. I did that and a number
of us vetted the patch before I posted it to linux-man like a week
ago, but no feedback there from Michael Kerrisk or anyone else yet.
Maybe he's on holiday, or maybe someone else can also carry the man
page pumpkin, I don't know...
Either way I imagine lkml sees this over and over again and fixing the
man pages would go a long way toward cutting down on confusion.
- Mike