On 2019-02-08, Steve Keller <kel...@no.invalid> wrote:
> AFAIU, reading files using mmap(2) has some performance benefits
> compared to read(2).
This is not always the case. Basically the file has to be large enough
for the overhead of allocating a new map.
A program that repeatedly processes files by reading them into buffers
from malloc can perform better, because malloc can efficiently re-use
liberated memory without having to make system calls.
A program that repeatedly processes small files using mmap is constantly
making calls to mmap and munmap. These are expensive, and additionally
so because they manipulate the address space.
Basically the cost of the mmap operation has to be amortized somehow:
the best situation is that very large files are processed, and
infrequently so. Furthermore, random access is required.
> If a number of proecesses read the same file and
> each process mmap()s the file into its address space to read it, then
> only one copy of the file is in memory. OTOH, if the processes malloc
> some memory and use read() to fill it with file data, the memory is
> not shared, because (1) it will be aligned differently in these
> processes and (2) each process writes to the memory causing a private
> copy to be created.
However, often we can process an arbitrarily large file with only a
small buffer of a few kilobytes. Including doing random access, achieved
by seeking around in the file.
Ten processes passing over the same gigabyte file using 4 kilobyte
buffers are allocating only 40 kilobytes in total.
Ten processes mmapping the same gigabyte file means a gigabyte memory
map exists. The madvise system call can help here.
(To present a balanced view, we must observe that mmap doesn't have to
map the entire file at once, either. Also, a mapping can be destroyed
piece-wise, rather than all at once: munmap can be called on portions of
a mapping that we know we are not going to touch.)
> So I think one should prefer mmap() to access files, but how can
> errors be handled portably, then? On file I/O errors I get an error
> return code from read() (e.g. EIO), but with mmap() I typically get a
> SIGSEGV. How should I handle this?
In a utility program that can just bail on errors, you don't have to
bother too much. Fetch the size of the file upfront (for instance
stat(file, &stbuf) it and take stbuf.st_size). Then map just for that
size. If the file happens to shrink, let the chips land where they may.
In a robust application, you have to deal with the SIGBUS if you access
the mapping beyond the end of the file.
The signal handling for SIGBUS is about equally portable as mmap: you're
writing a POSIX application.