mmap vs. read

Steve Keller

unread,

Feb 8, 2019, 6:40:21 AM2/8/19

to

AFAIU, reading files using mmap(2) has some performance benefits
compared to read(2). If a number of proecesses read the same file and
each process mmap()s the file into its address space to read it, then
only one copy of the file is in memory. OTOH, if the processes malloc
some memory and use read() to fill it with file data, the memory is
not shared, because (1) it will be aligned differently in these
processes and (2) each process writes to the memory causing a private
copy to be created.

So I think one should prefer mmap() to access files, but how can
errors be handled portably, then? On file I/O errors I get an error
return code from read() (e.g. EIO), but with mmap() I typically get a
SIGSEGV. How should I handle this?

Steve

Richard Kettlewell

unread,

Feb 8, 2019, 10:39:01 AM2/8/19

to

Steve Keller <kel...@no.invalid> writes:
> AFAIU, reading files using mmap(2) has some performance benefits
> compared to read(2). If a number of proecesses read the same file and
> each process mmap()s the file into its address space to read it, then
> only one copy of the file is in memory. OTOH, if the processes malloc
> some memory and use read() to fill it with file data, the memory is
> not shared, because (1) it will be aligned differently in these
> processes and (2) each process writes to the memory causing a private
> copy to be created.
>
> So I think one should prefer mmap() to access files,

Profile first; historically at least mmap was not reliably faster than
read/write. Fiddling with pages tables can be quite expensive.

> but how can errors be handled portably, then? On file I/O errors I
> get an error return code from read() (e.g. EIO), but with mmap() I
> typically get a SIGSEGV. How should I handle this?

Pass.

--
https://www.greenend.org.uk/rjk/

Casper H.S. Dik

unread,

Feb 8, 2019, 11:15:46 AM2/8/19

to

Richard Kettlewell <inv...@invalid.invalid> writes:

>Steve Keller <kel...@no.invalid> writes:
>> AFAIU, reading files using mmap(2) has some performance benefits
>> compared to read(2). If a number of proecesses read the same file and
>> each process mmap()s the file into its address space to read it, then
>> only one copy of the file is in memory. OTOH, if the processes malloc
>> some memory and use read() to fill it with file data, the memory is
>> not shared, because (1) it will be aligned differently in these
>> processes and (2) each process writes to the memory causing a private
>> copy to be created.
>>
>> So I think one should prefer mmap() to access files,

>Profile first; historically at least mmap was not reliably faster than
>read/write. Fiddling with pages tables can be quite expensive.

Yeah, though over time, memory closer to the CPU (cache, memory, page
tables) has become much faster and CPU became faster more quickly.
Storage, however, was lacking.

>> but how can errors be handled portably, then? On file I/O errors I
>> get an error return code from read() (e.g. EIO), but with mmap() I
>> typically get a SIGSEGV. How should I handle this?

>Pass.

catch siginfo and see where the memory fault it (and siginfo may
return why it failed). Returning from such a signal handler
is not possible; you will need to resume somewhere else.

That is, catching errors is pretty hard in that case, especially when
writing.

Casper

Marcel Mueller

unread,

Feb 8, 2019, 12:02:14 PM2/8/19

to

Am 08.02.19 um 12:40 schrieb Steve Keller:

> AFAIU, reading files using mmap(2) has some performance benefits
> compared to read(2). If a number of proecesses read the same file and
> each process mmap()s the file into its address space to read it, then
> only one copy of the file is in memory.

This is significant if and only if
(1) the file is sufficiently large,
(2) the file is opened by multiple processes and
(3) the file is not processed as stream.

But if the file is large, you probably do not want to load it into
memory completely at all. Most large files are processed as stream with
limited buffer size.

> So I think one should prefer mmap() to access files, but how can

I do not agree. Quite the contrary. You should use mmap if you /need/ it.

> errors be handled portably, then?

If you really need mmap, it is likely that any I/O error is fatal for
your application. So the question is less likely to arise.

> On file I/O errors I get an error
> return code from read() (e.g. EIO), but with mmap() I typically get a
> SIGSEGV. How should I handle this?

With a signal handler. Of course you have to examine where the error
occurs and whether it is in your mapped memory area.

Marcel

blt_u...@xvjhmg9ueyj23p1690akks_mo.net

unread,

Feb 8, 2019, 12:32:36 PM2/8/19

to

On 08 Feb 2019 16:15:44 GMT

Casper H.S. Dik <Caspe...@OrSPaMcle.COM> wrote:
>Richard Kettlewell <inv...@invalid.invalid> writes:
>
>>Steve Keller <kel...@no.invalid> writes:
>>> AFAIU, reading files using mmap(2) has some performance benefits
>>> compared to read(2). If a number of proecesses read the same file and
>>> each process mmap()s the file into its address space to read it, then
>>> only one copy of the file is in memory. OTOH, if the processes malloc
>>> some memory and use read() to fill it with file data, the memory is
>>> not shared, because (1) it will be aligned differently in these
>>> processes and (2) each process writes to the memory causing a private
>>> copy to be created.
>>>
>>> So I think one should prefer mmap() to access files,
>
>>Profile first; historically at least mmap was not reliably faster than
>>read/write. Fiddling with pages tables can be quite expensive.
>
>Yeah, though over time, memory closer to the CPU (cache, memory, page
>tables) has become much faster and CPU became faster more quickly.
>Storage, however, was lacking.

Arn't the higher level I/O routines, eg fread() etc, supposed to be written
to use the best access method on a given architecture?

Mikko Rauhala

unread,

Feb 8, 2019, 2:00:28 PM2/8/19

to

On Fri, 8 Feb 2019 17:32:33 +0000 (UTC),
blt_uYh21j@xvjhmg9ueyj23p1690akks_mo.net

<blt_uYh21j@xvjhmg9ueyj23p1690akks_mo.net> wrote:
> Arn't the higher level I/O routines, eg fread() etc, supposed to be written
> to use the best access method on a given architecture?

fread() API limits it to making necessarily at least one copy of the data,
not (easily) shareable. Internally, of course, it may use whatever method
it wants to get at the data to be copied.

--
Mikko Rauhala - m...@iki.fi - http://rauhala.org/

Kaz Kylheku

unread,

Feb 8, 2019, 2:09:38 PM2/8/19

to

On 2019-02-08, Steve Keller <kel...@no.invalid> wrote:
> AFAIU, reading files using mmap(2) has some performance benefits
> compared to read(2).

This is not always the case. Basically the file has to be large enough
for the overhead of allocating a new map.

A program that repeatedly processes files by reading them into buffers
from malloc can perform better, because malloc can efficiently re-use
liberated memory without having to make system calls.

A program that repeatedly processes small files using mmap is constantly
making calls to mmap and munmap. These are expensive, and additionally
so because they manipulate the address space.

Basically the cost of the mmap operation has to be amortized somehow:
the best situation is that very large files are processed, and
infrequently so. Furthermore, random access is required.

> If a number of proecesses read the same file and
> each process mmap()s the file into its address space to read it, then
> only one copy of the file is in memory. OTOH, if the processes malloc
> some memory and use read() to fill it with file data, the memory is
> not shared, because (1) it will be aligned differently in these
> processes and (2) each process writes to the memory causing a private
> copy to be created.

However, often we can process an arbitrarily large file with only a
small buffer of a few kilobytes. Including doing random access, achieved
by seeking around in the file.

Ten processes passing over the same gigabyte file using 4 kilobyte
buffers are allocating only 40 kilobytes in total.

Ten processes mmapping the same gigabyte file means a gigabyte memory
map exists. The madvise system call can help here.

(To present a balanced view, we must observe that mmap doesn't have to
map the entire file at once, either. Also, a mapping can be destroyed
piece-wise, rather than all at once: munmap can be called on portions of
a mapping that we know we are not going to touch.)

> So I think one should prefer mmap() to access files, but how can
> errors be handled portably, then? On file I/O errors I get an error
> return code from read() (e.g. EIO), but with mmap() I typically get a
> SIGSEGV. How should I handle this?

In a utility program that can just bail on errors, you don't have to
bother too much. Fetch the size of the file upfront (for instance
stat(file, &stbuf) it and take stbuf.st_size). Then map just for that
size. If the file happens to shrink, let the chips land where they may.

In a robust application, you have to deal with the SIGBUS if you access
the mapping beyond the end of the file.

The signal handling for SIGBUS is about equally portable as mmap: you're
writing a POSIX application.

Kaz Kylheku

unread,

Feb 8, 2019, 2:13:44 PM2/8/19

to

On 2019-02-08, Richard Kettlewell <inv...@invalid.invalid> wrote:
> Steve Keller <kel...@no.invalid> writes:
>> AFAIU, reading files using mmap(2) has some performance benefits
>> compared to read(2). If a number of proecesses read the same file and
>> each process mmap()s the file into its address space to read it, then
>> only one copy of the file is in memory. OTOH, if the processes malloc
>> some memory and use read() to fill it with file data, the memory is
>> not shared, because (1) it will be aligned differently in these
>> processes and (2) each process writes to the memory causing a private
>> copy to be created.
>>
>> So I think one should prefer mmap() to access files,
>
> Profile first; historically at least mmap was not reliably faster than
> read/write. Fiddling with pages tables can be quite expensive.

I recently saw this on recent PC hardware, Ubuntu 18.

There is a Debian patch for bsdiff which converts it from malloced
buffers to use mmap. (The patch has a bug in the unmapping, which I
fixed: it uses the compressed size of the source file to unmap it,
rather than the original size.)

I converted the bsdiff utility into a shared library, to use as a
subroutine in a program which calls it millions of times for small-ish
files.

The original read() version was found to be faster than the mmap()
version, so we dropped the patch instead of fixing its bug.

I hypothesized the poorer performance to be caused by the repeated
mapping and unmapping calls which manipulate the virtual address space
and require trips to the kernel. Whereas the malloced buffers can be
recycled without trips to the kernel or tweaking of the address space.