readsome() vs. fread()

James R. Kuyper

unread,

Feb 21, 2018, 12:06:24 PM2/21/18

to

I'm writing some code that reads an unspecified number of fixed-length
records from a binary file; it will know how many there are when it
reaches the end of the file. It's normal for the last read to reach EOF,
but a format error if EOF is reached in the middle of a record, and I
need to detect and report if that situation occurs.
In C, I'd use fread() and check the return value for a short read.
Looking over the C++ standard, I came to the conclusion that readsome()
is what I should use for comparable purposes. However, I got results
that I don't understand when using readsome(). I created the following
programs to investigate those results (error handling suppressed for the
sake of clarity):

C version:

#include <stdio.h>
int main(int argc, char *argv[])
{
FILE *infile = fopen(argv[1], "rb");
char buffer[256];
size_t bytes;
long records;
for(records=0; bytes = fread(buffer, 1, sizeof(buffer), infile);
records++)
{
printf("%ld: %zu\n", records, bytes);
}
if(ferror(infile))
perror(argv[1]);
if(feof(infile))
printf("EOF\n");
return 0;
}

C++ version:
#include <iostream>
#include <fstream>

int main(int argc, char *argv[])
{
std::ifstream infile(argv[1], std::ios_base::binary);
char buffer[256];
int records=0;
std::streamsize bytes;
for(records=0; bytes = infile.readsome(buffer, sizeof buffer);
records++)
{
std::cout << records << ": " << bytes << std::endl;
}
if(infile.eof())
std::cout << "EOF ";
if(infile.bad())
std::cout << "bad ";
if(infile.rdstate() & std::ios_base::failbit)
std::cout << "fail ";
std::cout << std::endl;

return 0;
}

When I run these programs using the same input file, which is 14648
bytes long, I get the following results:

~/testprog(99) gcc -std=c11 -pedantic -Wall -Wpointer-arith -Wcast-align
-Wstrict-prototypes -Wmissing-prototypes read_test_c.c -o read_test_c
~/testprog(100) ./read_test_c read_test
0: 256
1: 256
2: 256
3: 256
4: 256
5: 256
6: 256
7: 256
8: 256
9: 256
10: 256
11: 256
12: 256
13: 256
14: 256
15: 256
16: 256
17: 256
18: 256
19: 256
20: 256
21: 256
22: 256
23: 256
24: 256
25: 256
26: 256
27: 256
28: 256
29: 256
30: 256
31: 256
32: 256
33: 256
34: 256
35: 256
36: 256
37: 256
38: 256
39: 256
40: 256
41: 256
42: 256
43: 256
44: 256
45: 256
46: 256
47: 256
48: 256
49: 256
50: 256
51: 256
52: 256
53: 256
54: 256
55: 256
56: 256
57: 56
EOF

~/testprog(102) g++ -std=c++1y -pedantic -Wall -Wpointer-arith
-Wcast-align -ffor-scope -fno-gnu-keywords -fno-nonansi-builtins
-Wctor-dtor-privacy -Wnon-virtual-dtor -Wold-style-cast
-Woverloaded-virtual -Wsign-promo read_test_c++.cpp -o read_test_c++
~/testprog(103) ./read_test_c++ read_test
0: 256
1: 256
2: 256
3: 256
4: 256
5: 256
6: 256
7: 256
8: 256
9: 256
10: 256
11: 256
12: 256
13: 256
14: 256
15: 256
16: 256
17: 256
18: 256
19: 256
20: 256
21: 256
22: 256
23: 256
24: 256
25: 256
26: 256
27: 256
28: 256
29: 256
30: 256
31: 255
32: 256
33: 256
34: 256
35: 256
36: 256
37: 256
38: 256
39: 256
40: 256
41: 256
42: 256
43: 256
44: 256
45: 256
46: 256
47: 256
48: 256
49: 256
50: 256
51: 256
52: 256
53: 256
54: 256
55: 256
56: 256
57: 57

Could someone explain to me why the C++ version apparently read one more
byte than the C version, which is also one more byte than the file size?
Also, why infile.eof() was false at the end?

Barry Schwarz

unread,

Feb 21, 2018, 12:44:44 PM2/21/18

to

On Wed, 21 Feb 2018 12:05:57 -0500, "James R. Kuyper"
<james...@verizon.net> wrote:

>I'm writing some code that reads an unspecified number of fixed-length
>records from a binary file; it will know how many there are when it
>reaches the end of the file. It's normal for the last read to reach EOF,
>but a format error if EOF is reached in the middle of a record, and I
>need to detect and report if that situation occurs.
>In C, I'd use fread() and check the return value for a short read.
>Looking over the C++ standard, I came to the conclusion that readsome()
>is what I should use for comparable purposes. However, I got results
>that I don't understand when using readsome(). I created the following
>programs to investigate those results (error handling suppressed for the
>sake of clarity):

>When I run these programs using the same input file, which is 14648
>bytes long, I get the following results:
>
>~/testprog(99) gcc -std=c11 -pedantic -Wall -Wpointer-arith -Wcast-align
>-Wstrict-prototypes -Wmissing-prototypes read_test_c.c -o read_test_c
>~/testprog(100) ./read_test_c read_test
>0: 256

<snip>

>56: 256
>57: 56
>EOF
>
>~/testprog(102) g++ -std=c++1y -pedantic -Wall -Wpointer-arith
>-Wcast-align -ffor-scope -fno-gnu-keywords -fno-nonansi-builtins
>-Wctor-dtor-privacy -Wnon-virtual-dtor -Wold-style-cast
>-Woverloaded-virtual -Wsign-promo read_test_c++.cpp -o read_test_c++
>~/testprog(103) ./read_test_c++ read_test
>0: 256

<snip>

>56: 256
>57: 57
>
>Could someone explain to me why the C++ version apparently read one more
>byte than the C version, which is also one more byte than the file size?
>Also, why infile.eof() was false at the end?

The description for readsome at cplusplus.com says that it will stop
reading when there is no data in the stream buffer, even if end of
file has not been reached. That may answer you second question.

What are the last few characters in the file? What are the last few
characters placed in your array when the short record is read?

--
Remove del for email

James R. Kuyper

unread,

Feb 21, 2018, 1:51:17 PM2/21/18

to

I read something like that too, but it made readsome() seem rather
useless, unless you're getting excessively intimate with the details of
how the stream buffer gets filled, so I assumed that it was either a
mistake on their part, or a misunderstanding of what they were saying on
my part.

Still, that's probably the explanation for what I found (see below).
Reviewing the description of readsome() with that in mind, it means that
in_avail() has a different meaning than I thought it did.

> What are the last few characters in the file? What are the last few
> characters placed in your array when the short record is read?

The file I was using for the test got erased. I'm using a different
file, of length 14704. I modified the code to print out the last 5
characters read in whenever it was less than the buffer size. As a
result, I discovered the reason for the results I saw:

...
31: 255
buffer[250]=0
buffer[251]=0
buffer[252]=0
buffer[253]=0
buffer[254]=0
...
57: 113
buffer[108]=0
buffer[109]=0
buffer[110]=0
buffer[111]=0
buffer[112]=0

I'd been assuming that all of the calls to readsome() returned 256
characters except the last one, and used that "fact" to calculate the
total number of bytes read, rather than having my program calculate the
actual total. A review of the actual outputs seemed to confirm that
assumption, which implies that I must have misread 255 as 256.

Here's the last 2 lines of the output from od on the input file:
0034540 000001 000000 000000 000000 000000 000000 000000 000000
0034560

The problem is resolved, and readsome() is NOT a suitable replacement
for fread() for this purpose. I replaced

infile.readsome(buffer, sizeof buffer)

with

infile.read(buffer, sizeof buffer).gcount()

and that seemed to work exactly as I need it work. Is there any danger
of read() stopping at the end of the stream buffer, the same way as
readsome()?

Jorgen Grahn

unread,

Feb 21, 2018, 2:08:50 PM2/21/18

to

On Wed, 2018-02-21, James R. Kuyper wrote:
> I'm writing some code that reads an unspecified number of fixed-length
> records from a binary file; it will know how many there are when it
> reaches the end of the file. It's normal for the last read to reach EOF,
> but a format error if EOF is reached in the middle of a record, and I
> need to detect and report if that situation occurs.
> In C, I'd use fread() and check the return value for a short read.
> Looking over the C++ standard, I came to the conclusion that readsome()
> is what I should use for comparable purposes.

Why not istream::read(buf, count)? That's the closest equivalent to
fread() for iostreams.

/Jorgen

--
// Jorgen Grahn <grahn@ Oo o. . .
\X/ snipabacken.se> O o .

James R. Kuyper

unread,

Feb 21, 2018, 2:37:35 PM2/21/18

to

On 02/21/2018 02:08 PM, Jorgen Grahn wrote:
> On Wed, 2018-02-21, James R. Kuyper wrote:
>> I'm writing some code that reads an unspecified number of fixed-length
>> records from a binary file; it will know how many there are when it
>> reaches the end of the file. It's normal for the last read to reach EOF,
>> but a format error if EOF is reached in the middle of a record, and I
>> need to detect and report if that situation occurs.
>> In C, I'd use fread() and check the return value for a short read.
>> Looking over the C++ standard, I came to the conclusion that readsome()
>> is what I should use for comparable purposes.
>
> Why not istream::read(buf, count)? That's the closest equivalent to
> fread() for iostreams.

1. It doesn't return the count of characters read, though adding
".gcount() at the end resolves that problem.

2. 30.7.4.3p30 says
"Characters are extracted and stored until either of the following occurs:
(30.1) — n characters are stored;
(30.2) — end-of-file occurs on the input sequence (in which case the
function calls setstate(failbit | eofbit), which may throw
ios_base::failure."

I'm relying on reaching end-of-file in order to know when I've read all
the records, so having read() throw an exception when that happens would
be inconvenient. However, my testing shows that no exception is thrown,
so I may be misunderstanding something.

Chris Vine

unread,

Feb 21, 2018, 2:47:14 PM2/21/18

to

On Wed, 21 Feb 2018 13:50:54 -0500

No. In the absence of an error, it is guaranteed to provide the number
of characters requested (in your case, the buffer size) unless eof is
encountered, which will set failbit/eofbit and gcount() will indicate
the number of bytes (on a narrow stream) actually received.

If you are reading binary records you might consider doing without
ifstream entirely and reading directly from a filebuf using
std::basic_streambuf::sgetn(), and carrying on until traits_type::eof()
(-1 on narrow stream) is returned.

Chris

Scott Lurndal

unread,

Feb 21, 2018, 2:49:38 PM2/21/18

to

I must ask - if the fread code worked, why change it? Valid C is generally
also valid C++.

Chris Vine

unread,

Feb 21, 2018, 2:52:11 PM2/21/18

to

On Wed, 21 Feb 2018 14:37:14 -0500

"James R. Kuyper" <james...@verizon.net> wrote:

It will not throw on failbit or eofbit unless you explicitly set the
default exception mask to do so using the std::basic_ios::exceptions()
function.

As I have mentioned in another post, in your use I should consider
avoiding all this by using filebuf directly and not instantiate an
ifstream object.

Chris

Chris Vine

unread,

Feb 21, 2018, 3:13:07 PM2/21/18

to

On Wed, 21 Feb 2018 19:46:50 +0000
Chris Vine <chris@cvine--nospam--.freeserve.co.uk> wrote:
[snip]

> No. In the absence of an error, it is guaranteed to provide the
> number of characters requested (in your case, the buffer size) unless
> eof is encountered, which will set failbit/eofbit and gcount() will
> indicate the number of bytes (on a narrow stream) actually received.
>
> If you are reading binary records you might consider doing without
> ifstream entirely and reading directly from a filebuf using
> std::basic_streambuf::sgetn(), and carrying on until
> traits_type::eof() (-1 on narrow stream) is returned.

To correct that, it is uflow() which returns traits_type::eof() on end
of file. sgetn() just returns less than the number of characters
requested.

James R. Kuyper

unread,

Feb 21, 2018, 3:30:06 PM2/21/18

to

I mentioned fread() code because I'm a lot more familiar with C than
C++, and that's how I would have written this code if I were using C.
The actual code was written by someone else in C++, and interacts with a
variety of libraries written in C++. It has a bug, and I need to fix the
bug, and the minimum fix is sufficiently complicated to justify a
significant re-write. That's good, because the existing code is also
very clumsy - it looks like C code that has been translated into C++
code by someone significantly less familiar with C++ than I am.

Öö Tiib

unread,

Feb 21, 2018, 4:16:28 PM2/21/18

to

I remember default buffers of MSVC were set unfavorably for fread
about 10 years ago so fread performed about twice worse than
ifstream when reading 64KB chunks from large file. Consumer
file i/o was about 15 times slower than now then, so it sometimes
mattered. Setting optimal buffer (with setvbuf() or
streambuf::pubsetbuf() ) solved it but surprisingly few
programmers were aware of those features.

Chris Vine

unread,

Feb 21, 2018, 5:06:54 PM2/21/18

to

std::streambuf::xsgetn(), and so std::ifstream::read() and
std::filebuf::sgetn(), are allowed by the C++ standard on a large block
read (in effect, when the buffer size passed in is larger than the
streambuffer's own buffer size) to bypass the streambuffer's buffer
entirely. std::ifstream::read() and std::filebuf::sgetn() are then
passed on directly to unix read() or the windows equivalent.

I wonder if that was the reason for the difference with fread(). I am
not sure that fread() is entitled to do the same; or if it is, whether
that is ever done in fact.

Scott Lurndal

unread,

Feb 21, 2018, 5:13:41 PM2/21/18

to

Chris Vine <chris@cvine--nospam--.freeserve.co.uk> writes:
>On Wed, 21 Feb 2018 13:15:50 -0800 (PST)
>=C3=96=C3=B6 Tiib <oot...@hot.ee> wrote:
>> On Wednesday, 21 February 2018 21:49:38 UTC+2, Scott Lurndal wrote:

>> > "James R. Kuyper" <james...@verizon.net> writes: =20
>> > >On 02/21/2018 12:44 PM, Barry Schwarz wrote: =20
>> > =20

>> > >The problem is resolved, and readsome() is NOT a suitable
>> > >replacement for fread() for this purpose. I replaced
>> > >
>> > > infile.readsome(buffer, sizeof buffer)
>> > >
>> > >with
>> > >
>> > > infile.read(buffer, sizeof buffer).gcount()
>> > >
>> > >and that seemed to work exactly as I need it work. Is there any
>> > >danger of read() stopping at the end of the stream buffer, the

>> > >same way as readsome()? =20
>> >=20

>> > I must ask - if the fread code worked, why change it? Valid C is

>> > generally also valid C++. =20
>>=20

>> I remember default buffers of MSVC were set unfavorably for fread
>> about 10 years ago so fread performed about twice worse than
>> ifstream when reading 64KB chunks from large file. Consumer
>> file i/o was about 15 times slower than now then, so it sometimes

>> mattered. Setting optimal buffer (with setvbuf() or=20
>> streambuf::pubsetbuf() ) solved it but surprisingly few=20

>> programmers were aware of those features.
>
>std::streambuf::xsgetn(), and so std::ifstream::read() and
>std::filebuf::sgetn(), are allowed by the C++ standard on a large block
>read (in effect, when the buffer size passed in is larger than the
>streambuffer's own buffer size) to bypass the streambuffer's buffer
>entirely. std::ifstream::read() and std::filebuf::sgetn() are then
>passed on directly to unix read() or the windows equivalent.
>
>I wonder if that was the reason for the difference with fread(). I am
>not sure that fread() is entitled to do the same; or if it is, whether
>that is ever done in fact.
>

fread will return the number of bytes requested, unless EOF occurs. Regardless
of the size of the stdio buffer.

Chris Vine

unread,

Feb 21, 2018, 5:38:50 PM2/21/18

to

As will std::filebuf::sgetn(). The issue is whether the internal
buffers are short-cicuited or not. On an optimized block read by
std::filebuf::sgetn(), anything in the buffers will first be extracted
and then a call to unix read() will be made directly into the buffer
passed in to std::filebuf::sgetn() (rather than into the
streambuffer's internal buffer).

I was hypothesizing that the poorer performance of fread() on large
block transfers which was reported may be caused by the fact that it
either cannot, or does not, short-circuit in this way.

Chris Vine

unread,

Feb 21, 2018, 5:59:56 PM2/21/18

to

So far as I understand the C standard, it looks as if fread() cannot
make this optimization. The C standard says about fread(): "For each
object, 'size' calls are made to the fgetc function and the results
stored, in the order read, in an array of unsigned char exactly
overlaying the object."

fgetc will always go via the internal stream buffer, for efficiency
reasons.

Scott Lurndal

unread,

Feb 21, 2018, 6:53:03 PM2/21/18

to

I'm not sure why anyone would actually do fread for large block transfers,
but maybe they're limited to windows. On any POSIX system, pread(2)/pwrite(2) are
the easy and efficient[*] way to access and modify fixed sized records.

[*] mmap(2) wins on the efficiency metric.

Chris Vine

unread,

Feb 21, 2018, 6:59:59 PM2/21/18

to

On Wed, 21 Feb 2018 23:52:52 GMT
sc...@slp53.sl.home (Scott Lurndal) wrote:
[snip]

> I'm not sure why anyone would actually do fread for large block
> transfers, but maybe they're limited to windows. On any POSIX
> system, pread(2)/pwrite(2) are the easy and efficient[*] way to
> access and modify fixed sized records.
>
> [*] mmap(2) wins on the efficiency metric.

Quite so. Presumably the authors of the C standard take the view that
if you want unix read() you should use unix read(). It is not very
different from fread() apart from the fact that read() might do a short
read so you need to put it in a do loop until the request is met or
end-of-file is encountered, and that you need to account for EINTR (both
of which outcomes are handled automatically by fread()).

Presumably the C++ standard authors thought that the C++ abstractions
were better, so the optimization needs to be catered for. If so, I
think I agree with them.

James Kuyper

unread,

Feb 21, 2018, 11:30:02 PM2/21/18

to

On 02/21/2018 05:59 PM, Chris Vine wrote:
> On Wed, 21 Feb 2018 22:38:33 +0000
> Chris Vine <chris@cvine--nospam--.freeserve.co.uk> wrote:
>
>> On Wed, 21 Feb 2018 22:13:19 GMT
>> sc...@slp53.sl.home (Scott Lurndal) wrote:
>>> Chris Vine <chris@cvine--nospam--.freeserve.co.uk> writes:

...

Yes, but keep in mind that it's only the observable behavior (5.1.2.3p6)
of a program that is constrained by those requirements. The difference
between actually calling fgetc() separately for each byte and doing a
single large block read doesn't involve anything that qualifies as
observable behavior. The speed with which something happens does NOT
qualify as "observable behavior" as that term is defined by the C
standard (even though it is trivially easy to observe such behavior).

James Kuyper

unread,

Feb 21, 2018, 11:41:53 PM2/21/18

to

The distinction you're making doesn't really exist. Both standards
define "observable behavior" almost identically, and both allow any
optimization that produces observable behavior that's consistent with
the standards' requirements, even if it is not produced by the same
mechanism as that described in the requirements. That is sufficient
freedom to allow the same optimization for fread() and
std::streambuf::xsgetn().

Paavo Helde

unread,

Feb 22, 2018, 1:56:08 AM2/22/18

to

On 22.02.2018 1:52, Scott Lurndal wrote:
>
> I'm not sure why anyone would actually do fread for large block transfers,
> but maybe they're limited to windows. On any POSIX system, pread(2)/pwrite(2) are
> the easy and efficient[*] way to access and modify fixed sized records.
>
> [*] mmap(2) wins on the efficiency metric.

+1 for mmap. I'm baffled why anybody should discuss the relative speed
of various binary file content copying methods when there is a way to
avoid this copying step, at least on more common platforms. If you are
not using mmap it means you are not interested in performance, so why
discuss this in such great lengths?

Öö Tiib

unread,

Feb 22, 2018, 2:19:07 AM2/22/18

to

Yes, I did bring example of 10 years ago of Windows being target platform.
Reading 1GB file in 64kB chunks with default buffers fread did it
about 49 sec and ifstream::read did it about 24 sec.

>
> [*] mmap(2) wins on the efficiency metric.

When I set buffer size to 2 MB on current Mac-book then both fread
and ifstream::read read 1 GB file with 2 sec and the chunks size does
not seemingly affect it. The 2 seconds is likely the limit of SSD
so appears that mmap() is overkill for mundane sequential reads on
this platform. Since mmap() is more error-prone it should be perhaps
only used when it is easier to understand. For example for random
access of large file the code using mmap is easier to understand than
code that winds the streams back and forth.

Paavo Helde

unread,

Feb 22, 2018, 4:46:22 AM2/22/18

to

What do you mean by mmap being more error-prone? I do not recall having
any problems with it ever.

About read/fread/pread: any reading of file content into the user-space
buffer first reads it into the OS disk cache[*], then copies it over to
the user space buffer. The second step is omitted by mmap(), and the
first step is performed only for pages you touch. Also, if the file
already happens to be in the disk cache, the first step is altogether
omitted.

[*] On some platforms one can specify some flags like O_DIRECT to bypass
the disk cache, but this will likely make the program slower, not faster.

Jorgen Grahn

unread,

Feb 22, 2018, 6:17:22 AM2/22/18

to

You can only mmap "true" files -- that's often my reason for not using it.
(Also, working on Unix, I rarely have to do binary I/O.)

Öö Tiib

unread,

Feb 22, 2018, 7:45:36 AM2/22/18

to

On Thursday, 22 February 2018 11:46:22 UTC+2, Paavo Helde wrote:
> On 22.02.2018 9:18, Öö Tiib wrote:
> > On Thursday, 22 February 2018 01:53:03 UTC+2, Scott Lurndal wrote:
> >> [*] mmap(2) wins on the efficiency metric.
> >
> > When I set buffer size to 2 MB on current Mac-book then both fread
> > and ifstream::read read 1 GB file with 2 sec and the chunks size does
> > not seemingly affect it. The 2 seconds is likely the limit of SSD
> > so appears that mmap() is overkill for mundane sequential reads on
> > this platform. Since mmap() is more error-prone it should be perhaps
> > only used when it is easier to understand. For example for random
> > access of large file the code using mmap is easier to understand than
> > code that winds the streams back and forth.
>
> What do you mean by mmap being more error-prone? I do not recall having
> any problems with it ever.

mmap() does nothing a good programmer can't handle merely it is more
complex to and so the mechanism is more error-prone. You likely
already know the details i just give first 3 that pop into mind:
* memory mapping uses fixed page length (lets say multiplies of 4KB).
That does not on general case match with (lets say 5KB) file sizes
and mismatch always provides niche for fun to next maintainer.
* when file size exceeds the addressable space (say 3GB on 32 bit
system whose kernel uses 2 GB) then orchestrating portions mapped
can be fun.
* i/o errors raise SIGSEGV on Mac and EXECUTE_IN_PAGE_ERROR
on Windows. Handling user ejecting mapped media during access
is fun.

> About read/fread/pread: any reading of file content into the user-space
> buffer first reads it into the OS disk cache[*], then copies it over to
> the user space buffer. The second step is omitted by mmap(), and the
> first step is performed only for pages you touch. Also, if the file
> already happens to be in the disk cache, the first step is altogether
> omitted.
>
> [*] On some platforms one can specify some flags like O_DIRECT to bypass
> the disk cache, but this will likely make the program slower, not faster.

That is copying from memory to memory. It's pace is something like
0.1 sec/GB and so is not major fraction of pace 2 sec/GB. 2 sec/GB
roughly matches with pace of the SSD drive from what the data is read.
Activity monitor shows CPU % to be 7.2 about that ifstreaming
process at that speed. So that 7.2% is playground where the alternatives
can optimize out redundant memory-to memory copies and the like.
It might be worth of effort but does not likely alter the speed of media
(that seems to be actual throughput bottleneck).

Manfred

unread,

Feb 22, 2018, 12:08:40 PM2/22/18

to

On 2/21/2018 7:50 PM, James R. Kuyper wrote:
> On 02/21/2018 12:44 PM, Barry Schwarz wrote:
>> On Wed, 21 Feb 2018 12:05:57 -0500, "James R. Kuyper"

<snip>

>>> Could someone explain to me why the C++ version apparently read one more
>>> byte than the C version, which is also one more byte than the file size?
>>> Also, why infile.eof() was false at the end?

From Bjarne's book, about unformatted input:
"If you have a choice, use formatted input instead of these low-level
input functions." - the low-level input functions mentioned here include
istream::read()

>>
>> The description for readsome at cplusplus.com says that it will stop
>> reading when there is no data in the stream buffer, even if end of
>> file has not been reached. That may answer you second question.
>
> I read something like that too, but it made readsome() seem rather
> useless, unless you're getting excessively intimate with the details of
> how the stream buffer gets filled, so I assumed that it was either a
> mistake on their part, or a misunderstanding of what they were saying on
> my part.

From the same Bjarne's book, following page:
"The following functions depend on the detailed interaction between the
stream buffer and the real data source and should only be used if
necessary and then very carefully" - the functions mentioned here
include istream::readsome()

>
> Still, that's probably the explanation for what I found (see below).
> Reviewing the description of readsome() with that in mind, it means that
> in_avail() has a different meaning than I thought it did.
>

In my experience, the iostream library was historically one of the most
controversial among C++ standard libraries (as opposed to STL which has
been doing great since the beginning). C++11 improved a lot on this, but
still for unformatted (a.k.a. binary) IO, the good old fread()/fwrite()
are hard to beat in terms of code cleanliness and robustness, meaning
istream::read() probably reaches a comparable level, but not much more,
IMHO.
Apart of traps like that indicated by Öö Tiib, I think the preference
for istream::read() would be more due to uniformity with the rest of the
code.

James R. Kuyper

unread,

Feb 22, 2018, 12:43:44 PM2/22/18

to

On 02/22/2018 12:08 PM, Manfred wrote:
...

> From Bjarne's book, about unformatted input:
> "If you have a choice, use formatted input instead of these low-level
> input functions." - the low-level input functions mentioned here include
> istream::read()

I don't think I have a choice - this file contains binary data.

...

> still for unformatted (a.k.a. binary) IO, the good old fread()/fwrite()
> are hard to beat in terms of code cleanliness and robustness, meaning
> istream::read() probably reaches a comparable level, but not much more,
> IMHO.
> Apart of traps like that indicated by Öö Tiib, I think the preference
> for istream::read() would be more due to uniformity with the rest of the
> code.

Having compared them, I tend to agree with that. The C++ code I posted
was no simpler nor any more type safe than the C code.
The C++ code is, however, vastly more customizable: could have used
basic_ifstream<charT, traits>

James R. Kuyper

unread,

Feb 22, 2018, 12:55:14 PM2/22/18

to

I accidentally hit "Send" on the wrong message, before it was complete.

On 02/22/2018 12:08 PM, Manfred wrote:
...

> From Bjarne's book, about unformatted input:
> "If you have a choice, use formatted input instead of these low-level
> input functions." - the low-level input functions mentioned here include
> istream::read()

I don't think I have a choice - this file contains binary data.

...

> still for unformatted (a.k.a. binary) IO, the good old fread()/fwrite()
> are hard to beat in terms of code cleanliness and robustness, meaning
> istream::read() probably reaches a comparable level, but not much more,
> IMHO.
> Apart of traps like that indicated by Öö Tiib, I think the preference
> for istream::read() would be more due to uniformity with the rest of the
> code.

Having compared them, I tend to agree with that. The C++ code I posted
was no simpler nor any more type safe than the C code.

The only significant advantage that the C++ code has is that its vastly
more customizable: I could have used basic_ifstream<charT, traits> with
my own classes for charT or traits, or provided my own class derived
from basic_istream<>.

Manfred

unread,

Feb 22, 2018, 1:19:04 PM2/22/18

to

True, but then you would probably not be handling the stream as a pure
bytestream, iow it would not be 'unformatted' in strict terms (and then
the question would be what is the behavior of read() with charT other
than char)
A more significant advantage, obviously, would be the use of a common
istream interface for different kinds of stream e.g. stringstream and
fstream.

Manfred

unread,

Feb 22, 2018, 8:35:53 PM2/22/18

to

On 2/22/2018 1:45 PM, Öö Tiib wrote:
> On Thursday, 22 February 2018 11:46:22 UTC+2, Paavo Helde wrote:
>>
>> What do you mean by mmap being more error-prone? I do not recall having
>> any problems with it ever.
>
> mmap() does nothing a good programmer can't handle merely it is more
> complex to and so the mechanism is more error-prone. You likely
> already know the details i just give first 3 that pop into mind:
> * memory mapping uses fixed page length (lets say multiplies of 4KB).
> That does not on general case match with (lets say 5KB) file sizes
> and mismatch always provides niche for fun to next maintainer.
> * when file size exceeds the addressable space (say 3GB on 32 bit
> system whose kernel uses 2 GB) then orchestrating portions mapped
> can be fun.
> * i/o errors raise SIGSEGV on Mac and EXECUTE_IN_PAGE_ERROR
> on Windows. Handling user ejecting mapped media during access
> is fun.
>

Your idealization of 'fun' is...

Fun :)

Paavo Helde

unread,

Feb 23, 2018, 2:03:28 AM2/23/18

to

On 22.02.2018 14:45, Öö Tiib wrote:
> On Thursday, 22 February 2018 11:46:22 UTC+2, Paavo Helde wrote:
>> On 22.02.2018 9:18, Öö Tiib wrote:
>>> On Thursday, 22 February 2018 01:53:03 UTC+2, Scott Lurndal wrote:
>>>> [*] mmap(2) wins on the efficiency metric.
>>>
>>> When I set buffer size to 2 MB on current Mac-book then both fread
>>> and ifstream::read read 1 GB file with 2 sec and the chunks size does
>>> not seemingly affect it. The 2 seconds is likely the limit of SSD
>>> so appears that mmap() is overkill for mundane sequential reads on
>>> this platform. Since mmap() is more error-prone it should be perhaps
>>> only used when it is easier to understand. For example for random
>>> access of large file the code using mmap is easier to understand than
>>> code that winds the streams back and forth.
>>
>> What do you mean by mmap being more error-prone? I do not recall having
>> any problems with it ever.
>
> mmap() does nothing a good programmer can't handle merely it is more
> complex to and so the mechanism is more error-prone. You likely
> already know the details i just give first 3 that pop into mind:
> * memory mapping uses fixed page length (lets say multiplies of 4KB).
> That does not on general case match with (lets say 5KB) file sizes
> and mismatch always provides niche for fun to next maintainer.

In C++ you can always create a wrapper class providing a more robust
interface and taking care of closing the resources, and which can keep
account on the file size as well. There is no need to twiddle with the
mmap C interface at the application level.

> * when file size exceeds the addressable space (say 3GB on 32 bit
> system whose kernel uses 2 GB) then orchestrating portions mapped
> can be fun.

Fortunately our software is built in 64-bit for many years already. Got
a 10 GB TIFF file (BigTIFF, to be more exact)? No problem, just map it
in the address space. Well, actually there are problems because the
actual address space is not really 64-bit, but we haven't run into the
limits yet.

In principle a C++ abstraction layer can take care about the windowing
as well, but then it might be indeed easier to just use fread() or
something.

> * i/o errors raise SIGSEGV on Mac and EXECUTE_IN_PAGE_ERROR
> on Windows. Handling user ejecting mapped media during access
> is fun.

That's the most serious argument. Maybe I should review our abstraction
layer and switch off mmap-ing for removable media.

Cheers
Paavo

Juha Nieminen

unread,

Feb 27, 2018, 2:18:03 AM2/27/18

to

James R. Kuyper <james...@verizon.net> wrote:
> In C, I'd use fread() and check the return value for a short read.

There's nothing wrong in using std::fread() in C++. It's a perfectly
legit standard function.

James R. Kuyper

unread,

Feb 27, 2018, 8:52:31 AM2/27/18

to

Yes, I understand that. However, most of <cstdio> has equivalent
functionality within <iostream>, and I want to properly understand what
the <iostream> way of doing such things is. In particular, I was curious
whether either of the two ways of doing this was noticeably superior to
the other. What I've learned is that they're pretty similar for this
kind of task.
The code I'm working on right now is not code that I'm entirely free to
change as I desire - there's a couple of different versions of the
program, and when I make my fix to our version, the people responsible
for the other versions will find it easier to incorporate my fix if it
doesn't require them to switch from <iostream> to <cstdio>. I'd guess
that if I did deliver <cstdio>-based code, they'd assume that I was a C
programmer doing so due to my ignorance of <iostream> (which is mostly -
but not entirely - wrong).

Juha Nieminen

unread,

Feb 28, 2018, 12:45:47 AM2/28/18

to

James R. Kuyper <james...@verizon.net> wrote:
> Yes, I understand that. However, most of <cstdio> has equivalent
> functionality within <iostream>, and I want to properly understand what
> the <iostream> way of doing such things is. In particular, I was curious
> whether either of the two ways of doing this was noticeably superior to
> the other. What I've learned is that they're pretty similar for this
> kind of task.

Understanding <iostream> is ok, but be aware that most, if not all,
implementations of it are less efficient than those of <cstdio>.
Even std::istream::read() can be less efficient than std::fread(),
even though logically they do the exact same thing.

Melzzzzz

unread,

Feb 28, 2018, 12:58:51 AM2/28/18

to

Why?

--
press any key to continue or any other to quit...

Hergen Lehmann

unread,

Feb 28, 2018, 3:30:23 AM2/28/18

to

Am 28.02.2018 um 06:45 schrieb Juha Nieminen:

> Understanding <iostream> is ok, but be aware that most, if not all,
> implementations of it are less efficient than those of <cstdio>.
> Even std::istream::read() can be less efficient than std::fread(),
> even though logically they do the exact same thing.

Another problem with <iostream> is that it does too good a job in hiding
the underlying file handle.
It is fine for basic file IO in simple single-process/single-user
applications, but in "real life" you will quickly run in into situations
where you need more control over the opening mode, need to set runtime
flags like record locks, or need to operate on non-file streams like
sockets. <iostream> can't do that, at least not without re-inventing the
wheel (by writing your own streambuf with it's highly complex and
awkward interface).

Ian Collins

unread,

Feb 28, 2018, 4:17:50 AM2/28/18

to

On the contrary, it is a very simple interface... I often use
specialised streambuf implementations as a teaching example. All that
is required for an underflow() override.

--
Ian.

Paavo Helde

unread,

Feb 28, 2018, 5:02:08 AM2/28/18

to

I just stepped through a std::ifstream::read() in the MSVC++2017
debugger. In addition to an indirection (stream -> readbuf) it calls the
following virtual functions:

virtual streamsize basic_streambuf::xsgetn();
virtual int_type basic_filebuf::uflow();

xsgetn() is called once and uflow() is called when the read buffer gets
empty. So it's not so bad (I have an impression some other std::istream
features involve virtual calls per each character).

Underneath it uses the FILE* infrastructure, but does not use any public
FILE* functions for reading the bulk of the data; instead it uses the
internal read buffer in the FILE structure directly.

In short: std::ifstream::read() is built on top of FILE* mechanism with
some indirections and virtual calls added. As such, it cannot be faster
than fread(); if it is slower depends on a lot of things, most probably
the actual I/O speed will be the bottleneck anyway, but YMMV. And in
another implementation ifstream::read() might be implemented differently.

If the file is in the disk cache then the ifstream::read() overhead
seems to matter to some extent. I made a test reading a 8MB binary file
repeatedly and summing each 128-th byte, the results are (I also added
mmap/MapViewOfFile for curiosity):

istream::read: 3.48344 ms
fread: 2.09826 ms
mmap: 0.990501 ms

Cheers
Paavo

Chris Vine

unread,

Feb 28, 2018, 6:16:55 AM2/28/18

to

I agree that std::basic_streambuf is a reasonably designed and
simple-ish interface to implement. The bare minimum is to inherit from
std::basic_streambuf and override sync() and overflow() for output and
underflow() for input. If you want the most efficient block reads and
writes which bypass the internal buffers you might also want to
override xsputn() and xsgetn(), and you are also going to have to
override seekoff() and seekpos() if you want to implement seekability
on seekable files (irrelevant for things like sockets of course).

My objection is more having to do it at all for the most common case,
namely when you want to construct a stream object for a pre-opened
file descriptor. I find it surprising that, on unix-like platforms at
least, gcc and clang do not provide built in streambuffers for this.
The built-in streams provided by C++ do not actually enable you to
write a safe temporary file.

So in a unix-like environment you either have to write your own
streambuffer (which is what I have done) or you take the easier route
of using POSIX fdopen() and C streams instead.

Chris

Jorgen Grahn

unread,

Feb 28, 2018, 6:36:51 AM2/28/18

to

You should teach us ... I've spent a lot of time on it, then barely
understood enough to to implement something useful -- and forgotten it
again. The documentation in Stroustrup's book isn't enough, not for
me anyway.

ISTR that James Kanze had a web page on the subject.

Chris Vine

unread,

Feb 28, 2018, 7:33:20 AM2/28/18

to

On 28 Feb 2018 11:36:29 GMT

Josuttis's "The C++ Standard Library" gives an introduction and sample
implementation, which isn't bad (or at least the first edition does, I
don't have the second).

Melzzzzz

unread,

Feb 28, 2018, 7:52:24 AM2/28/18

to

Heh, I always use fstream/sstream, didn't know that this is that slower
;)
I even implemented simple database exclusivelly with fstream ;p
And parsing strings , I don't like snprintf/sscanf at all ;)

>
> Cheers
> Paavo

Paavo Helde

unread,

Feb 28, 2018, 9:16:05 AM2/28/18

to

Depends on the usage. If the file is not in the OS disk cache the
timings would probably be much longer and more equal. Also, whenever the
program does something more substantial with the file content then it
also does not matter if there is a millisecond lost somewhere.

> And parsing strings , I don't like snprintf/sscanf at all ;)

snprintf,sscanf et al are locale sensitive, meaning that a substantial
part of their work is spent on setting up and locking the global locale.
This can become significant in a massively multithreaded app. At least
C++ streams have the imbue() functionality for binding a non-global
locale which ought to remove this particular bottleneck.

Of course, if locale dependency is not needed at all, like when parsing
source code of a typical programming language, the most performing way
is to use special functions with built-in ASCII/C-locale behavior. But
again, all this matters only if the speed is important at that point.

red floyd

unread,

Feb 28, 2018, 12:52:42 PM2/28/18

to

On 2/28/2018 3:36 AM, Jorgen Grahn wrote:
> On Wed, 2018-02-28, Ian Collins wrote:

>> On the contrary, it is a very simple interface... I often use
>> specialised streambuf implementations as a teaching example. All that
>> is required for an underflow() override.
>
> You should teach us ... I've spent a lot of time on it, then barely
> understood enough to to implement something useful -- and forgotten it
> again. The documentation in Stroustrup's book isn't enough, not for
> me anyway.
>
> ISTR that James Kanze had a web page on the subject.
>

Langer and Kreft wrote a book on it. I have it somewhere:

"Standard C++ IOStreams and Locales: Advanced Programmer's Guide and
Reference"

https://www.amazon.com/Standard-IOStreams-Locales-Programmers-Reference/dp/0201183951

Alf P. Steinbach

unread,

Feb 28, 2018, 2:52:15 PM2/28/18

to

But g++ does.

__gnu_cxx::stdio_filebuf

<url: https://gcc.gnu.org/onlinedocs/gcc-4.6.2/libstdc++/api/a00069.html>

> The built-in streams provided by C++ do not actually enable you to
> write a safe temporary file.
>
> So in a unix-like environment you either have to write your own
> streambuffer (which is what I have done) or you take the easier route
> of using POSIX fdopen() and C streams instead.

Cheers & hth.,

- Alf

Chris Vine

unread,

Feb 28, 2018, 5:50:12 PM2/28/18

to

Ah, OK, I didn't know that. Thanks.

woodb...@gmail.com

unread,

Feb 28, 2018, 11:01:52 PM2/28/18

to

That sounds right. Yawn.

Brian
Ebenezer Enterprises - Enjoying programming again.
http://webEbenezer.net