Reading from a large file by multiple reads of small chunks

Paul

unread,

Sep 19, 2015, 4:56:41 PM9/19/15

to

The following question was posted at http://www.careercup.com/question?id=14424684 --

Given API:
int Read4096(char* buf);
It reads data from a file and records the position so that the next time when it is called it read the next 4k chars (or the rest of the file, whichever is smaller) from the file.
The return is the number of chars read.

Todo: Use above API to Implement API
"int Read(char* buf, int n)" which reads any number of chars from the file.

Below is a solution offered by someone posting at that site. I have questions about this as follows. Some of these cases seem anomalous, namely num_read == 0 total != n-remain num_read < remain. Do these cases correspond to failed attempts to read the file? I think I spot some errors. Clearly read4096 should be Read4096. Also I think read4096(readbuf) should be Read4096(buf). I would also make num_copy simply remain instead of min(num_read, remain).

Is the basic idea correct though? Any other comments on this code?

Many thanks for your help,

Paul

CODE FOLLOWS BELOW ******************************
int read(char* buf, int n){
int num_it = n / 4096;
int remain = n % 4096;
int total = 0;

while(num_it--){
int num_read = read4096(buf);
if(num_read == 0) break;
buf += num_read;
total += num_read;
}

if(total != n-remain) return total;

char readbuf[4096];
int num_read = read4096(readbuf);
int num_copy = min(num_read, remain);
copy(readbuf, readbuf + num_copy, buf);
total += num_copy;

return total;
}

Jens Thoms Toerring

unread,

Sep 19, 2015, 6:49:42 PM9/19/15

to

Paul <peps...@gmail.com> wrote:
> The following question was posted at http://www.careercup.com/question?id=14424684 --

> Given API:
> int Read4096(char* buf);
> It reads data from a file and records the position so that the next time
> when it is called it read the next 4k chars (or the rest of the file,
> whichever is smaller) from the file. The return is the number of chars read.

> Todo: Use above API to Implement API
> "int Read(char* buf, int n)" which reads any number of chars from the file.

> Below is a solution offered by someone posting at that site. I have
> questions about this as follows. Some of these cases seem anomalous, namely
> num_read == 0 total != n-remain num_read < remain. Do these cases correspond
> to failed attempts to read the file? I think I spot some errors. Clearly
> read4096 should be Read4096. Also I think read4096(readbuf) should be
> Read4096(buf).

No, definitely not. There's a large chance that 'buf' has not
enough space left to accept another 4096 bytes at this stage
but just 'remain' bytes. To avoid that you need a buffer that
can accept the full 096 bytes.

> I would also make num_copy simply remain instead of
> min(num_read, remain).

No, because you just want to append to 'buf' as much as was
requested, not everything that may have been read (which
might be more than still fits into 'buf').

Both your proposed changes could result in writing past the
end of 'buf' which must be avoided at all costs! That's the
raw material for buffer-overflow attacks, i.e. it not only
makes your program behave in unpredictable ways but may even
allow the "bad guys" to wreak havoc by carefully crafted files
they may get you (or some user of your progam) to read in.

> Is the basic idea correct though? Any other comments on this code?

> CODE FOLLOWS BELOW ******************************
> int read( char * buf, int n )
> {
> int num_it = n / 4096;
> int remain = n % 4096;
> int total = 0;
>
> while ( num_it-- )
> {
> int num_read = read4096( buf );
> if ( num_read == 0 )
> break;
> buf += num_read;
> total += num_read;
> }

This tries to read as many 4096-byte chunks as can be
read from the file (modulo some typing errors like
'read4096' versus 'Read4096').

> if ( total != n - remain )
> return total;

Here things start to get a bit strange. If 'n' is an integer
multiple of 4096 (thus 'remain' is 0 ) and everything went
well one probably should also stop here and not attempt
to read anymore from the file. The code only seems to catch
cases of read errors (or less in the file than requested).
But that could be fixed easily by using '<=' instead of '!='
in the if condition.

The case that 'n' is smaller than 4096 though is handled
correctly - in that case the loop never gets run, thus
'total' is still 0 and so is 'n-remain' - thus the fol-
lowing code will get executed.

> char readbuf[ 4096 ];
> int num_read = read4096( readbuf );
> int num_copy = min( num_read, remain );
> copy( readbuf, readbuf + num_copy, buf );
> total += num_copy;

I don't see anything broken here (assuming that somewhere
before there was a line with 'using std;').

> return total;
> }

What's missing from the specifications is if it should
be possible to call this function several times. The
way it's written it can only be called once - or data
from the file may get lost. If it's to be called again
and again (e.g. with small values of 'n' for a large
file) the stuff read into 'readbuf' would have to be
stored (e.g. by making 'readbuf' a static variable) and
it would also have to be recorded in another static va-
riable how much from the last read in the last call has
not yet been returned to caller, and dealing with that on
entry into the function.

Since Read4096() is obviously meant to be called several
times to read in a file in chunks of that size I would
tend to assume that the same is also intended with this
function. But that won't work when bytes read from the
file but not yet returned to the caller are simply dis-
carded. So the function, as it is, satisfies the requi-
rements by the letter, but probably isn't what was ex-
pected.
Regards, Jens
--
\ Jens Thoms Toerring ___ j...@toerring.de
\__________________________ http://toerring.de

Paul

unread,

Sep 20, 2015, 4:26:20 AM9/20/15

to

...

Jens,

Many many thanks for such a thoughtful and helpful analysis. However, a few questions still remain. I understand you as saying that the below is correct.
char readbuf[ 4096 ];
int num_read = Read4096( readbuf ); However, it seems (to me) to read from an array of uninitialized chars. That's why I wanted to read from buf rather than from the uninitialized array.

Also, (assuming that we're aiming for an initial simplified version that isn't concerned with repeated calls), I don't see why the final piece of the file (past the chunks of 4096) needs to be treated differently (by being copied into a separate buffer).

Why not the code below?

Many thanks,

Paul

CODE AS ABOVE WITH SUGGESTED (AND PROBABLY ERRONEOUS) SIMPLIFICATIONS

int read(char* buf, int n){

int num_it = n / 4096;

int total = 0;

for(int i = num_it; num_it >= 0; --num_it){
int num_read = read4096(buf);
total += num_read;
buf += num_read;
if(num_read == 0) break;

}

return total;
}

Kalle Olavi Niemitalo

unread,

Sep 20, 2015, 4:59:26 AM9/20/15

to

Paul <peps...@gmail.com> writes:

> I understand you as saying that the below is correct.
> char readbuf[ 4096 ];
> int num_read = Read4096( readbuf );
> However, it seems (to me) to read from an array of uninitialized chars.

That code does not read from the array.
The expression Read4096(readbuf) means the same thing as
Read4096(&readbuf[0]); the array "decays" to a pointer that
points to the first element of the array.
Only that pointer is passed to the function Read4096,
which then uses it to write to the array.
The same thing would happen in fread(readbuf, 1, 4096, file).

> Why not the code below?

Consider what happens if someone calls it like this, and the file
actually is 50 bytes long;

char smallbuf[10];
int got = read(smallbuf, 10);

The intended effect is that the first 10 bytes of the file
should be read to smallbuf.

> int read(char* buf, int n){
> int num_it = n / 4096;
> int total = 0;

This would set num_it = 0 and total = 0.

> for(int i = num_it; num_it >= 0; --num_it){

This would set i = 0 and enter the loop body.

> int num_read = read4096(buf);

This would try to read 4096 bytes to buf, which points to
smallbuf. Because the file is 50 bytes long, read4096 would read
only 50 bytes to smallbuf. But smallbuf only has room for 10
bytes, so read4096 would write past the end of smallbuf
and likely corrupt some other object.

Jens Thoms Toerring

unread,

Sep 20, 2015, 6:13:55 AM9/20/15

to

Paul <peps...@gmail.com> wrote:

> However, a few questions still remain. I understand you as saying that the
> below is correct.

> char readbuf[ 4096 ];
> int num_read = Read4096( readbuf );

> However, it seems (to me) to read from an array of uninitialized chars.
> That's why I wanted to read from buf rather than from the uninitialized
> array.

As Kalle has already pointed out, this doesn't read from
'readbuf' - the Read4096() function reads from a file (up
to 4096 bytes) and puts them into the buffer you call it
with. So it will write what it read into 'readbuf'.

> Also, (assuming that we're aiming for an initial simplified version that
> isn't concerned with repeated calls), I don't see why the final piece of the
> file (past the chunks of 4096) needs to be treated differently (by being
> copied into a separate buffer).

> Why not the code below?

> CODE AS ABOVE WITH SUGGESTED (AND PROBABLY ERRONEOUS) SIMPLIFICATIONS
> int read(char* buf, int n){
> int num_it = n / 4096;
> int total = 0;

> for(int i = num_it; num_it >= 0; --num_it){
> int num_read = read4096(buf);
> total += num_read;
> buf += num_read;
> if(num_read == 0) break;
> }
> return total;
> }

The problem is the buffer the caller passes to the function.
The user isn't supposed to know how the function works in-
ternally. So it's reasonable for him to assume that the
buffer he or she passes to the function doesn't have to be
larger than the number of bytes he requests. And for that
reason alone the function may never try to write more than
as many bytes as the user requested into the buffer supplied
by him (or her).

Consider an attempt by the user to read the file character
by character, similar to fgetc(). So he might do

char c;
while ( read( &c, 1 ) == 1 )
do_something_with_the_character( c )

The user expects to get exactly one character (or nothing
when the end of the file has been reached). But your ver-
sion of the function would try to stuff up to 4096 bytes
into the location of 'c' where there's room for just a single
character (thus writing over the memory following it, pos-
sibly destroying other important data, smashing the stack
etc.) and would return a number between 0 and 4096, depen-
ding on how much was still available in the file. But the
user had no reason to expect that the function would return
anything but 1 or 0.

This type of use case is also why I would consider the
function you originally posted to be deficient. While it
at least doesn't write past the end of the user supplied
buffer and never has a return value larger than the num-
ber of characters the user requested, it discards all
the characters it read in from the file but could not
return to the user. If used in the way shown above, i.
e. in an attempt to read in the file character by cha-
racter, it would return just every 4096th character
from the file, which probably isn't what any user of it
would expect.

To use it properly the user would have to be aware that,
to get everything from the file, it must be called with a
buffer with a length that's an integer multiple of 4096.
And then it probably would be simpler to use Read4096()
directly;-)

Christian Gollwitzer

unread,

Sep 20, 2015, 8:09:16 AM9/20/15

to

Am 19.09.15 um 22:56 schrieb Paul:

> The following question was posted at http://www.careercup.com/question?id=14424684 --
>
> Given API:
> int Read4096(char* buf);
> It reads data from a file and records the position so that the next time when it is called it read the next 4k chars (or the rest of the file, whichever is smaller) from the file.
> The return is the number of chars read.
>
> Todo: Use above API to Implement API
> "int Read(char* buf, int n)" which reads any number of chars from the file.
>
> Below is a solution offered by someone posting at that site. I have questions about this as follows.

Others have already pointed out where this solution is insufficient. I'd
like to add:

this solution tries to optimize away copying by calling read4096
directly into the output bufer, if possible.
For sure this can be done, and it'll be a little more efficient than
copying - but it would be far easier to get it correct with a "static
char readbuf[4096]" and a "static size_t readbufindex" that points to
the first byte not-yet retrieved and a "static char buflength", pointing
to the end of the data in readbuf. You then always only fill the buffer
if the request cannot be fulfilled from the remaining content.
Sure, this will waste a few bytes and make it a bit slower due to
copying - but first you should try to implement a correct version,
profile, and then maybe optimize. If the read4096 takes lots of time to
get the data from the external device, copying might be negligible.

Christian