Fast read of binary files that alternate type

eud9...@yahoo.com

unread,

Feb 3, 2006, 10:11:50 PM2/3/06

to

I am reading binary files that contain multiple data types that
alternate. That is, each file is organized, for example, as int double
int double int ... and so on. I would like to read this data into two
vectors of type int and double, while avoiding slow handwritten loops
such as below. Is there a way to replace these single element read and
insertions with faster iterator range insertions? Or perhaps, is there
a way to use STL algorithms or member functions to achieve the same
thing? Or some other faster way to read this data properly into
vectors?

#include <iostream>
#include <fstream>
#include <vector>
using namespace std;

int main ()
{
ifstream file ("C:\\File\\Path\\Here.bin",
ios::in|ios::binary|ios::ate);

if(file.good())
{
const long int totbyts = file.tellg();

int intval;
double doubleval;

vector<int> intvec;
vector<double> doublevec;

const int bpr = sizeof(intval)+ sizeof(doubleval);
const int intsiz = sizeof(intval);
const int doublesiz = sizeof(doubleval);

//reserve space in vectors
intvec.reserve(totbyts/bpr); doublevec.reserve(totbyts/bpr);
// read file

file.seekg(0);
for (long int i = 0; i < totbyts; i += bpr)
{

file.read((char *) &intval, intsiz);
file.read((char *) &doubleval, doublesiz);

intvec.push_back(intval);
doublevec.push_back(doublevec);

}

file.close();

// do something useful with the vectors
}

return 0;
}

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Ulrich Eckhardt

unread,

Feb 6, 2006, 8:44:50 AM2/6/06

to

eud9...@yahoo.com wrote:
> I am reading binary files that contain multiple data types that
> alternate. That is, each file is organized, for example, as int double
> int double int ... and so on.

Just for your info, there is no standard format for a double or an int, so
your fileformat is inherently nonportable.

> I would like to read this data into two
> vectors of type int and double, while avoiding slow handwritten loops
> such as below. Is there a way to replace these single element read and
> insertions with faster iterator range insertions? Or perhaps, is there
> a way to use STL algorithms or member functions to achieve the same
> thing? Or some other faster way to read this data properly into
> vectors?

Who cares? I'm pretty sure that even if you used a linked list you would
barely be able to measure any slowdown at all, because the bottleneck in
this operation usually is the IO performance and not the CPU performance.
Profile first, optimise later.

> ifstream file ("C:\\File\\Path\\Here.bin",
> ios::in|ios::binary|ios::ate);

You forget to imbue with the classic locale here! ios_base::binary only
turns off translating lineendings.

> const long int totbyts = file.tellg();

This might be unsigned and a long might be too small to hold the size of a
file. Also, I'm not sure this method is guaranteed to work.

> const int bpr = sizeof(intval)+ sizeof(doubleval);
> const int intsiz = sizeof(intval);
> const int doublesiz = sizeof(doubleval);

Use size_t instead, just for consistency and because that is what sizeof
yields. Also, when applied to a reference, sizeof doesn't need any
brackets.

> file.close();

Why? Happens automatically when the file goes out of scope or the program
terminates.

OK, two things:
- Prove that you have a performance problem and then measure where it is.
- The fastest way to copy data is to not copy data. Consider using a vector
of structures. Consider memory mapping the file and providing a
vector-like access to the contained data. Consider using a filebuf
directly or C file IO instead of C++ IOstreams.

Uli

robert...@gmail.com

unread,

Feb 6, 2006, 6:38:24 PM2/6/06

to

Ulrich Eckhardt wrote:
> eud9...@yahoo.com wrote:

> > ifstream file ("C:\\File\\Path\\Here.bin",
> > ios::in|ios::binary|ios::ate);
>
> You forget to imbue with the classic locale here! ios_base::binary only
> turns off translating lineendings.

Please elaborate, I want more on this.

eud9...@yahoo.com

unread,

Feb 6, 2006, 6:43:02 PM2/6/06

to

>> I am reading binary files that contain multiple data types that
>> alternate. That is, each file is organized, for example, as int double
>> int double int ... and so on.

>Just for your info, there is no standard format for a double or an int, so
>your fileformat is inherently nonportable.

Really? I don't think this is true. The file is actually all binary
integers some are 4 byte (long ints) and others are 8 byte (long long
ints). The 8 byte ints are actually doubles that have had their
decimal place shifted and truncated. Surely such a file is portable.

>You forget to imbue with the classic locale here! ios_base::binary only
>turns off translating lineendings.

Imbue, what? Classic locale? I don't know what you are referring to.

>OK, two things:
>- Prove that you have a performance problem and then measure where it is.

As mentioned the performance problem in the single element read and
push_back. This is why I want to read the whole file at once (or in
large chuncks) and get this into the right vectors.

>- The fastest way to copy data is to not copy data. Consider using a vector
>of structures. Consider memory mapping the file and providing a
>vector-like access to the contained data. Consider using a filebuf
>directly or C file IO instead of C++ IOstreams.

I will consider this, perhaps a vector of structures will work, thanks.
Also thanks for the "unsigned" and "size_t" points.

Carl Barron

unread,

Feb 7, 2006, 6:33:51 AM2/7/06

to

eud9...@yahoo.com <eud9...@yahoo.com> wrote:

>
> Really? I don't think this is true. The file is actually all binary
> integers some are 4 byte (long ints) and others are 8 byte (long long
> ints). The 8 byte ints are actually doubles that have had their
> decimal place shifted and truncated. Surely such a file is portable.

ever hear of byte ordering or endianness??
if not google for 'endianness' and start reading.... :)

all computers do NOT store the bits of a 32 bit integer[assuming one
exixts] in the same internal order, and that is what your 4 byte reads
of binary integers assumes!!

There is no guarantee that 32 and 64 bit integers would have the same
endianness, although I know of no examples that this is the case!.

Better impliment c99's hex format for floating pt, numbers and state
what the byte sequence[]={ \x01,\x02,\x03,\x04}; produces as a 32 bit
integer.

Ulrich Eckhardt

unread,

Feb 7, 2006, 6:34:40 AM2/7/06

to

eud9...@yahoo.com wrote:
>>Just for your info, there is no standard format for a double or an int, so
>>your fileformat is inherently nonportable.
>
> Really? I don't think this is true. The file is actually all binary
> integers some are 4 byte (long ints) and others are 8 byte (long long
> ints). The 8 byte ints are actually doubles that have had their
> decimal place shifted and truncated. Surely such a file is portable.

- 'long long' is not a C++ type (yet) but a GCC extension to C++ or a C99
feature
- 'long' can be 32 or 64 bit in common systems
- 'double' is 8 byte according to IEEE, but you're not guaranteed to have an
IEEE system by the C++ standard
- The binary layout of floating point numbers is not standardised in C++.

In particular the second and fourth points are showstoppers for the unwary.
If I were you, I'd use the typedefs in stdint.h, i.e. uint32_t and uint64_t
to make clear what you mean. Oh, and did I mention endianess?

>>You forget to imbue with the classic locale here! ios_base::binary only
>>turns off translating lineendings.
>
> Imbue, what? Classic locale? I don't know what you are referring to.

There's a plugin in IOStreams' locale (codecvt facet) that does conversion
between the internally used characterset and the external representation.
This can be used for e.g. translating between internal ISO-8859-1 and
external UTF-8, but in your case you don't want that:
in.imbue( std::locale::classic());
If you really want a long explanation, try to get hold of "C++ IOStreams and
Locales" by Langer and Kreft.

> As mentioned the performance problem in the single element read and
> push_back. This is why I want to read the whole file at once (or in
> large chuncks) and get this into the right vectors.

You should be able to eliminate the push_back overhead with reserve(), IIRC
you did so already.

The other thing is more complicated. The call to read will knock loose a
whole cascade of (sometimes virtual) functioncalls and that only to
transfer 4 or 8 bytes. Since you don't need anything that happens in these
calls but the raw data, I suggested using raw C-style IO which doesn't have
all this overhead.
However, on a moderately recent system this overhead will still not be as
big as the overhead of transferring data from the harddrive, so don't
expect too much.

Uli

kanze

unread,

Feb 7, 2006, 9:11:39 AM2/7/06

to

eud9...@yahoo.com wrote:
> >> I am reading binary files that contain multiple data types
> >> that alternate. That is, each file is organized, for
> >> example, as int double int double int ... and so on.

> >Just for your info, there is no standard format for a double
> >or an int, so your fileformat is inherently nonportable.

> Really? I don't think this is true.

Well, strictly speaking, he is wrong -- there's not just one
standard format, there are a lot of them: XDR, ASN.1 BER...

Of course, they are just that: formats. None of them
necessarily correspond to what you get with
ostream::write/istream::read.

> The file is actually all binary integers some are 4 byte (long
> ints) and others are 8 byte (long long ints). The 8 byte ints
> are actually doubles that have had their decimal place shifted
> and truncated. Surely such a file is portable.

In the same way as XDR or ASN.1 BER is portable. You define the
format, and anyone should be able to write code to read and
write it. On the other hand, just using write and read won't do
the trick -- these functions are designed for writing and
reading pre-formatted data, and don't do any formatting.

Tell us what the format of the data is, and we will find a way
to read it correctly. If you don't tell us, however, we can
only guess.

> >You forget to imbue with the classic locale here!
> >ios_base::binary only turns off translating lineendings.

> Imbue, what? Classic locale? I don't know what you are
> referring to.

Standard iostreams use locales for code translation. The
classic locale ("C") is guaranteed to have a degenerate
translation, i.e. the identity function. No other locale is.

The locale used by default in a filebuf/[io]fstream is the
current global locale. Which may or may not be the "C" locale.

Just do
file.imbue( std::locale::classic() ) ;
anytime before the first input.

> >OK, two things:
> >- Prove that you have a performance problem and then measure
> > where it is.

> As mentioned the performance problem in the single element
> read and push_back. This is why I want to read the whole file
> at once (or in large chuncks) and get this into the right
> vectors.

So is the performance problem due to the vectors, or to the IO
itself? A filebuf is normally buffered, so whether you read it
one or two bytes at a time, or in large blocks, doesn't
generally make much difference.

(Depending on the implementation, the code might be faster if
you read byte by byte, using istream::get(), and shift, or if
you use the sbumpc() of the filebuf directly. Or it might not
be -- there's no real rule.)

--
James Kanze GABI Software
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

ThosRTanner

unread,

Feb 7, 2006, 9:14:59 AM2/7/06

to

eud9...@yahoo.com wrote:
> >> I am reading binary files that contain multiple data types that
> >> alternate. That is, each file is organized, for example, as int double
> >> int double int ... and so on.
>
>
> >Just for your info, there is no standard format for a double or an int, so
> >your fileformat is inherently nonportable.
>
> Really? I don't think this is true. The file is actually all binary
> integers some are 4 byte (long ints) and others are 8 byte (long long
> ints). The 8 byte ints are actually doubles that have had their
> decimal place shifted and truncated. Surely such a file is portable.

I'm confused about your file format. Is it

4 byte int, 8 byte int, 4 byte int, 8 byte int, etc (which the above
suggests)
or
4 byte int, double, 4 byte int, double, etc (which your original post
implies)
?

If the former, your reading code won't work at all, because you can't
read an 8 byte int into a double. Well, clearly you can, but you
certainly won't get the answer you expect.

ThosRTanner

unread,

Feb 7, 2006, 9:12:05 AM2/7/06

to

eud9...@yahoo.com wrote:
>
> >Just for your info, there is no standard format for a double or an int, so
> >your fileformat is inherently nonportable.
>
> Really? I don't think this is true. The file is actually all binary
> integers some are 4 byte (long ints) and others are 8 byte (long long
> ints). The 8 byte ints are actually doubles that have had their
> decimal place shifted and truncated. Surely such a file is portable.
>

It isn't portable across different architectures. Depending on the
endianness of your architecture, you may get the least significant byte
1st in the file, or the most signficant byte. As long as you are
running your program on the same architecture machine as you generated
the data you are OK (although, as some CPUs allow you to change their
endianness, even that statement should be taken with a pinch of salt
:-( ).

kanze

unread,

Feb 7, 2006, 9:17:29 AM2/7/06

to

Ulrich Eckhardt wrote:
> eud9...@yahoo.com wrote:
> > I am reading binary files that contain multiple data types
> > that alternate. That is, each file is organized, for
> > example, as int double int double int ... and so on.

> Just for your info, there is no standard format for a double
> or an int, so your fileformat is inherently nonportable.

You might even mention that it can change from one version of
the compiler to the next, or depending on the options used
during compilation.

> > I would like to read this data into two vectors of type int
> > and double, while avoiding slow handwritten loops such as
> > below. Is there a way to replace these single element read
> > and insertions with faster iterator range insertions? Or
> > perhaps, is there a way to use STL algorithms or member
> > functions to achieve the same thing? Or some other faster
> > way to read this data properly into vectors?

> Who cares? I'm pretty sure that even if you used a linked list
> you would barely be able to measure any slowdown at all,
> because the bottleneck in this operation usually is the IO
> performance and not the CPU performance. Profile first,
> optimise later.

> > ifstream file ("C:\\File\\Path\\Here.bin",
> > ios::in|ios::binary|ios::ate);

> You forget to imbue with the classic locale here!
> ios_base::binary only turns off translating lineendings.

And possibly some special end of file recognition. On some
systems, it might in fact cause a different type of file to be
opened.

> > const long int totbyts = file.tellg();

> This might be unsigned and a long might be too small to hold
> the size of a file. Also, I'm not sure this method is
> guaranteed to work.

It's not. I don't think it's even guaranteed to compile --
there's no guarantee that an fpos<char> can convert to an
integral type. (And long is too small to hold the length of a
file on all the systems I know.)

> > const int bpr = sizeof(intval)+ sizeof(doubleval);
> > const int intsiz = sizeof(intval);
> > const int doublesiz = sizeof(doubleval);

> Use size_t instead, just for consistency and because that is
> what sizeof yields. Also, when applied to a reference, sizeof
> doesn't need any brackets.

> > file.close();

> Why? Happens automatically when the file goes out of scope or
> the program terminates.

Well, you might want to use check the status of the file after
close(). (In practice, for input, you have to check it after
each read anyway, and it's not too important to verify it after
close.)

> OK, two things:
> - Prove that you have a performance problem and then measure
> where it is.
> - The fastest way to copy data is to not copy data. Consider
> using a vector of structures. Consider memory mapping the
> file and providing a vector-like access to the contained
> data. Consider using a filebuf directly or C file IO instead
> of C++ IOstreams.

Totally agreed. In practice, I suspect that if reading directly
to an std::vector isn't sufficient enough, he's going to need
the mmap solution (which really isn't portable at all).

--
James Kanze GABI Software
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

eud9...@yahoo.com

unread,

Feb 8, 2006, 4:56:27 AM2/8/06

to

Thanks everyone for this discussion. Here's a few more comments
about the problem:

1) I'll learn about endianess! The binary files were not created by
me and are very likely to be from different architectures. Therefore
they may have different byte orders. I'll have to find out from the
source.
2) I do know that the actual file is all binary integers (4 bytes and 8
bytes, alternating). My current code reads the 8 byte integers into a
'long long'. This is then translated into a double by dividing by
1000000.0. (I was omitting these details to help focus on my problem).
3) My general impression is that my code for reading this data is
pretty good, but I'll also look into the suggestions (such as C IO
and reading byte-by-byte) for possible improvements.

Thanks again I appreciate your comments.

kanze

unread,

Feb 8, 2006, 9:19:05 AM2/8/06

to

Ulrich Eckhardt wrote:

> The other thing is more complicated. The call to read will
> knock loose a whole cascade of (sometimes virtual)
> functioncalls and that only to transfer 4 or 8 bytes.

This depends very heavily on the implementation. A simple
implementation of read might simply loop on calls to
rdbuf()->sgetc() -- in which case, everything could be inline,
and the only time there is an actual function call is when the
buffer is empty.

Typically, however, I would imagine that read is optimized for
longer sequences. Which would probably mean that probably uses
rdbuf()->sgetn(), which in turn, in filebuf, is specialized to
try to do direct reading, bypassing the buffer in the case of
large reads. All of which is complicated enough that it won't
be inlined, and that there will be at least one virtual function
call, as you suggest.

This is the reason I suggested something like using four calls
to get(), shifting and or'ing the results -- it seems to me more
likely that get() would be inlined, and get() only calls
rdbuf()->sgetc(), which is often inlined as well, so you skip
the (virtual) function calls and the extra tests, except when
the input buffer is empty.

Another possible optimization would be to use istream::read to
read a large block, then access the individual bytes using *p++,
where p is a char const*.

> Since you don't need anything that happens in these calls but
> the raw data, I suggested using raw C-style IO which doesn't
> have all this overhead.

In theory, there isn't much difference in the two. To be sure
of avoiding the overhead, you have to descend to the system API;
something like read() under Unix, for example.

> However, on a moderately recent system this overhead will
> still not be as big as the overhead of transferring data from
> the harddrive, so don't expect too much.

It all depends on the implementation. From what I've heard (and
to some degree seen), there are still some pretty poor
iostream's implementations out there. (The one delivered with
Solaris, for example, will execute a pthread_mutex_lock and a
pthread_mutex_free for each call to istream::read... or
istream::get, for that matter.)

--
James Kanze GABI Software
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

kanze

unread,

Feb 8, 2006, 9:27:49 AM2/8/06

to

Carl Barron wrote:
> eud9...@yahoo.com <eud9...@yahoo.com> wrote:

> > Really? I don't think this is true. The file is actually
> > all binary integers some are 4 byte (long ints) and others
> > are 8 byte (long long ints). The 8 byte ints are actually
> > doubles that have had their decimal place shifted and
> > truncated. Surely such a file is portable.

> ever hear of byte ordering or endianness??
> if not google for 'endianness' and start reading.... :)

Endianness is just the tip of the iceberg. At least three
representations of negative numbers have been used, and at least
two are still being used in computers delivered today. And the
number of bits in a byte can vary -- although the only values I
know of in current computers are 8, 9 and 32. (And obviously,
computers with 32 bit bytes don't have 4 byte longs.)

Even with four eight bit bytes, there are 24 different possible
orderings -- I've actually encountered three on machines I've
used. (On one machine, in fact, the ordering depended on the
compiler -- and changed from one version of the compiler to the
next.)

--
James Kanze GABI Software
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

Eugene Kalenkovich

unread,

Feb 9, 2006, 9:34:02 PM2/9/06

to

"kanze" <ka...@gabi-soft.fr> wrote in message
news:1139396543.6...@z14g2000cwz.googlegroups.com...

> Carl Barron wrote:
>> eud9...@yahoo.com <eud9...@yahoo.com> wrote:
>
> Endianness is just the tip of the iceberg. At least three
> representations of negative numbers have been used, and at least
> two are still being used in computers delivered today. And the
> number of bits in a byte can vary -- although the only values I
> know of in current computers are 8, 9 and 32. (And obviously,
> computers with 32 bit bytes don't have 4 byte longs.)
>

I do not know what do you mean by "current", but you can add 6 to collection
(nice old ICL-1900). My favorite example for those who tell that
sizeof(int)==32 :)
BTW, where 9 is used?

-- EK

kanze

unread,

Feb 10, 2006, 9:18:03 AM2/10/06

to

Eugene Kalenkovich wrote:
> "kanze" <ka...@gabi-soft.fr> wrote in message
> news:1139396543.6...@z14g2000cwz.googlegroups.com...
> > Carl Barron wrote:
> >> eud9...@yahoo.com <eud9...@yahoo.com> wrote:

> > Endianness is just the tip of the iceberg. At least three
> > representations of negative numbers have been used, and at
> > least two are still being used in computers delivered today.
> > And the number of bits in a byte can vary -- although the
> > only values I know of in current computers are 8, 9 and 32.
> > (And obviously, computers with 32 bit bytes don't have 4
> > byte longs.)

> I do not know what do you mean by "current", but you can add 6
> to collection (nice old ICL-1900).

Not for C/C++.

I think that the very first bytes were 6 bits; it was, at any
rate, a popular value back in the late 50's/early 60's -- 6 six
bit bytes in a 36 bit word. (This explains why Fortran 1) uses
such a small character set, and 2) uses six character symbols.)
And of course, the PDP-10 traditionally used 7, with a left over
bit (5 seven bit bytes in a 36 bit word). But the C standard
requires at least 8 bits, and that the size of an int be a whole
number multiple of the bytes.

> My favorite example for those who tell that sizeof(int)==32 :)

> BTW, where 9 is used?

Today, only on the Unisys 2200, to my knowledge. (This machine
also uses 1's complement.) But it is the natural byte size for
a 36 bit machine; if the PDP-10 ever had a standard C compiler,
that's the size it would have used.

--
James Kanze GABI Software
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34