I am reading binary files that contain multiple data types that alternate. That is, each file is organized, for example, as int double int double int ... and so on. I would like to read this data into two vectors of type int and double, while avoiding slow handwritten loops such as below. Is there a way to replace these single element read and insertions with faster iterator range insertions? Or perhaps, is there a way to use STL algorithms or member functions to achieve the same thing? Or some other faster way to read this data properly into vectors?
#include <iostream> #include <fstream> #include <vector> using namespace std;
int main () { ifstream file ("C:\\File\\Path\\Here.bin", ios::in|ios::binary|ios::ate);
if(file.good()) { const long int totbyts = file.tellg();
int intval; double doubleval;
vector<int> intvec; vector<double> doublevec;
const int bpr = sizeof(intval)+ sizeof(doubleval); const int intsiz = sizeof(intval); const int doublesiz = sizeof(doubleval);
//reserve space in vectors intvec.reserve(totbyts/bpr); doublevec.reserve(totbyts/bpr); // read file
file.seekg(0); for (long int i = 0; i < totbyts; i += bpr) {
eud9-...@yahoo.com wrote: > I am reading binary files that contain multiple data types that > alternate. That is, each file is organized, for example, as int double > int double int ... and so on.
Just for your info, there is no standard format for a double or an int, so your fileformat is inherently nonportable.
> I would like to read this data into two > vectors of type int and double, while avoiding slow handwritten loops > such as below. Is there a way to replace these single element read and > insertions with faster iterator range insertions? Or perhaps, is there > a way to use STL algorithms or member functions to achieve the same > thing? Or some other faster way to read this data properly into > vectors?
Who cares? I'm pretty sure that even if you used a linked list you would barely be able to measure any slowdown at all, because the bottleneck in this operation usually is the IO performance and not the CPU performance. Profile first, optimise later.
You forget to imbue with the classic locale here! ios_base::binary only turns off translating lineendings.
> const long int totbyts = file.tellg();
This might be unsigned and a long might be too small to hold the size of a file. Also, I'm not sure this method is guaranteed to work.
> const int bpr = sizeof(intval)+ sizeof(doubleval); > const int intsiz = sizeof(intval); > const int doublesiz = sizeof(doubleval);
Use size_t instead, just for consistency and because that is what sizeof yields. Also, when applied to a reference, sizeof doesn't need any brackets.
> file.close();
Why? Happens automatically when the file goes out of scope or the program terminates.
OK, two things: - Prove that you have a performance problem and then measure where it is. - The fastest way to copy data is to not copy data. Consider using a vector of structures. Consider memory mapping the file and providing a vector-like access to the contained data. Consider using a filebuf directly or C file IO instead of C++ IOstreams.
>> I am reading binary files that contain multiple data types that >> alternate. That is, each file is organized, for example, as int double >> int double int ... and so on. >Just for your info, there is no standard format for a double or an int, so >your fileformat is inherently nonportable.
Really? I don't think this is true. The file is actually all binary integers some are 4 byte (long ints) and others are 8 byte (long long ints). The 8 byte ints are actually doubles that have had their decimal place shifted and truncated. Surely such a file is portable.
>You forget to imbue with the classic locale here! ios_base::binary only >turns off translating lineendings.
Imbue, what? Classic locale? I don't know what you are referring to.
>OK, two things: >- Prove that you have a performance problem and then measure where it is.
As mentioned the performance problem in the single element read and push_back. This is why I want to read the whole file at once (or in large chuncks) and get this into the right vectors.
>- The fastest way to copy data is to not copy data. Consider using a vector >of structures. Consider memory mapping the file and providing a >vector-like access to the contained data. Consider using a filebuf >directly or C file IO instead of C++ IOstreams.
I will consider this, perhaps a vector of structures will work, thanks. Also thanks for the "unsigned" and "size_t" points.
> Really? I don't think this is true. The file is actually all binary > integers some are 4 byte (long ints) and others are 8 byte (long long > ints). The 8 byte ints are actually doubles that have had their > decimal place shifted and truncated. Surely such a file is portable.
ever hear of byte ordering or endianness?? if not google for 'endianness' and start reading.... :)
all computers do NOT store the bits of a 32 bit integer[assuming one exixts] in the same internal order, and that is what your 4 byte reads of binary integers assumes!!
There is no guarantee that 32 and 64 bit integers would have the same endianness, although I know of no examples that this is the case!.
Better impliment c99's hex format for floating pt, numbers and state what the byte sequence[]={ \x01,\x02,\x03,\x04}; produces as a 32 bit integer.
eud9-...@yahoo.com wrote: >>Just for your info, there is no standard format for a double or an int, so >>your fileformat is inherently nonportable.
> Really? I don't think this is true. The file is actually all binary > integers some are 4 byte (long ints) and others are 8 byte (long long > ints). The 8 byte ints are actually doubles that have had their > decimal place shifted and truncated. Surely such a file is portable.
- 'long long' is not a C++ type (yet) but a GCC extension to C++ or a C99 feature - 'long' can be 32 or 64 bit in common systems - 'double' is 8 byte according to IEEE, but you're not guaranteed to have an IEEE system by the C++ standard - The binary layout of floating point numbers is not standardised in C++.
In particular the second and fourth points are showstoppers for the unwary. If I were you, I'd use the typedefs in stdint.h, i.e. uint32_t and uint64_t to make clear what you mean. Oh, and did I mention endianess?
>>You forget to imbue with the classic locale here! ios_base::binary only >>turns off translating lineendings.
> Imbue, what? Classic locale? I don't know what you are referring to.
There's a plugin in IOStreams' locale (codecvt facet) that does conversion between the internally used characterset and the external representation. This can be used for e.g. translating between internal ISO-8859-1 and external UTF-8, but in your case you don't want that: in.imbue( std::locale::classic()); If you really want a long explanation, try to get hold of "C++ IOStreams and Locales" by Langer and Kreft.
> As mentioned the performance problem in the single element read and > push_back. This is why I want to read the whole file at once (or in > large chuncks) and get this into the right vectors.
You should be able to eliminate the push_back overhead with reserve(), IIRC you did so already.
The other thing is more complicated. The call to read will knock loose a whole cascade of (sometimes virtual) functioncalls and that only to transfer 4 or 8 bytes. Since you don't need anything that happens in these calls but the raw data, I suggested using raw C-style IO which doesn't have all this overhead. However, on a moderately recent system this overhead will still not be as big as the overhead of transferring data from the harddrive, so don't expect too much.
eud9-...@yahoo.com wrote: > >> I am reading binary files that contain multiple data types > >> that alternate. That is, each file is organized, for > >> example, as int double int double int ... and so on. > >Just for your info, there is no standard format for a double > >or an int, so your fileformat is inherently nonportable. > Really? I don't think this is true.
Well, strictly speaking, he is wrong -- there's not just one standard format, there are a lot of them: XDR, ASN.1 BER...
Of course, they are just that: formats. None of them necessarily correspond to what you get with ostream::write/istream::read.
> The file is actually all binary integers some are 4 byte (long > ints) and others are 8 byte (long long ints). The 8 byte ints > are actually doubles that have had their decimal place shifted > and truncated. Surely such a file is portable.
In the same way as XDR or ASN.1 BER is portable. You define the format, and anyone should be able to write code to read and write it. On the other hand, just using write and read won't do the trick -- these functions are designed for writing and reading pre-formatted data, and don't do any formatting.
Tell us what the format of the data is, and we will find a way to read it correctly. If you don't tell us, however, we can only guess.
> >You forget to imbue with the classic locale here! > >ios_base::binary only turns off translating lineendings. > Imbue, what? Classic locale? I don't know what you are > referring to.
Standard iostreams use locales for code translation. The classic locale ("C") is guaranteed to have a degenerate translation, i.e. the identity function. No other locale is.
The locale used by default in a filebuf/[io]fstream is the current global locale. Which may or may not be the "C" locale.
Just do file.imbue( std::locale::classic() ) ; anytime before the first input.
> >OK, two things: > >- Prove that you have a performance problem and then measure > > where it is. > As mentioned the performance problem in the single element > read and push_back. This is why I want to read the whole file > at once (or in large chuncks) and get this into the right > vectors.
So is the performance problem due to the vectors, or to the IO itself? A filebuf is normally buffered, so whether you read it one or two bytes at a time, or in large blocks, doesn't generally make much difference.
(Depending on the implementation, the code might be faster if you read byte by byte, using istream::get(), and shift, or if you use the sbumpc() of the filebuf directly. Or it might not be -- there's no real rule.)
-- James Kanze GABI Software Conseils en informatique orientée objet/ Beratung in objektorientierter Datenverarbeitung 9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34
eud9-...@yahoo.com wrote: > >> I am reading binary files that contain multiple data types that > >> alternate. That is, each file is organized, for example, as int double > >> int double int ... and so on.
> >Just for your info, there is no standard format for a double or an int, so > >your fileformat is inherently nonportable.
> Really? I don't think this is true. The file is actually all binary > integers some are 4 byte (long ints) and others are 8 byte (long long > ints). The 8 byte ints are actually doubles that have had their > decimal place shifted and truncated. Surely such a file is portable.
I'm confused about your file format. Is it
4 byte int, 8 byte int, 4 byte int, 8 byte int, etc (which the above suggests) or 4 byte int, double, 4 byte int, double, etc (which your original post implies) ?
If the former, your reading code won't work at all, because you can't read an 8 byte int into a double. Well, clearly you can, but you certainly won't get the answer you expect.
> >Just for your info, there is no standard format for a double or an int, so > >your fileformat is inherently nonportable.
> Really? I don't think this is true. The file is actually all binary > integers some are 4 byte (long ints) and others are 8 byte (long long > ints). The 8 byte ints are actually doubles that have had their > decimal place shifted and truncated. Surely such a file is portable.
It isn't portable across different architectures. Depending on the endianness of your architecture, you may get the least significant byte 1st in the file, or the most signficant byte. As long as you are running your program on the same architecture machine as you generated the data you are OK (although, as some CPUs allow you to change their endianness, even that statement should be taken with a pinch of salt :-( ).
Ulrich Eckhardt wrote: > eud9-...@yahoo.com wrote: > > I am reading binary files that contain multiple data types > > that alternate. That is, each file is organized, for > > example, as int double int double int ... and so on. > Just for your info, there is no standard format for a double > or an int, so your fileformat is inherently nonportable.
You might even mention that it can change from one version of the compiler to the next, or depending on the options used during compilation.
> > I would like to read this data into two vectors of type int > > and double, while avoiding slow handwritten loops such as > > below. Is there a way to replace these single element read > > and insertions with faster iterator range insertions? Or > > perhaps, is there a way to use STL algorithms or member > > functions to achieve the same thing? Or some other faster > > way to read this data properly into vectors? > Who cares? I'm pretty sure that even if you used a linked list > you would barely be able to measure any slowdown at all, > because the bottleneck in this operation usually is the IO > performance and not the CPU performance. Profile first, > optimise later. > > ifstream file ("C:\\File\\Path\\Here.bin", > > ios::in|ios::binary|ios::ate); > You forget to imbue with the classic locale here! > ios_base::binary only turns off translating lineendings.
And possibly some special end of file recognition. On some systems, it might in fact cause a different type of file to be opened.
> > const long int totbyts = file.tellg(); > This might be unsigned and a long might be too small to hold > the size of a file. Also, I'm not sure this method is > guaranteed to work.
It's not. I don't think it's even guaranteed to compile -- there's no guarantee that an fpos<char> can convert to an integral type. (And long is too small to hold the length of a file on all the systems I know.)
> > const int bpr = sizeof(intval)+ sizeof(doubleval); > > const int intsiz = sizeof(intval); > > const int doublesiz = sizeof(doubleval); > Use size_t instead, just for consistency and because that is > what sizeof yields. Also, when applied to a reference, sizeof > doesn't need any brackets. > > file.close(); > Why? Happens automatically when the file goes out of scope or > the program terminates.
Well, you might want to use check the status of the file after close(). (In practice, for input, you have to check it after each read anyway, and it's not too important to verify it after close.)
> OK, two things: > - Prove that you have a performance problem and then measure > where it is. > - The fastest way to copy data is to not copy data. Consider > using a vector of structures. Consider memory mapping the > file and providing a vector-like access to the contained > data. Consider using a filebuf directly or C file IO instead > of C++ IOstreams.
Totally agreed. In practice, I suspect that if reading directly to an std::vector isn't sufficient enough, he's going to need the mmap solution (which really isn't portable at all).
-- James Kanze GABI Software Conseils en informatique orientée objet/ Beratung in objektorientierter Datenverarbeitung 9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34
Thanks everyone for this discussion. Here's a few more comments about the problem:
1) I'll learn about endianess! The binary files were not created by me and are very likely to be from different architectures. Therefore they may have different byte orders. I'll have to find out from the source. 2) I do know that the actual file is all binary integers (4 bytes and 8 bytes, alternating). My current code reads the 8 byte integers into a 'long long'. This is then translated into a double by dividing by 1000000.0. (I was omitting these details to help focus on my problem). 3) My general impression is that my code for reading this data is pretty good, but I'll also look into the suggestions (such as C IO and reading byte-by-byte) for possible improvements.
Ulrich Eckhardt wrote: > The other thing is more complicated. The call to read will > knock loose a whole cascade of (sometimes virtual) > functioncalls and that only to transfer 4 or 8 bytes.
This depends very heavily on the implementation. A simple implementation of read might simply loop on calls to rdbuf()->sgetc() -- in which case, everything could be inline, and the only time there is an actual function call is when the buffer is empty.
Typically, however, I would imagine that read is optimized for longer sequences. Which would probably mean that probably uses rdbuf()->sgetn(), which in turn, in filebuf, is specialized to try to do direct reading, bypassing the buffer in the case of large reads. All of which is complicated enough that it won't be inlined, and that there will be at least one virtual function call, as you suggest.
This is the reason I suggested something like using four calls to get(), shifting and or'ing the results -- it seems to me more likely that get() would be inlined, and get() only calls rdbuf()->sgetc(), which is often inlined as well, so you skip the (virtual) function calls and the extra tests, except when the input buffer is empty.
Another possible optimization would be to use istream::read to read a large block, then access the individual bytes using *p++, where p is a char const*.
> Since you don't need anything that happens in these calls but > the raw data, I suggested using raw C-style IO which doesn't > have all this overhead.
In theory, there isn't much difference in the two. To be sure of avoiding the overhead, you have to descend to the system API; something like read() under Unix, for example.
> However, on a moderately recent system this overhead will > still not be as big as the overhead of transferring data from > the harddrive, so don't expect too much.
It all depends on the implementation. From what I've heard (and to some degree seen), there are still some pretty poor iostream's implementations out there. (The one delivered with Solaris, for example, will execute a pthread_mutex_lock and a pthread_mutex_free for each call to istream::read... or istream::get, for that matter.)
-- James Kanze GABI Software Conseils en informatique orientée objet/ Beratung in objektorientierter Datenverarbeitung 9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34
Carl Barron wrote: > eud9-...@yahoo.com <eud9-...@yahoo.com> wrote: > > Really? I don't think this is true. The file is actually > > all binary integers some are 4 byte (long ints) and others > > are 8 byte (long long ints). The 8 byte ints are actually > > doubles that have had their decimal place shifted and > > truncated. Surely such a file is portable. > ever hear of byte ordering or endianness?? > if not google for 'endianness' and start reading.... :)
Endianness is just the tip of the iceberg. At least three representations of negative numbers have been used, and at least two are still being used in computers delivered today. And the number of bits in a byte can vary -- although the only values I know of in current computers are 8, 9 and 32. (And obviously, computers with 32 bit bytes don't have 4 byte longs.)
Even with four eight bit bytes, there are 24 different possible orderings -- I've actually encountered three on machines I've used. (On one machine, in fact, the ordering depended on the compiler -- and changed from one version of the compiler to the next.)
-- James Kanze GABI Software Conseils en informatique orientée objet/ Beratung in objektorientierter Datenverarbeitung 9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34
> Carl Barron wrote: >> eud9-...@yahoo.com <eud9-...@yahoo.com> wrote:
> Endianness is just the tip of the iceberg. At least three > representations of negative numbers have been used, and at least > two are still being used in computers delivered today. And the > number of bits in a byte can vary -- although the only values I > know of in current computers are 8, 9 and 32. (And obviously, > computers with 32 bit bytes don't have 4 byte longs.)
I do not know what do you mean by "current", but you can add 6 to collection (nice old ICL-1900). My favorite example for those who tell that sizeof(int)==32 :) BTW, where 9 is used?
Eugene Kalenkovich wrote: > "kanze" <ka...@gabi-soft.fr> wrote in message > news:1139396543.675121.298790@z14g2000cwz.googlegroups.com... > > Carl Barron wrote: > >> eud9-...@yahoo.com <eud9-...@yahoo.com> wrote: > > Endianness is just the tip of the iceberg. At least three > > representations of negative numbers have been used, and at > > least two are still being used in computers delivered today. > > And the number of bits in a byte can vary -- although the > > only values I know of in current computers are 8, 9 and 32. > > (And obviously, computers with 32 bit bytes don't have 4 > > byte longs.) > I do not know what do you mean by "current", but you can add 6 > to collection (nice old ICL-1900).
Not for C/C++.
I think that the very first bytes were 6 bits; it was, at any rate, a popular value back in the late 50's/early 60's -- 6 six bit bytes in a 36 bit word. (This explains why Fortran 1) uses such a small character set, and 2) uses six character symbols.) And of course, the PDP-10 traditionally used 7, with a left over bit (5 seven bit bytes in a 36 bit word). But the C standard requires at least 8 bits, and that the size of an int be a whole number multiple of the bytes.
> My favorite example for those who tell that sizeof(int)==32 :) > BTW, where 9 is used?
Today, only on the Unisys 2200, to my knowledge. (This machine also uses 1's complement.) But it is the natural byte size for a 36 bit machine; if the PDP-10 ever had a standard C compiler, that's the size it would have used.
-- James Kanze GABI Software Conseils en informatique orientée objet/ Beratung in objektorientierter Datenverarbeitung 9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34