string1|string2|number1|number1|float
I need to read each field in this string and to process some
information based on this data and store all this data in an internal
data structure.
What is the most efficient way to do this? Is it better to use istream
and read each line as std::strings and then parse each string? Or
should I use char arrays and use C-style parsing??
Should I first load the file into a buffer and then parse the large
buffer? Or parse and process wach line by lineof the file.
Help!!
--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
I answered a similar question to this on StackOverflow a while back
(which of course I can't find at the moment) by posting the results of
some tests I did that showed that reading into a large buffer was about
twice as fast as reading line by line into a string. You will have to
decide if a 2x increase in code speed (which isn't all that much) is
worth the effort. Personally, I would get the application working with
the simplest code possible and only look at optimising reads if the
performance is actually unacceptable in practice.
Neil Butterworth
In your case I'd probably just read the file line by line. I make my
case as follows.
1) You probably wouldn't notice the optimization. 40,000 records
really isn't all that much data and whatever efficiency you might gain
from blobbing it up and trying to parse it all in RAM will be hard to
detect
2) You may not get the optimization you think. The object of blobbing
it up in RAM is to get rid of the 40k function calls to read the line
and allocate the string for it. You may well wind up replacing those
40k function calls with some other function calls to break the memory
image up line by line. In any case, your file request will likely be
read in blocks anyway behind the scenes.. check out what you can do
with changing your buffer sizes.
3) The line by line approach scales better. If per chance your input
file grows to 400 million rows, your utility would work and you would
not have to change a thing if you did it line by line.
I've been playing with Boost Spirit to do file parsing. Do check it
out! The Boost folks have some giant brains, like, they are the like
the super brilliant aliens in the original Star Trek and all you and I
can do is beg for Quatloos and hope to use their stuff. Spirit is
pretty cool because it can handle the case of vertical bars pretty
easily, like you have, but it can also be used to write your own
compiler with it. That's pretty cool, I must say.
Our group created a delimiter seperated file reader class which does
exactly this. It's constructor opened the file. It contained a
nextLine function which read in the next line. And access functions
that allowed getting each field as an int, double, or string either by
name (the first line in the file contained column headings) or by
position.
HTH
Get the code correct first, and then worry about "efficiency". And
BENCHMARK BENCHMARK BENCHMARK before you do "efficiency" tweaks.
I'd use your first method -- read line by line into a std::string.
If benchmarking/profiling then shows that 1) you're too slow, and 2)
the
bottleneck is in the parsing; then -- and ONLY THEN -- should you look
at optimizing the parse.
If your target is linux, then mmap() might help. But I agree with others
to profile and see where is the bottleneck.
--
ultrasound www.ezono.com
It seems like boost::spirit has been popping up in the discussions
fairly frequently here.
I'd suggest using spirit for your needs here, as it seems the perfect
fit for it.
Not only will it make your code cleaner but it also has a chance to be
very efficient.
But please, check out "Using boost::spirit in production code" thread
from Nov 10th before you decide to use it.
HTH,
Andy.
Personally, I think it's overkill for this case. The lines are
clearly delimited by pipe characters and the fields are strings, ints
and floats. Tokenizing is easily done with the various find functions
of std::string. And parsing ints and floats is very straight forward
with C++ standard library tools.
HTH
We are using the Intel Threaded Building Blocks library
and see drastic improvement in some areas, modest in others
and sometimes even a drop in performance.
Perhaps your best performance would be read and pass the
parsing to an idle core.