Fastest way to parse strings from a very big file!

coding junkie

unread,

Nov 23, 2009, 4:47:32 PM11/23/09

to

I need to parse a file containing about 40,000 records. Each record
would look like

string1|string2|number1|number1|float

I need to read each field in this string and to process some
information based on this data and store all this data in an internal
data structure.

What is the most efficient way to do this? Is it better to use istream
and read each line as std::strings and then parse each string? Or
should I use char arrays and use C-style parsing??

Should I first load the file into a buffer and then parse the large
buffer? Or parse and process wach line by lineof the file.

Help!!

--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Neil Butterworth

unread,

Nov 24, 2009, 12:37:08 AM11/24/09

to

coding junkie wrote:
> I need to parse a file containing about 40,000 records. Each record
> would look like
>
> string1|string2|number1|number1|float
>
> I need to read each field in this string and to process some
> information based on this data and store all this data in an internal
> data structure.
>
> What is the most efficient way to do this? Is it better to use istream
> and read each line as std::strings and then parse each string? Or
> should I use char arrays and use C-style parsing??
>
> Should I first load the file into a buffer and then parse the large
> buffer? Or parse and process wach line by lineof the file.

I answered a similar question to this on StackOverflow a while back
(which of course I can't find at the moment) by posting the results of
some tests I did that showed that reading into a large buffer was about
twice as fast as reading line by line into a string. You will have to
decide if a 2x increase in code speed (which isn't all that much) is
worth the effort. Personally, I would get the application working with
the simplest code possible and only look at optimising reads if the
performance is actually unacceptable in practice.

Neil Butterworth

stork

unread,

Nov 24, 2009, 12:54:32 AM11/24/09

to

> What is the most efficient way to do this? Is it better to use istream
> and read each line as std::strings and then parse each string? Or
> should I use char arrays and use C-style parsing??
>
> Should I first load the file into a buffer and then parse the large
> buffer? Or parse and process wach line by lineof the file.
>

In your case I'd probably just read the file line by line. I make my
case as follows.

1) You probably wouldn't notice the optimization. 40,000 records
really isn't all that much data and whatever efficiency you might gain
from blobbing it up and trying to parse it all in RAM will be hard to
detect

2) You may not get the optimization you think. The object of blobbing
it up in RAM is to get rid of the 40k function calls to read the line
and allocate the string for it. You may well wind up replacing those
40k function calls with some other function calls to break the memory
image up line by line. In any case, your file request will likely be
read in blocks anyway behind the scenes.. check out what you can do
with changing your buffer sizes.

3) The line by line approach scales better. If per chance your input
file grows to 400 million rows, your utility would work and you would
not have to change a thing if you did it line by line.

I've been playing with Boost Spirit to do file parsing. Do check it
out! The Boost folks have some giant brains, like, they are the like
the super brilliant aliens in the original Star Trek and all you and I
can do is beg for Quatloos and hope to use their stuff. Spirit is
pretty cool because it can handle the case of vertical bars pretty
easily, like you have, but it can also be used to write your own
compiler with it. That's pretty cool, I must say.

anonma...@gmail.com

unread,

Nov 24, 2009, 12:52:04 AM11/24/09

to

On Nov 23, 4:47 pm, coding junkie <coding.junkie...@gmail.com> wrote:
> I need to parse a file containing about 40,000 records. Each record
> would look like
>
> string1|string2|number1|number1|float
>
> I need to read each field in this string and to process some
> information based on this data and store all this data in an internal
> data structure.
>
> What is the most efficient way to do this? Is it better to use istream
> and read each line as std::strings and then parse each string? Or
> should I use char arrays and use C-style parsing??
>
> Should I first load the file into a buffer and then parse the large
> buffer? Or parse and process wach line by lineof the file.
>
> Help!!

I suggest using std::string.getline() to read a line into a
std::string from an istream and then parse the string into fields.
Reading the entire file into a buffer uses more space than reading
line by line. Also, you will be able catch lines errors such as wrong
number of fields this way. Reading the entire file at once won't
allow catching these types of errors.

Our group created a delimiter seperated file reader class which does
exactly this. It's constructor opened the file. It contained a
nextLine function which read in the next line. And access functions
that allowed getting each field as an int, double, or string either by
name (the first line in the file contained column headings) or by
position.

HTH

red floyd

unread,

Nov 24, 2009, 12:52:48 AM11/24/09

to

On Nov 23, 1:47 pm, coding junkie <coding.junkie...@gmail.com> wrote:
> I need to parse a file containing about 40,000 records. Each record
> would look like
>
> string1|string2|number1|number1|float
>
> I need to read each field in this string and to process some
> information based on this data and store all this data in an internal
> data structure.
>
> What is the most efficient way to do this? Is it better to use istream
> and read each line as std::strings and then parse each string? Or
> should I use char arrays and use C-style parsing??
>
> Should I first load the file into a buffer and then parse the large
> buffer? Or parse and process wach line by lineof the file.
>

Get the code correct first, and then worry about "efficiency". And
BENCHMARK BENCHMARK BENCHMARK before you do "efficiency" tweaks.

I'd use your first method -- read line by line into a std::string.
If benchmarking/profiling then shows that 1) you're too slow, and 2)
the
bottleneck is in the parsing; then -- and ONLY THEN -- should you look
at optimizing the parse.

Vladimir Jovic

unread,

Nov 24, 2009, 11:04:02 AM11/24/09

to

coding junkie wrote:
> What is the most efficient way to do this? Is it better to use istream
> and read each line as std::strings and then parse each string? Or
> should I use char arrays and use C-style parsing??

If your target is linux, then mmap() might help. But I agree with others
to profile and see where is the bottleneck.

--
ultrasound www.ezono.com

Andy Venikov

unread,

Nov 24, 2009, 3:37:38 PM11/24/09

to

coding junkie wrote:
> I need to parse a file containing about 40,000 records. Each record
> would look like
>
> string1|string2|number1|number1|float
>
> I need to read each field in this string and to process some
> information based on this data and store all this data in an internal
> data structure.
>
> What is the most efficient way to do this? Is it better to use istream
> and read each line as std::strings and then parse each string? Or
> should I use char arrays and use C-style parsing??
>
> Should I first load the file into a buffer and then parse the large
> buffer? Or parse and process wach line by lineof the file.
>
> Help!!
>

It seems like boost::spirit has been popping up in the discussions
fairly frequently here.

I'd suggest using spirit for your needs here, as it seems the perfect
fit for it.

Not only will it make your code cleaner but it also has a chance to be
very efficient.

But please, check out "Using boost::spirit in production code" thread
from Nov 10th before you decide to use it.

HTH,
Andy.

anonma...@gmail.com

unread,

Nov 25, 2009, 12:53:48 PM11/25/09

to

On Nov 24, 3:37 pm, Andy Venikov <swojchelo...@gmail.com> wrote:
> coding junkie wrote:
> > I need to parse a file containing about 40,000 records. Each record
> > would look like
>
> > string1|string2|number1|number1|float
>
> > I need to read each field in this string and to process some
> > information based on this data and store all this data in an internal
> > data structure.
>
> > What is the most efficient way to do this? Is it better to use istream
> > and read each line as std::strings and then parse each string? Or
> > should I use char arrays and use C-style parsing??
>
> > Should I first load the file into a buffer and then parse the large
> > buffer? Or parse and process wach line by lineof the file.
>
> > Help!!
>
> It seems like boost::spirit has been popping up in the discussions
> fairly frequently here.
>
> I'd suggest using spirit for your needs here, as it seems the perfect
> fit for it.
>
> Not only will it make your code cleaner but it also has a chance to be
> very efficient.
>
> But please, check out "Using boost::spirit in production code" thread
> from Nov 10th before you decide to use it.
>
> HTH,
> Andy.

I know you're not the only one who suggested boost spirit but I'm
replying to your post.

Personally, I think it's overkill for this case. The lines are
clearly delimited by pipe characters and the fields are strings, ints
and floats. Tokenizing is easily done with the various find functions
of std::string. And parsing ints and floats is very straight forward
with C++ standard library tools.

HTH

mzdude

unread,

Nov 25, 2009, 12:52:43 PM11/25/09

to

On Nov 23, 4:47 pm, coding junkie <coding.junkie...@gmail.com> wrote:
> I need to parse a file containing about 40,000 records. Each record
> would look like
>
> string1|string2|number1|number1|float
>
> I need to read each field in this string and to process some
> information based on this data and store all this data in an internal
> data structure.
>
> What is the most efficient way to do this? Is it better to use istream
> and read each line as std::strings and then parse each string? Or
> should I use char arrays and use C-style parsing??
>
> Should I first load the file into a buffer and then parse the large
> buffer? Or parse and process wach line by lineof the file.
>

Lately we have been going over our code base looking to
wring out better performance. One way to do that was to
parallelize some of our algorithms and take advantage of
multiple cores. I know C++ doesn't officially support
multiple cores, but there are some libraries that do.

We are using the Intel Threaded Building Blocks library
and see drastic improvement in some areas, modest in others
and sometimes even a drop in performance.

Perhaps your best performance would be read and pass the
parsing to an idle core.