[boost] [Tokenizer]Usage and documentation

Max

unread,

Feb 8, 2011, 8:13:21 AM2/8/11

to bo...@lists.boost.org

Hello,

I'm using boost::tokenizer to do some simple parsing of data file in a
format specified by the following rules:

- One record of several fields in a single line

- Adjacent data fields in a record separated by space char's(space
or tab), with or without ","

- String without space(s), with or without quotation marks

- String with space(s), with quotation marks

One example of a 4-field-per-record file is like:

"string 2" 3 4 5 4.3

"String", 2, 3.04 4 3

AnyOtherText, 2, 3.04 4 3

I am using the following code to get a line at first, supposing 'input' has
the contents of the data file:

typedef boost::tokenizer<boost::char_separator<char> > tokenizer;

boost::char_separator<char> sep("\n", " ");

tokenizer tokens(input, sep);

for(tokenizer::iterator beg=tokens.begin(); beg!=tokens.end(); ++beg)

{

}

Then for each *beg, I parse each line with this

typedef boost::tokenizer<boost::char_separator<char> > tokenizer;

tokenizer tokens (*beg, boost::char_separator<char>(", "));

tokenizer::iterator it= tokens.begin();

But I cannot get the expected output. And, at the mean time, I found the doc
of boost::tokenizer quite slim and not easy to find the information that I
need.

Does anybody else have the same feeling, or, is the fact that nobody is
actually using it but turning to any other better lib?

Thanks for any help.

B/Rgds

Max

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Michael Caisse

unread,

Feb 8, 2011, 12:46:39 PM2/8/11

to bo...@lists.boost.org

On 2/8/2011 5:13 AM, Max wrote:
> Hello,
>
>
>
> I'm using boost::tokenizer to do some simple parsing of data file in a
> format specified by the following rules:
>
>
>
> - One record of several fields in a single line
>
> - Adjacent data fields in a record separated by space char's(space
> or tab), with or without ","
>
> - String without space(s), with or without quotation marks
>
> - String with space(s), with quotation marks
>
>
>

Hi Max -

I would use Spirit Qi for this task. You can find the documentation here:

<http://www.boost.org/doc/libs/1_45_0/libs/spirit/doc/html/index.html>

michael

--

----------------------------------
Michael Caisse
Object Modeling Designs
www.objectmodelingdesigns.com

Max

unread,

Feb 9, 2011, 4:50:43 AM2/9/11

to bo...@lists.boost.org

Thank you Michael for your pointer.

I did knew spirit before but not much. It seemed like a canon or much more
while what I need is a gun.

Its power and elegance seems worth a try, even though the learning curve is
a little bit steep.

(To play with a simple toy program closely resembling those samples
presented in the tutorial is not difficult,
but having a full grasp, or nearly, is far from a easy task, especially when
one comes across compile errors -
the scenario that I believe everyone here can imagine, which is probably the
biggest drawback of the
powerful high order programming)

B/Rgds
Max

Yechezkel Mett

unread,

Feb 9, 2011, 6:41:32 AM2/9/11

to bo...@lists.boost.org

On Tue, Feb 8, 2011 at 3:13 PM, Max <more...@sina.com> wrote:
> I'm using boost::tokenizer to do some simple parsing of data file in a
> format specified by the following rules:
>
> - One record of several fields in a single line
>
> - Adjacent data fields in a record separated by space char's(space
> or tab), with or without ","
>
> - String without space(s), with or without quotation marks
>
> - String with space(s), with quotation marks
>
>
> One example of a 4-field-per-record file is like:
>
> "string 2" 3 4 5 4.3
>
> "String", 2, 3.04 4 3
>
> AnyOtherText, 2, 3.04 4 3

I normally use boost.regex's regex_token_iterator for this sort of task.
Try the following regex:

"([^"]*)"|(?:^|[[:space:],])+([^[:space:],]+)(?:$|[[:space:],])+

and tell regex_token_iterator to extract matches 1 and 2.

The above regex has a couple of quirks: "a""b" will be taken as two
fields, "a" and "b". a,,b will be taken as two fields, not three.

To read the file line by line, simply use std::getline.

Yechezkel Mett

Max

unread,

Feb 9, 2011, 8:22:44 AM2/9/11

to bo...@lists.boost.org

Thank you Yechezkel.

I've indeed tried with boost.Regex, on a slightly different path though - I
was using boost::regex_search instead.

One drawback of the regex approach, IMO, is I feel the code a little bit
rigid,
or lack of flexibility, or in any other words, it's not anything I feel it
should be -
even though I cannot actually tell in what respect.

Thanks for your yet another regex approach. I'm trying to rewrite your regex

"([^"]*)"|(?:^|[[:space:],])+([^[:space:],]+)(?:$|[[:space:],])+

In a form that I'm more familiar

"([^"]*)"|(?:^|[\s,])+([^\s,]+)(?:$|[\s,])+

But I still cannot understand it, after reading through
http://www.boost.org/doc/libs/1_45_0/libs/regex/doc/html/boost_regex/syntax/
perl_syntax.html
The part I could not interpret is:

^|[\s,]

And

$|[\s,]

:-( (this is not a part of the regex, part of my expression instead.)

Thanks.

Max

> -----Original Message-----
> From: boost-...@lists.boost.org [mailto:boost-...@lists.boost.org]
> On Behalf Of Yechezkel Mett
> Sent: Wednesday, February 09, 2011 7:42 PM
> To: bo...@lists.boost.org
> Subject: Re: [boost] [Tokenizer]Usage and documentation
>

Stephan T. Lavavej

unread,

Feb 9, 2011, 9:46:03 AM2/9/11

to bo...@lists.boost.org

[Max]

> I'm using boost::tokenizer to do some simple parsing of data file in a format specified by the following rules:
> - One record of several fields in a single line
> - Adjacent data fields in a record separated by space char's(space or tab), with or without ","
> - String without space(s), with or without quotation marks
> - String with space(s), with quotation marks
> One example of a 4-field-per-record file is like:
> "string 2" 3 4 5 4.3
> "String", 2, 3.04 4 3
> AnyOtherText, 2, 3.04 4 3

> I've indeed tried with boost.Regex, on a slightly different path though - I

> was using boost::regex_search instead.

Never call regex_search() in a loop by incrementing iterators - doing so can trigger infinite loops and incorrect results. Read Pete Becker's TR1 book for the gory details (consider what happens with zero-length matches, for example). Always use regex_iterator or regex_token_iterator instead.

> But I still cannot understand it, after reading through
> http://www.boost.org/doc/libs/1_45_0/libs/regex/doc/html/boost_regex/syntax/perl_syntax.html
> The part I could not interpret is:
> ^|[\s,]
> And
> $|[\s,]

The docs say:

> A '^' character shall match the start of a line.
> A '$' character shall match the end of a line.

It depends on how strict you want to be (see the unusual examples below, especially involving empty fields). One approach is to describe the fields you're interested in, and let regex_iterator find them. (Another approach, activating regex_token_iterator's magical field splitting ability, doesn't seem to be applicable here because you want to handle quoted strings - if I'm wrong about that I'd love to find out). I suggest the following (I've used VC10 RTM std::regex here, but boost::regex will behave identically):

C:\Temp>type meow.cpp
#include <iostream>
#include <ostream>
#include <regex>
#include <string>
#include <vector>
using namespace std;

int main() {
const string reg("\"([^\"]*)\"|([^\\s,\"]+)");
const regex r(reg);

cout << "r: " << reg << endl << endl;

for (string s; getline(cin, s); ) {
if (s == "bye") {
break;
}

vector<string> v;

for (sregex_iterator i(s.begin(), s.end(), r), end; i != end; ++i) {
const smatch& m = *i;

v.push_back(m[1].matched ? m[1] : m[2]);
}

for (vector<string>::const_iterator i = v.begin(); i != v.end(); ++i) {
cout << "[" << *i << "]";
}

cout << endl << endl;
}
}

C:\Temp>cl /EHsc /nologo /W4 meow.cpp
meow.cpp

C:\Temp>meow
r: "([^"]*)"|([^\s,"]+)

"string 2" 3 4 5 4.3

[string 2][3][4][5][4.3]

"String", 2, 3.04 4 3

[String][2][3.04][4][3]

AnyOtherText, 2, 3.04 4 3

[AnyOtherText][2][3.04][4][3]

commas,without,spaces,"and","cute fluffy kittens"
[commas][without][spaces][and][cute fluffy kittens]

leading whitespace and (invisible) trailing whitespace
[leading][whitespace][and][(invisible)][trailing][whitespace]

empty "" quotes
[empty][][quotes]

really"bizarre"strings"like""this"
[really][bizarre][strings][like][this]

empty,,,fields, , , like this
[empty][fields][like][this]

bye

C:\Temp>

Stephan T. Lavavej
Member of the Society for Regex Simplicity, I mean, Visual C++ Libraries Developer

Michael Caisse

unread,

Feb 9, 2011, 2:43:36 PM2/9/11

to bo...@lists.boost.org

On 2/9/2011 1:50 AM, Max wrote:
> Thank you Michael for your pointer.
>
> I did knew spirit before but not much. It seemed like a canon or much more
> while what I need is a gun.

This, unfortunately, seems to be a common mis-perception. If we were
talking about using lex, yacc, or bison I would understand. Spirit in a
DSEL and a simple include enables the functionality.

It can't be that the compiled result is 'canon' compared to the other
solutions you are looking at. The resulting code is tight and fast.

> Its power and elegance seems worth a try, even though the learning curve is
> a little bit steep.
>
> (To play with a simple toy program closely resembling those samples
> presented in the tutorial is not difficult,
> but having a full grasp, or nearly, is far from a easy task, especially when
> one comes across compile errors -
> the scenario that I believe everyone here can imagine, which is probably the
> biggest drawback of the
> powerful high order programming)
>
> B/Rgds
> Max

I think this hits the problem. The library can be intimidating at first.
Any DSEL can just look odd initially. I personally find Qi to be close
enough to EBNF that it reads nicely.

The compiler errors are definitely an issue, especially when you are
first beginning. Looks like you have found the tutorial. If you are
interested in a crash course you can find slides here:
<http://www.objectmodelingdesigns.com/boostcon10/> and the BoostCon
video here: <http://blip.tv/file/4143337>.

There is a Spirit ML and a bunch of hang out on the Boost IRC channel.
The community is friendly and always eager to help a new convert.
Regardless of the approach you take, I wish you luck!

designated benevolent spirit evangelist -

Max

unread,

Feb 10, 2011, 2:32:09 AM2/10/11

to bo...@lists.boost.org

Hello Stephen,

Thank you so much for your detailed information, which is exactly what I
need by now.

After a glimpse on your email address, It comes to me that you might be the
guy in a series of STL lectures I found on the net.
It's really you! Even though I'm not a very beginner of STL, I've watched
some of the video both for revisiting STL and for English practice. :-)

Thank you also for your STL lectures.

> > The part I could not interpret is:
> > ^|[\s,]
> > And
> > $|[\s,]
>
> The docs say:
>
> > A '^' character shall match the start of a line.
> > A '$' character shall match the end of a line.

Yes, I'm aware of this. But even with this in mind, I cannot interpret
"^|[\s,]" and "$|[\s,]".
For the former, I know '|' means alteration, but how can it be after '^'?
For the latter, how can "|[\s,]" be expected after the end of a line (and
the same confusion as above)?

>
> It depends on how strict you want to be (see the unusual examples below,
> especially involving empty fields). One approach is to describe the fields
you're
> interested in, and let regex_iterator find them. (Another approach,
activating
> regex_token_iterator's magical field splitting ability, doesn't seem to be
> applicable here because you want to handle quoted strings - if I'm wrong
> about that I'd love to find out). I suggest the following (I've used VC10
RTM
> std::regex here, but boost::regex will behave identically):
>
> C:\Temp>type meow.cpp
>

>[code snippet]

>
> C:\Temp>
>
> Stephan T. Lavavej
> Member of the Society for Regex Simplicity, I mean, Visual C++ Libraries
> Developer

One more question - with you code, any empty 'token' between two contiguous
',' is ignored, what if someday I'd like to pick them up?

B/Rgds
Max

Max

unread,

Feb 10, 2011, 2:53:15 AM2/10/11

to bo...@lists.boost.org

Hello Michael

> <please don't top-post>
> <http://www.boost.org/community/policy.html#quoting>

My apologies.

> >
> > I did knew spirit before but not much. It seemed like a canon or much
more
> > while what I need is a gun.
>
> This, unfortunately, seems to be a common mis-perception. If we were
> talking about using lex, yacc, or bison I would understand. Spirit in a
> DSEL and a simple include enables the functionality.
>
> It can't be that the compiled result is 'canon' compared to the other
> solutions you are looking at. The resulting code is tight and fast.

It's nice to hear that.

> > Its power and elegance seems worth a try, even though the learning curve
is
> > a little bit steep.
> >
> > (To play with a simple toy program closely resembling those samples
> > presented in the tutorial is not difficult,
> > but having a full grasp, or nearly, is far from a easy task, especially
when
> > one comes across compile errors -
> > the scenario that I believe everyone here can imagine, which is probably
the
> > biggest drawback of the
> > powerful high order programming)
>

> I think this hits the problem. The library can be intimidating at first.
> Any DSEL can just look odd initially. I personally find Qi to be close
> enough to EBNF that it reads nicely.
>
> The compiler errors are definitely an issue, especially when you are
> first beginning. Looks like you have found the tutorial. If you are
> interested in a crash course you can find slides here:
> <http://www.objectmodelingdesigns.com/boostcon10/> and the BoostCon
> video here: <http://blip.tv/file/4143337>.
>
> There is a Spirit ML and a bunch of hang out on the Boost IRC channel.
> The community is friendly and always eager to help a new convert.
> Regardless of the approach you take, I wish you luck!

In case I come across problems I'll definitely join the ML and ask for your
help.
Thank you very much for your pointer - I'll have a careful read of the
presentations
for both SPIRIT and ASIO in which both lib's and many other boost libs are
presenting
a beautiful real life collaboration.

And... I'm so happy to get to know how you three (Michael, Hartmut and Joel)
look like
by the photo at the left bottom of the page. :-)

I know you guys long before but it's this time I have a look at your photo.

> designated benevolent spirit evangelist -
> michael

_______________________________________________

Yechezkel Mett

unread,

Feb 10, 2011, 4:40:59 AM2/10/11

to bo...@lists.boost.org

On Thu, Feb 10, 2011 at 9:32 AM, Max <more...@sina.com> wrote:
[Stephan T. Lavavej <s...@exchange.microsoft.com> wrote:]
>> [Max]

>> > The part I could not interpret is:
>> > ^|[\s,]
>> > And
>> > $|[\s,]
>>
>> The docs say:
>>
>> > A '^' character shall match the start of a line.
>> > A '$' character shall match the end of a line.
>
> Yes, I'm aware of this. But even with this in mind, I cannot interpret
> "^|[\s,]" and "$|[\s,]".
> For the former, I know '|' means alteration, but how can it be after '^'?
> For the latter, how can "|[\s,]" be expected after the end of a line (and
> the same confusion as above)?

^|[\s,]

means _either_ the beginning of the line _or_ a space or comma. In
other words the field starts either at the beginning of the line or
after a space or comma.

Likewise

$|[\s,]

The field ends either at the end of the line or before a space or comma.

> One more question - with you code, any empty 'token' between two contiguous
> ',' is ignored, what if someday I'd like to pick them up?

"([^"]*)"|([^\s,"]+)|,\s*(),|^\s*(),|,\s*()$

I'm presuming an empty line should count as no tokens; if you don't
mind an empty line being one token it can be simplified to

"([^"]*)"|([^\s,"]+)|(?:^|,)\s*()(?:$|,)

Not really that much simpler.

Yechezkel Mett

Max

unread,

Feb 10, 2011, 8:46:03 AM2/10/11

to bo...@lists.boost.org

> From: boost-...@lists.boost.org [mailto:boost-...@lists.boost.org]
> On Behalf Of Yechezkel Mett

> Sent: Thursday, February 10, 2011 5:41 PM

> To: bo...@lists.boost.org
> Subject: Re: [boost] [Tokenizer]Usage and documentation
>

> ^|[\s,]
>
> means _either_ the beginning of the line _or_ a space or comma. In
> other words the field starts either at the beginning of the line or
> after a space or comma.
>
> Likewise
>
> $|[\s,]
>
> The field ends either at the end of the line or before a space or comma.

I indeed never realized that ^ and $ could be used in combination with | in
that way before.
I didn't use RE that frequently though.

>
> > One more question - with you code, any empty 'token' between two
> contiguous
> > ',' is ignored, what if someday I'd like to pick them up?
>
> "([^"]*)"|([^\s,"]+)|,\s*(),|^\s*(),|,\s*()$
>
> I'm presuming an empty line should count as no tokens; if you don't
> mind an empty line being one token it can be simplified to

I have 3 version of the RE's sitting side by side attempting to figure out
the difference
between them.

> "([^"]*)"|([^\s,"]+)|,\s*(),|^\s*(),|,\s*()$ // (1)
> "([^"]*)"|([^\s,"]+)|(?:^|,)\s*()(?:$|,) //
(2)
> "([^"]*)"|([^\s,"]+) //
(3) original version offered by Stephen

But, unfortunately, I still cannot fully grasp the meaning of (1) and (2).
But by testing (1) with Stephen's code, I get:

r: "([^"]*)"|([^\s,"]+)|,\s*(),|^\s*(),|,\s*()$

empty,,,fields, , , like this
[empty][][fields][][like][this]

,,,
[][]

There are 2 empty tokens in between each 3 contiguous ',' but only one for
each is detected.

Likewise, for (2), I get:

r: "([^"]*)"|([^\s,"]+)|(?:^|,)\s*()(?:$|,)

empty,,,fields, , , like this
[empty][fields][like][this]

This time, the behavior is no different than the 'original' version.

Thank you Yechezkel for you help.

BTW, it seems like by reading
http://www.boost.org/doc/libs/1_45_0/libs/regex/doc/html/boost_regex/syntax/
perl_syntax.html
I cannot get a full view of the regex grammar. Maybe I need a whole book on
it? :-)

Is there any *complete* introduction available on the net?

B/Rgds
Max

Yechezkel Mett

unread,

Feb 13, 2011, 6:44:17 AM2/13/11

to bo...@lists.boost.org

On Thu, Feb 10, 2011 at 3:46 PM, Max <more...@sina.com> wrote:
> I have 3 version of the RE's sitting side by side attempting to figure out
> the difference
> between them.
>
>> "([^"]*)"|([^\s,"]+)|,\s*(),|^\s*(),|,\s*()$ // (1)
>> "([^"]*)"|([^\s,"]+)|(?:^|,)\s*()(?:$|,) //
> (2)
>> "([^"]*)"|([^\s,"]+) //
> (3) original version offered by Stephen
>
> But, unfortunately, I still cannot fully grasp the meaning of (1) and (2).

,\s*(),

means find a ',' followed by any number of spaces followed by a ','
and capture an empty string.

The others are similar.

>
> r: "([^"]*)"|([^\s,"]+)|,\s*(),|^\s*(),|,\s*()$
>
> empty,,,fields, , , like this
> [empty][][fields][][like][this]
> ,,,
> [][]
>
> There are 2 empty tokens in between each 3 contiguous ',' but only one for
> each is detected.

Yes, that's a mistake. When matching ,, as an empty field the second
',' is eaten and can no longer be used as the beginning of the next
field.

"([^"]*)"|([^\s,"]+)|,\s*()(?=,)|^\s*()(?=,)|,\s*()$

should work. (?=) is a lookahead, it checks that the pattern (',' in
this case) matches at this point, but doesn't eat any input.

>
> Likewise, for (2), I get:
>
> r: "([^"]*)"|([^\s,"]+)|(?:^|,)\s*()(?:$|,)
>
> empty,,,fields, , , like this
> [empty][fields][like][this]
>
> This time, the behavior is no different than the 'original' version.

I get the same results as the first version. Perhaps it wasn't escaped properly?

Yechezkel Mett

Max

unread,

Feb 16, 2011, 8:01:19 AM2/16/11

to bo...@lists.boost.org

[Yechezkel Mett]

>
> ,\s*(),
>
> means find a ',' followed by any number of spaces followed by a ','
> and capture an empty string.

Yes, now I see. Thank you, Yechezkel.

>
> The others are similar.
>
> >
> > r: "([^"]*)"|([^\s,"]+)|,\s*(),|^\s*(),|,\s*()$
> >
> > empty,,,fields, , , like this
> > [empty][][fields][][like][this]
> > ,,,
> > [][]
> >
> > There are 2 empty tokens in between each 3 contiguous ',' but only one
for
> > each is detected.
>
> Yes, that's a mistake. When matching ,, as an empty field the second
> ',' is eaten and can no longer be used as the beginning of the next
> field.
>
> "([^"]*)"|([^\s,"]+)|,\s*()(?=,)|^\s*()(?=,)|,\s*()$
>
> should work. (?=) is a lookahead, it checks that the pattern (',' in
> this case) matches at this point, but doesn't eat any input.
>

Yes, Its behavior is exactly as you expected.

> >
> > Likewise, for (2), I get:
> >
> > r: "([^"]*)"|([^\s,"]+)|(?:^|,)\s*()(?:$|,)
> >
> > empty,,,fields, , , like this
> > [empty][fields][like][this]
> >
> > This time, the behavior is no different than the 'original' version.
>
> I get the same results as the first version. Perhaps it wasn't escaped
properly?

Yes, you are right. My different result came from my incorrect escaping
unintentionally.

B/Rgds
Max

P.S

I've found some 'complete' reference (books) on RE. However it's this thread

of discussion that has indeed triggered a leap of my understanding of RE.
And, I have also had a revisit, not so deep though, to SPIRIT.Qi, following
the direction of Michael. (Qi is a power tool I believe I definitely will
use,
and its siblings.)

Now I'm able to comprehend quite 'complex' expression, including whose
appeared in this thread.

Thank you Michael, Yechezkel, Stephan for your kind help!

Reply all

Reply to author

Forward