Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

C++ String tokenizaer

21 views
Skip to first unread message

dann...@hotmail.com

unread,
Jan 29, 2006, 11:28:35 AM1/29/06
to
Is there a recommend best solution for string tokenizing in C++?

I mean, I know strtok() is bad, but's what's a good alternative?

I've seen Boost tokenizers, but there's a hugh include overhead, which
I don't want. Is there a simple class or function out there that's
recommended by those in the nkow?

Fanks

DT


[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Timo Geusch

unread,
Jan 29, 2006, 1:17:45 PM1/29/06
to
dann...@hotmail.com wrote:

> Is there a recommend best solution for string tokenizing in C++?
>
> I mean, I know strtok() is bad, but's what's a good alternative?
>
> I've seen Boost tokenizers, but there's a hugh include overhead, which
> I don't want. Is there a simple class or function out there that's
> recommended by those in the nkow?

The boost tokeniser actually works very well so that would be my first
choice. Not quite sure what you mean by the "huge include overhead",
maybe you want to elaborate a tad on this?

Pete Becker

unread,
Jan 29, 2006, 1:20:49 PM1/29/06
to
dann...@hotmail.com wrote:

>
> I mean, I know strtok() is bad, but's what's a good alternative?
>

strtok works just fine for the things it was designed for. Whether
that's appropriate for your needs depends on what a "string tokeinzer"
is supposed to do.

--

Pete Becker
Dinkumware, Ltd. (http://www.dinkumware.com)

Carlos Moreno

unread,
Jan 29, 2006, 9:35:29 PM1/29/06
to
dann...@hotmail.com wrote:
> Is there a recommend best solution for string tokenizing in C++?
>
> I mean, I know strtok() is bad, but's what's a good alternative?
>
> I've seen Boost tokenizers, but there's a hugh include overhead, which
> I don't want. Is there a simple class or function out there that's
> recommended by those in the nkow?

If you must do it yourself, you could use a combination of
string::find_first_not_of and find_first_of to identify the
beginning of a new token and the first separator character
after the current token (respectively).

I wrote this little function which you could modify if a
vector<string> for the broken-down tokens does not fit the
bill:


vector<string> tokens (const string & s, const string & separators)
{
vector<string> result;

int pos_start, pos_end;

pos_start = s.find_first_not_of (separators);
if (pos_start == string::npos)
{
return result;
}

pos_end = s.find_first_of (separators, pos_start);
if (pos_end == string::npos)
{
result.push_back (s);
return result;
}

result.push_back (s.substr (pos_start, pos_end - pos_start));

while ((pos_start = s.find_first_not_of (separators, pos_end)) !=
string::npos)
{
pos_end = s.find_first_of (separators, pos_start);
if (pos_end == string::npos)
{
result.push_back (s.substr (pos_start));
return result;
}

result.push_back (s.substr (pos_start, pos_end - pos_start));
}

return result;
}

I think it is bug-free (though I know, those ints should be at
least unsigned ints or size_t, if not string::size_type), but no
guarantees at all.

HTH,

Carlos
--

shablool

unread,
Jan 30, 2006, 5:52:40 AM1/30/06
to
dann...@hotmail.com wrote:
> Is there a recommend best solution for string tokenizing in C++?
>
> I mean, I know strtok() is bad, but's what's a good alternative?
>
> I've seen Boost tokenizers, but there's a hugh include overhead, which
> I don't want. Is there a simple class or function out there that's
> recommended by those in the nkow?
>
> Fanks
>
> DT


The Strinx library ( http://strinx.sourceforge.net ) was designed to
solve exactly such problems, without performance penalty (such as
redundant memory allocations behind the scenes), and it is rather small
compared to boost-libs.

Here is an example of string tokenizing using Strinx:

/* Token-parsing Example */
#include <iostream>
#include <strinx/strinx.h>

int main()
{
using namespace strinx;

// Parse 'line' into tokens, delimeted by ':',';' or '/'.
// Output is: Ant Bee Cat Dog Eel Frog Giraffe
String line("/Ant:::Bee;:Cat:Dog;Eel/Frog:/Giraffe///");
const char seps[] = ":;/";

String tok = find_token(line, seps);
while (tok.size() > 0)
{
std::cout << tok << ' ';
tok = find_next_token(line, tok, seps);
}

std::cout << std::endl;

return 0;
}

Shachar S.

Russ

unread,
Jan 30, 2006, 9:23:43 AM1/30/06
to
I use stlsoft's string_tokeniser (see
http://synesis.com.au/software/stlsoft/help/classstlsoft_1_1string__tokeniser.html)

It's very efficient. The library's author, Matt Wilson, wrote an
article in CUJ about string views recently, and showed using with
string_tokeniser, to tokenize without any allocationes, which got me on
to it.

And since you ask it's low on compile overhead, unlike some other libs
I've tried. ;-o

HTH

Russ; no fuss

Carlos Moreno

unread,
Jan 30, 2006, 4:44:44 PM1/30/06
to
shablool wrote:

> Here is an example of string tokenizing using Strinx:
>
> /* Token-parsing Example */
> #include <iostream>
> #include <strinx/strinx.h>
>
> int main()
> {
> using namespace strinx;
>
> // Parse 'line' into tokens, delimeted by ':',';' or '/'.
> // Output is: Ant Bee Cat Dog Eel Frog Giraffe
> String line("/Ant:::Bee;:Cat:Dog;Eel/Frog:/Giraffe///");
> const char seps[] = ":;/";
>
> String tok = find_token(line, seps);
> while (tok.size() > 0)
> {
> std::cout << tok << ' ';
> tok = find_next_token(line, tok, seps);
> }

How could this possibly work? Would it properly tokenize
the string "a b a b c" when given space as separator? Or
would it fall in an infinite loop finding the second b
after the third a? If it doesn't (fall in an infinite
loop, which would be possible if it uses static data, or
if the internal representation for tok is a really weird
thing that stores the necessary information), I'd almost
say that it is worse than strtok, since its interface is
all but intuitive -- no sane mind would expect it to work
otherwise, plus talk about a single-class-does-everything
scheme.

Carlos
--

Tilman Kuepper

unread,
Jan 30, 2006, 4:46:37 PM1/30/06
to
Hi there,

> Is there a recommend best solution for string tokenizing in C++?
>
> I mean, I know strtok() is bad, but's what's a good alternative?

Sometimes std::getline() is an option:

std::stringstream strm("hello;world;test");
std::string s;
while(std::getline(strm, s, ';'))
{
std::cout << s << "\n";
}

Best regards,
Tilman

dann...@hotmail.com

unread,
Jan 30, 2006, 8:08:15 PM1/30/06
to
Oh wow! You live and learn.

Thanks for that, I'll try it.

How's it for efficiency though. (IOStreams being slow, and all.)

Also, I assume theres only one char that can act as the delimitir at a
time?

dann...@hotmail.com

unread,
Jan 31, 2006, 3:44:31 AM1/31/06
to
Wow! That's too weird. I even greped for tokenize in all my OS libs.
Sue me for not being smart enough to realise that someone might have
written a component using the British? spelling!! (That's not exactly
clever is it? I think I'll have to post on their newsgroup, and point
out my experiences.)

I've been using STLSoft for agse, and have seen some of his articles,
but never come across the string tokenizer (sorry "string_tokeniser"
:-o).

I really like their stuff, but man, like their docs are terrible. Is it
just Matt Wilson alone? (The code seems to mention mostly him, with an
occaional contri from other.) Imo he should probably stop inventing
stuff and writing articles and focus awhile on producing some good docs
for his libs, and maybe post about these things on codeproject. Could
be some worthehilw stuff if anyone could see the wood for the trees.
(Or he may not care if other people uses it, but then why bother
releasing?)

Anyway, I've just plugged in the tokenizer and it's perfect for what
I'm after. (And it didn't heat up my processor during compilation,
which is a nice bonus.) I think I can stop lookin now.

Fanks

DT

Benry

unread,
Jan 31, 2006, 10:15:22 AM1/31/06
to
> Thanks for that, I'll try it.
>
> How's it for efficiency though. (IOStreams being slow, and all.)


I'm sure you've figured this out by now but this is from MS's C++
Documentation:

"""
The [string::]getline function creates a string containing all of the
characters from the input stream until one of the following situations
occurs:
- End of file.
- The delimiter is encountered.
- is.max_str elements have been extracted.
"""

It's good if you want to strip the beginning before a delimiter, but I
doubt that is what you're looking for for a tokenizer.

Benry

unread,
Jan 31, 2006, 10:43:40 AM1/31/06
to

I would like to know if you've ever used a parser (I only know Bison)?
I think this implementation is perfectly understanable and intuitive.
You input a string and delimiters, and you get a segmented string on
the output, which can be iterated through like an array using the
"find_next_token()" method. If the overhead is low, I don't understand
why this isn't the best tokenizer listed so far in this thread.

-Ben

Tilman Kuepper

unread,
Jan 31, 2006, 11:21:38 AM1/31/06
to
Hi Benry,

> [std::getline...]


>
> It's good if you want to strip the beginning before a delimiter, but I
> doubt that is what you're looking for for a tokenizer.

Not quite true; you can really use it to step through a sequence
of strings. Here is a complete example:

#include <iostream>
#include <string>
#include <sstream>

int main()
{
std::istringstream strm("first;second;third");
std::string str;
while(std::getline(strm, str, ';'))
std::cout << str << "\n";
}

And here is the output of this program:

first
second
third

Best regards,
Tilman

Tilman Kuepper

unread,
Jan 31, 2006, 11:21:08 AM1/31/06
to
Hi Danny,

> How's it for efficiency though. (IOStreams being slow, and all.)

Fast enough for me. Did you try it...?

> Also, I assume theres only one char that can act as the delimitir
> at a time?

That's right. The delimiter is a single character.

Best regards,
Tilman

shablool

unread,
Jan 31, 2006, 12:57:35 PM1/31/06
to


The above code will work fine with any input string. If using
white-space as a delimiter, it may also be written in the following way
(check for yourself that it works):

#include <iostream>
#include <strinx/strinx.h>

int main()
{
using namespace strinx;

// Parse line in to tokens, delimited by whitespace.
// The output is: "a b a b c "
const String line(" a b a b c ");

String ss = trim(line);
while (ss.size() > 0)
{
size_t i = ss.find(' ');
std::cout << substr(ss, 0, i) << ' ';
ss = trim(substr(ss, i, ss.size()));
}

std::cout << std::endl;

return 0;
}


A description of the way it is done is given in the Strinx library
documentations (see: http://strinx.sourceforge.net/strinx.html#id3). It
is off-topic for this thread to have a detailed discussion of the
underlying design of the Strinx library.


If we go back to the initial problem, I believe we can see that the
author wanted to perform string parsing (tokenizing), yet with the
following requirements:

1. Preserve the same level of efficiency which he was used to in C (in
particular, without redundant memory allocations behind the scenes).

2. Take advantage of C++'s powerful capabilities, such as
object-oriented and type-checking.

3. Maintain a simple and readable code, which is not prone to errors.

Unfortunately, it is very difficult (if not impossible) to have *all*
of the above with only using std::string (and exactly for such reasons
the Strinx library was designed in the first place, see:
http://strinx.sourceforge.net/strinx.html#appendix-b-what-is-wrong-with-the-standard-string)


Shachar S.

John Potter

unread,
Feb 1, 2006, 5:09:02 AM2/1/06
to
On 29 Jan 2006 21:35:29 -0500, Carlos Moreno
<moreno_at_mo...@mailinator.com> wrote:

> I wrote this little function which you could modify if a
> vector<string> for the broken-down tokens does not fit the
> bill:

Not picking on you, but my first reaction was simply, "YUK!". This code
may be a quick and dirty post or something else.

[blank lines removed from following]

> vector<string> tokens (const string & s, const string & separators)
> {
> vector<string> result;
> int pos_start, pos_end;
> pos_start = s.find_first_not_of (separators);
> if (pos_start == string::npos)
> {
> return result;
> }
> pos_end = s.find_first_of (separators, pos_start);
> if (pos_end == string::npos)
> {
> result.push_back (s);
> return result;
> }
> result.push_back (s.substr (pos_start, pos_end - pos_start));
> while ((pos_start = s.find_first_not_of (separators, pos_end))
> != string::npos)
> {
> pos_end = s.find_first_of (separators, pos_start);
> if (pos_end == string::npos)
> {
> result.push_back (s.substr (pos_start));
> return result;
> }
> result.push_back (s.substr (pos_start, pos_end - pos_start));
> }
> return result;
> }

> I think it is bug-free (though I know, those ints should be at
> least unsigned ints or size_t, if not string::size_type), but no
> guarantees at all.

It works.

K&M have a couple of versions of this algorithm which avoid standard
string functions and use less code. I wonder why you did not just
reference them. I think that there is an elegant version using the
standard string functions.

vector<string> tokens (string const& text, string const& delim) {
vector<string> result;
string::size_type b(text.find_first_not_of(delim));
while (b != string::npos) {
string::size_type e(text.find_first_of(delim, b));
result.push_back(text.substr(b, e - b));
b = text.find_first_not_of(delim, min(e, text.size()));
}
return result;
}

We could also remove the duplicated code as you seemed to want by using
the usual C while loop and a bit more. I prefer the above to this, but
it was fun. This version really belts it out and thus has no need for
braces :)

vector<string> tokens (string const& text, string const& delim) {
vector<string> result;
string::size_type b, e(0);
while ((b = text.find_first_not_of(delim, min(e, text.size())))
!= string::npos)
result.push_back(text.substr(b,
(e = text.find_first_of(delim, b)) - b));
return result;
}

At this point, I wonder if your code might be easier to comprehend than
the elegant version. I find the many special cases to be distractions
from the general solution; however, the new maintenance programer may be
able to understand it easier than an elegant solution which consumes the
special cases into the general case. Or is it just that production
demands prevent enough thought to produce an elegnat solution? The
first working model is enough?

We could also look at the STL version:

template <class Iiter, class Fiter>
Iiter find_first_not_of (Iiter tb, Iiter te, Fiter db, Fiter de) {
while (tb != te && std::find(db, de, *tb) != de)
++ tb;
return tb;
}

Seems that the standard forgot to include that one. Is there a way to
produce it using the provided algorithms?

vector<string> tokens (string const& text, string const& delim) {
vector<string> result;
string::const_iterator b(find_first_not_of(text.begin(), text.end(),
delim.begin(), delim.end()));
while (b != text.end()) {
string::const_iterator e(find_first_of(b, text.end(),
delim.begin(), delim.end()));
result.push_back(string(b, e));
b = find_first_not_of(e, text.end(),
delim.begin(), delim.end());
}
return result;
}

Yes, the pieces are a bit more verbose, but there is no need to avoid a
thing (string::npos) which is not usable as a position. Could the
standard string have gained some functionality from the STL? Could the
STL have gained some functionality from the standard string?

Is a special off-the-end value ever more useful than the simple
off-the-end by one value?

John

Russ

unread,
Feb 1, 2006, 5:11:22 AM2/1/06
to
> I would like to know if you've ever used a parser (I only know Bison)?
> I think this implementation is perfectly understanable and intuitive.

Agreed.

> You input a string and delimiters, and you get a segmented string on
> the output, which can be iterated through like an array using the
> "find_next_token()" method. If the overhead is low, I don't understand
> why this isn't the best tokenizer listed so far in this thread.

Well, for one thing, it offers a different use paradigm. Aside from
just being trendy, the STL sequence/iterator paradigm is well known.
That, in and of itself, is a worthy trait of any tokenizer class.

Second, it looks like this strinx' find_token either mutates the source
string, or has global (or at least thread-specific) state information
for maintaining its current position in the source.

Further, as I said in another post on this thread, the stlsoft
tokenizer can be combined with their "string view" component, meaning
that a process such as that described can be done without any mutation
of the tokenized string and without any copies being taken, i.e.
without any allocation at all. This obviously results in highly
efficient code, which can be important in some circumstances; it may be
that absence of allocation will appeal to more programmers.

I agree with the OP in his later response about the poor quality of the
stlsoft docs, but I would say that the code itself is generally
top-notch. Having used the stlsoft::string_tokeniser myself in both
"normal" and "string view" mode, I have to say it leaves the user very
satisfied.

(For anyone considering exploring stlsoft, I would advise that grep is
likely to lead you to the gold a lot faster than your browser.)

HTH

Russ; no fuss

Carlos Moreno

unread,
Feb 1, 2006, 5:22:30 AM2/1/06
to

Let's see it from the "observable state/behaviour" point of
view.

I have the string "a b c a b" and want to tokenize it with
space as separator.

In the code above, the instruction find_next_token tells me
that it locates the token that follows the given string
(since "tok" is a String object). But a string object in
anyone's mind is a simple, plain sequence of characters.

The inintuitive part kicks in when I ask myself: what does
it mean "find the next token after the string "a" ?? There
are *two* instances of the token "a" in the given string;
how would the statement find_next_token (line, tok, separators)
know which "a" (in line) is the object tok talking about?

> You input a string and delimiters, and you get a segmented string on
> the output, which can be iterated through like an array using the
> "find_next_token()" method. If the overhead is low, I don't understand
> why this isn't the best tokenizer listed so far in this thread.

An additional reason why it's far from the best is that the
interface is not good either (in addition to the "I still do
not understand how it can even work -- the one thing that
comes to mind right now is that there was a typo and the
type for tok is not String, but some Token class that
represents a token within another string). The following
is, IMO, a good interface:

string line = "a b c a b";
tokenizer tokens (line);

copy (tokens.begin(), tokens.end(),
ostream_iterator<string> (cout,"\n"));

// or

tokenizer::iterator pos = find (tokens.begin(), tokens.end(), "b");
if (pos != tokens.end()) // etc.

Compatibility with STL idioms is a must for this sort of thing,
IMHO.

I know, I know... Do take into account that the code that I
posted earlier in this thread, I wrote that more than six
years ago. Since it is working for the particular application
I'm using it, and I've been lazy about it (:-)), it has
remained in its "sucky" state for all this time after I was
showed the above approach. Oh well, two more threads like this
one and I'll probably reach the critical mass point to finally
go ahead and fix that code :-)

Carlos
--

Benry

unread,
Feb 1, 2006, 5:25:47 AM2/1/06
to
Tilman Kuepper wrote:
> Hi Benry,
>
> > [std::getline...]
> >
> > It's good if you want to strip the beginning before a delimiter, but I
> > doubt that is what you're looking for for a tokenizer.
>
> Not quite true; you can really use it to step through a sequence
> of strings. Here is a complete example:
>
> #include <iostream>
> #include <string>
> #include <sstream>
>
> int main()
> {
> std::istringstream strm("first;second;third");
> std::string str;
> while(std::getline(strm, str, ';'))
> std::cout << str << "\n";
> }
>
> And here is the output of this program:
>
> first
> second
> third

Ok, I was thinking that strm would be level alone, and not have the
returned string removed from it. It makes sense :). Thanks.

-Ben

Roland Pibinger

unread,
Feb 1, 2006, 7:20:26 PM2/1/06
to
On 1 Feb 2006 05:09:02 -0500, John Potter <jpo...@lhup.edu> wrote:

>K&M have a couple of versions of this algorithm which avoid standard
>string functions and use less code. I wonder why you did not just
>reference them. I think that there is an elegant version using the
>standard string functions.
>
>vector<string> tokens (string const& text, string const& delim) {
> vector<string> result;
> string::size_type b(text.find_first_not_of(delim));
> while (b != string::npos) {
> string::size_type e(text.find_first_of(delim, b));
> result.push_back(text.substr(b, e - b));
> b = text.find_first_not_of(delim, min(e, text.size()));
> }
> return result;
> }

[...]
Your explanations and solutions are lucid, as always. The
implementations could be much faster for common std::string
implementations. The C++ Standard neither guarantees that copying a
string is 'cheap' nor that returning by value is optimized by the
compiler. IMO, it's an anti-idiom to return a vector<something> by
value. The following is a slightly modified version that avoids
unnecessary string copying. It lets the user program against an
interface not against the implementation details of the used
std::string and/or compiler. The only assumtion in the code is that
copying an empty string is 'cheap', a safe assumption IMO.

void tokens (string const& text, string const& delim, vector<string>&
result) {


string::size_type b(text.find_first_not_of(delim));
while (b != string::npos) {
string::size_type e(text.find_first_of(delim, b));

result.insert (result.end(), string())->assign
(text.c_str() + b, e - b);


b = text.find_first_not_of(delim, min(e, text.size()));
}
}

Best regards,
Roland Pibinger

Andrei Alexandrescu (See Website For Email)

unread,
Feb 2, 2006, 5:56:18 AM2/2/06
to
Roland Pibinger wrote:
> On 1 Feb 2006 05:09:02 -0500, John Potter <jpo...@lhup.edu> wrote:
>
>> K&M have a couple of versions of this algorithm which avoid standard
>> string functions and use less code. I wonder why you did not just
>> reference them. I think that there is an elegant version using the
>> standard string functions.
>>
>> vector<string> tokens (string const& text, string const& delim) {
>> vector<string> result;
>> string::size_type b(text.find_first_not_of(delim));
>> while (b != string::npos) {
>> string::size_type e(text.find_first_of(delim, b));
>> result.push_back(text.substr(b, e - b));
>> b = text.find_first_not_of(delim, min(e, text.size()));
>> }
>> return result;
>> }
> [...]
> Your explanations and solutions are lucid, as always. The
> implementations could be much faster for common std::string
> implementations. The C++ Standard neither guarantees that copying a
> string is 'cheap' nor that returning by value is optimized by the
> compiler. IMO, it's an anti-idiom to return a vector<something> by
> value. The following is a slightly modified version that avoids
> unnecessary string copying.

I think you needn't worry about that; the implementation is fine, if of
course you ignore its fatal flaw of insisting on using SESE, which leads
to the utterly inelegant duplication of the call to find_first_not_of.

The more worrisome thing performance-wise is scalability. The algorithm
has complexity O(text.size() * delim.size()). Fortunately, it's often
the case that the delimiter is short, so the function might be what's
needed, if only you could get rid of the code duplication :o).


Andrei

Carlos Moreno

unread,
Feb 2, 2006, 12:02:39 PM2/2/06
to
John Potter wrote:

> vector<string> tokens (string const& text, string const& delim) {
> vector<string> result;
> string::size_type b(text.find_first_not_of(delim));
> while (b != string::npos) {
> string::size_type e(text.find_first_of(delim, b));
> result.push_back(text.substr(b, e - b));

Isn't this last statement kind of walking on thin ice?

I mean, is there somewhere in the standard that guatantees that
string::npos is implemented in such a way that the above works?

Is there an absolute guarantee about the behaviour of e - b when
b is string::npos? In fact, is there any guarantee that arithmetic
involving string::npos works at all? I've always been picky about
that: string::npos to me is a "magic entity" that is only used
with == or != to check if the call to find* found or didn't find
the requested target/pattern. It scares the hell out of me to
use it in arithmetic, even if I know that in many/most(/all?)
Standard Library implementations it will work.

> vector<string> tokens (string const& text, string const& delim) {
> vector<string> result;
> string::size_type b, e(0);
> while ((b = text.find_first_not_of(delim, min(e, text.size())))
> != string::npos)
> result.push_back(text.substr(b,
> (e = text.find_first_of(delim, b)) - b));
> return result;
> }
>
> At this point, I wonder if your code might be easier to comprehend than
> the elegant version.

Isn't this an oxymoron? If version A is easier to comprehend than
version B, then that automatically disqualifies the adjective
"elegant" for version B... right? (maybe you could refer to this
one as the "less inelegant" version :-))

Plus, again, is there anything in the standard that requires that
comparison works with string::npos? That min call receiving a
string::npos as one parameter makes me shudder.

Carlos
--

Carlos Moreno

unread,
Feb 2, 2006, 12:03:10 PM2/2/06
to
Andrei Alexandrescu (See Website For Email) wrote:

> I think you needn't worry about that; the implementation is fine, if of
> course you ignore its fatal flaw of insisting on using SESE, which leads
> to the utterly inelegant duplication of the call to find_first_not_of.

More utterly inelegant than placing an if inside the loop? An if that
will be unnecessarily executed each time even if it's only going to be
meaningful once at the end (during the last pass) of the loop?

Sorry, but SESE in this case leads to more efficient code, and it is
equally (in)elegant.

> The more worrisome thing performance-wise is scalability. The algorithm
> has complexity O(text.size() * delim.size()). Fortunately, it's often
> the case that the delimiter is short, so the function might be what's
> needed, if only you could get rid of the code duplication :o).

But that's not a flaw in John's (or my) implementation -- that's a
necessity; tokeninzing an *arbitrary* string with *arbitrary* set of
delimiters requires by definition O(text's size times delim's size);
there's no way around it, SESE or not, std::string or not, object-
oriented or not.

Carlos
--

Andrei Alexandrescu (See Website For Email)

unread,
Feb 2, 2006, 3:35:14 PM2/2/06
to
Carlos Moreno wrote:
> Andrei Alexandrescu (See Website For Email) wrote:
>
>> I think you needn't worry about that; the implementation is fine, if of
>> course you ignore its fatal flaw of insisting on using SESE, which leads
>> to the utterly inelegant duplication of the call to find_first_not_of.
>
> More utterly inelegant than placing an if inside the loop? An if that
> will be unnecessarily executed each time even if it's only going to be
> meaningful once at the end (during the last pass) of the loop?
>
> Sorry, but SESE in this case leads to more efficient code, and it is
> equally (in)elegant.

I've heard about, and I agree with, "Once and Only Once"
(http://xp.c2.com/OnceAndOnlyOnce.html), but not about "placing an if
inside the loop".

That if wouldn't be unnecessarily executed. If the loop is properly
written, it's the only test inside the loop, and it breaks it halfway.
"Think outside the SESE." :o)

>> The more worrisome thing performance-wise is scalability. The algorithm
>> has complexity O(text.size() * delim.size()). Fortunately, it's often
>> the case that the delimiter is short, so the function might be what's
>> needed, if only you could get rid of the code duplication :o).
>
> But that's not a flaw in John's (or my) implementation -- that's a
> necessity; tokeninzing an *arbitrary* string with *arbitrary* set of
> delimiters requires by definition O(text's size times delim's size);
> there's no way around it, SESE or not, std::string or not, object-
> oriented or not.

It's as much of a necessity as SESE is :o). A google for <<fast string
searching>> yields as its first result "The Boyer-Moore Fast String
Searching Algorithm"
(http://www.cs.utexas.edu/users/moore/best-ideas/string-searching/).
Even without using Boyer-Moore, a simple hash for the delimiter
characters decreases the complexity from quadratic to linear.


Andrei

Roland Pibinger

unread,
Feb 2, 2006, 3:39:43 PM2/2/06
to
On 2 Feb 2006 05:56:18 -0500, "Andrei Alexandrescu (See Website For
Email)" <SeeWebsit...@moderncppdesign.com> wrote:

>Roland Pibinger wrote:
>> Your explanations and solutions are lucid, as always. The
>> implementations could be much faster for common std::string
>> implementations. The C++ Standard neither guarantees that copying a
>> string is 'cheap' nor that returning by value is optimized by the
>> compiler. IMO, it's an anti-idiom to return a vector<something> by
>> value. The following is a slightly modified version that avoids
>> unnecessary string copying.
>
>I think you needn't worry about that; the implementation is fine, if of
>course you ignore its fatal flaw of insisting on using SESE, which leads
>to the utterly inelegant duplication of the call to find_first_not_of.

There is no need to worry when you can measure.
I've compared the two implementations, John Potter's original and my
modified version, WRT the number of dynamic allocations they produce.
The test string consists of 10 tokens of length 16, each delimited by
one delimiter:

const char* text = "1234567890123456;123... // 10 tokens (len=16 each)

// in a row delimited by
';'
// total length=170
// test 1 (original)
vector<string> v1;
v1 = tokens (text, ";.:,");
assert (v1.size() == 10);

// test 2 (original)
vector<string> v2 (tokens (text, ";.:,"));
assert (v2.size() == 10);

// test 3 (modified)
vector<string> v3;
tokens (text, ";.:,", v3);
assert (v3.size() == 10);

// test 4 (modified)
vector<string> v4;
v4.reserve(10);
tokens (text, ";.:,", v4);
assert (v4.size() == 10);

Number of dynamic allocations for string using VC++ 7.1
(compiled with /O2):
test 1: 73
test 2: 63
test 3: 36
test 4: 11

Allocations of vector are not included in the numbers.
In VC++ 6.0 (reference-counted string) the number of allocations is 12
for all tests. If tokens with length < 16 are used in VC++ 7.1 the
number probably drops to 1 (because of SSO). Well, programming with
std::string is programming against an implementation, not an
interface.

Best regards,
Roland Pibinger

Andrei Alexandrescu (See Website For Email)

unread,
Feb 3, 2006, 7:06:09 AM2/3/06
to
Roland Pibinger wrote:
> On 2 Feb 2006 05:56:18 -0500, "Andrei Alexandrescu (See Website For
> Email)" <SeeWebsit...@moderncppdesign.com> wrote:
>> Roland Pibinger wrote:
>>> Your explanations and solutions are lucid, as always. The
>>> implementations could be much faster for common std::string
>>> implementations. The C++ Standard neither guarantees that copying a
>>> string is 'cheap' nor that returning by value is optimized by the
>>> compiler. IMO, it's an anti-idiom to return a vector<something> by
>>> value. The following is a slightly modified version that avoids
>>> unnecessary string copying.
>> I think you needn't worry about that; the implementation is fine, if of
>> course you ignore its fatal flaw of insisting on using SESE, which leads
>> to the utterly inelegant duplication of the call to find_first_not_of.
>
> There is no need to worry when you can measure.
> I've compared the two implementations, John Potter's original and my
> modified version, WRT the number of dynamic allocations they produce.
> The test string consists of 10 tokens of length 16, each delimited by
> one delimiter:

That's a nice set of numbers, but what I was saying was that the number
of character comparisons will increase fast with the size of the input,
to the point where it dominates the total cost. The allocation count is
O(text.size() / delim.size()), while the complexity of the algo is
O(text.size() * delim.size()).


Andrei

John Potter

unread,
Feb 3, 2006, 7:01:07 AM2/3/06
to
On 2 Feb 2006 05:56:18 -0500, "Andrei Alexandrescu (See Website For
Email)" <SeeWebsit...@moderncppdesign.com> wrote:

> Roland Pibinger wrote:

> > On 1 Feb 2006 05:09:02 -0500, John Potter <jpo...@lhup.edu> wrote:

> >> K&M have a couple of versions of this algorithm which avoid standard
> >> string functions and use less code. I wonder why you did not just
> >> reference them. I think that there is an elegant version using the
> >> standard string functions.

> >> vector<string> tokens (string const& text, string const& delim) {
> >> vector<string> result;
> >> string::size_type b(text.find_first_not_of(delim));
> >> while (b != string::npos) {
> >> string::size_type e(text.find_first_of(delim, b));
> >> result.push_back(text.substr(b, e - b));
> >> b = text.find_first_not_of(delim, min(e, text.size()));
> >> }
> >> return result;
> >> }

> > Your explanations and solutions are lucid, as always. The


> > implementations could be much faster for common std::string
> > implementations. The C++ Standard neither guarantees that copying a
> > string is 'cheap' nor that returning by value is optimized by the
> > compiler. IMO, it's an anti-idiom to return a vector<something> by
> > value. The following is a slightly modified version that avoids
> > unnecessary string copying.

> I think you needn't worry about that; the implementation is fine, if of
> course you ignore its fatal flaw of insisting on using SESE, which leads
> to the utterly inelegant duplication of the call to find_first_not_of.

It's a shame that your news server does not seem to carry the whole
thread. The original SEME code contained 2 find_first_of, 2
find_first_not_of, 3 substr, 4 push_back, and 4 return. I also posted a
version with no duplication.

| We could also remove the duplicated code as you seemed to want by using
| the usual C while loop and a bit more. I prefer the above to this, but
| it was fun. This version really belts it out and thus has no need for
| braces :)

| vector<string> tokens (string const& text, string const& delim) {
| vector<string> result;

| string::size_type b, e(0);
| while ((b = text.find_first_not_of(delim, min(e, text.size())))
| != string::npos)
| result.push_back(text.substr(b,
| (e = text.find_first_of(delim, b)) - b));
| return result;
| }

> The more worrisome thing performance-wise is scalability. The algorithm

> has complexity O(text.size() * delim.size()).

Yes, the solution is general. It could be improved by using a table for
small character sets but not likely for large ones. The point was
clarity, not speed. Likely that the application is I/O bound anyway and
certainly no profiling data was provided.

> Fortunately, it's often
> the case that the delimiter is short, so the function might be what's
> needed, if only you could get rid of the code duplication :o).

It's gone. Now maybe you can respond to the question in my original
post? Is the maximal duplication original code easier to write, read,
and maintain than the elegant version? Now, do you think that the
removal of the duplication helps the elegant version? Do you agree that
combining the find_first_of and push_back is just cute not helpful? Do
you have a version which is clearer? I'm curious about writing,
reading, maintaining. I'm sure that I spent more time on the code than
Carlos did; however, that could also mean that anyone reading it would
also need to spend more time understanding my code than his. Maybe
there is no place for elegance in industry?

John

John Potter

unread,
Feb 3, 2006, 7:03:38 AM2/3/06
to
On 1 Feb 2006 19:20:26 -0500, rpb...@yahoo.com (Roland Pibinger) wrote:

> The implementations could be much faster for common std::string
> implementations. The C++ Standard neither guarantees that copying a
> string is 'cheap' nor that returning by value is optimized by the
> compiler. IMO, it's an anti-idiom to return a vector<something> by
> value.

Perhaps. I think it has much to do with the calling code. An old C
style will suffer the copy, but a C++ style seems to avoid it in most
compilers today.

vector<string> vs; // old C style
while (...) {
vs = tokens(...)
...
}

while (...) { // C++ style
vector<string> vs(tokens(...));
...
}

> The following is a slightly modified version that avoids
> unnecessary string copying. It lets the user program against an
> interface not against the implementation details of the used
> std::string and/or compiler. The only assumtion in the code is that
> copying an empty string is 'cheap', a safe assumption IMO.

> void tokens (string const& text, string const& delim, vector<string>&
> result) {
> string::size_type b(text.find_first_not_of(delim));
> while (b != string::npos) {
> string::size_type e(text.find_first_of(delim, b));
> result.insert (result.end(), string())->assign
> (text.c_str() + b, e - b);
> b = text.find_first_not_of(delim, min(e, text.size()));
> }
> }

I think that there are also some assumptions about assign new value to
empty string being considerably faster than copying a string and c_str
being considerably faster than substr. For curiosity, here are some
timed results. The comparison is between algorithms not compilers which
use different clocks.

bcc32
609 Carlos
609 pos
391 iterator
593 Roland

gcc
2953 Carlos
2969 pos
3484 iterator
2938 Roland

gcc -O1
2625 Carlos
2625 pos
2531 iterator
2563 Roland

gcc -O2
1969 Carlos
1969 pos
1766 iterator
2094 Roland

I suspect that actual use would be in an I/O bound application and these
times are meaningless. Profile?

In any case, my curiosity was about readability not efficiency. I think
that your insert, assign, c_str would lose to push_back, substr on that
count.

I still have not seen any responses to the write, read question.

John

kanze

unread,
Feb 3, 2006, 9:57:49 AM2/3/06
to
Carlos Moreno wrote:
> John Potter wrote:

> > vector<string> tokens (string const& text, string const& delim) {
> > vector<string> result;
> > string::size_type b(text.find_first_not_of(delim));
> > while (b != string::npos) {
> > string::size_type e(text.find_first_of(delim, b));
> > result.push_back(text.substr(b, e - b));

> Isn't this last statement kind of walking on thin ice?

> I mean, is there somewhere in the standard that guatantees
> that string::npos is implemented in such a way that the above
> works?

Yes. The actual length of the substring is min(n,size()-pos).
and npos is guaranteed to be larger than the largest possible
length.

In fact, the default argument to the second parameter is npos.

> Is there an absolute guarantee about the behaviour of e - b when
> b is string::npos?

Yes. It was designed this way on purpose.

> In fact, is there any guarantee that arithmetic involving
> string::npos works at all?

Well, it has the type size_t, so arithmetic is well defined on
it. It's also guaranteed to be (size_t)( -1 ), which in turn is
guaranteed to be std::numeric_limits<size_t>::max() (since
size_t is guaranteed to be unsigned). So arithmetic on it is
well defined. Whether it makes sense or not is another matter,
but in practically every case where you give a length to be
copied, the standard requires that the implementation actually
use min(available length, given length), so npos can always be
used.

In this special case, it is guaranteed that npos - b will return
a value larger than the number of characters remaining in the
string.

> I've always been picky about that: string::npos to me is a
> "magic entity" that is only used with == or != to check if the
> call to find* found or didn't find the requested
> target/pattern. It scares the hell out of me to use it in
> arithmetic, even if I know that in many/most(/all?) Standard
> Library implementations it will work.

It's guaranteed by the standard.

> > vector<string> tokens (string const& text, string const& delim) {
> > vector<string> result;
> > string::size_type b, e(0);
> > while ((b = text.find_first_not_of(delim, min(e, text.size())))
> > != string::npos)
> > result.push_back(text.substr(b,
> > (e = text.find_first_of(delim, b)) - b));
> > return result;
> > }

> > At this point, I wonder if your code might be easier to
> > comprehend than the elegant version.

[...]

> Plus, again, is there anything in the standard that requires
> that comparison works with string::npos? That min call
> receiving a string::npos as one parameter makes me shudder.

It's legal, although in this case, I agree that it's pretty
ugly. Given that I need to test outside of the function anyway,
I'd move the test up front:

while ( e != string::npos && (b = text.find_...

(Actually, of course, I'd encapsule this somehow so I didn't
need the nested assignment. There's nothing like unexpected
side effects in a conditional to render code unreadable.)

--
James Kanze GABI Software
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

kanze

unread,
Feb 3, 2006, 9:57:23 AM2/3/06
to
Andrei Alexandrescu (See Website For Email) wrote:
> Carlos Moreno wrote:

> >> The more worrisome thing performance-wise is scalability.
> >> The algorithm has complexity O(text.size() * delim.size()).
> >> Fortunately, it's often the case that the delimiter is
> >> short, so the function might be what's needed, if only you
> >> could get rid of the code duplication :o).

> > But that's not a flaw in John's (or my) implementation --
> > that's a necessity; tokeninzing an *arbitrary* string with
> > *arbitrary* set of delimiters requires by definition
> > O(text's size times delim's size); there's no way around it,
> > SESE or not, std::string or not, object- oriented or not.

> It's as much of a necessity as SESE is :o). A google for
> <<fast string searching>> yields as its first result "The
> Boyer-Moore Fast String Searching Algorithm"
> (http://www.cs.utexas.edu/users/moore/best-ideas/string-searching/).

Boyer-Moore won't help here -- we're looking to match one of,
not to match a complete sequence. On the other hand,
initializing a bool[UCHAR_MAX+1] according to whether each
character is in the target set is trivial, and reduces the
complexity to O(n).

> Even without using Boyer-Moore, a simple hash for the
> delimiter characters decreases the complexity from quadratic
> to linear.

Not quite sure how a hash applies here, either.

Just for the record, my version of John's code would be
something like:

class Setter
{
public:
Setter( bool* dest ) : tbl( dest ) {}
void operator()( char ch ) const
{
tbl[ (unsigned char)ch ] = true ;
}
private:
bool* tbl ;
} ;

template< bool t >
struct Tester
{
public:
Tester( bool* dest ) : tbl( dest ) {}
bool operator()( char ch ) const
{
return tbl[ (unsigned char)ch ] ^ t ;
}
private:
bool* tbl ;
} ;

std::vector< std::string >
tokens(
std::string const& text,
std::string const& delim )
{
bool tValues[ UCHAR_MAX + 1 ] ;
std::fill_n( tValues, UCHAR_MAX + 1, false ) ;
std::for_each( delim.begin(), delim.end(), Setter() ) ;
std::string::const_iterator
b( std::find_if(
text.begin(),
text.end(),
Tester< false >() ) ) ;
while ( b != text.end() ) {
std::string::const_iterator
e( std::find_if(
b,
text.end(),
Tester< true >() ) ) ;
result.push_back( std::string( b, e ) ) ;
b = std::find_if( e, text.end(), Tester< false >() ) ;
}
return result ;
}

The extra classes make it look like a lot of work, but in fact,
of course, anyone who has to deal with text will already have a
set of character class which will allow something like:

std::vector< std::string >
tokens(
std::string const& text,
std::string const& delim )
{
std::vector< std::string > result ;
Gabi::SetOfCharacter d( delim ) ;
std::string::const_iterator b( std::find_if( text.begin(),
text.end(),
std::not1( d.contains() ) )
) ;
while ( b != text.end() ) {
std::string::const_iterator e( std::find_if( b,
text.end(),
d.contains() ) ) ;
result.push_back( std::string( b, e ) ) ;
b = std::find_if( e, text.end(), std::not1( d.contains()) )
;
}
return result ;
}

In both cases, the actual complexity is only O(n+m), not O(n*m).

--
James Kanze GABI Software
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

philip....@gmail.com

unread,
Feb 3, 2006, 10:53:22 AM2/3/06
to
I'm impressed with the elegance of the algorithms shown so far but can
they all produce a fixed number of tokens for input like this:

10,12.2,a,bbb,15
12,,a,bbb,12
15,15.1,,bbb,12
,1,11.6,,

I'd have thought that in many usage scenarios (certainly any that I've
ever needed) I'd want to tokenise the data such that empty fields were
still present in the vector otherwise my calling code wouldn't know
which fields were missing.

Or maybe I'm doing something wrong when I try these examples...

Thanks, Phil

Roland Pibinger

unread,
Feb 3, 2006, 10:01:15 PM2/3/06
to
On 3 Feb 2006 10:53:22 -0500, philip....@gmail.com wrote:
>I'm impressed with the elegance of the algorithms shown so far but can
>they all produce a fixed number of tokens for input like this:
>
>10,12.2,a,bbb,15
>12,,a,bbb,12
>15,15.1,,bbb,12
>,1,11.6,,
>
>I'd have thought that in many usage scenarios (certainly any that I've
>ever needed) I'd want to tokenise the data such that empty fields were
>still present in the vector otherwise my calling code wouldn't know
>which fields were missing.

In your case it is probably better to read the data line by line and
use string.find (',' pos) to identify the possibly empty tokens (you
have two delimiters in your data: the field delimiter ',' and the
record delimiter '\n').

Best wishes,
Roland Pibinger

Andrei Alexandrescu (See Website For Email)

unread,
Feb 3, 2006, 10:03:43 PM2/3/06
to
John Potter wrote:
> On 2 Feb 2006 05:56:18 -0500, "Andrei Alexandrescu (See Website For
> Email)" <SeeWebsit...@moderncppdesign.com> wrote:
>> I think you needn't worry about that; the implementation is fine, if of
>> course you ignore its fatal flaw of insisting on using SESE, which leads
>> to the utterly inelegant duplication of the call to find_first_not_of.
>
> It's a shame that your news server does not seem to carry the whole
> thread. The original SEME code contained 2 find_first_of, 2
> find_first_not_of, 3 substr, 4 push_back, and 4 return. I also posted a
> version with no duplication.
>
> | We could also remove the duplicated code as you seemed to want by using
> | the usual C while loop and a bit more. I prefer the above to this, but
> | it was fun. This version really belts it out and thus has no need for
> | braces :)
>
> | vector<string> tokens (string const& text, string const& delim) {
> | vector<string> result;
> | string::size_type b, e(0);
> | while ((b = text.find_first_not_of(delim, min(e, text.size())))
> | != string::npos)
> | result.push_back(text.substr(b,
> | (e = text.find_first_of(delim, b)) - b));
> | return result;
> | }

Sorry I missed the SESE variant that has an uninitialized variable :o).

>> The more worrisome thing performance-wise is scalability. The algorithm
>> has complexity O(text.size() * delim.size()).
>
> Yes, the solution is general. It could be improved by using a table for
> small character sets but not likely for large ones. The point was
> clarity, not speed. Likely that the application is I/O bound anyway and
> certainly no profiling data was provided.

As I said, a hash is quite what the doctor prescribed for such a tokenizer.

>> Fortunately, it's often
>> the case that the delimiter is short, so the function might be what's
>> needed, if only you could get rid of the code duplication :o).
>
> It's gone. Now maybe you can respond to the question in my original
> post? Is the maximal duplication original code easier to write, read,
> and maintain than the elegant version?

Depends on who's looking at it. I can't speak for others, but for my
money, duplication is bad pretty much no matter how you look at it, and
no matter how innocuous it seems at first. In fact, it's bitten me (and
people I worked with) so often and with so much regularity, that I found
it safe to simply make it a point to avoid duplication. For example,
when I have no other solution, I'd use a macro. To people who say
"macros are evil and should be avoided at all costs", I reply that
duplication has caused me way more problems than macros ever did, and so
I believe that duplication is a greater evil that justifies a low blow
with macros.

> Now, do you think that the
> removal of the duplication helps the elegant version?

Yes.

> Do you agree that
> combining the find_first_of and push_back is just cute not helpful?

It depends on who's looking. If you ask me of all people, I'd say, I
perhaps wouldn't write it that way (because I only see it as a
contortion with roots in SESE), but I can read it, and I definitely
prefer it to the version using duplication.

Duplication jumps at me from the code like a foul word inside a doctoral
thesis.

> Do
> you have a version which is clearer?

I have one that you won't find clearer. If you did, you would have
written it yourself. After all, the task is very simple so we can all
write the code that we really believe is the very best.

vector<string> tokens (string const& text, string const& delim) {
vector<string> result;

string::const_iterator delimB = text.begin();
for (;;) {
string::const_iterator tokB = find_first_not_of(
delimB, text.end(), delim.begin(), delim.end());
if (tokB == text.end()) break;
delimB = find_first_of(tokB, text.end(),
delim.begin(), delim.end()));
result.push_back(string(tokB, delimB));
}
return result;
}

I personally find such code very easy to understand, and I'm encouraged
that I could assign informative names to the variables. For example,
initializing delimB with text.begin() tells immediately that we're first
looking for a delimiter. Others, on the other hand, will find the
infinite loop with the hidden "break" in the middle disgusting. (I like
it. :o))

> I'm curious about writing,
> reading, maintaining. I'm sure that I spent more time on the code than
> Carlos did; however, that could also mean that anyone reading it would
> also need to spend more time understanding my code than his. Maybe
> there is no place for elegance in industry?

I believe spending more time on some code will make it easier to get
into for a newcomer, so I'm not sure what to say. I do believe there's
elegance in industry, but there's also a lot of bad taste too, and just
like in life, the pressure of bad taste is discouraging at times. In
this particular case, maybe someone would look at an implementation
bleeding with duplication and say, yeah, I see what this does. But
they'd put value in the wrong place. Code is not meant to be read like
natural language text because natural language text is not optimized for
eternal modification; it's just meant for write once, read many. Someone
with the (IMHO) right set of values will point in no time that the code
bears duplication, which makes it inelegant.

So I'll solve the internal tension raised by your philosophical question
by simply blaming all obfuscation to SESE :o).


Andrei

Alex Vinokur

unread,
Feb 3, 2006, 10:06:27 PM2/3/06
to

<dann...@hotmail.com> wrote in message news:1138513102.2...@g47g2000cwa.googlegroups.com...
> Is there a recommend best solution for string tokenizing in C++?
>
> I mean, I know strtok() is bad, but's what's a good alternative?
>
> I've seen Boost tokenizers, but there's a hugh include overhead, which
> I don't want. Is there a simple class or function out there that's
> recommended by those in the nkow?
>

[snip]


You can use:
C function strsep() on UNIX,
C++ function using std::getline and std::istream_iterator,
C++ function using istream::getline and std::istream_iterator,
C++ function using for-loop and std::string::find_first_of.

Sources:
http://groups.google.com/group/sources/browse_frm/thread/015c0163b259585c

Test-results:
http://groups.google.com/group/log-files/browse_frm/thread/be6d5155996006ce


--
Alex Vinokur
email: alex DOT vinokur AT gmail DOT com
http://mathforum.org/library/view/10978.html
http://sourceforge.net/users/alexvn

John Potter

unread,
Feb 3, 2006, 10:13:45 PM2/3/06
to
On 2 Feb 2006 12:02:39 -0500, Carlos Moreno
<moreno_at_mo...@mailinator.com> wrote:

> John Potter wrote:
>
> > vector<string> tokens (string const& text, string const& delim) {
> > vector<string> result;
> > string::size_type b(text.find_first_not_of(delim));
> > while (b != string::npos) {
> > string::size_type e(text.find_first_of(delim, b));
> > result.push_back(text.substr(b, e - b));

> Isn't this last statement kind of walking on thin ice?

No.

> I mean, is there somewhere in the standard that guatantees that
> string::npos is implemented in such a way that the above works?

Sure. substr (size_type pos = 0, size_type n = npos). Your if used the
default length which is everything after pos. Pos must be in 0..length,
but n can be anything.

> Is there an absolute guarantee about the behaviour of e - b when
> b is string::npos? In fact, is there any guarantee that arithmetic
> involving string::npos works at all?

Yes. Size_type is unsigned and npos is size_type(-1). If e is npos,
you get some large number greater than the remaining size which acts the
same as npos. Any n greater than remaining size just gives the rest of
the string.

> I've always been picky about
> that: string::npos to me is a "magic entity" that is only used
> with == or != to check if the call to find* found or didn't find
> the requested target/pattern. It scares the hell out of me to
> use it in arithmetic, even if I know that in many/most(/all?)
> Standard Library implementations it will work.

Doesn't scare me because it is required to work. See 21.3/6 and
21.3.6.7.

> > vector<string> tokens (string const& text, string const& delim) {
> > vector<string> result;
> > string::size_type b, e(0);
> > while ((b = text.find_first_not_of(delim, min(e, text.size())))
> > != string::npos)
> > result.push_back(text.substr(b,
> > (e = text.find_first_of(delim, b)) - b));
> > return result;
> > }

> > At this point, I wonder if your code might be easier to comprehend than
> > the elegant version.

> Isn't this an oxymoron? If version A is easier to comprehend than
> version B, then that automatically disqualifies the adjective
> "elegant" for version B... right? (maybe you could refer to this
> one as the "less inelegant" version :-))

Elegance is in the eye of the beholder. It takes some sophistication to
appreciate elegance. Elegance is sophisticated simplicity which could
seem complex or ugly to the uninitiated. Quick sort is elegant, but
bubble sort is easier to understand.

> Plus, again, is there anything in the standard that requires that
> comparison works with string::npos? That min call receiving a
> string::npos as one parameter makes me shudder.

Not when you understand. It is valid to start a search at size which is
off the end by one and is always less than npos and greater than any pos
within the string. Starting at size returns npos, starting at size+
throws out_of_range.

See what I mean about elegance? You may also have answered my question.
Your code requires no knowledge of the true meaning of life^H^H^H^Hnpos.
The knowledge removes the needless tests, but may add confusion for the
next programmer who reads the code.

John

Carlos Moreno

unread,
Feb 3, 2006, 10:08:30 PM2/3/06
to
John Potter wrote:

> It's a shame that your news server does not seem to carry the whole
> thread. The original SEME code contained 2 find_first_of, 2
> find_first_not_of, 3 substr, 4 push_back, and 4 return. I also posted a
> version with no duplication.

Hey HEY!! Carlos' newsreader *is* showing him everything, ok?!! ;-)

> I'm sure that I spent more time on the code than
> Carlos did; however, that could also mean that anyone reading it would
> also need to spend more time understanding my code than his. Maybe
> there is no place for elegance in industry?

Again -- it depends on how you define elegance; I tend to assume a
high correlation with code readability and maintainability; we could
informally say that this is a measure of "objective" elegance, whereas
your code (at least your code as compared to my original version) has
a higher level of "subjective" elegance, maybe?

Carlos
--

Roland Pibinger

unread,
Feb 3, 2006, 10:10:58 PM2/3/06
to
On 3 Feb 2006 07:06:09 -0500, "Andrei Alexandrescu (See Website For

Email)" <SeeWebsit...@moderncppdesign.com> wrote:
>That's a nice set of numbers, but what I was saying was that the number
>of character comparisons will increase fast with the size of the input,
>to the point where it dominates the total cost.

But where is the break-even point? File-size 10000, 100000 or more?
Delimiter-size?

>The allocation count is O(text.size() / delim.size()),

... neglecting vector re-allocations which are responsible for the
large numbers in 3 of the 4 examples.

>while the complexity of the algo is
>O(text.size() * delim.size()).

Apart from tokenizing, what I try to say is that std::string is
underspecified. It has opposite performance characteristics (cheap vs.
expensive) depending on the implementation for the two most import
member functions, the copy constructor and the assignment operator.
The string implementation determines the string programming style.
Most text-book and newsgroup examples use a programming style that
implicitly assumes cheap string copying and assignment. For no good
reason.

Best regards,
Roland Pibinger

Roland Pibinger

unread,
Feb 3, 2006, 10:12:40 PM2/3/06
to
On 3 Feb 2006 07:03:38 -0500, John Potter <jpo...@lhup.edu> wrote:

>On 1 Feb 2006 19:20:26 -0500, rpb...@yahoo.com (Roland Pibinger) wrote:
>
>> void tokens (string const& text, string const& delim, vector<string>&
>> result) {
>> string::size_type b(text.find_first_not_of(delim));
>> while (b != string::npos) {
>> string::size_type e(text.find_first_of(delim, b));
>> result.insert (result.end(), string())->assign
>> (text.c_str() + b, e - b);
>> b = text.find_first_not_of(delim, min(e, text.size()));
>> }
>> }
>
>I think that there are also some assumptions about assign new value to
>empty string being considerably faster than copying a string and c_str
>being considerably faster than substr. For curiosity, here are some
>timed results. The comparison is between algorithms not compilers which
>use different clocks.

[test results]

The strings you use are all reference counted for which assignment and
copying is 'cheap'. Therfore the small differences between the
'tokens' implementations (see my other posting for a different
performance test).

>I suspect that actual use would be in an I/O bound application and these
>times are meaningless. Profile?

If C++ had standardized a string class with defined performance
characteristics there would be no discussion.

>In any case, my curiosity was about readability not efficiency. I think
>that your insert, assign, c_str would lose to push_back, substr on that
>count.

Maybe, but I don't see so much difference between them (the assign can
be written more user-friendly). Your first solution using the standard
string functions is "obviously" the most readable and elegant (apart
from the return by value ;-) of the three candidates. The simple
solution always wins over the more complicated.

>I still have not seen any responses to the write, read question.

Maybe the readers of this group are already used to the code
obfuscation caused by the Standard library?

Best regards,
Roland Pibinger

Andrei Alexandrescu (See Website For Email)

unread,
Feb 3, 2006, 10:09:21 PM2/3/06
to
kanze wrote:
> Andrei Alexandrescu (See Website For Email) wrote:
> Boyer-Moore won't help here -- we're looking to match one of,
> not to match a complete sequence. On the other hand,
> initializing a bool[UCHAR_MAX+1] according to whether each
> character is in the target set is trivial, and reduces the
> complexity to O(n).

That's the inspiration I was hinting at...

>> Even without using Boyer-Moore, a simple hash for the
>> delimiter characters decreases the complexity from quadratic
>> to linear.
>
> Not quite sure how a hash applies here, either.

Super simple. (1) Create a hash containing the delimiter characters. (2)
For each character in the text, look it up in the hash. If it's there,
it's a delimiter. If not, it's part of a token. The algo is O(n)
assuming a O(1) hash.

> Just for the record, my version of John's code would be
> something like:
>
> class Setter
> {
> public:
> Setter( bool* dest ) : tbl( dest ) {}
> void operator()( char ch ) const
> {
> tbl[ (unsigned char)ch ] = true ;
> }
> private:
> bool* tbl ;
> } ;

[snip]

Alas, yet another sacrifice on the altar of the for_each god... :o)
There are things that I guess I'll never understand.

Andrei

John Potter

unread,
Feb 4, 2006, 7:05:23 AM2/4/06
to
On 3 Feb 2006 10:53:22 -0500, philip....@gmail.com wrote:

> I'm impressed with the elegance of the algorithms shown so far but can
> they all produce a fixed number of tokens for input like this:

> 10,12.2,a,bbb,15
> 12,,a,bbb,12
> 15,15.1,,bbb,12
> ,1,11.6,,

No. They solve a different problem. Text processing where the
delimiters might be punctuation. There are no fields and things like
the period space space prior to this sentence is treated as a single
token seperator. There is no such thing as a missing token.

> I'd have thought that in many usage scenarios (certainly any that I've
> ever needed) I'd want to tokenise the data such that empty fields were
> still present in the vector otherwise my calling code wouldn't know
> which fields were missing.

The key word is "field". You want the fields, not the tokens. Here is
one way to get them.

vector<string> fields (string const& text, char delim) {
vector<string> retval;
string::size_type e(string::npos);
do {
string::size_type b(e + 1); // Yes Carlos, npos+1 is 0. :-P
e = text.find(delim, b);
retval.push_back(text.substr(b, e - b));
} while (e != string::npos);
return retval;
}

John

John Potter

unread,
Feb 4, 2006, 7:05:48 AM2/4/06
to
On 2 Feb 2006 15:35:14 -0500, "Andrei Alexandrescu (See Website For
Email)" <SeeWebsit...@moderncppdesign.com> wrote:

> That if wouldn't be unnecessarily executed. If the loop is properly
> written, it's the only test inside the loop, and it breaks it halfway.
> "Think outside the SESE." :o)

Surely you didn't intend to write the SESE:

vector<string> tokens (string const& text, string const& delim) {
vector<string> result;

string::size_type b, e(0);
for (;;) {


b = text.find_first_not_of(delim, min(e, text.size()));

if (b == string::npos) break;
e = text.find_first_of(delim, b);
result.push_back(text.substr(b, e - b));
}
return result;
}

The loop enters at the top and exits in the middle. It is SESE. There
is even a structured syntax version.

loop <statement> andwhile (<condition>) <statement>

If you did not intend that, post some code. If you did intend that, how
do you compare it to this?

vector<string> tokens (string const& text, string const& delim) {
vector<string> result;

string::size_type b, e(0);
while ((b = text.find_first_not_of(delim, min(e, text.size())))

!= string::npos) {
e = text.find_first_of(delim, b);
result.push_back(text.substr(b, e));
}
return result;
}

John

de...@antiquark.com

unread,
Feb 4, 2006, 7:02:02 AM2/4/06
to
> I'm impressed with the elegance of the algorithms shown so far but can
> they all produce a fixed number of tokens for input like this:
>
> 10,12.2,a,bbb,15
> 12,,a,bbb,12
> 15,15.1,,bbb,12
> ,1,11.6,,

Maybe you're looking for something like this.
Derek.
========================

#include <vector>
#include <string>
#include <sstream>
#include <iostream>

using namespace std;

int main()
{
string line;
vector<vector<string> > table;
cout << "Enter data, ctrl-d when done:" << endl;
while(getline(cin,line))
{
table.push_back( vector<string>() );
istringstream is(line);
string token;
while(getline(is, token, ','))
{
table.back().push_back( token );
}
}

for(size_t r = 0 ; r < table.size() ; r ++ )
{
for(size_t c = 0 ; c < table.at(r).size() ; c ++ )
{
cout << " row:" << r ;
cout << " col:" << c ;
cout << " val:\"" << table.at(r).at(c) << "\"";
cout << endl;

John Potter

unread,
Feb 4, 2006, 7:04:09 AM2/4/06
to
On 3 Feb 2006 09:57:23 -0500, "kanze" <ka...@gabi-soft.fr> wrote:

> Andrei Alexandrescu (See Website For Email) wrote:

> > It's as much of a necessity as SESE is :o). A google for
> > <<fast string searching>> yields as its first result "The
> > Boyer-Moore Fast String Searching Algorithm"
> > (http://www.cs.utexas.edu/users/moore/best-ideas/string-searching/).

> Boyer-Moore won't help here -- we're looking to match one of,
> not to match a complete sequence. On the other hand,
> initializing a bool[UCHAR_MAX+1] according to whether each
> character is in the target set is trivial, and reduces the
> complexity to O(n).

That is a perfect hashing function from char to bool that is called
using [] rather than ().

> > Even without using Boyer-Moore, a simple hash for the
> > delimiter characters decreases the complexity from quadratic
> > to linear.

> Not quite sure how a hash applies here, either.

You just wrote one above and used it below.

> Just for the record, my version of John's code would be
> something like:

[snip]

> The extra classes make it look like a lot of work,

And thus more complex than either of the others?

> but in fact,
> of course, anyone who has to deal with text will already have a
> set of character class which will allow something like:

> std::vector< std::string >
> tokens(
> std::string const& text,
> std::string const& delim )
> {
> std::vector< std::string > result ;
> Gabi::SetOfCharacter d( delim ) ;
> std::string::const_iterator b( std::find_if( text.begin(),
> text.end(),
> std::not1( d.contains() ) )
> ) ;
> while ( b != text.end() ) {
> std::string::const_iterator e( std::find_if( b,
> text.end(),
> d.contains() ) ) ;
> result.push_back( std::string( b, e ) ) ;
> b = std::find_if( e, text.end(), std::not1( d.contains()) )
> ;
> }
> return result ;
> }

Which, of course, will never need to be modified by any new maintenance
programmer. My concern was readability not efficiency.

For my edification, what would you use for wchar_t?

John

philip....@gmail.com

unread,
Feb 4, 2006, 12:47:19 PM2/4/06
to
Roland Pibinger wrote:
> In your case it is probably better to read the data line by line and
> use string.find (',' pos) to identify the possibly empty tokens (you
> have two delimiters in your data: the field delimiter ',' and the
> record delimiter '\n').

FWIW here's the function that I wrote to do this. I'm sure there are
several improvements that you guys could make (go easy on me!) but it
can handle missing data at any point in the input:

void split(const std::string& line, std::string delim,
std::vector<std::string>& tokens)

{

std::string::size_type curPos(0);

std::string::size_type endPos;

do {

endPos = line.find_first_of(delim, curPos);

std::string word = line.substr(curPos, endPos - curPos);

tokens.push_back(word);

curPos = endPos + 1;

} while(endPos != std::string::npos);

}

Thanks, Phil

Andrei Alexandrescu (See Website For Email)

unread,
Feb 4, 2006, 12:45:24 PM2/4/06
to
Roland Pibinger wrote:
> On 3 Feb 2006 07:06:09 -0500, "Andrei Alexandrescu (See Website For
> Email)" <SeeWebsit...@moderncppdesign.com> wrote:
>> That's a nice set of numbers, but what I was saying was that the number
>> of character comparisons will increase fast with the size of the input,
>> to the point where it dominates the total cost.
>
> But where is the break-even point? File-size 10000, 100000 or more?
> Delimiter-size?

Well you were the guy with the numbers :o).

Anyway, I replied to your post because in turn you replied to mine, but
with measurements that (while relevant to the discussion) were not at
all related to my point.

So I'd say, once you have your codebase, it's easy to figure out how the
time spent in the tokenizer grows with the size of the text and that of
the separator.

>> The allocation count is O(text.size() / delim.size()),
>
> ... neglecting vector re-allocations which are responsible for the
> large numbers in 3 of the 4 examples.
>
>> while the complexity of the algo is
>> O(text.size() * delim.size()).
>
> Apart from tokenizing, what I try to say is that std::string is
> underspecified. It has opposite performance characteristics (cheap vs.
> expensive) depending on the implementation for the two most import
> member functions, the copy constructor and the assignment operator.
> The string implementation determines the string programming style.
> Most text-book and newsgroup examples use a programming style that
> implicitly assumes cheap string copying and assignment. For no good
> reason.

I think that's a very valid point.


Andrei

Russ

unread,
Feb 4, 2006, 12:46:35 PM2/4/06
to
philip....@gmail.com wrote:
> I'm impressed with the elegance of the algorithms shown so far but can
> they all produce a fixed number of tokens for input like this:
>
> 10,12.2,a,bbb,15
> 12,,a,bbb,12
> 15,15.1,,bbb,12
> ,1,11.6,,
>
> I'd have thought that in many usage scenarios (certainly any that I've
> ever needed) I'd want to tokenise the data such that empty fields were
> still present in the vector otherwise my calling code wouldn't know
> which fields were missing.

Try:

const char *strs[] = {
"10,12.2,a,bbb,15",
"12,,a,bbb,12",
"15,15.1,,bbb,12",
",1,11.6,,"
};

#include <stlsoft/string_tokeniser.hpp>
#include <iostream>
#include <vector>

int main() {

for(int i = 0; i < sizeof(strs) / sizeof(strs[0]); ++i) {

stlsoft::string_tokeniser<std::string, char,
stlsoft::skip_blank_tokens<false> > toks(strs[i], ',');

std::vector<std::string> results(toks.begin(), toks.end());

std::cout << "tokenizing \"" << strs[i] << '"' << std::endl;

for(int j = 0; j < results.size(); ++j) {

std::cout << '[' << j << "] " << results[j] << std::endl;
}
}
}

HTH

Russ; no fuss

Dave Harris

unread,
Feb 4, 2006, 1:59:26 PM2/4/06
to
ka...@gabi-soft.fr (kanze) wrote (abridged):

> Boyer-Moore won't help here -- we're looking to match one of,
> not to match a complete sequence. On the other hand,
> initializing a bool[UCHAR_MAX+1] according to whether each
> character is in the target set is trivial, and reduces the
> complexity to O(n).

Yes. Although on some platforms a char is 32 bits, so the array would take
a huge amount of space, and probably longer to initialise than to scan the
text. Even with 16-bit Unicode the array would be excessive. I expect
that's why Andrei suggested using a hash table rather than an array.

-- Dave Harris, Nottingham, UK.

Dave Harris

unread,
Feb 4, 2006, 1:59:51 PM2/4/06
to
SeeWebsit...@moderncppdesign.com (Andrei Alexandrescu (See Website
For Email)) wrote (abridged):

> Others, on the other hand, will find the infinite loop with the
> hidden "break" in the middle disgusting.

It would be less hidden if the body of the "if" statement were indented.

if (tokB == text.end())
break;

It really does help if the "break" is on a line of its own, rather than
buried in another line.

Also, the code would be clearer generally if it didn't abbreviate "Begin"
to "B". When I first saw it, I spent some time looking for tokA and maybe
a tokC.

Other than that, I agree (which will surprise no-one). This is a classic
loop-and a half, with a single exit which comes in the middle.

-- Dave Harris, Nottingham, UK.

[ See http://www.gotw.ca/resources/clcm.htm for info about ]

Andrei Alexandrescu (See Website For Email)

unread,
Feb 4, 2006, 8:13:46 PM2/4/06
to
Dave Harris wrote:
> SeeWebsit...@moderncppdesign.com (Andrei Alexandrescu (See Website
> For Email)) wrote (abridged):
>> Others, on the other hand, will find the infinite loop with the
>> hidden "break" in the middle disgusting.
>
> It would be less hidden if the body of the "if" statement were indented.
>
> if (tokB == text.end())
> break;
>
> It really does help if the "break" is on a line of its own, rather than
> buried in another line.
>
> Also, the code would be clearer generally if it didn't abbreviate "Begin"
> to "B". When I first saw it, I spent some time looking for tokA and maybe
> a tokC.

Agreed on both counts.

> Other than that, I agree (which will surprise no-one). This is a classic
> loop-and a half, with a single exit which comes in the middle.

Yah, as John Potter revealed, that was indeed a SESE construct. Heh. So
I've been pointing that Gatling gun towards innocent civilians :o).

On the other hand, I do think it's relevant that the "classic
loop-and-a-half" came from a SEME adept and only late in the thread.
(Those innocent civilians... hey, they're pocketing grenades! Put that
down!) The thing is, inhabitants of the SESE world - by the way, did you
know that North Korea only grants "Single-Entry, Single-Exit" visas? :o)
- would do everything to suppress the little
programmer-wanting-to-use-break-continue-and-return struggling to get
out. In contrast, here in the land of freedom, programmers simply focus
on finding the best solution, and that solution will naturally show
itself, be it SESE or SEME, and with or without frowned-upon constructs
such as break, continue, or (early) return.


Andrei

Carlos Moreno

unread,
Feb 4, 2006, 8:35:40 PM2/4/06
to
Andrei Alexandrescu (See Website For Email) wrote:
> kanze wrote:
>
>>Andrei Alexandrescu (See Website For Email) wrote:
>>Boyer-Moore won't help here -- we're looking to match one of,
>>not to match a complete sequence. On the other hand,
>>initializing a bool[UCHAR_MAX+1] according to whether each
>>character is in the target set is trivial, and reduces the
>>complexity to O(n).
>
>
> That's the inspiration I was hinting at...
>
>
>>>Even without using Boyer-Moore, a simple hash for the
>>>delimiter characters decreases the complexity from quadratic
>>>to linear.
>>
>>Not quite sure how a hash applies here, either.
>
>
> Super simple. (1) Create a hash containing the delimiter characters. (2)
> For each character in the text, look it up in the hash. If it's there,
> it's a delimiter. If not, it's part of a token. The algo is O(n)
> assuming a O(1) hash.

I don't mean to be dense... But... How exactly does a hash
apply in here?? Really, if it wasn't because I'm seeing your
name, I'd be tempted to ask if you know what a hash is... So,
that tellsme that *I* must be missing something fundamental...

Say, my string is: "First part. Second, third, and fourth."

And the string with the delimiters is: " \n\t.,;!?()"

How does a hash help?

I obviously see how James' suggestion helps -- at the expense
of building an UCHAR_MAX+1-elements array as up-front cost,
you check each character of the string in O(1) instead of O(10)
in this case -- O(size of delimiter string) in the general case.

Thanks,

Carlos
--

Andrei Alexandrescu (See Website For Email)

unread,
Feb 4, 2006, 8:37:12 PM2/4/06
to
John Potter wrote:
> On 2 Feb 2006 15:35:14 -0500, "Andrei Alexandrescu (See Website For
> Email)" <SeeWebsit...@moderncppdesign.com> wrote:
>
>> That if wouldn't be unnecessarily executed. If the loop is properly
>> written, it's the only test inside the loop, and it breaks it halfway.
>> "Think outside the SESE." :o)
>
> Surely you didn't intend to write the SESE:
>
> vector<string> tokens (string const& text, string const& delim) {
> vector<string> result;
> string::size_type b, e(0);
> for (;;) {
> b = text.find_first_not_of(delim, min(e, text.size()));
> if (b == string::npos) break;
> e = text.find_first_of(delim, b);
> result.push_back(text.substr(b, e - b));
> }
> return result;
> }

Oy, I got so used to defend "break" against SESEalots, that I forgot
that it could actually be used to write a SESE construct, albeit
unstructured :o).

It's much like the code I wrote in a post that hadn't appeared at the
time you wrote the above, except that mine is untested :o|.

The variable "e" should be defined inside the loop.

> The loop enters at the top and exits in the middle. It is SESE. There
> is even a structured syntax version.
>
> loop <statement> andwhile (<condition>) <statement>

Yah. It's a pity that the construct isn't as used in today's languages.
Perl has it though.

> If you did not intend that, post some code. If you did intend that, how
> do you compare it to this?
>
> vector<string> tokens (string const& text, string const& delim) {
> vector<string> result;
> string::size_type b, e(0);
> while ((b = text.find_first_not_of(delim, min(e, text.size())))
> != string::npos) {
> e = text.find_first_of(delim, b);
> result.push_back(text.substr(b, e));
> }
> return result;
> }

I'm fine with both. I mildly prefer the former for two reasons:

1. It avoids all uninitialized variables. But that's a limitation of the
language, not a fundamental one. Perl has:

while ((my $e = function()) != something) { ... }

so a variable definition can occur anywhere an expression is needed, and
I find that as cool as it gets.

2. The condition in the while (second implementation) is longer than any
line in the first implementation.


Andrei

Andrei Alexandrescu (See Website For Email)

unread,
Feb 5, 2006, 7:04:03 AM2/5/06
to
Carlos Moreno wrote:
> Andrei Alexandrescu (See Website For Email) wrote:
>> kanze wrote:
>>
>>> Andrei Alexandrescu (See Website For Email) wrote:
>>> Boyer-Moore won't help here -- we're looking to match one of,
>>> not to match a complete sequence. On the other hand,
>>> initializing a bool[UCHAR_MAX+1] according to whether each
>>> character is in the target set is trivial, and reduces the
>>> complexity to O(n).
>>
>> That's the inspiration I was hinting at...
>>
>>
>>>> Even without using Boyer-Moore, a simple hash for the
>>>> delimiter characters decreases the complexity from quadratic
>>>> to linear.
>>> Not quite sure how a hash applies here, either.
>>
>> Super simple. (1) Create a hash containing the delimiter characters. (2)
>> For each character in the text, look it up in the hash. If it's there,
>> it's a delimiter. If not, it's part of a token. The algo is O(n)
>> assuming a O(1) hash.
>
> I don't mean to be dense... But... How exactly does a hash
> apply in here?? Really, if it wasn't because I'm seeing your
> name, I'd be tempted to ask if you know what a hash is... So,
> that tellsme that *I* must be missing something fundamental...

As an aside, it's nicer to be more conciliatory when the typical Usenet
misunderstanding come by. It's a misunderstanding, let's figure it out,
without giving in to the temptation of putting one another down, be they
perceived (rightly or not) as some authoritative figure. There's all
kinds of temptations in life (free food... mmm...) but belittling
someone else is a temptation IMHO worth resisting to. (I confess I
sometimes give into that, sigh.)

> Say, my string is: "First part. Second, third, and fourth."
>
> And the string with the delimiters is: " \n\t.,;!?()"
>
> How does a hash help?

At the end of the day, for each character in your string, there's a
linear search in the " \n\t.,;!?()" string.

That is, for the input given, the code will linearly look up 'F' in "
\n\t.,;!?()". Conclusion: it's not a separator.

Then the code will linearly look up 'i' in " \n\t.,;!?()". Conclusion,
again: it's not a separator.

Then you linearly look up 'r'... etc.

All of these lookups are hidden inside find_first_not_of and
find_first_not_of.

So when the dust settles, you compared (on average) each character in
your string with each character in the delimiter string.

The point of the hash is to accelerate that lookup. If you build a hash
table from the characters in " \n\t.,;!?()", then things would go like this:

Build a hash table with ' ', '\n', '\t', etc. as its entries.

Look up 'F' in the hash table. Not found. Conclusion: it's not a separator.

etc.


Andrei

Eugene Kalenkovich

unread,
Feb 6, 2006, 5:08:09 AM2/6/06
to
"Carlos Moreno" <moreno_at_mo...@mailinator.com> wrote in message
news:c63Ff.24711$kO4.7...@weber.videotron.net...

> Andrei Alexandrescu (See Website For Email) wrote:
>> kanze wrote:
>>
>> Super simple. (1) Create a hash containing the delimiter characters. (2)
>> For each character in the text, look it up in the hash. If it's there,
>> it's a delimiter. If not, it's part of a token. The algo is O(n)
>> assuming a O(1) hash.
>
> I don't mean to be dense... But... How exactly does a hash
> apply in here?? Really, if it wasn't because I'm seeing your
> name, I'd be tempted to ask if you know what a hash is... So,
> that tellsme that *I* must be missing something fundamental...
>
[...]

> I obviously see how James' suggestion helps -- at the expense
> of building an UCHAR_MAX+1-elements array as up-front cost,
> you check each character of the string in O(1) instead of O(10)
> in this case -- O(size of delimiter string) in the general case.
>

I think that basic misunderstanding here is around the term "hash". James'
array _is_ a hash in this case, just extremely memory-inefficient (try to
extend it to wchar_t, as John Potter pointed).

-- EK

Carlos Moreno

unread,
Feb 6, 2006, 5:06:52 AM2/6/06
to
Andrei Alexandrescu (See Website For Email) wrote:

>>>>>Even without using Boyer-Moore, a simple hash for the
>>>>>delimiter characters decreases the complexity from quadratic
>>>>>to linear.
>>>>
>>>>Not quite sure how a hash applies here, either.
>>>
>>>Super simple. (1) Create a hash containing the delimiter characters. (2)
>>>For each character in the text, look it up in the hash. If it's there,
>>>it's a delimiter. If not, it's part of a token. The algo is O(n)
>>>assuming a O(1) hash.
>>
>>I don't mean to be dense... But... How exactly does a hash
>>apply in here?? Really, if it wasn't because I'm seeing your
>>name, I'd be tempted to ask if you know what a hash is... So,
>>that tellsme that *I* must be missing something fundamental...
>
>
> As an aside, it's nicer to be more conciliatory when the typical Usenet
> misunderstanding come by. It's a misunderstanding, let's figure it out,
> without giving in to the temptation of putting one another down, be they
> perceived (rightly or not) as some authoritative figure.

I see that my words may have been taken as a bit too strong
and a bit rude (although not that much that at least one
moderator thought it was acceptable :-)).

It's really not a matter of "authoritative figure" or not;
it's a matter of "I know you" -- I know you from your posts
here, I know you from your history as a knowledgeable person
in C++, so what sounded to me -- *because of the way I
understood it* (see below) -- like something coming from
someone that did not really understand what they were
talking about, well, it's obvious that there must have
been something wrong in my analysis.

I discovered long time ago that when I hear something that
sounds like nonsense to me -- even when I know I have knowledge
enough to determine that it *does sound* like nonsense, it
does not necessarily mean that the other person doesn't
know a thing about the subject -- quite often, it means
that; but often enough, it can mean that the other person
knows much much much more than I do. (and again, this is
not an attempt to compensate for my rudeness by sucking up
and saying that you're a genius or a super-knowledgeable,
etc. etc. -- just one of those facts of life I feel lucky
to have observed and understood at relatively young age)

> kinds of temptations in life (free food... mmm...) but belittling
> someone else is a temptation IMHO worth resisting to.

But asking someone if they really know what a hash is is
not really an offense, I think -- lack of knowledge is not
something to feel offended; we were not born with the
given knowledge; at one point in time, you did not know
what the notion of classes was, or a variable, etc.

That said, I agree that through Usenet, things are a bit
more delicate and volatile, so it is one of those things
that can easily be taken the wrong way and blow out of
proportion.

Ok, all that said:

>>Say, my string is: "First part. Second, third, and fourth."
>>
>>And the string with the delimiters is: " \n\t.,;!?()"
>>
>>How does a hash help?
>
>
> At the end of the day, for each character in your string, there's a
> linear search in the " \n\t.,;!?()" string.

> [...]


>
> All of these lookups are hidden inside find_first_not_of and
> find_first_not_of.

Ok, I had gotten that part.

> The point of the hash is to accelerate that lookup. If you build a hash
> table from the characters in " \n\t.,;!?()", then things would go like this:

A-haaa!!! This is the key detail -- and it's your fault!!
:-) :-) :-)

Notice that you used the terms "hash" and "hash table"
interchangeably -- I confess that I'm not sure if it is
acceptable that these terms are used like this; but when
you said hash (in your previous post you used the term
"hash" only, and not "hash table"), I thought of a hash
*function* (like MD5, or SHA1, etc.) -- some form of
message digest for the delimiters. *That* didn't make
sense.

BTW, given that the string with delimiters is usually
short, it's not clear to me that the complexity analysis
means something in this case (i.e., O(1) with a possibly
large proportionality constant vs. O(N) with a small N
and small proportionality constant). Not sure how
efficient the hash function for single characters to
be used by the hash table could be.

Cheers,

Carlos
--

kanze

unread,
Feb 6, 2006, 5:26:30 PM2/6/06
to
Andrei Alexandrescu (See Website For Email) wrote:
> kanze wrote:
> > Andrei Alexandrescu (See Website For Email) wrote:
> > Boyer-Moore won't help here -- we're looking to match one
> > of, not to match a complete sequence. On the other hand,
> > initializing a bool[UCHAR_MAX+1] according to whether each
> > character is in the target set is trivial, and reduces the
> > complexity to O(n).

> That's the inspiration I was hinting at...

> >> Even without using Boyer-Moore, a simple hash for the
> >> delimiter characters decreases the complexity from
> >> quadratic to linear.

> > Not quite sure how a hash applies here, either.

> Super simple. (1) Create a hash containing the delimiter
> characters. (2) For each character in the text, look it up in
> the hash. If it's there, it's a delimiter. If not, it's part
> of a token. The algo is O(n) assuming a O(1) hash.

You're already thinking Unicode and wchar_t. If you're indexing
with an integral value in the range 0-255 or -128-127, you
really don't need a hash table. Most machines today have enough
memory for you to use a plain ordinary C style table.

> > Just for the record, my version of John's code would be
> > something like:

> > class Setter
> > {
> > public:
> > Setter( bool* dest ) : tbl( dest ) {}
> > void operator()( char ch ) const
> > {
> > tbl[ (unsigned char)ch ] = true ;
> > }
> > private:
> > bool* tbl ;
> > } ;

> [snip]

> Alas, yet another sacrifice on the altar of the for_each
> god... :o)

Sort of:-).

If I didn't mention it later in my posting, I should have --
this solution is really only applicable if you can leverage some
reuse of the class Setter.

Or, of course, if you're in a job interview where the goal is to
maximize use of the standard library.

> There are things that I guess I'll never understand.

I guess I'm one of them:-).

--
James Kanze GABI Software
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

kanze

unread,
Feb 6, 2006, 5:26:08 PM2/6/06
to
John Potter wrote:
> On 3 Feb 2006 09:57:23 -0500, "kanze" <ka...@gabi-soft.fr> wrote:

[...]


> > but in fact, of course, anyone who has to deal with text
> > will already have a set of character class which will allow
> > something like:

> > std::vector< std::string >
> > tokens(
> > std::string const& text,
> > std::string const& delim )
> > {
> > std::vector< std::string > result ;
> > Gabi::SetOfCharacter d( delim ) ;
> > std::string::const_iterator b( std::find_if( text.begin(),
> > text.end(),
> > std::not1( d.contains() ) )
> > ) ;
> > while ( b != text.end() ) {
> > std::string::const_iterator e( std::find_if( b,
> > text.end(),
> > d.contains() ) ) ;
> > result.push_back( std::string( b, e ) ) ;
> > b = std::find_if( e, text.end(), std::not1( d.contains()) )
> > ;
> > }
> > return result ;
> > }

> Which, of course, will never need to be modified by any new
> maintenance programmer.

What will never need to be modified by any new maintenance
programmer: the above function (which is pretty easy to handle),
or SetOfCharacter (which is easy for char:-).)

> My concern was readability not efficiency.

Quite. In my case, I'd use Gabi::SetOfCharacter because I have
it; it's one of my standard (and most used) classes.

There's also a derived class (which only differs by having
different constructors), Gabi::CharacterClass, which would allow
specifying the separators with things like "[[:blank:],;".

Part of my point is that programmers who frequently have to deal
with text typically have these sort of things already in their
toolbox.

> For my edification, what would you use for wchar_t?

Gabi::SetOfUnicodeCharacter. Of course, I'd have to write it
first:-). (I've done a little experimentation with text
processing in Unicode, and to date, I've not found an
implementation of SetOfUnicodeCharacter which satisfies me. I
currently use a tri with UTF-8, which is probably the best
solution that I'll find for my applications. But UTF-8 and
find_if aren't going to co-habitate very well -- contains()
requires iterators, and not a character.)

--
James Kanze GABI Software
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

kanze

unread,
Feb 6, 2006, 5:28:33 PM2/6/06
to
Roland Pibinger wrote:
> On 3 Feb 2006 07:06:09 -0500, "Andrei Alexandrescu (See Website For
> Email)" <SeeWebsit...@moderncppdesign.com> wrote:

> Apart from tokenizing, what I try to say is that std::string
> is underspecified. It has opposite performance characteristics
> (cheap vs. expensive) depending on the implementation for the
> two most import member functions, the copy constructor and the
> assignment operator. The string implementation determines the
> string programming style.

> Most text-book and newsgroup examples use a programming style
> that implicitly assumes cheap string copying and assignment.
> For no good reason.

I suspect that most text-book and newsgroup examples ues a
programming style that implicitly assumes that string handling
will not be a dominant factor in the application. Which is true
for many applications -- if the only tokenizing you're doing is
parsing a configuration file with, perhaps, maximum 100 lines,
you really don't care.

I've said it before: you should never use a standard class as a
major abstraction in your application. If something is
important in your application, you define it with the interface
which corresponds to what you need to do, and only what you need
to do. If it is text, then you use std::string to implement it.
Until the profiler says otherwise. At which point, since the
use is behind a narrow interface, you have a maximum amount of
room to play around in.

--
James Kanze GABI Software
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

Allan W

unread,
Feb 7, 2006, 6:20:15 AM2/7/06
to
John Potter wrote:
> For my edification, what would you use for wchar_t?

How about std::set?

Andrei Alexandrescu (See Website For Email)

unread,
Feb 7, 2006, 6:19:19 AM2/7/06
to
kanze wrote:
>>> Just for the record, my version of John's code would be
>>> something like:
>
>>> class Setter
>>> {
>>> public:
>>> Setter( bool* dest ) : tbl( dest ) {}
>>> void operator()( char ch ) const
>>> {
>>> tbl[ (unsigned char)ch ] = true ;
>>> }
>>> private:
>>> bool* tbl ;
>>> } ;
>
>> [snip]
>
>> Alas, yet another sacrifice on the altar of the for_each
>> god... :o)
>
> Sort of:-).
>
> If I didn't mention it later in my posting, I should have --
> this solution is really only applicable if you can leverage some
> reuse of the class Setter.

I'm distinctly unimpressed by a 11-lines class that essentially
"abstracts" one line of uninteresting code, all for the sake of
adaptation to the unsuitable function for_each. Its reuse is more likely
to spawn more awkward code than anything else, so... no, Sir.

Looking at for_each and the useless classlets that it spawns makes me
love Boost::FOREACH. And I'm not kidding - I actually do find that piece
of abstraction eminently usable, although at the end of the day it's a
-- dare I say the "M" word? -- macro.

I'm looking forward to the times when we'll get out of this baroque
stage of writing nonsensical classlets just for feeling good that we're
using for_each.

> Or, of course, if you're in a job interview where the goal is to
> maximize use of the standard library.

Hum. If I were the interviewer and someone wrote on the white board a
class for a one-liner for loop, I'd make them a lavaliere out of the
white board. :o)


Andrei

Andrei Alexandrescu (See Website For Email)

unread,
Feb 8, 2006, 5:00:35 AM2/8/06
to
Allan W wrote:
> John Potter wrote:
>> For my edification, what would you use for wchar_t?
>
> How about std::set?

Yah that's a good idea, uses only std components, scales nicely to O(n
log d), and is not overkill for short delimiters.

Another (better I think) variant is Loki::AssocVector, which shines at
create-once-query many uses. Also, given that objects in the set are
small, Loki::AssocVector will be even more suitable because it doesn't
use a node-based structure.


Andrei

Carlos Moreno

unread,
Feb 8, 2006, 6:11:37 AM2/8/06
to
Andrei Alexandrescu (See Website For Email) wrote:

> I'm looking forward to the times when we'll get out of this baroque
> stage of writing nonsensical classlets just for feeling good that we're
> using for_each.

Now you confused me... *Where* do I find people that feel good
*because* they're using for_each??

I thought nowadays it was considered good manners to feel
ashamed in those *extremely rare* occasions where we give in
to the temptation to use for_each (I confess to the vicious
crime of having used for_each no less than four times in the
last five years)

I mean, for heavens sake, TC++PL *nine years ago* already said
quite clearly that one should re-think whatever use of for_each,
for the reasons that we all know.

So... What am I missing?

Carlos
--

Andrei Alexandrescu (See Website For Email)

unread,
Feb 9, 2006, 5:11:51 AM2/9/06
to
Carlos Moreno wrote:
> Andrei Alexandrescu (See Website For Email) wrote:
>
>> I'm looking forward to the times when we'll get out of this baroque
>> stage of writing nonsensical classlets just for feeling good that we're
>> using for_each.
>
> Now you confused me... *Where* do I find people that feel good
> *because* they're using for_each??
>
> I thought nowadays it was considered good manners to feel
> ashamed in those *extremely rare* occasions where we give in
> to the temptation to use for_each (I confess to the vicious
> crime of having used for_each no less than four times in the
> last five years)
>
> I mean, for heavens sake, TC++PL *nine years ago* already said
> quite clearly that one should re-think whatever use of for_each,
> for the reasons that we all know.
>
> So... What am I missing?

The post to which I replied, featuring a burlesque use of for_each
without adding a smiley? :o)

John Potter

unread,
Feb 10, 2006, 5:52:32 AM2/10/06
to
On 8 Feb 2006 05:00:35 -0500, "Andrei Alexandrescu (See Website For
Email)" <SeeWebsit...@moderncppdesign.com> wrote:

> Allan W wrote:

> > John Potter wrote:

> > > For my edification, what would you use for wchar_t?

> > How about std::set?

> Yah that's a good idea, uses only std components, scales nicely to O(n
> log d), and is not overkill for short delimiters.

> Another (better I think) variant is Loki::AssocVector, which shines at
> create-once-query many uses. Also, given that objects in the set are
> small, Loki::AssocVector will be even more suitable because it doesn't
> use a node-based structure.

Just sort the delimeter string and use binaray search. No space
overhead and will be as fast or faster than either of those.

I hate to confuse intersting discussion with data, but here it is
anyway. This is for eleven delimeters and strings of about 80
characters. Using James as reference time.

1.00 J James
1.24 R optimization by Roland
1.12 P string::size_type functions
1.01 I string::const_iterator functions
1.04 B binary search of sorted delimeters
1.28 H ext::hash_set
1.25 S set

To be fair, the two set methods do have a high setup time. We can
create the sets ahead of time making the functions only slightly less
easy to use. We can also create the array for James.

1.00 J
1.23 R
1.16 P
1.02 I
1.04 B
1.32 H
1.30 S
1.02 HP
1.02 SP
0.98 JP

It is also interesting that all times decreased with the modified test
program. The reference time was 0.82 that of the original. Unrelated
changes to the program can make a large change in performance. The
largest consistent fact is that using iterators (I) and standard
algorithms is faster than using string functions (P) on this
implementation.

Not knowing what applications would be used with wide characters, it may
be that the algorithms would be turned inside out because almost
everything is a delimeter and the set of useful characters is small. It
may also be the case that the input still contains only a small subset
of the character set which could easily be mapped to the table used by
James.

Any paractical thoughts for real applications?

John

0 new messages