Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

string tokenizing

0 views
Skip to first unread message

David Rubin

unread,
Oct 7, 2003, 3:36:13 PM10/7/03
to
I looked on google for an answer, but I didn't find anything short of
using boost which sufficiently answers my question: what is a good way
of doing string tokenization (note: I cannot use boost). For example, I
have tried this:

#include <algorithm>
#include <cctype>
#include <climits>
#include <deque>
#include <iostream>
#include <iterator>
#include <string>

using namespace std;

int
main()
{
string delim;
int c;

/* fill delim */
for(c=0; c < CHAR_MAX; c++){ // I tried #include <limits>, but
failed...
if((isspace(c) || ispunct(c))
&& !(c == '_' || c == '#')
delim += c;
}

string buf;
string::size_type op, np;
deque<string> tok;

while(std::getline(cin, buf) && !cin.fail()){
op = 0;
while((np=buf.find_first_of(delim, op)) != buf.npos){
tok.push_back(string(&buf[op], np-op));
if((op=buf.find_first_not_of(delim, np)) == buf.npos)
break;
}
tok.push_back(string(&buf[op]));

cout << buf << endl;
copy(tok.begin(), tok.end(), ostream_iterator<string>(cout,
"\n"));
cout << endl;
tok.clear();
}
return 0;
}

The inner loop basically finds tokens delimited by any character in
delim where multiple delimiters may appear between tokens (algorithm
follows some advice found on clc++). However, the method seems a little
clumsy, especially with respect to temporary objects. (Also, it does not
seem to work correctly. For example, the last token gets corrupted in
the second outer loop iteration.)

Also, it would be very nice to have a function like

int tokenize(const string& s, container<string>& c);

which returns the number of tokens, inserted into the container.
However, how do you write this so c is any container model? I'm not sure
you can since they don't share a base class. Is there any better way?

Certainly, this is easy to do with a mix of C and C++:

for(char *t=strtok(buf, delim); t != 0; t=strtok(0, delim))
tok.push_back(t);

where buf and delim are essentially char*'s. However, this seems
unsatisfactory as well.

/david

--
Andre, a simple peasant, had only one thing on his mind as he crept
along the East wall: 'Andre, creep... Andre, creep... Andre, creep.'
-- unknown

Mike Wahler

unread,
Oct 7, 2003, 4:27:17 PM10/7/03
to
"David Rubin" <bogus_...@nomail.com> wrote in message
news:3F8315AD...@nomail.com...

> I looked on google for an answer, but I didn't find anything short of
> using boost which sufficiently answers my question: what is a good way
> of doing string tokenization (note: I cannot use boost). For example, I
> have tried this:

Remarks below.

>
> #include <algorithm>
> #include <cctype>
> #include <climits>
> #include <deque>
> #include <iostream>
> #include <iterator>
> #include <string>
>
> using namespace std;
>
> int
> main()
> {
> string delim;
> int c;
>
> /* fill delim */
> for(c=0; c < CHAR_MAX; c++)


Is there a particular reason you're excluding the
value 'CHAR_MAX' from the loop? (using < instead of <=)

>{ // I tried #include <limits>, but
> failed...

What happened?

#include <limits>

std::numeric_limits<char>::max();

should work.

More below.

I find your code interesting, so I'll probably play around
with it for a bit, and let you know if I have any ideas.

But here's some food for thought: one way to 'generalize'
container access is with iterators, as do the functions in
<algorithm>.

template <typename T>
T::size_type tokenize(const std::string& s,
T::iterator beg,
T::iterator end)
{
}

For inserting new elements, you can use an iterator
adapter, e.g. std::insert_iterator. You could even
use an output stream as a 'container' using
ostream_iterator.

>
> Certainly, this is easy to do with a mix of C and C++:
>
> for(char *t=strtok(buf, delim); t != 0; t=strtok(0, delim))
> tok.push_back(t);

This contradicts your parameter type of const reference to string,
since 'strtok()' modifies its argument.

> where buf and delim are essentially char*'s. However, this seems
> unsatisfactory as well.

Yes, 'strtok()' can be problematic, if only for the reason that
it modifies its argument, necessitating creation of a copy if
you want to keep the argument const.

HTH,
-Mike


Mike Wahler

unread,
Oct 7, 2003, 4:30:09 PM10/7/03
to
"Mike Wahler" <mkwa...@mkwahler.net> wrote in message
news:FkFgb.1741$dn6....@newsread4.news.pas.earthlink.net...

> template <typename T>
> T::size_type tokenize(const std::string& s,
> T::iterator beg,
> T::iterator end)
> {
> }

Oops, I meant to make those iterator parameters const refs
as well

const T::iterator& beg, const T::iterator& end

-Mike


David Rubin

unread,
Oct 7, 2003, 6:53:14 PM10/7/03
to
Mike Wahler wrote:

[snip]


> >{ // I tried #include <limits>, but
> > failed...
>
> What happened?

foo.cc:9: limits: No such file or directory

For some reason, my compiler can't find the file. Otherwise, I agree
with you...

> #include <limits>
>
> std::numeric_limits<char>::max();
>
> should work.

[snip]


> But here's some food for thought: one way to 'generalize'
> container access is with iterators, as do the functions in
> <algorithm>.

> template <typename T>
> T::size_type tokenize(const std::string& s,
> T::iterator beg,
> T::iterator end)
> {
> }

This is a nice idea! I knew you should be able to do this, but I
couldn't see how. Here is the refactored code:

template <typename InsertIter>
void
tokenize(const string& buf, const string& delim, InsertIter ii)
{
string word;
string::size_type sp, ep; // start/end position

sp = 0;
do{
sp = buf.find_first_not_of(delim, sp);
ep = buf.find_first_of(delim, sp);
if(sp != ep){
if(ep == buf.npos)
ep = buf.length();
word = buf.substr(sp, ep-sp);
*ii++ = lc(word);
sp = buf.find_first_not_of(delim, ep+1);
}
}while(sp != buf.npos);

if(sp != buf.npos){
word = buf.substr(sp, buf.length()-sp);
*ii++ = lc(word);
}
}

called as

tokenize(buf, delim, insert_iter<deque<string> >(tokens,
tokens.begin()));

The orignal spec returned the number of tokens parsed. Now I have to
settle for checking

if(tokens.size() > 0){ ... }

David Rubin

unread,
Oct 7, 2003, 7:13:01 PM10/7/03
to
David Rubin wrote:

template <typename InsertIter>
void
tokenize(const string& buf, const string& delim, InsertIter ii)
{

string::size_type sp, ep; // start/end position

ep = -1;

do{
sp = buf.find_first_not_of(delim, ep+1);


ep = buf.find_first_of(delim, sp);
if(sp != ep){
if(ep == buf.npos)
ep = buf.length();

*ii++ = buf.substr(sp, ep-sp);
}
}while(sp != buf.npos);
}

That's better. The 'ep+1' is a small optimization. I'm not sure it makes
any difference, and really, starting with ep=0 makes the code a bit
clearer.

Mike Wahler

unread,
Oct 7, 2003, 7:50:20 PM10/7/03
to

"David Rubin" <bogus_...@nomail.com> wrote in message
news:3F8343DA...@nomail.com...

> Mike Wahler wrote:
>
> [snip]
> > >{ // I tried #include <limits>, but
> > > failed...
> >
> > What happened?
>
> foo.cc:9: limits: No such file or directory
>
> For some reason, my compiler can't find the file.

Configuration problem, installation problem, or perhaps
simply not provided by your implemenation. Which one
are you using? More below.

> Otherwise, I agree
> with you...
>
> > #include <limits>
> >
> > std::numeric_limits<char>::max();
> >
> > should work.

> [snip]

> > But here's some food for thought: one way to 'generalize'
> > container access is with iterators, as do the functions in
> > <algorithm>.
>
> > template <typename T>
> > T::size_type tokenize(const std::string& s,
> > T::iterator beg,
> > T::iterator end)
> > {
> > }
>
> This is a nice idea!


Yes it is. But not my idea. I "stole" it from
the standard library, the design of which is imo
rich with Good Ideas.

If you don't have the Josuttis book, get it.
www.josuttis.com/libbook


>I knew you should be able to do this, but I
> couldn't see how. Here is the refactored code:

[snip code] (I didn't look at it very closely, so if
obvious errors, I didn't see them)

> called as
>
> tokenize(buf, delim, insert_iter<deque<string> >(tokens,
> tokens.begin()));
>
> The orignal spec returned the number of tokens parsed. Now I have to
> settle for checking
>
> if(tokens.size() > 0){ ... }

or

if(!tokens.empty())

which *might* improve performance, and imo is
more expressive.

-Mike


Mike Wahler

unread,
Oct 7, 2003, 7:58:38 PM10/7/03
to
"David Rubin" <bogus_...@nomail.com> wrote in message
news:3F83487D...@nomail.com...

> David Rubin wrote:
>
> template <typename InsertIter>
> void
> tokenize(const string& buf, const string& delim, InsertIter ii)

You might squeeze out some more performance by making
that last parameter a const reference.


> {
> string::size_type sp, ep; // start/end position
>
> ep = -1;
>
> do{
> sp = buf.find_first_not_of(delim, ep+1);
> ep = buf.find_first_of(delim, sp);
> if(sp != ep){
> if(ep == buf.npos)
> ep = buf.length();
> *ii++ = buf.substr(sp, ep-sp);
> }
> }while(sp != buf.npos);
> }
>
> That's better. The 'ep+1' is a small optimization. I'm not sure it makes
> any difference, and really, starting with ep=0 makes the code a bit
> clearer.

"I love it when a plan comes together."
-George Peppard, as "Hannibal" in "The A Team"

:-)

-Mike


SomeDumbGuy

unread,
Oct 7, 2003, 9:58:31 PM10/7/03
to
Mike Wahler wrote:


>>tokenize(const string& buf, const string& delim, InsertIter ii)
>
>
> You might squeeze out some more performance by making
> that last parameter a const reference.

Sorry to bother you. I was wondering how making it a const reference
would help performance?


Mike Wahler

unread,
Oct 7, 2003, 11:20:34 PM10/7/03
to
"SomeDumbGuy" <ab...@127.0.0.1> wrote in message
news:bbKgb.29244$3b7....@nwrddc02.gnilink.net...

Note the word "might". Depending upon the implementation,
an iterator might be a simple pointer, but it also could
be an elaborate large structure, in which case passing
by reference could be faster than passing a copy by
value, and would still probably be just as fast as
pass by value if the iterator is represented with a
pointer type.

Since OP's quest was to 'generalize' for any container
type (via a template), we cannot know what the actual
representation of the iterator will be.

Also the same iterator type might be implemented in different
ways among library implemenations, some large and complex,
some not. A reference can prevent possible performance
degradation, without having to be concerned whether it's
actually an issue.

The 'const' part of my suggestion doesn't really have anything
to do with performance. I say 'const' reference, since the
function does not need to modify the iterator's value. This
is the recommendation for any parameter of non-built-in type
which a function does not modify, especially when the parameter
type is a template argument, since you can't know how large and
complex it might be.

The "traditional wisdom" concerning parameter types is
essentially:

For non-built-in types or "unknown" types (specified by
a template argument), pass by const reference by default.

If the function needs to modify the caller's argument,
regardless of type, pass by nonconst reference.

If the parameter is always a built-in type and the function
need not modify the caller's argument, pass by value or
const reference.

Or something like that. :-)

Does that help?

-Mike


Jerry Coffin

unread,
Oct 7, 2003, 11:41:49 PM10/7/03
to
In article <3F8315AD...@nomail.com>, bogus_...@nomail.com
says...

> I looked on google for an answer, but I didn't find anything short of
> using boost which sufficiently answers my question: what is a good way
> of doing string tokenization (note: I cannot use boost). For example, I
> have tried this:

This may look a bit odd at first, but it works quite nicely. Its basic
idea is to hijack the tokenizer built into the standard iostream
classes, and put it to our purposes. The main thing necessary to do
that is to create a facet that classifies our delimiters as "space" and
everything else as not-"space". Once we have a stream using our facet,
we can simply read tokens from the stream and use them as we please.

Here's the code:

#include <iostream>
#include <deque>
#include <sstream>
#include <iterator>

#include "facet"

template <class T>
class delims : public ctype_table<T>
{
public:
delims(size_t refs = 0)
: ctype_table<T>(ctype_table<T>::empty)
{
for (int i=0; i<table_size; i++)
if((isspace(i) || ispunct(i)) && !(i == '_' || i == '#'))
table()[widen(i)] = mask(space);
}
};

int main() {
std::string buf;
std::locale d(std::locale::classic(), new delims<char>);

while ( std::getline(std::cin, buf)) {
// deque to hold tokens.
std::deque<std::string> tok;

// create istringstream from the input string and have it use our facet.
std::istringstream is(buf);
is.imbue(d);

std::istream_iterator<std::string> in(is), end;

// copy tokens from our stream into a deque
std::copy(in, end,
std::back_inserter<std::deque<std::string> >(tok));

// show tokens, one per line.
std::copy(tok.begin(), tok.end(),
std::ostream_iterator<std::string>(std::cout, "\n"));
}

return 0;
}

Note that if we knew we were going to use std::cin for this, we could
just imbue std::cin with the locale we created, and read directly from
it into the deque:

// same facet as above.

int main() {
std::locale d(std::locale::classic(), new delims<char>);
std::cin.imbue(d);

std::istream_iterator<std::string> in(std::cin), end;
std::ostream_iterator<std::string> out(std::cout, "\n");

std::copy(in, end, out);
return 0;
}

Of course, even if you're planning to put the tokens into some
container, you can still imbue the input stream with the locale instead
of reading from stream to string, then imbuing a stringstream with the
locale.

The facet header I'm using contains some code I posted a while back --
it looks like this:

#include <locale>
#include <algorithm>

template<class T>
class table {
typedef typename std::ctype<T>::mask tmask;

tmask *t;
public:
table() : t(new std::ctype<T>::mask[std::ctype<T>::table_size]) {}
~table() { delete [] t; }
tmask *the_table() { return t; }
};

template<class T>
class ctype_table : table<T>, public std::ctype<T> {
protected:
typedef typename std::ctype<T>::mask tmask;

enum inits { empty, classic };

ctype_table(size_t refs = 0, inits init=classic)
: std::ctype<T>(the_table(), false, refs)
{
if (classic == init)
std::copy(classic_table(),
classic_table()+table_size,
the_table());
else
std::fill_n(the_table(), table_size, mask());
}
public:
tmask *table() {
return the_table();
}
};

This handles most of the dirty work of creating a facet, so about all
you're left with is specifying what characters you want treated as
delimiters, and then telling the stream to use a locale that includes
the facet.

--
Later,
Jerry.

The universe is a figment of its own imagination.

SomeDumbGuy

unread,
Oct 8, 2003, 3:29:11 AM10/8/03
to
Mike Wahler wrote:

>>>You might squeeze out some more performance by making
>>>that last parameter a const reference.
>>
>>Sorry to bother you. I was wondering how making it a const reference
>>would help performance?
>
>
> Note the word "might". Depending upon the implementation,
> an iterator might be a simple pointer, but it also could
> be an elaborate large structure, in which case passing
> by reference could be faster than passing a copy by
> value, and would still probably be just as fast as
> pass by value if the iterator is represented with a
> pointer type.

> The "traditional wisdom" concerning parameter types is
> essentially:
>
> For non-built-in types or "unknown" types (specified by
> a template argument), pass by const reference by default.
>
> If the function needs to modify the caller's argument,
> regardless of type, pass by nonconst reference.
>
> If the parameter is always a built-in type and the function
> need not modify the caller's argument, pass by value or
> const reference.
>
> Or something like that. :-)
>
> Does that help?

Yes. :)

David Rubin

unread,
Oct 8, 2003, 11:41:45 AM10/8/03
to
Jerry Coffin wrote:

> This may look a bit odd at first, but it works quite nicely. Its basic
> idea is to hijack the tokenizer built into the standard iostream
> classes, and put it to our purposes. The main thing necessary to do
> that is to create a facet that classifies our delimiters as "space" and
> everything else as not-"space". Once we have a stream using our facet,
> we can simply read tokens from the stream and use them as we please.

Interesting idea. It seems rather complex compared to the code I posted
though. One nice thing about your solution is that you encapsulate
delims in a class. I suppose you could extend the class a bit with
various methods to extend the delimiters (equivalent to isspace,
ispunct, etc), although this is reasonably accomplished via subclassing.
What are the other benefits of this approach compared to mine?

Also, I have a few questions...

[snip]


> int main() {
> std::locale d(std::locale::classic(), new delims<char>);

Can you explain the role of locales in a little more detail? Can't you
just skip this part in most cases?

> std::cin.imbue(d);

> std::istream_iterator<std::string> in(std::cin), end;
> std::ostream_iterator<std::string> out(std::cout, "\n");

> std::copy(in, end, out);

How does end function here? I've only seen copy specified with iterators
associated with a container. In this case, in and end seem to have no
association with each other.

> return 0;

David Rubin

unread,
Oct 8, 2003, 11:49:21 AM10/8/03
to
Mike Wahler wrote:
>
> "David Rubin" <bogus_...@nomail.com> wrote in message
> news:3F83487D...@nomail.com...
> > David Rubin wrote:
> >
> > template <typename InsertIter>
> > void
> > tokenize(const string& buf, const string& delim, InsertIter ii)
>
> You might squeeze out some more performance by making
> that last parameter a const reference.

Actually, I tried defining this function as

template <typename InsertIter>
void tokenize(const string& buf, const string& delim, InsertIter& ii);

(and const InsertIter&) with little success. I got compiler errors both
times. For exapmle, with the non-const refernce, I got:

; g++ tokenize.cc
tokenize.cc: In function `int main()':
tokenize.cc:31: error: could not convert `
insert_iterator<std::deque<std::basic_string<char,
std::char_traits<char>,
std::allocator<char> >, std::allocator<std::basic_string<char,
std::char_traits<char>, std::allocator<char> > > > >((&tokens),
std::deque<_Tp, _Alloc>::begin() [with _Tp = std::string, _Alloc =
std::allocator<std::string>]())' to `
std::insert_iterator<std::deque<std::string,
std::allocator<std::string> >
>&'
tokenize.cc:12: error: in passing argument 3 of `void tokenize(const
std::string&, const std::string&, InsertIter&) [with InsertIter =
std::insert_iterator<std::deque<std::string,
std::allocator<std::string> >
>]'

when invoked (12) as

tokenize(buf, delim, insert_iterator<deque<string> >(tokens,
tokens.begin()));

(g++ 3.3.1, Solaris 2.6).

David B. Held

unread,
Oct 8, 2003, 2:06:30 PM10/8/03
to
"David Rubin" <bogus_...@nomail.com> wrote in message
news:3F8315AD...@nomail.com...
> [...]

> (note: I cannot use boost).
> [...]

Why can't you use Boost? Legal department?

Dave

---
Outgoing mail is certified Virus Free.
Checked by AVG anti-virus system (http://www.grisoft.com).
Version: 6.0.521 / Virus Database: 319 - Release Date: 9/23/2003


Mike Wahler

unread,
Oct 8, 2003, 3:38:30 PM10/8/03
to

"David Rubin" <bogus_...@nomail.com> wrote in message
news:3F843201...@nomail.com...

If it's really an issue for you, I'll take a look and
see if I can locate the problem.

-Mike


David Rubin

unread,
Oct 8, 2003, 3:42:05 PM10/8/03
to
Mike Wahler wrote:

> > [...] I got compiler errors both


> > times. For exapmle, with the non-const refernce, I got:

[snip]


> If it's really an issue for you, I'll take a look and
> see if I can locate the problem.

I'd appreciate it. I just don't understand what the compiler is doing.
Thanks.

/david

Mike Wahler

unread,
Oct 8, 2003, 8:09:06 PM10/8/03
to

"David Rubin" <bogus_...@nomail.com> wrote in message
news:3F84688D...@nomail.com...

> Mike Wahler wrote:
>
> > > [...] I got compiler errors both
> > > times. For exapmle, with the non-const refernce, I got:
> [snip]
> > If it's really an issue for you, I'll take a look and
> > see if I can locate the problem.
>
> I'd appreciate it. I just don't understand what the compiler is doing.

I didn't get exactly the same diagnostics, but I did get
complaints about 'insert_iterator::operator++()', because
it modifies the iterator, so we cannot make it a const
reference, but it works for me as a nonconst reference:

#include <algorithm>
#include <deque>
#include <iterator>
#include <string>


template <typename InsertIter>
void tokenize(const std::string& buf,
const std::string& delim,
InsertIter& ii)
{
std::string::size_type sp(0); /* start position */
std::string::size_type ep(-1); /* end position */

do
{
sp = buf.find_first_not_of(delim, ep + 1);
ep = buf.find_first_of(delim, sp);

if(sp != ep)
{
if(ep == buf.npos)
ep = buf.length();

*ii++ = buf.substr(sp, ep-sp);
}

} while(sp != buf.npos);
}

int main()
{
std::string buf("We* are/parsing [a---string");
std::string delim(" */[-");
std::deque<std::string> tokens;

tokenize(buf, delim, std::inserter(tokens, tokens.begin()));

std::copy(tokens.begin(), tokens.end(),


std::ostream_iterator<std::string>(std::cout, "\n"));

return 0;

}


Output:

We
are
parsing
a
string


HTH,
-Mike

Mike Wahler

unread,
Oct 8, 2003, 8:25:18 PM10/8/03
to
"Mike Wahler" <mkwa...@mkwahler.net> wrote in message
news:CG1hb.3532$dn6....@newsread4.news.pas.earthlink.net...

>
> std::copy(tokens.begin(), tokens.end(),
> std::ostream_iterator<std::string>(std::cout, "\n"));

I forgot the obvious #include <iostream> :-)

(I suppose my implementation brought the decls in via
#include <iterator>, so didn'tcomplain about no 'cout')

-Mike


Jerry Coffin

unread,
Oct 8, 2003, 11:26:14 PM10/8/03
to
In article <3F843039...@nomail.com>, bogus_...@nomail.com
says...

[ ... ]

> Interesting idea. It seems rather complex compared to the code I posted
> though.

Yes and no -- the framework for creating a facet is a bit more complex
than I'd like, but creating a facet is fairly easy and using it is
pretty nearly dead simple.

> One nice thing about your solution is that you encapsulate
> delims in a class. I suppose you could extend the class a bit with
> various methods to extend the delimiters (equivalent to isspace,
> ispunct, etc), although this is reasonably accomplished via subclassing.
> What are the other benefits of this approach compared to mine?

Ease of use -- it supports normal stream extractors. Since the example
at hand was taking the tokens from a stream and putting them into a
deque, I used iterators and std::copy to copy from one to the other. It
could have been done with normal stream extraction though:

std::string token;

while (stream >> token)
std::cout << token << std::endl;

would have worked just fine for the job at hand as well. For that
matter, to read them and put them into the deque, we could have done
something like:

std::string token;
std::deque tok;

while ( stream >> token)
tok.push_back(token);


To make a long story short, once you've imbued a stream with the facet
that defines the delimiters between tokens, you don't have to learn
anything new at all -- you're just extracting strings from a stream.



> Also, I have a few questions...
>
> [snip]
> > int main() {
> > std::locale d(std::locale::classic(), new delims<char>);
>
> Can you explain the role of locales in a little more detail? Can't you
> just skip this part in most cases?

A locale is basically a collection of facets (one facet each of a number
of different types). A stream, however, only knows about a complete
locale, though, not individual facets. Therefore, we create an otherwise
normal locale, but have it use our ctype facet. The rest of it probably
doesn't matter for the job at hand, but (at least TTBOMK) there's no
function that tells a stream to use on facet at a time.

> > std::cin.imbue(d);
>
> > std::istream_iterator<std::string> in(std::cin), end;
> > std::ostream_iterator<std::string> out(std::cout, "\n");
>
> > std::copy(in, end, out);
>
> How does end function here? I've only seen copy specified with iterators
> associated with a container. In this case, in and end seem to have no
> association with each other.

An istream iterator that's not associated with a stream is basically the
equivalent of EOF -- another istream_iterator will become equal to it
when the end of the stream is encountered.

That's strictly related to using istream_iterator's though -- it's
completely independent of changing the facet to get the parsing we want.
Just for example, if we had a file of numbers (integers for the moment)
and we wanted to read those numbers into a vector, we could do it
similarly:

std::vector<int> v;

std::istream_iterator<int> in(std::cin), end;

std::copy(in, end, std::back_inserter<std::vector<int> >(v));

Similarly, if we have a file of floating point values that we wanted to
read into a list, we could do:

std::list<double> ld;

std::istream_iterator<double> in(std::cin), end;

std::copy(in, end, std::back_inserter<std::list<double> >(ld));

and so on.

To summarize: an istream_iterator allows us to treat the contents of a
stream like a collection, so we can apply a standard algorithm to its
contents (it's basically an input_iterator, so we can't, for example,
apply an algorithm that requires a random access iterator though).

David Rubin

unread,
Oct 9, 2003, 10:51:44 AM10/9/03
to
Mike Wahler wrote:

[snip]


> template <typename InsertIter>
> void tokenize(const std::string& buf,
> const std::string& delim,
> InsertIter& ii)

[snip]


> int main()
> {
> std::string buf("We* are/parsing [a---string");
> std::string delim(" */[-");
> std::deque<std::string> tokens;
>
> tokenize(buf, delim, std::inserter(tokens, tokens.begin()));

This only works with my compiler if I do

std::insert_iterator<std::deque<std::string> > ii(tokens,
tokens.begin());
tokenize(buf, delim, ii);

Otherwise, I get the same errors as before. I guess this means my
compiler is broken?

; g++ --version
g++ (GCC) 3.2.2 20030222 (Red Hat Linux 3.2.2-5)

Mike Wahler

unread,
Oct 9, 2003, 1:21:45 PM10/9/03
to

"David Rubin" <bogus_...@nomail.com> wrote in message
news:3F857600...@nomail.com...

> Mike Wahler wrote:
>
> [snip]
> > template <typename InsertIter>
> > void tokenize(const std::string& buf,
> > const std::string& delim,
> > InsertIter& ii)
>
> [snip]
> > int main()
> > {
> > std::string buf("We* are/parsing [a---string");
> > std::string delim(" */[-");
> > std::deque<std::string> tokens;
> >
> > tokenize(buf, delim, std::inserter(tokens, tokens.begin()));
>
> This only works with my compiler if I do
>
> std::insert_iterator<std::deque<std::string> > ii(tokens,
> tokens.begin());

But you *did* get it to work with the reference parameter
for the iterator, right?

Interesting about having to 'spell it out' like that.
Both ways worked for me, (I also tried 'std::back_inserter' and
std::back_insert_iterator, both of which also worked for me as well).


> tokenize(buf, delim, ii);
>
> Otherwise, I get the same errors as before. I guess this means my
> compiler is broken?

Seems so to me. Have you checked for a newer version
of g++?

-Mike


Frank Schmitt

unread,
Oct 10, 2003, 7:21:41 AM10/10/03
to
"Mike Wahler" <mkwa...@mkwahler.net> writes:

FYI: both g++ 3.3.1 and the Intel compiler V7.0 reject the original version.
I guess the problem is that you are forming a non-const reference to a
temporary, which isn't allowed by the standard.
Changing the signature of tokenize to

void tokenize(const std::string& buf,
const std::string& delim,

const InsertIter& ii)

and copying ii inside tokenize to a local helper variable for the loop works.

HTH & kind regards
frank

--
Frank Schmitt
4SC AG phone: +49 89 700763-0
e-mail: frankNO DOT SPAMschmitt AT 4sc DOT com

Mike Wahler

unread,
Oct 10, 2003, 12:56:55 PM10/10/03
to
"Frank Schmitt" <inv...@seesignature.info> wrote in message
news:4c1xtlx...@scxw21.4sc...

>
> FYI: both g++ 3.3.1 and the Intel compiler V7.0 reject the original
version.
> I guess the problem is that you are forming a non-const reference to a
> temporary, which isn't allowed by the standard.

Thanks. For some reason, that issue always seems to trip
me up. Perhaps I need to write this down a hundred times
on the chalkboard. :-)

> Changing the signature of tokenize to
>
> void tokenize(const std::string& buf,
> const std::string& delim,
> const InsertIter& ii)
>
> and copying ii inside tokenize to a local helper variable for the loop
works.

Thanks,
-Mike


David Rubin

unread,
Oct 10, 2003, 5:19:11 PM10/10/03
to
Frank Schmitt wrote:

[snip]


> FYI: both g++ 3.3.1 and the Intel compiler V7.0 reject the original version.
> I guess the problem is that you are forming a non-const reference to a
> temporary, which isn't allowed by the standard.

This is what I suspected, but I was thrown off because the diagnostic
was so abstruse. I was expecting something more along the lines of
"cannot for a non-const reference to a temporary." Anyway, it's
interesting that MSVC++6.0 compiles the code without warning. Using MS
as a point of reference always begs the question of which is correct :-)

> Changing the signature of tokenize to
>
> void tokenize(const std::string& buf,
> const std::string& delim,
> const InsertIter& ii)
>
> and copying ii inside tokenize to a local helper variable for the loop works.

What is the point of copying ii inside tokenize if you can just remove
the reference argument altogether and use pass-by-value? Isn't that the
same?

Much thanks,

Frank Schmitt

unread,
Oct 13, 2003, 5:22:36 AM10/13/03
to
David Rubin <bogus_...@nomail.com> writes:

> Frank Schmitt wrote:
>
> > Changing the signature of tokenize to
> >
> > void tokenize(const std::string& buf,
> > const std::string& delim,
> > const InsertIter& ii)
> >
> > and copying ii inside tokenize to a local helper variable
> > for the loop works.
>
> What is the point of copying ii inside tokenize if you can just remove
> the reference argument altogether and use pass-by-value? Isn't that the
> same?

Yes, of course - I guess I just have gotten so used to const references instead
of pass-by-value that it has become a reflex for me not to consider
pass-by-value :-)

David Rubin

unread,
Oct 15, 2003, 3:59:28 PM10/15/03
to
Jerry Coffin wrote:

> #include <locale>
> #include <algorithm>

> enum inits { empty, classic };

I decided after a while that I liked your approach (using istringstr and
locale) better than my tokenize function. I was able to replace the
above code with

#include <algorithm>
#include <locale>

class ctype_table : public std::ctype<char> {
private:
mask tab[table_size];

protected:
enum Init {empty, classic};

ctype_table(Init type=classic) : std::ctype<char>(tab)
{
if (type == classic)
std::copy(classic_table(), classic_table()+table_size, tab);
else
std::fill_n(tab, table_size, space);
}

public:
mask *table() { return tab; }
};

You want to derive from std::ctype<char> rather than std::ctype<T> since
only the char specialization contains the functions and constants you
are using. Also, by deriving from std::ctype<T> [T=char], you can use
type mask and constant space freely (I think 'mask()' is a typo in your
code). Additionally, you don't really need the refs argument (at least
for my application). Lastly, I found on my platform that creating a
static table (tab) results in a smaller executable than allocating tab
off the heap (I assume this was the motivation for privately inheriting
from table).

Jerry Coffin

unread,
Oct 16, 2003, 8:42:39 PM10/16/03
to
David Rubin <bogus_...@nomail.com> wrote in message news:<3F8DA720...@nomail.com>...

[ ... ]

> I decided after a while that I liked your approach (using istringstr and
> locale) better than my tokenize function. I was able to replace the
> above code with

[ code elided ]

> You want to derive from std::ctype<char> rather than std::ctype<T> since
> only the char specialization contains the functions and constants you
> are using.

I'm on my laptop right now, so I don't have the standard handy to
check with, but I don't remember using anything that shouldn't work
with wchar_t, etc., as well.

> Also, by deriving from std::ctype<T> [T=char], you can use
> type mask and constant space freely (I think 'mask()' is a typo in your
> code).

The use of mask() was intentional, and you'll almost certainly get all
sorts of strange errors if you try to substitute just "mask" where I
used mask(). Where I used mask(), it was to create a
default-initialized mask object with which to initialize the objects
in the array. Using mask instead, would result only in compiler
errors because you're specifying a type where it wants an object.

> Additionally, you don't really need the refs argument (at least
> for my application).

For this application, that's probably right. That part of the code
was written with an eye to generality, not specifically for this
application.

> Lastly, I found on my platform that creating a
> static table (tab) results in a smaller executable than allocating tab
> off the heap (I assume this was the motivation for privately inheriting
> from table).

The result can be smaller code, or a _lot_ smaller code -- like none
at all. The header is not required to initialize table_size, and with
an implementation that doesn't initialized it _in the header_, your
code won't compile.

The private inheritance was because table only exists to ensure that
the initialization gets done in the right order. There's no reason to
support casting back to table or anything like that.

David Rubin

unread,
Oct 17, 2003, 1:45:22 PM10/17/03
to
Jerry Coffin wrote:
>
> David Rubin <bogus_...@nomail.com> wrote in message news:<3F8DA720...@nomail.com>...
>
> [ ... ]
>
> > I decided after a while that I liked your approach (using istringstr and
> > locale) better than my tokenize function. I was able to replace the
> > above code with
>
> [ code elided ]
>
> > You want to derive from std::ctype<char> rather than std::ctype<T> since
> > only the char specialization contains the functions and constants you
> > are using.
>
> I'm on my laptop right now, so I don't have the standard handy to
> check with, but I don't remember using anything that shouldn't work
> with wchar_t, etc., as well.

For example, table_size, classic_table(), and the constructor taking a
const mask* argument are only defined in std::ctype<char> AFAIK.

> > Also, by deriving from std::ctype<T> [T=char], you can use
> > type mask and constant space freely (I think 'mask()' is a typo in your
> > code).
>
> The use of mask() was intentional, and you'll almost certainly get all
> sorts of strange errors if you try to substitute just "mask" where I
> used mask(). Where I used mask(), it was to create a
> default-initialized mask object with which to initialize the objects
> in the array. Using mask instead, would result only in compiler
> errors because you're specifying a type where it wants an object.

I was suggesting that you use 'space' rather than 'mask', which, of
course, will give you a compile error. My understanding is that mask()
will create a mask temporary initialized to zero (since it's an integer
type). My *guess* (although there is no guarantee) is that mask() is
equivalent to space in most implementations. Even if it's not, an
'empty' table would then be full of spaces rather than some
implementation-defined value.

> > Additionally, you don't really need the refs argument (at least
> > for my application).
>
> For this application, that's probably right. That part of the code
> was written with an eye to generality, not specifically for this
> application.

Agreed, but then you don't include a 'delete-when-done' argument, and
you reversed refs and init when you call the std::ctype<T> constructor
in your implementation of delim.

> > Lastly, I found on my platform that creating a
> > static table (tab) results in a smaller executable than allocating tab
> > off the heap (I assume this was the motivation for privately inheriting
> > from table).
>
> The result can be smaller code, or a _lot_ smaller code -- like none
> at all. The header is not required to initialize table_size, and with
> an implementation that doesn't initialized it _in the header_, your
> code won't compile.

This is a subtle point. I don't have the standard in front of me, but
isn't this covered by C++PL3ed, 12.2.2:

Class objects are constructed from the bottom up: first the base,
then the members, and then the derived class itself.

This suggests to me that table_size is initialized (at least) by the
base class constructor, and is therefore available when the tab member
is "constructed"...

> The private inheritance was because table only exists to ensure that
> the initialization gets done in the right order. There's no reason to
> support casting back to table or anything like that.

...Otherwise, I agree.

Jerry Coffin

unread,
Oct 22, 2003, 12:24:11 PM10/22/03
to
In article <3F902AB2...@nomail.com>, bogus_...@nomail.com
says...

[ ... ]

> > I'm on my laptop right now, so I don't have the standard handy to
> > check with, but I don't remember using anything that shouldn't work
> > with wchar_t, etc., as well.
>
> For example, table_size, classic_table(), and the constructor taking a
> const mask* argument are only defined in std::ctype<char> AFAIK.

Doing some looking, you're right. I may need to re-think the code a
bit.

[ ... ]



> I was suggesting that you use 'space' rather than 'mask', which, of
> course, will give you a compile error. My understanding is that mask()
> will create a mask temporary initialized to zero (since it's an integer
> type). My *guess* (although there is no guarantee) is that mask() is
> equivalent to space in most implementations.

Your guess is wrong, AFAIK. mask() creates a value that basically says
the character doesn't fit _any_ classification. I.e. it's not a space
or a digit or alphabetic, or control, or anything else. mask is
required to be a bitmask type, and if no bits are set, it doesn't
classify the character as anything at all.

> Even if it's not, an
> 'empty' table would then be full of spaces rather than some
> implementation-defined value.

...which would utterly _ruin_ its usefulness. The whole idea is to
produce a table that ONLY classifies a character as a space (for
example) if you say it should be a space. Setting it to fill the table
with a value that said everything was a space would produce utterly
useless results -- when you extract from an istream, it will skip across
anything its locale says it a space character, so doing this would
produce a ctype that always skipped across all input.

[ ... ]



> This is a subtle point. I don't have the standard in front of me, but
> isn't this covered by C++PL3ed, 12.2.2:
>
> Class objects are constructed from the bottom up: first the base,
> then the members, and then the derived class itself.
>
> This suggests to me that table_size is initialized (at least) by the
> base class constructor, and is therefore available when the tab member
> is "constructed"...

Theoretically that might cover it. Practically speaking, a number of
compilers fail when/if you try to use table_size as the size of an
array. Since I don't care to ignore those compilers, my alternative is
to write code that works with them.

0 new messages