StringTokenizer is giving me a headache

767 views
Skip to first unread message

MyndPhlyp

unread,
Feb 25, 2002, 3:18:36 PM2/25/02
to
I've been trying to find a way around this without having to kluge the data.

Given the following:

String[] myField = new String[6];
String tempString = "A|B|C|D|E|F"
StringTokenizer st = new StringTokenizer(tempString, "|");
int i = 0;
while (st.hasMoreTokens)
{
myField[i] = st.nextToken();
i++;
}

Life is wonderful and I get what I expected:

myField[0] = "A"
myField[1] = "B"
myField[2] = "C"
myField[3] = "D"
myField[4] = "E"
myField[5] = "F"


But if one of those "fields" is empty, StringTokenizer skips over it:

String[] myField = new String[6];
String tempString = "A|B||D|E|F"
StringTokenizer st = new StringTokenizer(tempString, "|");
int i = 0;
while (st.hasMoreTokens)
{
myField[i] = st.nextToken();
i++;
}

This results in:

myField[0] = "A"
myField[1] = "B"
myField[2] = "D"
myField[3] = "E"
myField[4] = "F"
myField[5] = <uninitialized>

I have found that if I kluge the data inserting a space in the empty slot,
StringTokenizer works just fine, but I don't want to kluge the data.

Is there a way to get StringTokenizer to recognize the non-existance of data
between the two delimiters and perhaps even return a null value or
something?


Matt Schalit

unread,
Feb 25, 2002, 3:32:53 PM2/25/02
to
On Mon, 25 Feb 2002 15:18:36 -0500, "MyndPhlyp" <nob...@home.com> wrote:


>Is there a way to get StringTokenizer to recognize the non-existance of data
>between the two delimiters and perhaps even return a null value or
>something?


I noticed that yesterday and was wondering the same thing.
I thought it would return a null. Anyone?
Matt

MyndPhlyp

unread,
Feb 25, 2002, 3:48:24 PM2/25/02
to
Matt:

I think I may have uncovered the reason "why". I dug a little deeper in
JBuilder6's online help and came across this item under JCStringTokenizer.
In short, the StringTokenizer "symptom" is a known bug.

Trying to maintain reverse compatibility rules out using JCStringTokenizer
in my case. I hope somebody here has a nice workaround to the problem.

=====
JCStringTokenizer controls simple linear tokenization of a String. The set
of delimiters, which defaults to common whitespace characters, can be
specified either during creation or on a per-token bases.

It is similar to java.util.StringTokenizer, but delimiters can be included
as literals by preceding them with a backslash character (the default). IT
ALSO FIXES A KNOWN PROBLEM: IF ONE DELIMITER IMMEDIATELY FOLLOWS ANOTHER, A
NULL STRING IS RETURNED AS THE TOKEN INSTEAD OF BEING SKIPPED OVER.
=====


"Matt Schalit" <msch...@pacbell.net> wrote in message
news:3c7a9e19....@news.sf.sbcglobal.net...

Richard Reynolds

unread,
Feb 25, 2002, 4:05:10 PM2/25/02
to
I don't think this is a bug, this is expected behaviour. There was nothing
between two delimiters so nothing was returned.

"MyndPhlyp" <nob...@home.com> wrote in message
news:a5e679$3rs$1...@nntp9.atl.mindspring.net...

Thorsten Seelend

unread,
Feb 25, 2002, 4:07:46 PM2/25/02
to
On Mon, 25 Feb 2002 15:18:36 -0500, "MyndPhlyp" <nob...@home.com> wrote:

> I've been trying to find a way around this without having to kluge the data.

> ...


> But if one of those "fields" is empty, StringTokenizer skips over it:
>
> String[] myField = new String[6];
> String tempString = "A|B||D|E|F"
> StringTokenizer st = new StringTokenizer(tempString, "|");
> int i = 0;
> while (st.hasMoreTokens)
> {
> myField[i] = st.nextToken();
> i++;
> }
>
> This results in:
>
> myField[0] = "A"
> myField[1] = "B"
> myField[2] = "D"
> myField[3] = "E"
> myField[4] = "F"
> myField[5] = <uninitialized>
>
> I have found that if I kluge the data inserting a space in the empty slot,
> StringTokenizer works just fine, but I don't want to kluge the data.
>
> Is there a way to get StringTokenizer to recognize the non-existance of data
> between the two delimiters and perhaps even return a null value or
> something?

The StringTokenizer won't return "empty tokens".

A fast workaround would be to extend that class, internally use the
constructor that will deliver delimeters and handle empty tokens
by yourself.

import java.util.*;

public class Tokenize extends StringTokenizer {
protected boolean lastWasDelim = true;
String delims;

public Tokenize(String s, String _delims) {
super(s, _delims, true);
delims = _delims;
}

public String nextToken() {
String token = super.nextToken();
boolean isDelim = token.length() > 0 && delims.indexOf(token.charAt(0)) != -1;

token = isDelim ? (lastWasDelim ? "" : null) : token;
lastWasDelim = isDelim;
return token==null ? nextToken() : token;
}

public static void main(String[] args) {
Tokenize st = new Tokenize("A|B||D|E|||F", "|");

while (st.hasMoreTokens())
System.out.println('<' + st.nextToken() + '>');
}
}

prints out:

<A>
<B>
<>
<D>
<E>
<>
<>
<F>


Note that I didn't "adjust" all methods according to the
specification (countTokens() nextToken(newDelims), ...)

Bye
Thorsten

Greg Faron

unread,
Feb 25, 2002, 4:06:53 PM2/25/02
to
MyndPhlyp wrote:
> Is there a way to get StringTokenizer to recognize the non-existance of data
> between the two delimiters and perhaps even return a null value or
> something?

Sub-class StringTokenizer to do what you want it to do. I've skimmed
through the class, and I think the key is in the private method
skipDelimiters(int). If you write a different method that simply
returns the value of the argument plus one (all delimiters are assumed
to be of length one), it _should_ result in tokens of empty Strings
being returned when you have consecutive delimiters. This private
method is called in three places, so you'll need to override those
public methods (countTokens(), nextToken(), and hasMoreTokens()) to call
your version instead.

--
Greg Faron
Integre Technical Publishing Co.

MyndPhlyp

unread,
Feb 25, 2002, 4:12:54 PM2/25/02
to
Richard:

The problem is that nextToken() not only returned nothing, it skipped right
over it. Certainly not expected behavior in my case (and in the case of Matt
Schalit).

nextToken() could have at least returned null thereby acknowledging the fact
that the delimiter existed.


"Richard Reynolds" <richier...@ntlworld.com> wrote in message
news:LExe8.120$Hg1....@news6-win.server.ntlworld.com...


> I don't think this is a bug, this is expected behaviour. There was nothing
> between two delimiters so nothing was returned.
>

<... snip ...>


Richard Reynolds

unread,
Feb 25, 2002, 4:55:00 PM2/25/02
to
it didn't "skip over" anything, there was no token to skip over, there was
no token so it didn't return anything. A token is a String, the class is
called StringTokenizer, now if it were called StringOrEmptyStringTokenizer,
I'd complain :)
It returns Strings that are delimited, if there's nothing between the
delimiters I wouldn't expect it to return anything.

"MyndPhlyp" <nob...@home.com> wrote in message

news:a5e9df$ero$1...@slb6.atl.mindspring.net...

MyndPhlyp

unread,
Feb 25, 2002, 5:37:25 PM2/25/02
to
Richard:

I guess one person's "feature" is another person's "bug." <g>


"Richard Reynolds" <richier...@ntlworld.com> wrote in message

news:unye8.370$Hg1....@news6-win.server.ntlworld.com...

MyndPhlyp

unread,
Feb 25, 2002, 5:40:48 PM2/25/02
to
Greg:

That is certainly a possibility. I'll have to look into this a bit deeper.


"Greg Faron" <gfa...@integretechpub.com> wrote in message
news:3C7AA76D...@integretechpub.com...

MyndPhlyp

unread,
Feb 25, 2002, 5:43:35 PM2/25/02
to
Thorsten:

Re-engineering seems to be the theme at the moment. Notice that Greg Faron
also mentioned that route. I'll have to look into this.


"Thorsten Seelend" <thor...@gmx.de> wrote in message
news:oc9l7u4n97b8l743i...@4ax.com...

Jon Skeet

unread,
Feb 25, 2002, 5:46:22 PM2/25/02
to
MyndPhlyp <nob...@home.com> wrote:

> Is there a way to get StringTokenizer to recognize the non-existance of data
> between the two delimiters and perhaps even return a null value or
> something?

Not easily. However, there are other tokenizers out there which do. Have
a look at JlsTokenizer at
http://www.pobox.com/~skeet/java/skeetutil

--
Jon Skeet - <sk...@pobox.com>
http://www.pobox.com/~skeet/
If replying to the group, please do not mail me too

Thorsten Seelend

unread,
Feb 25, 2002, 6:07:26 PM2/25/02
to
On Mon, 25 Feb 2002 21:55:00 -0000, "Richard Reynolds" <richier...@ntlworld.com>
wrote:

> it didn't "skip over" anything, there was no token to skip over, there was
> no token so it didn't return anything. A token is a String, the class is
> called StringTokenizer, now if it were called StringOrEmptyStringTokenizer,
> I'd complain :)

Isn't "" a string??

> ...

Bye
Thorsten

Michiel Konstapel

unread,
Feb 25, 2002, 3:49:49 PM2/25/02
to
> I have found that if I kluge the data inserting a space in the empty slot,
> StringTokenizer works just fine, but I don't want to kluge the data.
>
> Is there a way to get StringTokenizer to recognize the non-existance of
data
> between the two delimiters and perhaps even return a null value or
> something?

Nope, that's just how it works. If you want to "see" empty fields, you have
to use the other StringTokenizer constructor with a boolean parameter
telling it to return the delimiters as well. Then, when you see two
delimiters in a row, you know you just passed an empty field.
HTH,
Michiel


Michiel Konstapel

unread,
Feb 25, 2002, 4:54:24 PM2/25/02
to
I might expect it to return exactly what's there: "", and I wish it did.
Fortunately, I read the docs before getting bitten ;-)
Michiel

"Richard Reynolds" <richier...@ntlworld.com> wrote in message

news:unye8.370$Hg1....@news6-win.server.ntlworld.com...

Karl Schmidt

unread,
Feb 25, 2002, 7:26:56 PM2/25/02
to
Thorsten Seelend schrieb:

In some way, I understand your point. But what would you expect for this?:

// String s contains "Content-Type: multipart/mixed;
boundary="_=_=_=_X05T_BOUNDARY_STRING_=_=_=_"
StringTokenizer st = new StringTokenizer(s, ": /;\"");

The StringTokenizer only returns tokens between delimiters and skips the delimiters. If
that is not what you want, write your own or use StreamTokenizer with the option to
return the delimiters (so, if two delimiters follow immediataly, you know some token is
missing)

--

MfG


Karl Schmidt
ICQ #15923569


MyndPhlyp

unread,
Feb 25, 2002, 8:10:59 PM2/25/02
to
Thanx to all for the suggestions ... and the heated debate on what is, what
should be and what always was. I enjoy the occasional sparing.

Looking at the re-engineering suggested by a couple of individuals, the
solutions proposed are definitely creative. However, I'm electing to go yet
another route keeping StringTokenizer as virgin as it is. There is a "return
delimiter" parameter that can be used when creating the new StringTokenizer.
While it returns more than I really want, I can always parse through the
returns with an "if" statement and handle the delimiters.

=====
Source
=====
// The array is increased in size to accommodate
// the potential return of each "field" plus
// each "field value"

// The "new StringTokenizer" third parameter
// is set to true.

String[] myField = new String[11];


String tempString = "A|B||D|E|F";

StringTokenizer st = new StringTokenizer(tempString, "|", true);
int i = 0;
while (st.hasMoreTokens())


{
myField[i] = st.nextToken();
i++;
}

for (i = 0; i < myField.length; i++)
System.out.println("myField[" + i + "] = " + myField[i]);
=====
Output
=====


myField[0] = A
myField[1] = |

myField[2] = B
myField[3] = |
myField[4] = |
myField[5] = D
myField[6] = |
myField[7] = E
myField[8] = |
myField[9] = F
myField[10] = null

All that is left is to add in the quick "if" statement after st.nextToken():

if (myField[i] == "|")
{
myField[i] = null;
i--;
}

It probably looks sloppy to some, strange to others, and far from elegent to
many. But it works.

Thanx again all.


MyndPhlyp

unread,
Feb 25, 2002, 8:18:26 PM2/25/02
to
Read docs?

BEFORE getting bitten?!?

Now, THERE'S a concept I hadn't thought of. <g>


"Michiel Konstapel" <a...@me.nl> wrote in message
news:koye8.448$HE5....@nlnews00.chello.com...

MyndPhlyp

unread,
Feb 25, 2002, 8:31:54 PM2/25/02
to
I hate it when that happens. Nothing like defeating the purpose. I wouldn't
wish this kind of premature typulation upon anybody.


"MyndDent" <nob...@home.com> wrote in message
news:a5enm7$bor$1...@slb3.atl.mindspring.net...

Karl Schmidt

unread,
Feb 25, 2002, 8:54:20 PM2/25/02
to
MyndPhlyp schrieb:

> while (st.hasMoreTokens())
> {
> myField[i] = st.nextToken();
> i++;
> }
> for (i = 0; i < myField.length; i++)
> System.out.println("myField[" + i + "] = " + myField[i]);

Argh!! That hurts...

Why don't you check earlier?

String lastToken = null;
while (st.hasMoreTokens()) {
String token = st.nextToken();
if (token.equals("|")) {
if ("|".equals(lastToken)) {
myField[i++] = "";
}
} else {
myField[i++] = token;
}
lastToken = token;

Richard Reynolds

unread,
Feb 26, 2002, 7:47:56 AM2/26/02
to
It's not a token, if "" is a token how many are in a file? an infinite
number? or are we dividing by zero!

"Thorsten Seelend" <thor...@gmx.de> wrote in message

news:nqgl7uc3a187ht6hv...@4ax.com...

Thorsten Seelend

unread,
Feb 26, 2002, 9:01:44 AM2/26/02
to
On Tue, 26 Feb 2002 12:47:56 -0000, "Richard Reynolds" <richier...@ntlworld.com>
wrote:

> "Thorsten Seelend" <thor...@gmx.de> wrote in message
> > Isn't "" a string??


> It's not a token, if "" is a token how many are in a file? an infinite
> number? or are we dividing by zero!

OK. That's a worthy argument.

Bye
Thorsten

Dale King

unread,
Feb 25, 2002, 8:44:07 PM2/25/02
to
"MyndPhlyp" <nob...@home.com> wrote in message
news:a5e9df$ero$1...@slb6.atl.mindspring.net...

> Richard:
>
> The problem is that nextToken() not only returned nothing, it skipped
right
> over it. Certainly not expected behavior in my case (and in the case of
Matt
> Schalit).
>
> nextToken() could have at least returned null thereby acknowledging the
fact
> that the delimiter existed.


It is the specified behavior. It is more designed for the case where tokens
are separated by spaces or whitespace. In that case multiple spaces are not
usually considered significant.

StreamTokenizer has many faults including the fact that from 1.2 to 1.3 they
changed its behavior so that any program that depended on the old behavior
(like mine did) is now broken and there is no good workaround. As far as I'm
concerned StringTokenizer should be marked as deprecated.

For more info on how they broke it consider this code:

String toParse = "foo-bar,baz";
StringTokenizer tok = new StringTokenizer( toParse, "-");
tok.nextToken();
System.out.println( tok.nextToken(",") );

What does this print? Depends on which version of the JDK you are using. 1.2
and before prints bar. 1.3 and later will print -bar. Somehow they don't
think this is a bug.

Starting with 1.4 you are better off using the regular expression package.
--
Dale King


André Wuttke

unread,
Feb 26, 2002, 2:22:49 PM2/26/02
to
"Dale King" <Ki...@TCE.com> wrote in news:3c7b...@news.tce.com:

Hello Dale

> For more info on how they broke it consider this code:
>
> String toParse = "foo-bar,baz";
> StringTokenizer tok = new StringTokenizer( toParse, "-");
> tok.nextToken();
> System.out.println( tok.nextToken(",") );
>
> What does this print? Depends on which version of the JDK you are
> using. 1.2 and before prints bar. 1.3 and later will print -bar.
> Somehow they don't think this is a bug.

This seems to be OK in my opinion. I think the old behavior was a bug.
Since you changed the seperator, "-" isn't a seperator anymore and so
belongs to the next token.
Have you considered changing the seperator to ",-"?

André

Greg Faron

unread,
Feb 26, 2002, 3:33:54 PM2/26/02
to

Not really. There are as many "" as are separated by valid
delimiters. For example, given the delimiter "|" (and delimiters
are guaranteed to be non-empty, single-character strings), and the
original string "a|b|c|||f|g||i", I would simply expect for there
to be 9 valid tokens. There cannot be an infinite number of tokens
without an infinite (minus one :) ) delimiters.

Richard's argument more likely can be applied to the delimiter " "
and the original string "please parse this string", for which you
would like 4 tokens, not 8 (four of which are the empty string).

Dale King

unread,
Feb 26, 2002, 3:59:35 PM2/26/02
to
"André Wuttke" <awuttke(remove)@medistar.de> wrote in message
news:a5gna9$75s31$1...@ID-133186.news.dfncis.de...

No it is a separator not part of the next token. Basically it is a separator
for the first token and not for the remaining text. What I want is to get
the first token and then the rest of the text after the separator. There is
no clean way to do that with the current implementation of StringTokenizer.

> Have you considered changing the seperator to ",-"?

OK, I looked again at what it was I was trying to do. That example was off
the top of my head. What I want to do is change to no separators. What I was
parsing was something like this:

foo: Arbitrary text which can contain spaces and : characters

What I wanted was to tokenize this to get the initial token "foo" and then
get the argument "Arbitrary text which can contain spaces and : characters".
Note that the argument may in fact have tokens delimited by colons or spaces
that will get tokenized later, it depends on the command. For certain
commands it is any arbitrary text. It may not be the best command format for
parsing, but it is for a program I am porting and the format is defined and
already widely used.

What I did before 1.3 was:

StringTokenizer tok = new StringTokenizer( text, ": " );
String command = tok.nextToken();
String argument = null;
if( tok.hasMoreTokens() )
{
argument = tok.nextToken("");
}

That worked in 1.2 and in 1.3 I now get the colon and spaces at the
beginning of the argument. I find no good way to workaround this with
StringTokenizer.

You can argue which makes more sense, but changing the contract of
StringTokenizer with no change in documentation is plain wrong. I would have
no problem with adding an overload with a flag to get the new behavior, but
removing the old behavior is not acceptable, particularly when there is no
good workaround.

Note the old behavior is very familiar to those familiar with C's strtok
function, which is what was used in the code I was porting.

--
Dale King


André Wuttke

unread,
Feb 26, 2002, 6:07:27 PM2/26/02
to
"Dale King" <Ki...@TCE.com> wrote in news:3c7b...@news.tce.com:

> You can argue which makes more sense, but changing the contract of


> StringTokenizer with no change in documentation is plain wrong. I would

Maybe I can argue but Sun considered the old behavior a bug (lookup Bug-
Parade for StringTokenizer). So they must not change the docs for fixing a
bug, must they?
What one can argue is what contract the docs are implying. And it's somewhat
missunderstandable:

nextToken
public String nextToken(String delim)
Returns the next token in this string tokenizer's string. First, the set of
characters considered to be delimiters by this StringTokenizer object is
changed to be the characters in the string delim. Then the next token in the
string after the current position is returned. The current position is
advanced beyond the recognized token. The new delimiter set remains the
default after this call.

This I'm understanding as such the current position is set immediatly after
the last token. And that's before the next delemiter. So does Sun and
considered the old behavior a bug in "hasMoreElements" errornous changing
the current position.

> have no problem with adding an overload with a flag to get the new
> behavior, but removing the old behavior is not acceptable, particularly
> when there is no good workaround.

That's right since many programers used the old behavior as a feature. They
had to implement the old behavior in some way activatable if desired :-)

> What I did before 1.3 was:
>
> StringTokenizer tok = new StringTokenizer( text, ": " );
> String command = tok.nextToken();
> String argument = null;
> if( tok.hasMoreTokens() )
> {
> argument = tok.nextToken("");
> }

You can do it this way:


StringTokenizer tok = new StringTokenizer( text, ":" );
String command = tok.nextToken();
String argument = null;
if( tok.hasMoreTokens() ) {

tok.nextToken(" "); //this will remove the colon
}
if( tok.hasMoreTokens() )
{
argument = tok.nextToken("").trim();
}

Will give You
"foo"
and

"Arbitrary text which can contain spaces and : characters"

from Your "text".

It's not so elegant as before, but fullfills your requestments.

André

Michiel Konstapel

unread,
Feb 26, 2002, 6:33:58 PM2/26/02
to
LOL :)

"MyndPhlyp" <nob...@home.com> wrote in message

news:a5enpq$ds$1...@slb0.atl.mindspring.net...

Pat Reaney

unread,
Feb 26, 2002, 11:39:01 PM2/26/02
to
Gave me a headache too. I took two aspirin and wrote my own; the
source can be found at:
http://forum.java.sun.com/thread.jsp?forum=31&thread=204323

It works just like the perl split() function - consecutive delimiters
return the empty string as a token( so you don't have to test for null
).

Dale King

unread,
Feb 27, 2002, 1:08:28 PM2/27/02
to
"André Wuttke" <awuttke(remove)@medistar.de> wrote in message
news:a5h4fe$7hf6f$1...@ID-133186.news.dfncis.de...

> "Dale King" <Ki...@TCE.com> wrote in news:3c7b...@news.tce.com:
>
> > You can argue which makes more sense, but changing the contract of
> > StringTokenizer with no change in documentation is plain wrong. I would
> Maybe I can argue but Sun considered the old behavior a bug (lookup Bug-
> Parade for StringTokenizer). So they must not change the docs for fixing a
> bug, must they?

I am aware of the bug history on this. There was a bug in the hasMoreTokens.
And in fixing that they caused this change. They don't consider it a bug.
They are wrong. See the long list of comments on the bug report 4338282 that
agree with me. Note that this broke JRun.

> What one can argue is what contract the docs are implying. And it's
somewhat
> missunderstandable:
>
> nextToken
> public String nextToken(String delim)
> Returns the next token in this string tokenizer's string. First, the set
of
> characters considered to be delimiters by this StringTokenizer object is
> changed to be the characters in the string delim. Then the next token in
the
> string after the current position is returned. The current position is
> advanced beyond the recognized token. The new delimiter set remains the
> default after this call.

The ambiguity comes in terms of "next token". I got one token, then I have
delimiters. My constructor said that delimiters are not to be considered
tokens. But if I change the delimiters suddenly they are. Those initial
characters were already determined in the previous call to be delimiters.
Changing the delimiters should not mean that now they aren't delimiters.

> This I'm understanding as such the current position is set immediatly
after
> the last token. And that's before the next delemiter. So does Sun and

And that is necessary if you have the flag turned on to return delimiters as
tokens. You can't just skip past them in that case.

> considered the old behavior a bug in "hasMoreElements" errornous changing
> the current position.

I'm not debating which is correct or not. There is valid reasoning that
makes the old way correct. The point is that with 4-5 years in existence you
don't suddenly change the way something works. You have introduced an
incompatibility. No matter which behavior is more correct, I can't depend on
it working either way. If I depend on the new behavior my code doesn't work
in 1.2. If I depend on the old behavior, it doesn't work in 1.3. Who cares
which is more correct?

The class is now only usable if you never change the delimiters. Might as
well deprecate it in favor of the regular expression package.

> > have no problem with adding an overload with a flag to get the new
> > behavior, but removing the old behavior is not acceptable, particularly
> > when there is no good workaround.
> That's right since many programers used the old behavior as a feature.
They
> had to implement the old behavior in some way activatable if desired :-)

But they didn't do that. They just changed the way the existing code works.
The only workaround is to create your own copy of the old StringTokenizer
because you cannot get the old behavior out of the new implementation.

> > What I did before 1.3 was:
> >
> > StringTokenizer tok = new StringTokenizer( text, ": " );
> > String command = tok.nextToken();
> > String argument = null;
> > if( tok.hasMoreTokens() )
> > {
> > argument = tok.nextToken("");
> > }
>
> You can do it this way:
> StringTokenizer tok = new StringTokenizer( text, ":" );
> String command = tok.nextToken();
> String argument = null;
> if( tok.hasMoreTokens() ) {
> tok.nextToken(" "); //this will remove the colon
> }
> if( tok.hasMoreTokens() )
> {
> argument = tok.nextToken("").trim();
> }

Not the same by a long shot. First off, the colon is optional. I could have
a space instead. I could have multiple spaces, multiple colons, multiple
colons and spaces. The following is legal as well:

foo : : : : : Arbitrary text which can contain spaces and : characters

> Will give You
> "foo"
> and
> "Arbitrary text which can contain spaces and : characters"
> from Your "text".

Nope, it will give me:

"foo"


"text which can contain spaces and : characters"

You lost the first word from the remaining text in your loop to remove the
spaces.

> It's not so elegant as before, but fullfills your requestments.

No it doesn't. You cannot fulfill my requirements using the new
StringTokenizer (short of using reflection to access private members of
StringTokenizer).

These are not that bizarre of requirements. In C it is simply:

command = strtok( text, ": ");
argument = srtok( NULL, "" );

--
Dale King


André Wuttke

unread,
Feb 27, 2002, 3:04:06 PM2/27/02
to
"Dale King" <Ki...@TCE.com> wrote in news:3c7d...@news.tce.com:

> consider it a bug. They are wrong. See the long list of comments on the
> bug report 4338282 that agree with me. Note that this broke JRun.

OK. Many programers have seen this a feature not a bug. Sun faild in
recognizing this. But after all it was a bug in the first place.

> to be delimiters. Changing the delimiters should not mean that now they
> aren't delimiters.

What else? Either they are delimiters or tokens. Some strange previously-
used-delimiter-character-now-to-be-skipped ?

> I'm not debating which is correct or not. There is valid reasoning that
> makes the old way correct. The point is that with 4-5 years in
> existence you don't suddenly change the way something works. You have
> introduced an incompatibility. No matter which behavior is more
> correct, I can't depend on it working either way. If I depend on the
> new behavior my code doesn't work in 1.2. If I depend on the old
> behavior, it doesn't work in 1.3. Who cares which is more correct?

Right.

>
> The class is now only usable if you never change the delimiters. Might
> as well deprecate it in favor of the regular expression package.

Right again.


>> You can do it this way:
>> StringTokenizer tok = new StringTokenizer( text, ":" );
>> String command = tok.nextToken();
>> String argument = null;
>> if( tok.hasMoreTokens() ) {
>> tok.nextToken(" "); //this will remove the colon }
>> if( tok.hasMoreTokens() )
>> {
>> argument = tok.nextToken("").trim(); }

>> Will give You
>> "foo"
>> and
>> "Arbitrary text which can contain spaces and : characters" from Your
>> "text".
>
> Nope, it will give me:
>
> "foo"
> "text which can contain spaces and : characters"
>
> You lost the first word from the remaining text in your loop to remove
> the spaces.

Worked for me. No los of a word.
What loop, anyway?

>
>> It's not so elegant as before, but fullfills your requestments.
>
> No it doesn't. You cannot fulfill my requirements using the new
> StringTokenizer (short of using reflection to access private members of
> StringTokenizer).

I haven't said it fullfills Your requirements. Only what You requested in
Your previous post :-)
Had the uneasy feeling that Your requirements were not that simple, anyway
:-)

André

Dale King

unread,
Feb 27, 2002, 4:04:11 PM2/27/02
to
"André Wuttke" <awuttke(remove)@medistar.de> wrote in message
news:a5je3m$7oh17$1...@ID-133186.news.dfncis.de...

> "Dale King" <Ki...@TCE.com> wrote in news:3c7d...@news.tce.com:
>
> > consider it a bug. They are wrong. See the long list of comments on the
> > bug report 4338282 that agree with me. Note that this broke JRun.
> OK. Many programers have seen this a feature not a bug. Sun faild in
> recognizing this. But after all it was a bug in the first place.

The fact that it skipped the delimiters from the previous token was not a
bug. That was a perfectly sensible way to work, unless you had the flag that
delimiters are returned as tokens.

> > to be delimiters. Changing the delimiters should not mean that now they
> > aren't delimiters.
> What else? Either they are delimiters or tokens. Some strange previously-
> used-delimiter-character-now-to-be-skipped ?

Well according to the JDK when I am not returning delimiters as tokens, "a
token is a maiximal sequence of consecutive characters that are not
delimiters". So in my example : is either a delimiter or it isn't. Since
StringTokenizer now returns two strings that had nothing in between them
that violates that definition. If you say the colon is not a delimiter then
it is not maximal. If you say the colon is a delimiter then it fails the
statement that tokens do not contain delimiters.

Basically I told it that colon and space were delimiters. It is then
ignoring what I told it.

You keep saying that the old way was a bug and the new way is correct. Can
you give any reasonable example where you want the delimiter to be part of
the next token? I can see it if you are returning delimiters as tokens.

> >> You can do it this way:
> >> StringTokenizer tok = new StringTokenizer( text, ":" );
> >> String command = tok.nextToken();
> >> String argument = null;
> >> if( tok.hasMoreTokens() ) {
> >> tok.nextToken(" "); //this will remove the colon }
> >> if( tok.hasMoreTokens() )
> >> {
> >> argument = tok.nextToken("").trim(); }
> >> Will give You
> >> "foo"
> >> and
> >> "Arbitrary text which can contain spaces and : characters" from Your
> >> "text".
> >
> > Nope, it will give me:
> >
> > "foo"
> > "text which can contain spaces and : characters"
> >
> > You lost the first word from the remaining text in your loop to remove
> > the spaces.
> Worked for me. No los of a word.
> What loop, anyway?

I thought you were looping removing spaces after the colon. I see now that
you simply skipped the colon that you assumed had to be there (it doesn't)

Yours will only work in the case where there is exactly one colon with no
spaces before the colon and at least one space after the colon. It also
assumes that I don't care about trailing spaces, but that could be fixed. I
don't remember stating that there was exactly one colon and it couldn't have
a space before it.

> >> It's not so elegant as before, but fullfills your requestments.
> >
> > No it doesn't. You cannot fulfill my requirements using the new
> > StringTokenizer (short of using reflection to access private members of
> > StringTokenizer).
> I haven't said it fullfills Your requirements. Only what You requested in
> Your previous post :-)
> Had the uneasy feeling that Your requirements were not that simple, anyway

My requirements were very simple, the same result that I obtained with the
previous implementation. You got the same result for my one example input,
but not for all. It gives different results for these strings:

foo : : Arbitrary text
foo:Arbitrary text
foo Arbitrary text

which are all acceptable inputs and should provide the same result. It would
be nice to be able to eliminate these as acceptable, but as I said this is
an existing well-known format and the text is input by humans.

The real requirement is to emulate:

command = strtok( text, ": " );

argument = strtok( NULL, "" );

which the old implementation did nicely.

I suppose there is a workaround that is simpler than what you posted:

StringTokenizer tok = new StringTokenizer( text, ": " );
String command = tok.nextToken();
String argument = null;

String remaining = tok.nextToken("");
for( int i = 0; i < remainging.length; i++ )
{
char c = remaining.charAt( i );
if( c != ':' && c != ' ' )
{
argument = remaining.substring( i );
break;
}
}

This at least gives the required behavior under either implementation.
--
Dale King


Greg Faron

unread,
Feb 27, 2002, 5:07:54 PM2/27/02
to
Dale King wrote:
>
> You keep saying that the old way was a bug and the new way is
> correct. Can you give any reasonable example where you want
> the delimiter to be part of the next token? I can see it if
> you are returning delimiters as tokens.

It seems to me that there is a misunderstanding going on between
the two halfs of this discussion. I may be wrong, but I think Andre
is saying that the second delimiter replaces the first, not gets
added to the list of delimiters. If it worked like this, it would
result in the _old_ delimiter being part of the token, as it would
no longer be a valid delimiter character.

Chris Smith

unread,
Feb 27, 2002, 6:57:19 PM2/27/02
to
Greg Faron wrote ...

Yes, the new delimiter does replace the first.

And Dale is saying that since the old delimiter character has already
been recognized and treated as a delimiter, it is inconsistent to go back
now and treat it as a part of the next token. I tend to agree with him,
bith logically and in terms of practical use. I can see no useful
applications of StringTokenizer while changing the delimiter.

I still think the class as a whole is simpler and easier than regexps
when tokenizing with whitespace, but I would tend to use other language
features for more complex lexing needs.

At a minimum, if Sun does believe that the current behavior is correct,
then the API specification should be clarified on this point. I actually
think the general StringTokenizer behavior would be infinitely more
understandable were there to be no such thing as nextToken(String).

Chris Smith

Dale King

unread,
Feb 28, 2002, 10:32:23 AM2/28/02
to
"Chris Smith" <cds...@twu.net> wrote in message
news:MPG.16e7204ca...@news.altopia.com...

> Greg Faron wrote ...
> > Dale King wrote:
> > >
> > > You keep saying that the old way was a bug and the new way is
> > > correct. Can you give any reasonable example where you want
> > > the delimiter to be part of the next token? I can see it if
> > > you are returning delimiters as tokens.
> >
> > It seems to me that there is a misunderstanding going on between
> > the two halfs of this discussion. I may be wrong, but I think Andre
> > is saying that the second delimiter replaces the first, not gets
> > added to the list of delimiters. If it worked like this, it would
> > result in the _old_ delimiter being part of the token, as it would
> > no longer be a valid delimiter character.
>
> Yes, the new delimiter does replace the first.
>
> And Dale is saying that since the old delimiter character has already
> been recognized and treated as a delimiter, it is inconsistent to go back
> now and treat it as a part of the next token. I tend to agree with him,
> bith logically and in terms of practical use. I can see no useful
> applications of StringTokenizer while changing the delimiter.

I think mine is very logical. Basically I want to extract the first token
that is delimited by a colon or space. Then I want the rest of the text
after the delimiter. It is a format much like the lines in a properties
file. I am saying only the first occurrence has that delimiter. I posted
another example like this:

foo: bar,baz,fubar

That seems to be a common pattern to me.

> I still think the class as a whole is simpler and easier than regexps
> when tokenizing with whitespace, but I would tend to use other language
> features for more complex lexing needs.

This is a simple lexing task. As I said strtok in C does it in two lines.

> At a minimum, if Sun does believe that the current behavior is correct,
> then the API specification should be clarified on this point. I actually
> think the general StringTokenizer behavior would be infinitely more
> understandable were there to be no such thing as nextToken(String).

At this point there is no way to correct nextToken( String ). They have
rendered it useless since it has different behavior across JDK versions. Any
code that uses it will not work correctly on at least one version of the
JDK. The only solution is to not use it, thus it should be deprecated. Or at
least give it a @since 1.3 and note that it worked differently in older
versions.


--
Dale King