Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

StringTokenizer is giving me a headache

977 views
Skip to first unread message

MyndPhlyp

unread,
Feb 25, 2002, 3:18:36 PM2/25/02
to
I've been trying to find a way around this without having to kluge the data.

Given the following:

String[] myField = new String[6];
String tempString = "A|B|C|D|E|F"
StringTokenizer st = new StringTokenizer(tempString, "|");
int i = 0;
while (st.hasMoreTokens)
{
myField[i] = st.nextToken();
i++;
}

Life is wonderful and I get what I expected:

myField[0] = "A"
myField[1] = "B"
myField[2] = "C"
myField[3] = "D"
myField[4] = "E"
myField[5] = "F"


But if one of those "fields" is empty, StringTokenizer skips over it:

String[] myField = new String[6];
String tempString = "A|B||D|E|F"
StringTokenizer st = new StringTokenizer(tempString, "|");
int i = 0;
while (st.hasMoreTokens)
{
myField[i] = st.nextToken();
i++;
}

This results in:

myField[0] = "A"
myField[1] = "B"
myField[2] = "D"
myField[3] = "E"
myField[4] = "F"
myField[5] = <uninitialized>

I have found that if I kluge the data inserting a space in the empty slot,
StringTokenizer works just fine, but I don't want to kluge the data.

Is there a way to get StringTokenizer to recognize the non-existance of data
between the two delimiters and perhaps even return a null value or
something?


Matt Schalit

unread,
Feb 25, 2002, 3:32:53 PM2/25/02
to
On Mon, 25 Feb 2002 15:18:36 -0500, "MyndPhlyp" <nob...@home.com> wrote:


>Is there a way to get StringTokenizer to recognize the non-existance of data
>between the two delimiters and perhaps even return a null value or
>something?


I noticed that yesterday and was wondering the same thing.
I thought it would return a null. Anyone?
Matt

MyndPhlyp

unread,
Feb 25, 2002, 3:48:24 PM2/25/02
to
Matt:

I think I may have uncovered the reason "why". I dug a little deeper in
JBuilder6's online help and came across this item under JCStringTokenizer.
In short, the StringTokenizer "symptom" is a known bug.

Trying to maintain reverse compatibility rules out using JCStringTokenizer
in my case. I hope somebody here has a nice workaround to the problem.

=====
JCStringTokenizer controls simple linear tokenization of a String. The set
of delimiters, which defaults to common whitespace characters, can be
specified either during creation or on a per-token bases.

It is similar to java.util.StringTokenizer, but delimiters can be included
as literals by preceding them with a backslash character (the default). IT
ALSO FIXES A KNOWN PROBLEM: IF ONE DELIMITER IMMEDIATELY FOLLOWS ANOTHER, A
NULL STRING IS RETURNED AS THE TOKEN INSTEAD OF BEING SKIPPED OVER.
=====


"Matt Schalit" <msch...@pacbell.net> wrote in message
news:3c7a9e19....@news.sf.sbcglobal.net...

Richard Reynolds

unread,
Feb 25, 2002, 4:05:10 PM2/25/02
to
I don't think this is a bug, this is expected behaviour. There was nothing
between two delimiters so nothing was returned.

"MyndPhlyp" <nob...@home.com> wrote in message
news:a5e679$3rs$1...@nntp9.atl.mindspring.net...

Thorsten Seelend

unread,
Feb 25, 2002, 4:07:46 PM2/25/02
to
On Mon, 25 Feb 2002 15:18:36 -0500, "MyndPhlyp" <nob...@home.com> wrote:

> I've been trying to find a way around this without having to kluge the data.

> ...


> But if one of those "fields" is empty, StringTokenizer skips over it:
>
> String[] myField = new String[6];
> String tempString = "A|B||D|E|F"
> StringTokenizer st = new StringTokenizer(tempString, "|");
> int i = 0;
> while (st.hasMoreTokens)
> {
> myField[i] = st.nextToken();
> i++;
> }
>
> This results in:
>
> myField[0] = "A"
> myField[1] = "B"
> myField[2] = "D"
> myField[3] = "E"
> myField[4] = "F"
> myField[5] = <uninitialized>
>
> I have found that if I kluge the data inserting a space in the empty slot,
> StringTokenizer works just fine, but I don't want to kluge the data.
>
> Is there a way to get StringTokenizer to recognize the non-existance of data
> between the two delimiters and perhaps even return a null value or
> something?

The StringTokenizer won't return "empty tokens".

A fast workaround would be to extend that class, internally use the
constructor that will deliver delimeters and handle empty tokens
by yourself.

import java.util.*;

public class Tokenize extends StringTokenizer {
protected boolean lastWasDelim = true;
String delims;

public Tokenize(String s, String _delims) {
super(s, _delims, true);
delims = _delims;
}

public String nextToken() {
String token = super.nextToken();
boolean isDelim = token.length() > 0 && delims.indexOf(token.charAt(0)) != -1;

token = isDelim ? (lastWasDelim ? "" : null) : token;
lastWasDelim = isDelim;
return token==null ? nextToken() : token;
}

public static void main(String[] args) {
Tokenize st = new Tokenize("A|B||D|E|||F", "|");

while (st.hasMoreTokens())
System.out.println('<' + st.nextToken() + '>');
}
}

prints out:

<A>
<B>
<>
<D>
<E>
<>
<>
<F>


Note that I didn't "adjust" all methods according to the
specification (countTokens() nextToken(newDelims), ...)

Bye
Thorsten

Greg Faron

unread,
Feb 25, 2002, 4:06:53 PM2/25/02
to
MyndPhlyp wrote:
> Is there a way to get StringTokenizer to recognize the non-existance of data
> between the two delimiters and perhaps even return a null value or
> something?

Sub-class StringTokenizer to do what you want it to do. I've skimmed
through the class, and I think the key is in the private method
skipDelimiters(int). If you write a different method that simply
returns the value of the argument plus one (all delimiters are assumed
to be of length one), it _should_ result in tokens of empty Strings
being returned when you have consecutive delimiters. This private
method is called in three places, so you'll need to override those
public methods (countTokens(), nextToken(), and hasMoreTokens()) to call
your version instead.

--
Greg Faron
Integre Technical Publishing Co.

MyndPhlyp

unread,
Feb 25, 2002, 4:12:54 PM2/25/02
to
Richard:

The problem is that nextToken() not only returned nothing, it skipped right
over it. Certainly not expected behavior in my case (and in the case of Matt
Schalit).

nextToken() could have at least returned null thereby acknowledging the fact
that the delimiter existed.


"Richard Reynolds" <richier...@ntlworld.com> wrote in message
news:LExe8.120$Hg1....@news6-win.server.ntlworld.com...


> I don't think this is a bug, this is expected behaviour. There was nothing
> between two delimiters so nothing was returned.
>

<... snip ...>


Richard Reynolds

unread,
Feb 25, 2002, 4:55:00 PM2/25/02
to
it didn't "skip over" anything, there was no token to skip over, there was
no token so it didn't return anything. A token is a String, the class is
called StringTokenizer, now if it were called StringOrEmptyStringTokenizer,
I'd complain :)
It returns Strings that are delimited, if there's nothing between the
delimiters I wouldn't expect it to return anything.

"MyndPhlyp" <nob...@home.com> wrote in message

news:a5e9df$ero$1...@slb6.atl.mindspring.net...

MyndPhlyp

unread,
Feb 25, 2002, 5:37:25 PM2/25/02
to
Richard:

I guess one person's "feature" is another person's "bug." <g>


"Richard Reynolds" <richier...@ntlworld.com> wrote in message

news:unye8.370$Hg1....@news6-win.server.ntlworld.com...

MyndPhlyp

unread,
Feb 25, 2002, 5:40:48 PM2/25/02
to
Greg:

That is certainly a possibility. I'll have to look into this a bit deeper.


"Greg Faron" <gfa...@integretechpub.com> wrote in message
news:3C7AA76D...@integretechpub.com...

MyndPhlyp

unread,
Feb 25, 2002, 5:43:35 PM2/25/02
to
Thorsten:

Re-engineering seems to be the theme at the moment. Notice that Greg Faron
also mentioned that route. I'll have to look into this.


"Thorsten Seelend" <thor...@gmx.de> wrote in message
news:oc9l7u4n97b8l743i...@4ax.com...

Jon Skeet

unread,
Feb 25, 2002, 5:46:22 PM2/25/02
to
MyndPhlyp <nob...@home.com> wrote:

> Is there a way to get StringTokenizer to recognize the non-existance of data
> between the two delimiters and perhaps even return a null value or
> something?

Not easily. However, there are other tokenizers out there which do. Have
a look at JlsTokenizer at
http://www.pobox.com/~skeet/java/skeetutil

--
Jon Skeet - <sk...@pobox.com>
http://www.pobox.com/~skeet/
If replying to the group, please do not mail me too

Thorsten Seelend

unread,
Feb 25, 2002, 6:07:26 PM2/25/02
to
On Mon, 25 Feb 2002 21:55:00 -0000, "Richard Reynolds" <richier...@ntlworld.com>
wrote:

> it didn't "skip over" anything, there was no token to skip over, there was
> no token so it didn't return anything. A token is a String, the class is
> called StringTokenizer, now if it were called StringOrEmptyStringTokenizer,
> I'd complain :)

Isn't "" a string??

> ...

Bye
Thorsten

Michiel Konstapel

unread,
Feb 25, 2002, 3:49:49 PM2/25/02
to
> I have found that if I kluge the data inserting a space in the empty slot,
> StringTokenizer works just fine, but I don't want to kluge the data.
>
> Is there a way to get StringTokenizer to recognize the non-existance of
data
> between the two delimiters and perhaps even return a null value or
> something?

Nope, that's just how it works. If you want to "see" empty fields, you have
to use the other StringTokenizer constructor with a boolean parameter
telling it to return the delimiters as well. Then, when you see two
delimiters in a row, you know you just passed an empty field.
HTH,
Michiel


Michiel Konstapel

unread,
Feb 25, 2002, 4:54:24 PM2/25/02
to
I might expect it to return exactly what's there: "", and I wish it did.
Fortunately, I read the docs before getting bitten ;-)
Michiel

"Richard Reynolds" <richier...@ntlworld.com> wrote in message

news:unye8.370$Hg1....@news6-win.server.ntlworld.com...

Karl Schmidt

unread,
Feb 25, 2002, 7:26:56 PM2/25/02
to
Thorsten Seelend schrieb:

In some way, I understand your point. But what would you expect for this?:

// String s contains "Content-Type: multipart/mixed;
boundary="_=_=_=_X05T_BOUNDARY_STRING_=_=_=_"
StringTokenizer st = new StringTokenizer(s, ": /;\"");

The StringTokenizer only returns tokens between delimiters and skips the delimiters. If
that is not what you want, write your own or use StreamTokenizer with the option to
return the delimiters (so, if two delimiters follow immediataly, you know some token is
missing)

--

MfG


Karl Schmidt
ICQ #15923569


MyndPhlyp

unread,
Feb 25, 2002, 8:10:59 PM2/25/02
to
Thanx to all for the suggestions ... and the heated debate on what is, what
should be and what always was. I enjoy the occasional sparing.

Looking at the re-engineering suggested by a couple of individuals, the
solutions proposed are definitely creative. However, I'm electing to go yet
another route keeping StringTokenizer as virgin as it is. There is a "return
delimiter" parameter that can be used when creating the new StringTokenizer.
While it returns more than I really want, I can always parse through the
returns with an "if" statement and handle the delimiters.

=====
Source
=====
// The array is increased in size to accommodate
// the potential return of each "field" plus
// each "field value"

// The "new StringTokenizer" third parameter
// is set to true.

String[] myField = new String[11];


String tempString = "A|B||D|E|F";

StringTokenizer st = new StringTokenizer(tempString, "|", true);
int i = 0;
while (st.hasMoreTokens())


{
myField[i] = st.nextToken();
i++;
}

for (i = 0; i < myField.length; i++)
System.out.println("myField[" + i + "] = " + myField[i]);
=====
Output
=====


myField[0] = A
myField[1] = |

myField[2] = B
myField[3] = |
myField[4] = |
myField[5] = D
myField[6] = |
myField[7] = E
myField[8] = |
myField[9] = F
myField[10] = null

All that is left is to add in the quick "if" statement after st.nextToken():

if (myField[i] == "|")
{
myField[i] = null;
i--;
}

It probably looks sloppy to some, strange to others, and far from elegent to
many. But it works.

Thanx again all.


MyndPhlyp

unread,
Feb 25, 2002, 8:18:26 PM2/25/02
to
Read docs?

BEFORE getting bitten?!?

Now, THERE'S a concept I hadn't thought of. <g>


"Michiel Konstapel" <a...@me.nl> wrote in message
news:koye8.448$HE5....@nlnews00.chello.com...

MyndPhlyp

unread,
Feb 25, 2002, 8:31:54 PM2/25/02
to
I hate it when that happens. Nothing like defeating the purpose. I wouldn't
wish this kind of premature typulation upon anybody.


"MyndDent" <nob...@home.com> wrote in message
news:a5enm7$bor$1...@slb3.atl.mindspring.net...

Karl Schmidt

unread,
Feb 25, 2002, 8:54:20 PM2/25/02
to
MyndPhlyp schrieb:

> while (st.hasMoreTokens())
> {
> myField[i] = st.nextToken();
> i++;
> }
> for (i = 0; i < myField.length; i++)
> System.out.println("myField[" + i + "] = " + myField[i]);

Argh!! That hurts...

Why don't you check earlier?

String lastToken = null;
while (st.hasMoreTokens()) {
String token = st.nextToken();
if (token.equals("|")) {
if ("|".equals(lastToken)) {
myField[i++] = "";
}
} else {
myField[i++] = token;
}
lastToken = token;

Richard Reynolds

unread,
Feb 26, 2002, 7:47:56 AM2/26/02
to
It's not a token, if "" is a token how many are in a file? an infinite
number? or are we dividing by zero!

"Thorsten Seelend" <thor...@gmx.de> wrote in message

news:nqgl7uc3a187ht6hv...@4ax.com...

Thorsten Seelend

unread,
Feb 26, 2002, 9:01:44 AM2/26/02
to
On Tue, 26 Feb 2002 12:47:56 -0000, "Richard Reynolds" <richier...@ntlworld.com>
wrote:

> "Thorsten Seelend" <thor...@gmx.de> wrote in message
> > Isn't "" a string??


> It's not a token, if "" is a token how many are in a file? an infinite
> number? or are we dividing by zero!

OK. That's a worthy argument.

Bye
Thorsten

Dale King

unread,
Feb 25, 2002, 8:44:07 PM2/25/02
to
"MyndPhlyp" <nob...@home.com> wrote in message
news:a5e9df$ero$1...@slb6.atl.mindspring.net...

> Richard:
>
> The problem is that nextToken() not only returned nothing, it skipped
right
> over it. Certainly not expected behavior in my case (and in the case of
Matt
> Schalit).
>
> nextToken() could have at least returned null thereby acknowledging the
fact
> that the delimiter existed.


It is the specified behavior. It is more designed for the case where tokens
are separated by spaces or whitespace. In that case multiple spaces are not
usually considered significant.

StreamTokenizer has many faults including the fact that from 1.2 to 1.3 they
changed its behavior so that any program that depended on the old behavior
(like mine did) is now broken and there is no good workaround. As far as I'm
concerned StringTokenizer should be marked as deprecated.

For more info on how they broke it consider this code:

String toParse = "foo-bar,baz";
StringTokenizer tok = new StringTokenizer( toParse, "-");
tok.nextToken();
System.out.println( tok.nextToken(",") );

What does this print? Depends on which version of the JDK you are using. 1.2
and before prints bar. 1.3 and later will print -bar. Somehow they don't
think this is a bug.

Starting with 1.4 you are better off using the regular expression package.
--
Dale King


André Wuttke

unread,
Feb 26, 2002, 2:22:49 PM2/26/02
to
"Dale King" <Ki...@TCE.com> wrote in news:3c7b...@news.tce.com:

Hello Dale

> For more info on how they broke it consider this code:
>
> String toParse = "foo-bar,baz";
> StringTokenizer tok = new StringTokenizer( toParse, "-");
> tok.nextToken();
> System.out.println( tok.nextToken(",") );
>
> What does this print? Depends on which version of the JDK you are
> using. 1.2 and before prints bar. 1.3 and later will print -bar.
> Somehow they don't think this is a bug.

This seems to be OK in my opinion. I think the old behavior was a bug.
Since you changed the seperator, "-" isn't a seperator anymore and so
belongs to the next token.
Have you considered changing the seperator to ",-"?

André

Greg Faron

unread,
Feb 26, 2002, 3:33:54 PM2/26/02
to

Not really. There are as many "" as are separated by valid
delimiters. For example, given the delimiter "|" (and delimiters
are guaranteed to be non-empty, single-character strings), and the
original string "a|b|c|||f|g||i", I would simply expect for there
to be 9 valid tokens. There cannot be an infinite number of tokens
without an infinite (minus one :) ) delimiters.

Richard's argument more likely can be applied to the delimiter " "
and the original string "please parse this string", for which you
would like 4 tokens, not 8 (four of which are the empty string).

Dale King

unread,
Feb 26, 2002, 3:59:35 PM2/26/02
to
"André Wuttke" <awuttke(remove)@medistar.de> wrote in message
news:a5gna9$75s31$1...@ID-133186.news.dfncis.de...

No it is a separator not part of the next token. Basically it is a separator
for the first token and not for the remaining text. What I want is to get
the first token and then the rest of the text after the separator. There is
no clean way to do that with the current implementation of StringTokenizer.

> Have you considered changing the seperator to ",-"?

OK, I looked again at what it was I was trying to do. That example was off
the top of my head. What I want to do is change to no separators. What I was
parsing was something like this:

foo: Arbitrary text which can contain spaces and : characters

What I wanted was to tokenize this to get the initial token "foo" and then
get the argument "Arbitrary text which can contain spaces and : characters".
Note that the argument may in fact have tokens delimited by colons or spaces
that will get tokenized later, it depends on the command. For certain
commands it is any arbitrary text. It may not be the best command format for
parsing, but it is for a program I am porting and the format is defined and
already widely used.

What I did before 1.3 was:

StringTokenizer tok = new StringTokenizer( text, ": " );
String command = tok.nextToken();
String argument = null;
if( tok.hasMoreTokens() )
{
argument = tok.nextToken("");
}

That worked in 1.2 and in 1.3 I now get the colon and spaces at the
beginning of the argument. I find no good way to workaround this with
StringTokenizer.

You can argue which makes more sense, but changing the contract of
StringTokenizer with no change in documentation is plain wrong. I would have
no problem with adding an overload with a flag to get the new behavior, but
removing the old behavior is not acceptable, particularly when there is no
good workaround.

Note the old behavior is very familiar to those familiar with C's strtok
function, which is what was used in the code I was porting.

--
Dale King


André Wuttke

unread,
Feb 26, 2002, 6:07:27 PM2/26/02
to
"Dale King" <Ki...@TCE.com> wrote in news:3c7b...@news.tce.com:

> You can argue which makes more sense, but changing the contract of


> StringTokenizer with no change in documentation is plain wrong. I would

Maybe I can argue but Sun considered the old behavior a bug (lookup Bug-
Parade for StringTokenizer). So they must not change the docs for fixing a
bug, must they?
What one can argue is what contract the docs are implying. And it's somewhat
missunderstandable:

nextToken
public String nextToken(String delim)
Returns the next token in this string tokenizer's string. First, the set of
characters considered to be delimiters by this StringTokenizer object is
changed to be the characters in the string delim. Then the next token in the
string after the current position is returned. The current position is
advanced beyond the recognized token. The new delimiter set remains the
default after this call.

This I'm understanding as such the current position is set immediatly after
the last token. And that's before the next delemiter. So does Sun and
considered the old behavior a bug in "hasMoreElements" errornous changing
the current position.

> have no problem with adding an overload with a flag to get the new
> behavior, but removing the old behavior is not acceptable, particularly
> when there is no good workaround.

That's right since many programers used the old behavior as a feature. They
had to implement the old behavior in some way activatable if desired :-)

> What I did before 1.3 was:
>
> StringTokenizer tok = new StringTokenizer( text, ": " );
> String command = tok.nextToken();
> String argument = null;
> if( tok.hasMoreTokens() )
> {
> argument = tok.nextToken("");
> }

You can do it this way:


StringTokenizer tok = new StringTokenizer( text, ":" );
String command = tok.nextToken();
String argument = null;
if( tok.hasMoreTokens() ) {

tok.nextToken(" "); //this will remove the colon
}
if( tok.hasMoreTokens() )
{
argument = tok.nextToken("").trim();
}

Will give You
"foo"
and

"Arbitrary text which can contain spaces and : characters"

from Your "text".

It's not so elegant as before, but fullfills your requestments.

André

Michiel Konstapel

unread,
Feb 26, 2002, 6:33:58 PM2/26/02
to
LOL :)

"MyndPhlyp" <nob...@home.com> wrote in message

news:a5enpq$ds$1...@slb0.atl.mindspring.net...

Pat Reaney

unread,
Feb 26, 2002, 11:39:01 PM2/26/02
to
Gave me a headache too. I took two aspirin and wrote my own; the
source can be found at:
http://forum.java.sun.com/thread.jsp?forum=31&thread=204323

It works just like the perl split() function - consecutive delimiters
return the empty string as a token( so you don't have to test for null
).

Dale King

unread,
Feb 27, 2002, 1:08:28 PM2/27/02
to
"André Wuttke" <awuttke(remove)@medistar.de> wrote in message
news:a5h4fe$7hf6f$1...@ID-133186.news.dfncis.de...

> "Dale King" <Ki...@TCE.com> wrote in news:3c7b...@news.tce.com:
>
> > You can argue which makes more sense, but changing the contract of
> > StringTokenizer with no change in documentation is plain wrong. I would
> Maybe I can argue but Sun considered the old behavior a bug (lookup Bug-
> Parade for StringTokenizer). So they must not change the docs for fixing a
> bug, must they?

I am aware of the bug history on this. There was a bug in the hasMoreTokens.
And in fixing that they caused this change. They don't consider it a bug.
They are wrong. See the long list of comments on the bug report 4338282 that
agree with me. Note that this broke JRun.

> What one can argue is what contract the docs are implying. And it's
somewhat
> missunderstandable:
>
> nextToken
> public String nextToken(String delim)
> Returns the next token in this string tokenizer's string. First, the set
of
> characters considered to be delimiters by this StringTokenizer object is
> changed to be the characters in the string delim. Then the next token in
the
> string after the current position is returned. The current position is
> advanced beyond the recognized token. The new delimiter set remains the
> default after this call.

The ambiguity comes in terms of "next token". I got one token, then I have
delimiters. My constructor said that delimiters are not to be considered
tokens. But if I change the delimiters suddenly they are. Those initial
characters were already determined in the previous call to be delimiters.
Changing the delimiters should not mean that now they aren't delimiters.

> This I'm understanding as such the current position is set immediatly
after
> the last token. And that's before the next delemiter. So does Sun and

And that is necessary if you have the flag turned on to return delimiters as
tokens. You can't just skip past them in that case.

> considered the old behavior a bug in "hasMoreElements" errornous changing
> the current position.

I'm not debating which is correct or not. There is valid reasoning that
makes the old way correct. The point is that with 4-5 years in existence you
don't suddenly change the way something works. You have introduced an
incompatibility. No matter which behavior is more correct, I can't depend on
it working either way. If I depend on the new behavior my code doesn't work
in 1.2. If I depend on the old behavior, it doesn't work in 1.3. Who cares
which is more correct?

The class is now only usable if you never change the delimiters. Might as
well deprecate it in favor of the regular expression package.

> > have no problem with adding an overload with a flag to get the new
> > behavior, but removing the old behavior is not acceptable, particularly
> > when there is no good workaround.
> That's right since many programers used the old behavior as a feature.
They
> had to implement the old behavior in some way activatable if desired :-)

But they didn't do that. They just changed the way the existing code works.
The only workaround is to create your own copy of the old StringTokenizer
because you cannot get the old behavior out of the new implementation.

> > What I did before 1.3 was:
> >
> > StringTokenizer tok = new StringTokenizer( text, ": " );
> > String command = tok.nextToken();
> > String argument = null;
> > if( tok.hasMoreTokens() )
> > {
> > argument = tok.nextToken("");
> > }
>
> You can do it this way:
> StringTokenizer tok = new StringTokenizer( text, ":" );
> String command = tok.nextToken();
> String argument = null;
> if( tok.hasMoreTokens() ) {
> tok.nextToken(" "); //this will remove the colon
> }
> if( tok.hasMoreTokens() )
> {
> argument = tok.nextToken("").trim();
> }

Not the same by a long shot. First off, the colon is optional. I could have
a space instead. I could have multiple spaces, multiple colons, multiple
colons and spaces. The following is legal as well:

foo : : : : : Arbitrary text which can contain spaces and : characters

> Will give You
> "foo"
> and
> "Arbitrary text which can contain spaces and : characters"
> from Your "text".

Nope, it will give me:

"foo"


"text which can contain spaces and : characters"

You lost the first word from the remaining text in your loop to remove the
spaces.

> It's not so elegant as before, but fullfills your requestments.

No it doesn't. You cannot fulfill my requirements using the new
StringTokenizer (short of using reflection to access private members of
StringTokenizer).

These are not that bizarre of requirements. In C it is simply:

command = strtok( text, ": ");
argument = srtok( NULL, "" );

--
Dale King


André Wuttke

unread,
Feb 27, 2002, 3:04:06 PM2/27/02
to
"Dale King" <Ki...@TCE.com> wrote in news:3c7d...@news.tce.com:

> consider it a bug. They are wrong. See the long list of comments on the
> bug report 4338282 that agree with me. Note that this broke JRun.

OK. Many programers have seen this a feature not a bug. Sun faild in
recognizing this. But after all it was a bug in the first place.

> to be delimiters. Changing the delimiters should not mean that now they
> aren't delimiters.

What else? Either they are delimiters or tokens. Some strange previously-
used-delimiter-character-now-to-be-skipped ?

> I'm not debating which is correct or not. There is valid reasoning that
> makes the old way correct. The point is that with 4-5 years in
> existence you don't suddenly change the way something works. You have
> introduced an incompatibility. No matter which behavior is more
> correct, I can't depend on it working either way. If I depend on the
> new behavior my code doesn't work in 1.2. If I depend on the old
> behavior, it doesn't work in 1.3. Who cares which is more correct?

Right.

>
> The class is now only usable if you never change the delimiters. Might
> as well deprecate it in favor of the regular expression package.

Right again.


>> You can do it this way:
>> StringTokenizer tok = new StringTokenizer( text, ":" );
>> String command = tok.nextToken();
>> String argument = null;
>> if( tok.hasMoreTokens() ) {
>> tok.nextToken(" "); //this will remove the colon }
>> if( tok.hasMoreTokens() )
>> {
>> argument = tok.nextToken("").trim(); }

>> Will give You
>> "foo"
>> and
>> "Arbitrary text which can contain spaces and : characters" from Your
>> "text".
>
> Nope, it will give me:
>
> "foo"
> "text which can contain spaces and : characters"
>
> You lost the first word from the remaining text in your loop to remove
> the spaces.

Worked for me. No los of a word.
What loop, anyway?

>
>> It's not so elegant as before, but fullfills your requestments.
>
> No it doesn't. You cannot fulfill my requirements using the new
> StringTokenizer (short of using reflection to access private members of
> StringTokenizer).

I haven't said it fullfills Your requirements. Only what You requested in
Your previous post :-)
Had the uneasy feeling that Your requirements were not that simple, anyway
:-)

André

Dale King

unread,
Feb 27, 2002, 4:04:11 PM2/27/02
to
"André Wuttke" <awuttke(remove)@medistar.de> wrote in message
news:a5je3m$7oh17$1...@ID-133186.news.dfncis.de...

> "Dale King" <Ki...@TCE.com> wrote in news:3c7d...@news.tce.com:
>
> > consider it a bug. They are wrong. See the long list of comments on the
> > bug report 4338282 that agree with me. Note that this broke JRun.
> OK. Many programers have seen this a feature not a bug. Sun faild in
> recognizing this. But after all it was a bug in the first place.

The fact that it skipped the delimiters from the previous token was not a
bug. That was a perfectly sensible way to work, unless you had the flag that
delimiters are returned as tokens.

> > to be delimiters. Changing the delimiters should not mean that now they
> > aren't delimiters.
> What else? Either they are delimiters or tokens. Some strange previously-
> used-delimiter-character-now-to-be-skipped ?

Well according to the JDK when I am not returning delimiters as tokens, "a
token is a maiximal sequence of consecutive characters that are not
delimiters". So in my example : is either a delimiter or it isn't. Since
StringTokenizer now returns two strings that had nothing in between them
that violates that definition. If you say the colon is not a delimiter then
it is not maximal. If you say the colon is a delimiter then it fails the
statement that tokens do not contain delimiters.

Basically I told it that colon and space were delimiters. It is then
ignoring what I told it.

You keep saying that the old way was a bug and the new way is correct. Can
you give any reasonable example where you want the delimiter to be part of
the next token? I can see it if you are returning delimiters as tokens.

> >> You can do it this way:
> >> StringTokenizer tok = new StringTokenizer( text, ":" );
> >> String command = tok.nextToken();
> >> String argument = null;
> >> if( tok.hasMoreTokens() ) {
> >> tok.nextToken(" "); //this will remove the colon }
> >> if( tok.hasMoreTokens() )
> >> {
> >> argument = tok.nextToken("").trim(); }
> >> Will give You
> >> "foo"
> >> and
> >> "Arbitrary text which can contain spaces and : characters" from Your
> >> "text".
> >
> > Nope, it will give me:
> >
> > "foo"
> > "text which can contain spaces and : characters"
> >
> > You lost the first word from the remaining text in your loop to remove
> > the spaces.
> Worked for me. No los of a word.
> What loop, anyway?

I thought you were looping removing spaces after the colon. I see now that
you simply skipped the colon that you assumed had to be there (it doesn't)

Yours will only work in the case where there is exactly one colon with no
spaces before the colon and at least one space after the colon. It also
assumes that I don't care about trailing spaces, but that could be fixed. I
don't remember stating that there was exactly one colon and it couldn't have
a space before it.

> >> It's not so elegant as before, but fullfills your requestments.
> >
> > No it doesn't. You cannot fulfill my requirements using the new
> > StringTokenizer (short of using reflection to access private members of
> > StringTokenizer).
> I haven't said it fullfills Your requirements. Only what You requested in
> Your previous post :-)
> Had the uneasy feeling that Your requirements were not that simple, anyway

My requirements were very simple, the same result that I obtained with the
previous implementation. You got the same result for my one example input,
but not for all. It gives different results for these strings:

foo : : Arbitrary text
foo:Arbitrary text
foo Arbitrary text

which are all acceptable inputs and should provide the same result. It would
be nice to be able to eliminate these as acceptable, but as I said this is
an existing well-known format and the text is input by humans.

The real requirement is to emulate:

command = strtok( text, ": " );

argument = strtok( NULL, "" );

which the old implementation did nicely.

I suppose there is a workaround that is simpler than what you posted:

StringTokenizer tok = new StringTokenizer( text, ": " );
String command = tok.nextToken();
String argument = null;

String remaining = tok.nextToken("");
for( int i = 0; i < remainging.length; i++ )
{
char c = remaining.charAt( i );
if( c != ':' && c != ' ' )
{
argument = remaining.substring( i );
break;
}
}

This at least gives the required behavior under either implementation.
--
Dale King


Greg Faron

unread,
Feb 27, 2002, 5:07:54 PM2/27/02
to
Dale King wrote:
>
> You keep saying that the old way was a bug and the new way is
> correct. Can you give any reasonable example where you want
> the delimiter to be part of the next token? I can see it if
> you are returning delimiters as tokens.

It seems to me that there is a misunderstanding going on between
the two halfs of this discussion. I may be wrong, but I think Andre
is saying that the second delimiter replaces the first, not gets
added to the list of delimiters. If it worked like this, it would
result in the _old_ delimiter being part of the token, as it would
no longer be a valid delimiter character.

Chris Smith

unread,
Feb 27, 2002, 6:57:19 PM2/27/02
to
Greg Faron wrote ...

Yes, the new delimiter does replace the first.

And Dale is saying that since the old delimiter character has already
been recognized and treated as a delimiter, it is inconsistent to go back
now and treat it as a part of the next token. I tend to agree with him,
bith logically and in terms of practical use. I can see no useful
applications of StringTokenizer while changing the delimiter.

I still think the class as a whole is simpler and easier than regexps
when tokenizing with whitespace, but I would tend to use other language
features for more complex lexing needs.

At a minimum, if Sun does believe that the current behavior is correct,
then the API specification should be clarified on this point. I actually
think the general StringTokenizer behavior would be infinitely more
understandable were there to be no such thing as nextToken(String).

Chris Smith

Dale King

unread,
Feb 28, 2002, 10:32:23 AM2/28/02
to
"Chris Smith" <cds...@twu.net> wrote in message
news:MPG.16e7204ca...@news.altopia.com...

> Greg Faron wrote ...
> > Dale King wrote:
> > >
> > > You keep saying that the old way was a bug and the new way is
> > > correct. Can you give any reasonable example where you want
> > > the delimiter to be part of the next token? I can see it if
> > > you are returning delimiters as tokens.
> >
> > It seems to me that there is a misunderstanding going on between
> > the two halfs of this discussion. I may be wrong, but I think Andre
> > is saying that the second delimiter replaces the first, not gets
> > added to the list of delimiters. If it worked like this, it would
> > result in the _old_ delimiter being part of the token, as it would
> > no longer be a valid delimiter character.
>
> Yes, the new delimiter does replace the first.
>
> And Dale is saying that since the old delimiter character has already
> been recognized and treated as a delimiter, it is inconsistent to go back
> now and treat it as a part of the next token. I tend to agree with him,
> bith logically and in terms of practical use. I can see no useful
> applications of StringTokenizer while changing the delimiter.

I think mine is very logical. Basically I want to extract the first token
that is delimited by a colon or space. Then I want the rest of the text
after the delimiter. It is a format much like the lines in a properties
file. I am saying only the first occurrence has that delimiter. I posted
another example like this:

foo: bar,baz,fubar

That seems to be a common pattern to me.

> I still think the class as a whole is simpler and easier than regexps
> when tokenizing with whitespace, but I would tend to use other language
> features for more complex lexing needs.

This is a simple lexing task. As I said strtok in C does it in two lines.

> At a minimum, if Sun does believe that the current behavior is correct,
> then the API specification should be clarified on this point. I actually
> think the general StringTokenizer behavior would be infinitely more
> understandable were there to be no such thing as nextToken(String).

At this point there is no way to correct nextToken( String ). They have
rendered it useless since it has different behavior across JDK versions. Any
code that uses it will not work correctly on at least one version of the
JDK. The only solution is to not use it, thus it should be deprecated. Or at
least give it a @since 1.3 and note that it worked differently in older
versions.


--
Dale King


André Wuttke

unread,
Feb 28, 2002, 3:21:20 PM2/28/02
to
"Dale King" <Ki...@TCE.com> wrote in news:3c7d...@news.tce.com:

> Basically I told it that colon and space were delimiters. It is then


> ignoring what I told it.

You told it that colon and space aren't delimiters anymore. So they belong
to tokens now.

>
> You keep saying that the old way was a bug and the new way is correct.
> Can you give any reasonable example where you want the delimiter to be
> part of the next token? I can see it if you are returning delimiters as
> tokens.

Consider splitting up file names like "somename12345" into alpha and
numerical part.
Since 1.3 you can do this with
StringTokenizer tok = new StringTokenizer( filename, "1234567890");
String alpha = tok.nextToken();
String numerical = tok.nextToken("");
You won't loose the first digit from the numerical part, won't you?

André

André Wuttke

unread,
Feb 28, 2002, 3:41:21 PM2/28/02
to
"Dale King" <Ki...@TCE.com> wrote in news:3c7d...@news.tce.com:

> Yours will only work in the case where there is exactly one colon with


> no spaces before the colon and at least one space after the colon. It
> also assumes that I don't care about trailing spaces, but that could be
> fixed. I don't remember stating that there was exactly one colon and it
> couldn't have a space before it.

The example You gave in the first place was this:


>>> the top of my head. What I want to do is change to no separators. What
>>> I was parsing was something like this:
>>>

>>> foo: Arbitrary text which can contain spaces and : characters


>>> What I did before 1.3 was:

[skiped code]


>>> That worked in 1.2 and in 1.3 I now get the colon and spaces at the
>>> beginning of the argument. I find no good way to workaround this with
>>> StringTokenizer.

I gave You a piece of code which was able to handle exactly what You
requested. Not what Your true requirements where. I thought by myself that
they must be not that simple. So I intended it to be kind of a joke.
Let's stop arguing about Your example and my code-snippet and not
fullfilling requirements You haven't posted in the first place.

André

Dale King

unread,
Feb 28, 2002, 5:56:36 PM2/28/02
to
"Dale King" <Ki...@TCE.com> wrote in message news:3c7e...@news.tce.com...

> "Chris Smith" <cds...@twu.net> wrote in message
>
> > At a minimum, if Sun does believe that the current behavior is correct,
> > then the API specification should be clarified on this point. I
actually
> > think the general StringTokenizer behavior would be infinitely more
> > understandable were there to be no such thing as nextToken(String).
>
> At this point there is no way to correct nextToken( String ). They have
> rendered it useless since it has different behavior across JDK versions.
Any
> code that uses it will not work correctly on at least one version of the
> JDK. The only solution is to not use it, thus it should be deprecated. Or
at
> least give it a @since 1.3 and note that it worked differently in older
> versions.


I have submitted a bug report on this and it has been assigned the ID
4645089.
--
Dale King


Stephen Ostermiller

unread,
Mar 1, 2002, 7:38:16 AM3/1/02
to
I have written a replacement StringTokenizer that tries to address
many of the concerns that folks in this discussion have had. It
provides a way to get the old behavior back, and it has an option for
returning null tokens:
http://ostermiller.org/utils/StringTokenizer.html

Dave Ryan

unread,
Mar 1, 2002, 11:18:46 AM3/1/02
to

Thanks Stephen. I like the StringTokenizer changes, and it also
appears you have some other nice utilities.
-dave

Werner Purrer

unread,
Mar 1, 2002, 11:30:17 AM3/1/02
to
From the depths of Usenet there came a cry:

>
>>Is there a way to get StringTokenizer to recognize the non-existance of
>>data between the two delimiters and perhaps even return a null value or
>>something?
>
>
> I noticed that yesterday and was wondering the same thing.
> I thought it would return a null.  Anyone?
> Matt
Use Regular Expressions...

Werner Purrer

unread,
Mar 1, 2002, 11:31:20 AM3/1/02
to
From the depths of Usenet there came a cry:

> I think I may have uncovered the reason "why". I dug a little deeper in
> JBuilder6's online help and came across this item under JCStringTokenizer.
> In short, the StringTokenizer "symptom" is a known bug.
>
> Trying to maintain reverse compatibility rules out using JCStringTokenizer
> in my case. I hope somebody here has a nice workaround to the problem.
>
Look at one of the RegExp packages, like Jakarta Oro, they have nice regexp
based splitters, which work far better than the basic tokenizer!


Dale King

unread,
Mar 1, 2002, 1:00:51 PM3/1/02
to
"Stephen Ostermiller" <1010...@ostermiller.com> wrote in message
news:6f77396a.0203...@posting.google.com...

Looks good. I have been suggesting for a long time that someone needs to
write a full-featured replacement.

I have one possible enhancement though. The ability to declare certain
characters as ignorable whitespace when appearing around a delimiter.

Consider the case of parsing a comma delimited string like "George
Washington, Abraham Lincoln". If I parse that using comma as a delimiter I
will get the tokens: "George Washington", " Abraham Lincoln". Note the
extra space in front of the second token. I don't really care if there are
extra spaces before or after a token.

One way to get rid of the spaces is to change your delimiter string to be ",
" but that means the space will be a delimiter which would then give me the
tokens: "George", "Washington", "Abraham", "Lincoln", which is not what was
wanted.

The solution is to allow a third string to specify characters that are to be
treated as whitespace that is to be removed when at the beginning or end of
a token, but do not themselves constitute a delimiter.

In my example you could have just used String.trim(), but in general the
characters may not necessarily be whitespace.
--
Dale King


skeptic

unread,
Mar 2, 2002, 12:54:05 PM3/2/02
to
>
> I have one possible enhancement though. The ability to declare certain
> characters as ignorable whitespace when appearing around a delimiter.
>
> Consider the case of parsing a comma delimited string like "George
> Washington, Abraham Lincoln". If I parse that using comma as a delimiter I
> will get the tokens: "George Washington", " Abraham Lincoln". Note the
> extra space in front of the second token. I don't really care if there are
> extra spaces before or after a token.
>
> One way to get rid of the spaces is to change your delimiter string to be ",
> " but that means the space will be a delimiter which would then give me the
> tokens: "George", "Washington", "Abraham", "Lincoln", which is not what was
> wanted.
>
> The solution is to allow a third string to specify characters that are to be
> treated as whitespace that is to be removed when at the beginning or end of
> a token, but do not themselves constitute a delimiter.

You actually need a regex-based tokenizer.
For example this:

import jregex.*;
...
RETokenizer tok=new Pattern("\\s*,\\s*").tokenizer(theString);
while(tok.hasMoreTokens()){
...
}

solves your task.

Dale King

unread,
Mar 4, 2002, 11:20:17 AM3/4/02
to
"skeptic" <se...@zolshar.ru> wrote in message
news:97f0f876.02030...@posting.google.com...

Certainly a regex package will work, but is overkill unless you can rely on
JDK1.4 where it is there anyway.

I am actually disappointed with one aspect of the regex package in JDK1.4.
It doesn't let you do searches on an InputStream. I have an application that
searches the contents of certain webpages looking for certain URL patterns.
It won't work with the JDK1.4 regex package unless I load the entire webpage
to memory.

If you're curious the application is so I can keep up on my Dilbert comics
from UnitedMedia's site.
--
Dale King


skeptic

unread,
Mar 5, 2002, 12:03:57 PM3/5/02
to
> Certainly a regex package will work, but is overkill unless you can rely on
> JDK1.4 where it is there anyway.

Regular expressions aren't an overkill for this task, just because
they are all
about that. Otherwise you'll end up with writing your own mini-regex
engine.

> I am actually disappointed with one aspect of the regex package in JDK1.4.
> It doesn't let you do searches on an InputStream.

It's an inherent feature of NFA-based (also known as perl-type) regex
matchers (as opposing to POSIX-type) that they are hardly rely upon a
buffered input.
That's so because they sometimes need to backtrack.

> I have an application that
> searches the contents of certain webpages looking for certain URL patterns.
> It won't work with the JDK1.4 regex package unless I load the entire webpage
> to memory.
>
> If you're curious the application is so I can keep up on my Dilbert comics
> from UnitedMedia's site.

Look at AwkMatcher in jakarta-oro package, it's a member of POSIX
family and so is capable of i/o stream matching.

Regards

Chris Smith

unread,
Mar 5, 2002, 2:18:30 PM3/5/02
to
skeptic wrote ...

> > Certainly a regex package will work, but is overkill unless you can rely on
> > JDK1.4 where it is there anyway.
>
> Regular expressions aren't an overkill for this task, just because
> they are all
> about that. Otherwise you'll end up with writing your own mini-regex
> engine.

Most text parsing can be done without regular expressions, and it's often
much simpler than using a regular expression. For this case, I'd rather
be reading code that uses StringTokenizer in conjunction with
String.trim(), rather than a regular expression package. Of course, I
would have no problem with a utility class like StringTokenizer being
implemented using a regular expression approach. I don't care how any
utility classes are implemented.

Regular expressions, like most general-purpose tools, are inherently more
complex than specific tools that handle specific needs.

> > I am actually disappointed with one aspect of the regex package in JDK1.4.
> > It doesn't let you do searches on an InputStream.
>
> It's an inherent feature of NFA-based (also known as perl-type) regex
> matchers (as opposing to POSIX-type) that they are hardly rely upon a
> buffered input.
> That's so because they sometimes need to backtrack.

Are you saying that regular expressions in Java are evaluated directly
from a NFA representation? I haven't read the code, but it doesn't make
any sense to me. Any NFA can be deterministically converted to a DFA
that will run faster with simpler logic. The DFA representation will be
larger, but given the typical huge disparities between the sizes of
parsable text versus the matching pattern, this seems like the only way
to go.

Chris Smith

Dale King

unread,
Mar 5, 2002, 3:53:47 PM3/5/02
to
"skeptic" <se...@zolshar.ru> wrote in message
news:97f0f876.0203...@posting.google.com...

> > Certainly a regex package will work, but is overkill unless you can rely
on
> > JDK1.4 where it is there anyway.
>
> Regular expressions aren't an overkill for this task, just because
> they are all
> about that. Otherwise you'll end up with writing your own mini-regex
> engine.

If you aren't using 1.4, is it worth it to add a regex package for a very
simple job? No. If you have one available, it is not worth doing it
yourself.

> > I am actually disappointed with one aspect of the regex package in
JDK1.4.
> > It doesn't let you do searches on an InputStream.
>
> It's an inherent feature of NFA-based (also known as perl-type) regex
> matchers (as opposing to POSIX-type) that they are hardly rely upon a
> buffered input.
> That's so because they sometimes need to backtrack.

Yes, I know.

> > I have an application that
> > searches the contents of certain webpages looking for certain URL
patterns.
> > It won't work with the JDK1.4 regex package unless I load the entire
webpage
> > to memory.
> >
> > If you're curious the application is so I can keep up on my Dilbert
comics
> > from UnitedMedia's site.
>
> Look at AwkMatcher in jakarta-oro package, it's a member of POSIX
> family and so is capable of i/o stream matching.

I'm actually using an older version of ORO Matcher that still has
Perl5StreamInput, so I am aware of ORO, it would be nice if this
functionality were included for matches that can be done without
backtracking.

I also submitted a suggestion about changing something like this to work
better with non-blocking I/O introduced in 1.4.
--
Dale King


skeptic

unread,
Mar 6, 2002, 8:31:01 AM3/6/02
to
> Most text parsing can be done without regular expressions, and it's often
> much simpler than using a regular expression.
> For this case, I'd rather
> be reading code that uses StringTokenizer in conjunction with
> String.trim(), rather than a regular expression package.

Are you joking?
Would you really prefer those mountains of spaghetty code?
Did you really do it?

<skipped>

> Are you saying that regular expressions in Java are evaluated directly
> from a NFA representation?

Ja.

> I haven't read the code, but it doesn't make
> any sense to me.

It does for me.
Some reasons (in order of coming onto mind):
1. a huge alphabet (64k); it's all active: the "\p{L}" takes a
significant part of it;
2. it seems to me (IMHO^2) that perl5.6 regexes represent a Type1
(&#1089;ontext sensitive) grammar, see for example '(\w+)(\w+)\2\1' or
'(?(N)aaa|bbb)', which (IMHO^3) cannot be written down as a DFA (or
even as pure NFA).
3. seems enough...

> Any NFA can be deterministically converted to a DFA
> that will run faster with simpler logic.

Not so definitely. Good NFA matchers apply a lot of heuristic
optimization, yielding much better performance in many cases.

> The DFA representation will be
> larger, but given the typical huge disparities between the sizes of
> parsable text versus the matching pattern, this seems like the only way
> to go.

IMHO the only place where the DFAs certainly take over the NFAs is a
stream
matching.


Best regards

John Smith

unread,
Mar 7, 2002, 3:20:28 AM3/7/02
to

Sorry to inform you, but it is actually impossible to search any input
stream without loading it into memory.

chris.
[cj...@hotmail.com] - put 'real mail' in the subject or it gets killed
instantly.

"Dale King" <Ki...@TCE.com> wrote in message news:3c83...@news.tce.com...

Chris Smith

unread,
Mar 6, 2002, 9:42:25 AM3/6/02
to
skeptic wrote ...

> > Most text parsing can be done without regular expressions, and it's often
> > much simpler than using a regular expression.
> > For this case, I'd rather
> > be reading code that uses StringTokenizer in conjunction with
> > String.trim(), rather than a regular expression package.
>
> Are you joking?
> Would you really prefer those mountains of spaghetty code?
> Did you really do it?

String str = "a, b, c, d, e";

StringTokenizer st = new StringTokenizer(str, ",");
while (st.hasMoreTokens())
{
System.out.println(st.nextToken().trim());
}

Mountains of code? Looks like five lines to me. Moreover, it's five
lines that are clearly understandable and say what they are doing.

> 2. it seems to me (IMHO^2) that perl5.6 regexes represent a Type1
> (&#1089;ontext sensitive) grammar, see for example '(\w+)(\w+)\2\1' or
> '(?(N)aaa|bbb)', which (IMHO^3) cannot be written down as a DFA (or
> even as pure NFA).

Ah... if you're parsing patterns that aren't representable in an NFA
either, then this makes more sense.

Chris Smith

skeptic

unread,
Mar 6, 2002, 10:41:58 AM3/6/02
to
> If you aren't using 1.4, is it worth it to add a regex package for a very
> simple job? No. If you have one available, it is not worth doing it
> yourself.

Moreover, even using 1.4, it's worth of doing (imho untill 1.5
release).
This is a subtle yet significant issue with versions of the package.

Let's consider some cool scenario.

At first, java.util.regex is still buggy(as of 1.4.1) and will be at
least untill 1.5.
That's all ok, assuming the complexity of the thing. Let me here to
call "the bug" any wrong behaviour that may change in the following
version of the package.

Then, someone implements a "replace in all files" function for an IDE
that he works on (or a plugin to it), using the java.util.regex. He
tests it and finds that it works ok.

Then, some unlucky customer purchases the IDE, imports his project in
and runs "replace in all files", which turns his project into a little
heap of garbage, just because in his (newer) version of jre the
regexes work *slightly* differently. No repository was used.
Is that possible? Is that worth of avoiding?

And what is the way to go?
The only i know is *to ship with the SAME code that you have tested
with*, that is only possible with a small free 3-d party library.

Note that this doesn't apply to non-io features.
A bug in regexes (as well as in other io stuff) always costs much more
than a bug in whatever not related to io.

> I also submitted a suggestion about changing something like this to work
> better with non-blocking I/O introduced in 1.4.

As far as i understand java.util.regex can work with nio through
CharSequence interface.

Best regards

Dale King

unread,
Mar 6, 2002, 11:12:25 AM3/6/02
to
"John Smith" <cj...@hotmail.com> wrote in message
news:%Coh8.432$uR5....@newsfeeds.bigpond.com...

>
> Sorry to inform you, but it is actually impossible to search any input
> stream without loading it into memory.

In the worst case with all the features of Perl regular expressions, yes.
But for many simple regular expressions it is quite possible. A basic
regular expression can be parsed using a DFA which requires only looking at
the current symbol. For example ORO has an Awk matcher that can work with
streams of data.


--
Dale King


Dale King

unread,
Mar 6, 2002, 11:21:20 AM3/6/02
to
"skeptic" <se...@zolshar.ru> wrote in message
news:97f0f876.0203...@posting.google.com...
>
> > I also submitted a suggestion about changing something like this to work
> > better with non-blocking I/O introduced in 1.4.
>
> As far as i understand java.util.regex can work with nio through
> CharSequence interface.


But it doesn't support reading from a continuing stream of data. You can
search a CharBuffer to see if the pattern is in that CharBuffer, but that
CharBuffer may only be part of the entire stream of data read from a socket.

The 1.4 regex package only works when all the data is in memory. The reason
is that it supports a very complex set of features with the power of Perl
regular expressions. These can require backtracking and basically only work
if you can access all of the data.

Simple regular expressions do not however need backtracking. ORO has this in
their support for AWK regular expressions which are not as powerful as Perl
regular expressions but work on streams of data.

The AWK stuff for streams in ORO however only works with blocking IO and
using a thread pre matcher. To efficiently scan a number of streams with the
most scalability you would like to use the non-blocking IO in 1.4 and not
assign a thread per matcher. In this case you woul rather push data to the
matcher rather than have it pull from a blocking I/O stream. It would then
use an event listener mechanism to notify of matches.
--
Dale King


Dale King

unread,
Mar 6, 2002, 11:51:27 AM3/6/02
to
"Chris Smith" <cds...@twu.net> wrote in message
news:MPG.16eec7ef...@news.altopia.com...

>
> > > I am actually disappointed with one aspect of the regex package in
JDK1.4.
> > > It doesn't let you do searches on an InputStream.
> >
> > It's an inherent feature of NFA-based (also known as perl-type) regex
> > matchers (as opposing to POSIX-type) that they are hardly rely upon a
> > buffered input.
> > That's so because they sometimes need to backtrack.
>
> Are you saying that regular expressions in Java are evaluated directly
> from a NFA representation? I haven't read the code, but it doesn't make
> any sense to me. Any NFA can be deterministically converted to a DFA

Perl regular expressions are much more complex than simple regular
expressions and they go beyond simple recognizers with a true/false result
which is what a DFA does. They also support capture groups and many other
complex things that go beyond simply recognizing whether the pattern is
present.

While any NFA can indeed be turned into a DFA, the algorithm to do so and
the resulting size of the DFA are exponential in the worst case. An NFA with
n nodes may require O(2^n) DFA nodes and O(2^n) time to construct it. It is
often better to execute the NFA directly.

For an example, consider the regular language over an alphabet with m
symbols where the language consists of all strings that do not contain all m
symbols. An NFA to recognize this language requires only m + 1 states. A DFA
for this requires 2^m states.

> that will run faster with simpler logic. The DFA representation will be
> larger, but given the typical huge disparities between the sizes of
> parsable text versus the matching pattern, this seems like the only way
> to go.

For the cases where the conversion is slow it is faster to simply run the
NFA. For the simpler cases though it is only slightly slower to run the NFA
instead of converting it to a DFA.

--
Dale King


Chris Smith

unread,
Mar 6, 2002, 11:43:43 AM3/6/02
to
Chris Smith wrote ...

> > 2. it seems to me (IMHO^2) that perl5.6 regexes represent a Type1
> > (&#1089;ontext sensitive) grammar, see for example '(\w+)(\w+)\2\1' or
> > '(?(N)aaa|bbb)', which (IMHO^3) cannot be written down as a DFA (or
> > even as pure NFA).
>
> Ah... if you're parsing patterns that aren't representable in an NFA
> either, then this makes more sense.

Incidentally (I'm not familiar with Perl here), do Perl people really
call these things regular expressions? Or is that just sloppy
terminology that's limited to this thread? It seems silly to call
something a regular expression when it's not regular...

Chris Smith

Jon Skeet

unread,
Mar 6, 2002, 11:56:13 AM3/6/02
to
skeptic <se...@zolshar.ru> wrote:
> At first, java.util.regex is still buggy(as of 1.4.1)

As of 1.4.1? Which 1.4.1 would that be?

As far as I can tell, only 1.4.0 is available. There is likely to be a
1.4.1 some time within the next 6 months, at a guess.

--
Jon Skeet - <sk...@pobox.com>
http://www.pobox.com/~skeet/
If replying to the group, please do not mail me too

John W. Kennedy

unread,
Mar 6, 2002, 1:43:02 PM3/6/02
to

A) Yes, it's standard terminology.

B) Regular expressions, by that name, have existed at least since the
70's, and are used in many Unix programs, including file-search
programs, editors, compiler-construction aids, and more.

C) They're called regular expressions because they constitute a
precisely-defined language for expressing the complex concepts that they
deal with.

--
John W. Kennedy
Read the remains of Shakespeare's lost play, now annotated!
http://pws.prserv.net/jwkennedy/Double%20Falshood.html

skeptic

unread,
Mar 6, 2002, 2:21:19 PM3/6/02
to
> String str = "a, b, c, d, e";
> StringTokenizer st = new StringTokenizer(str, ",");
> while (st.hasMoreTokens())
> {
> System.out.println(st.nextToken().trim());
> }
> Mountains of code? Looks like five lines to me. Moreover, it's five
> lines that are clearly understandable and say what they are doing.


I didn't mean something *that* simple.
Look into Properties.load(...). Seems barely readable for me.
And the grammar is still simple, merely name=value pairs...

BestRegards

Carl Howells

unread,
Mar 6, 2002, 3:53:01 PM3/6/02
to
"John W. Kennedy" <jwk...@attglobal.net> wrote...

> Chris Smith wrote:
> > Incidentally (I'm not familiar with Perl here), do Perl people really
> > call these things regular expressions? Or is that just sloppy
> > terminology that's limited to this thread? It seems silly to call
> > something a regular expression when it's not regular...
>
> C) They're called regular expressions because they constitute a
> precisely-defined language for expressing the complex concepts that they
> deal with.

Actually, the term "regular" originates in linguistics. It comes from the
formal study of grammar, as a description of a very simple type of grammar.
Using BNF notation, a regular grammar is one that can be expressed entirely
as rules of two forms:

A ::= lambda (the empty string, which I can't actually type)
A ::= bC

Where A and C are non-terminals, and b is a terminal.

It just happens that regular grammars can be recognized by DFAs, which is a
very efficient way of doing such pattern-matching. Regular expressions are
just a simpler way of representing the grammar.

At least until the \1 style of thing showed up... Those technically make
the expression context-free rather than regular, although in a very limited
manner.

So to Chris: Yes, perl people call those "regular expressions". It is a
sloppy and slightly misleading term, but it is the accepted one these days.


Chris Smith

unread,
Mar 6, 2002, 4:18:23 PM3/6/02
to
skeptic wrote ...

> > String str = "a, b, c, d, e";
> > StringTokenizer st = new StringTokenizer(str, ",");
> > while (st.hasMoreTokens())
> > {
> > System.out.println(st.nextToken().trim());
> > }
> > Mountains of code? Looks like five lines to me. Moreover, it's five
> > lines that are clearly understandable and say what they are doing.
>
>
> I didn't mean something *that* simple.

Ah, well that is exactly the case I was referring to, and the one that
initiated this discussion, and for which someone said that regular
expressions are the way to do it. I responded that, for this task, I
find StringTokenizer code to be far superior.

Obviously, I think regular expression parsers are a superior technique
for many kinds of more complex parsing applications... especially those
involving escapable characters as property files do.

Chris Smith

Chris Smith

unread,
Mar 7, 2002, 12:02:14 AM3/7/02
to
luther wrote ...

> > Ah, well that is exactly the case I was referring to, and the one that
> > initiated this discussion, and for which someone said that regular
> > expressions are the way to do it. I responded that, for this task, I
> > find StringTokenizer code to be far superior.
>
> Not taking sides but I thought that you mentioned earlier in the thread
> that you do not have much experience with regular expressions. If that is
> the case on what do you base the superiority of StringTokenizer?

This thread contains code to do this task both ways. I certainly find
the code based on StringTokenizer to be far easier to read. And you?

Chris Smith

Jon Skeet

unread,
Mar 7, 2002, 3:01:44 AM3/7/02
to
luther <iam_mo...@hotmail.com> wrote:
> > Ah, well that is exactly the case I was referring to, and the one that
> > initiated this discussion, and for which someone said that regular
> > expressions are the way to do it. I responded that, for this task, I
> > find StringTokenizer code to be far superior.
>
> Not taking sides but I thought that you mentioned earlier in the thread
> that you do not have much experience with regular expressions. If that is
> the case on what do you base the superiority of StringTokenizer?

Well, the fact that you don't *need* much experience with regular
expressions in order to read the StringTokenizer code would suggest that
its readability is superior. You only need a couple of sentences of
explanation of what StringTokenizer does in order to understand that
code - the same is certainly *not* true of regular expressions.

Jon Skeet

unread,
Mar 7, 2002, 3:02:46 AM3/7/02
to
luther <iam_mo...@hotmail.com> wrote:
> > This thread contains code to do this task both ways. I certainly find
> > the code based on StringTokenizer to be far easier to read. And you?
>
> My experience also includes Perl and Regex, I tend to like the RegEx's
> better. To me they are like reading poetry. :-)

And what about the other people who might have to read your code?

Much as I love the sound of Shakespeare, if someone's giving me
instructions I'd prefer them in plain English.

Jon Skeet

unread,
Mar 7, 2002, 4:32:04 AM3/7/02
to
luther <iam_mo...@hotmail.com> wrote:
> Jon Skeet <sk...@pobox.com> wrote in
> news:MPG.16f12ec77...@dnews.peramon.com:

>
> > Well, the fact that you don't *need* much experience with regular
> > expressions in order to read the StringTokenizer code would suggest that
> > its readability is superior. You only need a couple of sentences of
> > explanation of what StringTokenizer does in order to understand that
> > code - the same is certainly *not* true of regular expressions.
>
> I would disagree, that is why java has both.

No, I would suggest that Java has both so that when you *need* something
more than a simple StringTokenizer, the features are there.

> It is clear that regex is not
> your cup o' tea, why are you trying so hard to admonish those of us who
> prefer the power/flexibility/brevity of regex?

Because using very powerful features when a simple one will do just as
well and would be easier for *most* people to read isn't a good idea. I
encourage simplicity wherever possible.

Jon Skeet

unread,
Mar 7, 2002, 4:34:14 AM3/7/02
to
luther <iam_mo...@hotmail.com> wrote:
> Jon Skeet <sk...@pobox.com> wrote in
> news:MPG.16f12f046...@dnews.peramon.com:

>
> > And what about the other people who might have to read your code?
>
> I would hope that they are capable of figuring out a simple regex.


> And, who is to say that the person who reads my code is not a regex expert?

Oh, they may well

> Oh well, it does not really matter. One look at the subject line says
> it all, anyway you solve the problem is bound to confuse someone....

Yes, there are certainly problems in StringTokenizer, which is why I've
written my own very nice and simply which gives back empty tokens.
That's a problem with StringTokenizer, but *not* an argument for adding
complexity for no good reason.

Jon Skeet

unread,
Mar 7, 2002, 4:37:30 AM3/7/02
to
[Sorry, didn't finish this off before accidentally sending it.]

Jon Skeet <sk...@pobox.com> wrote:
> luther <iam_mo...@hotmail.com> wrote:
> > Jon Skeet <sk...@pobox.com> wrote in
> > news:MPG.16f12f046...@dnews.peramon.com:
> >
> > > And what about the other people who might have to read your code?
> >
> > I would hope that they are capable of figuring out a simple regex.

I suspect it would take longer to figure that out than to figure out a
very simple tokenizer though. That's the point - readability. If you
don't need to make things more complicated, don't.

> > And, who is to say that the person who reads my code is not a regex expert?

Oh, they may well be. Can you guarantee it though?

Chris Smith

unread,
Mar 7, 2002, 9:38:08 AM3/7/02
to
luther wrote ...

> > This thread contains code to do this task both ways. I certainly find
> > the code based on StringTokenizer to be far easier to read. And you?
>
> My experience also includes Perl and Regex, I tend to like the RegEx's
> better. To me they are like reading poetry. :-)

Look, this isn't that difficult. Use regular expressions when you have
a problem that they solve. Throwing in regular expressions just because
you like them, though, is really dumb. The particular problem being
discussed in this branch of this thread is very easily solved without
using regular expressions, in a way that's many times more readable.

Writing code is not like writing poetry. I find poetry very beautiful,
but I'd hate to have to maintain some of it. ;)

Chris Smith

Jon Skeet

unread,
Mar 7, 2002, 10:46:11 AM3/7/02
to
luther <iam_mo...@hotmail.com> wrote:
> Jon Skeet <sk...@pobox.com> wrote in
> news:MPG.16f143f29...@dnews.peramon.com:

>
> > Because using very powerful features when a simple one will do just as
> > well and would be easier for *most* people to read isn't a good idea. I
> > encourage simplicity wherever possible.
>
> You my friend have made several assumptions to arrive at your conclusions.
> Your experience, or lack of, with regex has led you to your opinions.

Nope, I actually *do* understand regexes reasonably well - I'm far from
an expert, but I understand them to a fair extent. Now who's been making
assumptions?

> Do not be surprised when others arrive at different conclusions. Many people
> have backgrounds that include a fair amount of regex, in there case it is
> possible that regex, in their opinion, meets your above criteria.

But that's not everyone. That's the point. I'm pretty sure that if you
take the set of Java programmers, a much higher proportion of them will
have seen StringTokenizers before and understand them than have seen
regexes before. If you took the subset of Java programmers who are also
Perl programmers, that will clearly change.

> IMHO, it is worth the time to learn regex. However unlike you I would not
> subscribe my opinions on others, nor would I consider there conclusions
> "incorrect" because they do not arrive at the same ones as I.

I'm interested in code being as readable as possible. I don't consider
using regexes where they're not needed (and where there's simpler code
to achieve the same result) as encouraging readability - therefore I
will discourage uses regexes in those situations. I don't see that as a
bad thing.

Note that I also encourage the use of Sun's naming conventions pretty
often too - are you going to castigate me for that as well?

Jon Skeet

unread,
Mar 7, 2002, 11:38:16 AM3/7/02
to
luther <iam_mo...@hotmail.com> wrote:
> Jon Skeet <sk...@pobox.com> wrote in news:MPG.16f1446d68f88cdd98ace0
> @dnews.peramon.com:

>
> > Yes, there are certainly problems in StringTokenizer, which is why I've
> > written my own very nice and simply which gives back empty tokens.
> > That's a problem with StringTokenizer, but *not* an argument for adding
> > complexity for no good reason.
>
> Writing your own custom class would then force the person supporting the
> class to learn your API

They can read the differences between my class and the standard
StringTokenizer in a sentence or two. The method calls are all the same,
it just handles empty tokens differently.

> if they already know regex than I would say that
> your solution added complexity where none was needed.

*IF*. What if they *don't* already know regex? It takes longer to learn
about regular expressions than to learn about StringTokenizer if all you
need is StringTokenizer.

> IMHO, using regex
> would be a better solution rather than forcing someone to learn a
> proprietary API.

Nothing proprietary about it - it's available to anyone, and is actually
the same API in terms of method calls as the normal StringTokenizer.

> Once regex is learned it is something that can be reused
> throughout ones career where as your API would only be useful when working
> on your project.

I don't doubt it - that doesn't make it appropriate for all situations.

Jon Skeet

unread,
Mar 7, 2002, 11:41:57 AM3/7/02
to
luther <iam_mo...@hotmail.com> wrote:
> No, but then again you cannot gauranteee the reverse either. Complicated
> is a relative term isn't it? I do take offense to the repeated offerings
> of KISS. I have done this for a long time and understand that premise very
> well, the only difference between you and I is what we consider to be the
> less complex solution.

Indeed - it all really boils down to what assumptions you make about the
people reading the code. My assumptions are that they know common and
core Java classes which have been in Java since version 1.0, or can
learn about simple things without taking much time. Your assumptions are
that they know about things which weren't part of the core libraries
until 1.4, and which there are various slightly incompatible versions of
floating around.

I still think my assumption is likely to be right in more cases than
yours - *especially* when talking on a forum like this one.

> Bottom line is this, if you were working on my project I would not demand
> that you use one or the other. I would suggest regex for the reasons
> outlined in previous posts, however I would not dismiss StringTokenizer if
> that solution were more suitable to the experience level of the developer.

What is worked on in a private project is very different from advice
given on newsgroups. There are all kinds of things I'd leave be in a
private situation if the individual in question really wanted to do it
that way, but that doesn't mean I won't say it's bad advice when it's
given in a public forum.

Chris Smith

unread,
Mar 7, 2002, 12:30:37 PM3/7/02
to
Jon Skeet wrote ...

> Indeed - it all really boils down to what assumptions you make about the
> people reading the code. My assumptions are that they know common and
> core Java classes which have been in Java since version 1.0, or can
> learn about simple things without taking much time. Your assumptions are
> that they know about things which weren't part of the core libraries
> until 1.4, and which there are various slightly incompatible versions of
> floating around.

And I think that "complexity" can be defined and used in a sense that's
largely independent of "familiarity". No buts about it, I can write code
using regular expressions (in fact, I did just that a couple days ago
when I needed to parse a text file format that involved escaped and
quoted characters). However, I *recognize* that regular expressions are
a low-level and inherently complex way of dealing with a problem. As
such, I encapsulated the parsing into a set of static utility methods
that I can call from the rest of my code. The higher level code is far
easier to read as a result.

In the same way, I would have no problem seeing StringTokenizer
implemented using regular expressions. However, StringTokenizer provides
a specific higher level of abstraction above that given by regular
expressions, and removing StringTokenizer and telling everyone to use
regular expressions directly would be dumb. I firmly hold that I'd think
the same thing even if I lived and breathed writing regular expression
code. I still wouldn't want it directly used by application logic.

I *do*, on the other hand, wonder about some of the code that I've seen
that uses StringTokenizer with the "return delimiters" option, maintains
lots of internal state to tell the difference between lexical parts of
what's being parsed, etc. This could probably be written far easier by
using regular expressions, and then encapsulating it into its own higher-
level abstraction that represents the semantic meaning of the kind of
text that's being parsed.

You see, no one is saying "regular expressions are hard" or "I don't
understand regular expressions, so don't use them". What I am saying is
that regular expressions are a way of dealing with problems at a very low
level of abstraction. Good OO design suggests that you should
encapsulate that into a higher level of abstraction that's more suited to
the algorithm being written... and if the algorithm being written wants a
sequence of Strings, and you want to get them by splitting a longer
String based on whitespace, or one of a few other commonly recurring
abstractions that StringTokenizer was specifically designed to deal with,
it's kinda dumb not to use it.

It's equally dumb to *use* StringTokenizer is cases where it's not
appropriate. In those cases, it's generally worth defining your own
abstraction that does fit, rather than sprinkling your parsing code
around higher-level algorithms.

> > Bottom line is this, if you were working on my project I would not demand
> > that you use one or the other. I would suggest regex for the reasons
> > outlined in previous posts, however I would not dismiss StringTokenizer if
> > that solution were more suitable to the experience level of the developer.

How about a mention of the task at hand? I would not react favorably to
an employee of mine writing their own regular expression code to split a
String based on whitespace, *regardless* of the skills of my developers.

Chris Smith

Chris Smith

unread,
Mar 7, 2002, 5:12:16 PM3/7/02
to
luther wrote ...

> > Look, this isn't that difficult. Use regular expressions when you have
> > a problem that they solve. Throwing in regular expressions just because
> > you like them, though, is really dumb. The particular problem being
> > discussed in this branch of this thread is very easily solved without
> > using regular expressions, in a way that's many times more readable.
> >
> > Writing code is not like writing poetry. I find poetry very beautiful,
> > but I'd hate to have to maintain some of it. ;)
>
> Uhh, this was said tongue in cheek. I know how to do my job, thank you
> very much. And BTW, don't tell me I am dumb. You do not know me, I have
> given you the benefit of the doubt and would expect the same from you. This
> is getting out of hand, at this point you are disagreeing just to disagree
> me thinks.

I'm afraid I had a very different tone in mind when I write that than
you've picked up on. I had no intention of "telling you that you're
dumb". Instead, I was simply bringing to attention that you use regular
expressions because of some perceived advantage, which varies based on
the situation in which you're applying them. I know you realize this,
but it's been lost in the discussion.

Chris Smith

Marshall Spight

unread,
Mar 8, 2002, 2:18:20 AM3/8/02
to
"Chris Smith" <cds...@twu.net> wrote in message news:MPG.16f151a45...@news.altopia.com...

> However, I *recognize* that regular expressions are
> a low-level and inherently complex way of dealing with a problem. As
> such, I encapsulated the parsing into a set of static utility methods
> that I can call from the rest of my code. The higher level code is far
> easier to read as a result.

Uh, I don't see how that follows. In fact, I don't think I agree with the
idea that StringTokenizer "provides a higher level of abstraction."
StringTokenizer is, well, severely underpowered. True, it's just the
thing for splitting a String into tokens based on whitespace, but that's
about it. I wouldn't say that means it's using a higher level of abstration,
though; just that's it's more suited to the specific, narrow task at hand.

I am quite happy about jdk 1.4's java.util.regex. There have been a lot of
problems I've come across in the past that required some decent string
processing in Java, and the several occasions I've tried wedging StringTokenizer
into the problem space have given me a headache, too. I've been planning on
dumping StringTokenizer forever.

I think I'd prefer to use a powerful, general facility in a wide array of
places than try to occasionally insert StringTokenizer into the few
narrow places it's useful.


Marshall

Jon Skeet

unread,
Mar 8, 2002, 2:50:47 AM3/8/02
to
Marshall Spight <msp...@dnai.com> wrote:
> "Chris Smith" <cds...@twu.net>

> > However, I *recognize* that regular expressions are
> > a low-level and inherently complex way of dealing with a problem. As
> > such, I encapsulated the parsing into a set of static utility methods
> > that I can call from the rest of my code. The higher level code is far
> > easier to read as a result.
>
> Uh, I don't see how that follows. In fact, I don't think I agree with the
> idea that StringTokenizer "provides a higher level of abstraction."
> StringTokenizer is, well, severely underpowered. True, it's just the
> thing for splitting a String into tokens based on whitespace, but that's
> about it. I wouldn't say that means it's using a higher level of abstration,
> though; just that's it's more suited to the specific, narrow task at hand.

I'm with Chris on the abstraction thing (unsurprisingly) - using
StringTokenizer (or preferrably a version of it that *does* give empty
tokens) means you don't need to get into any of the "details"
(relatively simple though they are) about escaping some characters etc.



> I am quite happy about jdk 1.4's java.util.regex. There have been a lot of
> problems I've come across in the past that required some decent string
> processing in Java, and the several occasions I've tried wedging StringTokenizer
> into the problem space have given me a headache, too. I've been planning on
> dumping StringTokenizer forever.

Certainly (as Chris said) using StringTokenizer for complicated stuff is
a bit of a nightmare - and there *is* a good place to use regexes.

> I think I'd prefer to use a powerful, general facility in a wide array of
> places than try to occasionally insert StringTokenizer into the few
> narrow places it's useful.

Of course it depends on the problem space, but I've found that although
StringTokenizer only really solves one problem, that's actually a
problem I run across more often than most other problems. I'd say that a
good half of the places I'd make a decision between using a regex or
using StringTokenizer, it's precisely to split a string up with one
particular delimited (usually , rather than whitespace, admittedly).

Chris Smith

unread,
Mar 8, 2002, 9:22:02 AM3/8/02
to
Jon Skeet wrote ...

> Of course it depends on the problem space, but I've found that although
> StringTokenizer only really solves one problem, that's actually a
> problem I run across more often than most other problems. I'd say that a
> good half of the places I'd make a decision between using a regex or
> using StringTokenizer, it's precisely to split a string up with one
> particular delimited (usually , rather than whitespace, admittedly).

Right, and after noting that StringTokenizer (in and of itself) only
works in such situations when there is no possibility of an empty field,
it fits the requirements fine. Of course, you've got your utility class
to handle the latter situation.

Incidentally, anyone raised an RFE about returning empty tokens? I'll
check.

Chris Smith

Dale King

unread,
Mar 8, 2002, 11:23:03 AM3/8/02
to
"Chris Smith" <cds...@twu.net> wrote in message
news:MPG.16f276f8a...@news.altopia.com...

> Jon Skeet wrote ...
> > Of course it depends on the problem space, but I've found that although
> > StringTokenizer only really solves one problem, that's actually a
> > problem I run across more often than most other problems. I'd say that a
> > good half of the places I'd make a decision between using a regex or
> > using StringTokenizer, it's precisely to split a string up with one
> > particular delimited (usually , rather than whitespace, admittedly).
>
> Right, and after noting that StringTokenizer (in and of itself) only
> works in such situations when there is no possibility of an empty field,
> it fits the requirements fine. Of course, you've got your utility class
> to handle the latter situation.

And based on the way they changed it is also only suitable if you use the
same delimiter throughout.

> Incidentally, anyone raised an RFE about returning empty tokens? I'll
> check.

It's been raised repeatedly, and always marked as closed, will not fix.
--
Dale King


Jon Skeet

unread,
Mar 8, 2002, 12:57:49 PM3/8/02
to
luther <iam_mo...@hotmail.com> wrote:
> I completely agree with you. I am starting to get the feeling that
> some in this group have never had to parse much beyond a token delimited
> string.

I presume you're talking about me and Chris. I find it amusing that you
say this after accusing *me* of making assumptions about *your*
competence. We've both said we know how to use regexes - we just choose
not to when there's a simpler solution. There's no need to get a
wrecking ball in to bash a tiny nail into a bit of wood, when there's a
small hammer sitting in the tool box.

Could you show *anything* either of us have posted which indicates that
we've never had to parse much beyond token delimited strings?

I'll draw an analogy: this is like someone asking how to square a
number. Using a regex is like using Math.pow, whereas using
StringTokenizer is like using x*x. Sure, x*x will *only* work when you
want to square the number rather than cubing it or whatever, but is
simpler than using Math.pow. I don't see any reason why saying that
would lead anyone to conclude that the person involved doesn't *know*
about Math.pow() or is quite capable of using it when appropriate.

> Unfortunately they are a couple of the most vocal and would lead
> most to believe that the reverse is true, that being that regex is only
> useful in narrow places when that is far from true.

No, regex is useful in many, many conditions - but those conditions come
up more rarely than splitting token delimitted strings, IME.

Chris Smith

unread,
Mar 8, 2002, 12:59:25 PM3/8/02
to
luther wrote ...

> I completely agree with you. I am starting to get the feeling that
> some in this group have never had to parse much beyond a token delimited
> string. Unfortunately they are a couple of the most vocal and would lead
> most to believe that the reverse is true, that being that regex is only
> useful in narrow places when that is far from true.

Since that's a thinly-veiled comment aimed at myself (and to a lesser
extent Jon), I will reply: I don't believe that regular expressions are
useless... I do, on the other hand, find that the specific patterns
handled by StringTokenizer (and Jon's extension) are very common,
especially in application fields that are not heavily into text
processing.

You've implied, many times now, that I'm somehow "against" regular
expressions parsing. That's not true. I'm simply in favor of
encapsulation and abstraction.

StringTokenizer implements one possible abstraction (and, yes, as Dale
has pointed out, it helpfully implements somewhat less than its promise).
If StringTokenizer is not suitable, I'd encourage people to use other
abstractions, and *quite* possibly to represent them using regular
expressions (just as, like I've said, it's fine with me if
StringTokenizer is implemented with regular expressions... the only
reason it's not, I'd guess, is that it has existed for longer than the
regexp stuff).

However, *exposing* regular expressions to parts of the application that
are concerned with correct interpretation the semantic meaning of a
document describing tax codes is a serious breach of design principle.

Chris Smith

Jon Skeet

unread,
Mar 8, 2002, 1:10:08 PM3/8/02
to
Dale King <Ki...@TCE.com> wrote:
> > Right, and after noting that StringTokenizer (in and of itself) only
> > works in such situations when there is no possibility of an empty field,
> > it fits the requirements fine. Of course, you've got your utility class
> > to handle the latter situation.
>
> And based on the way they changed it is also only suitable if you use the
> same delimiter throughout.

Yup. Fortunately I've never had to deal with things changing delimiter
half way through - something like that would make me question the design
anyway, to be honest.



> > Incidentally, anyone raised an RFE about returning empty tokens? I'll
> > check.
>
> It's been raised repeatedly, and always marked as closed, will not fix.

I think it's too late to do that now anyway, to be honest. What they
*could* do would be to introduce a pair of "split" methods in String,
one of which took an array of characters and one of which took a single
character, and return either an array of strings. I prefer the array of
characters approach to the String approach used in StringTokenizer as
it's introduced confusion in StringTokenizer (with people assuming that
it would tokenize on the *whole* string).

The split methods in 1.4 are okay if you understand regexps, of course,
but mean you have to escape various characters first (which may involve
further regexp work if the delimiter is passed into your routine as a
parameter).

If those extra methods were introduced, it would make it very simple to
write simple code using the single character version first, and then
update it to use regexes later if they became necessary - just change
the type of the argument.

I find it interesting to note that the examples given in the
String.split(String) documentation would work just as well with a single
character delimiter - it would be worth putting some more "challenging"
stuff in there if my suggestion above were taken up. I don't expect it
to be, admittedly, but I might just raise it on the bug parade as a
suggestion...

Dale King

unread,
Mar 8, 2002, 1:44:09 PM3/8/02
to
"Jon Skeet" <sk...@pobox.com> wrote in message
news:MPG.16f30dc08...@news.ntlworld.com...

> Dale King <Ki...@TCE.com> wrote:
> > > Right, and after noting that StringTokenizer (in and of itself) only
> > > works in such situations when there is no possibility of an empty
field,
> > > it fits the requirements fine. Of course, you've got your utility
class
> > > to handle the latter situation.
> >
> > And based on the way they changed it is also only suitable if you use
the
> > same delimiter throughout.
>
> Yup. Fortunately I've never had to deal with things changing delimiter
> half way through - something like that would make me question the design
> anyway, to be honest.

I think it is actually quite common. Mine wasn't so much that I wanted to
tokenize with mulitple delimiters, but the fact that only the first
delimited token mattered. What I wanted was to get the first token and then
get the remainder of the string. That is common with things like key-value
pairs such as in properties files. The first delimiter marks the separation
between the key and the value, but we don't want to break the value into
tokens. That sure doesn't seem like a questionable design to me.
Unfortunately, the only way to do that with StringTokenizer is to switch the
delimiter to no delimiter.

--
Dale King


Dale King

unread,
Mar 8, 2002, 1:50:13 PM3/8/02
to
"Jon Skeet" <sk...@pobox.com> wrote in message
news:MPG.16f30dc08...@news.ntlworld.com...


What I would like to see is for us as a group to come up with a replacement
to StringTokenizer and to just make it public domain. Many people have made
their own versions and each has different licensing and packaging. I just
wan't some code I can add to my list of Utils as is. I don't want to have to
worry about GPL licenses and trying to include your jar file.

I tried to create something like this a while back with the comp.lang.java
utility classes. Unfortunately, the person that stepped up to administer
that suddenly got too busy and it fell flat on its face.
--
Dale King


Jon Skeet

unread,
Mar 8, 2002, 2:15:37 PM3/8/02
to
Dale King <Ki...@TCE.com> wrote:
> What I would like to see is for us as a group to come up with a replacement
> to StringTokenizer and to just make it public domain. Many people have made
> their own versions and each has different licensing and packaging. I just
> wan't some code I can add to my list of Utils as is. I don't want to have to
> worry about GPL licenses and trying to include your jar file.

Sounds like a good plan, yes.

> I tried to create something like this a while back with the comp.lang.java
> utility classes. Unfortunately, the person that stepped up to administer
> that suddenly got too busy and it fell flat on its face.

Ah yes, I remember. I can't remember who is was that was meant to
administer it - I hope it wasn't me :)

Jon Skeet

unread,
Mar 8, 2002, 2:16:18 PM3/8/02
to
Dale King <Ki...@TCE.com> wrote:

> I think it is actually quite common. Mine wasn't so much that I wanted to
> tokenize with mulitple delimiters, but the fact that only the first
> delimited token mattered. What I wanted was to get the first token and then
> get the remainder of the string. That is common with things like key-value
> pairs such as in properties files. The first delimiter marks the separation
> between the key and the value, but we don't want to break the value into
> tokens. That sure doesn't seem like a questionable design to me.
> Unfortunately, the only way to do that with StringTokenizer is to switch the
> delimiter to no delimiter.

Ah - in that situation I just use String.indexOf and String.substring
usually. And yes, there the regex would make sense and String.split with
a limit does what you want. Of course, there's no reason why the
split(char) and split(char[]) methods I proposed in the other post
shouldn't have optional limits as well, other than that by that stage
you're getting into a reasonable amount of complexity anyway, and you
might as well go for the full regex.

Dale King

unread,
Mar 8, 2002, 3:49:49 PM3/8/02
to
"Jon Skeet" <sk...@pobox.com> wrote in message
news:MPG.16f31e61e...@news.ntlworld.com...


It seems to me that the StringTokenizer model is better in that instead of
returning an array of Strings it is an iterator. An iterator design pattern
seems like a better design to me as usually you will want to iterate over
those parts and is simply more OO. It should also have a method to return
the remaining substring or at least an index so I can do the substring.

I would also like to have a way to have it so that I can strip characters
from the beginning and the end of the tokens like whitespace, but I can
always subclass to do that as long as it is not all private the way
StringTokenizer is.
--
Dale King


Jon Skeet

unread,
Mar 8, 2002, 3:56:18 PM3/8/02
to
Dale King <Ki...@TCE.com> wrote:
> It seems to me that the StringTokenizer model is better in that instead of
> returning an array of Strings it is an iterator. An iterator design pattern
> seems like a better design to me as usually you will want to iterate over
> those parts and is simply more OO.

That's ironic - I original had "or an iterator", but then changed it
after checking what 1.4's String.split methods gave back :)

> It should also have a method to return
> the remaining substring or at least an index so I can do the substring.

Yup, that sounds like a good idea.

> I would also like to have a way to have it so that I can strip characters
> from the beginning and the end of the tokens like whitespace, but I can
> always subclass to do that as long as it is not all private the way
> StringTokenizer is.

It's certainly a common thing to want to do, yes.

Chris Smith

unread,
Mar 8, 2002, 4:18:42 PM3/8/02
to
Chris Smith wrote ...

> Since that's a thinly-veiled comment aimed at myself (and to a lesser
> extent Jon), I will reply:

I take that back. After reviewing the thread, it's become apparent that
you're being just as rude to Jon and I just hadn't noticed. Sorry for
mirepresenting your comments.

Chris Smith

Jon Skeet

unread,
Mar 8, 2002, 9:18:48 PM3/8/02
to
luther <iam_mo...@hotmail.com> wrote:
> I guess the difference between you and I is that I do not equate
> competence with experience.

Again, you're making an assumption. Please show me where I've equated
them. You keep "guessing" differences between us with very little
evidence of why you're making these assumptions.

> Not having done something before does not
> make one incompetent, just inexperienced. I will also say that just
> because you may have used regex does not make you experienced either. To
> make it clear, I have never said nor infered that you are incompetent.

You've implied it though. Your "some members of the group" bit was
clearly patronising, and whilst it only *talked* about experience, the
implication of incompetence was obvious to me. Chris seems to have
inferred it too.

> I have said that I believe that your experience with regex, or lack of it,
> is what is driving your decision. I have said this all along.

And several times, Chris and I have said that we're both perfectly
capable of using regular expressions ourselves, but don't believe that
every single member of our code's audience is likely to be able to do
so.

> If it your decision that someone is incompetent because they have limited
> experience with regex than that is up to you.

When exactly did I imply that that was my viewpoint?

> However do not attribute your beliefs as mine.

When exactly did I do that?

> Again I will say this, you are now just getting ridiculous in your
> attempts to be "right".

I see you've snipped the only bit of my post which was actually relevant
to the original topic, preferring to go straight to assumptions. Care to
comment on the analogy to using Math.pow()? I believe it goes to the
heart of the matter.

Marshall Spight

unread,
Mar 8, 2002, 10:38:51 PM3/8/02
to
"luther" <iam_mo...@hotmail.com> wrote in message news:Xns91CBA7C3BD0A...@206.10.149.66...
> Chris Smith <cds...@twu.net> wrote in news:MPG.16f2a9e568c3a9fe989dd9
> @news.altopia.com:

>
> > You've implied, many times now, that I'm somehow "against" regular
> > expressions parsing. That's not true. I'm simply in favor of
> > encapsulation and abstraction.
>
> Sigh, if I had it was not my intention. I would say that I am also in
> favor of encapsulation and abstraction, as a matter of fact I am a huge fan
> of both. The use of regex does not in anyway prevent either.

Perhaps we could all sit back in our chairs and take a few deep breaths.
Let's all remember that it's very easy to misunderstand each other. Also,
the longer these kinds of conversations go on, the less they become about
the topic itself and the more they become about ... something else.

I just want to express my support all of {Chris, regular expressions,
Luther, StringTokenizer, and Jon.}


Marshall

Jon Skeet

unread,
Mar 9, 2002, 3:33:32 AM3/9/02
to
luther <iam_mo...@hotmail.com> wrote:

> If you have anything relevant to add that has to do with java and not your
> perceived notion that I am somehow out to get you, than please do.

Okay, here's the question again then, with no personal
suggestions/questions whatsoever:


When wishing to square a double, I would use:

square = x*x;

rather than

square = Math.pow (x, 2);

Which would you use, and why? If you would use the former due to its
simplicity and readability, what is the difference (if any) between this
case and the case of using a simple construct which is limited in scope
but requires less knowledge for parsing a token-delimited line rather
than using a regular expression?

Chris Smith

unread,
Mar 9, 2002, 9:32:14 AM3/9/02
to
luther wrote ...
> Again, I am sorry. In the future I will try and excercise caution when
> replying to threads in this group. Clearly some in here cannot handle
> disagreement and alternative ideas.

Sorry you feel that way. I will admit that there was a time when I would
have meen more patient with a lot of this discussion. I still hold to my
viewpoints, but there have been a couple of occasions where I've gotten
fed up and said things that perhaps shouldn't be said in public.

On the other hand, I don't think you've been treated rudely "from the
start". I think you have been treated rudely in response to your own
rude and degrading comments. I certainly could have reacted better,
though, and I apologize for that.

Chris Smith

Patricia Shanahan

unread,
Mar 9, 2002, 10:22:44 AM3/9/02
to

luther wrote:
>
> "Marshall Spight" <msp...@dnai.com> wrote in news:0BZh8.2093$af7.2045
> @rwcrnsc53:


>
> > I think I'd prefer to use a powerful, general facility in a wide array of
> > places than try to occasionally insert StringTokenizer into the few
> > narrow places it's useful.
> >
>

> I completely agree with you. I am starting to get the feeling that
> some in this group have never had to parse much beyond a token delimited
> string. Unfortunately they are a couple of the most vocal and would lead
> most to believe that the reverse is true, that being that regex is only
> useful in narrow places when that is far from true.

While I hesitate to enter a thread of this extent, I should perhaps
point out that I'm a counter example to any suggestion that people who
use StringTokenizer do so because of limited parsing experience.

I use StringTokenizer, when appropriate, though I don't believe in
trying to stretch it beyond the jobs it does well and simply.

I worked as a compiler writer for several years. I've ported, corrected,
and extended lexical analyzers and parsers for both C and Fortran. I've
used regular expressions for about 19 years.

Patricia

John W. Kennedy

unread,
Mar 9, 2002, 6:13:14 PM3/9/02
to
Jon Skeet wrote:
> But that's not everyone. That's the point. I'm pretty sure that if you
> take the set of Java programmers, a much higher proportion of them will
> have seen StringTokenizers before and understand them than have seen
> regexes before. If you took the subset of Java programmers who are also
> Perl programmers, that will clearly change.

Or who have used TextPad, or Emacs, or vi, or StarOffice/OpenOffice, or
awk, or sed, or Visual C++, or....

--
John W. Kennedy
Read the remains of Shakespeare's lost play, now annotated!
http://pws.prserv.net/jwkennedy/Double%20Falshood.html


Jon Skeet

unread,
Mar 9, 2002, 7:06:40 PM3/9/02
to
Chris Smith <cds...@twu.net> wrote:
> On the other hand, I don't think you've been treated rudely "from the
> start". I think you have been treated rudely in response to your own
> rude and degrading comments. I certainly could have reacted better,
> though, and I apologize for that.

Likewise, and well put Chris. Part of the problem (with me) is that I
simultaneously hate and love being part of argumentative debate. Short
post, I know, but I just wanted to agree wholeheartedly with all of the
above for my part.

Jon Skeet

unread,
Mar 9, 2002, 7:06:39 PM3/9/02
to
luther <iam_mo...@hotmail.com> wrote:
> Jon Skeet <sk...@pobox.com> wrote in
> news:MPG.16f3d932...@news.ntlworld.com:

>
> > square = Math.pow (x, 2);
> >
> > Which would you use, and why? If you would use the former due to its
> > simplicity and readability, what is the difference (if any) between this
> > case and the case of using a simple construct which is limited in scope
> > but requires less knowledge for parsing a token-delimited line rather
> > than using a regular expression?
> >
>
> I would use
>
> square = Math.pow (x, 2);

Then you're certainly consistent :)

If I were maintaining code containing that, then changing it to x*x is
probably the first thing I'd do. I find the idea of multiplying
something by itself a much simpler concept than raising a number to a
power (especially when the invocation of Math.pow obviously allows
raising to negative and fractional powers, though not complex ones). I
have no difficulty with raising numbers to powers (otherwise my maths
degree would truly be a sham) but I still find the idea of multiplying a
number by itself rather simpler.

> I fail to see how this is analagous to regex vs. StringTokenizer?

It's another case of using a more powerful tool when a simpler one which
is perfectly adequate for the actual problem being tackled is available.

> I like
> regex and am comfortable with using it, so I would tend to use that in
> place of StringTokenizer. As I said before, and I say again, I would tend
> to use and recommend regex. If someone on my project wanted to use the
> latter that would be fine by me to.

Here's another hypothetical: suppose someone on your project *didn't*
know regular expressions, and wanted to split a string in a way that a
very simple use of StringTokenizer would allow. Would you then recommend
StringTokenizer, or recommend the learning of regular expressions?

(Just to answer a similar question you may wish to pose back to me, if
someone wanted to do some parsing which would be simple with regular
expressions but involve complicated work with StringTokenizer, I would
certainly recommend they learn regexes in that situation. As I've said
before, I have nothing against regexes themselves at all, and recognise
them as a powerful tool - I just consider using them to split strings
using a single character token to be akin to using a sledgehammer to
crack a nut.)

Jon Skeet

unread,
Mar 9, 2002, 7:20:41 PM3/9/02
to
John W. Kennedy <jwk...@attglobal.net> wrote:
> Jon Skeet wrote:
> > But that's not everyone. That's the point. I'm pretty sure that if you
> > take the set of Java programmers, a much higher proportion of them will
> > have seen StringTokenizers before and understand them than have seen
> > regexes before. If you took the subset of Java programmers who are also
> > Perl programmers, that will clearly change.
>
> Or who have used TextPad, or Emacs, or vi, or StarOffice/OpenOffice, or
> awk, or sed, or Visual C++, or....

Are you suggesting that the use of regexes is fundamental to the use of
all of the above? awk and sed I could understand, and maybe vi, but I'm
sure all the rest can be used perfectly easily without knowing the first
thing about regular expressions.

Apologies if I've misunderstood you.

ghl

unread,
Mar 10, 2002, 12:31:10 PM3/10/02
to
"Jon Skeet" <sk...@pobox.com> wrote in message
news:MPG.16f4b731...@news.ntlworld.com...

> John W. Kennedy <jwk...@attglobal.net> wrote:
> > Jon Skeet wrote:
> > > But that's not everyone. That's the point. I'm pretty sure that if you
> > > take the set of Java programmers, a much higher proportion of them
will
> > > have seen StringTokenizers before and understand them than have seen
> > > regexes before. If you took the subset of Java programmers who are
also
> > > Perl programmers, that will clearly change.
> >
> > Or who have used TextPad, or Emacs, or vi, or StarOffice/OpenOffice, or
> > awk, or sed, or Visual C++, or....
>
> Are you suggesting that the use of regexes is fundamental to the use of
> all of the above? awk and sed I could understand, and maybe vi, but I'm
> sure all the rest can be used perfectly easily without knowing the first
> thing about regular expressions.
Certainly not fundamental.
Here is the function pattern from EditPlus2 for Java: ^[ \t]*[ps].*\([^;]*$
What is that?
--
Gary
Note new e-mail address:
glab...@comcast.net

Jon Skeet

unread,
Mar 11, 2002, 3:12:20 AM3/11/02
to
ghl <glab...@comcast.net> wrote:
> Certainly not fundamental.
> Here is the function pattern from EditPlus2 for Java: ^[ \t]*[ps].*\([^;]*$
> What is that?

Dodgy, is what it is. It matches far too many things, eg

props.setProperty ("foo",
"bar");

which isn't exactly an impossible pair of lines of code.

Is it meant to be used for highlighting method declarations?

Dave Ryan

unread,
Mar 11, 2002, 3:18:22 PM3/11/02
to

While Chris and Jon are busy patting themselves on the back about not
being rude to luther's opposing opinion, I'd like to say that I
defeinitely interpeted their responses as unnecessarily hostile. That
being said, I fall into the camp of using StringTokenizer when it can
handle the job. I also fall into the camp of using Math.pow instead
of x*x. Probably due to my strong math background. I also use re
frequently, but would in no way consider myself an expert.

In my opinion it's more of a style issue than anything else, although
previous experience with only 1 side would lead to that side being
chosen. I certainly wouldn't change someones code merely because
they used the one I don't prefer. If, however, it added obvious and
unnecessary complexity and made maintenance difficult I would.

my 2c.
-dave

Dave Ryan

unread,
Mar 11, 2002, 5:06:11 PM3/11/02
to

Such is the way of me and Usenet. After reviewing my post and
reading the thread again, I'd like to take special care to point
out that I've mistakenly thrown Jon into the 'hostile' category.

I would say that luther's and Jon's posts followed each other
in tone very similarly. Each taking care to not jab the other
unnecesarily. All the jabs thrown were small and intentional,
but not ill mannered. All these opinions are only my own. My
miscategorization was caused by my inability to differentiate
Jon's posts from other posters over the last week. Sorry Jon.
-dave

It is loading more messages.
0 new messages