Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

java String split() does not work for delimiter "|" ?

10,223 views
Skip to first unread message

chun...@gmail.com

unread,
Oct 12, 2007, 4:39:06 PM10/12/07
to
Hi all,

I have such data in a flat text file,
"
106083|1791||7|73755|48|96|3||01/07/2005 13:04:48.979215 PST|||||t|f||
t|f|t|"
"

And such java code to read this line and split it by "|",

"
while ((( rd = in.readLine())!= null)) {
String delimiter = new String(''|")
String[] t1 = rd.split(delimiter);
String[] t2 = rd.split("|");
}
"

Either way, the split does not work! It splits the string per each
char. Does someone know why ?

Here is my jdk information on the linux box.
"
java version "1.6.0"
Java(TM) SE Runtime Environment (build 1.6.0-b105)
Java HotSpot(TM) Server VM (build 1.6.0-b105, mixed mode)
"


Thanks a lot for any tips.


Chun

Joshua Cranmer

unread,
Oct 12, 2007, 4:47:31 PM10/12/07
to
chun...@gmail.com wrote:
> Hi all,
>
> I have such data in a flat text file,
> "
> 106083|1791||7|73755|48|96|3||01/07/2005 13:04:48.979215 PST|||||t|f||
> t|f|t|"
> "
>
> And such java code to read this line and split it by "|",

`split' uses a regex command, and '|' happens to be a special operator
in regex. Instead of "|", you want "\\|".

> Either way, the split does not work! It splits the string per each
> char. Does someone know why ?

Your regex specifies either the empty string or the empty string. Since
there is an empty string between each character, the string is split
between each character. It's what you told it do.

For more information:
<http://java.sun.com/javase/6/docs/api/java/lang/String.html> and
<http://java.sun.com/javase/6/docs/api/java/util/regex/Pattern.html>


--
Beware of bugs in the above code; I have only proved it correct, not
tried it. -- Donald E. Knuth

chun...@gmail.com

unread,
Oct 12, 2007, 4:48:19 PM10/12/07
to

Please ignore that, "\\|" works for me, I guess I use perl too much
-:)


-cji

RedGrittyBrick

unread,
Oct 12, 2007, 5:06:42 PM10/12/07
to
chun...@gmail.com wrote:
> Hi all,
>
> I have such data in a flat text file,
> "
> 106083|1791||7|73755|48|96|3||01/07/2005 13:04:48.979215 PST|||||t|f||
> t|f|t|"
> "
>
> And such java code to read this line and split it by "|",
>
> "
> while ((( rd = in.readLine())!= null)) {
> String delimiter = new String(''|")
> String[] t1 = rd.split(delimiter);
> String[] t2 = rd.split("|");
> }
> "
>
> Either way, the split does not work! It splits the string per each
> char. Does someone know why ?
>

Because the argument to split() is a regex not a string.

In regexes, certain characters (metacharacters) have special meanings.
The vertical bar is such a metacharacter, representing alternation.

public class MetaChar {
public static void main(String[] args) {
String s = "oneXtwoYthreeXfour";
String[] a = s.split("X|Y");
for (String w:a)
System.out.println(w);
}
}

You have to "escape" the vertical bar if you want to treat it as an
ordinary character and not as a metacharacter.

http://www.regular-expressions.info/alternation.html
http://www.regular-expressions.info/characters.html

Roedy Green

unread,
Oct 13, 2007, 1:10:35 AM10/13/07
to
On Fri, 12 Oct 2007 20:39:06 -0000, chun...@gmail.com wrote, quoted
or indirectly quoted someone who said :

>Either way, the split does not work! It splits the string per each
>char. Does someone know why ?

you mean literal | not the regex command |. See
http://mindprod.com/jgloss/regex.html
on quoting.
--
Roedy Green Canadian Mind Products
The Java Glossary
http://mindprod.com

Roedy Green

unread,
Oct 13, 2007, 2:31:03 AM10/13/07
to
On Fri, 12 Oct 2007 20:39:06 -0000, chun...@gmail.com wrote, quoted
or indirectly quoted someone who said :

> String delimiter = new String(''|")

there is no need for new String.

See http://mindprod.com/jgloss/newbie.html

you can write that;

String delimiter = ''|";

but of course as others pointed out, you meant:

String delimiter = ''\\|";

Message has been deleted

smart...@gmail.com

unread,
Aug 8, 2013, 8:07:42 AM8/8/13
to
You can also do like this :
StringTokenizer tokenizer = new StringTokenizer(content, "|");
while(tokenizer.hasMoreTokens()){
_log.info("tokenizer.nextToken() : "+tokenizer.nextToken());
}

Lew

unread,
Aug 8, 2013, 4:16:44 PM8/8/13
to
smart...@gmail.com wrote:
> You can also do like this :
> StringTokenizer tokenizer = new StringTokenizer(content, "|");
> while(tokenizer.hasMoreTokens()){
> _log.info("tokenizer.nextToken() : "+tokenizer.nextToken());
> }

"StringTokenizer is a legacy class that is retained for compatibility reasons although
its use is discouraged in new code. It is recommended that anyone seeking this
functionality use the split method of String or the java.util.regex package instead."
http://docs.oracle.com/javase/7/docs/api/java/util/StringTokenizer.html

"Variable names should not start with underscore _ or dollar sign $ characters,
even though both are allowed."
http://www.oracle.com/technetwork/java/javase/documentation/codeconventions-135099.html#367

--
Lew

Kevin McMurtrie

unread,
Aug 9, 2013, 2:46:31 AM8/9/13
to
In article <4c416073-9f99-425e...@googlegroups.com>,
Lew <lewb...@gmail.com> wrote:

> smart...@gmail.com wrote:
> > You can also do like this :
> > StringTokenizer tokenizer = new StringTokenizer(content, "|");
> > while(tokenizer.hasMoreTokens()){
> > _log.info("tokenizer.nextToken() : "+tokenizer.nextToken());
> > }
>
> "StringTokenizer is a legacy class that is retained for compatibility reasons
> although
> its use is discouraged in new code. It is recommended that anyone seeking
> this
> functionality use the split method of String or the java.util.regex package
> instead."
> http://docs.oracle.com/javase/7/docs/api/java/util/StringTokenizer.html

Last time I checked, the performance of String.spit() sucked. The
JavaDoc up to 1.6 even says it sucks. Hopefully they've fixed that
before calling a simple and effective tool like StringTokenizer "legacy."

Now if there was only a way to revert String.substring()'s performance
in Java 1.7, I might try Oracle's version of Java.

Arved Sandstrom

unread,
Aug 9, 2013, 3:45:10 AM8/9/13
to
I had to check that because I didn't remember ever seeing that the
Javadoc for String.split saying that the performance sucked. Lo and
behold, I don't see that language.

What's the basis for assessing the suckage of Java String.split? Doing
millions of splits? And if the situation calls for industrial text
processing, why use Java anyway? It's not the first language I'd think
of for that purpose, it's cumbersome. And you can't ramp up your RAM?

I don't mind your comments about Java implementation performance, they
are useful to followup. I just wonder what kind of Java programs you
write where you find this kind of detail to be that important. Can't say
I've ever in 15+ years seen a Java SE or EE project be significantly
impacted by these considerations.

AHS
--
When a true genius appears, you can know him by this sign:
that all the dunces are in a confederacy against him.
-- Jonathan Swift

Eric Sosman

unread,
Aug 9, 2013, 8:47:34 AM8/9/13
to
On 8/8/2013 8:06 AM, smart...@gmail.com wrote:
> On Saturday, 13 October 2007 02:09:06 UTC+5:30, chun...@gmail.com wrote:

Couldn't you have waited for its sixth birthday?

--
Eric Sosman
eso...@comcast-dot-net.invalid

Kevin McMurtrie

unread,
Aug 10, 2013, 2:25:52 AM8/10/13
to
In article <i61Nt.55783$Su6....@fx16.iad>,
String.split() delegates to the Pattern class. The Pattern class
mentions that the form used in String is not efficient because it must
compile the regular expression on each use.

Let me test...

Java 1.6.0_51 on an old Mac gives me these relative times:
splitNanos= 5341045000
tokenizerNanos= 1934390000

I hacked in a copy of 1.7.0_40-ea and got:
splitNanos= 3299753000
tokenizerNanos= 1675745000


It's not HUGE, but don't think you should deprecate a class that's 2
times faster than the replacement. String.split() is great for utility
use but the core code should use pre-compiled patterns or
StringTokenizer.

Last time I checked, Oracle was still targeting big business. Asking to
double the datacenter could get a whole Engineering team fired.



public class Str
{
final char testChars[]=
"\t\n;0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"
.toCharArray();
final Random rnd= new Random();

public static void main(String[] args)
{
final Str str= new Str();

long splitNanos= 0;
long tokenizerNanos= 0;

for (int i= 0; i < 100; ++i)
{
final String line= str.randomAlphaNumerics();
String formatBySplit= null, formatByTokenize= null;

final long startTime= System.nanoTime();
for (int j= 0; j < 10000; ++j)
formatBySplit= str.formatSplit(line);
final long midTime= System.nanoTime();
for (int j= 0; j < 10000; ++j)
formatByTokenize= str.formatTokenized(line);
final long endTime= System.nanoTime();

splitNanos+= midTime - startTime;
tokenizerNanos+= endTime - midTime;

if (!formatBySplit.equals(formatByTokenize))
throw new RuntimeException("formatBySplit=" + formatBySplit +
" formatByTokenize=" +formatByTokenize);
}

System.out.println ("splitNanos= " + splitNanos);
System.out.println ("tokenizerNanos= " + tokenizerNanos);
}

private String formatSplit (String input)
{
final String toks[]= input.split("[ \t\n;]+");
final StringBuilder buf= new StringBuilder (input.length());

for (String tok : toks)
{
if (tok.length() > 0)
{
if (buf.length() > 0)
buf.append('\n');
buf.append(tok);
}
}
return buf.toString();
}

private String formatTokenized (String input)
{
final StringTokenizer tok= new StringTokenizer(input, " \t\n;", false);
final StringBuilder buf= new StringBuilder (input.length());

if (tok.hasMoreElements())
buf.append(tok.nextElement());

while (tok.hasMoreElements())
buf.append('\n').append(tok.nextElement());

return buf.toString();
}

private String randomAlphaNumerics ()
{
final char buf[]= new char[rnd.nextInt(200)];
for (int i= 0; i < buf.length; ++i)
buf[i]= testChars[rnd.nextInt(testChars.length)];
return new String (buf);
}
}

Michael Jung

unread,
Aug 10, 2013, 6:37:55 AM8/10/13
to
Kevin McMurtrie <mcmu...@pixelmemory.us> writes:
> In article <i61Nt.55783$Su6....@fx16.iad>,
> Arved Sandstrom <asand...@eastlink.ca> wrote:
>> On 08/09/2013 03:46 AM, Kevin McMurtrie wrote:
>> > In article <4c416073-9f99-425e...@googlegroups.com>,
>> > Lew <lewb...@gmail.com> wrote:
>> >
>> >> smart...@gmail.com wrote:
>> >>> StringTokenizer tokenizer = new StringTokenizer(content, "|");
>> >>> while(tokenizer.hasMoreTokens()){
>> >>> _log.info("tokenizer.nextToken() : "+tokenizer.nextToken());
>> >>> }
>> >> "StringTokenizer is a legacy class that is retained for compatibility
>> >> reasons although
>> >> its use is discouraged in new code. It is recommended that anyone seeking
>> >> this
>> >> functionality use the split method of String or the java.util.regex
>> >> package instead."
>> >> http://docs.oracle.com/javase/7/docs/api/java/util/StringTokenizer.html
>> > Last time I checked, the performance of String.spit() sucked. The
>> > JavaDoc up to 1.6 even says it sucks. Hopefully they've fixed that
>> > before calling a simple and effective tool like StringTokenizer "legacy.
>> > Now if there was only a way to revert String.substring()'s performance
>> > in Java 1.7, I might try Oracle's version of Java.
>> I had to check that because I didn't remember ever seeing that the
>> Javadoc for String.split saying that the performance sucked. Lo and
>> behold, I don't see that language.
>> What's the basis for assessing the suckage of Java String.split? Doing
>> millions of splits? And if the situation calls for industrial text
>> processing, why use Java anyway? It's not the first language I'd think
>> of for that purpose, it's cumbersome. And you can't ramp up your RAM?
>> I don't mind your comments about Java implementation performance, they
>> are useful to followup. I just wonder what kind of Java programs you
>> write where you find this kind of detail to be that important. Can't say
>> I've ever in 15+ years seen a Java SE or EE project be significantly
>> impacted by these considerations.
> String.split() delegates to the Pattern class. The Pattern class
> mentions that the form used in String is not efficient because it must
> compile the regular expression on each use.
> Let me test...
> Java 1.6.0_51 on an old Mac gives me these relative times:
> splitNanos= 5341045000
> tokenizerNanos= 1934390000
> I hacked in a copy of 1.7.0_40-ea and got:
> splitNanos= 3299753000
> tokenizerNanos= 1675745000
> It's not HUGE, but don't think you should deprecate a class that's 2
> times faster than the replacement. String.split() is great for utility
> use but the core code should use pre-compiled patterns or
> StringTokenizer.
> Last time I checked, Oracle was still targeting big business. Asking to
> double the datacenter could get a whole Engineering team fired.

I can confirm that this does matter in business code. We got a 10%-20%
performance boost by avoiding split for certain use cases that used it a
lot, not just in micro-optimizing tests. The numbers from Kevin are
about what we had (although I personally wouldn't show that many decimal
places that suggest a higher degree of accuracy than is actually
reasonable).

Michael

Joerg Meier

unread,
Aug 10, 2013, 9:34:32 AM8/10/13
to
On Fri, 09 Aug 2013 23:25:52 -0700, Kevin McMurtrie wrote:

> String.split() delegates to the Pattern class. The Pattern class
> mentions that the form used in String is not efficient because it must
> compile the regular expression on each use.

There is really no way around that with .split(), short of some convoluted
internal chaching system where the last x patterns compiled by .sort are
stored for y time. You call a method with a String as a parameter twice,
how are you going to avoid having to compile the String to a Pattern other
than through that ?

The .split syntax is convenient, but slow. There is really no sensible way
to speed it up while keeping the convenient method signature. Of course,
simply using Pattern is not terribly hard at all.

With all that being said: StringTokenizer obviously can only handle very
simple splitting due to the lack of regex support, and thus is naturally
faster, but if your splitting is simple enough not to need regex, it might
be simple enough to use indexOf, which is almost a magnitude faster than
even Tokenizer.

Liebe Gruesse,
Joerg

--
Ich lese meine Emails nicht, replies to Email bleiben also leider
ungelesen.

Arved Sandstrom

unread,
Aug 10, 2013, 9:33:02 PM8/10/13
to
I don't doubt that use of String.split is not always the optimal
approach. From the sounds of it it's not often the optimal approach. But
I'll bet that the large majority of the time using it is a "good enough"
approach, because very often that extra 10-20 percent speed bump isn't
actually needed.

Funny thing is, I can think of one ESB application of mine right now
that needs to process a high volume of messages, and each message is
composed of 10-20 lines each one of which may have multiple fields
delimited by slashes...and I've been using String.split without
problems. Having said that, this is a 24/7 "don't fail or shit rains
down from the heavens" application, so I might try swapping out
.split(), since it's not complicated logic and I know exactly what the
delimiter is.

But I wouldn't eschew String.split as a rule. I doubt most apps care.

Michael Jung

unread,
Aug 11, 2013, 5:12:38 AM8/11/13
to
Arved Sandstrom <asand...@eastlink.ca> writes:
> On 08/10/2013 07:37 AM, Michael Jung wrote:
[...]
>> I can confirm that this does matter in business code. We got a 10%-20%
>> performance boost by avoiding split for certain use cases that used it a
>> lot, not just in micro-optimizing tests. The numbers from Kevin are
>> about what we had (although I personally wouldn't show that many decimal
>> places that suggest a higher degree of accuracy than is actually
>> reasonable).
> I don't doubt that use of String.split is not always the optimal
> approach. From the sounds of it it's not often the optimal
> approach. But I'll bet that the large majority of the time using it is
> a "good enough" approach, because very often that extra 10-20 percent
> speed bump isn't actually needed.
[...]
> But I wouldn't eschew String.split as a rule. I doubt most apps care.

I use split myself often enough. You can read my response as a case for
optimzation surprises. The micro benchmark shows around a 200% boost
(3:10), the overall gain was 15%, but the code in question as to the
amount of (user-level) code run through was far less than 1% (big "fat"
EE application).

Michael

Joerg Meier

unread,
Aug 11, 2013, 7:32:26 AM8/11/13
to
On Sun, 11 Aug 2013 11:12:38 +0200, Michael Jung wrote:

> I use split myself often enough. You can read my response as a case for
> optimzation surprises. The micro benchmark shows around a 200% boost
> (3:10), the overall gain was 15%, but the code in question as to the
> amount of (user-level) code run through was far less than 1% (big "fat"
> EE application).

Well, odds are, not many applications spend 25% of their CPU time doing
.split(), so I would say that your application speeding up that much is an
extreme edge case. What on Earth do you do that requires millions of
.split() calls per second, and why did you think that would even remotely
be a representative example ?

Michael Jung

unread,
Aug 11, 2013, 11:42:27 AM8/11/13
to
Joerg Meier <joerg...@arcor.de> writes:
> On Sun, 11 Aug 2013 11:12:38 +0200, Michael Jung wrote:
>> I use split myself often enough. You can read my response as a case for
>> optimzation surprises. The micro benchmark shows around a 200% boost
>> (3:10), the overall gain was 15%, but the code in question as to the
>> amount of (user-level) code run through was far less than 1% (big "fat"
>> EE application).
> Well, odds are, not many applications spend 25% of their CPU time doing
> .split(), so I would say that your application speeding up that much is an
> extreme edge case. What on Earth do you do that requires millions of
> .split() calls per second, and why did you think that would even remotely
> be a representative example ?

Odds are that the rest of the application was already highly
optimized. (I already said this was for certain use cases.) Whether this
is representative of something, I don't know, everybody has to judge for
himself what to do with split. But string manipulation is omnipresent in
many applications these days. This was just some light.

Michael

0 new messages