Splitting on White-space

Cecil Westerhof

unread,

Jun 7, 2018, 12:44:05 AM6/7/18

to

Often I want to split a string on repeating white-space. The normal
split function does not do what I want. That is why is created the
proc splitOnWhiteSpace:
http://wiki.tcl.tk/55362

Beside splitting on repeating white-space it can also check the number
of elements of the split.

--
Cecil Westerhof
Senior Software Engineer
LinkedIn: http://www.linkedin.com/in/cecilwesterhof

Lawrence Woodman

unread,

Jun 7, 2018, 1:52:07 AM6/7/18

to

On Thu, 07 Jun 2018 06:37:38 +0200, Cecil Westerhof wrote:

> Often I want to split a string on repeating white-space. The normal
> split function does not do what I want. That is why is created the
> proc splitOnWhiteSpace:
> http://wiki.tcl.tk/55362
>
> Beside splitting on repeating white-space it can also check the number
> of elements of the split.

What you are describing is a list. This can turned into a cleaner list,
if you want, using:
lrange $str 0 end

However, even without that you can use the normal iterative functions such
as:
foreach, llength, llindex, etc.

Best wishes

Lorry

--
http://lawrencewoodman.github.io/

heinrichmartin

unread,

Jun 7, 2018, 2:31:49 AM6/7/18

to

On Thursday, June 7, 2018 at 7:52:07 AM UTC+2, Lawrence Woodman wrote:
> On Thu, 07 Jun 2018 06:37:38 +0200, Cecil Westerhof wrote:
> > Often I want to split a string on repeating white-space.
>

> What you are describing is a list.

Just be aware of edge cases, not all strings are lists.

expect [~]set foo "this is not {a\tlist"
this is not {a list
expect [~]llength $foo
unmatched open brace in list
while evaluating {llength $foo}
expect [~]regexp -all -inline {\S+} $foo
this is not \{a list
expect [~]llength [regexp -all -inline {\S+} $foo]
5

Maybe there are faster implementations...

skuh...@web.de

unread,

Jun 7, 2018, 2:47:21 AM6/7/18

to

About four times faster compared to the regexp-line:

list {*}[string map {\{ \\\{} $value]

The string map is needed to avoid unmatched open braces in lists. If you know, that there will never be an opening brace in your inputs, you can get it even faster.

Cecil Westerhof

unread,

Jun 7, 2018, 3:28:04 AM6/7/18

to

Implemented it in my library. Keeping the map, because with a library
function you never know what it will receive.

Did not change the function itself, so people see the progress. ;-)

heinrichmartin

unread,

Jun 7, 2018, 4:04:29 AM6/7/18

to

On Thursday, June 7, 2018 at 8:47:21 AM UTC+2, skuh...@web.de wrote:
> About four times faster compared to the regexp-line:
>
> list {*}[string map {\{ \\\{} $value]

Thanks for measuring. After adding the quote character to the map, its two times faster here. Similar timings after adding the backslash, too.

I wondered whether we missed other cases, but didn't find quickly how list parsing is defined. The Dodekalogue seems no longer self-contained since the {*}-operator has been introduced: "backslash substitutions are performed as is normal for a list" relies on the definition of a "list" that is not exactly given.

It looks like this should be rephrased to "backslash substitutions are performed according to rule [9]" and we should be fine with the string map {\{ \\\{ \" \\\" \\ \\\\}.

skuh...@web.de

unread,

Jun 7, 2018, 4:08:50 AM6/7/18

to

> I wondered whether we missed other cases, but didn't find quickly how list parsing is defined.

True. The map must ensure a well-formed list. My ad-hoc-implementation did not consider other possibilities to ill-form a list.

Cecil Westerhof

unread,

Jun 7, 2018, 4:14:04 AM6/7/18

to

I changed the function. Instead of count I now work with min and max.
(When max is not defined it is the same as min, retaining the old
functionality.) I took your optimisation into the code also and
explains what the code was you commented on.

Lawrence Woodman

unread,

Jun 7, 2018, 4:22:28 AM6/7/18

to

That's really interesting, I hadn't thought of that.

--
https://lawrencewoodman.github.io

heinrichmartin

unread,

Jun 7, 2018, 4:57:46 AM6/7/18

to

On Thursday, June 7, 2018 at 6:44:05 AM UTC+2, Cecil Westerhof wrote:
> Often I want to split a string on repeating white-space. The normal
> split function does not do what I want.

After re-reading this, there is actually another option: split and remove empty entries. I am quite surprised that this keeps up with the timing.

The -squeeze option to split could be implemented in C ...

#!/usr/bin/env tclsh

rename split split_native
proc split {squeeze args} {
if {"-squeeze" eq $squeeze} {
return [lsearch -inline -not -exact -all [split_native {*}$args] ""]
}
# $squeeze is actually the string
split_native $squeeze {*}$args
}
proc split_squeeze {args} {
lsearch -inline -not -exact -all [split_native {*}$args] ""
}
proc regsplit {value} {
regexp -all -inline {\S+} $value
}
proc listsplit {value} {

list {*}[string map {\{ \\\{ \" \\\" \\ \\\\} $value]
}

set value "this is\tnot \{a list"

puts stderr "Squeeze option:"
puts stderr [split -squeeze $value]
puts stderr [time {split -squeeze $value} 10000]

puts stderr "Squeeze direct:"
puts stderr [split_squeeze $value]
puts stderr [time {split_squeeze $value} 10000]

puts stderr "Regular:"
puts stderr [regsplit $value]
puts stderr [time {regsplit $value} 10000]

puts stderr "List:"
puts stderr [listsplit $value]
puts stderr [time {listsplit $value} 10000]

Cecil Westerhof

unread,

Jun 7, 2018, 5:59:04 AM6/7/18

to

I had done some checking, but not enough. :'-(

The quote is necessary (I should have thought about that), but the
backslash is not.

When executing:
set value { Just ($a) test. [ " \\}
list {*}[string map {\{ \\\{ \" \\\"} ${value}]

I get:
Just {($a)} test. {[} {"} \\

How fast can a simple proc grow and how much time can it absorb. ;-)

heinrichmartin

unread,

Jun 7, 2018, 6:18:29 AM6/7/18

to

On Thursday, June 7, 2018 at 11:59:04 AM UTC+2, Cecil Westerhof wrote:
> The quote is necessary (I should have thought about that), but the
> backslash is not.

The {*}-introduces another round of backslash substitution; therefore we must protect backslashes, too.

set value \\n ;# the string is backslash-n
lindex [string map {\{ \\\{ \" \\\"} ${value}] end ;# returns a newline character
lindex [string map {\{ \\\{ \" \\\" \\ \\\\} ${value}] end ;# returns backslash-n

Rich

unread,

Jun 7, 2018, 6:19:15 AM6/7/18

to

Cecil Westerhof <Ce...@decebal.nl> wrote:
> Often I want to split a string on repeating white-space. The normal
> split function does not do what I want. That is why is created the
> proc splitOnWhiteSpace:
> http://wiki.tcl.tk/55362
>
> Beside splitting on repeating white-space it can also check the number
> of elements of the split.

That is a lot of Tcl (to write and to execute) when this performs the
function you describe (and 100% avoids *all* of the is a string but not
a list edge cases). The regexp does all the work.:

$ rlwrap tclsh

% set string_but_not_list "a b c d { \\ \" x y z"
a b c d { \ " x y z

% set now_a_list [regexp -inline -all {\S+} $string_but_not_list]
a b c d \{ \\ {"} x y z

% llength $now_a_list
10

% join $now_a_list \n
a
b
c
d
{
\
"
x
y
z

Cecil Westerhof

unread,

Jun 7, 2018, 7:28:04 AM6/7/18

to

You are correct. I am going to change it.

Cecil Westerhof

unread,

Jun 7, 2018, 7:28:04 AM6/7/18

to

That was the way I did it originally, but I was told that the
implemented variant is about four times as fast.

I hope that the edge case are now taken care of. ;-)

Rich

unread,

Jun 7, 2018, 11:07:29 AM6/7/18

to

Did you perform any timing tests yourself?

I ask because this (perf-test.tcl):

#!/usr/bin/tclsh

# "splitonwhitespace.tcl" is the code from revision 11 on the wiki

source splitonwhitespace.tcl

proc code {str} {
puts [time {splitOnWhiteSpace $str} 100]
}

proc reexp {str} {
puts [time {regexp regexp -inline -all {\S+} $str} 100]
}

set str " To show the problem. "

puts "Tcl code:"
code $str

puts "Regex engine:"
reexp $str

returns this for me on this laptop (Tcl 8.5):

Tcl code:
43.05 microseconds per iteration
Regex engine:
12.94 microseconds per iteration

Which puts the regex engine at about 3.3x faster than your code, for
the sample string here in the newsgroup.

> I hope that the edge case are now taken care of. ;-)

Well, that is one advantage of the regex method, *all* the edge conditions
are taken care of already.

Brad Lanam

unread,

Jun 7, 2018, 12:59:58 PM6/7/18

to

On Thursday, June 7, 2018 at 8:07:29 AM UTC-7, Rich wrote:

Did you time it with this typo? Or without?
regexp regexp
Just checking as the typo does not return an error.

Rich

unread,

Jun 7, 2018, 1:07:19 PM6/7/18

to

Sadly, with the typo.

Removing the typo, I get (for several runs):

$ ./perf-test.tcl
Tcl code:
108.02 microseconds per iteration
Regex engine:
88.36 microseconds per iteration
$ ./perf-test.tcl
Tcl code:
78.57 microseconds per iteration
Regex engine:
71.62 microseconds per iteration
$ ./perf-test.tcl
Tcl code:
81.22 microseconds per iteration
Regex engine:
77.81 microseconds per iteration
$ ./perf-test.tcl
Tcl code:
102.88 microseconds per iteration
Regex engine:
81.14 microseconds per iteration
$ ./perf-test.tcl
Tcl code:
75.61 microseconds per iteration
Regex engine:
68.07 microseconds per iteration

I don't understand the variance, but the regex version is faster, or at
least equal, to the code version over these runs.

Brad Lanam

unread,

Jun 7, 2018, 1:24:47 PM6/7/18

to

On Thursday, June 7, 2018 at 10:07:19 AM UTC-7, Rich wrote:
> I don't understand the variance, but the regex version is faster, or at
> least equal, to the code version over these runs.

*Without* the error checking in splitOnWhiteSpace,
increased time loop to 1000 from 100.
I am also getting a lot of variance. I was getting more variance
for the regex test when the timing loop was set to 100. Strange.

Hard to tell which is better with that variance in timing.
My system is fairly quiescent, but I didn't shut down the browser, etc.

Tcl code:
3.189 microseconds per iteration
Regex engine:
4.7 microseconds per iteration
Tcl code:
4.114 microseconds per iteration
Regex engine:
4.634 microseconds per iteration
Tcl code:
2.889 microseconds per iteration
Regex engine:
4.989 microseconds per iteration
Tcl code:
5.118 microseconds per iteration
Regex engine:
5.448 microseconds per iteration
Tcl code:
5.019 microseconds per iteration
Regex engine:
4.753 microseconds per iteration
Tcl code:
3.096 microseconds per iteration
Regex engine:
5.149 microseconds per iteration
Tcl code:
2.646 microseconds per iteration
Regex engine:
5.367 microseconds per iteration
Tcl code:
5.123 microseconds per iteration
Regex engine:
5.102 microseconds per iteration
Tcl code:
4.286 microseconds per iteration
Regex engine:
4.755 microseconds per iteration

Cecil Westerhof

unread,

Jun 7, 2018, 1:28:04 PM6/7/18

to

No, it was mentioned here. I should try it out myself also. Maybe I
should go back to my original code. I do not think that this function
will be on the critical path for resource usage. And readability is
more important then efficiency, unless there is a good reason to
choose for performance.

Cecil Westerhof

unread,

Jun 7, 2018, 1:44:05 PM6/7/18

to

Well, the non regex version is not significant faster as the regex
version. So the regex version seems better to me.

Cecil Westerhof

unread,

Jun 7, 2018, 2:28:04 PM6/7/18

to

It was not an honest comparison. I changed the proc so that a
parameter decides which version is going to be used. With this the tcl
version takes about 55% of the time as the regexp version takes.
That is not near the 25% that was said.

The following code:
#!/usr/bin/env tclsh

# "splitOnWhiteSpace.tcl" is the code from revision 12 on the wiki
package require dcblUtilities

namespace import dcblUtilities::*

proc code {str} {
global iterations
global tries

for {set i 0} {${i} < ${tries}} {incr i} {
puts [time {splitOnWhiteSpace ${str} -1 -1 True} ${iterations}]
}
}

proc reexp {str} {
global iterations
global tries

for {set i 0} {${i} < ${tries}} {incr i} {
puts [time {splitOnWhiteSpace ${str}} ${iterations}]
}
}

set iterations [expr {10 ** 6}]

set str " To show the problem. "

set tries 5

puts "Tcl code:"
code ${str}

puts "\n\nRegex engine:"
reexp ${str}

Results in:
Tcl code:
3.339411 microseconds per iteration
3.302142 microseconds per iteration
3.332466 microseconds per iteration
3.294206 microseconds per iteration
3.339484 microseconds per iteration

Regex engine:
5.910905 microseconds per iteration
5.893764 microseconds per iteration
5.97237 microseconds per iteration
5.846914 microseconds per iteration
5.869534 microseconds per iteration

Cecil Westerhof

unread,

Jun 8, 2018, 3:14:05 AM6/8/18

to

It looks like there is a case for the tcl code variant. When the
strings become longer, the time needed for the regexp version grows
faster as the tcl code version. I know have for example:
Tcl code:
45.632089 microseconds per iteration
45.695309 microseconds per iteration
45.972859 microseconds per iteration
46.516739 microseconds per iteration
46.451871 microseconds per iteration

Regex engine:
161.118906 microseconds per iteration
162.945644 microseconds per iteration
162.676964 microseconds per iteration
163.008758 microseconds per iteration
162.228666 microseconds per iteration

That is almost the four times that was mentioned in the beginning.

I have a 'few' other things to do, so I work it out in the weekend.