--
Internet service
http://www.isp2dial.com/
What is a token" The first character?
See:
http://en.wikipedia.org/wiki/Tokenizing
Okay. So you want the first set of sequential non-space characters. To use
the Wiki example.
set tokens "The quick brown fox jumps over the lazy dog."
set firstToken [lindex [split $tokens] 0]
>> >"John Kelly" <j...@isp2dial.com> wrote in message
>> >> What's the Tcl way to get the first token from a string?
>Okay. So you want the first set of sequential non-space characters. To use
>the Wiki example.
>set tokens "The quick brown fox jumps over the lazy dog."
>set firstToken [lindex [split $tokens] 0]
OK, but then, what if I want the second token, and there are 4 spaces
between the first two?
set tokens "The lazy dog."
% set tokens "The lazy dog."
The lazy dog.
% foreach token $tokens {puts $token}
The
lazy
dog.
% set secondToken [lindex $tokens 1]
lazy
Are you really willing to parse strings in tokens in the broad sense or
will your needs just be circumscribed to words separated by white space?
set secondToken [lindex [split [string trim [regsub -all { +} $tokens " "]]]
1]
However, strings in the wild are not always well-formed Tcl lists...
When you can't guarantee the format of the string in question, you can
use a regular expression to get the tokens:
set tokens [regexp -all -inline {\S+} $string]
Slightly different, but generally more useful to me when doing
linguistics, is:
set words [regexp -all -inline {\w+} $string]
Of course, your regular expression can get as complex as you need it
to be. If you need a full-fledged finite-state automaton of any
complexity, you might instead opt for the [grammar::] module in Tcllib
(see also fickle and taccle on the wiki).
>Are you really willing to parse strings in tokens in the broad sense or
>will your needs just be circumscribed to words separated by white space?
For the task at hand, simple white space. But later, I may need the
real thing.
My fast fingers hit the wrong keys... continuing:
For the former get a look at fickle (and perhaps its cousing taccle)
http://wiki.tcl.tk/3555 whereas for the later I believe you already got
the hang of thing using lindex.
>For the former get a look at fickle (and perhaps its cousing taccle)
>http://wiki.tcl.tk/3555 whereas for the later I believe you already got
>the hang of thing using lindex.
GPL license?
I hope to avoid GPL code or library for use with Tcl. I prefer the
Tcl license.
> "John Kelly" <j...@isp2dial.com> wrote in message
> news:n6jje3plci2m8f0lv...@4ax.com...
>>
>> OK, but then, what if I want the second token, and there are 4 spaces
>> between the first two?
>>
>> set tokens "The lazy dog."
>>
>
> set tokens "The lazy dog."
> set secondToken [lindex [split [string trim [regsub -all { +} $tokens " "]]] 1]
Now it's open season!
There are at least 4 different(ish, actually they are all just lindex with
fancy clothes) way to do (simplish) tokenization of a string. And I will
deal with them all.
First one is.. lindex! Yes, it is fragile if input data isn't conforming,
but it is very, very fast. Naturally, since there is no overhead.
[code]
interp alias {} twlindex {} lindex
[/code]
Second is.. lindex with lsearch and split! This isn't fragile at all and
it has an acceptable performance with acceptably small datasets.
[code]
proc twlsearch {tokens i} {
return [lindex [lsearch -all -not -exact -inline [split $tokens { }] {}] $i]
}
[/code]
Third approach to this thing is.. lindex with regexp! This is the One when
there is a need to do some more sophisticated splitting of stuff. The more
sophisticated means here regular expressions, since constant strings can
be reduced to dust by [string map]ping them, and then one can utilize the
speed of lindex. Or split and lindex.
[code]
proc twregexp {tokens i} {
return [lindex [regexp -all -inline {\S+} $tokens] $i]
}
[/code]
The final stab at this tokenizing thing is the mother of all overkills,
regsub. Naturally comes with split flavour and a hint of lindex.
[code]
proc twregsub {tokens i} {
return [lindex [split [string trim [regsub -all { +} $tokens " "]]] $i]
}
[/code]
These are the four first ones. Of course one may build his/her own engine
for splitting things into little pieces, but that is something one has to
write in C, because typical
[code]
proc twscan {t i} {
set t [string trim $t]
while {[string length $t]} {
scan $t %s token
if {[incr i -1] < 0} {return $token}
set t [string trimleft [string range $t [string length $token] end]]
}
return {}
}
proc twscan2 {t i} {
set fstr "[string repeat {%*s} $i] %s"
set token {}
catch {scan $t $fstr token}
return $token
}
[/code]
don't perform well at all.
So, which one is the best? Since "best" is subjective I just timed these
and collected the results in a table. Here is the test:
[code]
proc test {t i} {
set runcount 25
foreach case {regsub regexp lsearch lindex scan scan2} {
puts "$case:"
set d [tw$case $t $i]
if {[string length $d] > 25} {set d "[string range $d 0 21] .."}
puts "test: tw$case: $d"
set t1 [time {tw$case $t $i} $runcount]
puts "time: tw$case: [lrange $t1 0 1]"
}
}
set tokens "The lazy dog."
foreach {t i} [list \
$tokens 1\
[string repeat "$tokens " 700] 1619\
"a[string repeat { } 2000]b" 1\
[string repeat a 2000] 0] {
test $t $i
}
[/code]
Runcount is rather low since my machine is low-end. You are free to crank
it up by orders of magnitude. Test data set is created from doggie example
in previous posts, plus "a<2000 spaces>b" and "2000 times a". Test was run
with 8.4.12 and 8.5a6. Here are the results (this would actually take a 3D
table, but I doubt you have needed plugins in your news-readers; times are
in milliseconds):
Tcl 8.4:
=======
proc | set 1 | set 2 | set 3 | set 4
--------+---------+------------+----------+--------
regsub | 1058.12 | 361873.8 | 1787.84 | 4240.16
regexp | 873.72 | 352221.44 | 1672.32 | 1516.44
lsearch | 589.04 | 80514.24 | 20152.56 | 578.48
lindex | 216.24 | 216.92 | 213.72 | 222.76
scan | 952.72 | 3383727.48 | 2306.6 | 4159.04
scan2 | 510.84 | 37118.28 | 2783.96 | 3878.08
Tcl 8.5:
=======
proc | set 1 | set 2 | set 3 | set 4
--------+---------+------------+----------+--------
regsub | 1001.88 Š 321359.4 | 1737.32 | 4098.2
regexp | 901.12 | 362268.16 | 1653.88 | 1504.92
lsearch | 553.0 | 64546.92 | 13417.32 | 610.48
lindex | 246.92 | 257.28 | 250.36 | 254.24
scan | 890.08 | 3365021.84 | 2350.64 | 6201.2
scan2 | 515.08 | 50211.6 | 4622.72 | 4783.04
So what is the conclusion? Simple: 8.5 must get out of alpha stage, so I
will waste half the time doing these charts. About the timings? Oh, these
aren't worth reading, performance depends on dataset. Numbers fluctuate
too much to point out non-fragile winner, but obviously [lindex] is the
choice when you can trust your data.
Unfortunately, most times you can't. But hey, put a catch around it and
fall back to something else when it fails. And hope that your data is not
going to be "valid list with sublists" :)
--
-Kaitzschu
s="TCL ";while true;do echo -en "\r$s";s=${s:1:${#s}}${s:0:1};sleep .1;done
>> "John Kelly" <j...@isp2dial.com> wrote in message
>>> OK, but then, what if I want the second token, and there are 4 spaces
>>> between the first two?
>Now it's open season!
Thanks all, for the responses. I will study them ...
You've probably seen this result:
set s "The lazy dog."
set tokens [split $s] ;# ==> 6 elements: The {} {} {} lazy dog.
However, tcllib to the rescue
package require textutil
set tokens [textutil::splitx $s] ;# ==> 3 elements: The lazy dog.
http://tcllib.sourceforge.net/doc/textutil.html
--
Glenn Jackman
"You can only be young once. But you can always be immature." -- Dave Barry
It does depend on what you mean by "token", but this is simple:
set firstToken [lindex [split [string trim $string]] 0]
Donal.
I appreciate your mention of the (entirely appropriate--please
don't let Wikipedia's own disclaimers discourage) Wikipedia
article on tokenization. Tokenization is, of course, always
relative to a specific grammar. What's the grammar of "the
real thing" for you? Did you explain it in some part of this
thread I missed?
I see that it might involve the notion of white space ...
>>>Are you really willing to parse strings in tokens in the broad sense or
>>>will your needs just be circumscribed to words separated by white space?
>>For the task at hand, simple white space. But later, I may need the
>>real thing.
.
>I appreciate your mention of the (entirely appropriate--please
>don't let Wikipedia's own disclaimers discourage) Wikipedia
>article on tokenization. Tokenization is, of course, always
>relative to a specific grammar. What's the grammar of "the
>real thing" for you? Did you explain it in some part of this
>thread I missed?
lindex simple whitespace handling is a good start. But what if I want
to split some.long.host.name.com into tokens? lindex can't handle
that, can it? And split leaves you with unwanted empty tokens if the
string is somehow irregular.
Looks like Glenn pointed me to what I had in mind ...
>However, tcllib to the rescue
>
> package require textutil
> set tokens [textutil::splitx $s] ;# ==> 3 elements: The lazy dog.
>
>http://tcllib.sourceforge.net/doc/textutil.html
And Aric blew my mind ...
So I think I'm all set, once I recover my mind ...
>> OK, but then, what if I want the second token, and there are 4 spaces
>> between the first two?
>>
>> set tokens "The lazy dog."
>
>You've probably seen this result:
>
> set s "The lazy dog."
> set tokens [split $s] ;# ==> 6 elements: The {} {} {} lazy dog.
>
>However, tcllib to the rescue
>
> package require textutil
> set tokens [textutil::splitx $s] ;# ==> 3 elements: The lazy dog.
Ah, that's more like it.
I looked in the docs, but missed it. It helps to know where to look,
and what to look for ...
> tcllib to the rescue
> package require textutil
> set tokens [textutil::splitx $s] ;# ==> 3 elements: The lazy dog.
I wondered how this would perform vs. Aric's regexp ...
package require Tcl 8.4
package require textutil
proc do1 {limit text} {
set tx 0
while {$tx < $limit} {
incr tx
set tokens [regexp -all -inline {\S+} $text]
}
}
proc do2 {limit text} {
set tx 0
while {$tx < $limit} {
incr tx
set tokens [::textutil::splitx $text {\s+}]
}
}
set limit 100000
set text "this is some text"
puts [time {do1 $limit $text}]
puts [time {do2 $limit $text}]
4191671 microseconds per iteration
9536334 microseconds per iteration
... and I see textutil::splitx takes twice as long.
Is there a performance penalty for using an external library, or is
splitx just slower?
> I wondered how this would perform vs. Aric's regexp ...
,,,
> ... and I see textutil::splitx takes twice as long.
>
> Is there a performance penalty for using an external library, or is
> splitx just slower?
Splitx is just slower. And of course there is the overhead of library call
([regexp] is core command), but this is negligible. Splitx simply does a
lot of stuff your other example doesn't. Take a look at
http://tcllib.cvs.sourceforge.net/tcllib/tcllib/modules/textutil/split.tcl?view=markup
around line 67 and you'll see why splitx is slow, multiple calls to
[regexp] versus single call to [regexp], that just has to show up
somewhere.
>> Is there a performance penalty for using an external library, or is
>> splitx just slower?
>Splitx is just slower. And of course there is the overhead of library call
>([regexp] is core command), but this is negligible. Splitx simply does a
>lot of stuff your other example doesn't. Take a look at
>http://tcllib.cvs.sourceforge.net/tcllib/tcllib/modules/textutil/split.tcl?view=markup
>around line 67 and you'll see why splitx is slow, multiple calls to
>[regexp] versus single call to [regexp], that just has to show up
>somewhere.
I see they wrap some extra code around regexp.
Got it, thanks.
scan returns a list and automatically collapses white space. Not
an option?
>% scan "Okay. So you want the first set of sequential non-space
>characters. To" [string repeat %s 100]
>Okay. So you want the first set of sequential non-space characters. To
>{} {} {} {} {} {} {} {} {} {} {} {} {} {} {} {} {} {} {} {} {} {} {} {}
>{} {} {} {} {} {} {} {} {} {} {} {} {} {} {} {} {} {} {} {} {} {} {} {}
>{} {} {} {} {} {} {} {} {} {} {} {} {} {} {} {} {} {} {} {} {} {} {} {}
>{} {} {} {} {} {} {} {} {} {} {} {} {} {} {} {}
>
>scan returns a list and automatically collapses white space. Not
>an option?
It's good if you only want the first token. That was my original
question.
But now I want more tokens. And I don't have prior knowledge of how
many tokens the string holds.
>set tokens "The lazy dog."
>set secondToken [lindex [split [string trim [regsub -all { +} $tokens " "]]] 1]
At first, this looked too complicated to perform well. And Kaitzschu
reinforced my expectation, saying:
>the mother of all overkills, regsub.
But today, I looked again, to see if I missed anything. As it turns
out, you can improve the performance of regsub by thinking about the
data.
We don't need to substitute all whitespace. Why replace one for one?
We only need to replace two or more. That change alone cuts the time
in half for typical data. And by using trim before regsub, instead of
after, regsub won't waste time handling useless whitespace on either
end.
I was surprised by the result:
proc do1 {limit text} {
set tx 0
while {$tx < $limit} {
incr tx
set tokens [regexp -all -inline {\S+} $text]
}
}
proc do2 {limit text} {
set tx 0
while {$tx < $limit} {
incr tx
set tokens [split [regsub -all {\s{2,}} [string trim $text] { }]]
}
}
set limit 100000
set text " this is some text "
puts [time {do1 $limit $text}]
puts [time {do2 $limit $text}]
puts ""
set text " this is some text "
puts [time {do1 $limit $text}]
puts [time {do2 $limit $text}]
4678182 microseconds per iteration
3799362 microseconds per iteration
4655958 microseconds per iteration
2064295 microseconds per iteration
If you are so concerned about performance, why are you still using
regsub/regexp instead of the method that uses lsearch (my personal
favourite) that Kaitzschu also showed?
proc do1 {text} {
set tokens [regexp -all -inline {\S+} $text]
}
proc do2 {text} {
set tokens [split [regsub -all {\s{2,}} [string trim $text] { }]]
}
proc do3 {text} {
set tokens [lsearch -all -inline -exact -not [split [string trim $text]] {}]
}
set limit 100000
100000
set text " this is some text "
this is some text
do1 $text
this is some text
do2 $text
this is some text
do3 $text
this is some text
time {do1 $text} $limit
18.23532 microseconds per iteration
time {do2 $text} $limit
17.61229 microseconds per iteration
time {do3 $text} $limit
7.65453 microseconds per iteration
set text " this is some text "
this is some text
time {do1 $text} $limit
18.00468 microseconds per iteration
time {do2 $text} $limit
8.9039 microseconds per iteration
time {do3 $text} $limit
7.11857 microseconds per iteration
Note: The [string trim] in do3 isn't functionally necessary,
but I needed it to beat your do2 times on the last string ;-)
Depending on your real input you may actually get better times
if you remove it.
Schelte.
--
set Reply-To [string map {nospam schelte} $header(From)]
>If you are so concerned about performance, why are you still using
>regsub/regexp instead of the method that uses lsearch (my personal
>favourite) that Kaitzschu also showed?
I don't always know why I do what I do.
>time {do1 $text} $limit
But it looks like your test also times the proc call. I only make one
proc call per test. I need to chew on this ...
> If you are so concerned about performance, why are you still using
> regsub/regexp instead of the method that uses lsearch (my personal
> favourite) that Kaitzschu also showed?
>John Kelly <j...@isp2dial.com> wrote:
> I need to chew on this ...
OK, done chewing.
>set tokens [lsearch -all -inline -exact -not [split [string trim $text]] {}]
That method is disastrous when two tokens have thousands of spaces
between them, because split produces a huge list and hands it off to
lsearch. The lsearch method runs 10 times longer under that load.
Yes, 10 times longer! Try it.
>Kaitzschu wrote:
> which one is the best? ... "best" is subjective
"Best" never collapses under load, and has competitive performance
under typical loads.
Using regsub with {2,}, is looking pretty good ...
Good thing programmers don't build bridges. Better leave that to the
"real" engineers. ;-)
>> set tokens [lsearch -all -inline -exact -not [split [string trim $text]] {}]
>
> That method is disastrous when two tokens have thousands of spaces
> between them, because split produces a huge list and hands it off to
> lsearch. The lsearch method runs 10 times longer under that load.
>
> Yes, 10 times longer! Try it.
>
>
>> Kaitzschu wrote:
>
>> which one is the best? ... "best" is subjective
>
> "Best" never collapses under load, and has competitive performance under
> typical loads.
My tables did show (set 3) that lsearch does collapse under ridiculously
sparse input data. That is what I meant by subjective, "best" tokenizing
solution is subject to format of input data. Maybe my wording was a bit
off, but that has to have been the original meaning.
"Real engineers" do indeed deserve to be proud of their professional
successes. I entirely agree that graceful failure is desirable.
It's unfair to Schelte, Kaitzschu, et al., though, to imply that they
are mere programmers whose advice is likely to lead to breakdowns.
What I've seen in this thread is repeated emphasis that several
techniques satisfy underspecified requirements, and that rational
choice between them demands collection of more details.
Actually John Kelly seems to misunderstand what "real engineers" do. I
am an Engineer, not a programmer (yup, BEng qualification not BSc and
I am accredited so there.)
Real engineers do indeed build bridges the way Kaitzschu wrote those
code. Yes real bridges DO collapse under load. Take for example the
well designed Millennium Bridge in London. Drive an 18 wheeler over it
an it WILL collapse! That's because the bridge is a pedestrian bridge
not designed to handle cars and trucks much less 18 wheelers. Of
course you CAN design a pedestrian bridge that can handle trucks but
you don't have to. Besides, it would then be overkill.
This is similar to the choice of which algorithm to use. Given a
packed input data format the [lsearch [split]] code will scream past
[regexp] based code. There are real-world examples of such data
format: tab separated values files. On the other hand given a more
free form input format the [regexp] solution may be better.
Engineering is knowing which one to use given a specific problem.
> Using regsub with {2,}, is looking pretty good ...
After digging in the docs, I found ctoken from Tclx. No one mentioned
that, I guess it's not widely known.
For getting one token at a time, it wins the race. And when getting
all the tokens at once, it runs a close second to regsub {2,}.
package require Tcl 8.4
package require Tclx
proc do1 {limit text} {
set tokens [split [regsub -all {\s{2,}} [string trim $text] { }]]
}
proc do2 {limit text} {
set tokens {}
while {[string length [set token [ctoken text " \f\n\r\t\v"]]]} {
lappend tokens $token
}
}
proc do3 {limit text} {
set tokens [regexp -all -inline {\S+} $text]
}
set limit 100000
set text { this a\aa bbb ccccc is oops"som{ e} wacky " text}
puts [time {do1 $limit $text} $limit]
puts [time {do2 $limit $text} $limit]
puts [time {do3 $limit $text} $limit]
>33.81905 microseconds per iteration
>47.48279 microseconds per iteration
>56.11683 microseconds per iteration