set str " hello tcl world "
set lst [split [ regsub -all {[\s]+} [string trim $str { }] { } ]]
Is there any efficient/clever way to do this?
I usually use the following code:
package require Itcl ;# Itcl adds object-oriented facilities to Tcl.
itcl::class ELEP::DocumentUtilities::NaiveTokenizer {
constructor {} {
};# constructor
proc tokenize {text} {
set tcl_symbols_re_pattern {[;,.'\"?]}
set tokens {}
set length [string length $text]
## We have to find the first word...
set start 0
set end [tcl_endOfWord $text $start]
if {[tcl_startOfNextWord $text $start] < $end} {
set start [tcl_startOfNextWord $text $start]
}
set textStart $start
set token [string trim [string range $text $start $end]]
regsub -all -- $tcl_symbols_re_pattern $token {} token
if {[string length $token]} {lappend tokens $token}
## And now iterate over all other words...
while {$start >= 0} {
set start [tcl_startOfNextWord $text $start]
set end [tcl_endOfWord $text $start]
if {$start != -1 && $end != -1} {
set token [string trim [string range $text $start $end]]
regsub -all -- $tcl_symbols_re_pattern $token {} token
if {[string length $token]} {lappend tokens $token}
}
}
return $tokens
};# tokenize
};# class ELEP::DocumentUtilities::NaiveTokenizer
George
What about: set lst [lsearch -inline -not -all [split $str] {}]
--
Zbigniew
My favourite technique is this:
set lst [regexp -all -inline {\S+} $str]
That returns by returning ("inline") a list of all non-empty sequences
of non-whitespace characters. That's probably what you wanted, and it's
also wonderfully short...
Donal.
For my purpose this is the best, you won.
Thanks.
pmarin.
Dear Cameron,
Regarding the heavy requirement of itcl, its just code I took from an
existing application, because I got a feeling that the author of the
original post was in a hurry (3 posts in a about 6 minutes), so I pasted
a quick answer, leaving customization to the poster.
And from the description I also got the impression that something more
"complex" was required than a simple splitting on spaces, so sending
code that imitates how the text widget selects words sounded a good idea...
George
Well, my 3 post in a about 6 minutes is becouse my stolen neighbor's
internet conexion some times is a little weak ;)
Sorry about that.
Pmarin.
Assuming the "words" are strictly alphanumeric (or rather they don't
contain any characters reserved for Tcl syntax like {, ", or [ ), then
$str can be interpreted as a list with no further effort:
set str " hello tcl world "
string is list $str
-> 1
If you want to force it into a list representation internally, you
could use lrange:
set lst [lrange $str 0 end]
Regards,
Twylite
I think Twylite is correct. In fact, if the string isn't in the form
of a Tcl list, no amount of magic could possibly turn it into a list.
If it is in the form of a Tcl list (although it might be a string or
look like a non-normalized list), there is no possible substitute for
turning it into a list that would differ from what the Tcl parser
would produce when told that the string is a list, such as [lindex
" hello tcl world " 2] == world.
However, DKF's code produces a normalized list. Here is the same thing
using [lsearch]:
% set lst [lsearch -inline -all " hello tcl world " *]
hello tcl world
% llength $lst
3
The [lsearch] version is about 6x faster than regexp. The only thing
either do is to normalize a string which could be a list.
All these are very interesting tricks, but maybe we are missing an
argument to split to not return empty elements?
The obvious solution to this task should have been split, and not regexp
or lsearch...
George
Is it really a trick to use [lsearch]? The whole point of the -all
option is to return a list (even if it contains only one element), and
-inline returns the actual element and not the index of the match. The
effect is a filter which produces a new list from the current list
(like a where clause in SQL).
But [split] only returns empty elements if the split chars are
directly next to each other, regardless of the char. It seems
confusing to make an exception in the case of whitespace chars.
Why an exception on white chars? I am speaking for a new split switch,
that will not return empty list elements, like:
split -noemptyelements text split_chars.
Empty list elements are removed, no matter what split chars are in use...
For such a trivial and frequent task Tcl should not offer efficient
alternative and trick. It sould be obvious to everybody on how the task
can be done.
George
No, there is an important difference between the [regexp] solution and
the [lsearch] solution. [regexp -all -inline] expects a string and
returns a list. [lsearch -all -inline] expects a list and returns a
list. So, [regexp] will succeed for strings like "bizzare { string"
while [lsearch] will fail.
I like this suggestion.
sorry but:
% string is list "not { a { list"
0
% set is_a_list [regexp -inline -all {\S+} "not { a { list"]
not \{ a \{ list
% lindex $is_a_list 4
list
Any string can be converted into a list. It just depends on how you
parse it.
You are absolutely right, any string can be converted into a list.
Then what? My assumption is that the string represents something more
than a string. Is this representation more apparent with whitespace
removed? Is a list of strings (each list element is a string) easier
to work with than the original string?
> The obvious solution to this task should have been split, and not regexp
> or lsearch...
Command splitx from tcllib.
Particularly, from the textutil package.
--
Glenn Jackman
Write a wise saying and your name will live forever. -- Anonymous
That is not true. Donal's regexp answer produces lists from non-list
strings, as [split] or [splitx] also do.
However, I suspect that most times this issue comes up, people would
actually like the effect of simply treating the string as a list.
Then
the list operation ([lrange]) is important as a *test* whether the
string
can indeed be handled as a list, to be used in [catch].
> However, DKF's code produces a normalized list. Here is the same thing
> using [lsearch]:
>
> % set lst [lsearch -inline -all " hello tcl world " *]
This lsearch answer is the same as [lrange], as both require a valid
list for input. The regexp answer, while slower, operates on all
strings.
> The [lsearch] version is about 6x faster than regexp. The only thing
> either do is to normalize a string which could be a list.
Not true.
Donald Arseneau
It's also easy to adapt to other interpretations of "word", whereas
anything that directly uses lists ([lrange], [lsearch], etc.) is going
to come unstuck for some inputs.
% set str "this \{ is \{not a\} list"
this { is {not a} list
% split $str
this {} \{ {} is \{not a\} list
% regexp -all -inline {\S+} $str
this \{ is \{not a\} list
% lrange $str 0 end
unmatched open brace in list
% package require textutil
0.7.1
% textutil::splitx $str
this \{ is \{not a\} list
Which is best? It all depends on what you want to do and whether what
you have is really a Tcl list in the first place or a string or record
that needs splitting up. (The [split] command is really intended for
records like /etc/passwd entries.)
Donal.
> However, I suspect that most times this issue comes up, people would
> actually like the effect of simply treating the string as a list.
> Then
> the list operation ([lrange]) is important as a *test* whether the
> string
> can indeed be handled as a list, to be used in [catch].
As of Tcl 8.5, the best way to know if a string can be treated as a
list is [string is list $string].
Aric