How to convert a string (words between random spaces) in to a tcl list?

pmarin

unread,

Sep 27, 2009, 8:24:27 AM9/27/09

to

I often have to convert a string (words between random spaces) in to a
tcl list, which is the best way to do it?
Currently I'm doing something like that:

set str " hello tcl world "
set lst [split [ regsub -all {[\s]+} [string trim $str { }] { } ]]

Is there any efficient/clever way to do this?

pmarin

unread,

Sep 27, 2009, 8:26:33 AM9/27/09

to

pmarin

unread,

Sep 27, 2009, 8:28:57 AM9/27/09

to

Georgios Petasis

unread,

Sep 27, 2009, 9:10:29 AM9/27/09

to pmarin

O/H pmarin έγραψε:

I usually use the following code:

package require Itcl ;# Itcl adds object-oriented facilities to Tcl.

itcl::class ELEP::DocumentUtilities::NaiveTokenizer {

constructor {} {
};# constructor

proc tokenize {text} {
set tcl_symbols_re_pattern {[;,.'\"?]}
set tokens {}
set length [string length $text]

## We have to find the first word...
set start 0
set end [tcl_endOfWord $text $start]
if {[tcl_startOfNextWord $text $start] < $end} {
set start [tcl_startOfNextWord $text $start]
}
set textStart $start
set token [string trim [string range $text $start $end]]
regsub -all -- $tcl_symbols_re_pattern $token {} token
if {[string length $token]} {lappend tokens $token}

## And now iterate over all other words...
while {$start >= 0} {
set start [tcl_startOfNextWord $text $start]
set end [tcl_endOfWord $text $start]
if {$start != -1 && $end != -1} {
set token [string trim [string range $text $start $end]]
regsub -all -- $tcl_symbols_re_pattern $token {} token
if {[string length $token]} {lappend tokens $token}
}
}
return $tokens
};# tokenize

};# class ELEP::DocumentUtilities::NaiveTokenizer

George

ZB

unread,

Sep 27, 2009, 11:06:23 AM9/27/09

to

Dnia 27.09.2009 pmarin <paco...@gmail.com> napisaďż˝/a:

What about: set lst [lsearch -inline -not -all [split $str] {}]
--
Zbigniew

Donal K. Fellows

unread,

Sep 27, 2009, 11:37:47 AM9/27/09

to

My favourite technique is this:

set lst [regexp -all -inline {\S+} $str]

That returns by returning ("inline") a list of all non-empty sequences
of non-whitespace characters. That's probably what you wanted, and it's
also wonderfully short...

Donal.

Cameron Laird

unread,

Sep 27, 2009, 11:32:17 AM9/27/09

to

In article <4ABF6445...@iit.demokritos.gr>,

.
.
.
Georgios, I strongly urge you to amplify your offer with an explanation
of the advantages of [tokenize]. At this point, it looks like eight
times as much code, PLUS a rather heavy library requirement, to yield a
more opaque implementation than pmarin's original [regsub].

pmarin

unread,

Sep 27, 2009, 12:11:59 PM9/27/09

to

On Sep 27, 5:37 pm, "Donal K. Fellows"

For my purpose this is the best, you won.

Thanks.

pmarin.

Georgios Petasis

unread,

Sep 27, 2009, 2:32:36 PM9/27/09

to Cameron Laird

O/H Cameron Laird οΏ½οΏ½οΏ½οΏ½οΏ½οΏ½:

> In article <4ABF6445...@iit.demokritos.gr>,
> Georgios Petasis <pet...@iit.demokritos.gr> wrote:

>> O/H pmarin οΏ½οΏ½οΏ½οΏ½οΏ½οΏ½:

Dear Cameron,

Regarding the heavy requirement of itcl, its just code I took from an
existing application, because I got a feeling that the author of the
original post was in a hurry (3 posts in a about 6 minutes), so I pasted
a quick answer, leaving customization to the poster.

And from the description I also got the impression that something more
"complex" was required than a simple splitting on spaces, so sending
code that imitates how the text widget selects words sounded a good idea...

George

pmarin

unread,

Sep 27, 2009, 3:22:56 PM9/27/09

to

> Regarding the heavy requirement of itcl, its just code I took from an
> existing application, because I got a feeling that the author of the
> original post was in a hurry (3 posts in a about 6 minutes), so I pasted
> a quick answer, leaving customization to the poster.
>

Well, my 3 post in a about 6 minutes is becouse my stolen neighbor's
internet conexion some times is a little weak ;)
Sorry about that.

Pmarin.

Twylite

unread,

Sep 28, 2009, 3:37:53 AM9/28/09

to

> > > set str " hello tcl world "
> > > set lst [split [ regsub -all {[\s]+} [string trim $str { }] { } ]]
>
> > > Is there any efficient/clever way to do this?
>
> > My favourite technique is this:
>
> > set lst [regexp -all -inline {\S+} $str]

Assuming the "words" are strictly alphanumeric (or rather they don't
contain any characters reserved for Tcl syntax like {, ", or [ ), then
$str can be interpreted as a list with no further effort:

set str " hello tcl world "

string is list $str
-> 1

If you want to force it into a list representation internally, you
could use lrange:
set lst [lrange $str 0 end]

Regards,
Twylite

tom.rmadilo

unread,

Sep 28, 2009, 5:15:13 PM9/28/09

to

I think Twylite is correct. In fact, if the string isn't in the form
of a Tcl list, no amount of magic could possibly turn it into a list.
If it is in the form of a Tcl list (although it might be a string or
look like a non-normalized list), there is no possible substitute for
turning it into a list that would differ from what the Tcl parser
would produce when told that the string is a list, such as [lindex
" hello tcl world " 2] == world.

However, DKF's code produces a normalized list. Here is the same thing
using [lsearch]:

% set lst [lsearch -inline -all " hello tcl world " *]
hello tcl world
% llength $lst
3

The [lsearch] version is about 6x faster than regexp. The only thing
either do is to normalize a string which could be a list.

Georgios Petasis

unread,

Sep 28, 2009, 5:32:01 PM9/28/09

to

O/H tom.rmadilo έγραψε:

All these are very interesting tricks, but maybe we are missing an
argument to split to not return empty elements?

The obvious solution to this task should have been split, and not regexp
or lsearch...

George

tom.rmadilo

unread,

Sep 28, 2009, 6:14:48 PM9/28/09

to

On Sep 28, 2:32 pm, Georgios Petasis <peta...@iit.demokritos.gr>
wrote:

Is it really a trick to use [lsearch]? The whole point of the -all
option is to return a list (even if it contains only one element), and
-inline returns the actual element and not the index of the match. The
effect is a filter which produces a new list from the current list
(like a where clause in SQL).

But [split] only returns empty elements if the split chars are
directly next to each other, regardless of the char. It seems
confusing to make an exception in the case of whitespace chars.

twowheel

unread,

Sep 28, 2009, 9:16:51 PM9/28/09

to

On Sep 28, 2:32 pm, Georgios Petasis <peta...@iit.demokritos.gr>
wrote:

Yep. That's about it.

Georgios Petasis

unread,

Sep 29, 2009, 2:42:10 AM9/29/09

to tom.rmadilo

Why an exception on white chars? I am speaking for a new split switch,
that will not return empty list elements, like:

split -noemptyelements text split_chars.

Empty list elements are removed, no matter what split chars are in use...

For such a trivial and frequent task Tcl should not offer efficient
alternative and trick. It sould be obvious to everybody on how the task
can be done.

George

Aric

unread,

Sep 29, 2009, 3:33:40 AM9/29/09

to

No, there is an important difference between the [regexp] solution and
the [lsearch] solution. [regexp -all -inline] expects a string and
returns a list. [lsearch -all -inline] expects a list and returns a
list. So, [regexp] will succeed for strings like "bizzare { string"
while [lsearch] will fail.

Aric

unread,

Sep 29, 2009, 3:36:38 AM9/29/09

to

I like this suggestion.

slebetman

unread,

Sep 29, 2009, 5:27:36 AM9/29/09

to

On Sep 29, 5:15 am, "tom.rmadilo" <tom.rmad...@gmail.com> wrote:
> On Sep 28, 12:37 am, Twylite <twylite.cr...@gmail.com> wrote:
>
>
>
> > > > > set str " hello tcl world "
> > > > > set lst [split [ regsub -all {[\s]+} [string trim $str { }] { } ]]
>
> > > > > Is there any efficient/clever way to do this?
>
> > > > My favourite technique is this:
>
> > > > set lst [regexp -all -inline {\S+} $str]
>
> > Assuming the "words" are strictly alphanumeric (or rather they don't
> > contain any characters reserved for Tcl syntax like {, ", or [ ), then
> > $str can be interpreted as a list with no further effort:
>
> > set str " hello tcl world "
> > string is list $str
> > -> 1
>
> > If you want to force it into a list representation internally, you
> > could use lrange:
> > set lst [lrange $str 0 end]
>
> I think Twylite is correct. In fact, if the string isn't in the form
> of a Tcl list, no amount of magic could possibly turn it into a list.

sorry but:

% string is list "not { a { list"
0
% set is_a_list [regexp -inline -all {\S+} "not { a { list"]
not \{ a \{ list
% lindex $is_a_list 4
list

Any string can be converted into a list. It just depends on how you
parse it.

tom.rmadilo

unread,

Sep 29, 2009, 4:31:41 PM9/29/09

to

You are absolutely right, any string can be converted into a list.
Then what? My assumption is that the string represents something more
than a string. Is this representation more apparent with whitespace
removed? Is a list of strings (each list element is a string) easier
to work with than the original string?

Donald Arseneau

unread,

Sep 30, 2009, 4:58:08 AM9/30/09

to

On Sep 28, 2:32 pm, Georgios Petasis <peta...@iit.demokritos.gr>
wrote:

> The obvious solution to this task should have been split, and not regexp
> or lsearch...

Command splitx from tcllib.

Glenn Jackman

unread,

Sep 30, 2009, 10:01:20 AM9/30/09

to

Particularly, from the textutil package.

--
Glenn Jackman
Write a wise saying and your name will live forever. -- Anonymous

Donald Arseneau

unread,

Oct 7, 2009, 2:50:50 AM10/7/09

to

On Sep 28, 2:15 pm, "tom.rmadilo" <tom.rmad...@gmail.com> wrote:
> On Sep 28, 12:37 am, Twylite <twylite.cr...@gmail.com> wrote:

....

> > > > set lst [regexp -all -inline {\S+} $str]
>
> > Assuming the "words" are strictly alphanumeric (or rather they don't
> > contain any characters reserved for Tcl syntax like {, ", or [ ), then
> > $str can be interpreted as a list with no further effort:
>

> > If you want to force it into a list representation internally, you
> > could use lrange:
> > set lst [lrange $str 0 end]
>
> I think Twylite is correct. In fact, if the string isn't in the form
> of a Tcl list, no amount of magic could possibly turn it into a list.

That is not true. Donal's regexp answer produces lists from non-list
strings, as [split] or [splitx] also do.

However, I suspect that most times this issue comes up, people would
actually like the effect of simply treating the string as a list.
Then
the list operation ([lrange]) is important as a *test* whether the
string
can indeed be handled as a list, to be used in [catch].

> However, DKF's code produces a normalized list. Here is the same thing
> using [lsearch]:
>
> % set lst [lsearch -inline -all " hello tcl world " *]

This lsearch answer is the same as [lrange], as both require a valid
list for input. The regexp answer, while slower, operates on all
strings.

> The [lsearch] version is about 6x faster than regexp. The only thing
> either do is to normalize a string which could be a list.

Not true.

Donald Arseneau

Donal K. Fellows

unread,

Oct 7, 2009, 9:27:59 AM10/7/09

to

On 7 Oct, 07:50, Donald Arseneau <a...@triumf.ca> wrote:
> This lsearch answer is the same as [lrange], as both require a valid
> list for input. The regexp answer, while slower, operates on all
> strings.

It's also easy to adapt to other interpretations of "word", whereas
anything that directly uses lists ([lrange], [lsearch], etc.) is going
to come unstuck for some inputs.

% set str "this \{ is \{not a\} list"
this { is {not a} list
% split $str
this {} \{ {} is \{not a\} list
% regexp -all -inline {\S+} $str
this \{ is \{not a\} list
% lrange $str 0 end
unmatched open brace in list
% package require textutil
0.7.1
% textutil::splitx $str
this \{ is \{not a\} list

Which is best? It all depends on what you want to do and whether what
you have is really a Tcl list in the first place or a string or record
that needs splitting up. (The [split] command is really intended for
records like /etc/passwd entries.)

Donal.

Aric

unread,

Oct 7, 2009, 12:11:26 PM10/7/09

to

On 7 oct, 00:50, Donald Arseneau <a...@triumf.ca> wrote:

> However, I suspect that most times this issue comes up, people would
> actually like the effect of simply treating the string as a list.
> Then
> the list operation ([lrange]) is important as a *test* whether the
> string
> can indeed be handled as a list, to be used in [catch].

As of Tcl 8.5, the best way to know if a string can be treated as a
list is [string is list $string].

Aric

alista...@gmail.com

unread,

Jun 3, 2013, 10:14:14 AM6/3/13

to

alista...@gmail.com

unread,

Jun 3, 2013, 10:29:04 AM6/3/13

to

On Sunday, September 27, 2009 2:24:27 PM UTC+2, pmarin wrote:
> I often have to convert a string (words between random spaces) in to a
> tcl list, which is the best way to do it?
> Currently I'm doing something like that:
>
> set str " hello tcl world "
> set lst [split [ regsub -all {[\s]+} [string trim $str { }] { } ]]
>
> Is there any efficient/clever way to do this?

Hallo,

I read many interesting things in the thread below...
I also want to make explicit that I am a newbie in tcl scripting, but
I think the most direct to convert in a list is to do nothing...
in the sense that str is already a "list object"
the demonstration is just a loop like

tclsh > foreach item $str { puts " << $item >>" }

it replies with

<< hello >>
<< tcl >>
<< world >>

I'm wrong?

alistar

Ian Gay

unread,

Jun 3, 2013, 11:44:03 AM6/3/13

to

How about [list {*}$str]

--
*********** To reply by e-mail, make w single in address **************

tomás zerolo

unread,

Jun 3, 2013, 12:26:11 PM6/3/13

to

alista...@gmail.com writes:

> On Sunday, September 27, 2009 2:24:27 PM UTC+2, pmarin wrote:

[...]

> I think the most direct to convert in a list is to do nothing...

Right. If you "look at it" as a list, then it's a list. This is what Ian
is proposing upthread. He's just *very intently* looking at it as a list.

Regards
-- tomás

Georgios Petasis

unread,

Jun 4, 2013, 2:26:58 AM6/4/13

to

This will result in an error if your string contains a tcl special
character, like ", { or }.

What you really need is a split that treats successive delimiter
characters as one. And tcl does not have this. I don't know why :-)

I thing what you already do with regexp & split is the fastest solution.

George