Table entries extraction

hssig

unread,

Jan 7, 2011, 2:57:56 PM1/7/11

to

Hi,

I am trying to extract the numbers 100 200 300

with following regular expression:

set line "1 | 100 | 200 | 300 | none"
regexp {(\|\s+(\w+)\s+)+} $line dummy first second third
puts "$first $second $third "

But instead of
100 200 300

I do get
| 300 300

Can someone explain to me, why the outer ()+ does not repeat the inner
regular expression ?

Following code is OK for one argument

set line "1 | 100 | 200 | 300 | none"
regexp {\|\s+(\w+)\s+} $line dummy first
puts "$first "

The result is:
100

Cheers, hssig

Gerald W. Lester

unread,

Jan 7, 2011, 3:03:02 PM1/7/11

to

Any reason you are not using [split] and then [string trim [lindex ...]] on
to result get the element?

Or even [::csv::split] from the csv package of TclLib -- again with [string
trim [lindex ...]] on to result get the element?

--
+------------------------------------------------------------------------+
| Gerald W. Lester, President, KNG Consulting LLC |
| Email: Gerald...@kng-consulting.net |
+------------------------------------------------------------------------+

hssig

unread,

Jan 8, 2011, 6:45:08 AM1/8/11

to

Hi Gerald,

your split approach does work, I had already tried it before.

But out of curiosity and to learn regular expressions I wonder why the
following expressions do not implement the same: (2. implements the
desired behavior)

1.

regexp {(\|\s+(\w+)\s+)+} $line dummy first second third

2.

regexp {\|\s+(\w+)\s+} $line dummy first

regexp {\|\s+\w+\s+\|\s+(\w+)\s+} $line dummy second
regexp {\|\s+\w+\s+\|\s+\w+\s+\|\s+(\w+)\s+} $line dummy third

Variant 1. is supposed to repeat the steps which are coded in a
successive manner in 2.

Can you shed some light on it?

Cheers, hssig

tomas

unread,

Jan 8, 2011, 7:11:32 AM1/8/11

to

hssig <hs...@gmx.net> writes:

> Hi,
>
> I am trying to extract the numbers 100 200 300
>
> with following regular expression:
>
> set line "1 | 100 | 200 | 300 | none"
> regexp {(\|\s+(\w+)\s+)+} $line dummy first second third
> puts "$first $second $third "
>
> But instead of
> 100 200 300
>
> I do get
> | 300 300

Had you said

% puts "first='$first' second='$second' third='$third'"

you might have guessed what's going on:

=> first='| 300 ' second='300' third=''

So the first matchvar corresponds to the outer paren and the second
corresponds to the inner (of the last possible match):

regexp {(\|\s+(\w+)\s+)+} $line dummy first second third

^1 ^2

Seems that (implicitly) the regexp is prepended with .* The manpage is
silent about that, but option -all gives a hint:

-all Causes the regular expression to be matched as many
times as possible in the string, returning the total
number of matches found. If this is specified with
match variables, they will contain information for
the last match only.

As Gerald says downthread, this seems a job for split. If you still want
to do it the Regexp Way, inline seems to be for you. But beware -- if
the regexp matches in one go, the subexp vars will still be capturing
just the last occurrence:

% join [regexp -all -inline {(\|\s+(\w+)\s+)+} $line] ";"
=> | 100 | 200 | 300 ;| 300 ;300

Better is:

% join [regexp -all -inline {\|\s+(\w+)\s+} $line] ";"
=> | 100 ;100;| 200 ;200;| 300 ;300

Note that you'll have to "filte out" the whole matches, i.e. the odd
list entries.

The problem with your approach seems to be that the regexp just "matches
once" and the match subvars are bound to the last occurrence of this
"one match".

Regards
-- tomás

Uwe Klein

unread,

Jan 8, 2011, 8:04:56 AM1/8/11

to

tomas wrote:
> Seems that (implicitly) the regexp is prepended with .*

Emphatically No.
Any RE "pounces" on the part of a given string that matches.

You can adapt your RE by anchoring ( ^ $ ) to start/end of string
to force a match to the bounds of the string given as input.

Alexandre Ferrieux

unread,

Jan 8, 2011, 8:47:06 AM1/8/11

to

On Jan 8, 2:04 pm, Uwe Klein <uwe_klein_habertw...@t-online.de> wrote:
> tomas wrote:
> > Seems that (implicitly) the regexp is prepended with .*
>
> Emphatically No.
> Any RE "pounces" on the part of a given string that matches.

Indeed. If one insists on emulating unanchored REs with an anchored
one, then ^.*?FOO is a better approximation to FOO:

regexp FOO(.) FOOAFOOB -> x;puts $x

A

regexp ^.*FOO(.) FOOAFOOB -> x;puts $x

B

regexp ^.*?FOO(.) FOOAFOOB -> x;puts $x

A

Meaning: the non-greedy quantifier allows to land on the leftmost
match (as in the unanchored case), while the greedy one will insist on
swallowing as much prefix as possible, and only yield the rightmost
one.

One difference to keep in mind though, is that a greedy unanchored RE
is okay, while an explicitly mixed-greediness RE is known to have
problems.

-Alex

tomas

unread,

Jan 8, 2011, 9:20:31 AM1/8/11

to

Uwe Klein <uwe_klein_...@t-online.de> writes:

> tomas wrote:
>> Seems that (implicitly) the regexp is prepended with .*
>
> Emphatically No.
> Any RE "pounces" on the part of a given string that matches.

You are of course right, as can be readily seen in the examples
upthread: the "whole match" starts at the beginning.

The "explanation" for the effect is that the sub-matches seem to forget
their earlier instances and just "keep" their last one (other than
e.g. in Perl, which keeps *all* sub-matches).

Thanks
-- tomás

tomas

unread,

Jan 8, 2011, 9:21:50 AM1/8/11

to

Alexandre Ferrieux <alexandre...@gmail.com> writes:

> On Jan 8, 2:04 pm, Uwe Klein <uwe_klein_habertw...@t-online.de> wrote:
>> tomas wrote:
>> > Seems that (implicitly) the regexp is prepended with .*
>>
>> Emphatically No.
>> Any RE "pounces" on the part of a given string that matches.
>
> Indeed. If one insists on emulating unanchored REs with an anchored
> one, then ^.*?FOO is a better approximation to FOO:

Very much so

Regards
-- tomás

Alexandre Ferrieux

unread,

Jan 8, 2011, 9:46:47 AM1/8/11

to

On Jan 8, 3:20 pm, tomas <to...@floh.bas23> wrote:

That's another story entirely. To get several matches, use the -all
flag to [regexp] or [regsub].

And please stop using "seem" when describing fully documented and
deterministic behavior. This term is reserved for bugs and other areas
of experimental science ;-)

-Alex

tomas

unread,

Jan 8, 2011, 12:08:49 PM1/8/11

to

Alexandre Ferrieux <alexandre...@gmail.com> writes:

> On Jan 8, 3:20 pm, tomas <to...@floh.bas23> wrote:
>> Uwe Klein <uwe_klein_habertw...@t-online.de> writes:
>> > tomas wrote:
>> >> Seems that (implicitly) the regexp is prepended with .*

> That's another story entirely. To get several matches, use the -all

> flag to [regexp] or [regsub].

As I put in my original post, right. The confusion stemmed from another
detail (only keeping last match for sub-matches).

> And please stop using "seem" when describing fully documented and
> deterministic behavior. This term is reserved for bugs and other areas
> of experimental science ;-)

:-)

...or for limited knowledge of the poster (actually was the case here).

Thanks
-- tomás

hssig

unread,

Jan 10, 2011, 3:55:16 AM1/10/11

to

>% join [regexp -all -inline {\|\s+(\w+)\s+} $line] ";"
> => | 100 ;100;| 200 ;200;| 300 ;300
>
>Note that you'll have to "filte out" the whole matches, i.e. the odd
>list entries.

How would you do that ? With split operation ? :-)

Cheers, hssig

tomas

unread,

Jan 10, 2011, 5:32:55 AM1/10/11

to

hssig <hs...@gmx.net> writes:

As far as I know, split just splits a string into a list. What we have
here is already a list. You'd want:

% set line "1 | 100 | 200 | 300 | none"
% set result {}
% foreach {x y} [regexp -all -inline {\|\s+(\w+)\s+} $line] {
% lappend result $y ;# discard x
% }
% puts $result
==> 100 200 300

see <http://wiki.tcl.tk/12574> or <http://wiki.tcl.tk/12850> if you are
in search for something less "pedestrian".

Note that I'm just (re-)learning tcl, so I hope someone more experienced
will chime in.

Regards
-- tomás

Uwe Klein

unread,

Jan 10, 2011, 6:19:25 AM1/10/11

to

tomas wrote:
> hssig <hs...@gmx.net> writes:
>
>
>>>% join [regexp -all -inline {\|\s+(\w+)\s+} $line] ";"
>>> => | 100 ;100;| 200 ;200;| 300 ;300
>>>
>>>Note that you'll have to "filte out" the whole matches, i.e. the odd
>>>list entries.

There is a lot potential for damage to be done by drilling holes in kneecaps ;-)

>>
>>How would you do that ? With split operation ? :-)
>
>
> As far as I know, split just splits a string into a list. What we have
> here is already a list. You'd want:
>
> % set line "1 | 100 | 200 | 300 | none"

list? yeah, but the wrong list.

set line "1 | 100 | 200 | 300 | none"

set list [ split $line | ]

# optional for getting rid of spaces, tabs ..
set list2 $list ; unset list
foreach item $list2 {
lappend list [ string trim $item ]
]
# endoptional

set start 1
set end 3
set result [ lrange $list $start $end ]

uwe

tomas

unread,

Jan 10, 2011, 9:56:16 AM1/10/11

to

Uwe Klein <uwe_klein_...@t-online.de> writes:

> tomas wrote:
>> hssig <hs...@gmx.net> writes:
>>
>>
>>>>% join [regexp -all -inline {\|\s+(\w+)\s+} $line] ";"
>>>> => | 100 ;100;| 200 ;200;| 300 ;300
>>>>
>>>>Note that you'll have to "filte out" the whole matches, i.e. the odd
>>>>list entries.
> There is a lot potential for damage to be done by drilling holes in
> kneecaps ;-)

My doctor did, and since then my knee is much better ;-)

But seriously: I thought the OP wanted to know how to "filter out" the
odd entries of the result.

Further upthread consensus was already there that starting off with
split might be the best option for this given string structure.

Regards
-- tomás