I am trying to extract the numbers 100 200 300
with following regular expression:
set line "1 | 100 | 200 | 300 | none"
regexp {(\|\s+(\w+)\s+)+} $line dummy first second third
puts "$first $second $third "
But instead of
100 200 300
I do get
| 300 300
Can someone explain to me, why the outer ()+ does not repeat the inner
regular expression ?
Following code is OK for one argument
set line "1 | 100 | 200 | 300 | none"
regexp {\|\s+(\w+)\s+} $line dummy first
puts "$first "
The result is:
100
Cheers, hssig
Or even [::csv::split] from the csv package of TclLib -- again with [string
trim [lindex ...]] on to result get the element?
--
+------------------------------------------------------------------------+
| Gerald W. Lester, President, KNG Consulting LLC |
| Email: Gerald...@kng-consulting.net |
+------------------------------------------------------------------------+
your split approach does work, I had already tried it before.
But out of curiosity and to learn regular expressions I wonder why the
following expressions do not implement the same: (2. implements the
desired behavior)
1.
regexp {(\|\s+(\w+)\s+)+} $line dummy first second third
2.
regexp {\|\s+(\w+)\s+} $line dummy first
regexp {\|\s+\w+\s+\|\s+(\w+)\s+} $line dummy second
regexp {\|\s+\w+\s+\|\s+\w+\s+\|\s+(\w+)\s+} $line dummy third
Variant 1. is supposed to repeat the steps which are coded in a
successive manner in 2.
Can you shed some light on it?
Cheers, hssig
> Hi,
>
> I am trying to extract the numbers 100 200 300
>
> with following regular expression:
>
> set line "1 | 100 | 200 | 300 | none"
> regexp {(\|\s+(\w+)\s+)+} $line dummy first second third
> puts "$first $second $third "
>
> But instead of
> 100 200 300
>
> I do get
> | 300 300
Had you said
% puts "first='$first' second='$second' third='$third'"
you might have guessed what's going on:
=> first='| 300 ' second='300' third=''
So the first matchvar corresponds to the outer paren and the second
corresponds to the inner (of the last possible match):
regexp {(\|\s+(\w+)\s+)+} $line dummy first second third
^1 ^2
Seems that (implicitly) the regexp is prepended with .* The manpage is
silent about that, but option -all gives a hint:
-all Causes the regular expression to be matched as many
times as possible in the string, returning the total
number of matches found. If this is specified with
match variables, they will contain information for
the last match only.
As Gerald says downthread, this seems a job for split. If you still want
to do it the Regexp Way, inline seems to be for you. But beware -- if
the regexp matches in one go, the subexp vars will still be capturing
just the last occurrence:
% join [regexp -all -inline {(\|\s+(\w+)\s+)+} $line] ";"
=> | 100 | 200 | 300 ;| 300 ;300
Better is:
% join [regexp -all -inline {\|\s+(\w+)\s+} $line] ";"
=> | 100 ;100;| 200 ;200;| 300 ;300
Note that you'll have to "filte out" the whole matches, i.e. the odd
list entries.
The problem with your approach seems to be that the regexp just "matches
once" and the match subvars are bound to the last occurrence of this
"one match".
Regards
-- tomás
Emphatically No.
Any RE "pounces" on the part of a given string that matches.
You can adapt your RE by anchoring ( ^ $ ) to start/end of string
to force a match to the bounds of the string given as input.
Indeed. If one insists on emulating unanchored REs with an anchored
one, then ^.*?FOO is a better approximation to FOO:
regexp FOO(.) FOOAFOOB -> x;puts $x
A
regexp ^.*FOO(.) FOOAFOOB -> x;puts $x
B
regexp ^.*?FOO(.) FOOAFOOB -> x;puts $x
A
Meaning: the non-greedy quantifier allows to land on the leftmost
match (as in the unanchored case), while the greedy one will insist on
swallowing as much prefix as possible, and only yield the rightmost
one.
One difference to keep in mind though, is that a greedy unanchored RE
is okay, while an explicitly mixed-greediness RE is known to have
problems.
-Alex
> tomas wrote:
>> Seems that (implicitly) the regexp is prepended with .*
>
> Emphatically No.
> Any RE "pounces" on the part of a given string that matches.
You are of course right, as can be readily seen in the examples
upthread: the "whole match" starts at the beginning.
The "explanation" for the effect is that the sub-matches seem to forget
their earlier instances and just "keep" their last one (other than
e.g. in Perl, which keeps *all* sub-matches).
Thanks
-- tomás
> On Jan 8, 2:04 pm, Uwe Klein <uwe_klein_habertw...@t-online.de> wrote:
>> tomas wrote:
>> > Seems that (implicitly) the regexp is prepended with .*
>>
>> Emphatically No.
>> Any RE "pounces" on the part of a given string that matches.
>
> Indeed. If one insists on emulating unanchored REs with an anchored
> one, then ^.*?FOO is a better approximation to FOO:
Very much so
Regards
-- tomás
That's another story entirely. To get several matches, use the -all
flag to [regexp] or [regsub].
And please stop using "seem" when describing fully documented and
deterministic behavior. This term is reserved for bugs and other areas
of experimental science ;-)
-Alex
> On Jan 8, 3:20 pm, tomas <to...@floh.bas23> wrote:
>> Uwe Klein <uwe_klein_habertw...@t-online.de> writes:
>> > tomas wrote:
>> >> Seems that (implicitly) the regexp is prepended with .*
> That's another story entirely. To get several matches, use the -all
> flag to [regexp] or [regsub].
As I put in my original post, right. The confusion stemmed from another
detail (only keeping last match for sub-matches).
> And please stop using "seem" when describing fully documented and
> deterministic behavior. This term is reserved for bugs and other areas
> of experimental science ;-)
:-)
...or for limited knowledge of the poster (actually was the case here).
Thanks
-- tomás
How would you do that ? With split operation ? :-)
Cheers, hssig
As far as I know, split just splits a string into a list. What we have
here is already a list. You'd want:
% set line "1 | 100 | 200 | 300 | none"
% set result {}
% foreach {x y} [regexp -all -inline {\|\s+(\w+)\s+} $line] {
% lappend result $y ;# discard x
% }
% puts $result
==> 100 200 300
see <http://wiki.tcl.tk/12574> or <http://wiki.tcl.tk/12850> if you are
in search for something less "pedestrian".
Note that I'm just (re-)learning tcl, so I hope someone more experienced
will chime in.
Regards
-- tomás
list? yeah, but the wrong list.
set line "1 | 100 | 200 | 300 | none"
set list [ split $line | ]
# optional for getting rid of spaces, tabs ..
set list2 $list ; unset list
foreach item $list2 {
lappend list [ string trim $item ]
]
# endoptional
set start 1
set end 3
set result [ lrange $list $start $end ]
uwe
> tomas wrote:
>> hssig <hs...@gmx.net> writes:
>>
>>
>>>>% join [regexp -all -inline {\|\s+(\w+)\s+} $line] ";"
>>>> => | 100 ;100;| 200 ;200;| 300 ;300
>>>>
>>>>Note that you'll have to "filte out" the whole matches, i.e. the odd
>>>>list entries.
> There is a lot potential for damage to be done by drilling holes in
> kneecaps ;-)
My doctor did, and since then my knee is much better ;-)
But seriously: I thought the OP wanted to know how to "filter out" the
odd entries of the result.
Further upthread consensus was already there that starting off with
split might be the best option for this given string structure.
Regards
-- tomás