Need help with regexp formulation

Juge

unread,

Jul 20, 2015, 6:35:40 AM7/20/15

to

I have some equations and I want to read out the variables form the equation,
now the format can get quite complex since I am linking to another program which has some keywords for integration and stuff like that.

I have curvenames that have a format like:
curve_1.x
anothercurve.x
yetanother.y

.x and .y are always at the end of the curvename indicating x or y vector.
Now my equation may look like:
(curve.y*curve_2.y)/curve.x

or interp(curve.x,curve.y)

So in I am actually interested in the string ending with .x .y
string itself could contain alphanumeric characters or underscore,
alternatively I can imagine that before the string there is operator (+ - / *)
or comma or open parenthesis (

In the end I would like to have all matches
for example, from (curve.y*curve_2.y)/curve.x list output
of
curve
curve_2
curve

What I came up with after experimenting is:

set equation_x {(curve.y*curve_2.y)/curve.x}
set text "Some arbitrary text which might include \$ or {"
set wordList [regexp -all -inline -- {([a-zA-Z0-9_]*\.y?|[a-zA-Z0-9_]*\.x?)} $equation_x]
foreach word $wordList {
puts $word
}

It works, except that I get everything twice and I would like to strip the .x or .y out of the output.

Arjen Markus

unread,

Jul 20, 2015, 6:59:35 AM7/20/15

to

Op maandag 20 juli 2015 12:35:40 UTC+2 schreef Juge:

I am not sure how you would do that in one go, but the regular expression below is slightly more compact and by using [lsearch] you get only the ones you are interested in:

set wordList [lsearch -not -all -inline [regexp -all -inline -- {([A-Za-z_0-9]+)(\.x|\.y)} $equation_x] {*.*}]

Regards,

Arjen

heinrichmartin

unread,

Jul 21, 2015, 4:21:16 AM7/21/15

to

May I suggest {(\w+)\.[xy]}. You used \w* which makes .x and .y valid names ...

--inline gives \0 and \1. As alternatives to [lsearch], you could consider(*):

lsearch -not -all -inline [regexp -all -inline -- {(\w+)\.[xy]} $equation_x] *.*
dict values [regexp -all -inline -- {(\w+)\.[xy]} $equation_x] ;# gives unique results!
lmap {0 1} [regexp -all -inline -- {(\w+)\.[xy]} $equation_x] {set 1}

Performance is the same for your example. I checked it for 10000 entries, too:

# init random list
set x [list]
for {set i 0} {$i < 10000} {incr i} {set c [string replace [::tcl::mathfunc::rand] 0 1]; lappend x $c.x $c}

time {lsearch -not -all -inline $x *.*} 1000
1388.135 microseconds per iteration
time {lsearch -not -all -inline $x *.*} 1000
1348.472 microseconds per iteration

time {dict values $x} 1000
275.311 microseconds per iteration
time {dict values $x} 1000
265.618 microseconds per iteration

time {lmap {0 1} $x {set 1}} 1000
5018.793 microseconds per iteration
time {lmap {0 1} $x {set 1}} 1000
5272.107 microseconds per iteration

It is interesting that
* [lsearch] (another search after regexp) outperforms [lmap] which is the semantically "correct" way to extract every other element from a list.
* the list-to-dict conversion is even faster.

raks...@gmail.com

unread,

Jul 21, 2015, 6:36:36 AM7/21/15

to

I am not able to reproduce the results that you show:

dict values [regexp -all -inline -- {(\w+)\.[xy]} $equation_x];

when I run gives me the following result:
curve curve_2 curve

so basically the uniquification did not happen.

-Anirudh

Schelte Bron

unread,

Jul 21, 2015, 11:21:01 AM7/21/15

to

heinrichmartin wrote:
> May I suggest {(\w+)\.[xy]}. You used \w* which makes .x and .y
> valid names ...
>
> --inline gives \0 and \1. As alternatives to [lsearch], you could
> consider(*):
>
> lsearch -not -all -inline [regexp -all -inline -- {(\w+)\.[xy]}
$equation_x] *.*
> dict values [regexp -all -inline -- {(\w+)\.[xy]} $equation_x] ;#
gives unique results!
> lmap {0 1} [regexp -all -inline -- {(\w+)\.[xy]} $equation_x] {set
1}
>

Or make regexp return just the part you're interested in (I added
the \M so it doesn't match things like "fooled.you"):
regexp -all -inline {\w+(?=\.[xy]\M)} $equation_x

Schelte

heinrichmartin

unread,

Jul 22, 2015, 2:52:35 AM7/22/15

to

On Tuesday, July 21, 2015 at 12:36:36 PM UTC+2, raks...@gmail.com wrote:
> I am not able to reproduce the results that you show:
>
> dict values [regexp -all -inline -- {(\w+)\.[xy]} $equation_x];
>
> when I run gives me the following result:
> curve curve_2 curve
>
> so basically the uniquification did not happen.
>
> -Anirudh

Actually true, but note that the keys are unique: curve.x and curve.y

And Schelte provided an even better regexp using lookahead :-)

raks...@gmail.com

unread,

Jul 22, 2015, 4:16:36 AM7/22/15

to

When we reverse the result of the regexp match,
the (\w+) will become the keys & hence unique.

puts [dict keys [lreverse [regexp -all -inline -- {(\w+)\.[xy]} $equation_x]]]

gives me "curve" & "curve_2"

Andreas Leitgeb

unread,

Jul 22, 2015, 6:51:51 AM7/22/15

to

heinrichmartin <martin....@frequentis.com> wrote:
> time {dict values $x} 1000
> 275.311 microseconds per iteration
> time {dict values $x} 1000
> 265.618 microseconds per iteration

I think, this result is misleading, as it only really measures
the extraction of the values of a dict, but not the creation of
the dict. Ok, it does measure it, but only once of the 1000 runs,
so its impact gets divided by 1000. Once the original list $x is
converted to a dict, tcl will remember its dict-rep.

heinrichmartin

unread,

Jul 23, 2015, 5:47:28 AM7/23/15

to

On Wednesday, July 22, 2015 at 12:51:51 PM UTC+2, Andreas Leitgeb wrote:
> I think, this result is misleading, as it only really measures
> the extraction of the values of a dict, but not the creation of
> the dict. Ok, it does measure it, but only once of the 1000 runs,
> so its impact gets divided by 1000. Once the original list $x is
> converted to a dict, tcl will remember its dict-rep.

And Tcl's internals are still fooling me(*) ... I should have done the timing with the regexp-part, too. Thx for pointing that out!

(*) Assuming, that the dict-rep has keys and values at hand as a list, and considering the copy-on-write pattern, this did quite nothing at all. Am I right?

Donal K. Fellows

unread,

Jul 23, 2015, 8:33:01 AM7/23/15

to

On 20/07/2015 11:35, Juge wrote:
> It works, except that I get everything twice and I would like to
> strip the .x or .y out of the output.

You're getting things twice because you're putting the overall RE inside
a capturing parenthesis. With [regexp -inline] that means you get the
value twice, once as the overall match and the other time as the
contents of the parenthesis.

The simplest way to handle this is actually to use a separate filtering
step. There are many ways to do that, but here's one with Tcl 8.6's
[lmap] command:

# A slightly simpler RE that captures just the interesting bit
set matched [regexp -all -inline -- {([a-zA-Z0-9_]*)(?:\.y?|\.x?)}
$equation_x]
set wordList [lmap {all part} $matched {set part}]
# ==> curve curve_2 curve

If you want to remove further duplicates (which may be an order changing
operation) you would instead do this as the filtering operation (which
will work in 8.5):

foreach {all part} $matched {
dict set wordList $part "dummy value"
}
set wordList [dict keys $wordList]
# ==> curve curve_2

Donal.
--
Donal Fellows — Tcl user, Tcl maintainer, TIP editor.

raks...@gmail.com

unread,

Jul 23, 2015, 10:35:45 AM7/23/15

to

On Thursday, July 23, 2015 at 6:03:01 PM UTC+5:30, Donal K. Fellows wrote:
[snipped]

>
> # A slightly simpler RE that captures just the interesting bit
> set matched [regexp -all -inline -- {([a-zA-Z0-9_]*)(?:\.y?|\.x?)}
> $equation_x]
>

Actually the above regex can capture false matches like
curve_2. or curve_2.xyz as well since there's a y? &
to make it water tight we just need this:

regexp {([a-z_A-Z][a-z_A-Z0-9]*)\.(?:y|x)(?:\W|$)}
or,
regexp {(\w+)\.[xy](?:\W|$)}

>
> set wordList [lmap {all part} $matched {set part}]
> # ==> curve curve_2 curve
>

Can you please help explain how the above works? When I run this, get the
error message: Invalid command name "lmap"
Do we need to include some specific package for it to run?

-Anirudh.

Andreas Leitgeb

unread,

Jul 23, 2015, 12:13:38 PM7/23/15

to

lmap was introduced in tcl 8.6, so you apparently tried it
with an older version.

It is possible to create lmap as a procedure, or to change it's
use to a foreach-loop and some extra commands. e.g.:

set wordList {}
foreach {all part} $matched { lappend wordList $part }

would be mostly equivalent to the suggested lmap-use.

Andreas Leitgeb

unread,

Jul 23, 2015, 12:37:29 PM7/23/15

to

In my previous post I just "reproduced" from memory what I had gathered
from reading various posts here in the past.

Now, I tried to verify it, using tcl8.6's ::tcl::unsupported::representation:

% # create a pure list. ("llength $a" is to avoid the print out)
% set a [lrepeat 10000 foo] ; llength $a
10000
% ::tcl::unsupported::representation $a
value is a list with a refcount of 2, object pointer at 0x8158be8, internal representation 0x81a2c28:(nil), no string representation

% # it's a pure list: let's turn it into a dict:
% llength [dict values $a]
1
% ::tcl::unsupported::representation $a
value is a dict with a refcount of 2, object pointer at 0x8158be8, internal representation 0x819d198:(nil), string representation "foo foo foo f..."
% # yes it is a dict, but the previous list is now preserved as a string rep !

% # now turn it back into a list:
% llength $a
10000
% ::tcl::unsupported::representation $a
value is a list with a refcount of 2, object pointer at 0x8158be8, internal representation 0x81a2c28:(nil), string representation "foo foo foo f..."
% # string rep is kept, dict-rep is lost.

I don't know, if the dict is created directly from the list, or if
all goes through the string rep. Neither do I know, whether the dict
maintains a list of its values, but I'd rather doubt it, so I'd rather
believe that at least the extraction of values was actually re-done
1000 times during your measurement.

heinrichmartin

unread,

Jul 24, 2015, 5:11:02 AM7/24/15

to

Thx for giving these insights!

BTW: I am quite sure that the int rep of the list is re-built during [llength $a]; using the same address as before is just coincidence (or at most a logical, re-producible choice for memory allocation).

heinrichmartin

unread,

Jul 24, 2015, 5:12:52 AM7/24/15

to

As I hijacked this thread about regexp, let me point out Schelte's elegant solution once more:

Andreas Leitgeb

unread,

Jul 24, 2015, 7:16:57 AM7/24/15

to

heinrichmartin <martin....@frequentis.com> wrote:
> On Thursday, July 23, 2015 at 6:37:29 PM UTC+2, Andreas Leitgeb wrote:
>> In my previous post I just "reproduced" from memory what I had gathered
>> from reading various posts here in the past.
>> Now, I tried to verify it, using tcl8.6's
>> ::tcl::unsupported::representation:

>> % set a [lrepeat 10000 foo] ; llength $a; ...

>> value is a list with a refcount of 2, object pointer at 0x8158be8,

>> int rep 0x81a2c28:(nil), no string representation
>> % llength [dict values $a]; ...

>> value is a dict with a refcount of 2, object pointer at 0x8158be8,

>> int rep 0x819d198:(nil), string representation "foo foo foo f..."
>> % llength $a; ...

>> value is a list with a refcount of 2, object pointer at 0x8158be8,

>> int rep 0x81a2c28:(nil), string representation "foo foo foo f..."

> BTW: I am quite sure that the int rep of the list is re-built during
> [llength $a]; using the same address as before is just coincidence
> (or at most a logical, re-producible choice for memory allocation).

I actually made another test right after posting my experiments and
noticing the same int-rep address: between the steps of turning $a
into a dict and turning it back into a list, I inserted a creation
of another list. It turned out, that $a back as a list still got the
old address back. I'd now speculate that the dict internally caches
the list-rep, and that doesn't show up in output of ...::representation.

Rich

unread,

Jul 24, 2015, 8:39:49 AM7/24/15

to

In the best possible case, the actual dict/list elements remain
unchanged, and what changes is creating/destroying the dict hash table
and list index tables.

Both of which require scanning the elements at least once for each
rebuild, and for the dict case also require hashing the keys to build
the hash table. There is at least some measurable time consumed by the
rebuild of the hash table or list index table when shimmering back and
forth.

And both of the above have to be redone whether the conversion happens
'directly' or via passing through the string representation.