Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Regexp vs string match

57 views
Skip to first unread message

relaxmike

unread,
Aug 5, 2008, 9:26:14 AM8/5/08
to
Hi all,

I am currently reading the "dragon book" and
it is for me the occasion to experiment a little
more with Tcl regular expressions.
So I compared the performances for a Tcl script
which uses only "string match" commands with
additionnal "string range" and "string first", etc... commands
against the regexp command.
Unfortunately, I am unable to take advantage of the
power of the regexp command on my example.
On my simple example, the "string matching"-based script
is 10x faster than the "regexp"-based script.

Suppose I have to make a script which process a data
file which contains the headers of emails :

from First1 Last1 <first1...@gmail.com>
[other unwanted lines]

From this data file, I would like to get (parse) the
list of names and associated email addresses.

With a "string match" command, I can see if the current line
is matching a pattern. Then I can extract the various
parts of the line with a search of the "<" character.
This leads to the following command :

proc parseFromSM {line} {
set ismatching [string match "from*" $line]
if {$ismatching==1} then {
log "parseFrom : $line > $ismatching"
# Get user name
set firstInf [string first "<" $line]
incr firstInf -1
set lineName [string range $line 0 $firstInf]
set username [string range $lineName 5 end]
set username [string trim $username]
# Get user mail
incr firstInf +2
set useremail [string range $line $firstInf end-1]
# Store data
log "Name : $username"
log "Mail : $useremail"
}
return $ismatching
}

With the regexp command instead, all the processing
is done in one line only, since regexp allows to
return the various parts of the RE which matches the
string. After some tuning (with VisualRegexp btw),
I ended with :

proc parseFromRE {line} {
set ismatching [regexp {(from)(\ *)(.*) <(.*)>} $line matched from
blanks username useremail]
if {$ismatching==1} then {
log "parseFrom : $line > $ismatching"
log "Name : $username"
log "Mail : $useremail"
}
return $ismatching
}

I was surprised that it was so simple.
I did several timings and I was surprised to see
that the SM (string matching) command was faster
than the RE (regular expression) method.
The timing is based on the following script,
which uses the "time" command with 10 calls to the
command :

#
# Measure timings
#
set timecount 10
#
# Measure timings for String Matching
#
set line "from First1 Last1 <first1...@gmail.com>"
set imax 100
set tmin [expr {pow(10,3)}]
set tmax 0
set ttotal 0
for {set i 0} {$i<$imax} {incr i} {
set t1 [time [list parseFromSM $line] $timecount]
set currenttime [lindex $t1 0]
set ttotal [expr {$ttotal + $currenttime}]
if {$currenttime<$tmin} then {
set tmin $currenttime
}
if {$currenttime>$tmax} then {
set tmax $currenttime
}
}
set taverage [expr {$ttotal/$imax}]
puts "###################"
puts "parseFromSM"
puts "Total:$ttotal (ms)"
puts "Average:$taverage (ms)"
puts "Min:$tmin (ms)"
puts "Max:$tmax (ms)"

Here are the (surprising) results :

###################
parseFromSM
Total:686.0 (ms)
Average:6.86 (ms)
Min:6.6 (ms)
Max:13.2 (ms)
###################
parseFromRE
Total:7320.0 (ms)
Average:73.2 (ms)
Min:72.3 (ms)
Max:113.9 (ms)

What am I doing wrong ?

Regards,

Michaël

PS
Below is the complete script.
It contains a second version of the RE-based command,
with a global variable containing the regular expression.

#
# reperf.tcl --
# A small script to show the performance improvment by using
# REs.
#
proc parseFromSM {line} {
set ismatching [string match "from*" $line]
if {$ismatching==1} then {
log "parseFrom : $line > $ismatching"
# Get user name
set firstInf [string first "<" $line]
incr firstInf -1
set lineName [string range $line 0 $firstInf]
set username [string range $lineName 5 end]
set username [string trim $username]
# Get user mail
incr firstInf +2
set useremail [string range $line $firstInf end-1]
# Store data
log "Name : $username"
log "Mail : $useremail"
}
return $ismatching
}
proc parseFromRE {line} {
set ismatching [regexp {(from)(\ *)(.*) <(.*)>} $line matched from
blanks username useremail]
if {$ismatching==1} then {
log "parseFrom : $line > $ismatching"
log "Name : $username"
log "Mail : $useremail"
}
return $ismatching
}
proc parseFromRE2 {line} {
set ismatching [regexp $::RE $line matched from blanks username
useremail]
if {$ismatching==1} then {
log "parseFrom : $line > $ismatching"
log "Name : $username"
log "Mail : $useremail"
}
return $ismatching
}
proc log {msg} {
#puts "$msg"
}
#
# Measure timings
#
set timecount 10
#
# Measure timings for String Matching
#
set line "from First1 Last1 <first1...@gmail.com>"
set imax 100
set tmin [expr {pow(10,3)}]
set tmax 0
set ttotal 0
for {set i 0} {$i<$imax} {incr i} {
set t1 [time [list parseFromSM $line] $timecount]
set currenttime [lindex $t1 0]
set ttotal [expr {$ttotal + $currenttime}]
if {$currenttime<$tmin} then {
set tmin $currenttime
}
if {$currenttime>$tmax} then {
set tmax $currenttime
}
}
set taverage [expr {$ttotal/$imax}]
puts "###################"
puts "parseFromSM"
puts "Total:$ttotal (ms)"
puts "Average:$taverage (ms)"
puts "Min:$tmin (ms)"
puts "Max:$tmax (ms)"
#
# Measure timings for Regular Expressions
#
set tmin [expr {pow(10,3)}]
set tmax 0
set ttotal 0
for {set i 0} {$i<$imax} {incr i} {
set t1 [time [list parseFromRE $line] $timecount]
set currenttime [lindex $t1 0]
set ttotal [expr {$ttotal + $currenttime}]
if {$currenttime<$tmin} then {
set tmin $currenttime
}
if {$currenttime>$tmax} then {
set tmax $currenttime
}
}
set taverage [expr {$ttotal/$imax}]
puts "###################"
puts "parseFromRE"
puts "Total:$ttotal (ms)"
puts "Average:$taverage (ms)"
puts "Min:$tmin (ms)"
puts "Max:$tmax (ms)"
#
# Measure timings for Regular Expressions v2
#
set RE {(from)(\ *)(.*) <(.*)>}
set tmin [expr {pow(10,3)}]
set tmax 0
set ttotal 0
for {set i 0} {$i<$imax} {incr i} {
set t1 [time [list parseFromRE2 $line] $timecount]
set currenttime [lindex $t1 0]
set ttotal [expr {$ttotal + $currenttime}]
if {$currenttime<$tmin} then {
set tmin $currenttime
}
if {$currenttime>$tmax} then {
set tmax $currenttime
}
}
set taverage [expr {$ttotal/$imax}]
puts "###################"
puts "parseFromRE2"
puts "Total:$ttotal (ms)"
puts "Average:$taverage (ms)"
puts "Min:$tmin (ms)"
puts "Max:$tmax (ms)"

schlenk

unread,
Aug 5, 2008, 10:02:52 AM8/5/08
to
On Aug 5, 3:26 pm, relaxmike <michael.bau...@gmail.com> wrote:
> Hi all,
>
> I am currently reading the "dragon book" and
> it is for me the occasion to experiment a little
> more with Tcl regular expressions.
> So I compared the performances for a Tcl script
> which uses only "string match" commands with
> additionnal "string range" and "string first", etc... commands
> against the regexp command.
> Unfortunately, I am unable to take advantage of the
> power of the regexp command on my example.
> On my simple example, the "string matching"-based script
> is 10x faster than the "regexp"-based script.
>
[snip]

> Here are the (surprising) results :
>
> ###################
> parseFromSM
> Total:686.0 (ms)
> Average:6.86 (ms)
> Min:6.6 (ms)
> Max:13.2 (ms)
> ###################
> parseFromRE
> Total:7320.0 (ms)
> Average:73.2 (ms)
> Min:72.3 (ms)
> Max:113.9 (ms)
>
> What am I doing wrong ?

Probably nothing. The string commands are known to be way faster than
regexp, which should be no real surprise if you compare what string
match needs to do with what regexp needs to do. (not sure which of the
various string matching algorithms 'string match' uses, but its not
the totally naive version, i think its Boyer-Moore or Knuth-Morris-
Pratt, see the Tcl source for more insight).

So, regexp is powerful but for trivial stuff its slower as it needs to
build and drive a complex DFA. There are even some optimization in
Tcl's regexp parser to convert trivial RE's to string match bytecodes
when possible, because its faster.

Michael

relaxmike

unread,
Aug 5, 2008, 10:39:38 AM8/5/08
to
Thank you for your answer.

The "string match" command is implemented, as far as I
understand the Tcl source code, in Tcl_StringCaseMatch
tclUtil.c, see here :

http://tcl.cvs.sourceforge.net/tcl/tcl/generic/tclUtil.c?view=markup

As you said, the implementation is "naive", compared to
the construction of a DFA/NFA is.
But I thought there would be a performance improvement because
the automaton is computed once for all at the first launch of the
regexp command and that that all subexpressions
are captured in one iteration through the automaton.
Obviously, this is not the case, as I can see in the detailed
timings for the RE-based script.

Ok, so I cannot get an improvement for my example, because
it is too simple.
But, back to the real example, my data file does not
only have "from" statements to parse.
It also has date and subject statements :

from First1 Last1 <bla...@gmail.com>
date Sun, Aug 3, 2008 at 1:20 AM
subject This is the subject

Suppose that I am able to write a complicated RE
which can take care of the 3 cases in one.
I don't know if I am able to do so, but, in that
case, would the RE-based algorithm be faster ?

Regards,

Michael (the other one...)

Cameron Laird

unread,
Aug 5, 2008, 10:47:29 AM8/5/08
to
In article <8312e4ff-1808-4f05...@l42g2000hsc.googlegroups.com>,
.
.
.
Michael already responded to your performance observations.
I'll add a couple other remarks that I suspect you already
know, but might be unfamiliar to other readers:
A. Serious RFC 822 [2822] parsing can easily
be left to experts, for [mime:getheader]
and [mime::parseaddress], for example,
know how to handle several syntaxes that
your parsers above don't.
B. What was the "it" that surprised you with
its simplicity? Maybe we can help explain.

schlenk

unread,
Aug 5, 2008, 11:12:58 AM8/5/08
to
On Aug 5, 4:39 pm, relaxmike <michael.bau...@gmail.com> wrote:
> But I thought there would be a performance improvement because
> the automaton is computed once for all at the first launch of the
> regexp command and that that all subexpressions
> are captured in one iteration through the automaton.
> Obviously, this is not the case, as I can see in the detailed
> timings for the RE-based script.

No. The compiled RE is actually cached in byte compiled code.
But for the detailed workings of the innards of the re engine i can
just
defer to the actual sourcecode (read -> i don't understand it), its
a pretty complex piece of code...

>
> Ok, so I cannot get an improvement for my example, because
> it is too simple.
> But, back to the real example, my data file does not
> only have "from" statements to parse.
> It also has date and subject statements :
>
> from    First1 Last1 <bla...@gmail.com>
> date    Sun, Aug 3, 2008 at 1:20 AM
> subject    This is the subject
>
> Suppose that I am able to write a complicated RE
> which can take care of the 3 cases in one.
> I don't know if I am able to do so, but, in that
> case, would the RE-based algorithm be faster ?

Without measurements its guesswork, so the best answer
to get is 'try and see'.

Michael

Arnold Snarb

unread,
Aug 5, 2008, 11:56:03 AM8/5/08
to
schlenk wrote:
> [...] The string commands are known to be way faster than

> regexp, which should be no real surprise if you compare what string
> match needs to do with what regexp needs to do. (not sure which of the
> various string matching algorithms 'string match' uses, but its not
> the totally naive version, i think its Boyer-Moore or Knuth-Morris-
> Pratt, see the Tcl source for more insight).


Actually it is the naive algorithm.

Boyer-Moore and Knuth-Morris-Pratt are only a win if
the strings involved are really long.


> So, regexp is powerful but for trivial stuff its slower as it needs to
> build and drive a complex DFA. There are even some optimization in
> Tcl's regexp parser to convert trivial RE's to string match bytecodes
> when possible, because its faster.

This is true.

Jeff Hobbs

unread,
Aug 5, 2008, 1:38:54 PM8/5/08
to
On Aug 5, 8:56 am, Arnold Snarb <sn...@fdip.bad-monkeys.net> wrote:
> schlenk wrote:
> > [...] The string commands are known to be way faster than
> > regexp, which should be no real surprise if you compare what string
> > match needs to do with what regexp needs to do. (not sure which of the
> > various string matching algorithms 'string match' uses, but its not
> > the totally naive version, i think its Boyer-Moore or Knuth-Morris-
> > Pratt, see the Tcl source for more insight).
>
> Actually it is the naive algorithm.

Actually it's an advanced version of the naive algorithm.

> Boyer-Moore and Knuth-Morris-Pratt are only a win if
> the strings involved are really long.

and yes, this is why this isn't considered, because most string
matches are short.

Jeff

relaxmike

unread,
Aug 6, 2008, 3:12:50 AM8/6/08
to
Thank you for all your answers.

> Michael already responded to your performance observations.
> I'll add a couple other remarks that I suspect you already
> know, but might be unfamiliar to other readers:
> A.  Serious RFC 822 [2822] parsing can easily
>     be left to experts, for [mime:getheader]
>     and [mime::parseaddress], for example,
>     know how to handle several syntaxes that
>     your parsers above don't.

I did not know that package.
I looked at the details, in order to have an idea
of how experts handles this.
I noticed that, indeed, most of the parsing is done
with string match | string first | string last | string index
commands, while regexp is reserved for the most complex cases,
for example :

regexp {(.*?)(=\?(?:[^?]+)\?(?:.)\?(?:[^?]*)\?=)(.*)$} $field [...]

or

regexp {^(.*) ([+-])([0-9][0-9])([0-9][0-9])$} $value [...]

> B.  What was the "it" that surprised you with
>     its simplicity?  Maybe we can help explain.

I know the basics of REs since several years, but it appears
that I was "shy" to use them.
From now, I think that I won't be shy anymore, just careful,
since it appears that REs are to reserve for complex cases, which
cannot be handled by simple "string *" commands.

I have a remaining question about RE performances.
Is it possible to "optimize" my RE ?
That is, find another RE which does the same job,
but which have properties which makes the whole
process of building and running the DFA faster ?

Regards,

Michaël

PS
In the excellent book "Practical Programming in Tcl/Tk",
at the chapter about regexps :
http://www.beedub.com/book/3rd/regexp.pdf
one can find the paragraph :

"Regular expressions can seem overly complex at first.
They introduce their own syntax and their own rules, and
you may be tempted to use simpler commands like
string first, string range, or string match to process
your strings. How-ever, often a single regular expression
command can replace a sequence of severalstring
commands. Any time you can replace several Tcl commands with
one, you get a performance improvement. Furthermore, the regular
expressionmatcher is implemented in optimized C code, so
pattern matching is fast."

Isn'it a little too enthousiast, considering the current discussion ?

relaxmike

unread,
Aug 6, 2008, 3:25:04 AM8/6/08
to
Sorry to be lengthy, but the more I dig, the more I find...

There is this paper :

"Tcl bytecode optimization: some experiences"

http://aspn.activestate.com/ASPN/Tcl/TclConferencePapers2002/Tcl2002papers/kenny-bytecode/paperKBK.html

where I found the following paragraph, which is very clear
about regexp performances against "string *" commands :

"3.7 Prefer [string] to regular expressions
Regular expressions are amazingly powerful, and for complex string-
matching tasks, they are the right tool for the job. For simpler
operations, though, the [string] command is much faster because much
of it can be compiled to in-line code. In particular, you should use
[string map] where possible in preference to [regsub], and [string
match] or [string first] in preference to [regexp]. "

Regards,

Michaël

Cameron Laird

unread,
Aug 6, 2008, 8:35:48 AM8/6/08
to
In article <036acfe0-4014-469f...@34g2000hsh.googlegroups.com>,
relaxmike <michael...@gmail.com> wrote:
.
.
.

>In the excellent book "Practical Programming in Tcl/Tk",
>at the chapter about regexps :
>http://www.beedub.com/book/3rd/regexp.pdf
>one can find the paragraph :
>
>"Regular expressions can seem overly complex at first.
>They introduce their own syntax and their own rules, and
>you may be tempted to use simpler commands like
>string first, string range, or string match to process
>your strings. How-ever, often a single regular expression
>command can replace a sequence of severalstring
>commands. Any time you can replace several Tcl commands with
>one, you get a performance improvement. Furthermore, the regular
>expressionmatcher is implemented in optimized C code, so
>pattern matching is fast."
>
>Isn'it a little too enthousiast, considering the current discussion ?
>

Yes.

Brent's attention is probably elsewhere now, and I doubt he
answers. I'm willing to speculate a little, though. Casual
research hints that, "Any time you can replace ...", might
have originated quite a few versions ago. Also, I suspect
that if we look at your timings more deeply, we might even
find an interpretation that makes Brent's generalization more
plausible.

I don't feel motivated enough now to pursue the comparison,
though.

Don Porter

unread,
Aug 6, 2008, 11:19:20 AM8/6/08
to

relaxmike <michael...@gmail.com> wrote: .

>> In the excellent book "Practical Programming in Tcl/Tk",
>> "...Any time you can replace several Tcl commands with

>> one, you get a performance improvement.

Like much of the Welch wisdom that's recorded on the web and in
print, the principles are completely right, but the application
to particular examples is only accurate assuming you're still
programming in Tcl 7. Most of the world isn't doing that any more.

Since we got Tcl_Obj's for efficient alternative rep storage, and
bytecode to represent whole scripts, the cost of command invocation
machinery is no longer as predictable, and is much much smaller, at
least averaged over practical scripts.

As another example, anyone still using Tcl_GetCommandInfo probing in
their C code to bypass Tcl command procedure dispatch these days really
needs to re-examine their justifications.

The kernel of wisdom in the Welch quote is as true as ever. Commands
that loop in C are superior performers compared with those that loop in
Tcl. [string map] loops in C. So does [string first].
Also, if your matching task can be done with one [regexp], but is
beyond the reach of one [string], the regexp answer may perform better.

One additional wrinkle is that Tcl notices many cases where [regexp] and
friends are used when their full power is not needed, and when compiling
to bytecode simply rewrites them on the fly to their [string] equivalents.

--
| Don Porter Mathematical and Computational Sciences Division |
| donald...@nist.gov Information Technology Laboratory |
| http://math.nist.gov/~DPorter/ NIST |
|______________________________________________________________________|

tom.rmadilo

unread,
Aug 6, 2008, 12:43:38 PM8/6/08
to
On Aug 6, 8:19 am, Don Porter <d...@nist.gov> wrote:

> The kernel of wisdom in the Welch quote is as true as ever.  Commands
> that loop in C are superior performers compared with those that loop in
> Tcl.  [string map] loops in C.  So does [string first].
> Also, if your matching task can be done with one [regexp], but is
> beyond the reach of one [string], the regexp answer may perform better.

This is a perfect example of a straw man argument, even if
unintentional. The basic "issue" is that regular expressions are a
slower, harder to use version of string matching.

But even with the most simple regular expression you can get three
results:

0. eliminates, or handles per/post parsing (trimming, skip markup)
1. yes/no validation (did the string match the regular expression)
2. what was the match (pulls matched string into a variable)

The big additional benefit of using [regexp] is code regularity and
simplicity. Regular expressions are best used on data which is well
defined in structure (I know this should be obvious, but there are
some suggestions above that string match is better for small strings,
in fact string match is best to use when you are looking for an exact
match, or where the matching is extremely vague (starts with ab). If
you're actually looking for data which matches a pattern, you should
use a regular expression. String match type code is also useful where
you want to create a more complex parser (where the resultant code is
probably hard to understand, but packaged up and left untouched once
it works).

But regular expressions should be used whenever it will simplify code,
that is the real measure you should look at. If you find that a
regular expression is too slow for a particular use, think about
writing a special proc to package up your series of string match
commands.

The email/header reader is an example of this distinction. Headers are
complex structures that contain different syntaxes based upon the
header and even parts of the header value. If regexp was used, it
would only be used on small sub-parts. Much of the parsing is too
context specific to boil down into a useful regular expression. But
when the complexity is wrapped up into a series of procs and packaged
up, any client code which uses the package is easier to write and
understand.

Don Porter

unread,
Aug 6, 2008, 1:57:27 PM8/6/08
to

Don Porter <d...@nist.gov> wrote:
>> The kernel of wisdom in the Welch quote is as true as ever. Commands
>> that loop in C are superior performers compared with those that loop in
>> Tcl. [string map] loops in C. So does [string first].
>> Also, if your matching task can be done with one [regexp], but is
>> beyond the reach of one [string], the regexp answer may perform better.

> This is a perfect example of a straw man argument, even if
> unintentional. The basic "issue" is that regular expressions are a
> slower, harder to use version of string matching.

That's true. My comment above was not responding solely to the [string]
vs. [regexp] question, but to the assertion in the Welch quote that
commands with more implementation in C are always superior to commands
with more implementation in Tcl, and that [regexp] should be preferred
*FOR THAT REASON*.

That reasoning is no longer so clear, since the middle ground of
bytecode execution now exists, since command dispatch is now cheaper
in significant ways, and since C-coded commands like [string map] now
exist, that did not when that quote was likely originally written.

Use the right tool for the job. If you've got a pattern matching task,
[regexp] is at the ready. For simple substring searches and
replacements, there's simpler tools available.

bigfaceworm

unread,
Aug 7, 2008, 1:07:46 PM8/7/08
to
On Aug 6, 8:19 am, Don Porter <d...@nist.gov> wrote:
> The kernel of wisdom in the Welch quote is as true as ever. Commands
> that loop in C are superior performers compared with those that loop in
> Tcl. [string map] loops in C. So does [string first].
> Also, if your matching task can be done with one [regexp], but is
> beyond the reach of one [string], the regexp answer may perform better.

[Note: I refer to Tcl 8.4 - I've not updated to 8.5 (and respective
docs) yet]

This is one thing that's confused me - the various performance
implications of the language's commands (or libraries). I'm not
meaning to harp on Tcl, b/c it's true for other languages as well.

While I appreciate not having to know the internal implementation
details of Tcl's data structures (lists/arrays/dictionaries...), I
would find it useful when thinking about design considerations.

For example, list operations versus arrays. Standard data structure
implementations would lead me to believe

list: constant insertion/deletion given a node, linear lookup based on
index
standard trade-offs between singly and doubly linked lists (no
idea which Tcl has)

array: linear insertion time (copying/moving of data), constant lookup
based on index

But looking at the wiki: http://wiki.tcl.tk/348
[lindex] is O(1)!
but two items later the wiki says [foreach] is faster than [for/
lindex]

Now I don't know which is true. If lindex is truly O(1), then the
[foreach] and [for/lindex] should be pretty much the same (modulo a
small constant factor). (I'm ignoring the elegance of the [foreach]
solution which I'd prefer just b/c of that)


Again, not trying to pick on Tcl, just showing an example of what I
find confusing.


> One additional wrinkle is that Tcl notices many cases where [regexp] and
> friends are used when their full power is not needed, and when compiling
> to bytecode simply rewrites them on the fly to their [string] equivalents.

Apropos to this discussion (string versus regexp), I actually had
example where I was doing some parsing and I tried out all the various
incantations, and to my surprise, regexp turned out to be
significantly faster for this simple matching. Here are my notes from
the test:


# this example just loops over a file 5000 times (opened once)
# the file contained 167 lines of text, only 28 of the lines matched
either of the two checks.
# using regular Tcl 'time' I did this to every line


# 2.5 sec
# if {[string first Extraction $line] == 0} { incr es }
# if {[string first WARNING: $line] == 0} { incr ws }

# 2.4 sec
# if {[string equal -length 1 Extraction $line]} { incr es }
# if {[string equal -length 1 WARNING: $line]} { incr ws }

# 2.4 sec
# if {[string equal -length 10 Extraction $line]} { incr es }
# if {[string equal -length 8 WARNING: $line]} { incr ws }

# 4 sec
# ere == ^Extraction wre == ^WARNING:
# if {[regexp $ere $line]} { incr es }
# if {[regexp $wre $line]} { incr ws }

# 2.1 sec!!!
# if {[regexp ^Extraction $line]} { incr es }
# if {[regexp ^WARNING: $line]} { incr ws }

# 1.6 seconds to do no matches

I would have expected the string equal to be the fastest, yet the
regexp w/inlined RE turned out to be noticeably faster.
Now that I have read Don's comment above, this kind of makes sense,
though I would have still expected it to be the same
as the string equal with the length specified...


My real question is, what's a good way to learn this kind of
information?

Trial and error seems fraught with problems. Tcl's documentation
doesn't appear to have this information. And the Tcl wiki, while full
of information - reads more like a collection of tricks. What I'd
hope for is, "these are the implications of using [regexp], [string
equal], [lindex], ...."


thanks for any pointers,

TJ

ps. Perhaps the best resource is a brain dump from Don Porter, but
that doesn't scale very well and would probably annoy him to no end.

--
Trey Jackson
bigfa...@gmail.com
"Like any truly useful program, `hello' contains a built-in mail
reader."
-- GNU's Bulletin, July 1996

Gerald W. Lester

unread,
Aug 7, 2008, 1:33:56 PM8/7/08
to
bigfaceworm wrote:
> ...
> My real question is, what's a good way to learn this kind of
> information?
>
> Trial and error seems fraught with problems. Tcl's documentation
> doesn't appear to have this information. And the Tcl wiki, while full
> of information - reads more like a collection of tricks. What I'd
> hope for is, "these are the implications of using [regexp], [string
> equal], [lindex], ...."
>
>
> thanks for any pointers,
>
> TJ
>
> ps. Perhaps the best resource is a brain dump from Don Porter, but
> that doesn't scale very well and would probably annoy him to no end.

First off the information you want changes over versions of Tcl, so...

The brain dump and trail are really the only ways to get the true picture
for a given version of Tcl.

Mostly though, don't be concerned about speed until it becomes a problem --
then instrument to see where the hot spot is and fix it.

--
+--------------------------------+---------------------------------------+
| Gerald W. Lester |
|"The man who fights for his ideals is the man who is alive." - Cervantes|
+------------------------------------------------------------------------+

Jeff Hobbs

unread,
Aug 7, 2008, 1:58:48 PM8/7/08
to
On Aug 7, 10:07 am, bigfaceworm <bigfacew...@gmail.com> wrote:
> On Aug 6, 8:19 am, Don Porter <d...@nist.gov> wrote:
> > The kernel of wisdom in the Welch quote is as true as ever.  Commands
> > that loop in C are superior performers compared with those that loop in
> > Tcl.  [string map] loops in C.  So does [string first].
> > Also, if your matching task can be done with one [regexp], but is
> > beyond the reach of one [string], the regexp answer may perform better.

First off, the original quote from Brent's book was actually changed
(by me) in the 4th edition for the exact reason that 8.4 had numerous
[string] improvements in the bytecode. The wording is similar, but
not so one-sided for using regexp.

It is still true that a single regexp can perform better than multiple
string commands, but it is more case-by-case now.

> This is one thing that's confused me - the various performance
> implications of the language's commands (or libraries).  I'm not
> meaning to harp on Tcl, b/c it's true for other languages as well.

Yes, very true of all languages, and as with many others, the rules
change from release to release.

> While I appreciate not having to know the internal implementation
> details of Tcl's data structures (lists/arrays/dictionaries...), I
> would find it useful when thinking about design considerations.

If it becomes important to design for performance reasons, then you do
want to be up-to-date on the latest characteristic behavior for all
releases. One place that can tell you a lot about this wrt Tcl is
http://wiki.tcl.tk/1611. It covers a large range of operations and
commands performance over versions.

> # this example just loops over a file 5000 times (opened once)

First off, I hope that all the examples below are pulled out of proc
bodies. procs will get the full advantage of byte compilation.

>       # 2.4 sec
> #       if {[string equal -length 1 Extraction $line]} { incr es }
> #       if {[string equal -length 1 WARNING: $line]} { incr ws }

Only the simple [string equal] case is byte compiled. The -length
will cause it to take a slower path.

>       # 4 sec
>       # ere == ^Extraction wre == ^WARNING:
> #       if {[regexp $ere $line]} { incr es }
> #       if {[regexp $wre $line]} { incr ws }
>
>       # 2.1 sec!!!
> #       if {[regexp ^Extraction $line]} { incr es }
> #       if {[regexp ^WARNING: $line]} { incr ws }

The last regexp is actually turned into
[string match "Extraction*" $line]
in 8.4 (why didn't you test string match as well??).

> My real question is, what's a good way to learn this kind of
> information?

See the above link. Reading the sources is the best. To know what is
fully bytecompiled is the key.

> Trial and error seems fraught with problems.  Tcl's documentation
> doesn't appear to have this information.  And the Tcl wiki, while full

Of course it doesn't, and definitely shouldn't.

Jeff

Neil Madden

unread,
Aug 7, 2008, 2:03:44 PM8/7/08
to
bigfaceworm wrote:
[...]

> While I appreciate not having to know the internal implementation
> details of Tcl's data structures (lists/arrays/dictionaries...), I
> would find it useful when thinking about design considerations.
>
> For example, list operations versus arrays. Standard data structure
> implementations would lead me to believe
>
> list: constant insertion/deletion given a node, linear lookup based on
> index
> standard trade-offs between singly and doubly linked lists (no
> idea which Tcl has)
>
> array: linear insertion time (copying/moving of data), constant lookup
> based on index
>
> But looking at the wiki: http://wiki.tcl.tk/348
> [lindex] is O(1)!
> but two items later the wiki says [foreach] is faster than [for/
> lindex]

Yes. This is a terminology thing. Tcl's lists are really arrays/vectors
-- i.e., provide constant-time indexing, but inserts can involve
copying. Tcl's arrays on the other hand are "associative arrays" to give
them their full name -- i.e., maps. They are implemented as hashtables.
As are dicts. Tcl has no built-in linked-list type.

>
> Now I don't know which is true. If lindex is truly O(1), then the
> [foreach] and [for/lindex] should be pretty much the same (modulo a
> small constant factor). (I'm ignoring the elegance of the [foreach]
> solution which I'd prefer just b/c of that)

I believe the speed difference referred to is that "small constant
factor". Certainly, there shouldn't be a big-Oh difference (both loops
should be O(N)).

[...]


> My real question is, what's a good way to learn this kind of
> information?
>
> Trial and error seems fraught with problems.

Is this a general questioning of the scientific method, or just in
regard to Tcl? :-) More seriously, have you encountered particular
problems with probing the performance characteristics of Tcl code,
compared to other languages?

Tcl's documentation
> doesn't appear to have this information. And the Tcl wiki, while full
> of information - reads more like a collection of tricks. What I'd
> hope for is, "these are the implications of using [regexp], [string
> equal], [lindex], ...."

I agree that the Tcl documentation could say more. I think some
statement to the effect that [lindex] is expected to be O(1) etc, would
be a useful addition, without overconstraining the implementation too much.

-- Neil

tom.rmadilo

unread,
Aug 7, 2008, 2:48:08 PM8/7/08
to
On Aug 7, 10:07 am, bigfaceworm <bigfacew...@gmail.com> wrote:


> My real question is, what's a good way to learn this kind of
> information?
>
> Trial and error seems fraught with problems.  Tcl's documentation
> doesn't appear to have this information.  And the Tcl wiki, while full
> of information - reads more like a collection of tricks.  What I'd
> hope for is, "these are the implications of using [regexp], [string
> equal], [lindex], ...."

I'll try to repeat what I said earlier with no success: speed should
not be the determining factor here (as a first step). Instead use
[regexp] or [string] commands based upon the problem at hand. The fact
that there is some overlap in functionality between the two is a
distraction.

For instance, in the time it takes to try every possible combination
of code for speed, you could create a proc:

proc incr_linetype {lineVar} {
upvar 1 $lineVar linev
switch -glob -nocase $linev {
Extraction* { incr incrv


if {[regexp ^Extraction $line]} { incr es }
# if {[regexp ^WARNING: $line]} {

upvar ws ws incr ws }

tom.rmadilo

unread,
Aug 7, 2008, 2:52:02 PM8/7/08
to
On Aug 7, 11:48 am, "tom.rmadilo" <tom.rmad...@gmail.com> wrote:

> For instance, in the time it takes to try every possible combination
> of code for speed, you could create a proc:
>
> proc incr_linetype {lineVar} {
>     upvar 1 $lineVar linev
>     switch -glob -nocase $linev {
>         Extraction* { incr incrv
> if {[regexp ^Extraction $line]} { incr es }
> #       if {[regexp ^WARNING: $line]} {
> upvar ws ws incr ws }

Oops, sorry I somehow I posted mid-sentence:

proc incr_linetype {lineVar} {
upvar 1 $lineVar linev

switch -glob $linev {
Extraction* {
upvar 1 es es
incr es
}
WARNING* {
upvar 1 ws ws
incr ws
}
}
}

Or use something similar with string commands.

bigfaceworm

unread,
Aug 7, 2008, 5:20:50 PM8/7/08
to
On Aug 7, 10:58 am, Jeff Hobbs <jeff.ho...@gmail.com> wrote:
> Yes, very true of all languages, and as with many others, the rules
> change from release to release.

Sure, but basic information like "Tcl lists are really arrays" would
have significant
impact on my design of a sorting algorithm (for example). And w/out
some basic information like
that, I'd be just shooting in the dark when trying to speed things up.

For comparison, C++'s STL has performance characteristics in the spec,
so I know what to
expect when using vector/list.


> If it becomes important to design for performance reasons, then you do
> want to be up-to-date on the latest characteristic behavior for all

> releases. One place that can tell you a lot about this wrt Tcl ishttp://wiki.tcl.tk/1611. It covers a large range of operations and
> commands performance over versions.

I've looked at this, and while it does show a wide variety of
characteristics, I personally find it difficult to extract higher
level information.

For example - list insertion/deletion operations.

insertions all take ~17 ms (at begin, middle, end)
i find this at odds w/the notion that lists are really arrays
i'd expect arrays would imply end insertions faster than middle/
begin insertions

same goes for removals.

The point being, for me, it's more difficult to build up an idea of
*how* I should use various data structures and algorithms when I can
only get access to performance metrics. As opposed to building a
representative model of what the data structures and algorithms.

And I agree with Gerald's point that I shouldn't worry about
performance until it's an issue. However
given Tcl's "high level" of abstraction (being removed from the
hardware), it's useful to have some basic understanding of what is
under the hood.


> First off, I hope that all the examples below are pulled out of proc
> bodies. procs will get the full advantage of byte compilation.

yup, it was all inside a proc.

> > # if {[string equal -length 1 Extraction $line]} { incr es }
> > # if {[string equal -length 1 WARNING: $line]} { incr ws }
>
> Only the simple [string equal] case is byte compiled. The -length
> will cause it to take a slower path.

Interesting.

> > # 2.1 sec!!!
> > # if {[regexp ^Extraction $line]} { incr es }
> > # if {[regexp ^WARNING: $line]} { incr ws }
>
> The last regexp is actually turned into
> [string match "Extraction*" $line]
> in 8.4 (why didn't you test string match as well??).

No idea, I should have.

Maybe it was oversight, or perhaps I assumed [string equal] was going
to be faster than [string match]

> See the above link. Reading the sources is the best. To know what is
> fully bytecompiled is the key.

I will have to try out looking at the bytecompiled code.

Regarding reading sources being best, while true, it's got a bit of a
smell factor: I don't recall needing to read the gcc compiler sources
to get a decent understanding of the complexity of various operations
in C or C++. Granted, they are lower level languages. I also didn't
have to read the interpreter/compiler to understand complexity in
Lisp...

Maybe the take-away from this is that Tcl byte compilation
dramatically affects execution speed, and therefore it is very
important to understand when code is/can be byte compiled.

Using what Jeff said above, [string equal -length 8 some_str $line]
cannot be byte-compiled and will be slower than [string match
some_str* $line].

Would that be a fair assessment?


thanks,

TJ

ps. And maybe the answer is, "Suck it up Trey, Tcl has a bunch of
history w/language semantics and a newer byte compiler, and therefore
it isn't straight forward. So dig in and learn the byte compilation
rules and Tcl internals."

--
Trey Jackson
bigfacew...@gmail.com

bigfaceworm

unread,
Aug 7, 2008, 5:36:39 PM8/7/08
to
On Aug 7, 11:48 am, "tom.rmadilo" <tom.rmad...@gmail.com> wrote:
> I'll try to repeat what I said earlier with no success: speed should
> not be the determining factor here (as a first step). Instead use
> [regexp] or [string] commands based upon the problem at hand. The fact
> that there is some overlap in functionality between the two is a
> distraction.

Agreed.

I was hoping to move to the more meta-discussion of understanding why
there are such large differences in semantically similar Tcl code.


If speed were the real issue here, I'd either move to lex/yacc, or a
custom parser in C, or *gasp* bite the bullet and generate an
auxiliary file with the data already processed for me so I wouldn't
have to parse the text file to extract this tiny bit of data I'm
getting.


TJ

--
Trey Jackson
bigfacew...@gmail.com

Jeff Hobbs

unread,
Aug 7, 2008, 5:37:25 PM8/7/08
to
On Aug 7, 2:20 pm, bigfaceworm <bigfacew...@gmail.com> wrote:
> On Aug 7, 10:58 am, Jeff Hobbs <jeff.ho...@gmail.com> wrote:
> And I agree with Gerald's point that I shouldn't worry about
> performance until it's an issue.  However
> given Tcl's "high level" of abstraction (being removed from the
> hardware), it's useful to have some basic understanding of what is
> under the hood.

If you just mean big-O characteristics, then that would apply. Of
course, those have themselves changed (improved) over time.

> > > #       if {[string equal -length 1 Extraction $line]} { incr es }
> > > #       if {[string equal -length 1 WARNING: $line]} { incr ws }
>
> > Only the simple [string equal] case is byte compiled.  The -length
> > will cause it to take a slower path.
>
> Interesting.

And the reason is that [string equal -length ...] is an uncommon case,
so no bytecode was created to support it. It is technically feasible
(rather easy from the bytecode perspective).

> > See the above link.  Reading the sources is the best.  To know what is
> > fully bytecompiled is the key.
>
> I will have to try out looking at the bytecompiled code.
>
> Regarding reading sources being best, while true, it's got a bit of a
> smell factor: I don't recall needing to read the gcc compiler sources
> to get a decent understanding of the complexity of various operations
> in C or C++.  Granted, they are lower level languages.  I also didn't
> have to read the interpreter/compiler to understand complexity in
> Lisp...

but you are asking for more than complexity - you want very much
implementation specific information on nitty-gritty operation details.

> Maybe the take-away from this is that Tcl byte compilation
> dramatically affects execution speed, and therefore it is very
> important to understand when code is/can be byte compiled.

Yes, getting something bytecompiled is a big win. I could point out
instances of this in http://wiki.tcl.tk/1611 if you are curious.

> Using what Jeff said above, [string equal -length 8 some_str $line]
> cannot be byte-compiled and will be slower than [string match
> some_str* $line].

s/cannot be/is not currently/.

> ps. And maybe the answer is, "Suck it up Trey, Tcl has a bunch of
> history w/language semantics and a newer byte compiler, and therefore
> it isn't straight forward.  So dig in and learn the byte compilation
> rules and Tcl internals."

... if you really want to extract the last ounce of speed out.

Jeff

bigfaceworm

unread,
Aug 7, 2008, 7:05:06 PM8/7/08
to
On Aug 7, 2:37 pm, Jeff Hobbs <jeff.ho...@gmail.com> wrote:
> On Aug 7, 2:20 pm, bigfaceworm <bigfacew...@gmail.com> wrote:
>
> > On Aug 7, 10:58 am, Jeff Hobbs <jeff.ho...@gmail.com> wrote:
> > And I agree with Gerald's point that I shouldn't worry about
> > performance until it's an issue. However
> > given Tcl's "high level" of abstraction (being removed from the
> > hardware), it's useful to have some basic understanding of what is
> > under the hood.
>
> If you just mean big-O characteristics, then that would apply. Of
> course, those have themselves changed (improved) over time.

Yeah, so it'd be nice to have some kind of documentation of "lists are
arrays"
"arrays are trees" comparison b/w dicts & arrays. That kind of
thing. I understand
the implementation has changed over time, and that is fair.

Building a working knowledge based on benchmarks (http://wiki.tcl.tk/
1611) just leaves
everything at the black-magic level.

> but you are asking for more than complexity - you want very much
> implementation specific information on nitty-gritty operation details.

As a start, I'd like the complexity (which isn't readily available).

After that, yes, it's nitty-gritty, and it sounds like looking at the
byte-compiler results is where to start learning.

I'll try to give the decompilation a shot to see how easy it is to
understand.

> Yes, getting something bytecompiled is a big win. I could point out
> instances of this inhttp://wiki.tcl.tk/1611if you are curious.

I would be. That might be an interesting annotation to the table.

> > Using what Jeff said above, [string equal -length 8 some_str $line]
> > cannot be byte-compiled and will be slower than [string match
> > some_str* $line].
>
> s/cannot be/is not currently/.

True.


thanks for the info,


TJ
--
Trey Jackson
bigfa...@gmail.com

tom.rmadilo

unread,
Aug 7, 2008, 8:47:52 PM8/7/08
to
On Aug 7, 2:36 pm, bigfaceworm <bigfacew...@gmail.com> wrote:
> On Aug 7, 11:48 am, "tom.rmadilo" <tom.rmad...@gmail.com> wrote:
>
> > I'll try to repeat what I said earlier with no success: speed should
> > not be the determining factor here (as a first step). Instead use
> > [regexp] or [string] commands based upon the problem at hand. The fact
> > that there is some overlap in functionality between the two is a
> > distraction.
>
> Agreed.
>
> I was hoping to move to the more meta-discussion of understanding why
> there are such large differences in semantically similar Tcl code.

Right, my point is that the [regexp] and [string] commands are not
semantically similar. You could argue that the easiest examples of
regular expressions could be handled by the the various string
commands. In general, string commands give a simple answer to a simple
question, you can abuse [regexp] to return the same information as
[string match], which is just a boolean yes/no.

My argument is that regular expressions are semantically distinct from
the string commands. But my analysis would lean toward using string
commands unless you have an actual pattern, even if the [regexp] is
faster! Regular expressions are at least as exact as any combination
of string commands, but you don't use regular expressions to exactly
match two strings. On the flip side, you don't use a string command
unless it can exclude everything that should not match.

> If speed were the real issue here, I'd either move to lex/yacc, or a
> custom parser in C, or *gasp* bite the bullet and generate an
> auxiliary file with the data already processed for me so I wouldn't
> have to parse the text file to extract this tiny bit of data I'm
> getting.

You are making a huge mistake here. You cannot compare C code to Tcl
commands to very high level applications like lex/yacc. Like I said,
this is a huge distraction. We have a community of very helpful
programmers. A lot of them have given advice in this thread.

The bottom line is that if you are talking (or thinking) about the
implementation of Tcl commands as opposed to either why the commands
exist or when you should use them; you are not talking (or thinking)
about Tcl programming, but C programming, but even then you are
comparing apples and oranges.

Alan Anderson

unread,
Aug 7, 2008, 10:43:52 PM8/7/08
to
In article <TkGmk.8978$Bt6....@newsfe04.iad>,

"Gerald W. Lester" <Gerald...@cox.net> wrote:

> Mostly though, don't be concerned about speed until it becomes a problem --
> then instrument to see where the hot spot is and fix it.

Hear, hear.

Neil Madden

unread,
Aug 8, 2008, 6:37:44 AM8/8/08
to
bigfaceworm wrote:
> On Aug 7, 2:37 pm, Jeff Hobbs <jeff.ho...@gmail.com> wrote:
>> On Aug 7, 2:20 pm, bigfaceworm <bigfacew...@gmail.com> wrote:
>>
>>> On Aug 7, 10:58 am, Jeff Hobbs <jeff.ho...@gmail.com> wrote:
>>> And I agree with Gerald's point that I shouldn't worry about
>>> performance until it's an issue. However
>>> given Tcl's "high level" of abstraction (being removed from the
>>> hardware), it's useful to have some basic understanding of what is
>>> under the hood.
>> If you just mean big-O characteristics, then that would apply. Of
>> course, those have themselves changed (improved) over time.
>
> Yeah, so it'd be nice to have some kind of documentation of "lists are
> arrays"
> "arrays are trees" comparison b/w dicts & arrays. That kind of
> thing. I understand
> the implementation has changed over time, and that is fair.

Lists happen to be implemented as arrays. I don't think that needs to be
documented, as it could conceivably change. Much more useful would be to
document expected/worst-case big-Oh characteristics of various common
operations such as [lindex], [lreplace], etc. Of course, those can
change as well, but the docs can change too.

Tcl arrays are hashtables, not trees. The difference between dicts and
arrays is that the former contain values (and are themselves values),
whereas the latter contain variables (and are a type of variable).
Performance wise they should be very similar.

-- Neil

bigfaceworm

unread,
Aug 8, 2008, 3:04:46 PM8/8/08
to
On Aug 7, 5:47 pm, "tom.rmadilo" <tom.rmad...@gmail.com> wrote:
> Right, my point is that the [regexp] and [string] commands are not
> semantically similar. <<snip..>>

> My argument is that regular expressions are semantically distinct from
> the string commands. But my analysis would lean toward using string
> commands unless you have an actual pattern, even if the [regexp] is
> faster!

Yes, they *are* different.

I hear your point that 'string match' and 'regexp' are different and
generally should be used for different reasons.

I believe that point is orthogonal to the question I raised. Can we
move on?

FWIW, I find the regexp more readable than the string match (in the
example I raised). To each his own.


> You are making a huge mistake here. You cannot compare C code to Tcl
> commands to very high level applications like lex/yacc. Like I said,
> this is a huge distraction. We have a community of very helpful
> programmers. A lot of them have given advice in this thread.

Yes, there has been a lot of great advice, and for that I'm grateful.

I believe 'distraction' is in the eye of the beholder, I'm getting
very useful information from this discussion.

My question wasn't, "how do I best write this in Tcl?" or even "what's
the fastest way to do this in Tcl?"
But instead, "how do I examine Tcl code to gain understanding about
the runtime performance?" I want the understanding/tools so that
*when* it is appropriate, I can speed the code up. Heck, I just want
to *know* for the sake of knowledge.

Is a thirst for knowledge a mistake?

> The bottom line is that if you are talking (or thinking) about the
> implementation of Tcl commands as opposed to either why the commands
> exist or when you should use them; you are not talking (or thinking)
> about Tcl programming, but C programming, but even then you are
> comparing apples and oranges.

Are you saying I should just code things the "Tcl way" and not bother
my pretty little head with details of runtime? What if I'm just
curious? And why should I be restrained to only think in Tcl? I'm
just thinking about "programming" - Tcl is but a small part of that
thought process.

Maybe I came across as a newbie to you in my posts. I'm not looking
for advice in Tcl style (the zen of Tcl). And even if I were a
newbie, and I were starting down the wrong path while trying to learn
Tcl, let me wander...

If comp.lang.tcl isn't the proper forum in which to ask questions
like, "what's really going on with [linsert]?" What is?


TJ

--
Trey Jackson
bigfa...@gmail.com

Larry W. Virden

unread,
Aug 9, 2008, 5:53:48 AM8/9/08
to
On Aug 7, 1:07 pm, bigfaceworm <bigfacew...@gmail.com> wrote:
>
> This is one thing that's confused me - the various performance
> implications of the language's commands (or libraries). I'm not
> meaning to harp on Tcl, b/c it's true for other languages as well.

This is a great discussion - just be patient as you seek answers. This
is the right place to discuss the topic.

Tracking performance of particular parts of Tcl, whether library
function, or fundamental operation, is a topic that arises here
occasionally. And I've seen waves of pages arise at http://wiki.tcl.tk/
(I'd give you a page reference, but I don't seem to be able to access
the wiki this morning).

>
> While I appreciate not having to know the internal implementation
> details of Tcl's data structures (lists/arrays/dictionaries...), I
> would find it useful when thinking about design considerations.

That's certainly one way to go. Over 30+ years of programming, I
remember taking classes (and reading books) where incorporating the
hardware architecture, language architecture, etc. was part of the
strategy. My own preference was those who went the other direction -
"get the best design for solving the problem, then do only the
essential optimisations after the design is correct". The reason is
that hardware changes, language underpinnings change, and having a
program designed around those things can end up costing you more, in
the long run. I remember people spending days trying to optimize a
database design... then the new release of the database coming out and
shattering that design because the old strategies were now wrong for
the new design. Same for OS considerations... particularly on some
platforms where new OS releases are not always "improvements".

>
> But looking at the wiki:http://wiki.tcl.tk/348
> [lindex] is O(1)!
> but two items later the wiki says [foreach] is faster than [for/
> lindex]

Remember that the wiki is a community collaboration effort. And
information added there in, say, 2000, is not necessarily still true
in 2008. Same goes, of course, for any web site, book, mailing list,
or usenet group. What one generally needs to do is to develop testing
tools that allows one to easily profile the essential operations that
one's design is performing.

> I would have expected the string equal to be the fastest, yet the
> regexp w/inlined RE turned out to be noticeably faster.

It's hard to tell without seeing the literal entire code being used to
measure. The reason is that certain tcl coding styles can make
significant differences in timings. So _sometimes_ the style of the
mearsuing harness can result in poorer overall measurement times.
Granted, using the same harness for two techniques should give one
some comparative feeling.


> My real question is, what's a good way to learn this kind of
> information?
>
> Trial and error seems fraught with problems. Tcl's documentation
> doesn't appear to have this information. And the Tcl wiki, while full
> of information - reads more like a collection of tricks. What I'd
> hope for is, "these are the implications of using [regexp], [string
> equal], [lindex], ...."

There isn't a document of that type. However, the source code is
available for reading, and there's always the option of working
together with others on forums like this to develop a performance
testing script to measure what you want to measure.

Larry W. Virden

unread,
Aug 9, 2008, 6:08:48 AM8/9/08
to
On Aug 7, 5:20 pm, bigfaceworm <bigfacew...@gmail.com> wrote:
> On Aug 7, 10:58 am, Jeff Hobbs <jeff.ho...@gmail.com> wrote:
>
> > Yes, very true of all languages, and as with many others, the rules
> > change from release to release.
>
> Sure, but basic information like "Tcl lists are really arrays" would
> have significant
> impact on my design of a sorting algorithm (for example). And w/out
> some basic information like
> that, I'd be just shooting in the dark when trying to speed things up.

Of course, making use of basic information like this is only useful if
you know the performance characteristics of the specific design. And
right now, other that the various attempts at discussing performance
on the wiki, I don't know there is another document of that type.
However, given that the wiki is a community supported collaboration
effort, perhaps that's a project you could start. Brainstorm with
others about the types of measurements you want to know, how best to
determine those measurements, how best to document those measurements,
the types of structures that should be measured, then as a group,
start developing a common testbed of software for measuring these
things, and then encourage the community to begin measuring across the
various platforms, compilers, compiler settings.

Sounds like a useful contribution to engineers using Tcl!


>> http://wiki.tcl.tk/1611. It covers a large range of operations and
> > commands performance over versions.
>
> I've looked at this, and while it does show a wide variety of
> characteristics, I personally find it difficult to extract higher
> level information.
>
> For example - list insertion/deletion operations.
>
> insertions all take ~17 ms (at begin, middle, end)
> i find this at odds w/the notion that lists are really arrays
> i'd expect arrays would imply end insertions faster than middle/
> begin insertions

Various possibilites occur to me - the measurements are not measuring
what you expect, your expectation on data structure performance
doesn't take into account the design used in Tcl, the compiler or
hardware used to measure compensated (or penalized) the data
structure...

>
> And I agree with Gerald's point that I shouldn't worry about
> performance until it's an issue. However
> given Tcl's "high level" of abstraction (being removed from the
> hardware), it's useful to have some basic understanding of what is
> under the hood.

Sounds like the basis of a new set of pages on the wiki, along with
some source diving!

>
> ps. And maybe the answer is, "Suck it up Trey, Tcl has a bunch of
> history w/language semantics and a newer byte compiler, and therefore
> it isn't straight forward. So dig in and learn the byte compilation
> rules and Tcl internals."

I would add to this answer "then document it somewhere (my preference
is the wiki) so that others benefit from your experience".

These types of technical discussions are useful when shedding new
light on topics that sometimes evolve over time and releases.

P.S. before doing much source diving, if I might suggest, upgrade to
Tcl 8.5.3 (or even 8.6 alpha) so that you aren't too far behind the
curve...

Larry W. Virden

unread,
Aug 9, 2008, 6:20:52 AM8/9/08
to
On Aug 7, 8:47 pm, "tom.rmadilo" <tom.rmad...@gmail.com> wrote:
> On Aug 7, 2:36 pm, bigfaceworm <bigfacew...@gmail.com> wrote:
>

>
> > If speed were the real issue here, I'd either move to lex/yacc, or a
> > custom parser in C, or *gasp* bite the bullet and generate an
> > auxiliary file with the data already processed for me so I wouldn't
> > have to parse the text file to extract this tiny bit of data I'm
> > getting.
>
> You are making a huge mistake here. You cannot compare C code to Tcl
> commands to very high level applications like lex/yacc. Like I said,
> this is a huge distraction. We have a community of very helpful
> programmers. A lot of them have given advice in this thread.

I think his point was a fine one. He was asking whether anyone had
written engineering documentation for Tcl's internal structures,
providing information regarding complexity, performance, etc. Most
people here advise that, if one needs to improve performance in a Tcl
(or any other language) application, after getting the algorithms
right, one needs to move to C code.

>
> The bottom line is that if you are talking (or thinking) about the
> implementation of Tcl commands as opposed to either why the commands
> exist or when you should use them; you are not talking (or thinking)
> about Tcl programming, but C programming, but even then you are
> comparing apples and oranges.

His question seemed to me to be "how can I find the internals
information to allow me to compare different implementation strategies
for applications". I don't recall him ever asking "why does string
match exist" or "why do regular expressions exist" etc.

Since, Tcl is primarily a "fan supported" language, the documentation
that I've found, to date, has been less technically oriented than
documentation that one might find in an academic environment, or from
a company making "the big bucks" in some sort of engineering
environment, or in a commercial or governmental setting where
performance specs, etc. are required for contract purposes. Tcl is
quite a bit more casual an environment, and tends to expect that if
someone wants/needs some specific information, they will either be
willing to dig into the sources to get it, or pay someone else to
generate it.

Look at the wiki's pages for previous Tcl workshop/conferences. There
have been papers presented at times from work done within the
community around performance issues, etc. There isn't, however, a set
of engineering documents that one might see in some other languages.

Larry W. Virden

unread,
Aug 9, 2008, 6:23:57 AM8/9/08
to
On Aug 8, 6:37 am, Neil Madden <n...@cs.nott.ac.uk> wrote:

> Lists happen to be implemented as arrays.

I am confused by this statement.

Are you saying that, under the Tcl script language itself, the
interpreter uses Tcl arrays as the data structure for a Tcl list?

Larry W. Virden

unread,
Aug 9, 2008, 6:37:28 AM8/9/08
to
On Aug 8, 3:04 pm, bigfaceworm <bigfacew...@gmail.com> wrote:
> On Aug 7, 5:47 pm, "tom.rmadilo" <tom.rmad...@gmail.com> wrote:

>
> My question wasn't, "how do I best write this in Tcl?" or even "what's
> the fastest way to do this in Tcl?"
> But instead, "how do I examine Tcl code to gain understanding about
> the runtime performance?" I want the understanding/tools so that
> *when* it is appropriate, I can speed the code up. Heck, I just want
> to *know* for the sake of knowledge.
>
> Is a thirst for knowledge a mistake?

Certainly not. And looking for common tools, that one can use to gain
the information one would like, is a great idea. Having a set of tools
to gather specific performance information can not only help someone
in design decisions, but can help pin point places in Tcl where
improvements could be made. That's the gotcha, however, in designing
your program around specific performance behaviors of commands. When
the next version of Tcl comes out, the performance numbers may be
exactly the opposite of what they were at the time that you measured
and created your design. Of course, nothing forces you to move to the
next version of Tcl. However, if you have designed your application
around the specific functionality you need performed, rather than the
performance behavior of one specific version of Tcl, compiled on one
specific hardware/operating system/compiler/compiler flags platform,
then as the need arises, you can try to vary compiler flags, types or
versions of compilers, operating systems and hardware to see if an
improvement is available.


>
> Are you saying I should just code things the "Tcl way" and not bother
> my pretty little head with details of runtime? What if I'm just
> curious? And why should I be restrained to only think in Tcl? I'm
> just thinking about "programming" - Tcl is but a small part of that
> thought process.

I think what i am saying is that after you have gotten to the point
where your application is functioning exactly correct, then there are
a variety of options you have to improve performance. Your specific
application may warrant you freezing at a specific release of Tcl,
then digging down into that version to determine data structures, etc.
to try and squeeze the most out of that version. At the same time, you
would of course be freezing at a particular compiler verison, OS
version, hardware, etc. and trying to squeeze out the most of those
platforms.

In other cases, spending that time would be a matter of "diminishing
returns" - costing you more than benefiting you. If you optimize
around those things too much, then you limit yourself to where you go
in the future, because a change in compiler flags, compiler version,
OS version, hardware, or tcl version will alter the basis of the
performance changes you made, potentially making things worse.

When Tcl moved from standard ASCII to Unicode, for instance, there was
for a while a penalty that was incurred because the processing was a
bit more costly. Over time, a lot of that cost has been mitigated.
There continue to be edges and corners where things in certain
configurations can be costly - where you install Tcl libraries, for
instance, can cause a performance difference during startup (install
things on a flash drive vs hard drive, or into a directory with a lot
of other libraries from python, perl, ruby, etc. could cause tcl to
take longer to start up than installing it in a directory tree by
itself, etc.)

There are lots of factors. Having tools to investigate, and then
documenting what you find, is a very useful strategy.


>
> If comp.lang.tcl isn't the proper forum in which to ask questions
> like, "what's really going on with [linsert]?" What is?

This is the place.

Neil Madden

unread,
Aug 9, 2008, 7:50:54 AM8/9/08
to

No - I mean C-style arrays here, i.e. contiguous blocks of memory that
can be indexed in constant time (rather than a linked list). Of course,
the picture is more complicated than that -- closer to something like an
ArrayList in Java, which resizes as needed.

-- Neil

tom.rmadilo

unread,
Aug 11, 2008, 12:10:33 AM8/11/08
to
On Aug 8, 12:04 pm, bigfaceworm <bigfacew...@gmail.com> wrote:
> On Aug 7, 5:47 pm, "tom.rmadilo" <tom.rmad...@gmail.com> wrote:

> > The bottom line is that if you are talking (or thinking) about the
> > implementation of Tcl commands as opposed to either why the commands
> > exist or when you should use them; you are not talking (or thinking)
> > about Tcl programming, but C programming, but even then you are
> > comparing apples and oranges.
>
> Are you saying I should just code things the "Tcl way" and not bother
> my pretty little head with details of runtime? What if I'm just
> curious? And why should I be restrained to only think in Tcl? I'm
> just thinking about "programming" - Tcl is but a small part of that
> thought process.

No I'm saying that Tcl commands are written/implemented in C. You can
look at Tcl code all day long and learn nothing about the details of
this implementation. Comparing the runtime of two completely different
commands may be interesting as an academic exercise, but not as part
of writing good code.

> Maybe I came across as a newbie to you in my posts. I'm not looking
> for advice in Tcl style (the zen of Tcl). And even if I were a
> newbie, and I were starting down the wrong path while trying to learn
> Tcl, let me wander...

I don't think I gave any advice on Tcl style, there is a style guide
already written for newbies or oldbies:

http://www.tcl.tk/doc/styleGuide.pdf


> If comp.lang.tcl isn't the proper forum in which to ask questions
> like, "what's really going on with [linsert]?" What is?

I'm not part of the forum police, but keep in mind that I don't always
post responses for the benefit of those asking questions. Maybe you
think certain questions are interesting from the standpoint of what
knowledge you can gain for yourself. But I'm not tech support. If you
think it is important to profile small chunks of code, even single
commands, great. In general, Tcl programmers should not waste their
time doing this for the reasons I stated. It is possible that there
are newbies who read this forum. I would rather them not waste their
time going down an obvious dead end. Where you go is up to you, I'm
just tacking up a few warning signs for those interested in getting
stuff done.

tom.rmadilo

unread,
Aug 11, 2008, 12:32:57 AM8/11/08
to
On Aug 9, 3:37 am, "Larry W. Virden" <lvir...@gmail.com> wrote:
> I think what i am saying is that after you have gotten to the point
> where your application is functioning exactly correct, then there are
> a variety of options you have to improve performance. Your specific
> application may warrant you freezing at a specific release of Tcl,
> then digging down into that version to determine data structures, etc.
> to try and squeeze the most out of that version. At the same time, you
> would of course be freezing at a particular compiler verison, OS
> version, hardware, etc. and trying to squeeze out the most of those
> platforms.

This is crazy talk. Of course if you are a developer and someone is
paying you for all this optimization, great. But if you are the one
doing the paying, run away from this reasoning.

Larry W. Virden

unread,
Aug 11, 2008, 7:49:40 AM8/11/08
to

I agree that _I_ wouldn't be interested in doing this. But I've seen
people who were "stuck" working in this mode, because of code freezes
on products, etc.

I myself probably would code the best algorithm I could find, see if
others had advice to improve the algorithm, then turn into C code
anything that was still a major bottle neck...

relaxmike

unread,
Aug 19, 2008, 4:10:05 AM8/19/08
to
Hi all again,

Thank you for all the good advices on this thread.
Since I agree that timings should be taken into
account after design, I continue to dig in the RE.
While I am at learning REs more deeply, I read the book
"Mastering Regular Expression" by Jeffrey E. F. Friedl.
The book gives many practical advices as well
as some details about Tcl. It is written :

"Tcl's regex engine was custom built for Tcl by regular-expression
legen Henry
Spencer who has done a fantastic job blending the best of both DFA and
NFA worlds. Henry noted himself in an April 2000 Usenet posting :

In general, the Tcl RE-matching
engine is much less sensitive to the exact form of the RE than
traditional
matching engines. Things that it does quickly will be fast no matter
how you
write them; things that it does slowly will be slow no matter how you
write them.
The old folklore about hand-optimizing your REs simply does not apply"

But if Spencer's engine is less sensitive than other engines to the
RE,
efficiency still matters. The main advice of the book to optimize
the RE is written on the backfront : "Matching just what you
want, but not what you don't want".
I tried to follow the advices of the book, using "time" to
check that I really improve efficiency of the RE.

The following is the list of tips I used to reduce the time
by a factor 2 :
- capture only the sub-string I need,
- replace greedy ".*" with more specific [^<] to avoid backtracking,
- add line anchors "^" and "$" at appropriate locations

This was my starting point :
time {regexp {(from)(\ *)(.*) <(.*)>} "from First1 Last1
<first...@gmail.com>" matched from blanks username useremail} 1000
71.617 microseconds per iteration

Then remove the unwanted captured variables :
(bin) 48 % time {regexp {from\ *(.*) <(.*)>} "from First1 Last1
<first...@gmail.com>" matched username useremail} 1000
48.078 microseconds per iteration

Add anchors to the begin and to the end to be more specific :
(bin) 51 % time {regexp {^from\ *(.*) <(.*)>$} "from First1 Last1
<first...@gmail.com>" matched username useremail} 1000
44.25 microseconds per iteration

Avoid greedy (.*) for the useremail :
(bin) 71 % time {regexp {^from\ *(.*) <([^>]*)>$} "from First1
Last1 <first...@gmail.com>" matched username useremail} 1000
43.199 microseconds per iteration

Avoid greedy (.*) for the username by using [^<] (I cannot use [^ ]
since there is a space between the first and the last name):
(bin) 72 % time {regexp {^from\ *([^<]*) <([^>]*)>$} "from First1
Last1 <first...@gmail.com>" matched username useremail} 1000
30.884 microseconds per iteration

Did I get all right ? Check it :
(bin) 73 % regexp {^from\ *([^<]*) <([^>]*)>$} "from First1 Last1
<first...@gmail.com>" matched username useremail
1
(bin) 74 % set username
First1 Last1
(bin) 75 % set useremail
first...@gmail.com

Use the "about" option for the starting RE :
(bin) 77 % regexp -about {(from)(\ *)(.*) <(.*)>} "from First1
Last1 <first...@gmail.com>"
4 {}

Since I have 4 capturing parenthesis, the result seems correct, but
does not tell me much.

Same for the last "optimized" RE does not give interesting
information :
(bin) 78 % regexp -about {^from\ *([^<]*) <([^>]*)>$}
2 {}

These experiments lead me to several questions :
- where can I find details on the "-about" option documented (only in
the source code ?)
- is there a Tcl-based tool to count the number of tests, of
backtracking, etc... ?

Regards,

Michaël

relaxmike

unread,
Aug 21, 2008, 10:18:14 AM8/21/08
to
I found a documentation of the values returned by the -about option of
the
regexp command in the wiki :

http://wiki.tcl.tk/986

Regards,

Michaël

0 new messages