TclX loop slow in the default case

Evil Son

unread,

Aug 26, 2008, 5:02:41 PM8/26/08

to

Hello

I wanted to see how much faster the Tclx loop command was compared to
the ordinary for loop.

When [loop] is used without an explicit increment, it defaults to 1.
I noticed that this default case runs very slowly.

Here is my script:
ed@aleph> cat loop.tcl

# timing "for" and the Tclx "loop"
package require Tclx

proc loop_default {lim} {
loop i 0 $lim {}
}

proc loop_explicit {lim} {
loop i 0 1 $lim {}
}

proc tcl_for {lim} {
for {set i 0} {$i<$lim} {incr i} {}
}

proc run {lim} {
puts "time (us) taken for $lim iterations"
puts "[lindex [time {loop_default $lim}] 0]\t\
<-- TclX 'loop' with default increment"
puts "[lindex [time {loop_explicit $lim}] 0]\t\
<-- TclX 'loop' with explicit increment"
puts "[lindex [time {tcl_for $lim}] 0]\t\ <-- tcl 'for'"
}

if {[info script] eq $argv0 } {
puts "tclsh: [info patchlevel]"
puts "Tclx: [package require Tclx]"
set lim [lindex $argv 0]
run $lim
}
# end ----------------------------------------------------

I tested the script with different values for the iterations/limit 9
times per limit and took the median value. I tested with tcl 8.4 and
8.5.

ed@aleph>
ed@aleph> tclsh8.5 loop.tcl 10 > out.txt
ed@aleph> tclsh8.5 run-tests.tcl 100 >> out.txt
ed@aleph> tclsh8.5 run-tests.tcl 1000 >> out.txt
ed@aleph> tclsh8.5 run-tests.tcl 10000 >> out.txt
ed@aleph> cat out.txt
tclsh: 8.5.3
Tclx: 8.4
time (us) taken for 10 iterations
66 <-- TclX 'loop' with default increment
34 <-- TclX 'loop' with explicit increment
47 <-- tcl 'for'
8.4.18 8.5.3 %chng proc
23 21 -0.02 tcl_for
62 172 1.10 loop_default
7 9 0.02 loop_explicit
8.4.18 8.5.3 %chng proc
200 174 -0.26 tcl_for
561 1605 10.44 loop_default
7 9 0.02 loop_explicit
8.4.18 8.5.3 %chng proc
1961 1700 -2.61 tcl_for
5525 16539 110.14 loop_default
7 10 0.03 loop_explicit
ed@aleph>

Is this a bug for the default increment case?

Regards
Evilson

Alexandre Ferrieux

unread,

Aug 27, 2008, 3:23:49 AM8/27/08

to

On Aug 26, 11:02 pm, Evil Son <ewilsonm...@gmail.com> wrote:
>
> proc loop_explicit {lim} {
> loop i 0 1 $lim {}

Afraid you swapped two arguments here:

loop var start end ?increment? command

so you're iterating from 0 to 1 with a $lim increment, explaining the
constant and extremely short time for loop_explicit:

> 7 9 0.02 loop_explicit
> 7 9 0.02 loop_explicit
> 7 10 0.03 loop_explicit

In general, an O(N) algorithm suddenly switching to O(1) in constant
space is suspect ;-) Another, slightly more interesting case would be
a one-page proof that P=NP...

-Alex

Evil Son

unread,

Aug 27, 2008, 5:02:09 AM8/27/08

to

On Aug 27, 5:23 pm, Alexandre Ferrieux <alexandre.ferri...@gmail.com>
wrote:

> On Aug 26, 11:02 pm, Evil Son <ewilsonm...@gmail.com> wrote:

> > proc loop_explicit {lim} {
> > loop i 0 1 $lim {}
>
> Afraid you swapped two arguments here:

Aargh! Thanks! I had fun writing the test harness though, albeit all
screwed up.

> loop var start end ?increment? command

> In general, an O(N) algorithm suddenly switching to O(1) in constant
> space is suspect ;-)

Indeed. Here are the new results:

ed@aleph> tclsh8.5 run-tests.tcl 10
8.4.18 8.5.3 ratio proc
6 6 1.00 tcl_for
12 25 2.08 loop_default
12 25 2.08 loop_explicit
ed@aleph> tclsh8.5 run-tests.tcl 100
8.4.18 8.5.3 ratio proc
23 21 0.91 tcl_for
58 176 3.03 loop_default
59 171 2.90 loop_explicit
ed@aleph> tclsh8.5 run-tests.tcl 1000
8.4.18 8.5.3 ratio proc
199 174 0.87 tcl_for
522 1703 3.26 loop_default
531 1718 3.24 loop_explicit
ed@aleph> tclsh8.5 run-tests.tcl 10000
8.4.18 8.5.3 ratio proc
1966 1699 0.86 tcl_for
5466 16827 3.08 loop_default
5481 16433 3.00 loop_explicit

As always, I remain obliged.

Alexandre Ferrieux

unread,

Aug 27, 2008, 5:41:21 AM8/27/08

to

On Aug 27, 11:02 am, Evil Son <ewilsonm...@gmail.com> wrote:
> Here are the new results:
>
> ed@aleph> tclsh8.5 run-tests.tcl 10
> 8.4.18 8.5.3 ratio proc
> 6 6 1.00 tcl_for
> 12 25 2.08 loop_default
> 12 25 2.08 loop_explicit
> ed@aleph> tclsh8.5 run-tests.tcl 100
> 8.4.18 8.5.3 ratio proc
> 23 21 0.91 tcl_for
> 58 176 3.03 loop_default
> 59 171 2.90 loop_explicit
> ed@aleph> tclsh8.5 run-tests.tcl 1000
> 8.4.18 8.5.3 ratio proc
> 199 174 0.87 tcl_for
> 522 1703 3.26 loop_default
> 531 1718 3.24 loop_explicit
> ed@aleph> tclsh8.5 run-tests.tcl 10000
> 8.4.18 8.5.3 ratio proc
> 1966 1699 0.86 tcl_for
> 5466 16827 3.08 loop_default
> 5481 16433 3.00 loop_explicit

As you know of course, the near 3x speed gain for [for] is due to
bytecode compilation of its body, while TclX's [loop] command calls
Tcl_EvalObj at each turn.

This experiment IMO fuels the idea that preprocessing could help in
Tcl too from time to time: it would allow to define countless such
variants of Tcl's optimized control primitives without losing the
performance.

Notice that if you're bold enough to ignore fears of string collision,
you can define a variant of proc that does dangerous-yet-useful regexp
parsing. Here you can use [regexp] just to locate the [loop] word, and
then [info complete] to properly find the boundaries of the statement.
Then it's just a matter of substituting the equivalent [for]
incantation...

-Alex

Evil Son

unread,

Aug 27, 2008, 9:15:52 AM8/27/08

to

On Aug 27, 7:41 pm, Alexandre Ferrieux <alexandre.ferri...@gmail.com>
wrote:
> ...

> As you know of course, the near 3x speed gain for [for] is due to
> bytecode compilation of its body, while TclX's [loop] command calls
> Tcl_EvalObj at each turn.
>
> This experiment IMO fuels the idea that preprocessing could help in
> Tcl too from time to time: it would allow to define countless such
> variants of Tcl's optimized control primitives without losing the
> performance.

I couldn't agree more.

> Notice that if you're bold enough to ignore fears of string collision,
> you can define a variant of proc that does dangerous-yet-useful regexp
> parsing. Here you can use [regexp] just to locate the [loop] word, and
> then [info complete] to properly find the boundaries of the statement.
> Then it's just a matter of substituting the equivalent [for]
> incantation...

I'm going to be looking at this method as well as some of the 'macro'
techniques available on the wiki. Baby steps just now ... but this
language is addictive.

Cheers.
Evil Son.

Donal K. Fellows

unread,

Aug 27, 2008, 10:48:15 AM8/27/08

to

Alexandre Ferrieux wrote:
> As you know of course, the near 3x speed gain for [for] is due to
> bytecode compilation of its body, while TclX's [loop] command calls
> Tcl_EvalObj at each turn.

That's not it. Tcl will store the bytecodes for the loop body in that
object and reuse them. (The recompilation hit is actually about 10x when
it strikes, BTW, assuming a trivial script body.) The real hit is
probably due to the fact that TclX still uses the string-based variable
API; if they used Tcl_ObjSetVar2 instead of Tcl_SetVar2Ex, they'd go
quite a bit faster as variable names would not need to be reparsed every
time through the loop. (And the reparsing involves a fair bit of memory
management too...)

I know this because I've just checked the relevant code (tclBasic.c and
tclXgeneral.c FYI).

> This experiment IMO fuels the idea that preprocessing could help in
> Tcl too from time to time: it would allow to define countless such
> variants of Tcl's optimized control primitives without losing the
> performance.

That's the "LISP solution". And it turns out it's a wrong analysis in
this case. :-)

Donal.

Alexandre Ferrieux

unread,

Aug 27, 2008, 11:14:13 AM8/27/08

to

On Aug 27, 4:48 pm, "Donal K. Fellows"
<donal.k.fell...@manchester.ac.uk> wrote:
>
> [overhead due to setting var rather than Eval]

I stand corrected !

> > This experiment IMO fuels the idea that preprocessing could help in
> > Tcl too from time to time: it would allow to define countless such
> > variants of Tcl's optimized control primitives without losing the
> > performance.
>
> That's the "LISP solution". And it turns out it's a wrong analysis in
> this case. :-)

Why a wrong analysis ? How would you define [loop] in pure Tcl so that
it has the same performance as a [for] ?

-Alex

Donal K. Fellows

unread,

Aug 27, 2008, 9:03:00 PM8/27/08

to

Alexandre Ferrieux wrote:
> Why a wrong analysis ? How would you define [loop] in pure Tcl so that
> it has the same performance as a [for] ?

In 8.5 and before, you can't because the body script will not be
compiled to have efficient access to local variables. (This might
change in 8.6 AIUI.) But apart from that you can get close:

proc loop {var from to body} {
upvar 1 $var v
for {set v $from} {$v <= $to} {incr v} {
uplevel 1 $body
}
}

Yes, that's probably the sort of thing you'd write naïvely. That will
use maximally efficient variable accesses (except in the body at the
moment because we don't currently reconnect the compilation of the
body to the local variable table of the calling context) and will
compile the body. By contrast, the TclX loop command is actually less
efficient than that as it currently uses code that isn't very
efficient to assign the loop variables. (The API it uses is optimized
for ease of use with literal variable names, not for execution speed.)

Donal.

Evil Son

unread,

Aug 27, 2008, 11:02:58 PM8/27/08

to

On Aug 28, 11:03 am, "Donal K. Fellows" <donal.k.fell...@man.ac.uk>
wrote:

Funny you mention it, I had indeed written the naive implementation as
an exercise :-) I had gotten the following results for counting: 0 --
> 10,000-1

ed@aleph> tclsh8.5 run-tests.tcl 10000
8.4.18 8.5.3 ratio proc

1960 1699 0.87 tcl_for
5481 16175 2.95 loop_default
9020 11777 1.31 loop_eval <-- naive implemenation

Evil Son

unread,

Aug 27, 2008, 11:11:50 PM8/27/08

to

> ed@aleph> tclsh8.5 run-tests.tcl 10000
> 8.4.18 8.5.3 ratio proc
> 1960 1699 0.87 tcl_for
> 5481 16175 2.95 loop_default
> 9020 11777 1.31 loop_eval <-- naive implemenation

And here is the result if I use a fixed increment (what is the default
case). i.e.

proc my_loop {index init bound inc script} {
upvar 1 $index i
for {set i $init} {$i < $bound} {incr i} { ;# we dont use incr i
$inc
uplevel 1 $script
}
}

ed@aleph> tclsh8.5 run-tests.tcl 10000
8.4.18 8.5.3 ratio proc

1965 1705 0.87 tcl_for
5400 17278 3.20 loop_default
8620 10818 1.25 loop_eval_fixed_increment

The last line removes the cost of a single variable access
i.e. 8620 vs 9020

miguel

unread,

Aug 27, 2008, 11:28:15 PM8/27/08

to

I seem to have missed this thread ...

In 8.5 the variable access code has been optimised for access via Tcl_Obj names,
with some cost for access via the string interfaces. If TclX is using the string
interfaces, this could explain things.

Alexandre Ferrieux

unread,

Aug 28, 2008, 3:35:14 AM8/28/08

to

On Aug 28, 3:03 am, "Donal K. Fellows" <donal.k.fell...@man.ac.uk>
wrote:

> Alexandre Ferrieux wrote:
> > Why a wrong analysis ? How would you define [loop] in pure Tcl so that
> > it has the same performance as a [for] ?
>
> In 8.5 and before, you can't because the body script will not be
> compiled to have efficient access to local variables. (This might
> change in 8.6 AIUI.) But apart from that you can get close:
>
> proc loop {var from to body} {
> upvar 1 $var v
> for {set v $from} {$v <= $to} {incr v} {
> uplevel 1 $body
> }
> }

What about [uplevel {for ...}] instead of [for ... uplevel] ?
(Miguel, what versions received your patch bringing compilation of
[uplevel] ?)

proc loop {var from to body} {

uplevel 1 [list for [list set $var $from] "\$$var<$to" [list incr
$var] $body]
}

-Alex

Koen Danckaert

unread,

Aug 28, 2008, 5:32:22 AM8/28/08

to

Alexandre Ferrieux wrote:
> On Aug 28, 3:03 am, "Donal K. Fellows" <donal.k.fell...@man.ac.uk>
> wrote:
>> Alexandre Ferrieux wrote:
>>> Why a wrong analysis ? How would you define [loop] in pure Tcl so that
>>> it has the same performance as a [for] ?
>> In 8.5 and before, you can't because the body script will not be
>> compiled to have efficient access to local variables. (This might
>> change in 8.6 AIUI.) But apart from that you can get close:
>>
>> proc loop {var from to body} {
>> upvar 1 $var v
>> for {set v $from} {$v <= $to} {incr v} {
>> uplevel 1 $body
>> }
>> }
>

> (Miguel, what versions received your patch bringing compilation of
> [uplevel] ?)

I think this patch is in 8.5.3, but it doesn't show up in Evilson's examples because he uses an empty loop body. Using just a minimal loop body (for example {expr {$i+5}}) gives largely different results:

tclsh: 8.4.12
time (us) taken for 10000 iterations
1157 <-- tcl 'for'
33650 <-- loop_eval

tclsh: 8.5.4
time (us) taken for 10000 iterations
1151 <-- tcl 'for'
5271 <-- loop_eval

--Koen

Evil Son

unread,

Aug 28, 2008, 3:14:13 PM8/28/08

to

On Aug 28, 7:32 pm, Koen Danckaert <k...@usenet.cnntp.org> wrote:
> I think this patch is in 8.5.3, but it doesn't show up in Evilson's examples because he uses an empty loop body. Using just a minimal loop body (for example {expr {$i+5}}) gives largely different results:
>
> tclsh: 8.4.12
> time (us) taken for 10000 iterations
> 1157 <-- tcl 'for'
> 33650 <-- loop_eval
>
> tclsh: 8.5.4
> time (us) taken for 10000 iterations
> 1151 <-- tcl 'for'
> 5271 <-- loop_eval
>
> --Koen

Indeed, I was wondering about the iteration speed. So then I put an
[expr] in the body and got:

tclsh 8.4.18
2944 naked_for
14215 tclx_loop
81158 naive_loop
23162 clever_loop <-- from Alex earlier in thread

tclsh 8.5.3
3949 naked_for
23765 tclx_loop <-- as dkf/msf expected
105259 naive_loop <-- what the!
25769 clever_loop

tclsh 8.5.4
4040 naked_for
24557 tclx_loop
17135 naive_loop <-- optimisation work kicks in as Koen/msf said.
35612 clever_loop <-- not in the 8.5.4 possible world

And just to make a fool of myself, I'll include my bug-ridden procs
#-------------------------------------------------------------
# ignore $body, we want a baseline
proc naked_for { var from to {body {}} } {
for {set var $from} {$var<$to} {incr var} {expr {$var*$var}}
}

# e.g. from dkf which responds to all the optimisation work
proc naive_loop { var from to {body {}} } {
upvar $var i
for {set i $from} {$i<$to} {incr i} {
uplevel 1 $body
}
}

#ignore body, 'loop' already has some evaluating to do
proc tclx_loop { var from to {body {}} } {
upvar $var i
loop i $from $to 1 { expr {$i*$i}}
}

# from Alex
proc clever_loop { var from to {body {}} } {

uplevel 1 [list for [list set $var $from] \
"\$$var<$to" [list incr $var] $body]
}

The procs above are called from:
#-------------------------------------------------------------
proc main {n} {
global aProcs

foreach {i} [lsort -integer [array names aProcs]] {
set proc $aProcs($i)
if {[info procs $proc] eq $proc } {
puts -nonewline [timeus [list \
$proc var 0 $n {expr {$var*$var}}]]
puts \t$proc
} else {
puts -nonewline "proc $proc currently undefined"
}
}
}

# timeus is a helper proc that does the obvious thing

Cheers
Evilson.

Alexandre Ferrieux

unread,

Aug 28, 2008, 5:42:16 PM8/28/08

to

On Aug 28, 9:14 pm, Evil Son <ewilsonm...@gmail.com> wrote:
> tclsh 8.5.4

> 17135 naive_loop <-- optimisation work kicks in as Koen/msf said.
> 35612 clever_loop <-- not in the 8.5.4 possible world

Argh. Donal, can you explain why, despite [uplevel] compiling its
argument, uplevelling a computed [for] doesn't approach the speed of
the naked for (nor even that of for..uplevel) ?

-Alex

Donal K. Fellows

unread,

Aug 29, 2008, 4:31:12 AM8/29/08

to

Alexandre Ferrieux wrote:

> Evil Son wrote:
>> tclsh 8.5.4
>> 17135 naive_loop <-- optimisation work kicks in as Koen/msf said.
>> 35612 clever_loop <-- not in the 8.5.4 possible world
>
> Argh. Donal, can you explain why, despite [uplevel] compiling its
> argument, uplevelling a computed [for] doesn't approach the speed of
> the naked for (nor even that of for..uplevel) ?

Yes.

Variable accesses are only efficient when they can be tied to an index
into a local variable table. This process is (probably) the #1 source of
speed in Tcl's bytecode engine (either that or the dealing with the fact
that no parsing needs to be done, but that's out of scope here). When a
script is processed with [uplevel] in 8.5 and before[*] then it cannot
be bound to local variables, resulting in each variable access for
reading or writing being done through a hash table lookup. For a [for]
loop, this is quite a significant cost!

OK, perhaps the "cannot" was a bit strong. "Has not yet been possible
to" is perhaps closer to the truth.

In the "naïve" version, the [upvar] binds the variable in the parent
scope to a local variable in the [naive_loop]'s own scope, and makes the
actual [for] loop part of it efficient (though accesses from inside the
body still suck). BTW, if you want performance out of [uplevel], always
remember to include the level argument in 8.5.

Donal.
[* This might change in 8.6. ]

Alexandre Ferrieux

unread,

Aug 29, 2008, 6:24:41 AM8/29/08

to

On Aug 29, 10:31 am, "Donal K. Fellows"

<donal.k.fell...@manchester.ac.uk> wrote:
>
> Variable accesses are only efficient when they can be tied to an index
> into a local variable table. This process is (probably) the #1 source of
> speed in Tcl's bytecode engine (either that or the dealing with the fact
> that no parsing needs to be done, but that's out of scope here).

OK, so the compiling of uplevel's body bridged "only" half the canyon,
right ?
(That's not quite out of scope. It is interesting for performance-
concerned people to know such properties !)

> When a script is processed with [uplevel] in 8.5 and before[*] then it

> cannot be bound to local variables ...

> OK, perhaps the "cannot" was a bit strong. "Has not yet been possible
> to" is perhaps closer to the truth.

> [* This might change in 8.6. ]

Very interesting !

Random idea: what about extending the bytecode compiler API so that it
can be passed a non-empty "context" (compiledLocals, possibly
literals, etc) instead of starting afresh ? This way, in the dominant
case where the upper context is also a compiled proc, the body-
bytecode would "land" among the upper locals and access them at full
speed (not even a linkvar). Or is such sharing of lists of compiled
locals impossible in the current setup ?

-Alex

miguel

unread,

Aug 29, 2008, 6:42:16 AM8/29/08

to

Alexandre Ferrieux wrote:
> Random idea: what about extending the bytecode compiler API so that it
> can be passed a non-empty "context" (compiledLocals, possibly
> literals, etc) instead of starting afresh ? This way, in the dominant
> case where the upper context is also a compiled proc, the body-
> bytecode would "land" among the upper locals and access them at full
> speed (not even a linkvar). Or is such sharing of lists of compiled
> locals impossible in the current setup ?

Done in 8.6 HEAD since 2008-06-08 [Patch 1973096]

Donal K. Fellows

unread,

Aug 29, 2008, 9:07:03 AM8/29/08

to

miguel wrote:
> Done in 8.6 HEAD since 2008-06-08 [Patch 1973096]

I wasn't sure if that had been done or not. I knew you'd been talking
about it, but I'd been forced to focus on work at that time... :-(

Donal.

Glenn Jackman

unread,

Aug 29, 2008, 10:06:12 AM8/29/08

to

At 2008-08-29 06:24AM, "Alexandre Ferrieux" wrote:
> Random idea: what about extending the bytecode compiler API so that it

[...]

Your random idea generator sucks ;)

A truly random idea would involve whales or petunias.

--
Glenn Jackman
Write a wise saying and your name will live forever. -- Anonymous

Glenn Jackman

unread,

Aug 29, 2008, 10:15:44 AM8/29/08

to

At 2008-08-29 06:24AM, "Alexandre Ferrieux" wrote:

> Random idea: what about extending the bytecode compiler API so that it

Uwe Klein

unread,

Aug 29, 2008, 10:17:06 AM8/29/08

to

Glenn Jackman wrote:
> At 2008-08-29 06:24AM, "Alexandre Ferrieux" wrote:
>
>> Random idea: what about extending the bytecode compiler API so that it
>
> [...]
>
> Your random idea generator sucks ;)
>
> A truly random idea would involve whales or petunias.
>
>

OK, could we extend the bytecode compiler to
have scripted whale action towards plucking petunias?

uwe

Alexandre Ferrieux

unread,

Aug 29, 2008, 11:48:58 AM8/29/08

to

Guys, what's nice with Miguel is that not only does he make my drems
come true, but sometimes he does it in advance just to be sure 8^P

Uwe Klein

unread,

Aug 29, 2008, 12:12:33 PM8/29/08

to

Alexandre Ferrieux wrote:

> Guys, what's nice with Miguel is that not only does he make my drems
> come true, but sometimes he does it in advance just to be sure 8^P

A Deja Gnue?

uwe

Evil Son

unread,

Aug 29, 2008, 2:46:04 PM8/29/08

to

Previous scores favoured 8.4 since they used a non-thread-enabled
build. I've also updated the test with 8.6a2 (which I presume is
unoptimised)

n = 10,000
tclsh 8.4.19
4287 naked_for
18046 tclx_loop
101747 naive_loop
31775 clever_loop

tclsh 8.5.3
3820 naked_for
22843 tclx_loop
103781 naive_loop
26565 clever_loop

tclsh 8.5.4
4048 naked_for
24751 tclx_loop
16829 naive_loop
35350 clever_loop

tclsh 8.6a2
4590 naked_for
34790 tclx_loop
26192 naive_loop
48954 clever_loop

Donal K. Fellows

unread,

Aug 29, 2008, 6:00:33 PM8/29/08

to

Evil Son wrote:
> I've also updated the test with 8.6a2 (which I presume is unoptimised)

We know there are some significant issues with 8.6a2 (my testing
indicates that 8.5 goes around 50% faster, at least for some kinds of
tight loop) as it is using a substantially different execution model
internally. On the other hand, the new model brings quite a lot of
benefits, notably including coroutines. Guess we need to do a lot more
optimization. :-\

Donal (no, the location of the problem isn't very obvious when I
profile things).

Alexandre Ferrieux

unread,

Aug 29, 2008, 6:19:57 PM8/29/08

to

On Aug 29, 5:48 pm, Alexandre Ferrieux <alexandre.ferri...@gmail.com>
wrote:

Update: while the above two lines certinly hold regardless, I'm
disappointed to report that in HEAD (8.6a3), the uplevelled-for still
crawls at hardly more than half the speed of the naive for-uplevel:

$ ./tclsh86.exe
% proc tfor {} {for {set x 0} {$x<10000} {incr x} {}}
% proc tup {} {uploop x 0 10000 {}}
% proc uploop {var from to body} {

upvar 1 $var v
for {set v $from} {$v <= $to} {incr v} {
uplevel 1 $body
}
}

% proc loop {var from to body} {

uplevel 1 [list for [list set $var $from] "\$$var<$to" [list
incr $var] $body]
}

% proc tloop {} {set x 0;incr x;loop x 0 10000 {}}

% time tfor
751 microseconds per iteration
% time tup
5731 microseconds per iteration
% time tloop
10593 microseconds per iteration

Any idea why ? (notice that in tloop I took steps so that the upper
context has a compiled local for the iteration variable).

-Alex

miguel

unread,

Aug 29, 2008, 7:00:47 PM8/29/08

to

I can talk about this ... you decide if any of this makes sense (and time it if
you want).

(1) tup is calling [uplevel 1] once per iteration, which implies a bit of
activity; tloop is calling it just once: ADVANTAGE tloop

(2) tup and tloop are both calling TEBC to run $body once per iteration, it's a tie

(3) tup is accessing $v by index; tloop by name: ADVANTAGE tup

And the winner is one of (or should be) ...

proc cheatloop {var from to body} {
set nbody "for {set $var $from} {incr var} {\$var < $to} {$body}"
uplevel 1 $nbody
}
proc cloop {cheatloop x 0 10000 {}}

(Note that this is very similar to your tloop, but carefully bypassing the list
optimisation - we do want the [for] to be bytecompiled, not just $body)

Another contender for the championship would be

proc cheatloop2 {var from to body} {
set nbody "set $var $from\nwhile 1 {$body\n if {\[incr v\] >= $to} break}"
uplevel 1 $nbody
}

Yes! Single [uplevel] call, the evaluation of $body's bytecodes occurs without
exiting and re-entering TEBC, loop test access the var by index!

What I seem to be missing is the point of this exercise ...

miguel

unread,

Aug 29, 2008, 7:32:44 PM8/29/08

to

One cheap way to remove some unnecessary work is to change
#define NRE_ENABLE_ASSERTS 1
to
#define NRE_ENABLE_ASSERTS 0
in generic/tclInt.h, although that should not make too big a difference (most
assertions are for the coroutine stuff).

There's also quite a bit of streamlining within TEBC to be done. And possibly
some inlining of frequent callbacks in TclNRRunCallbacks. Merging some frequent
callbacks may be possible too, so that we end up doing one indirect call where
we are doing several now.

There is also the possibility of making TEBC be a bit cleverer and avoid some
calls to TEOV.

And who knows what else we'll find when we start looking. I imagine that the
penalty should be around 10% with a reasonable optimizing compiler.

Donal K. Fellows

unread,

Aug 30, 2008, 4:13:08 AM8/30/08

to

Miguel Sofer wrote:
> And the winner is one of (or should be) ...
>
> proc cheatloop {var from to body} {
> set nbody "for {set $var $from} {incr var} {\$var < $to} {$body}"
> uplevel 1 $nbody
> }
> proc cloop {cheatloop x 0 10000 {}}

I wish to enter the following correction:

proc cheatloop {var from to body} {

set var [list $var]
set nbody "for {set $var [list $from]} {\$$var < [list $to]} \
{incr $var} [list $body]"
uplevel 1 $nbody
}

That should have the bonus of being both correct and giving reasonable
errors when bogus input values are used. And still be just about as
efficient too. (Timing with the HEAD and this, I find that it is twice
as fast as the naive code and 5 times slower than a real inlined [for]
when doing either 10k or 100k iterations. YMMV.)

Donal.

Evil Son

unread,

Aug 30, 2008, 7:46:50 AM8/30/08

to

On Aug 30, 6:13 pm, "Donal K. Fellows" <donal.k.fell...@man.ac.uk>
wrote:

Taking Donal's proc as cheat3, my results n=10,000:

forL tclx naive clever cheat1 cheat2 cheat3
8.4.19 4211 17915 102339 31445 31626 20648 31157
8.5.3 3985 23580 105079 26259 26738 23510 26913
8.5.4 4099 24534 17188 37187 15107 13141 15525
8.6a2 4509 34646 26216 48983 16910 14573 16909

(normalised on 8.4.19 => compare down)
forL tclx naive clever cheat1 cheat2 cheat3
8.4.19 1.00 1.00 1.00 1.00 1.00 1.00 1.00
8.5.3 0.95 1.32 1.03 0.84 0.85 1.14 0.86
8.5.4 0.97 1.37 0.17 1.18 0.48 0.64 0.50
8.6a2 1.07 1.93 0.26 1.56 0.53 0.71 0.54

(normalised on [for] => compare across)
forL tclx naive clever cheat1 cheat2 cheat3
8.4.19 1.00 4.25 24.30 7.47 7.51 4.90 7.40
8.5.3 1.00 5.92 26.37 6.59 6.71 5.90 6.75
8.5.4 1.00 5.99 4.19 9.07 3.69 3.21 3.79
8.6a2 1.00 7.68 5.81 10.86 3.75 3.23 3.75

Normalised results almost identical with n=100,000

=> the cheats have it.

Alexandre Ferrieux

unread,

Aug 31, 2008, 6:18:31 AM8/31/08

to

On Aug 30, 1:00 am, miguel <mso...@users.sf.net> wrote:
>
> What I seem to be missing is the point of this exercise ...

The point was merely my own education, and it worked ;-)
To summarise:

(1) It was not really obvious (to me) why uplevel..for was so slow

(2) Donal gave the explanation, namely the var lookup by name

(3) So I suggested to arrange for the uplevelled bytecode to use the
compiledLocals of its homeland

(4) And you reminded us that you had already done and committed it to
the 8.6 branch

(5) Then I timed it in 8.6 and found it still slow.

(6) You found the second non-obvious (again, to me) gotcha: code which
already has a List intrep is compiled in a "superficial" manner (ie
invokeStk, regardless of the existence of a compileProc like [for]'s).

(7) You demonstrated the fact by winning the contest with the
associated cheat

Please correct me if any of the above is inaccurate (especially my
interpretation in (6)).
Also, can you explain why the cheats are still 3.75 times behind naked
[for] (see Evil Son's post) ?

-Alex

Donal K. Fellows

unread,

Aug 31, 2008, 7:47:28 AM8/31/08

to

Alexandre Ferrieux wrote:
> Also, can you explain why the cheats are still 3.75 times behind naked
> [for] (see Evil Son's post) ?

At an (informed) guess, because the variables still aren't being bound
directly to local variable table indices (though the LVT indices may
be being stored in the internal representation of the variable names)
and that forces time to be spent parsing variables and dealing with
hash tables. This is because the loop variable in the test harness
isn't actually being assigned a slot in the LVT at all.

Using (a corrected version of) Miguel's cheatloop2, I get this:

% proc c2 {} {cheatloop2 x 0 10000 {}}
% time c2 1000
2078.150919 microseconds per iteration
% proc c2 {} {set x {}; cheatloop2 x 0 10000 {}}
% time c2 1000
712.200715 microseconds per iteration

For comparison:

% proc floop {} {for {set x 0} {$x<10000} {incr x} {}}
% time floop 1000
636.676754 microseconds per iteration

Not bad given that you've got the overhead of dynamic programming!

(Also note that the disassembly of the thing generated inside
[cheatloop2] is different in the two cases. Remember to put the probe
after the [uplevel] call if you're testing this yourself.)

Donal.

Evil Son

unread,

Aug 31, 2008, 10:35:14 AM8/31/08

to

On Aug 31, 9:47 pm, "Donal K. Fellows" <donal.k.fell...@man.ac.uk>
wrote:

> Alexandre Ferrieux wrote:
> > Also, can you explain why the cheats are still 3.75 times behind naked
> > [for] (see Evil Son's post) ?

> Using (a corrected version of) Miguel's cheatloop2, I get this:

> [snip]

Yes, I used a /fixed/ version of Miguel's cheat2 to get the results
in:
http://groups.google.com/group/comp.lang.tcl/msg/d4fdbff8cacf3dca
I omitted to mention that cheat3 is Donal's version of the corrected
cheat2.

> Not bad given that you've got the overhead of dynamic programming!

Not bad at all.

I also noticed that calling an empty proc in the /same/ test has 8.6a2
taking an extra 25-30% compared to earlier versions. (In the test, I
call procs through a $proc variable)

But no doubt this is covered by the tcl tests and isn't news.

Thanks
Evil Son

Donal K. Fellows

unread,

Aug 31, 2008, 5:00:23 PM8/31/08

to

Evil Son wrote:
> Yes, I used a /fixed/ version of Miguel's cheat2 to get the results
> in:
> http://groups.google.com/group/comp.lang.tcl/msg/d4fdbff8cacf3dca
> I omitted to mention that cheat3 is Donal's version of the corrected
> cheat2.

Sure, but the important thing is that you get *real* speed by using a
variable that's in the LVT in all cases, and you can't do that from the
loop-code generator; the LVT is set up once when the procedure that
calls the loop generator is compiled. Putting a dummy [set] in was
enough to force the allocation of an entry and make the code fast (with
no change to the loop generator at all).

All of which goes to show this is a complicated business.

> I also noticed that calling an empty proc in the /same/ test has 8.6a2
> taking an extra 25-30% compared to earlier versions. (In the test, I
> call procs through a $proc variable)
>
> But no doubt this is covered by the tcl tests and isn't news.

We don't cover performance bugs in the test suite because there's just
no way to reliably test for them without adding lots of extra side
conditions (e.g. machine has to be quiescent apart from the test suite,
speed has to be reliably measurable, etc.) It's horribly difficult to do
that, so we don't. Instead, we have a benchmark suite (tclbench) that
allows people to do manual figuring out of what is significant. The
benchmarks probably need some maintenance; they're probably also in the
wrong project right now (tcllib...)

Donal.

Alexandre Ferrieux

unread,

Aug 31, 2008, 6:29:41 PM8/31/08

to

On Aug 31, 1:47 pm, "Donal K. Fellows" <donal.k.fell...@man.ac.uk>
wrote:
>

> % proc c2 {} {set x {}; cheatloop2 x 0 10000 {}}

> 712.200715 microseconds per iteration

Gosh ! Of course !
(What's funny is that higher up in this thread I had also thought
about this, prepending not only "set x 0" but also "incr x", as a
superstitious hint to a (yet nonexistent) type inference engine ;-)

So, summing up the recipe, to make [uplevel..for] race, one needs:
(1) 8.6 with Miguel's patch compiling [uplevel] within the LVT of the
up context.
(2) an uplevel body which is carefully not listified
(3) a caller context "declaring" the loop variable (ie allocating a
slot in the LVT)

Is this accurate ?
(Anyhow I'm glad I ventured here -- amazing how far a seemingly boring
comparison with Tclx leads into the roots of Tcl's performance)

-Alex

Fredderic

unread,

Sep 2, 2008, 7:42:45 AM9/2/08

to

On 29 Aug 2008 14:15:44 GMT,
Glenn Jackman <gle...@ncf.ca> wrote:

> At 2008-08-29 06:24AM, "Alexandre Ferrieux" wrote:
>> Random idea: what about extending the bytecode compiler API so
>> that it
> [...]
> Your random idea generator sucks ;)
> A truly random idea would involve whales or petunias.

Oh no, not again.

--
Fredderic

You would if you could, but you can't so you won't.

Debian/unstable (LC#384816) on i686 2.6.23-z2 2007 (up 61 days, 7:00)

Evil Son

unread,

Sep 7, 2008, 1:27:07 PM9/7/08

to

On Sep 1, 8:29 am, Alexandre Ferrieux <alexandre.ferri...@gmail.com>
wrote:

Re-ran the tests with cheat4 having properties (1)-(3) above and
included head in the list of tclsh versions. Results are consistent
with Donal's comment about the [eval] in the body slowing things
down. Don't compare this table with previous ones in this thread - I
called [time] directly to avoid the unnecessary uplevel.

[n = 10000, time in us]
forL tclx naive clever cheat1 cheat2 cheat3 cheat4
8.4.19 4259 17779 101808 31885 32571 21429 33023 21426
8.5.3 3964 23793 105397 27748 27565 24436 28608 24230
8.5.4 4026 24457 17763 36668 16457 13939 16296 13909
8.6a2 4520 34394 27073 49271 17494 15116 17560 15260
20080908 4305 33696 30913 53230 18124 15668 18160 15536

[Compare down only]
forL tclx naive clever cheat1 cheat2 cheat3 cheat4
8.4.19 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
8.5.3 0.93 1.34 1.04 0.87 0.85 1.14 0.87 1.13
8.5.4 0.95 1.38 0.17 1.15 0.51 0.65 0.49 0.65
8.6a2 1.06 1.93 0.27 1.55 0.54 0.71 0.53 0.71
20080908 1.01 1.90 0.30 1.67 0.56 0.73 0.55 0.73

[Compare across only]
forL tclx naive clever cheat1 cheat2 cheat3 cheat4
8.4.19 1.00 4.17 23.90 7.49 7.65 5.03 7.75 5.03
8.5.3 1.00 6.00 26.59 7.00 6.95 6.16 7.22 6.11
8.5.4 1.00 6.07 4.41 9.11 4.09 3.46 4.05 3.45
8.6a2 1.00 7.61 5.99 10.90 3.87 3.34 3.88 3.38
20080908 1.00 7.83 7.18 12.36 4.21 3.64 4.22 3.61

The tests are rather blunt - I should look at the ones in tcllib that
Donal mentioned.

Regards

Alexandre Ferrieux

unread,

Sep 7, 2008, 4:10:04 PM9/7/08

to

On Sep 7, 7:27 pm, Evil Son <ewilsonm...@gmail.com> wrote:
>
> [Compare down only]
> forL tclx naive clever cheat1 cheat2 cheat3 cheat4
> 8.4.19 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
> 8.5.3 0.93 1.34 1.04 0.87 0.85 1.14 0.87 1.13
> 8.5.4 0.95 1.38 0.17 1.15 0.51 0.65 0.49 0.65
> 8.6a2 1.06 1.93 0.27 1.55 0.54 0.71 0.53 0.71

The last 2 ones on cheat4 worry me: as if Miguel's patch didn't help ?

> [Compare across only]
> forL tclx naive clever cheat1 cheat2 cheat3 cheat4
> 8.4.19 1.00 4.17 23.90 7.49 7.65 5.03 7.75 5.03
> 8.5.3 1.00 6.00 26.59 7.00 6.95 6.16 7.22 6.11
> 8.5.4 1.00 6.07 4.41 9.11 4.09 3.46 4.05 3.45
> 8.6a2 1.00 7.61 5.99 10.90 3.87 3.34 3.88 3.38
> 20080908 1.00 7.83 7.18 12.36 4.21 3.64 4.22 3.61

The 3+ speed ratio between the best (cheat2/4) and forL also worries
me.
Considering what we've just learnt about effects 1-3, they should be
nearly undistinguishable (basically the same bytecodes, no var lookup
by name; anything else ?)...

Can you retry with larger N (a total of 15000 us is too small to
eliminate various overheads)?
100 times larger should be OK (1.5s).

-Alex

miguel

unread,

Sep 8, 2008, 8:00:45 AM9/8/08

to

Alexandre Ferrieux wrote:
> Can you retry with larger N (a total of 15000 us is too small to
> eliminate various overheads)?
> 100 times larger should be OK (1.5s).

Alex found a bug that was causing this, now fixed in HEAD.

Evil Son: could you run your tests again please?

Evil Son

unread,

Sep 8, 2008, 10:11:51 AM9/8/08

to

Just got back from work.
Updating now.

Evil Son

unread,

Sep 8, 2008, 11:13:30 AM9/8/08

to

On Sep 8, 10:00 pm, miguel <mso...@users.sf.net> wrote:

n=1,000,000 was taking too long so I'll run these later. Yes there is
a definite improvement - we are within ~10% of naked for. But to get
this result, I introduced cheat5 after studying Donal's king-cheat
more closely. I don't know if (1)-(3) captures the king-cheat
sufficiently, but perhaps I misunderstand them.

Anyway, I implemented cheat4 (the one that I thought captured (1)-(3))
in my test as follows (removing the bookkeeping of scores):

if {$proc eq "cheat4"} {
time {set var {}; cheat2 var 0 $n the-body} $nIter
}

So I simply redirected to cheat2.

#But should it have been this:

proc cheat5 {var from to body} {
set var {}
cheat2 var $from $to $body
}

And *this* performs within 10% of the naked for.

What of Alex's and Miguel's patches? They improve cheat4 by about 2%
and improve cheat5 by about 22% !

Here are the results:
-----------------------------------------------------------------------
[n = 10000, time in us] <---- Just to get the results out quickly

forL tclx naive clever cheat1 cheat2 cheat3 cheat4

cheat5
8.4.19 4225 18008 101106 32436 32165 21078 32508 21698
13388
8.5.3 3972 23169 105209 27897 28256 24472 28258 24856
16745
8.5.4 4107 24542 17427 36370 15730 13848 16268 13700
7260
8.6a2 4638 34986 26260 49080 17552 15182 17547 15104
6133
20080908 4337 33868 29356 50387 17266 15044 17276 15133
6114
20080909 4340 35242 29650 52326 17350 15100 17414 14979
4765

[Compare down only]
forL tclx naive clever cheat1 cheat2 cheat3 cheat4

cheat5

8.4.19 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00

1.00
8.5.3 0.94 1.29 1.04 0.86 0.88 1.16 0.87 1.15
1.25
8.5.4 0.97 1.36 0.17 1.12 0.49 0.66 0.50 0.63
0.54
8.6a2 1.10 1.94 0.26 1.51 0.55 0.72 0.54 0.70
0.46
20080908 1.03 1.88 0.29 1.55 0.54 0.71 0.53 0.70
0.46
20080909 1.03 1.96 0.29 1.61 0.54 0.72 0.54 0.69
0.36

[Compare across only]
forL tclx naive clever cheat1 cheat2 cheat3 cheat4

cheat5
8.4.19 1.00 4.26 23.93 7.68 7.61 4.99 7.69 5.14
3.17
8.5.3 1.00 5.83 26.49 7.02 7.11 6.16 7.11 6.26
4.22
8.5.4 1.00 5.98 4.24 8.86 3.83 3.37 3.96 3.34
1.77
8.6a2 1.00 7.54 5.66 10.58 3.78 3.27 3.78 3.26
1.32
20080908 1.00 7.81 6.77 11.62 3.98 3.47 3.98 3.49
1.41
20080909 1.00 8.12 6.83 12.06 4.00 3.48 4.01 3.45
1.10

-----------------------------------------------------------------------

I will be modifying my tests and running them again so the next batch
will have scores that will not be comparable to these ones.

Question: does (1)-(3) capture the optimisation?

A BIG WELL DONE!

Cheers
Evil Son

Evil Son

unread,

Sep 8, 2008, 11:18:15 AM9/8/08

to

Sorry about that - I got tricked by gmail. Here is a more readable
table with fewer columns.

[n = 10000, time in us]

forL naive cheat1 cheat2 cheat3 cheat4 cheat5
8.4.19 4308 101559 32763 21247 32246 21074 13423
8.5.3 3959 104413 28336 24052 27908 24460 16940
8.5.4 4028 18342 16813 14459 17303 14682 7303
8.6a2 4491 26665 17499 15138 17507 15130 6161
20080908 4307 29055 17083 14779 17414 14989 6103
20080909 4317 29059 17317 15089 17233 15165 4779

[Compare down only]
forL naive cheat1 cheat2 cheat3 cheat4 cheat5

8.4.19 1.00 1.00 1.00 1.00 1.00 1.00 1.00

8.5.3 0.92 1.03 0.86 1.13 0.87 1.16 1.26
8.5.4 0.94 0.18 0.51 0.68 0.54 0.70 0.54
8.6a2 1.04 0.26 0.53 0.71 0.54 0.72 0.46
20080908 1.00 0.29 0.52 0.70 0.54 0.71 0.45
20080909 1.00 0.29 0.53 0.71 0.53 0.72 0.36

[Compare across only]
forL naive cheat1 cheat2 cheat3 cheat4 cheat5
8.4.19 1.00 23.57 7.61 4.93 7.49 4.89 3.12
8.5.3 1.00 26.37 7.16 6.08 7.05 6.18 4.28
8.5.4 1.00 4.55 4.17 3.59 4.30 3.64 1.81
8.6a2 1.00 5.94 3.90 3.37 3.90 3.37 1.37
20080908 1.00 6.75 3.97 3.43 4.04 3.48 1.42
20080909 1.00 6.73 4.01 3.50 3.99 3.51 1.11

Alexandre Ferrieux

unread,

Sep 8, 2008, 3:12:14 PM9/8/08

to

Thanks for all these timings !
Indeed they are more or less consistent with what I observed after
Miguel did the decisive fix.
(I say "more or less" because I get 1.0 not 1.11 -- again I don't
understand why you're doing these tests on so small a scale: 4.7 ms is
ridiculously small and exposes one-shot overheads)

For those interested I'll post a summary on the wiki; but basically
there was one last unoptimized variable access in the bytecode
generated for the [for], namely in the test ($x<$n) part. As Miguel
reminds me, before claiming victory we still have that nasty
constraint of pre-declaring the var in the caller... Stay tuned !

-Alex

Evil Son

unread,

Sep 8, 2008, 9:55:25 PM9/8/08

to

On Sep 9, 5:12 am, Alexandre Ferrieux <alexandre.ferri...@gmail.com>
wrote:
[snip]

> Thanks for all these timings !
> Indeed they are more or less consistent with what I observed after
> Miguel did the decisive fix.
> (I say "more or less" because I get 1.0 not 1.11 -- again I don't
> understand why you're doing these tests on so small a scale: 4.7 ms is
> ridiculously small and exposes one-shot overheads)

[snip]

Agreed. I only ran the small test again to get the results to you
ASAP.

Now my longer test has completed: 6 hours later :-)
It was run in single-user mode.
---------------------------------------------------------
Each proc loops 0 --> 5000
[time] $proc 10000
[time] called 5 times, sorted and index 2 taken
Time taken by each proc in milliseconds (ms)

forL naive cheat1 cheat2 cheat3 cheat4 cheat5

8.4.19 2.1 50.3 16.5 10.9 16.6 10.9 6.7
8.5.3 2.0 52.7 14.4 12.3 14.4 12.3 8.4
8.5.4 2.0 8.6 8.1 7.0 8.1 7.1 3.7
8.6a2 2.3 13.2 9.2 7.9 9.2 7.9 3.1
20080908 2.1 15.5 9.4 8.1 9.4 8.2 3.1
20080909 2.1 14.6 9.1 7.8 9.1 7.8 2.4

[Compare down only]
forL naive cheat1 cheat2 cheat3 cheat4 cheat5

8.4.19 1.0 1.0 1.0 1.0 1.0 1.0 1.0
8.5.3 0.9 1.0 0.9 1.1 0.9 1.1 1.3
8.5.4 0.9 0.2 0.5 0.6 0.5 0.7 0.5
8.6a2 1.1 0.3 0.6 0.7 0.6 0.7 0.5
20080908 1.0 0.3 0.6 0.7 0.6 0.8 0.5
20080909 1.0 0.3 0.5 0.7 0.5 0.7 0.4

[Compare across only]
forL naive cheat1 cheat2 cheat3 cheat4 cheat5

8.4.19 1.0 23.7 7.8 5.1 7.8 5.1 3.2
8.5.3 1.0 26.8 7.3 6.3 7.3 6.2 4.3
8.5.4 1.0 4.3 4.0 3.5 4.0 3.5 1.8
8.6a2 1.0 5.8 4.0 3.4 4.0 3.5 1.4
20080908 1.0 7.2 4.4 3.7 4.4 3.8 1.4
20080909 1.0 6.8 4.2 3.6 4.2 3.6 1.1

Regards
Evil Son

Evil Son

unread,

Sep 9, 2008, 9:14:11 AM9/9/08

to

Ha Ha - still ridiculously small!

Evil Son

unread,

Sep 9, 2008, 11:37:43 AM9/9/08

to

On Sep 9, 5:12 am, Alexandre Ferrieux <alexandre.ferri...@gmail.com>
wrote:

Sorry about the useless results sent earlier. Here are my timings at
a more reasonable scale. The times are in *seconds* and tests were
run in single-user mode. All Tcl builds are multithreaded.

---------------------------------------------------------
Each proc loops 0 --> 10000000
[time] $proc 3
[time] called 1 times, sorted and index 0 taken
Time taken by each proc in secs

forL naive cheat1 cheat2 cheat3 cheat4 cheat5

8.4.19 4.28 102.25 33.10 21.90 33.20 21.83 13.40
8.5.3 4.03 104.69 27.40 24.32 27.91 24.27 17.05
8.5.4 4.06 17.57 16.66 13.90 16.20 14.24 7.22
8.6a2 4.66 26.18 18.48 15.89 18.44 15.86 6.12
20080908 4.32 29.70 18.32 16.01 18.27 15.86 6.23
20080909 4.29 30.37 18.67 15.88 18.67 15.92 4.73

[Compare down only]
forL naive cheat1 cheat2 cheat3 cheat4 cheat5
8.4.19 1.00 1.00 1.00 1.00 1.00 1.00 1.00

8.5.3 0.94 1.02 0.83 1.11 0.84 1.11 1.27
8.5.4 0.95 0.17 0.50 0.63 0.49 0.65 0.54
8.6a2 1.09 0.26 0.56 0.73 0.56 0.73 0.46
20080908 1.01 0.29 0.55 0.73 0.55 0.73 0.47
20080909 1.00 0.30 0.56 0.73 0.56 0.73 0.35

[Compare across only]
forL naive cheat1 cheat2 cheat3 cheat4 cheat5

8.4.19 1.00 23.90 7.74 5.12 7.76 5.10 3.13
8.5.3 1.00 25.97 6.80 6.03 6.92 6.02 4.23
8.5.4 1.00 4.33 4.11 3.43 3.99 3.51 1.78
8.6a2 1.00 5.62 3.97 3.41 3.96 3.40 1.31
20080908 1.00 6.87 4.24 3.70 4.22 3.67 1.44
20080909 1.00 7.08 4.35 3.70 4.35 3.71 1.10
---------------------------------------------------------

Evil Son
- with t-shirt "my cpu survived 6 hours of pointless testing"

Alexandre Ferrieux

unread,

Sep 9, 2008, 5:54:30 PM9/9/08

to

On Sep 9, 5:37 pm, Evil Son <ewilsonm...@gmail.com> wrote:
>
> [Compare across only]
> forL naive cheat1 cheat2 cheat3 cheat4 cheat5
> 8.4.19 1.00 23.90 7.74 5.12 7.76 5.10 3.13
> 8.5.3 1.00 25.97 6.80 6.03 6.92 6.02 4.23
> 8.5.4 1.00 4.33 4.11 3.43 3.99 3.51 1.78
> 8.6a2 1.00 5.62 3.97 3.41 3.96 3.40 1.31
> 20080908 1.00 6.87 4.24 3.70 4.22 3.67 1.44
> 20080909 1.00 7.08 4.35 3.70 4.35 3.71 1.10

Hmm. Can you post the exact code in the "cheat2" called by cheat5 ?
(Unless I'm mistaken you mentioned it being a fixed version of
Miguel's, but different from Donal's one which is cheat3).

The reason I'm nitpicking is (1) I didn't observe the 10% gap in my
simpler tests, and (2) I would like to correlate precisely the wins
and losses of cheat5 across versions to what's generated as a
bytecode.
I did it once to find the little bug between 20080908 and 20080909,
I'd like to repeat the exercise on earlier versions. The technique is
straightforward:

- instrument the code calling [uplevel] to store its script
argument in a global:

set ::glo "for ..."
uplevel 1 $::glo

- after the run, do a ::tcl::unsupported::disassemble script $::glo

- diff the bytecode from version to version.

-Alex

Evil Son

unread,

Sep 10, 2008, 11:46:13 AM9/10/08

to

On Sep 10, 7:54 am, Alexandre Ferrieux <alexandre.ferri...@gmail.com>
wrote:
[snip]

> Hmm. Can you post the exact code in the "cheat2" called by cheat5 ?
> (Unless I'm mistaken you mentioned it being a fixed version of
> Miguel's, but different from Donal's one which is cheat3).
>
> The reason I'm nitpicking is (1) I didn't observe the 10% gap in my
> simpler tests, and (2) I would like to correlate precisely the wins
> and losses of cheat5 across versions to what's generated as a
> bytecode.
> I did it once to find the little bug between 20080908 and 20080909,
> I'd like to repeat the exercise on earlier versions. The technique is
> straightforward:
>
> - instrument the code calling [uplevel] to store its script
> argument in a global:
>
> set ::glo "for ..."
> uplevel 1 $::glo
>
> - after the run, do a ::tcl::unsupported::disassemble script $::glo
>
> - diff the bytecode from version to version.
>
> -Alex

Please feel free to nit pick - hopefully there is something wrong with
my test and the team (that don't seem to sleep) have indeed closed the
gap. Here is the code:

proc cheat2 {var from to body} {
set nbody "set $var $from
while 1 {
$body\n if {\[incr var\] >= $to} break }"
uplevel 1 $nbody
}

proc cheat3 {var from to body} {

set var [list $var]
set nbody "for {set $var [list $from]} {\$$var < [list $to]} \
{incr $var} [list $body]"
uplevel 1 $nbody
}

#king cheat

proc cheat5 {var from to body} {
set var {}; cheat2 var $from $to $body
}

Regards

Evil Son

unread,

Sep 10, 2008, 12:14:33 PM9/10/08

to

On Sep 10, 7:54 am, Alexandre Ferrieux <alexandre.ferri...@gmail.com>
wrote:
[...]

> I did it once to find the little bug between 20080908 and 20080909,
> I'd like to repeat the exercise on earlier versions. The technique is
> straightforward:
>
> - instrument the code calling [uplevel] to store its script
> argument in a global:
>
> set ::glo "for ..."
> uplevel 1 $::glo
>
> - after the run, do a ::tcl::unsupported::disassemble script $::glo
>
> - diff the bytecode from version to version.
>
> -Alex

impressive ... and educational.
Cheers
Evil Son

Alexandre Ferrieux

unread,

Sep 10, 2008, 12:40:57 PM9/10/08

to

Say, I'm just realizing I don't have all those older versions handy.
Would you mind doing the disassembling and posting back ? I volunteer
for the diff and analysis ;-)

-Alex

Evil Son

unread,

Sep 10, 2008, 12:58:34 PM9/10/08

to

On Sep 11, 2:40 am, Alexandre Ferrieux <alexandre.ferri...@gmail.com>

OK. Will have a crack in the next 48 hours if it's alright.

Evil Son

unread,

Sep 13, 2008, 1:10:53 AM9/13/08

to

On Sep 11, 2:40 am, Alexandre Ferrieux <alexandre.ferri...@gmail.com>
wrote:
> [...]

> Say, I'm just realizing I don't have all those older versions handy.
> Would you mind doing the disassembling and posting back ? I volunteer
> for the diff and analysis ;-)
>
> -Alex

Ran both cheat2 and cheat5. Output's going to wrap I'm afraid.

=========================================
tclsh08sep08
cheat2
ByteCode 0x0x8169910, refCt 1, epoch 4, interp 0x0x8151d68 (epoch 4)
Source "\n set nbody \"set $var $from\n while 1 {\n $body\"
Cmds 2, src 110, inst 32, litObjs 7, aux 0, stkDepth 9, code/src
0.00
Proc 0x0x816acd0, refCt 1, args 4, compiled locals 5
slot 0, scalar, arg, "var"
slot 1, scalar, arg, "from"
slot 2, scalar, arg, "to"
slot 3, scalar, arg, "body"
slot 4, scalar, "nbody"
Commands 2:
1: pc 0-22, src 3-89 2: pc 23-30, src 93-108
Command 1: "set nbody \"set $var $from\n while 1 {\n $body\n
i"
(0) push1 0 # "set "
(2) loadScalar1 %v0 # var "var"
(4) push1 1 # " "
(6) loadScalar1 %v1 # var "from"
(8) push1 2 # "\n while 1 {\n "
(10) loadScalar1 %v3 # var "body"
(12) push1 3 # "\n if {[incr var] >= "
(14) loadScalar1 %v2 # var "to"
(16) push1 4 # "} break }"
(18) concat1 9
(20) storeScalar1 %v4 # var "nbody"
(22) pop
Command 2: "uplevel 1 $nbody"
(23) push1 5 # "uplevel"
(25) push1 6 # "1"
(27) loadScalar1 %v4 # var "nbody"
(29) invokeStk1 3
(31) done

-----------------------------------------
tclsh08sep08
cheat5
ByteCode 0x0x8169710, refCt 1, epoch 4, interp 0x0x8151d68 (epoch 4)
Source "\n set var {}; cheat2 var $from $to $body \n"
Cmds 2, src 44, inst 18, litObjs 3, aux 0, stkDepth 5, code/src 0.00
Proc 0x0x817b168, refCt 1, args 4, compiled locals 4
slot 0, scalar, arg, "var"
slot 1, scalar, arg, "from"
slot 2, scalar, arg, "to"
slot 3, scalar, arg, "body"
Commands 2:
1: pc 0-4, src 3-12 2: pc 5-16, src 15-42
Command 1: "set var {}"
(0) push1 0 # ""
(2) storeScalar1 %v0 # var "var"
(4) pop
Command 2: "cheat2 var $from $to $body "
(5) push1 1 # "cheat2"
(7) push1 2 # "var"
(9) loadScalar1 %v1 # var "from"
(11) loadScalar1 %v2 # var "to"
(13) loadScalar1 %v3 # var "body"
(15) invokeStk1 5
(17) done

=========================================
tclsh09sep08
cheat2
ByteCode 0x0x8169510, refCt 1, epoch 4, interp 0x0x8151d68 (epoch 4)
Source "\n set nbody \"set $var $from\n while 1 {\n $body\"
Cmds 2, src 110, inst 32, litObjs 7, aux 0, stkDepth 9, code/src
0.00
Proc 0x0x8174060, refCt 1, args 4, compiled locals 5
slot 0, scalar, arg, "var"
slot 1, scalar, arg, "from"
slot 2, scalar, arg, "to"
slot 3, scalar, arg, "body"
slot 4, scalar, "nbody"
Commands 2:
1: pc 0-22, src 3-89 2: pc 23-30, src 93-108
Command 1: "set nbody \"set $var $from\n while 1 {\n $body\n
i"
(0) push1 0 # "set "
(2) loadScalar1 %v0 # var "var"
(4) push1 1 # " "
(6) loadScalar1 %v1 # var "from"
(8) push1 2 # "\n while 1 {\n "
(10) loadScalar1 %v3 # var "body"
(12) push1 3 # "\n if {[incr var] >= "
(14) loadScalar1 %v2 # var "to"
(16) push1 4 # "} break }"
(18) concat1 9
(20) storeScalar1 %v4 # var "nbody"
(22) pop
Command 2: "uplevel 1 $nbody"
(23) push1 5 # "uplevel"
(25) push1 6 # "1"
(27) loadScalar1 %v4 # var "nbody"
(29) invokeStk1 3
(31) done

-----------------------------------------
tclsh09sep08
cheat5
ByteCode 0x0x8169310, refCt 1, epoch 4, interp 0x0x8151d68 (epoch 4)
Source "\n set var {}; cheat2 var $from $to $body \n"
Cmds 2, src 44, inst 18, litObjs 3, aux 0, stkDepth 5, code/src 0.00
Proc 0x0x8174260, refCt 1, args 4, compiled locals 4
slot 0, scalar, arg, "var"
slot 1, scalar, arg, "from"
slot 2, scalar, arg, "to"
slot 3, scalar, arg, "body"
Commands 2:
1: pc 0-4, src 3-12 2: pc 5-16, src 15-42
Command 1: "set var {}"
(0) push1 0 # ""
(2) storeScalar1 %v0 # var "var"
(4) pop
Command 2: "cheat2 var $from $to $body "
(5) push1 1 # "cheat2"
(7) push1 2 # "var"
(9) loadScalar1 %v1 # var "from"
(11) loadScalar1 %v2 # var "to"
(13) loadScalar1 %v3 # var "body"
(15) invokeStk1 5
(17) done

Over to you Maestro.

Evil Son

unread,

Sep 13, 2008, 2:19:41 AM9/13/08

to

Assemblies are identical?

Alexandre Ferrieux

unread,

Sep 13, 2008, 6:10:49 AM9/13/08

to

Sorry I've been unclear. The one we're interested in is the bytecode
generated for $nbody in cheat2, not the bytecode of any of the procs.
To do this, just add

set ::glo $nbody

before or after the [uplevel 1 $nbody] in cheat 2.
Then, run the test, and afterwards do a

::tcl::unsupported disassemble script $::glo

(The order is of course necessary because ::glo is not set unset the
test is run. But beyond that a, a much less obvious gotcha is that
calling disassemble on an equivalent $nbody *before* uplevel has had a
chance to dig its teeth in it, will result in a different bytecode,
because [disassemble script] will itself request a compilation, but in
an environment different from uplevel's (especially the compiled
locals of the caller)).

Apologies again for explaining this improperly, and thanks for your
patience for repeating this all over your installed versions ;-)

-Alex

Alexandre Ferrieux

unread,

Sep 13, 2008, 6:15:56 AM9/13/08

to

On Sep 13, 12:10 pm, I wrote:
>
> ... because ::glo is not set unset the test is run.

I meant "until" instead of "unset".

In addition to my usual rate of grammatical errors in English, I'm now
inserting Tclisms. Oh my.

-Alex

Evil Son

unread,

Sep 13, 2008, 11:29:33 AM9/13/08

to

On Sep 13, 8:10 pm, Alexandre Ferrieux <alexandre.ferri...@gmail.com>
wrote:
> [...]

> Apologies again for explaining this improperly, and thanks for your
> patience for repeating this all over your installed versions ;-)
>
> -Alex

You had it right the first time, I was just being dense.

cheat2 by itself doesn't show any difference ... recall that I had to
introduce cheat5 that got me to the right place. Here are cheat2 and
cheat5.

proc cheat2 {var from to body} {

set nbody "set $var $from

while 1 {
$body\n if {\[incr var\] >= $to} break }"
uplevel 1 $nbody

set ::glo $nbody
}

#king cheat

proc cheat5 {var from to body} {

set var {}; cheat2 var $from $to $body
}

In this test I called cheat5 which then calls cheat2 after setting
"var" to get the effect you were after. And thankfully, the diffs are
non-empty!
-----------------------------------------------------------------
n = 1 nCalls = 1 nRuns = 1
tclsh 8.5.3
0.0001 cheat5
ByteCode 0x0x808b080, refCt 1, epoch 4, interp 0x0x8054da0 (epoch 4)
Source "set var 0\n while 1 {\n expr {$var*$var}\n if {[in"
Cmds 6, src 76, inst 70, litObjs 4, aux 0, stkDepth 2, code/src 0.00
Exception ranges 1, depth 1:
0: level 0, loop, pc 15-63, continue 15, break 67
Commands 6:
1: pc 0-5, src 0-8 2: pc 6-68, src 14-75
3: pc 15-31, src 30-45 4: pc 32-63, src 48-74
5: pc 41-44, src 53-60 6: pc 50-59, src 69-73
Command 1: "set var 0"
(0) push1 0 # "var"
(2) push1 1 # "0"
(4) storeScalarStk
(5) pop
Command 2: "while 1 {\n expr {$var*$var}\n if {[incr var] >= 1}
"
(6) startCommand +63 1 # next cmd at pc 69
Command 3: "expr {$var*$var}"
(15) startCommand +16 1 # next cmd at pc 31
(24) push1 0 # "var"
(26) loadScalarStk
(27) push1 0 # "var"
(29) loadScalarStk
(30) mult
(31) pop
Command 4: "if {[incr var] >= 1} break "
(32) startCommand +32 2 # next cmd at pc 64, 2 cmds start here
Command 5: "incr var"
(41) push1 0 # "var"
(43) incrScalarStkImm +1
(45) push1 2 # "1"
(47) ge
(48) jumpFalse1 +14 # pc 62
Command 6: "break"
(50) startCommand +10 1 # next cmd at pc 60
(59) break
(60) jump1 +4 # pc 64
(62) push1 3 # ""
(64) pop
(65) jump1 -50 # pc 15
(67) push1 3 # ""
(69) done

tclsh 8.5.4
0.0001 cheat5
ByteCode 0x0x8066a58, refCt 1, epoch 4, interp 0x0x8054da0 (epoch 4)
Source "set var 0\n while 1 {\n expr {$var*$var}\n if {[in"
Cmds 6, src 76, inst 70, litObjs 4, aux 0, stkDepth 2, code/src 0.00
Exception ranges 1, depth 1:
0: level 0, loop, pc 15-63, continue 15, break 67
Commands 6:
1: pc 0-5, src 0-8 2: pc 6-68, src 14-75
3: pc 15-31, src 30-45 4: pc 32-63, src 48-74
5: pc 41-44, src 53-60 6: pc 50-59, src 69-73
Command 1: "set var 0"
(0) push1 0 # "var"
(2) push1 1 # "0"
(4) storeScalarStk
(5) pop
Command 2: "while 1 {\n expr {$var*$var}\n if {[incr var] >= 1}
"
(6) startCommand +63 1 # next cmd at pc 69
Command 3: "expr {$var*$var}"
(15) startCommand +16 1 # next cmd at pc 31
(24) push1 0 # "var"
(26) loadScalarStk
(27) push1 0 # "var"
(29) loadScalarStk
(30) mult
(31) pop
Command 4: "if {[incr var] >= 1} break "
(32) startCommand +32 2 # next cmd at pc 64, 2 cmds start here
Command 5: "incr var"
(41) push1 0 # "var"
(43) incrScalarStkImm +1
(45) push1 2 # "1"
(47) ge
(48) jumpFalse1 +14 # pc 62
Command 6: "break"
(50) startCommand +10 1 # next cmd at pc 60
(59) break
(60) jump1 +4 # pc 64
(62) push1 3 # ""
(64) pop
(65) jump1 -50 # pc 15
(67) push1 3 # ""
(69) done

tclsh 8.6a2
0.0001 cheat5
ByteCode 0x0x80bfef8, refCt 1, epoch 5, interp 0x0x8054d68 (epoch 5)
Source "set var 0\n while 1 {\n expr {$var*$var}\n if {[in"
Cmds 6, src 76, inst 68, litObjs 4, aux 0, stkDepth 2, code/src 0.00
Exception ranges 1, depth 1:
0: level 0, loop, pc 14-61, continue 14, break 65
Commands 6:
1: pc 0-4, src 0-8 2: pc 5-66, src 14-75
3: pc 14-30, src 30-45 4: pc 31-61, src 48-74
5: pc 40-42, src 53-60 6: pc 48-57, src 69-73
Command 1: "set var 0"
(0) push1 0 # "0"
(2) storeScalar1 %v0
(4) pop
Command 2: "while 1 {\n expr {$var*$var}\n if {[incr var] >= 1}
"
(5) startCommand +62 1 # next cmd at pc 67
Command 3: "expr {$var*$var}"
(14) startCommand +16 1 # next cmd at pc 30
(23) push1 1 # "var"
(25) loadScalarStk
(26) push1 1 # "var"
(28) loadScalarStk
(29) mult
(30) pop
Command 4: "if {[incr var] >= 1} break "
(31) startCommand +31 2 # next cmd at pc 62, 2 cmds start here
Command 5: "incr var"
(40) incrScalar1Imm %v0 +1
(43) push1 2 # "1"
(45) ge
(46) jumpFalse1 +14 # pc 60
Command 6: "break"
(48) startCommand +10 1 # next cmd at pc 58
(57) break
(58) jump1 +4 # pc 62
(60) push1 3 # ""
(62) pop
(63) jump1 -49 # pc 14
(65) push1 3 # ""
(67) done

tclsh 20080908
0.0001 cheat5
ByteCode 0x0x81da738, refCt 1, epoch 5, interp 0x0x8151d68 (epoch 5)
Source "set var 0\n while 1 {\n expr {$var*$var}\n if {[in"
Cmds 6, src 76, inst 68, litObjs 4, aux 0, stkDepth 2, code/src 0.00
Exception ranges 1, depth 1:
0: level 0, loop, pc 14-61, continue 14, break 65
Commands 6:
1: pc 0-4, src 0-8 2: pc 5-66, src 14-75
3: pc 14-30, src 30-45 4: pc 31-61, src 48-74
5: pc 40-42, src 53-60 6: pc 48-57, src 69-73
Command 1: "set var 0"
(0) push1 0 # "0"
(2) storeScalar1 %v0
(4) pop
Command 2: "while 1 {\n expr {$var*$var}\n if {[incr var] >= 1}
"
(5) startCommand +62 1 # next cmd at pc 67
Command 3: "expr {$var*$var}"
(14) startCommand +16 1 # next cmd at pc 30
(23) push1 1 # "var"
(25) loadScalarStk
(26) push1 1 # "var"
(28) loadScalarStk
(29) mult
(30) pop
Command 4: "if {[incr var] >= 1} break "
(31) startCommand +31 2 # next cmd at pc 62, 2 cmds start here
Command 5: "incr var"
(40) incrScalar1Imm %v0 +1
(43) push1 2 # "1"
(45) ge
(46) jumpFalse1 +14 # pc 60
Command 6: "break"
(48) startCommand +10 1 # next cmd at pc 58
(57) break
(58) jump1 +4 # pc 62
(60) push1 3 # ""
(62) pop
(63) jump1 -49 # pc 14
(65) push1 3 # ""
(67) done

tclsh 20080909
0.0001 cheat5
ByteCode 0x0x81cc520, refCt 1, epoch 5, interp 0x0x8151d68 (epoch 5)
Source "set var 0\n while 1 {\n expr {$var*$var}\n if {[in"
Cmds 6, src 76, inst 66, litObjs 3, aux 0, stkDepth 2, code/src 0.00
Exception ranges 1, depth 1:
0: level 0, loop, pc 14-59, continue 14, break 63
Commands 6:
1: pc 0-4, src 0-8 2: pc 5-64, src 14-75
3: pc 14-28, src 30-45 4: pc 29-59, src 48-74
5: pc 38-40, src 53-60 6: pc 46-55, src 69-73
Command 1: "set var 0"
(0) push1 0 # "0"
(2) storeScalar1 %v0
(4) pop
Command 2: "while 1 {\n expr {$var*$var}\n if {[incr var] >= 1}
"
(5) startCommand +60 1 # next cmd at pc 65
Command 3: "expr {$var*$var}"
(14) startCommand +14 1 # next cmd at pc 28
(23) loadScalar1 %v0
(25) loadScalar1 %v0
(27) mult
(28) pop
Command 4: "if {[incr var] >= 1} break "
(29) startCommand +31 2 # next cmd at pc 60, 2 cmds start here
Command 5: "incr var"
(38) incrScalar1Imm %v0 +1
(41) push1 1 # "1"
(43) ge
(44) jumpFalse1 +14 # pc 58
Command 6: "break"
(46) startCommand +10 1 # next cmd at pc 56
(55) break
(56) jump1 +4 # pc 60
(58) push1 2 # ""
(60) pop
(61) jump1 -47 # pc 14
(63) push1 2 # ""
(65) done

-----------------------------------------------------------------

If you will ... maestro

Alexandre Ferrieux

unread,

Sep 14, 2008, 5:07:38 AM9/14/08

to

OK, excellent !
Looking at these bytecodes, here is what we can observe:

(1) No difference between 8.5.3 and 8.5.4.
(2) 8.5.4 to 8.6a2: switch from name access to indexed access for the
[incr]: incrScalarStkImm is now replaced by an incrScalar1Imm. But the
[expr] is untouched (loadScalarStk).
(3) No difference between 8.6a2 and 20080908
(4) 20080908 to 20080909: this time the [expr] is cured too
(loadScalar1)

So basically we're witnessing the two known interventions by Miguel in
the [uplevel]-generated bytecode.
Now let's correlate with your "compare down only" column for cheat5:

(1)
8.5.3 1.26
8.5.4 0.54

Here I suspect the huge effect is because in 8.5.3 [uplevel] did not
compile anything. So the bytecodes are identical, but in the first
case they are created only by our disassembling, not executed...
Miguel do you confirm ? (the changes file lacks release lines between
8.5.2 and 8.6)

(2)
8.6a2 0.46

Here we get the effect of the first tweak.

(3)
20080908 0.45

No change in time nor bytecode.

(4)
20080909 0.36

Second tweak.

Now a comment on your last "compare across" result: we end up with a
ratio of 1.1 between cheat5 and the naked [for]. As I told you this
worried me. But re-reading the source code you posted I slapped my
forehead and noticed, a bit late, that since Miguel has proposed his
optimized variants we've been comparing apples and oranges: indeed,
$nbody is actually a [while] loop !

If we now go back to the regular [for] syntax:

proc naked {} {
set x {}
for {set x 0} {$x<10000000} {incr x} {}
}
proc cheat2 {} {
set nbody {for {set x 0} {$x<10000000} {incr x} {}}
set ::glo $nbody
uplevel 1 $nbody
}
proc cheat5 {} {
set x {}
cheat2
}

% time naked
767876 microseconds per iteration
% time cheat5
774791 microseconds per iteration

So, as a grand finale, we can now sit back and enjoy the 1.0 between
naked and uplevelled [for] !

And the bytecode microscope confirms that it is no accident:

% disassemble script $::glo

(0) push1 0 # "0"
(2) storeScalar1 %v0
(4) pop

(5) jump1 +18 # pc 23
(7) push1 1 # ""
(9) pop
(10) startCommand +12 1 # next cmd at pc 22
(19) incrScalar1Imm %v0 +1
(22) pop
(23) loadScalar1 %v0
(25) push1 2 # "10000000"
(27) lt
(28) jumpTrue1 -21 # pc 7
(30) push1 3 # ""
(32) done

% disassemble proc naked
(0) push1 0 # ""
(2) storeScalar1 %v0 # var "x"
(4) pop
(5) startCommand +41 2 # next cmd at pc 46, 2 cmds start here
(14) push1 1 # "0"
(16) storeScalar1 %v0 # var "x"
(18) pop
(19) jump1 +18 # pc 37
(21) push1 2 # ""
(23) pop
(24) startCommand +12 1 # next cmd at pc 36
(33) incrScalar1Imm %v0 +1 # var "x"
(36) pop
(37) loadScalar1 %v0 # var "x"
(39) push1 3 # "10000000"
(41) lt
(42) jumpTrue1 -21 # pc 21
(44) push1 0 # ""
(46) done

As you can see, insts 14-46 of the naked proc are identical to the
uplevel-generated chunk.

-Alex

Evil Son

unread,

Sep 14, 2008, 11:51:53 AM9/14/08

to

On Sep 14, 7:07 pm, Alexandre Ferrieux <alexandre.ferri...@gmail.com>
wrote:

> OK, excellent !
> Looking at these bytecodes, here is what we can observe:

> [...]

Bravo!

My timings with your modification confirm the result.

In summary, to get "byte-compiled" performance for a command that
takes a script,
we currently need 2 procs working together like cheat2 and cheat5
above.

We don't seem to be too far from getting a clean *performant* macro
facility? Clean in the sense that we can easily read, write and debug
macros. But I'd imagine taking that final step (where we do away with
the need for a wrapper: our cheat5) is decidedly non-trivial.

I'm impressed at any rate!

Kevin Kenny

unread,

Sep 14, 2008, 12:22:22 PM9/14/08

to

Alexandre Ferrieux wrote:
[in discussing the performance of the following code]:

> proc naked {} {
> set x {}
> for {set x 0} {$x<10000000} {incr x} {}
> }
> proc cheat2 {} {
> set nbody {for {set x 0} {$x<10000000} {incr x} {}}
> set ::glo $nbody
> uplevel 1 $nbody
> }
> proc cheat5 {} {
> set x {}
> cheat2
> }

> % disassemble script $::glo

> (0) push1 0 # "0"
> (2) storeScalar1 %v0
> (4) pop
> (5) jump1 +18 # pc 23
> (7) push1 1 # ""
> (9) pop
> (10) startCommand +12 1 # next cmd at pc 22
> (19) incrScalar1Imm %v0 +1
> (22) pop
> (23) loadScalar1 %v0
> (25) push1 2 # "10000000"
> (27) lt
> (28) jumpTrue1 -21 # pc 7
> (30) push1 3 # ""
> (32) done

Now for another minor tweak: Empty scripts are surprisingly common,
and we don't detect when we're simply discarding the result of a
script that's free of side effects. The instructions [7..9] could
have been omitted. But that's an entirely separate issue, and
I'd actually almost prefer to avoid it until we're ready to tackle
still more aggressive optimisations.

--
73 de ke9tv/2, Kevin

Kevin Kenny

unread,

Sep 14, 2008, 12:33:30 PM9/14/08

to

Alexandre Ferrieux wrote:
> So, as a grand finale, we can now sit back and enjoy the 1.0 between
> naked and uplevelled [for] !
>
> And the bytecode microscope confirms that it is no accident:

Thanks for persevering with the testing long enough that we now have
a recipe for making [uplevel]ed code as fast as inline code! I had
been seriously worried that the

$resultset foreach row { ... script ... }

syntax in TDBC would represent a horrible performance drain, because
the bytecode compiler wouldn't be able to compile the script properly.
Between Miguel's improvements and your testing, my worries are laid
to rest.

Onward to getting 308 into 8.6b1!

Thanks again.

--
tnx es 73 de ke9tv/2, Kevin

Alexandre Ferrieux

unread,

Sep 14, 2008, 4:25:20 PM9/14/08

to

On Sep 14, 5:51 pm, Evil Son <ewilsonm...@gmail.com> wrote:
>
> In summary, to get "byte-compiled" performance for a command that
> takes a script,
> we currently need 2 procs working together like cheat2 and cheat5
> above.
>
> We don't seem to be too far from getting a clean *performant* macro
> facility? Clean in the sense that we can easily read, write and debug
> macros. But I'd imagine taking that final step (where we do away with
> the need for a wrapper: our cheat5) is decidedly non-trivial.

Indeed those improvements (brought to us by Miguel, in case somebody
forgets) bring a first step towards "truly first class" user-extended
control structures. By the way, I wouldn't call that "macros": instead
they are regular functions, putting Tcl's introspecitve powers to good
effect, and approaching the performance of macros ... without any
preprocessing.

Now though I initially shared your optimism, in private Miguel brought
me back to reality :-}
Indeed, there are two nasty tricks:

- The first one, already mentioned, is the necessity of pre-declaring
all speed-critical vars in the caller. This is due to the caller's
compiledLocals not being extensible. Various half-solutions can be
imagined, like over-allocating 5 to 10 "unused slots" in that array so
that you can use "uplevelled-tricks" with up to that many internal
vars. Even full solutions, like making compiledLocals reallocatable.
But a strong motivation is needed for a rather nontrivial redesign.

- The second one is much more worrisome. I even initially failed to
understand it when Miguel first mentioned it :-(, and it's like cold
water on the neck now:

set x {}
myloop x 0 1000 {
set y {}
myloop y 0 1000 {
body
}
}

The problem is that while we now know how to write "myloop" so that
its own execution gets asymptotically fast, it still reconstructs from
scratch the code to be uplevelled. This means that in the code above,
each turn of the outer loop calls a new "instance" of the inner one,
which builds a fresh string (even though always the same) and hence
has to call the bytecode compiler on it. So we'll compile 1000 times
the y loop !

Of course a half-solution would be to use a hash table (an array or
dict) so that the built strings are "interned" and keep object
identity whenever possible. But this brings a whole lot of
subproblems, like entry lifecycle (think about worst cases like
"myloop y 0 $x" in the code above)...

<sigh>
While I'm glad we've achieved identical speed on our toy example, I'm
afraid turning this into a general tool is another story entirely...
Maybe there's room for preprocessing (true macros) after all.

-Alex

Alexandre Ferrieux

unread,

Sep 14, 2008, 4:32:24 PM9/14/08

to

On Sep 14, 6:33 pm, Kevin Kenny <kenn...@acm.org> wrote:
>
> [...] the bytecode compiler wouldn't be able to compile the script properly.

> Between Miguel's improvements and your testing, my worries are laid
> to rest.

Hold on :-}
See my above answer to Evil Son: the killer is the repeated
compilation in a nested iteration.
I'm not sure about the TDBC case, because there the compilation
overhead might be dominated by heavier stuff like database access, but
in principle it seems hard to avoid it without a true macro system.

You cannot imagine how sorry I am to write this.

-Alex

Alexandre Ferrieux

unread,

Sep 14, 2008, 4:35:56 PM9/14/08

to

On Sep 14, 6:22 pm, Kevin Kenny <kenn...@acm.org> wrote:
>
> Now for another minor tweak: Empty scripts are surprisingly common,
> and we don't detect when we're simply discarding the result of a
> script that's free of side effects. The instructions [7..9] could
> have been omitted. But that's an entirely separate issue, and
> I'd actually almost prefer to avoid it until we're ready to tackle
> still more aggressive optimisations.

Agreed on all points. A bytecode-level optimizer is a really
entertaining idea. I wish we were teleported in that distant future
where all our current "emergencies" have been dealt with and we can
focus on that jewel ;-)

-Alex

Donal K. Fellows

unread,

Sep 14, 2008, 9:56:01 PM9/14/08

to

Alexandre Ferrieux wrote:
> Hold on :-}
> See my above answer to Evil Son: the killer is the repeated
> compilation in a nested iteration.

The problem only really comes when you're generating a nested loop.

> I'm not sure about the TDBC case, because there the compilation
> overhead might be dominated by heavier stuff like database access, but
> in principle it seems hard to avoid it without a true macro system.

For myself, I'd much rather figure out how to make [uplevel 1] more
performant, since then we can avoid having to generate code in all of
these cases.

Donal.

Alexandre Ferrieux

unread,

Sep 15, 2008, 2:44:23 AM9/15/08

to

On Sep 15, 3:56 am, "Donal K. Fellows" <donal.k.fell...@man.ac.uk>
wrote:

> Alexandre Ferrieux wrote:
> > Hold on :-}
> > See my above answer to Evil Son: the killer is the repeated
> > compilation in a nested iteration.
>
> The problem only really comes when you're generating a nested loop.

Oh yes Donal, a rephrasing contest. I love them :^P

-Alex

Evil Son

unread,

Sep 15, 2008, 10:01:27 AM9/15/08

to

On Sep 15, 11:56 am, "Donal K. Fellows" <donal.k.fell...@man.ac.uk>
wrote:

> For myself, I'd much rather figure out how to make [uplevel 1] more
> performant, since then we can avoid having to generate code in all of
> these cases.
>
> Donal.

This makes very good sense.

miguel

unread,

Sep 15, 2008, 2:35:53 PM9/15/08

to

Alexandre Ferrieux wrote:
> - The second one is much more worrisome. I even initially failed to
> understand it when Miguel first mentioned it :-(, and it's like cold
> water on the neck now:
>
> set x {}
> myloop x 0 1000 {
> set y {}
> myloop y 0 1000 {
> body
> }
> }
>
> The problem is that while we now know how to write "myloop" so that
> its own execution gets asymptotically fast, it still reconstructs from
> scratch the code to be uplevelled. This means that in the code above,
> each turn of the outer loop calls a new "instance" of the inner one,
> which builds a fresh string (even though always the same) and hence
> has to call the bytecode compiler on it. So we'll compile 1000 times
> the y loop !

I am not sure that this is true, did you check that? I did not myself, but you
can do it: first configure with --enable-symbols=compile, then
proc history args {} ;# limit unneeded output
set tcl_traceCompile 3 ;# trace compilations
...

What I expect is that both loop bodies are compiled just once: they are literals
(or held in a var, or whatever: point is: always the same Tcl_Obj) and will not
be recompiled as long as the running env does not change.

Alexandre Ferrieux

unread,

Sep 15, 2008, 4:09:01 PM9/15/08

to

On Sep 15, 8:35 pm, miguel <mso...@users.sf.net> wrote:
> Alexandre Ferrieux wrote:
> > - The second one is much more worrisome. I even initially failed to
> > understand it when Miguel first mentioned it :-(, and it's like cold
> > water on the neck now:
>
> > set x {}
> > myloop x 0 1000 {
> > set y {}
> > myloop y 0 1000 {
> > body
> > }
> > }
>
> > The problem is that while we now know how to write "myloop" so that
> > its own execution gets asymptotically fast, it still reconstructs from
> > scratch the code to be uplevelled. This means that in the code above,
> > each turn of the outer loop calls a new "instance" of the inner one,
> > which builds a fresh string (even though always the same) and hence
> > has to call the bytecode compiler on it. So we'll compile 1000 times
> > the y loop !
> >
> I am not sure that this is true, did you check that?

I had little doubt but I did try to please you ;-)

% namespace path ::tcl::unsupported
% proc myloop {v a b code} {
set x "for {set $v $a} {\$$v<$b} {incr $v} [list $code]"
debugobj $x
uplevel 1 $x
}
% myloop y 0 5 {myloop z 0 2 {puts $y:$z}}
Obj:0x9E7A28(2) (null) <<<for {set y 0} {$y<5} {incr y} {myloop z 0
2 {puts $y:$z}}>>>
Obj:0x9E7A58(2) (null) <<<for {set z 0} {$z<2} {incr z} {puts $y:
$z}>>>
0:0
0:1
Obj:0x9E7A58(2) (null) <<<for {set z 0} {$z<2} {incr z} {puts $y:
$z}>>>
1:0
1:1
Obj:0x9C51E0(2) (null) <<<for {set z 0} {$z<2} {incr z} {puts $y:
$z}>>>
2:0
2:1
Obj:0x9E7AD0(2) (null) <<<for {set z 0} {$z<2} {incr z} {puts $y:
$z}>>>
3:0
3:1
Obj:0x9E78C0(2) (null) <<<for {set z 0} {$z<2} {incr z} {puts $y:
$z}>>>
4:0
4:1

In this display the (null) is the Tcl_ObjType of the built string :-(

> What I expect is that both loop bodies are compiled just once: they are literals
> (or held in a var, or whatever: point is: always the same Tcl_Obj) and will not
> be recompiled as long as the running env does not change.

The problem is that what's passed to uplevel is not the naked "body",
which is indeed a literal, but a dynamically concatenated string like
"for {set x $ini} {\$x<$ini} {incr x} [list $body]". For the outer
loop it is no problem because this value is rather long-lived; but for
the inner one, that string will be reconstructed at each turn, thus
zapping any intrep, bytecode or not, unless we use a hashtable to keep
it between invocations of the inner "macro". But again, the hashtable
trick is instantly broken by myloop y 0 $x since it will populate it
with one-shot snippets...

This all boils down to the bytecode compiler working in a "shallow"
manner for user-provided functions, ie only as a side-effect of
[uplevel] or [eval], while an arbitrarily nested [for] compiles all
the way down to the innermost loop. Changing this implies a kind of
"scriptetd compileProc", which is roughly equivalent to a macro
anyway :-(

-Alex

miguel

unread,

Sep 15, 2008, 4:37:43 PM9/15/08

to

Alexandre Ferrieux wrote:
> On Sep 15, 8:35 pm, miguel <mso...@users.sf.net> wrote:
>> I am not sure that this is true, did you check that?
>
> I had little doubt but I did try to please you ;-)

Much obliged :P

> % namespace path ::tcl::unsupported
> % proc myloop {v a b code} {
> set x "for {set $v $a} {\$$v<$b} {incr $v} [list $code]"
> debugobj $x
> uplevel 1 $x
> }
> % myloop y 0 5 {myloop z 0 2 {puts $y:$z}}

Indeed, recompile the inner loop body at each go :(

As you said, a "memoizing" myloop would solve this problem ... and introduce
memory management problems of its own. That is, IIUC what you said, something like

proc memomyloop {v a b code} {
if {![info exists $::MEMO($code)]} {
set ::MEMO($code) "for {set $v $a} {\$$v<$b} {incr $v} [list $code]"
}
uplevel 1 $::MEMO($code)
}

Another possibility (just as ugly and hacky) might be (now that eval
bytecompiles its scripts too):

proc myloopbody {v a b code} {
return "for {set $v $a} {\$$v<$b} {incr $v} [list $code]"
}

eval [myloopbody y 0 5 [myloopbody z 0 2 {puts $y:$z}]]

(or something like that ... obviously not tested)

Alexandre Ferrieux

unread,

Sep 15, 2008, 5:15:51 PM9/15/08

to

On Sep 15, 10:37 pm, miguel <mso...@users.sf.net> wrote:
>
> uplevel 1 $::MEMO($code)

Exactly. Wouldn't do that in my script... let alone in the core !

> eval [myloopbody y 0 5 [myloopbody z 0 2 {puts $y:$z}]]

Yup. Removing the [eval] above is exactly what would permit a "parse
hook", or "macro system".
Personally I prefer the first naming (parse hook) because it sounds
more introspective: it is just an extra window into Tcl's guts,
allowing one last intervention, from AST to AST, only at compile time,
just before the bytecodes are generated. Of course we'd need to
duplicate the same hook for pure-eval paths... until we get rid of
them ;-)

-Alex

Donal K. Fellows

unread,

Sep 16, 2008, 5:46:17 AM9/16/08

to

Alexandre Ferrieux wrote:

> Donal K. Fellows wrote:
> > The problem only really comes when you're generating a nested loop.
>
> Oh yes Donal, a rephrasing contest. I love them :^P

The reality is that I think it's incredibly difficult to do. As I
noted, making [uplevel 1] faster will win better for existing code.
Even if that leaves things not being able to go *quite* as fast, it
keeps the code much simpler.

Donal.

Alexandre Ferrieux

unread,

Sep 16, 2008, 8:01:14 AM9/16/08

to

On Sep 16, 11:46 am, "Donal K. Fellows" <donal.k.fell...@man.ac.uk>
wrote:

Maybe we're talking out of phase here ?
What exactly are you saying is incredibly difficult: inserting a parse
hook at the C level, or allowing nested loops to avoid the pitfall
we've just described ?

In any case "making [uplevel 1] faster" is exactly what we've been
busy exploring in the past few days, and we have reached a near-
optimum:

(a) we can reach 1.0 times the speed of the naked code
(b) currently we need extra var declarations in the caller
(c) removing this constraint involves making compiledLocals
extensible

Are you advocating to undertake (c) asap, or instead to postpone it ?
(but then, what's left to be done to uplevel ?)

-Alex

Donal K. Fellows

unread,

Sep 17, 2008, 12:08:22 AM9/17/08

to

Alexandre Ferrieux wrote:
> Maybe we're talking out of phase here ?

I think you're misunderstanding me.

> What exactly are you saying is incredibly difficult: inserting a parse
> hook at the C level, or allowing nested loops to avoid the pitfall
> we've just described ?

What I'm saying is that I'm not keen on parse hooks (they're really
easy to misuse badly; I know programmers too well to trust them not
to!) and the approach people have used so far with making [uplevel 1]
fast won't work with nested loops since it involves generating the
bytecodes on call to the looping command. Doing better will require a
different approach.

> In any case "making [uplevel 1] faster" is exactly what we've been
> busy exploring in the past few days, and we have reached a near-
> optimum:
>
> (a) we can reach 1.0 times the speed of the naked code

Only in the non-nested case.

> (b) currently we need extra var declarations in the caller
> (c) removing this constraint involves making compiledLocals
> extensible
>
> Are you advocating to undertake (c) asap, or instead to postpone it ?
> (but then, what's left to be done to uplevel ?)

I want to make this faster:
proc loop {var from to body} {
upvar 1 $var v
for {set v $from} {$v < $to} {uplevel 1 $body}
}

Why that? Because that handles the nested case well as it doesn't do
code generation (code generation forces recompilation). It is also
pretty much how a lot of existing code is written. The primary road-
block here is the slowness of the [uplevel 1] overhead...

Donal.

Alexandre Ferrieux

unread,

Sep 17, 2008, 3:25:41 AM9/17/08

to

On Sep 17, 6:08 am, "Donal K. Fellows" <donal.k.fell...@man.ac.uk>
wrote:

> Alexandre Ferrieux wrote:
> > Maybe we're talking out of phase here ?
>
> I think you're misunderstanding me.

You're right. Either I'm being dense or you're being sybilline. Or
both :)

> What I'm saying is that I'm not keen on parse hooks (they're really
> easy to misuse badly; I know programmers too well to trust them not
> to!)

Sigh. Let's not have tools, we might cut ourselves:

proc set {x {y {}}} {puts Eeeek}

Let's remove [proc] then !
Seriously, is this opinion shared among the TCT ?

> and the approach people have used so far with making [uplevel 1]
> fast won't work with nested loops since it involves generating the
> bytecodes on call to the looping command. Doing better will require a
> different approach.

Agreed.

> I want to make this faster:
> proc loop {var from to body} {
> upvar 1 $var v
> for {set v $from} {$v < $to} {uplevel 1 $body}
> }

I like it when you remove the -sibylline flag ;-)
This indeed avoids dynamic code generation and endless recompilations,
but surely you realize that regardless of how much we reduce the
overhead of [uplevel], a nested loop will still involve two levels of
invokeStk, and hence will never come close to the speed of a fully-
inlined, two-level [for]. So the Tcl leitmotiv that control flow
commands are first-class and user-extensible is still an exaggeration
when performance is at stake.

-Alex

Alexandre Ferrieux

unread,

Sep 17, 2008, 3:35:02 AM9/17/08

to

On Sep 17, 9:25 am, I wrote:
>
> ... I'm being dense or you're being sybilline.

Or, I'm being dyslexic and you're being sibylline :)

-Alex