Tcl faster than Perl/Python...but only with tricks...

Stephan Kuhagen

unread,

Dec 30, 2006, 7:05:16 AM12/30/06

to

Hello

Currently there is a thread in c.l.python
(http://groups.google.de/group/comp.lang.python/browse_thread/thread/923e34e8466ac920/233f1310151e19f6)
about if it is possible for Python to beat Perl in a small text matching
task. Have some patience, there comes a really fast Tcl solution at the
end, but I like to describe the other Versions of Perl/Python first and
then present Tcl and some questions about performance of some things in
Tcl.

The text to match was the case insensitive word "destroy" in a text from
gutenberg.org (King James Bible). The text used for the test was generated
this way:

$ wget http://www.gutenberg.org/files/7999/7999-h.zip
$ unzip 7999-h.zip
$ cd 7999-h
$ cat *.htm > bigfile
$ du -h bigfile
du -h bigfile
8.2M bigfile

The code there for Perl was:
---
open(F, 'bigfile') or die;
while(<F>) {
s/[\n\r]+$//;
print "$_\n" if m/destroy/oi;
}
---

Which really fast finds all lines containing "destroy" case insensitive and
prints them out. On my computer (Linux 2.6.18, 2.6 GHz Pentium 4) this took
0.273s for Perl (for all measurements I used the average of the last three
runs of four, throwing away the first for caching).

The Python-Version was
---
import re
r = re.compile(r'destroy', re.IGNORECASE)
for s in file('bigfile'):
if r.search(s): print s.rstrip("\r\n")
---

Also fast, I think: 0.622s. After some Iterations, the Pythonians came up
with this solution and faster 0.526s:
---
import re
r = re.compile(r'destroy', re.IGNORECASE)
def stripit(x):
return x.rstrip("\r\n")
print "\n".join( map(stripit, filter(r.search, file('bigfile'))) )
---

I asked myself, how this would perform in Tcl, so I first did the straight
forward version, which resembles the other versions:

---
set f [open bigfile r]
while { [gets $f line] >= 0} {
if {[string match -nocase "*destroy*" $line]} {
puts $line
}
}
---

0.937s Ouch... (Tcl 8.4.13, with 8.5a4 I got even worse 1.2s)

I asked myself, what make Tcl so damn slow here. I commented out the
if...puts...-part what made the thing twice as fast (and useless of
course...). But that shows, that matching only took half of the time, which
surprised me. I thought, reading the file and running through the
while-loop should take nearly no time...

So my question is, why are [gets] and or [while] so slow, and is there a
change to improve that? For text processing these are two very central
commands...

I think about all the usenet-threads and preconceptions about Tcls slowness
(just have a look at the current thread in c.l.tcl: "Is Tcl work for large
programs?"). Tcl CAN do really fast, but you need some tricks and
knowledge, which is far from being obvious... After some thinking, I came
up with this:

---
set f [open bigfile r]
puts [join [regexp -all -inline -linestop -nocase {.*destroy.*\n} [read $f]]
{}]
---

0.223s (8.5a4: 0.241) Wow! Faster than Perl and at least as unreadable as
Perl, the Perl-Guys would love it! ;-)

But I don't. It doesn't look good, and it uses an unfair trick by reading
the whole file into memory. But that does not work, if the file is too
large for memory, while this would be no problem for the
Perl/Python-Versions. The only good thing about this version is, that it
shows, that Tcl regexp are nearly as fast as Perls, whcih is really good, I
think.

So, I could beat Perls/Pythons performance with Tcl, but it does not really
make me happy...

Regards
Stephan

Mark Janssen

unread,

Dec 30, 2006, 8:46:27 AM12/30/06

to

>
> I asked myself, what make Tcl so damn slow here. I commented out the
> if...puts...-part what made the thing twice as fast (and useless of
> course...). But that shows, that matching only took half of the time, which
> surprised me. I thought, reading the file and running through the
> while-loop should take nearly no time...
>
> So my question is, why are [gets] and or [while] so slow, and is there a
> change to improve that? For text processing these are two very central
> commands...
>

Are you sure gets and while are slow? At least a part of the result the
time command is giving is spent in intialization of the Tcl
interpreter. A better way to test the performance of the Tcl code would
be to use the [time] command to determine the time.
In my experience starting Tcl can be relatively slow if you have for
instance a lot of packages installed.

Mark

George Petasis

unread,

Dec 30, 2006, 8:52:47 AM12/30/06

to Stephan Kuhagen

Dear Stefan,

Yes, there are tricks you can do to make things fast in any language,
but you can apply simple things (like placing your code in a proc) that
will make your code run faster.

I have tried some small variations by mimicking your first python code
(as I don't know enough perl to understand the perl code). For me,
the tcl code that does similar to the python code is:

proc do {} {

set f [open bigfile r]

foreach { s } [split [read $f] \n] {
if { [regexp -nocase {destroy} $s] } {
puts $s
}
}
}
do

I have enclosed the code in a proc, and I am treating the file as a
list, as python does :-)

The above code is faster than python. Perl is of course unbeatable and
the second python code uses a special feature of python (filters on
files?), which to me is equivalent to the trick you also did with
"regexp -all". So, my ranking is: :-)

1st: "tricky" tcl (regexp -all on all data)
2nd: perl
3rd: "tricky" python (using filter)
4th: tcl (code in proc, split on [read $f], string match & regexp)
5th: python
6th: tcl with gets (in proc)
7th: tcl with gets (no proc)

It would be interesting to have a version of python that reads line by
line (like gets in tcl) to compare (I don't know python, so perhaps I am
asking something silly?). Also, it is unclear to me if perl (5.8.8) is
working in unicode mode, or in 8-bit (in my fedora core 6, tcl uses
utf-8, and I suppose python also supports unicode). Anyway, the numbers
in my system are (measured with tcl time for 15 iterations):

perl perl.pl: 244442 microseconds per iteration
python python.py: 582543 microseconds per iteration
python python2.py: 527532 microseconds per iteration
tclsh tcl.tcl: 1020712 microseconds per iteration
tclsh tcl2.tcl: 816473 microseconds per iteration
tclsh tcl3.tcl: 550781 microseconds per iteration
tclsh tcl4.tcl: 568477 microseconds per iteration
========================================================
perl.pl (244442 microseconds per iteration)
========================================================

open(F, 'bigfile') or die;
while(<F>) {
s/[\n\r]+$//;
print "$_\n" if m/destroy/oi;
}

========================================================
python.py (582543 microseconds per iteration)
========================================================

import re
r = re.compile(r'destroy', re.IGNORECASE)
for s in file('bigfile'):
if r.search(s): print s.rstrip("\r\n")

========================================================
python2.py (527532 microseconds per iteration)
========================================================

import re
r = re.compile(r'destroy', re.IGNORECASE)
def stripit(x):
return x.rstrip("\r\n")
print "\n".join( map(stripit, filter(r.search, file('bigfile'))) )

========================================================
tcl.tcl (1020712 microseconds per iteration)
========================================================

set f [open bigfile r]
while { [gets $f line] >= 0} {
if {[string match -nocase {*destroy*} $line]} {
puts $line
}
}

========================================================
tcl2.tcl (816473 microseconds per iteration)
========================================================
proc do {} {

set f [open bigfile r]
while { [gets $f line] >= 0} {
if {[string match -nocase {*destroy*} $line]} {
puts $line
}
}
}

do
========================================================
tcl3.tcl (550781 microseconds per iteration)
========================================================
proc do {} {

set f [open bigfile r]

foreach { s } [split [read $f] \n] {
if { [regexp -nocase {destroy} $s] } {
puts $s
}
}
}
do
========================================================
tcl4.tcl (568477 microseconds per iteration)
========================================================
proc do {} {

set f [open bigfile r]

foreach s [split [read $f] \n] {
if {[string match -nocase {*destroy*} $s]} {
puts $s
}
}
}
do

The code of the running script is:
set list {perl perl.pl
python python.py python python2.py
tclsh tcl.tcl tclsh tcl2.tcl
tclsh tcl3.tcl tclsh tcl4.tcl}
foreach {exe code} $list {
set time($code) [time [list exec $exe $code 15]]
puts "$exe $code: $time($code)"
}
foreach {exe code} $list {
puts "========================================================"
puts " $code ($time($code))"
puts "========================================================"
set f [open $code]; puts [string trim [read $f]]; close $f
}

Regards,
George

O/H Stephan Kuhagen έγραψε:

George Petasis

unread,

Dec 30, 2006, 8:55:54 AM12/30/06

to George Petasis, Stephan Kuhagen

I forgot to mention that I have used ActiveTcl 8.4.14 for linux, with
only two additional packages, just to have a vague measure of the
startup time. The way I tested, the startup time for all languages is
also measured.

Regards,
George

Uwe Klein

unread,

Dec 30, 2006, 9:02:24 AM12/30/06

to

Hi Stephan,

Stephan Kuhagen wrote:
> Hello
>
> Currently there is a thread in c.l.python
> (http://groups.google.de/group/comp.lang.python/browse_thread/thread/923e34e8466ac920/233f1310151e19f6)

grep 0.430 for comparison: grep -i destroy bigfile
perl 0.365 your original
python 0.720 your original
tcl1 0.990 your original while ...
tcl2 0.750 same but while in proc
tcl3 0.500 same, but read and split into var, foreach line $var .....
tcl4 0.300 your regexp original
tcl5 0.330 your reg orig, but in proc ( no gain )

uwe

Stephan Kuhagen

unread,

Dec 30, 2006, 9:47:38 AM12/30/06

to

Mark Janssen wrote:

> Are you sure gets and while are slow? At least a part of the result the
> time command is giving is spent in intialization of the Tcl
> interpreter. A better way to test the performance of the Tcl code would
> be to use the [time] command to determine the time.

This would be the Tcl-way to test performance, of course. But since the
Perl/Python Versions also have to load the their interpreter, I think it
would be not fair to skip this step for Tcl. But to check that, I just
checked an empty Tcl-Script with the same Method (four runs, average of the
last three), which shows, that skipping the tclsh-loading saves only
0.015s.

I think, the biggest "mistake" I made was not to take the while/gets into a
proc, as mentioned by Uwe Klein and George Petasis. That improves
performance because of bytecode compilation.

Regards
Stephan

Stephan Kuhagen

unread,

Dec 30, 2006, 9:58:22 AM12/30/06

to

Hello

Uwe Klein wrote:

> grep 0.430 for comparison: grep -i destroy bigfile
> perl 0.365 your original
> python 0.720 your original
> tcl1 0.990 your original while ...
> tcl2 0.750 same but while in proc
> tcl3 0.500 same, but read and split into var, foreach line $var .....
> tcl4 0.300 your regexp original
> tcl5 0.330 your reg orig, but in proc ( no gain )

I tried also a solution with read and foreach. But that does have the same
drawback as mine with regexp, so it would not work for really big files.

Putting the while in a proc is a good solution, I should have done that
myself. But that is still far away from Perl. When I comment out the
if...puts-part again, it still needs 0.455s, so the [gets] and/or the
[while] still remain slow in a proc. This is, what surprises me.

Regards
Stephan

Stephan Kuhagen

unread,

Dec 30, 2006, 10:41:37 AM12/30/06

to

Hello

> I have tried some small variations by mimicking your first python code
> (as I don't know enough perl to understand the perl code

he, he, me too, but somehow it did the thing... ;-)

> ). For me,
> the tcl code that does similar to the python code is:
>
> proc do {} {
> set f [open bigfile r]
> foreach { s } [split [read $f] \n] {
> if { [regexp -nocase {destroy} $s] } {
> puts $s
> }
> }
> }
> do
>
> I have enclosed the code in a proc, and I am treating the file as a
> list, as python does :-)

The "for x in file" in Python give only list semantics, but it reads the
file line by line. I also tried a solution in a proc with the whole file
used as a list. This was a little bit faster, but not as fast as Perl, and
it also suffers from the fact, that it would not work for really big files.

> The above code is faster than python. Perl is of course unbeatable and
> the second python code uses a special feature of python (filters on
> files?), which to me is equivalent to the trick you also did with
> "regexp -all". So, my ranking is: :-)

Yes, the filter is a neat trick, but also has the advantage not to read the
whole file into memory.

> It would be interesting to have a version of python that reads line by

> line (like gets in tcl) to compare (I don't know python, so perhaps I am
> asking something silly?).

---
def do():
f=file('bigfile')

r = re.compile(r'destroy', re.IGNORECASE)

line=f.readline()
while line!="":
if r.search(line):
print line.rstrip("\r\n")
line=f.readline()

do()
---
Surprising for me, this is indeed slower than the Tcl version of gets/while
in a proc:
0.822s (Python)
0.659s (Tcl)

OTOH, nobody would write something like this in Python, since in the
tutorial shows you to use for...in or filter() (unfair trick! ;-) as soon
as you start learning Python. In Tcl I thinks it is natural to use
while/gets and not some weird read/regexp-combination.

But aside from that, I wondering, if while and gets (or one of them) have
such a big performance impact, and why. Remember, even if I remove the if,
regex and puts in the loop, it just gets twice as fast. I would expect,
that if, regex and puts together should use much more time than while and
gets.

Regards
Stephan

Alexandre Ferrieux

unread,

Dec 30, 2006, 1:31:49 PM12/30/06

to

Stephan Kuhagen wrote:
>
> But aside from that, I wondering, if while and gets (or one of them) have
> such a big performance impact, and why. Remember, even if I remove the if,
> regex and puts in the loop, it just gets twice as fast. I would expect,
> that if, regex and puts together should use much more time than while and
> gets.

A simple way of removing all the "first time" overheads (interpreter
init, bytecode compilation) would be to do the comparison on a file ten
times larger (just cat it over and over again).

But *if* [gets] is truly slower than it should (I'm not expecting
[while] to be the culprit !), then I'd look around the -translation
(the crlf) and -encoding (unicode conversion) aspects of the channel
(remember there's a mandatory unicode conversion for regexp; I don't
know if this could be a realistic cause for the slowdown though...).

-Alex

Cameron Laird

unread,

Dec 30, 2006, 2:21:23 PM12/30/06

to

In article <en61bi$a5k$1...@kohl.informatik.uni-bremen.de>,
Stephan Kuhagen <s...@nospam.tld> wrote:
.
.
.

>def do():
> f=file('bigfile')
> r = re.compile(r'destroy', re.IGNORECASE)
> line=f.readline()
> while line!="":
> if r.search(line):
> print line.rstrip("\r\n")
> line=f.readline()

.
.
.

>OTOH, nobody would write something like this in Python, since in the
>tutorial shows you to use for...in or filter() (unfair trick! ;-) as soon
>as you start learning Python. In Tcl I thinks it is natural to use

.
.
.
Plenty of people--including book authors--*do* write
things like this in Python.

Stephan Kuhagen

unread,

Dec 30, 2006, 4:08:26 PM12/30/06

to

Alexandre Ferrieux wrote:

> A simple way of removing all the "first time" overheads (interpreter
> init, bytecode compilation) would be to do the comparison on a file ten
> times larger (just cat it over and over again).

Good point. I made the file 100 times larger now, so they do not fit into
memory and no in-memory-processing has an advantage (which destroys my
first and fastest solution with reading the whole file and the using
regexpt to that large string). I used the fastest Tcl version mentioned
here until now (Uwe Klein: while and gets in a proc).

Perl: 35.960s
Python: 1m, 31.896s
Tcl: 1m, 19.639s

> But *if* [gets] is truly slower than it should (I'm not expecting
> [while] to be the culprit !), then I'd look around the -translation
> (the crlf) and -encoding (unicode conversion) aspects of the channel
> (remember there's a mandatory unicode conversion for regexp; I don't
> know if this could be a realistic cause for the slowdown though...).

I'm not sure how to eliminate that with fconfigure, but I doubt, that this
would be much improvement. When I commented out the regexp, I got only a
speedup of factor 2, which means half of the runtime is used by reading the
file with [gets] in a [while].

But your comment made me dig a little bit deeper, so I made this test: The
new very big file has 11.715.400 lines (819MB). The file can be read with
the following script:

---

set f [open bigfile r]

puts [time {
gets $f line
gets $f line
gets $f line
gets $f line
gets $f line
gets $f line
gets $f line
gets $f line
gets $f line
gets $f line
} 1171540 ]
---

So only the [gets] counts. The result is:
59.259s runtime total and 50.3546485822 microseconds per iteration.

Doing the same with ten times "while {1} break" runs only 2.338s. So
obviously [gets] is the slow part. So when processing files line by line,
one should try to get around [gets], if it should be really fast... The
regexp is not the bottleneck, which at first surprised me a little, but
delights me also and fits my experience when I once wrote the same parser
heavily using regular expressions in Python and Tcl, where the Tcl version
was, depending on the input, up to ten times faster.

Regards
Stephan

Stephan Kuhagen

unread,

Dec 30, 2006, 4:08:30 PM12/30/06

to

Cameron Laird wrote:

> Plenty of people--including book authors--*do* write
> things like this in Python.

Okay, maybe, I did not check that. I started using Python just one year ago
and would never come to this solution, because I learned Python by doing
and looking into its Tutorial where they show you the "for line in
file"-trick as one of the first things.

Tcl, OTOH, I learned in the early 90ies, and used it ever since as my
favourite language, but I really had to think about my really fast
solution.

But I do not want to discuss Tcl vs. anything. I just compared the execution
time of some languages for the same thing, and how people seem to do them
naturally in that language. What I wondered then was the time, either
[gets] or [while] used.

Regards
Stephan

Uwe Klein

unread,

Dec 30, 2006, 4:51:12 PM12/30/06

to

Stephan Kuhagen wrote:
> Alexandre Ferrieux wrote:
>
>
>>A simple way of removing all the "first time" overheads (interpreter
>>init, bytecode compilation) would be to do the comparison on a file ten
>>times larger (just cat it over and over again).
>
>
> Good point. I made the file 100 times larger now, so they do not fit into
> memory and no in-memory-processing has an advantage (which destroys my
> first and fastest solution with reading the whole file and the using
> regexpt to that large string). I used the fastest Tcl version mentioned
> here until now (Uwe Klein: while and gets in a proc).

Just for the fun of it:
I changed the [gets $fd line] to [set line [read $fd 65536]].
the difference is negligible.

uwe

Ian Bell

unread,

Dec 30, 2006, 6:20:56 PM12/30/06

to

Uwe Klein wrote:

I am no expert but presumably, not matter what the language, the gets or
read ends up as an fgets C library call. Is it possible the differences are
due in part to the way each language sets up the channel e.g buffer size,
blocking, no chars read in fgets etc?

Ian

Alexandre Ferrieux

unread,

Dec 30, 2006, 6:25:15 PM12/30/06

to

Uwe Klein a écrit :

>
> Just for the fun of it:
> I changed the [gets $fd line] to [set line [read $fd 65536]].
> the difference is negligible.

Ha ! This one seems to point to my suggestion about some
post-processing done in the channel code (encoding or crlf). Can you
try with -translation binary, and also play around with various
-encoding values ?

Also, another interesting comparison point would be stdio's
fread()/fgets(). Not sure 'grep' uses it though (does anybody know
whether Perl and Python do ?). You may need to compile a 3-line C
program to test...

-Alex

Earl Greida

unread,

Dec 30, 2006, 7:12:37 PM12/30/06

to

"Stephan Kuhagen" <s...@nospam.tld> wrote in message
news:en5klt$7fa$1...@kohl.informatik.uni-bremen.de...
>
> On my computer (Linux 2.6.18, 2.6 GHz Pentium 4)..
> 0.273s for Perl
> The Python-Version was....0.622s.
> Tcl...0.937s

To me, the question is how often does the program need to read the 8.2MB
file. If every second then the time difference matters. If once a minute
then the few hundred milli-second difference becomes less important. If
only a few times then the time difference is essentially irrelevant. The
question becomes, which language is easier to use, and produces a program
that is easier to read, easier to maintain, easier to enhance, and more
reliable, over the life of the program.

Stephan Kuhagen

unread,

Dec 30, 2006, 7:42:42 PM12/30/06

to

Earl Greida wrote:

This is quite right. The Perl-Version (for me) is not as readable as the Tcl
or the Python version. But Pythons "for line in file" is really simple and
good looking. Additionally one doesn't build such a script to run this task
only once. For that you simply use a one-liner or grep and do not care if
it runs half a second or three seconds. So this is of course an academical
example. But if you really often do those tasks on really big files, then
this will make a difference. I had such an example, and luckily Tcl won,
because of its faster regexp compared to Python. But if the checks I had to
do on the data were more simple and could have been done without regexp but
with simple string matching, Python would have been some times faster. And
since this specific task runs about in a loop some hundred times every ten
minutes, the runtime of the script has some serious impact. For me, the
slowness of [gets] doesn't do any harm to my preference to Tcl. But I was
really wondering, why reading strings from a file is so slow compared to
the others, because I have the feeling, that for many tasks this is a very
central bottleneck.

Regards
Stephan

Stephan Kuhagen

unread,

Dec 30, 2006, 7:52:14 PM12/30/06

to

Ian Bell wrote:

> I am no expert but presumably, not matter what the language, the gets or
> read ends up as an fgets C library call. Is it possible the differences
> are due in part to the way each language sets up the channel e.g buffer
> size, blocking, no chars read in fgets etc?

I'm starting to suspect this also, as well as what Alexandre Ferrieux wrote
about translation and encoding. Maybe there are some settings for buffering
and encoding/translation that can make [gets] as fast as the readlines of
Perl/Python. Knowing those settings (or some good heuristics depending on
the task and environment) could be a great improvement and mentioned in
some tutorial, the man pages or something.

If it's not that, maybe a look at the implementation of [gets] comparing it
to the same things in Python and Perl (no way!) would reveal some
interesting differences.

Regards
Stephan

sleb...@gmail.com

unread,

Dec 31, 2006, 5:50:43 AM12/31/06

to

Stephan Kuhagen wrote:
> Earl Greida wrote:
>
> > "Stephan Kuhagen" <s...@nospam.tld> wrote in message
> > news:en5klt$7fa$1...@kohl.informatik.uni-bremen.de...
> >>
> >> On my computer (Linux 2.6.18, 2.6 GHz Pentium 4)..
> >> 0.273s for Perl
> >> The Python-Version was....0.622s.
> >> Tcl...0.937s
> >
> > To me, the question is how often does the program need to read the 8.2MB
> > file. If every second then the time difference matters. If once a minute
> > then the few hundred milli-second difference becomes less important. If
> > only a few times then the time difference is essentially irrelevant. The
> > question becomes, which language is easier to use, and produces a program
> > that is easier to read, easier to maintain, easier to enhance, and more
> > reliable, over the life of the program.
>
> This is quite right. The Perl-Version (for me) is not as readable as the Tcl
> or the Python version. But Pythons "for line in file" is really simple and
> good looking.

Slightly off-topic. If you like that Python "trick" then you'd love
tcllib's fileutil::foreachLine. If you don't have tcllib then copy and
paste this to your file:

# Slightly different name than tcllib but same semantics:
proc eachline {var fname script} {
upvar 1 $var v
set f [open $fname r]
while 1 {
set v [gets $f]
uplevel 1 $script
if {[eof $f]} break
}
close $f
}

Or if you like Python's version:

# With python semantics:
proc eachline {var args} {
upvar 1 $var v
if {[llength $args] == 3} {
if {[lindex $args 0] == "in"} {
set args [lrange $args 1 end]
} else {
error {invalid syntax}
}
} elseif {[llength $args] > 3} {
error {wrong # args: should be "eachline var ?in? filename
script.."}
}

set fname [lindex $args 0]
set script [lindex $args 1]

set f [open $fname r]
while 1 {
set v [gets $f]
uplevel 1 $script
if {[eof $f]} break
}
close $f
}

Stephan Kuhagen

unread,

Dec 31, 2006, 6:30:25 AM12/31/06

to

Hello

> Slightly off-topic. If you like that Python "trick" then you'd love
> tcllib's fileutil::foreachLine. If you don't have tcllib then copy and
> paste this to your file:

Thanks. I know Tcllib of course, but I did not use it, because it makes
things much more comfortable but much slower also. foreachLine is nice, but
when using Tcllib anyway, I would use fileutil::grep, which is also much
faster than foreachLine.

Nevertheless, your hint gave me the idea to try this one:
---
proc grep_for {} {
for {set f [open bigfile r]} {[gets $f line] >= 0} {} {

if {[string match -nocase "*destroy*" $line]} {
puts $line
}
}
}
---

Performance is nearly the same as the [while] version (while: 0.692s, for:
0.680s), but a little bit shorter.

Regards
Stephan

Donal K. Fellows

unread,

Dec 31, 2006, 6:39:06 AM12/31/06

to

Ian Bell wrote:
> I am no expert but presumably, not matter what the language, the gets or
> read ends up as an fgets C library call.

Certainly not! fgets() is a part of stdio and Tcl doesn't use that
because it's buffering interacts badly with use of select() in Tcl's
notifier. Instead, Tcl manages everything over the OS's syscall level
(i.e. read() or recv() on Unix, depending on channel type) so that it
can get the interaction between the various pieces right.

If you want a speed boost, make sure that the channel is configured to
use -translation binary and use [read]. That turns off a lot of
processing that would otherwise slow things down. Also, if your file is
only a few percent of your available physical address space in size,
then slurping it all into memory is best anyway. :-)

> Is it possible the differences are
> due in part to the way each language sets up the channel e.g buffer size,
> blocking, no chars read in fgets etc?

The differences quite possibly also relate to how languages handle their
internal string representations. Processing as byte arrays is definitely
faster (though wrong with some rare system encodings, but I bet Tcl does
better with those than other languages do). As noted above, put the
channel in binary mode (8.5 will make this easier to do BTW) and then
use [read] for a speed boost.

Donal.

Uwe Klein

unread,

Dec 31, 2006, 7:25:34 AM12/31/06

to

sleb...@yahoo.com wrote:
................

Then I would like to have "pushable" stream modules.

set fd [ open InputFile r ]
chan $fd popall
chan $fd push regexp_filter

OR
Would a variable with "lazy content" be possible in tcl?

set fd [ open InputFile r ]
chan $fd mapto content_of_file_var

puts [join [regexp -all -inline -linestop -nocase {.*destroy.*\n} $content_of_file_var ]
{}]

uwe

Uwe Klein

unread,

Dec 31, 2006, 7:18:01 AM12/31/06

to

He he:
(This is actually broken because the searchterm will not be found
if it spans a block boundary, )

#!/usr/bin/tclsh

proc matchdestroy {fd} {
while { [set line [read $fd 65536]] != "" } {

if {[string match -nocase "*destroy*" $line]} {
puts $line
}
}
}

set f [open bigfile r]

fconfigure $f -encoding binary -translation binary

matchdestroy $f
#end

real 0m0.207s
user 0m0.162s
sys 0m0.045s

the same applied to the initial procified [gets] example
runs for:
real 0m0.668s
user 0m0.654s
sys 0m0.014s

Stephan Kuhagen

unread,

Dec 31, 2006, 7:32:12 AM12/31/06

to

Hello

> If you want a speed boost, make sure that the channel is configured to
> use -translation binary and use [read].

But [read] has the disadvantage that you can not read line by line, or do I
miss something here?

And I just tried to set translation to binary, and it slowed things down. Am
I doing this wrong or is there something strange happening?

---
proc grep {} {

set f [open bigfile r]

fconfigure $f -translation binary

while { [gets $f line] >= 0} {
if {[string match -nocase "*destroy*" $line]} {
puts $line
}
}
}
---

With the fconfigure the script runs 0.769s and when removing the fconfigure,
the script runs 0.703s...

> That turns off a lot of
> processing that would otherwise slow things down. Also, if your file is
> only a few percent of your available physical address space in size,
> then slurping it all into memory is best anyway. :-)

Yes, that was of course the solution I finally came up with, but this
doesn't work of course, when files get really big.

Regards
Stephan

George Petasis

unread,

Dec 31, 2006, 8:21:25 AM12/31/06

to Stephan Kuhagen

To my view, it will be nice if something is left out of all this,
so is it a good idea to try improve things?

How about starting a discussion on how to improve reading line by line a
file, which is a quite frequent action (at least for me).

How about creating a new subcommand (in 8.5 chan command?), that will
perform an iteration over a channel content, similar to the way python
does? For example, I can think of a command like:

chan foreach channelId {var1 ?var2 ...?} {split_chars-like-split} code

The idea is that Tcl_ReadObj is used to read pieces of the file in
memory, and the split_chars are used to cut this into pieces, which are
assigned into the variables. So,

chan foreach stdin line \n {
puts $line
}

will print a file line by line. The loop ends when eof/error occurs or
when break is executed. And even provide a variation for lines, like:

chan foreachline channelId {var-list} code

And finally, mimic completely foreach, by allowing searches on multiple
files:

chan foreach channelId-1 {var-list-1} {split-chars-1} \
?channelId-2 {var-list-2} {split-chars-2} ...? \
{code}

How about this idea?

Another idea that came to me while reading this post, was to implement a
special list type objects, that will be bounded to files. For example:

chan split split-chars

this will return a "special" list object, whose internal representation
will be bound to the channel. The file can be searched (preferably in a
"lazy" way, i.e. only when it is needed) and keep internally offsets of
the split characters. This way the object can return specific elements
by seeking the file. However, this is by far more complex than "chan
foreach" and I doupt it will be more useful than "chan foreach".

Any opinions? :-)

Regards,

George

Stephan Kuhagen

unread,

Dec 31, 2006, 8:38:27 AM12/31/06

to

Uwe Klein wrote:

> (This is actually broken because the searchterm will not be found
> if it spans a block boundary, )

...

> fconfigure $f -encoding binary -translation binary

I found that setting binary for encoding and translation slows things down.
This really surprises me. What are your differences, when using binary and
not?

And it is also of course broken, because it also prints lines without the
pattern, for example if the whole file contains 10 lines, but is only 1k in
size and only one line contains the pattern, then your proc prints the
whole file...

But using [read] in a loop for smaller chunks so they fit into memory is a
good solution. I combined it with gets to get around the problem of finding
line endings by myself a got a real speedup compared to gets alone. This is
even faster that the pure read/regexp version which was the fastest until
now. Really cool, thanks for the inspiration... ;-) Here the fastest
version so far:
---
proc grep {} {

set f [open bigfile r]

while { [set chunk [read $f 65536]] != "" } {
if {[string index $chunk end] != "\n" } {
append chunk [gets $f]
}
puts -nonewline [join
[regexp -all -inline -linestop -nocase .*destroy.*\n} $chunk] {}]
}
}
---

This runs 0.229s compared to only using gets (0.672s) or with reading the
whole file and then using regexp (0.232s). And of course, this can run even
a little bit faster by changing the chunks size. This one really beats
Perl... Fun. ;-)

Regards
Stephan

Donal K. Fellows

unread,

Dec 31, 2006, 9:22:38 AM12/31/06

to

Stephan Kuhagen wrote:
> But [read] has the disadvantage that you can not read line by line, or do I
> miss something here?

You have to write your code differently so that you can process multiple
lines at once, and then process the file in chunks (e.g. 1MB at a time).
The code to do this is a bit trickier, but it can be a big win if the
search term is expected to be uncommon.

> And I just tried to set translation to binary, and it slowed things down. Am
> I doing this wrong or is there something strange happening?

You're using [gets] with a binary channel, which isn't recommended at all.

> Yes, that was of course the solution I finally came up with, but this
> doesn't work of course, when files get really big.

If the file is really big, you'll find that the disk's seek times
dominate, whatever language you use.

Donal.

Stephan Kuhagen

unread,

Dec 31, 2006, 9:30:06 AM12/31/06

to

Hello

George Petasis wrote:

> To my view, it will be nice if something is left out of all this,
> so is it a good idea to try improve things?
>
> How about starting a discussion on how to improve reading line by line a
> file, which is a quite frequent action (at least for me).

This is of course a good idea. For normal file reading line by line I would
simply prefer a faster [gets], because it also improves old code then. But
maybe, this is hard to do, or even impossible.

> And finally, mimic completely foreach, by allowing searches on multiple
> files:
>
> chan foreach channelId-1 {var-list-1} {split-chars-1} \
> ?channelId-2 {var-list-2} {split-chars-2} ...? \
> {code}
>
> How about this idea?

The last one sounds a little bit too complex to me. I would be afraid, that
such big monsters tend to be slow. The first simpler one, nevertheless,
sounds reasonable.

> Another idea that came to me while reading this post, was to implement a
> special list type objects, that will be bounded to files. For example:
>
> chan split split-chars
>
> this will return a "special" list object, whose internal representation
> will be bound to the channel. The file can be searched (preferably in a
> "lazy" way, i.e. only when it is needed) and keep internally offsets of
> the split characters. This way the object can return specific elements
> by seeking the file. However, this is by far more complex than "chan
> foreach" and I doupt it will be more useful than "chan foreach".

That would be very handy, but sounds hard to implement. But maybe a simpler
solution could be used similar with much less effort. Memory mapped files
come to my mind. I don't know, if this is available on all platforms for
Tcl, but they would provide array-like access to the whole file, but the
real mapping is done transparently by the OS. Access would normally be very
fast, since all tricks, the OS knows are automatically available. You can
then work with them like strings or split a range of the file into a list
and so on. This would be a very natural way to handle large files fast and
flexible.

Regards
Stephan

Stephan Kuhagen

unread,

Dec 31, 2006, 9:35:49 AM12/31/06

to

Donal K. Fellows wrote:

> Stephan Kuhagen wrote:
>> But [read] has the disadvantage that you can not read line by line, or do
>> I miss something here?
>
> You have to write your code differently so that you can process multiple
> lines at once, and then process the file in chunks (e.g. 1MB at a time).
> The code to do this is a bit trickier, but it can be a big win if the
> search term is expected to be uncommon.

Tadah (not tricky, very simple instead, but I needed Uwe Klein to bring me
to the idea):

---
proc grep {} {
set f [open bigfile r]

while { [set chunk [read $f 65536]] != "" } {
if {[string index $chunk end] != "\n" } {
append chunk [gets $f]
}
puts -nonewline [join
[regexp -all -inline -linestop -nocase .*destroy.*\n} $chunk] {}]
}
}
---

Just wrote this before your answer in response to Uwe Klein. And it is
indeed faster as even the fastest solution I came up so far. The funny
thing is the combination of [read] and [gets]. The first on makes reading
chunks fast, the seond one make searching the next eol fast...

Regards
Stephan

Stephan Kuhagen

unread,

Dec 31, 2006, 9:39:50 AM12/31/06

to

Stephan Kuhagen wrote:
> ---
> proc grep {} {
> set f [open bigfile r]
> while { [set chunk [read $f 65536]] != "" } {
> if {[string index $chunk end] != "\n" } {
> append chunk [gets $f]
> }
> puts -nonewline [join
> [regexp -all -inline -linestop -nocase .*destroy.*\n} $chunk] {}]
> }
> }
> ---

Hm, I just see, this can miss some matches. Changing

append chunk [gets $f]

to

append chunk [gets $f] "\n"

fixes this and does not runtime measurable.

Regards
Stephan

Uwe Klein

unread,

Dec 31, 2006, 10:03:08 AM12/31/06

to

Stephan Kuhagen wrote:
> Uwe Klein wrote:
>
>
>>(This is actually broken because the searchterm will not be found
>>if it spans a block boundary, )
>
> ...
>
>>fconfigure $f -encoding binary -translation binary
>
>
> I found that setting binary for encoding and translation slows things down.
> This really surprises me. What are your differences, when using binary and
> not?

gets + noencoding : no gain ~770ms
because you still have to parse the input for line separators.

read + noencoding : good gain ~220ms

uwe

Alexandre Ferrieux

unread,

Dec 31, 2006, 12:32:25 PM12/31/06

to

George Petasis wrote:
>
> How about starting a discussion on how to improve reading line by line a
> file, which is a quite frequent action (at least for me).

Good idea (though it has already started :-)

> How about creating a new subcommand (in 8.5 chan command?), that will
> perform an iteration over a channel content, similar to the way python
> does? For example, I can think of a command like:

Why create a new command instead of fixing [gets] ?

Indeed, at the C level, nothing prevents Tcl from being just as fast as
Perl and Python and stdio. It's just that we have to concentrate a bit
on the code and look at that strange bottleneck. Stay tuned !

-Alex

Alexandre Ferrieux

unread,

Dec 31, 2006, 12:34:49 PM12/31/06

to

Uwe Klein wrote:
> gets + noencoding : no gain ~770ms
> because you still have to parse the input for line separators.

Not a sufficient reason. Perl and Python and stdio also look for the
separator.

-Alex

Donal K. Fellows

unread,

Dec 31, 2006, 5:41:38 PM12/31/06

to

Stephan Kuhagen wrote:
> Memory mapped files come to my mind.

Note that they'll have to always be binary-mode only; can't apply
encoding handling, end-of-line translation or end-of-file character
detection without scanning which will completely negate all performance
benefits.

> I don't know, if this is available on all platforms for Tcl,

Yes. It's part of how *everyone* implements dynamic executable loading
because it is the fastest way to do it. :-)

> You can then work with them like strings or split a range of the file into
> a list and so on. This would be a very natural way to handle large files
> fast and flexible.

I think it would be quite difficult to get it working right. Lots of
details to worry about (mainly so that you don't end up doing lots of
unnecessary copying...) so I think it's probably a feature not for even
the 8.6 release; we hope to get that out "to market" in 2007. But maybe
someone will contribute the code and prove me wrong. :-)

Donal.

Donal K. Fellows

unread,

Dec 31, 2006, 5:45:17 PM12/31/06

to

Alexandre Ferrieux wrote:
> Not a sufficient reason. Perl and Python and stdio also look for the
> separator.

It's a known issue that [gets] isn't very efficient with binary files
(it's tuned for text files, which go through a somewhat different path
in the I/O guts). Since that's an uncommon combination (unlike with
[read]), nobody's really put that much effort into optimizing it. (For
"optimizing" read "preventing shimmering between bytearrays and unicode".)

Donal.

Uwe Klein

unread,

Dec 31, 2006, 5:22:34 PM12/31/06

to

wrong, tcl circumvents the stdio package ( as has been stated in this thread
previously).
[read] on unix will probably use the read(int fd, void *buffer, int count) call
[gets] will use tcl chan buffering and interpretation.

i assume python and perl go with char *fgets(char *s, int size, FILE *stream)
a completely different scenario.

>
> -Alex
>
uwe

Ian Bell

unread,

Dec 31, 2006, 9:12:53 PM12/31/06

to

Donal K. Fellows wrote:

> Ian Bell wrote:
>> I am no expert but presumably, not matter what the language, the gets or
>> read ends up as an fgets C library call.

> Certainly not! fgets() is a part of stdio and Tcl doesn't use that
> because it's buffering interacts badly with use of select() in Tcl's
> notifier. Instead, Tcl manages everything over the OS's syscall level
> (i.e. read() or recv() on Unix, depending on channel type) so that it
> can get the interaction between the various pieces right.
>

What happens with,say, Windows?

Ian

Stephan Kuhagen

unread,

Jan 1, 2007, 5:57:39 AM1/1/07

to

Hello, happy new year to all.

Donal K. Fellows wrote:

> Note that they'll have to always be binary-mode only; can't apply
> encoding handling, end-of-line translation or end-of-file character
> detection without scanning which will completely negate all performance
> benefits.

I don't think so. Being bare bones bits without any translation would be
exactly what is expected from memory mapped files. So if you work with them
in that state, its clear, that there is no translation. If that is wanted,
there could be a kind of transfer function, which gives pieces of the file
back as strings or scans for line endings and such. I would not handle them
as normal files, but instead as a new concept to handle file data.

For example, I think of something like this:

New command, e.g. [memfile] with following subcommands:

Basics:
- memfile open "name"
Opens an existing file memory mapped.
Returns a handle for subsequent access.
- memfile create "name" ?size?
Create a new memory mapped file, with size 0 or given size.
Returns a handle for subsequent access.
- memfile close "handle"
Closes a memory mapped file.

Simple access (no translation or anything)
- memfile index "handle" ?index?
Returns the raw and untranslated char at index.
- memfile set "handle" value index ?last?
Changes the char at index or the range until last to the new value.
- memfile range "handle" first last
Return a range of the memory map.

Higher level access for fast loops and such (no translation or anything)
- memfile split "handle" ?splitChars?
Return a list from the files data, splitted at splitChars.
- memfile chunk "handle" varName ?splitChars?
Set varName to a chunk of memory mapped file data, starting from the
current memory map pointer (see below) until the next char from splitChars
(including or not) or end-of-data, and set the pointer to the new
position.
Return number of read chars, where 0 indicates then, that we are at the
end of the file data.
With that, the following would be possible. I think, that would be very
fast...:
---
set mm [memfile open "/tmp/bigdata"]
while {[memfile chunk $mm data "\n\x00\xFF"] != 0} {
doSomething $data
}
mmfile close $mm
---
- memfile map ?options? "handle" mapping
Like [string map]
- memfile find "handle" ?options? pattern ?startIndex?
Find one or all occurences of pattern/glob/regexp and return -1, the index
or a list of indices. Search can be also backward search, which gives
results in descending order and helps finding the last match.
- memfile size "handle"
Return the size of the memory mapped file.

Complex file like access with translation and all the common features:
- memfile tell "handle"
Return the position of an internal memory map pointer.
The pointer is used for file-like access with translation.
- memfile seek "handle" offset ?origin?
Set the internal memory map pointer.
- memfile gets "handle"
Like normal [gets]
Return translated string starting at current memory map pointer position.
Increment memory map pointer.
- memfile string "handle" first last
Return a translated/encoded string from a range of the memory mapped file.
- memfile puts "handle" string index
Put the given string into the memory mapped file starting at index.
Truncate if the data does not fit, no automatic resizing
Return size of written data.
Increment memory map pointer.
- memfile resize "handle" newSize
Sets the size of the file, expand or truncate.

I'm sure, I missed something. But the more I think about such a command, the
more I like it. The concept below seems simple to me, but what can be done
with it, could be very fast and powerfull for directly manipulating data.

>> I don't know, if this is available on all platforms for Tcl,
>
> Yes. It's part of how *everyone* implements dynamic executable loading
> because it is the fastest way to do it. :-)

Okay, I had some small devices possibly without MMU in mind, where no memory
mapping can be done, but where someone had Tcl ported to.

> I think it would be quite difficult to get it working right. Lots of
> details to worry about (mainly so that you don't end up doing lots of
> unnecessary copying...) so I think it's probably a feature not for even
> the 8.6 release; we hope to get that out "to market" in 2007. But maybe
> someone will contribute the code and prove me wrong. :-)

I admit, that it would be a new feature, that should be carefully
implemented, but I also think, it would be very useful.

Regards
Stephan

Donal K. Fellows

unread,

Jan 1, 2007, 6:38:18 AM1/1/07

to

Ian Bell wrote:
> What happens with,say, Windows?

The system call is called ReadChan(). :-)

Donal.

Ian Bell

unread,

Jan 1, 2007, 8:26:20 AM1/1/07

to

Donal K. Fellows wrote:

So in both cases the C lib is circumvented -or rather completely ignored -
presumably fgets itself uses these system calls??

Ian

Alexandre Ferrieux

unread,

Jan 1, 2007, 9:49:10 AM1/1/07

to

Uwe Klein wrote:
> > Not a sufficient reason. Perl and Python and stdio also look for the
> > separator.
> wrong, tcl circumvents the stdio package ( as has been stated in this thread
> previously).

??? afraid you misread me :^}

I know Tcl doesn't use stdio. I'm just saying that in order to read
line-by-line, *something* has to look for \n. For stdio-based schemes
it is done by fgets(); otherwise it's done by the obvious tight C loop.
In any case it cannot account for the bottleneck we are talking about.

> [read] on unix will probably use the read(int fd, void *buffer, int count) call
> [gets] will use tcl chan buffering and interpretation.
> i assume python and perl go with char *fgets(char *s, int size, FILE *stream)
> a completely different scenario.

Look at Jeff's answer: shimmering between binary and unicode is a
strong candidate.

-Alex

Donal K. Fellows

unread,

Jan 1, 2007, 10:33:23 AM1/1/07

to

Ian Bell wrote:
> So in both cases the C lib is circumvented -or rather completely ignored -
> presumably fgets itself uses these system calls??

I'd assume so, somewhere in the depths. I'm not really all that
interested in the implementation of the C library. :-)

Donal.

Alexandre Ferrieux

unread,

Jan 1, 2007, 1:04:45 PM1/1/07

to

I wrote:
> Look at Jeff's answer: shimmering between binary and unicode is a
> strong candidate.

Of course I meant "Look at Donal's answer". Sorry to both :-}

Ian Bell

unread,

Jan 1, 2007, 1:44:24 PM1/1/07

to

Donal K. Fellows wrote:

Me neither, I was just curious and you seemed to know what you were talking
about ;-)

Ian

Alexandre Ferrieux

unread,

Jan 1, 2007, 5:19:27 PM1/1/07

to

Donal K. Fellows a écrit :

>
> It's a known issue that [gets] isn't very efficient with binary files
> (it's tuned for text files, which go through a somewhat different path
> in the I/O guts). Since that's an uncommon combination (unlike with
> [read]), nobody's really put that much effort into optimizing it. (For
> "optimizing" read "preventing shimmering between bytearrays and unicode".)

OK, thanks for the rationale.
Still, two questions:

1) Tcl went this way in order to polish its i18n side. OK. But
apparently neither Perl nor Python suffer from this... Is it because
they neglect i18n ? Or simply that they have done a better job of
optimization if this specific case ?

2) Would it be possible to configure a channel (or even interpreter)
to depart from the "UTF/Unicode everywhere" approach and equate string
reps with byte arrays as much as possible ? (of course not in pure Tcl,
but a hint at the place to patch would be welcome...)

-Alex

Neil Madden

unread,

Jan 1, 2007, 11:04:27 PM1/1/07

to

George Petasis wrote:
> To my view, it will be nice if something is left out of all this,
> so is it a good idea to try improve things?
>
> How about starting a discussion on how to improve reading line by line a
> file, which is a quite frequent action (at least for me).
>
> How about creating a new subcommand (in 8.5 chan command?), that will
> perform an iteration over a channel content, similar to the way python
> does? For example, I can think of a command like:
>
> chan foreach channelId {var1 ?var2 ...?} {split_chars-like-split} code

See http://wiki.tcl.tk/17012 for a similar idea - although without the
split-chars (just blocksize or "line"). At present it only works for
async (fileevent-based) reading, but it should be fairly easy to do a
synchronous version. My motivation was to encapsulate the common pattern
of async channel handling, but it's a generally good abstraction.
Feedback very welcome -- I intend to submit this to tcllib at some point.

-- Neil

Paddy

unread,

Jan 2, 2007, 3:06:28 AM1/2/07

to

Stephan Kuhagen wrote:
> After some Iterations, the Pythonians came up
> with this solution and faster 0.526s:
> ---
> import re
> r = re.compile(r'destroy', re.IGNORECASE)
> def stripit(x):
> return x.rstrip("\r\n")
> print "\n".join( map(stripit, filter(r.search, file('bigfile'))) )
> ---

Just a note: The above Python does *not* read all lines of the file
into memory at once.
it iterates over each line in the file applying the regexp and drops
those lines that do not match.

- Paddy.

Stephan Kuhagen

unread,

Jan 2, 2007, 3:43:28 AM1/2/07

to

Paddy wrote:

> Just a note: The above Python does *not* read all lines of the file
> into memory at once.
> it iterates over each line in the file applying the regexp and drops
> those lines that do not match.

I know, who said, that it does? - It is the whole point of my surprise, that
Python and Perl are that fast *without* reading the file at once into
memory, but instead line by line. [gets] is so slow, that Tcl can not
compete with this scheme, so the trick with reading the whole first into
mem was necessary (or reading it in chunks with [read] as my last and
currently fastest solution here in the thread showed).

Regards
Stephan

Paddy

unread,

Jan 2, 2007, 3:54:17 AM1/2/07

to

Stephan Kuhagen wrote:

Ahh, I thought others without too much Python might have thought
otherwise so just added a clarification :-)

- Pad.

sleb...@gmail.com

unread,

Jan 2, 2007, 5:29:29 AM1/2/07

to

Stephan Kuhagen wrote:
> Hello
>
> > Slightly off-topic. If you like that Python "trick" then you'd love
> > tcllib's fileutil::foreachLine. If you don't have tcllib then copy and
> > paste this to your file:
>
> Thanks. I know Tcllib of course, but I did not use it, because it makes
> things much more comfortable but much slower also. foreachLine is nice, but
> when using Tcllib anyway, I would use fileutil::grep, which is also much
> faster than foreachLine.

>From the lessons learned in this thread I believe both foreachLine and
grep can be made even faster than Perl without mucking about with the C
core. And the code is much clearer to boot!

Here's a bufferized version of "eachline" (Python sugaring included):

proc eachline {args} {
set buffersize 524288

# argument processing:
set ops $args
set args [list]
while {[llength $ops]} {
set op [lindex $ops 0]
set ops [lrange $ops 1 end]
switch -- $op {
-buffer {
set buffersize [lindex $ops 0]
set ops [lrange $ops 1 end]
}
default {
lappend args $op
}
}
}

# process whatever is left in args:
set var [lindex $args 0]
set args [lrange $args 1 end]
upvar 1 $var v
if {[llength $args] == 3} {
if {[lindex $args 0] == "in"} {
set args [lrange $args 1 end]
} else {
error {invalid syntax}
}
} elseif {[llength $args] > 3} {
error {wrong # args}
}

set fname [lindex $args 0]
set code [lindex $args 1]

set f [open $fname r]
while 1 {
set chunk [read $f $buffersize]
append chunk [gets $f]
foreach v [split $chunk \n] {
uplevel 1 $code
}
if {[eof $f]} break
}
close $f
}

The default buffer size is 512k but can be configured using the -buffer
option. Try this out on your data set and see how fast it is:

eachline x in bigfile {
if {[string match -nocase "*destroy*" $x]} {
puts $line
}
}

see if a bigger buffer is faster:

# 4MB buffer:
eachline x in bigfile -buffer [expr 4*1024*1024] {
if {[string match -nocase "*destroy*" $x]} {
puts $line
}
}

sleb...@gmail.com

unread,

Jan 2, 2007, 5:32:04 AM1/2/07

to

> >From the lessons learned in this thread I believe both foreachLine and
> grep can be made even faster than Perl without mucking about with the C
> core. And the code is much clearer to boot!

#*&$^%! Google quoting bug!

That should have read:

sleb...@gmail.com

unread,

Jan 2, 2007, 5:33:53 AM1/2/07

to

This is really strange. It appears that Google likes to quote lines
beginning with "From .."

Donal K. Fellows

unread,

Jan 2, 2007, 5:37:38 AM1/2/07

to

sleb...@yahoo.com wrote:
> This is really strange. It appears that Google likes to quote lines
> beginning with "From .."

That's because USENET messages are really all stored in unix mailbox
format, where lines beginning with "From " (including the space) must
be quoted (they mark the beginning of a message otherwise). The problem
is that they're not hiding the fact that they're doing this, which is
what USENET (and email) clients are supposed to do.

Donal.

Stephan Kuhagen

unread,

Jan 2, 2007, 5:42:09 AM1/2/07

to

Hi

sleb...@yahoo.com wrote:

>>From the lessons learned in this thread I believe both foreachLine and
> grep can be made even faster than Perl without mucking about with the C
> core. And the code is much clearer to boot!
>
> Here's a bufferized version of "eachline" (Python sugaring included):

Thanks. Did you see this one, I posted Sunday 31 December 2006 14:38:27, it
uses the same trick as you with reading in chunks and finding line endings
with gets:

---
proc grep {} {
set f [open bigfile r]
while { [set chunk [read $f 65536]] != "" } {
if {[string index $chunk end] != "\n" } {
append chunk [gets $f]
}
puts -nonewline [join
[regexp -all -inline -linestop -nocase .*destroy.*\n} $chunk] {}]
}
}
---

It seems, that searching for optimal solutions leads to very similar
paths...

Regards
Stephan

sleb...@gmail.com

unread,

Jan 2, 2007, 6:36:21 AM1/2/07

to

Stephan Kuhagen wrote:
> Hi
>
> sleb...@yahoo.com wrote:
>
> >>From the lessons learned in this thread I believe both foreachLine and
> > grep can be made even faster than Perl without mucking about with the C
> > core. And the code is much clearer to boot!
> >
> > Here's a bufferized version of "eachline" (Python sugaring included):
>
> Thanks. Did you see this one, I posted Sunday 31 December 2006 14:38:27, it
> uses the same trick as you with reading in chunks and finding line endings
> with gets:

Yes, which is why I said "From the lessons learned in this thread".

> ---
> proc grep {} {
> set f [open bigfile r]
> while { [set chunk [read $f 65536]] != "" } {
> if {[string index $chunk end] != "\n" } {
> append chunk [gets $f]
> }
> puts -nonewline [join
> [regexp -all -inline -linestop -nocase .*destroy.*\n} $chunk] {}]
> }
> }
> ---

Actually I copied it from there (if it was my own code I would probably
not have used the word "chunk"). I only made it re-usable in a more
general form and since the semantics is compatible with the tcllib
version I was just giving an example of how this method may be merged
with tcllib's fileutil::foreachLine.

> It seems, that searching for optimal solutions leads to very similar
> paths...

It's more like trying not to re-invent the wheel ;-) It's one of the
old Unix mantras: make it general, make it re-usable.

Stephan Kuhagen

unread,

Jan 2, 2007, 7:37:14 AM1/2/07

to

sleb...@yahoo.com wrote:

> Yes, which is why I said "From the lessons learned in this thread".

Ah, sorry, I thought we had the same idea.

> It's more like trying not to re-invent the wheel ;-) It's one of the
> old Unix mantras: make it general, make it re-usable.

True, maybe it finds its way into next version of tcllib. My short version
was just meant to be another version of the short snippets in
Perl/Python/Tcl from the beginning of this thread.

Regards
Stephan

Michael Schlenker

unread,

Jan 2, 2007, 2:52:26 PM1/2/07

to

Alexandre Ferrieux schrieb:

> Donal K. Fellows a écrit :
>> It's a known issue that [gets] isn't very efficient with binary files
>> (it's tuned for text files, which go through a somewhat different path
>> in the I/O guts). Since that's an uncommon combination (unlike with
>> [read]), nobody's really put that much effort into optimizing it. (For
>> "optimizing" read "preventing shimmering between bytearrays and unicode".)
>
> OK, thanks for the rationale.
> Still, two questions:
>
> 1) Tcl went this way in order to polish its i18n side. OK. But
> apparently neither Perl nor Python suffer from this... Is it because
> they neglect i18n ? Or simply that they have done a better job of
> optimization if this specific case ?

Pythons unicode stuff sucks, they basically have two string types and
you have to know exactly what your doing to not shoot yourself into your
foot. Their optimization probably simply is everything is a bytearray
and we throw nasty exceptions if you mix strings and unicode strings and
binary data.

>
> 2) Would it be possible to configure a channel (or even interpreter)
> to depart from the "UTF/Unicode everywhere" approach and equate string
> reps with byte arrays as much as possible ? (of course not in pure Tcl,
> but a hint at the place to patch would be welcome...)

The problem is probably shimmering. As long as you don't use a function
that forces the bytearray to shimmer everything should be fine
(something like fcopying between two binary channels for example).

If you want to rewrite many functions you could probably create a whole
set of special byte array handling functions, but i think thats a waste
of time.

Michael

Michael Schlenker

unread,

Jan 2, 2007, 2:54:22 PM1/2/07

to

Stephan Kuhagen schrieb:

Submit a feature request with the code attached and it will probably
find its way into tcllib. Module fileutil in the bugtracker...

Michael

alle...@mail.northgrum.com

unread,

Jan 2, 2007, 3:37:08 PM1/2/07

to

Stephan Kuhagen wrote:
> Hello
>
> Currently there is a thread in c.l.python
> (http://groups.google.de/group/comp.lang.python/browse_thread/thread/923e34e8466ac920/233f1310151e19f6)
> about if it is possible for Python to beat Perl in a small text matching
> task. Have some patience, there comes a really fast Tcl solution at the
> end, but I like to describe the other Versions of Perl/Python first and
> then present Tcl and some questions about performance of some things in
> Tcl.
>
> The text to match was the case insensitive word "destroy" in a text from
> gutenberg.org (King James Bible). The text used for the test was generated
> this way:
>
> $ wget http://www.gutenberg.org/files/7999/7999-h.zip
> $ unzip 7999-h.zip
> $ cd 7999-h
> $ cat *.htm > bigfile
> $ du -h bigfile
> du -h bigfile
> 8.2M bigfile
>
> The code there for Perl was:
> ---
> open(F, 'bigfile') or die;
> while(<F>) {
> s/[\n\r]+$//;
> print "$_\n" if m/destroy/oi;
> }

I read most of this long thread, but didn't see any improvements to
this perl code posted. FWIW, with perl588 on AIX 5, I got almost a 2x
speed up by only doing the s/// when the line matches, like this:

open(F, 'bigfile') or die;
while(<F>) {
do {
s/[\n\r]+$//;
print "$_\n";
} if m/destroy/oi;
}

You can also use s///o like with the m//oi, to have the regex compiled
only (o)nce, but since only 465 lines out of 117000+ total matched
/destroy/, any improvement was negligible.

John.

Alexandre Ferrieux

unread,

Jan 2, 2007, 5:44:44 PM1/2/07

to

Michael Schlenker a écrit :

>
> Pythons unicode stuff sucks, they basically have two string types and
> you have to know exactly what your doing to not shoot yourself into your
> foot. Their optimization probably simply is everything is a bytearray
> and we throw nasty exceptions if you mix strings and unicode strings and
> binary data.

OK. Tcl is so much better than this. Too bad its beauty costs a factor
of 2 in time for gets on ASCII lines...

IOW: does everybody in this community

- realize that the i18n fashion keeps Tcl away from Perl/Python's
performance
- agree with the design decision

-Alex

Alan Anderson

unread,

Jan 2, 2007, 7:01:11 PM1/2/07

to

Ian Bell <ruffr...@yahoo.co.uk> wrote:

> Donal K. Fellows wrote:
>
> > ...I'm not really all that

> > interested in the implementation of the C library. :-)
>

> Me neither, I was just curious and you seemed to know what you were talking
> about ;-)

Here's a useful tip: if you want people to think you know what you're
talking about, be careful to talk about only what you know. :-) Many of
the most helpful people here on c.l.t follow this strategy. It has the
desirable effect of letting you know that when they talk, you can trust
them.

sleb...@gmail.com

unread,

Jan 2, 2007, 8:07:58 PM1/2/07

to

Alexandre Ferrieux wrote:
> Michael Schlenker a écrit :
> >
> > Pythons unicode stuff sucks, they basically have two string types and
> > you have to know exactly what your doing to not shoot yourself into your
> > foot. Their optimization probably simply is everything is a bytearray
> > and we throw nasty exceptions if you mix strings and unicode strings and
> > binary data.
>
> OK. Tcl is so much better than this. Too bad its beauty costs a factor
> of 2 in time for gets on ASCII lines...

It's not beauty that matters. In fact that "beauty" is invisible as it
should be. It's the doing things right as much as possible that
matters.

> IOW: does everybody in this community
>
> - realize that the i18n fashion keeps Tcl away from Perl/Python's
> performance
> - agree with the design decision

I agree with the design decision. From an engineer's perspective making
my life easier is much more important than being "fast". IMHO, when it
comes to unicode, Tcl got it right. A string is a string no matter what
language(human), charset, or encoding it's in and I don't want to have
to care. My time is more expensive than CPU time (and we're talking
about a difference of milliseconds in 99% of cases). There are still
some corner cases where Tcl fails but Tcl is still better than
anything else out there.

To paraphrase Dijkstra: "doing it fast is not the same as doing it
right".

Stephan Kuhagen

unread,

Jan 3, 2007, 12:12:57 AM1/3/07

to

alle...@mail.northgrum.com wrote:

> I read most of this long thread, but didn't see any improvements to
> this perl code posted. FWIW, with perl588 on AIX 5, I got almost a 2x
> speed up by only doing the s/// when the line matches, like this:

I posted two solutions, that are faster than the Perl version (at least on
my computer)

Regards
Stephan

David N. Welton

unread,

Jan 3, 2007, 5:34:50 AM1/3/07

to

Alexandre Ferrieux wrote:

> - realize that the i18n fashion keeps Tcl away from Perl/Python's
> performance
> - agree with the design decision

It also keeps you safe from some serious problems related to i18n that
simply do not occur with Tcl (or at least far less frequently than
languages like Python, Ruby and PHP).

After having started using Ruby heavily, I think that this is actually
one of the aspects I miss most about Tcl.

--
David N. Welton
- http://www.dedasys.com/davidw/

Linux, Open Source Consulting
- http://www.dedasys.com/

Alexandre Ferrieux

unread,

Jan 3, 2007, 8:02:39 AM1/3/07

to

sleb...@yahoo.com wrote:

> > IOW: does everybody in this community
> >
> > - realize that the i18n fashion keeps Tcl away from Perl/Python's
> > performance
> > - agree with the design decision
>
> I agree with the design decision. From an engineer's perspective making
> my life easier is much more important than being "fast".

Sure -- but it's not necessarily making everybody's life easier. I
mean, crunching international text is part of Tcl's everyday life, it
is not the alpha and omega.
For example, when you're parsing logfiles or doing protocol analysis on
hex dumps, ASCII is all you care about, and being as fast as sed or
grep when doing the [gets] is highly desirable, since the rest of Tcl
gives a strong plus over sed or grep.

> IMHO, when it
> comes to unicode, Tcl got it right. A string is a string no matter what
> language(human), charset, or encoding it's in and I don't want to have
> to care.

The problem is that at the string rep level, Everything Is A String.
Hence the UTF conversion potentially hits parts of a Tcl program that
are very remote from i18n considerations.

> My time is more expensive than CPU time

Mine too. But I was wondering about how to have the best of both
worlds, namely by making UTF/Unicode conversion a tool (among dozens of
others) in Tcl, usable when really needed, instead of a pervasive
normalization scheme.

> (and we're talking
> about a difference of milliseconds in 99% of cases).

"milliseconds" is an absolute value, while here the truth is a ratio.
And it is close to 2 for [gets]. When crunching a 600-meg "tcpdump -x
-X" output, it doesn't mean "milliseconds"...

> To paraphrase Dijkstra: "doing it fast is not the same as doing it
> right".

Yeah, but doing it right doesn't necessarily hinder performance. It may
happen, but the conclusion must come after thorough analysis of all
paths.

-Alex

Ian Bell

unread,

Jan 3, 2007, 8:36:48 AM1/3/07

to

Alan Anderson wrote:
>
> Here's a useful tip: if you want people to think you know what you're
> talking about, be careful to talk about only what you know. :-)

I think that qualifies for quote of the week.

Ian

Joe English

unread,

Jan 3, 2007, 10:48:48 AM1/3/07

to

I absolutely agree with the design decision. Tcl is
more than fast enough for many string processing tasks.
A factor-of-two speedup on something that takes less
than a second saves less than half of a second, but
a clumsy API costs hours, days, or weeks of
developer time.

--Joe English

sleb...@gmail.com

unread,

Jan 3, 2007, 12:13:03 PM1/3/07

to

Alexandre Ferrieux wrote:
> sleb...@yahoo.com wrote:
>
> > > IOW: does everybody in this community
> > >
> > > - realize that the i18n fashion keeps Tcl away from Perl/Python's
> > > performance
> > > - agree with the design decision
> >
> > I agree with the design decision. From an engineer's perspective making
> > my life easier is much more important than being "fast".
>
> Sure -- but it's not necessarily making everybody's life easier. I
> mean, crunching international text is part of Tcl's everyday life, it
> is not the alpha and omega.

>From experience, the few cases where Tcl makes life slightly hard is
when people insist on making it hard on themselves expecting
programming languages to require work-arounds to support unicode. Not
requiring a work-around, being able to just "type it" as it is directly
into the script feels weird to those people. The one case where it
affects me is posting code in the wiki because even if Tcl is unicode
native, the web isn't as unicode native as Tcl hence code posted in the
wiki must often be plain ASCII. Of course I haven't quite found a
decent unicode native code editor yet since most editors are written
with the assumption of editing plain ASCII but that is code editors
making my life hard, not Tcl.

> For example, when you're parsing logfiles or doing protocol analysis on
> hex dumps,

Neither of which are normally time sensitive operations. Optimising for
speed in these cases doesn't make much sense.

> ASCII is all you care about,

If it's a hex dump I wouldn't even do ASCII, I'd simply read in binary.
And Tcl's support for binary streams is very fast indeed since I quite
often saturate my I/O channels (except Gigabit ethernet, haven't
managed to saturate that yet).

> and being as fast as sed or
> grep when doing the [gets] is highly desirable, since the rest of Tcl
> gives a strong plus over sed or grep.

And I've posted a control structure for reading files line by line
written in pure Tcl which have been benchmarked to be faster than
straightforward (unoptimised) Perl code. As a bonus, the control
structure is both very readable and makes the intent clear. If it gets
merged into tcllib then it may turn out to be the de-facto standard way
of parsing files since it is very fast and very readable. Note that in
all this we still maintain full i18n support, we haven't turned it off.
We've just implemented a different algorithm of reading the file
(wonder how much Python and Perl benefit form the same algorithm). So
we can have both speed and unicode.

> > IMHO, when it
> > comes to unicode, Tcl got it right. A string is a string no matter what
> > language(human), charset, or encoding it's in and I don't want to have
> > to care.
>
> The problem is that at the string rep level, Everything Is A String.
> Hence the UTF conversion potentially hits parts of a Tcl program that
> are very remote from i18n considerations.
>
> > My time is more expensive than CPU time
>
> Mine too. But I was wondering about how to have the best of both
> worlds, namely by making UTF/Unicode conversion a tool (among dozens of
> others) in Tcl,

Conversion tools are crutches to work around an environment that isn't
unicode native. Conversion tools are necessary in such an environment,
but Tcl shouldn't be limited to those environments only. I can't tell
you how nice it is to be able to plug in chinese text directly into my
code without having to wrangle with conversion tools.

> > (and we're talking
> > about a difference of milliseconds in 99% of cases).
>
> "milliseconds" is an absolute value, while here the truth is a ratio.

The ratio is milliseconds in 99% of cases. That's my point. Sacrificing
something good for a use case that affects 1% of all cases is bad.

> And it is close to 2 for [gets]. When crunching a 600-meg "tcpdump -x
> -X" output, it doesn't mean "milliseconds"...

"tcpdump -x -X" doesn't generate ASCII text. It's binary. Use
[fconfigure -translation binary]. As I mentioned earlier, experience
shows that Tcl's binary string handling is quite fast.

> > To paraphrase Dijkstra: "doing it fast is not the same as doing it
> > right".
>
> Yeah, but doing it right doesn't necessarily hinder performance. It may
> happen, but the conclusion must come after thorough analysis of all
> paths.

The point is concentrating on performance instead of doing things right
is wrong. Do it right first, then do it fast. If "the right thing"
hinders performance then optimise it. Don't give up on it for the sake
of performance alone. Another mantra, this time from old-school
programming: "The first rule of optimisation is: don't. The second rule
is: don't (yet)".

Stephan Kuhagen

unread,

Jan 3, 2007, 3:19:38 PM1/3/07

to

sleb...@yahoo.com wrote:

> Alexandre Ferrieux wrote:
...

>> For example, when you're parsing logfiles or doing protocol analysis on
>> hex dumps,
>
> Neither of which are normally time sensitive operations. Optimising for
> speed in these cases doesn't make much sense.

Well, sometimes it does. I agree with Alexandre, that these tasks can be
important. I have no statistics, to prove, that this is very important most
times, but you can not ignore, that this is often an argument, when looking
at a scripting language and its alternatives: how good does it the job, and
how fast? You are right, this does not count everywhere, and sometimes it
should not count, but it does anyway. And to say, that in this case, the
people who decide against elegance and pro speed, have no clue, what they
are talking about, doesn't help, because they decide that anyway.

I have written a parser for compiler output in a big build system, that runs
as a multi platform continuous build and generates html-reports from over
400 software projects every 10-15 Minutes. The about 50 developers, who
check in their code, and look at the report-pages, if their code
compiles/runs on all platforms have a turn around time of max 15 minutes
until they can see the results. One major part of the time is the parsing
of the compiler output (which then generates links to the repository, who
checked in last changes, what are the failed dependencies and so on), which
takes about third of the time (compiler is very fast, since we use
distributed compilers and compiler-caches). For 15 minutes round time, this
gives on a normal work day about 50 rounds (not all programmers start work
at the same time). If the compiler-output-parser were half as fast as it is
(even if parsing of *one* *output* is done in a few milliseconds), this
reduces the available rounds of compiling/testing enormously and as such
directly the productivity. Now say, speed doesn't matter... - The funny
thing is, though, I wrote that parser twice, once in Tcl, once in Python,
both highly optimized. Tcl is 5-10 times faster than Python (because of its
faster regexp, not because of its string-reading...). But in the future,
when someone else re-implements parts of that build system, they will use
the Python version, since the people *think* that Python is better/faster.
*argh!*

> And I've posted a control structure for reading files line by line
> written in pure Tcl which have been benchmarked to be faster than
> straightforward (unoptimised) Perl code. As a bonus, the control
> structure is both very readable and makes the intent clear. If it gets
> merged into tcllib then it may turn out to be the de-facto standard way
> of parsing files since it is very fast and very readable. Note that in
> all this we still maintain full i18n support, we haven't turned it off.
> We've just implemented a different algorithm of reading the file
> (wonder how much Python and Perl benefit form the same algorithm). So
> we can have both speed and unicode.

Yes, that mixing of [read] and [gets] is a nice trick, but maybe there are
some situations, where it can be even faster than that and where one
decides not to need any i18n-services or anything else, but instead just
bare bits as fast as possible. This doesn't mean, that [gets] has to be
reimplemented, changes or something. Maybe, there is room for another
mechanism, which allows you to get things as fast as possible without any
high level string services, but with the most basic (and most common)
features, such as splitting at record boundaries (which is \n most times).

>> > (and we're talking
>> > about a difference of milliseconds in 99% of cases).
>>
>> "milliseconds" is an absolute value, while here the truth is a ratio.
>
> The ratio is milliseconds in 99% of cases. That's my point. Sacrificing
> something good for a use case that affects 1% of all cases is bad.

That is only a very special point... If you say "You can't gain more than 3
milliseconds in 99% of the time", then this sounds very unimportant. But
when you add "This means runtime is 7 instead of 10 milliseconds, and the
script runs 20.000 times a day" then this really changes the meaning of the
99%...

Regards
Stephan

alle...@mail.northgrum.com

unread,

Jan 3, 2007, 4:32:32 PM1/3/07

to

Faster than my modified perl? Could you please post them again so I
don't have to wade through dozens of posts? Thanks.

John.

Alan Anderson

unread,

Jan 3, 2007, 8:06:05 PM1/3/07

to

Stephan Kuhagen <s...@nospam.tld> wrote:

> If you say "You can't gain more than 3
> milliseconds in 99% of the time", then this sounds very unimportant. But
> when you add "This means runtime is 7 instead of 10 milliseconds, and the
> script runs 20.000 times a day" then this really changes the meaning of the
> 99%...

I have a hard time imagining that one minute of extra runtime per day is
a real issue.

sleb...@gmail.com

unread,

Jan 3, 2007, 9:16:36 PM1/3/07

to

Stephan Kuhagen wrote:

> sleb...@yahoo.com wrote:
> > And I've posted a control structure for reading files line by line
> > written in pure Tcl which have been benchmarked to be faster than
> > straightforward (unoptimised) Perl code. As a bonus, the control
>

> <snip>

>
> Yes, that mixing of [read] and [gets] is a nice trick, but maybe there are
> some situations, where it can be even faster than that and where one
> decides not to need any i18n-services or anything else, but instead just
> bare bits as fast as possible. This doesn't mean, that [gets] has to be
> reimplemented, changes or something.

Hmm.. This is Tcl remember. Everything (almost) can be re-implemented
;-) Let's see if we can apply that trick more transparently than using
the [eachline] function. Let's see if we can accelerate [gets] itself.

A little bit of warning though, when writing this beast I found I had
to also modify things like [close] and [eof] because of how things are
interrelated. And there are lots of other cases I haven't handled. For
example, using my modified [gets] with an unmodified [read] results in
data loss since the data is now in a buffer and is no longer on the
channel proper. But anyway, this is just a proof of concept for how we
can improve the speed of [gets].

I'd be interested to see how much faster this version of [gets] is
benchmarked on your machine. The [info exists] might slow it down a
little but I'm hoping the [read] trick will more than compensate for
that. We can avoid [info exists] by re-defining [open] but that's a
little too much work for me.

# Accelerated gets. Warning, untested code.
# Here be dragons!
array set __getbuffer {}
rename gets __gets
proc gets {channelId {varName {}}} {
global __getbuffer

if {[info exists __getbuffer($channelId)] == 0 ||
[llength $__getbuffer($channelId)] <= 0
} {
set tmp [read $channelId 524288]
append tmp [__gets $channelId]

set __getbuffer($channelId) [split $tmp \n]
}

set ret [lindex $__getbuffer($channelId) 0]
set __getbuffer($channelId) [lrange $__getbuffer($channelId) 1 end]

if {$varName != ""} {
upvar 1 $varName v
set v $ret

if {[__eof $channelId]} {
set ret -1
} elseif {[fblocked $channelId] && [llength
$__getbuffer($channelId)] <= 0} {
set ret -1
} else {
set ret [string length $v]
}
}

return $ret
}
rename close __close
proc close {channelId} {
global __getbuffer
__close $channelId
unset __getbuffer($channelId)
}
rename eof __eof
proc eof {channelId} {
global __getbuffer

if {[info exists __getbuffer($channelId)] &&
[llength $__getbuffer($channelId)] > 0
} {
return 0
}

return [__eof $channelId]
}

Stephan Kuhagen

unread,

Jan 4, 2007, 12:26:10 AM1/4/07

to

alle...@mail.northgrum.com wrote:

> Faster than my modified perl? Could you please post them again so I
> don't have to wade through dozens of posts? Thanks.

Oh, sorry, you are right. I did not see, that this was a modified version of
the Perl code (Perl code always looks the same to me... ;-). This is indeed
about one third faster than the fastest Tcl version I posted. So this is
one more argument for some IO speed improvements in Tcl, I think.

Regards
Stephan

Stephan Kuhagen

unread,

Jan 4, 2007, 12:31:09 AM1/4/07

to

Alan Anderson wrote:

> I have a hard time imagining that one minute of extra runtime per day is
> a real issue.

I made up the '3' and '10' milliseconds, but if you read my real world
example about the compiler output parser in the same posting above, you can
see, that such a slow down really makes a difference.

Regards
Stephan

Stephan Kuhagen

unread,

Jan 4, 2007, 12:55:56 AM1/4/07

to

sleb...@yahoo.com wrote:

> Hmm.. This is Tcl remember. Everything (almost) can be re-implemented
> ;-) Let's see if we can apply that trick more transparently than using
> the [eachline] function. Let's see if we can accelerate [gets] itself.

Right, we can of course redefine [gets]. My statement was not to say, that
we can not, but that there may be room for a new, different IO thingy in
Tcl, which does not have any special services included, except for record
separation, like matching of '\n'.

> I'd be interested to see how much faster this version of [gets] is
> benchmarked on your machine. The [info exists] might slow it down a
> little but I'm hoping the [read] trick will more than compensate for
> that. We can avoid [info exists] by re-defining [open] but that's a
> little too much work for me.

Hm... From what you write, I understand, that I can't use the last fastest
version I posted with mixed [read] and [gets], so I had to use this
version:

---
proc grep {} {
set f [open bigfile r]

fconfigure $f -encoding binary -translation binary
while { [gets $f line] >= 0} {
if {[string match -nocase "*destroy*" $line]} {
puts $line
}
}
}

grep
---

With an unmodified Version of [gets] this runs 0.740s. I don't know, what
went wrong, but your versions runs 6.959s, which is nearly *10* *times*
*slower*, than the unmodified simple [gets] version. The fastest solution I
posted runs 0.208s, which brings another factor three improvement:

---
proc grep {} {
set f [open bigfile r]
while { [set chunk [read $f 65536]] != "" } {
if {[string index $chunk end] != "\n" } {
append chunk [gets $f]
}
puts -nonewline [join [regexp -all -inline -linestop -nocase
{.*destroy.*\n} $chunk] {}]
}
}

grep
---

And I think, it does not look that much more complicated... Additionally,
using your [gets] changes the output (last six lines are missing, line
endings changed) ...

Never mind. I think, the problem is not to change [gets], except maybe for
some normal optimizations, if possible without any change in semantics.

Regards
Stephan

Alexandre Ferrieux

unread,

Jan 4, 2007, 6:27:34 AM1/4/07

to

sleb...@yahoo.com wrote:
> Let's see if we can apply that trick more transparently than using
> the [eachline] function. Let's see if we can accelerate [gets] itself.

Interesting. But please notice that to fully emulate [gets], we also
need to handle properly cases where lines come one by one over a socket
or pipe. In this case,

[read $ch $somefixedlength]

is a show-stopper. I'm not saying it is impossible based on your code.
A few [fconfigure -blocked] should do the job, but it's getting tricky
;-)

-Alex

Alexandre Ferrieux

unread,

Jan 4, 2007, 6:27:44 AM1/4/07

to

sleb...@yahoo.com wrote:
> Let's see if we can apply that trick more transparently than using
> the [eachline] function. Let's see if we can accelerate [gets] itself.

Interesting. But please notice that to fully emulate [gets], we also

Alexandre Ferrieux

unread,

Jan 4, 2007, 6:37:18 AM1/4/07

to

Joe English wrote:
>
> I absolutely agree with the design decision. Tcl is
> more than fast enough for many string processing tasks.

OK. I'll assume you're voicing the position of the whole TCT -- you win
:-)
Sorry for realizing so late what is obvious to you all. I've been away
from the core long enough to miss that move...

> A factor-of-two speedup on something that takes less
> than a second saves less than half of a second, but

The factor-of-two also applies to something that takes a day...

Anyway, I stand corrected: ol'good ASCII is dead, and 99% of Tcl
activity handles localized text; I'll face that truth now.

-Alex

Uwe Klein

unread,

Jan 4, 2007, 7:32:35 AM1/4/07

to

This is a virginity issue!

uwe

Stephan Kuhagen

unread,

Jan 4, 2007, 8:41:01 AM1/4/07

to

Uwe Klein wrote:

>> I have a hard time imagining that one minute of extra runtime per day is
>> a real issue.
>
> This is a virginity issue!

Hey, no jokes! Let this run several centuries and you need a Tcl-ian
calender reform.

Regards
Stephan

alle...@mail.northgrum.com

unread,

Jan 4, 2007, 10:17:06 AM1/4/07

to

Thanks for checking. My confidence in Perl has been restored :-)

John.

Darren New

unread,

Jan 4, 2007, 12:04:10 PM1/4/07

to

Alexandre Ferrieux wrote:
> Interesting. But please notice that to fully emulate [gets], we also
> need to handle properly cases where lines come one by one over a socket
> or pipe.

Also, you need to handle the different line endings and EOF characters,
if you really want to handle it properly. It really isn't difficult to
parse lines out of a [read], which one needs to do if one is getting
mixed text and binary as well.

--
Darren New / San Diego, CA, USA (PST)
Scruffitarianism - Where T-shirt, jeans,
and a three-day beard are "Sunday Best."

sleb...@gmail.com

unread,

Jan 4, 2007, 10:41:58 PM1/4/07

to

Stephan Kuhagen wrote:
> sleb...@yahoo.com wrote:

> > I'd be interested to see how much faster this version of [gets] is
> > benchmarked on your machine. The [info exists] might slow it down a
> > little but I'm hoping the [read] trick will more than compensate for
> > that. We can avoid [info exists] by re-defining [open] but that's a
> > little too much work for me.
>

> <snip>

>
> With an unmodified Version of [gets] this runs 0.740s. I don't know, what
> went wrong, but your versions runs 6.959s, which is nearly *10* *times*
> *slower*, than the unmodified simple [gets] version.

Yikes! It looks like if we ever decide to implement input buffering for
[gets] it would have to be implemented in C.

Donal K. Fellows

unread,

Jan 5, 2007, 7:53:25 AM1/5/07

to

sleb...@yahoo.com wrote:
> Yikes! It looks like if we ever decide to implement input buffering for
> [gets] it would have to be implemented in C.

Tcl's I/O layer does input buffering for [gets] (and [read] too) for
you anyway. See [fconfigure]'s -buffersize option.

Donal.

Alexandre Ferrieux

unread,

Jan 5, 2007, 10:49:59 AM1/5/07

to

sleb...@yahoo.com a écrit :

>
> Yikes! It looks like if we ever decide to implement input buffering for
> [gets] it would have to be implemented in C.

[gets] already does input buffering. The problem is in the
post-processing of the line once it has been read. This post-processing
is due to the i18n handling discussed before.
Since you lectured me pretty dryly saying Tcl shouldn't be improved on
that line, I wonder why you're suddenly worrying about imprving
[gets]...

-Alex

Message has been deleted

sleb...@gmail.com

unread,

Jan 5, 2007, 3:51:36 PM1/5/07

to

Alexandre Ferrieux wrote:
> sleb...@yahoo.com a écrit :
> >
> > Yikes! It looks like if we ever decide to implement input buffering for
> > [gets] it would have to be implemented in C.
>
> [gets] already does input buffering. The problem is in the
> post-processing of the line once it has been read. This post-processing
> is due to the i18n handling discussed before.

Hmm.. that's strange. Why is the [read] + [gets] trick faster? Is
[split] actually faster than scanning for newlines in C? If so what's
the difference? The string is still i18n normalised in both cases.

> Since you lectured me pretty dryly saying Tcl shouldn't be improved on
> that line, I wonder why you're suddenly worrying about imprving
> [gets]...

That's the whole point of this thread. We now have a workaround in pure
Tcl using [read] and [gets] then [split]ting the result which performs
faster than [gets] alone. At first sight all the trick does is to
buffer the read to take advantage of faster overall transfer for large
I/O operations. Is something else going on here?

Alexandre Ferrieux

unread,

Jan 5, 2007, 5:39:18 PM1/5/07

to

sleb...@yahoo.com a écrit :

> Alexandre Ferrieux wrote:
> > sleb...@yahoo.com a écrit :
> > >
> > > Yikes! It looks like if we ever decide to implement input buffering for
> > > [gets] it would have to be implemented in C.
> >
> > [gets] already does input buffering. The problem is in the
> > post-processing of the line once it has been read. This post-processing
> > is due to the i18n handling discussed before.
>
> Hmm.. that's strange. Why is the [read] + [gets] trick faster? Is
> [split] actually faster than scanning for newlines in C? If so what's
> the difference? The string is still i18n normalised in both cases.

Well, I shouldn't be telling you that since you wrote the code, but
look again:
In this code, [read]+[gets] is actually 0.99*[read]+0.01*[gets]. So
whatever cripples [gets] is negligible, since it's here only to tidy
the corners. And only this tiny [gets] gets the unicode/byte-array
shimmering.

> We now have a workaround in pure
> Tcl using [read] and [gets] then [split]ting the result which performs
> faster than [gets] alone.

But cannot work in 'synchronous' mode (one line at a time):

fconfigure stdout -buffering line
while {[gets stdin line]>=0} {
if {[some_predicate]} {puts [some_function $line]}
}

> At first sight all the trick does is to
> buffer the read to take advantage of faster overall transfer for large
> I/O operations. Is something else going on here?

Yes. You're doing *double* input buffering, since [read] is already
analogous to fgets().
But that's not a problem. It's still an interesting workaround, in the
meantime to a true fix of [gets], but I hope, not the recommended idiom
in Tcl in 2017...

-Alex

Donal K. Fellows

unread,

Jan 5, 2007, 7:06:49 PM1/5/07

to

sleb...@yahoo.com wrote:
> Hmm.. that's strange. Why is the [read] + [gets] trick much faster? Is

> [split] actually faster than scanning for newlines in C?

Because [gets] forces the string to have a utf-8 representation
immediately, and [read] is happy to work (efficiently) with byte-arrays.

> If so what's
> the difference? The string is still i18n normalised in both cases.

Hah! Wrong. :-) What's actually going on in the guts of Tcl's I/O layer
is more subtle than that. (Unfortunately, [gets] is not currently tuned
in any way for binary input - it's not been traditionally viewed as an
important use case - so it handles the binary case in a comparatively
slow string-based manner. A patch that makes the case better would be
welcomed, but when I tried to hack one in earlier this week, I just made
the test suite hang. Obviously more thought and time is needed than the
small amount I had available...)

Donal.