Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Speeding up recursive directory traversal

138 views
Skip to first unread message

ewils...@gmail.com

unread,
Dec 8, 2007, 11:22:18 PM12/8/07
to
Our found that my Tcl script was significantly slower than equivalent
php and python scripts and traced the problem to the part that handles
recursive directory traversal. I wrote test scripts and here are
sample results indicative of the tests:

[edwil@fedora devel]$ time php rec.php /usr/share

PHP SUMMARY
dirs 7367
files 95280

real 0m6.836s
user 0m3.801s
sys 0m2.750s
[edwil@fedora devel]$ time tclsh rec.tcl /usr/share

Tcl SUMMARY
dirs 7367
files 95280

real 0m18.240s
user 0m10.705s
sys 0m6.760s

The relevant part of the Tcl code is:
------------------------------------------------------------
proc process {e} {
global nDir nFile

if {[file isfile $e]} {
incr nFile
return
}
if {![file isdirectory $e] || ![file readable $e]} {return}
incr nDir
catch {
set entries [glob -directory $e * \.*]
foreach de $entries {
set tail [file tail $de]
if {[string equal $tail . ] || [string equal $tail ..]} {
continue
}
process $de
}
}
}
------------------------------------------------------------
I suspect that the problem is in the way I use [glob]. Any ideas?

(I'm not worried about matching "non-standard" filenames.)

Regards
E Wilson

Michael Schlenker

unread,
Dec 9, 2007, 8:13:58 AM12/9/07
to
ewils...@gmail.com schrieb:

> Our found that my Tcl script was significantly slower than equivalent
> php and python scripts and traced the problem to the part that handles
> recursive directory traversal. I wrote test scripts and here are
> sample results indicative of the tests:
>
> [edwil@fedora devel]$ time php rec.php /usr/share
>
> PHP SUMMARY
> dirs 7367
> files 95280
>
> real 0m6.836s
> user 0m3.801s
> sys 0m2.750s
> [edwil@fedora devel]$ time tclsh rec.tcl /usr/share
>
> Tcl SUMMARY
> dirs 7367
> files 95280
>
> real 0m18.240s
> user 0m10.705s
> sys 0m6.760s

It could be that Tcl does a lot more under the hood to cater for its vfs
layer, which has some overhead.

Try to take a look at an strace output when running both scripts,
probably Tcl does a lot more syscalls. There probably is no easy fix.

Michael

Wojciech Kocjan

unread,
Dec 9, 2007, 11:16:07 AM12/9/07
to ewils...@gmail.com
Dnia 09-12-2007 o 05:22:18 <ewils...@gmail.com> napisał(a):

> Our found that my Tcl script was significantly slower than equivalent
> php and python scripts and traced the problem to the part that handles
> recursive directory traversal. I wrote test scripts and here are
> sample results indicative of the tests:
>
> [edwil@fedora devel]$ time php rec.php /usr/share
>
> PHP SUMMARY
> dirs 7367
> files 95280
>
> real 0m6.836s
> user 0m3.801s
> sys 0m2.750s
> [edwil@fedora devel]$ time tclsh rec.tcl /usr/share

Is it possible for you to switch your code to do a file stat in the
beginning instead of a lot of isfile/isdirectory calls?

What are the results if you also skip file isreadable? I expect all of
these calls to do a stat().

Also, stracing the code should also show what is the main ovcerhead.

I think you could be able to get it down to 12s, but that's still slower
than PHP - which could be the overhead related to VFS code in Tcl.

--
Wojciech Kocjan

Alexandre Ferrieux

unread,
Dec 9, 2007, 7:19:10 PM12/9/07
to
On Dec 9, 5:22 am, ewilsonm...@gmail.com wrote:
> Our found that my Tcl script was significantly slower than equivalent
> php and python scripts and traced the problem to the part that handles
> recursive directory traversal. I wrote test scripts and here are
> sample results indicative of the tests:

It is possible to reduce all filesystem activity to a single [glob]
call per directory/file.
The key is that [glob -nocomplain -dir $e * .*] returns:
- an empty list for non-dirs
- a list with at least . and .. for dirs
thus avoiding [file isdirectory].

With this I gain a 2.6x speedup over your code, which is close to the
2.66 you get with PHP:

proc process2 e {
global nDir nFile

set l [glob -nocomplain -directory $e * .*]
if {![llength $l]} {incr nFile;return}
incr nDir
foreach de $l {


set tail [file tail $de]
if {[string equal $tail . ] || [string equal $tail ..]} continue

process2 $de
}
}

Granted, I omitted the rights-checking part. If you really want them,
you can [catch]. This may have a cost which I didn't bother to
measure ;-)

-Alex

ewils...@gmail.com

unread,
Dec 10, 2007, 1:33:41 PM12/10/07
to
Thank you for your suggestions. I reimplemented the Tcl script using
a single file stat but found no improvement. So I removed the stat
call altogether by catching the [glob] ... still no discernible
improvement.

Here is the Tcl script:
-------------------------------------------------------------


proc process {e} {
global nDir nFile

file stat $e stat
set type $stat(type)
if {$type eq "file"} {
incr nFile
return
}
if {$type ne "directory"} {return}


catch {
set entries [glob -directory $e * .*]

incr nDir


foreach de $entries {
set tail [file tail $de]
if {[string equal $tail .] || [string equal $tail ..]}
{continue}
process $de
}
}
}

proc main {} {
global argc argv argv0 nFile nDir

set nFile 0
set nDir 0

if { $argc<1 } { puts "$argv0: usage tclsh $argv0 files|dirs"; exit
1 }

foreach a $argv { process $a }
puts ""
puts "Tcl SUMMARY"
puts "dirs $nDir"
puts "files $nFile"
}
main
-------------------------------------------------------------
And my simple php script
-------------------------------------------------------------
<?php
function process($d)
{
global $nDir, $nFile;

if (is_file($d)) ++$nFile;
if (!is_dir($d)) return;
try {
$dir = new DirectoryIterator($d);
++$nDir;
foreach ($dir as $file) {
if($file->isDot()) continue;
process($file->getpathName());
}
}
catch (Exception $ex) {echo "Exception with $d\n";}
}

$nDir = $nFile = 0;

$n = count($argv);

for($i=1;$i<$n;++$i) {
$e = $argv[$i];
process($e);
}
echo "\nPHP SUMMARY\ndirs $nDir\nfiles $nFile\n";
-------------------------------------------------------------

The straces are interesting in how they differ over two runs: first
run where the directory tree is 2 deep and the second where we use /
usr/share. This could be related to setting up the interpreters?

[ed@fedora devel]$ strace -c tclsh rec.tcl . > /dev/null
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
83.25 0.000999 17 58 1 lstat64
9.67 0.000116 7 16 read
7.08 0.000085 5 18 mmap2
0.00 0.000000 0 5 write
0.00 0.000000 0 14 open
0.00 0.000000 0 15 close
0.00 0.000000 0 1 execve
0.00 0.000000 0 8 5 access
0.00 0.000000 0 4 brk
0.00 0.000000 0 9 4 ioctl
0.00 0.000000 0 1 readlink
0.00 0.000000 0 1 munmap
0.00 0.000000 0 1 clone
0.00 0.000000 0 2 uname
0.00 0.000000 0 6 mprotect
0.00 0.000000 0 6 4 _llseek
0.00 0.000000 0 8 getdents
0.00 0.000000 0 3 rt_sigaction
0.00 0.000000 0 1 rt_sigprocmask
0.00 0.000000 0 6 getcwd
0.00 0.000000 0 1 getrlimit
0.00 0.000000 0 13 stat64
0.00 0.000000 0 11 fstat64
0.00 0.000000 0 13 fcntl64
0.00 0.000000 0 9 2 futex
0.00 0.000000 0 1 set_thread_area
0.00 0.000000 0 1 set_tid_address
0.00 0.000000 0 1 set_robust_list
0.00 0.000000 0 1 1 getsockname
------ ----------- ----------- --------- --------- ----------------
100.00 0.001200 234 17 total

[ed@fedora devel]$ strace -c php rec.php . > /dev/null
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
39.77 0.000727 5 141 mmap2
32.82 0.000600 8 74 3 open
19.04 0.000348 4 91 read
4.27 0.000078 9 9 brk
4.10 0.000075 1 75 close
0.00 0.000000 0 1 write
0.00 0.000000 0 1 execve
0.00 0.000000 0 2 time
0.00 0.000000 0 2 lseek
0.00 0.000000 0 2 1 access
0.00 0.000000 0 10 10 ioctl
0.00 0.000000 0 31 munmap
0.00 0.000000 0 1 setitimer
0.00 0.000000 0 1 uname
0.00 0.000000 0 11 mprotect
0.00 0.000000 0 5 2 _llseek
0.00 0.000000 0 8 getdents
0.00 0.000000 0 1 sched_getparam
0.00 0.000000 0 1 sched_setscheduler
0.00 0.000000 0 1 sched_getscheduler
0.00 0.000000 0 1
sched_get_priority_max
0.00 0.000000 0 1
sched_get_priority_min
0.00 0.000000 0 4 rt_sigaction
0.00 0.000000 0 1 rt_sigprocmask
0.00 0.000000 0 2 getcwd
0.00 0.000000 0 1 getrlimit
0.00 0.000000 0 13 stat64
0.00 0.000000 0 17 2 lstat64
0.00 0.000000 0 77 fstat64
0.00 0.000000 0 2 fcntl64
0.00 0.000000 0 4 futex
0.00 0.000000 0 1 set_thread_area
0.00 0.000000 0 1 set_tid_address
0.00 0.000000 0 1 statfs64
0.00 0.000000 0 1 set_robust_list
0.00 0.000000 0 1 socket
------ ----------- ----------- --------- --------- ----------------
100.00 0.001828 596 18 total

And now for /usr/share ...
[ed@fedora devel]$ strace -c tclsh rec.tcl /usr/share > /dev/null
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
64.83 0.866678 1 628969 1 lstat64
20.10 0.268769 2 117927 stat64
9.74 0.130187 4 30024 getdents
2.90 0.038823 3 14790 open
1.68 0.022425 2 14791 close
0.62 0.008303 1 14787 fstat64
0.10 0.001272 1 1179 readlink
0.03 0.000399 399 1 execve
0.01 0.000071 12 6 4 _llseek
0.00 0.000000 0 16 read
0.00 0.000000 0 5 write
0.00 0.000000 0 8 5 access
0.00 0.000000 0 15 brk
0.00 0.000000 0 9 4 ioctl
0.00 0.000000 0 1 munmap
0.00 0.000000 0 1 clone
0.00 0.000000 0 2 uname
0.00 0.000000 0 6 mprotect
0.00 0.000000 0 3 rt_sigaction
0.00 0.000000 0 1 rt_sigprocmask
0.00 0.000000 0 1 getcwd
0.00 0.000000 0 1 getrlimit
0.00 0.000000 0 18 mmap2
0.00 0.000000 0 13 fcntl64
0.00 0.000000 0 9 2 futex
0.00 0.000000 0 1 set_thread_area
0.00 0.000000 0 1 set_tid_address
0.00 0.000000 0 1 set_robust_list
0.00 0.000000 0 1 1 getsockname
------ ----------- ----------- --------- --------- ----------------
100.00 1.336927 822587 17 total

[ed@fedora devel]$ strace -c php rec.php /usr/share > /dev/null
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
68.48 0.364644 4 103151 stat64
21.13 0.112513 5 22404 getdents
5.72 0.030447 4 7462 3 open
2.54 0.013514 2 7463 close
1.10 0.005839 1 7390 lseek
0.93 0.004970 1 7465 fstat64
0.05 0.000275 3 91 read
0.04 0.000238 2 141 mmap2
0.01 0.000066 6 11 mprotect
0.00 0.000000 0 1 write
0.00 0.000000 0 1 execve
0.00 0.000000 0 2 time
0.00 0.000000 0 2 1 access
0.00 0.000000 0 9 brk
0.00 0.000000 0 10 10 ioctl
0.00 0.000000 0 31 munmap
0.00 0.000000 0 1 setitimer
0.00 0.000000 0 1 uname
0.00 0.000000 0 5 2 _llseek
0.00 0.000000 0 1 sched_getparam
0.00 0.000000 0 1 sched_setscheduler
0.00 0.000000 0 1 sched_getscheduler
0.00 0.000000 0 1
sched_get_priority_max
0.00 0.000000 0 1
sched_get_priority_min
0.00 0.000000 0 4 rt_sigaction
0.00 0.000000 0 1 rt_sigprocmask
0.00 0.000000 0 2 getcwd
0.00 0.000000 0 1 getrlimit
0.00 0.000000 0 17 2 lstat64
0.00 0.000000 0 2 fcntl64
0.00 0.000000 0 4 futex
0.00 0.000000 0 1 set_thread_area
0.00 0.000000 0 1 set_tid_address
0.00 0.000000 0 1 statfs64
0.00 0.000000 0 1 set_robust_list
0.00 0.000000 0 1 socket
------ ----------- ----------- --------- --------- ----------------
100.00 0.532506 155682 18 total

Thoughts?

Regards
E Wilson

Alexandre Ferrieux

unread,
Dec 10, 2007, 5:27:18 PM12/10/07
to
On Dec 10, 7:33 pm, ewilsonm...@gmail.com wrote:
> Thank you for your suggestions. I reimplemented the Tcl script using
> a single file stat but found no improvement. So I removed the stat
> call altogether by catching the [glob] ... still no discernible
> improvement.

Still, the above posted code with just [glob] and no [file state] is
2.4 times faster than your last code.
You're not saying how important is the weeding out of non-regular
files nor dirs (but [glob -types] may help).

> The straces are interesting in how they differ over two runs: first
> run where the directory tree is 2 deep and the second where we use /
> usr/share. This could be related to setting up the interpreters?
>

> % time seconds usecs/call calls errors syscall
> ------ ----------- ----------- --------- --------- ----------------
> 83.25 0.000999 17 58 1 lstat64
>

> % time seconds usecs/call calls errors syscall
> ------ ----------- ----------- --------- --------- ----------------
> 64.83 0.866678 1 628969 1 lstat64

More likely, the filesystem cache: 17usecs/lstat when hitting the
head, 1 or less when running from cache.
Always do your tests twice or thrice to cancel out scuh first-time
effects in comparisons.

Moreover, you have already shown that syscalls accounted for roughly a
third of the total time.
So the userland part (non-syscall operations) are twice as important
as [strace -c] measurements (not to mention the ptrace()-SIGCONT
overhead)... Hence the dramatic effect of removing the extra [file
stat] and assorted hash lookups and variable manipulations.

-Alex

> read more >>...

ewils...@gmail.com

unread,
Dec 11, 2007, 11:26:07 AM12/11/07
to
On Dec 11, 8:27 am, Alexandre Ferrieux <alexandre.ferri...@gmail.com>
wrote:

> On Dec 10, 7:33 pm, ewilsonm...@gmail.com wrote:
>
> > Thank you for your suggestions. I reimplemented the Tcl script using
> > a single file stat but found no improvement. So I removed the stat
> > call altogether by catching the [glob] ... still no discernible
> > improvement.
>
> Still, the above posted code with just [glob] and no [file state] is
> 2.4 times faster than your last code.
> You're not saying how important is the weeding out of non-regular
> files nor dirs (but [glob -types] may help).

Regarding the optimisation, the new script runs slower than the
original:
(I just need regular files by the way but that's ok)

[ed@fedora devel]$ time tclsh rec.tcl /usr/share
Tcl SUMMARY
dirs 7390
files 95754

real 0m21.118s
user 0m14.572s
sys 0m6.441s

[ed@fedora devel]$ time tclsh rec2.tcl /usr/share
Tcl SUMMARY
dirs 7390
files 95754

real 0m30.725s
user 0m20.992s
sys 0m9.605s

[ed@fedora devel]$ time php rec.php /usr/share
PHP SUMMARY
dirs 7390
files 95754

real 0m5.835s
user 0m3.422s
sys 0m2.387s

[ed@fedora devel]$ cat rec2.tcl

proc process2 e {
global nDir nFile

set l [glob -nocomplain -directory $e * .*]
if {![llength $l]} {incr nFile;return}
incr nDir
foreach de $l {

set tail [file tail $de]
if {[string equal $tail . ] || [string equal $tail ..]} continue

process2 $de
}
}

proc main {} {
global argc argv argv0 nFile nDir

set nFile 0
set nDir 0

if { $argc<1 } { puts "$argv0: usage tclsh $argv0 files|dirs"; exit
1 }

foreach a $argv { process2 $a }


puts "Tcl SUMMARY"
puts "dirs $nDir"
puts "files $nFile"
}
main

Tcl version is 8.4.15.

> > The straces are interesting in how they differ over two runs: first

> > ...


> More likely, the filesystem cache: 17usecs/lstat when hitting the
> head, 1 or less when running from cache.
> Always do your tests twice or thrice to cancel out scuh first-time
> effects in comparisons.

Of course, you're quite right. The 2-dir-deep run was the last thing
I did. Let's ignore it.

Regards
E Wilson

ewils...@gmail.com

unread,
Dec 13, 2007, 8:04:09 AM12/13/07
to
On Dec 11, 8:27 am, Alexandre Ferrieux <alexandre.ferri...@gmail.com>
wrote:
<snip>

> Still, the above posted code with just [glob] and no [file state] is
> 2.4 times faster than your last code.
> You're not saying how important is the weeding out of non-regular
> files nor dirs (but [glob -types] may help).
<snip>

I found it slower: http://groups.google.com/group/comp.lang.tcl/msg/137e234ac5b743ac

Could you show your test scripts?

Regards
E Wilson.

Alexandre Ferrieux

unread,
Dec 13, 2007, 9:55:08 AM12/13/07
to

Nothing more than the posted code, invoked in the tclsh console as

time {process2 c:/}

on a WindowsXP Dell D620 laptop.
Again, I did every test several times to get rid of first-hit effects.

-Alex

ewils...@gmail.com

unread,
Dec 13, 2007, 10:24:32 AM12/13/07
to
On Dec 14, 12:55 am, Alexandre Ferrieux <alexandre.ferri...@gmail.com>

wrote:
> On Dec 13, 2:04 pm, ewilsonm...@gmail.com wrote:
>
> > On Dec 11, 8:27 am, Alexandre Ferrieux <alexandre.ferri...@gmail.com>
> > wrote:
> > <snip>> Still, the above posted code with just [glob] and no [file state] is
> > > 2.4 times faster than your last code.
> > > You're not saying how important is the weeding out of non-regular
> > > files nor dirs (but [glob -types] may help).
>
> > <snip>
>
> > I found it slower:http://groups.google.com/group/comp.lang.tcl/msg/137e234ac5b743ac
>
> > Could you show your test scripts?
>
> Nothing more than the posted code, invoked in the tclsh console as
>
> time {process2 c:/}
>
> on a WindowsXP Dell D620 laptop...

Ah, that might explain why we get different results.

Thank you.
E Wilson.

Bezoar

unread,
Dec 13, 2007, 10:56:28 PM12/13/07
to


The reason for the time difference seems obvious to me ... You will
notice that glob in tcl is optimized to find particular patterns in a
directory so must do a match on every directory entry before returning
its list. In your tcl script two compares must be made on each
entry. In the php case, none of these compares are being made except
the ones you program yourself. Perhaps glob should be made to return
everything in a directory if no pattern is provided rather than give
an error. On the other hand glob ( at lease in 8.5 ) has some very
nice filtering capabilities using -types that can filter on filesystem
types and permissions as well. You can use glob itself to separate
the files from the directories as I have done (below) eliminating the
comparisons in the script. Without this redundancy you save 50%. I
demonstrate that in the below script.

In the script below, I also add an additional counter to see how many
hidden files are in the file system,to show the filtering
action. Suprisingly without it ( and adding .* to the glob for all
files) only saved about .05 seconds in runtime. I also recursively
create a directory list first then glob for the files by iteratively
globing over the resulting directory list. I did this to eliminate the
the isdirectory and isfile comparisons in my script. It appears that
glob filters by the type first then applies the more expensive
matching tests second. The good news is that I was able to speed up
the time to about 3.5 - 3.7 seconds from 7.8 to 8 seconds. More than
%50 percent speed up.

I wonder if you could provide the timing with filtering out the
hidden files as I have done for the php script to see if that brings
the scripts more in line.

Original script when run across my home directory.

Tcl SUMMARY PHP SUMMARY
dirs 5885 dirs 5888
files 143373 files 143373

real 0m7.383s real 0m2.035s
user 0m3.359s user 0m0.869s
sys 0m3.922s sys 0m1.167s

Timing for script provided below over my home directory

Tcl SUMMARY
dirs :5888
Regular files :143143
hidden files :229
Total Files :143372
other 0

real 0m3.649s
user 0m0.829s
sys 0m2.820s


-- script ---

proc allDirs { e retlist } {
upvar $retlist retval
set count 0;
foreach dir [ glob -nocomplain -directory $e -types {d } --
* .* ] {
if { $count < 2 } {
if { [ regexp {(\.|\.\.)$} $dir] } {
incr count;
continue
}
}
allDirs $dir retval
}
lappend retval $e
}

proc process { e } {
global nDir nFile nFileHidden
set dirlist {}
allDirs $e dirlist
foreach dir $dirlist {
incr nFile [llength [ glob -nocomplain -directory $dir -types {f} --
* ] ]
incr nFileHidden [llength [ glob -nocomplain -directory $dir -
types {f hidden } -- .* ] ]
}
set nDir [llength $dirlist ]
}

proc main {} {
global argc argv argv0 nFile nDir nFileHidden

set nFile 0
set nFileHidden 0
set nDir 0
set nOther 0

if { $argc<1 } {
puts "$argv0: usage tclsh $argv0 files|dirs";
exit 1
}

foreach a $argv { process $a }
puts ""


puts "Tcl SUMMARY"
puts "dirs :$nDir"

puts "Regular files :$nFile"
puts "hidden files :$nFileHidden"
puts "Total Files :[expr { $nFile + $nFileHidden}] "
puts "other $nOther"
}

main

---- script end

Carl

Bezoar

unread,
Dec 13, 2007, 10:57:19 PM12/13/07
to
On Dec 8, 10:22 pm, ewilsonm...@gmail.com wrote:

The reason for the time difference seems obvious to me ... You will


-- script ---

proc process { e } {

main
-- end script ---

Carl

Message has been deleted

Donal K. Fellows

unread,
Dec 14, 2007, 5:34:04 PM12/14/07
to
ewils...@gmail.com wrote:
> Perhaps the [glob] command needs an option (or maybe the default
> option) that returns all directory entries.

What exactly do you mean by this? All subdirectories in a directory? All
entries in a directory, whether subdirs or what? (The [glob] command can
do both of those.)

Donal.

ewils...@gmail.com

unread,
Dec 15, 2007, 1:45:50 AM12/15/07
to
On Dec 15, 8:34 am, "Donal K. Fellows"
<donal.k.fell...@manchester.ac.uk> wrote:

The 2nd: all entries. But without having to match explicitly. So
instead of

[glob * .*]

we can just go

[glob] or [glob -all].

I'm no API designer so there are undoubtedly better ways to do it
including leaving things as they are.

Regards
E Wilson

(The message of mine that Donal replied to hasn't showing up in Google
Groups yet - for those who are using this interface ... and are
wondering)

Alexandre Ferrieux

unread,
Dec 15, 2007, 6:46:41 AM12/15/07
to
On Dec 15, 7:45 am, ewilsonm...@gmail.com wrote:
>
> instead of
> [glob * .*]
> we can just go
> [glob] or [glob -all].

Yes, I'd like this too.
And while we're at it, I'd also suggest that the same option (or an
extra one) skip the "." and ".." entries, because most of the time we
need extra code not to recurse on them.

-Alex

ewils...@gmail.com

unread,
Dec 15, 2007, 11:31:27 AM12/15/07
to
On Dec 15, 9:46 pm, Alexandre Ferrieux <alexandre.ferri...@gmail.com>
wrote:

+1

Regards
E Wilson

Donal K. Fellows

unread,
Dec 15, 2007, 5:45:22 PM12/15/07
to
ewils...@gmail.com wrote:
> The 2nd: all entries. But without having to match explicitly. So
> instead of
> [glob * .*]
> we can just go
> [glob] or [glob -all].

Have you done a FRQ for this yet?

Donal.

ewils...@gmail.com

unread,
Dec 18, 2007, 12:18:13 PM12/18/07
to
[Contents]

* where I'm at
* notes about the scripts
* the scripts
* timing results
* strace output
* comments and analysis welcomed

... so,
I tested each script in single-user mode taking the median of 5 runs
with the disk cache warmed up.

I tested 4 different scripts:
1. My original Tcl script (rec.tcl)
2. The latest Tcl script (rec4.tcl) from this thread with tweaks
3. My original php script (rec.php)
4. My quick python script (rec.py)

Then I ran strace on all scripts except the python one ( ...long day)

[Some Notes about the scripts]

I've removed some of the micro-optimisations from rec3.tcl ... it
runs a little slower than the previous version. I've also made it
more realistic: rather than counting files by [llength] of the file
list, I walk the list as I do in the script that does real work. It
uses code contributed by others in this thread.

My original php can be improved by utilising methods of the
'DirectoryIterator' instead of passing strings to 'is_file' and
'is_directory'. But the script is already the fastest.

The python script does not follow links. I leave this as an exercise
for the interested reader ;-) Apart from the timing, the interesting
thing with the python script is that it is embarrassingly
short (across & down) as it uses the os module of the python library.
In fairness to php and tcl scripts, they also check that their
arguments are files or directories and act appropriately while the
python os.walk() was written to ignore exceptions.

[The Scripts]

# rec.tcl: the original -----------------------------------------

proc err-exit {} {
global argv0

puts "$argv0: usage <directory-name | file-name> ..."
exit
}

proc process {e} {
global nDir nFile

if {[file isfile $e]} {
incr nFile
return
}

if {![file isdirectory $e]} {return}


incr nDir
catch {
set entries [glob -directory $e * \.*]
foreach de $entries {
set tail [file tail $de]
if {[string equal $tail . ] || [string equal $tail ..]} {
continue
}
process $de
}
}
}

proc main {} {
global argc argv nFile nDir

set nFile 0
set nDir 0

if { $argc<1 } { err-exit }

foreach a $argv { process $a }

puts SUMMARY
puts "directories $nDir"
puts "files $nFile"
}
main

# rec4.tcl: current -------------------------------------------------
# with thanks to everyone in this thread
proc is-dot {dir} { regexp {(\.|\.\.)$} $dir }

proc process-file {f} {
global nFile
incr nFile
}

proc process {e} {
global nDir nFile

incr nDir

foreach f [glob -nocomplain -directory $e -types {f} -- * .*] {
process-file $f
}

foreach dir [glob -nocomplain -directory $e -types {d} -- * .*] {

if {![is-dot $dir]} { process $dir }
}
}

proc main {} {
global argc argv argv0 nFile nDir

set nDir [set nFile 0]

if {$argc<1} {
puts "$argv0: usage tclsh $argv0 dirs or files";
exit 1
}

foreach a $argv {
if {[file isfile $a]} { process-file $a; continue }
process $a
}

puts "Tcl SUMMARY"
puts "dirs \t $nDir"
puts "files \t $nFile"
}

main

# rec.py ------------------------------------------------------------

import os, sys
from os.path import join

def rec(d):
global numDirs, numFiles
for root, dirs, files in os.walk(arg):
numDirs+=1
for file in files:
numFiles+=1

numDirs = numFiles = 0
args = sys.argv[1:]

for arg in args: rec(arg)

print "PYTHON SUMMARY"
print "dirs %d" % numDirs
print "files %d" % numFiles

# rec.php -----------------------------------------------------------

<?php

function process_file ($f)
{
global $nFile;
++$nFile;
}

function process ($d)
{
global $nDir;

if (is_file ($d)) { process_file ($d); return; }
if (!is_dir ($d)) return;


try {
$dir = new DirectoryIterator ($d);
++$nDir;
foreach ($dir as $file) {

if ($file->isDot()) continue;


process($file->getpathName());
}
}
catch (Exception $ex) {

echo "Cannot iterate over $d\n";
}
}

$nDir = $nFile = 0;

$n = count ($argv);

for ($i=1; $i<$n; ++$i) {


$e = $argv[$i];
process($e);
}
echo "\nPHP SUMMARY\ndirs $nDir\nfiles $nFile\n";

---------------------------------------------------------------------

[Timing Results] on /usr/share

-- Median of 5 runs --------------------------------
rec4.tcl rec.tcl rec.php rec.py
real 8.792 14.014 4.536 10.453
user 4.777 9.270 2.648 7.838
sys 4.013 4.742 1.888 2.614
----------------------------------------------------

[Truncated strace output]

rec.tcl (the original script)


% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------

62.76 0.582095 1 628971 1 lstat64
20.47 0.189866 2 125317 stat64
11.52 0.106819 4 30024 getdents
2.99 0.027711 2 14790 open
1.60 0.014857 1 14791 close
0.51 0.004709 0 14787 fstat64
0.12 0.001157 1 1179 readlink
0.01 0.000075 4 18 mmap2
0.01 0.000068 4 16 read
0.01 0.000066 4 16 brk

rec4.tcl


% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------

44.69 0.355098 1 265410 stat64
25.81 0.205072 3 60048 getdents
16.37 0.130055 1 131921 1 lstat64
7.51 0.059689 2 29570 open
3.98 0.031633 1 29571 close
1.51 0.012012 0 29567 fstat64
0.09 0.000727 1 601 readlink
0.03 0.000223 56 4 write
0.01 0.000062 6 10 7 access
0.01 0.000046 46 1 set_robust_list

rec.php


% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------

64.28 0.256811 2 103151 stat64
26.16 0.104530 5 22404 getdents
5.19 0.020739 3 7462 3 open
2.45 0.009807 1 7463 close
0.97 0.003856 1 7390 lseek
0.84 0.003353 0 7465 fstat64
0.07 0.000279 2 141 mmap2
0.02 0.000096 3 31 munmap
0.01 0.000057 1 91 read

Comments and analysis welcomed.

Regards
E Wilson

ewils...@gmail.com

unread,
Dec 18, 2007, 12:24:04 PM12/18/07
to
On Dec 16, 8:45 am, "Donal K. Fellows"
<donal.k.fell...@manchester.ac.uk> wrote:

Will do :-)

Regards
E Wilson

sch...@uni-oldenburg.de

unread,
Dec 18, 2007, 12:40:13 PM12/18/07
to
> The python script does not follow links. I leave this as an exercise
> for the interested reader ;-) Apart from the timing, the interesting
> thing with the python script is that it is embarrassingly
> short (across & down) as it uses the os module of the python library.
> In fairness to php and tcl scripts, they also check that their
> arguments are files or directories and act appropriately while the
> python os.walk() was written to ignore exceptions.

Maybe you should take a look at:
http://tcllib.sourceforge.net/doc/multiop.html

Its also very cool, not widely known and short. (but might not help
with the speed problem).

Michael

Alexandre Ferrieux

unread,
Dec 18, 2007, 12:52:13 PM12/18/07
to
On Dec 18, 6:40 pm, schl...@uni-oldenburg.de wrote:
>
> Maybe you should take a look at:http://tcllib.sourceforge.net/doc/multiop.html
>
> Its also very cool, not widely known and short. (but might not help
> with the speed problem).

It sits on top of fileutil::find, which is itself a variant of the
usual recursive-glob at the beginning of this thread. The only
difference I can see is that it insists on handling file entries at
each given level before diving into subdirs...

-Alex

ewils...@gmail.com

unread,
Dec 18, 2007, 1:04:46 PM12/18/07
to
I should add that I'm happy with the speed of my current Tcl script.
I must admit that I'm surprised with the speed of the php and the
brevity of the python one.
(multiop looks cool too)

I sometimes wish that glob (and Tcl commands in general) didn't use
such long words for options.

Compare


glob -nocomplain -directory $e -types {f} -- * .*

glob -q -d $e -t {f} -- * .*

The first may indicate the meaning better but I still have to look at
the manpage.

I find it easier to read down than across and often use [set] in my
scripts to make up for long command invocations. Tcl isn't perfect
but I can't help liking it (and I haven't got into the fun stuff yet).

E Wilson

Evil Son

unread,
Jan 28, 2008, 2:26:54 AM1/28/08
to
On Dec 16 2007, 8:45 am, "Donal K. Fellows"
<donal.k.fell...@manchester.ac.uk> wrote:

Yes, finally. I have made 2 FRQs for consideration:
1880967 A convenience option to [glob] returning all entries
http://sourceforge.net/tracker/index.php?func=detail&aid=1880967&group_id=10894&atid=360894
1880980 Option to [glob] to ease recursive traversal
http://sourceforge.net/tracker/index.php?func=detail&aid=1880980&group_id=10894&atid=360894

Kind regards
Ed

0 new messages