Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Can Tcl scan faster than find?

144 views
Skip to first unread message

Luc

unread,
Dec 8, 2022, 5:01:23 PM12/8/22
to
I have this application that is divided between a shell script
and a Tcl script.

The shell script uses `find' to scan the entire hard disk and output
the full path of every single file into a catalog file. It has to be
run from time to time to update the catalog.

The Tcl script has a very quick'n'dirty GUI that accepts a string
as input, finds matches in the catalog and shows all the matches,
with the matched string highlighted.

It's a very old application of mine that I want to improve.

The first version of it did everything in one Tcl script, but
I remember when I replaced the Tcl proc with a shell script to scan
the hard disk because `find' was a lot faster than my Tcl code.

Of course, maybe my code was bad, but it was just a matter of going
into every directory found and globbing it. There wasn't a lot of
opportunity for screwing up.

Anyway, my question is, do you think it's possible to write Tcl code
that can rival `find' in speed?

--
Luc
>>

Ralf Fassel

unread,
Dec 9, 2022, 5:54:42 AM12/9/22
to
* Luc <n...@no.no>
| Anyway, my question is, do you think it's possible to write Tcl code
| that can rival `find' in speed?

There is the fileutil package in tcllib:

https://core.tcl-lang.org/tcllib/doc/trunk/embedded/md/tcllib/files/modules/fileutil/fileutil.md

which contains

::fileutil::find ?basedir ?filtercmd??

An implementation of the unix command find. Adapted from the Tcler's
Wiki. Takes at most two arguments, the path to the directory to start
searching from and a command to use to evaluate interest in each
file. [...]

Maybe give it a try? Note that the command returns only after all files
have been found, so for a 'live' application you would start it in a
separate thread and communicate the files via the filtercmd to the main
thread (or play around with 'update' in the filtercmd).

Somehow I doubt that a script based solution will be faster than one
in C (though the disk IO should be the limiting factor here).

R'

Rich

unread,
Dec 9, 2022, 11:04:34 AM12/9/22
to
Ralf Fassel <ral...@gmx.de> wrote:
> * Luc <n...@no.no>
> | Anyway, my question is, do you think it's possible to write Tcl code
> | that can rival `find' in speed?
>
> Somehow I doubt that a script based solution will be faster than one
> in C (though the disk IO should be the limiting factor here).

Agreed. I also doubt a TCL variant will be faster than the
/usr/bin/find utility for identical scans.

And disk IO, esp. if using mechanical disks where seek times dominate
for "scan a directory hierarchy" runs, is going to be the ultimate
limiting factor. This fact will likely be what would make it appear
that a TCL and a /usr/bin/find scan were close in time -- both spent a
majority (as in 98%+) of their runtime waiting for disk head seeks to
complete.

Running on an SSD would remove the seek time overhead, and likely
result in /usr/bin/find surpassing a TCL solution by a substantial
margin.

Luc

unread,
Dec 9, 2022, 3:36:57 PM12/9/22
to
On Fri, 9 Dec 2022 16:04:29 -0000 (UTC), Rich wrote:

> Ralf Fassel <ral...@gmx.de> wrote:

> And disk IO, esp. if using mechanical disks where seek times dominate
> for "scan a directory hierarchy" runs, is going to be the ultimate
> limiting factor. This fact will likely be what would make it appear
> that a TCL and a /usr/bin/find scan were close in time -- both spent a
> majority (as in 98%+) of their runtime waiting for disk head seeks to
> complete.
>
> Running on an SSD would remove the seek time overhead, and likely
> result in /usr/bin/find surpassing a TCL solution by a substantial
> margin.


The disk I/O bottleneck is not very relevant because I am not as concerned
with how long it's going to take as I am with how much LONGER than `find'
it's going to take.

I intend to release the end product as an application so it's not just for
me, and people are expected to understand that scanning the entire HD is
going to take some time. The core of the issue here is whether it's still
worth trying to do everything in Tcl or I should just accept the facts of
life and do some [exec find] thing.

I'm also considering the option of collecting additional data on every
file such as size, date and permissions, up to the user. For that I would
feel a lot more comfortable using pure Tcl. The current code has none of
that but it occurs to me that some people may want it.

So yeah, I guess I have to run some tests on that ::fileutil:: command
and see how well it performs against my Tcl code and `find'.

Thank you all.


--
Luc
>>

Rich

unread,
Dec 9, 2022, 3:55:57 PM12/9/22
to
Luc <n...@no.no> wrote:
> On Fri, 9 Dec 2022 16:04:29 -0000 (UTC), Rich wrote:
>
>> Ralf Fassel <ral...@gmx.de> wrote:
>
>> And disk IO, esp. if using mechanical disks where seek times dominate
>> for "scan a directory hierarchy" runs, is going to be the ultimate
>> limiting factor. This fact will likely be what would make it appear
>> that a TCL and a /usr/bin/find scan were close in time -- both spent a
>> majority (as in 98%+) of their runtime waiting for disk head seeks to
>> complete.
>>
>> Running on an SSD would remove the seek time overhead, and likely
>> result in /usr/bin/find surpassing a TCL solution by a substantial
>> margin.
>
>
> The disk I/O bottleneck is not very relevant because I am not as concerned
> with how long it's going to take as I am with how much LONGER than `find'
> it's going to take.

If you want to quantify "how much longer" then your only option may be
to run tests. About all any of us can say without actually testing is
"TCL is likely to be slower".

> I intend to release the end product as an application so it's not just for
> me, and people are expected to understand that scanning the entire HD is
> going to take some time. The core of the issue here is whether it's still
> worth trying to do everything in Tcl or I should just accept the facts of
> life and do some [exec find] thing.

Do you plan to make the end product be cross platform (i.e., run on
Linux, Windows, and Mac)? If yes, then you'd want to write it all in
Tcl, even if slower, because there is no equivalent to 'find' on win
(at least not in the default MS install) and while there is one on Mac,
the BSD vs. GNU differences might make for the need for two different
process loops.

> I'm also considering the option of collecting additional data on every
> file such as size, date and permissions, up to the user. For that I would
> feel a lot more comfortable using pure Tcl. The current code has none of
> that but it occurs to me that some people may want it.

GNU find has the ability to output much of this with it's "-print"
option, which might make find even faster than TCL -- but then you /do/
still have to parse the output in TCL, possibly negating the
difference. But that option to find may not exist on Mac, and there is
no 'find' on windows by default.

Robert Heller

unread,
Dec 9, 2022, 4:51:21 PM12/9/22
to
At Fri, 9 Dec 2022 17:36:52 -0300 Luc <n...@no.no> wrote:

>
> On Fri, 9 Dec 2022 16:04:29 -0000 (UTC), Rich wrote:
>
> > Ralf Fassel <ral...@gmx.de> wrote:
>
> > And disk IO, esp. if using mechanical disks where seek times dominate
> > for "scan a directory hierarchy" runs, is going to be the ultimate
> > limiting factor. This fact will likely be what would make it appear
> > that a TCL and a /usr/bin/find scan were close in time -- both spent a
> > majority (as in 98%+) of their runtime waiting for disk head seeks to
> > complete.
> >
> > Running on an SSD would remove the seek time overhead, and likely
> > result in /usr/bin/find surpassing a TCL solution by a substantial
> > margin.
>
>
> The disk I/O bottleneck is not very relevant because I am not as concerned
> with how long it's going to take as I am with how much LONGER than `find'
> it's going to take.
>
> I intend to release the end product as an application so it's not just for
> me, and people are expected to understand that scanning the entire HD is
> going to take some time. The core of the issue here is whether it's still
> worth trying to do everything in Tcl or I should just accept the facts of
> life and do some [exec find] thing.

More likely:

set fp [open "|find ..." r];# replace '...' with find's params and opts

fileevent $fp readable [list processfile $fp]

## called as
proc processfile {fp} {
if {[gets $fp pathname] >= 0} {
# process pathname (eg using "file <command> $pathname ..." as desired)
} else {
catch {close $fp}
exit; # or whatever
}
}

vwait forever;# don't forget this at the end (if Tk is not in play).

>
> I'm also considering the option of collecting additional data on every
> file such as size, date and permissions, up to the user. For that I would
> feel a lot more comfortable using pure Tcl. The current code has none of
> that but it occurs to me that some people may want it.
>
> So yeah, I guess I have to run some tests on that ::fileutil:: command
> and see how well it performs against my Tcl code and `find'.
>
> Thank you all.
>
>

--
Robert Heller -- Cell: 413-658-7953 GV: 978-633-5364
Deepwoods Software -- Custom Software Services
http://www.deepsoft.com/ -- Linux Administration Services
hel...@deepsoft.com -- Webhosting Services

briang

unread,
Dec 10, 2022, 6:05:57 PM12/10/22
to
I doubt you'll be able to best the speed of find. I have written a utility in Tcl that scans the entire hard drive. I used a threaded model to try and take advantage of I/O latency, since it also gathers file size info. My assumption is that the OS will optimize its operations and suspend the thread(s) until the data is ready. I have not timed it or compared to "find", but it is able to scan ~0.5TB fast enough for me. It's not quick, nor does it take "forever." It also runs on all platforms.

It scans the starting dir for files and subdirectories, and farms the subdirectories out to another thread from a pool. The thread jobs get queued as worker threads become available. This is done recursively. The results in the worker thread are queued back to the main thread via a non-blocking callback command, making the worker thread quickly available for the next job.

-Brian

Luc

unread,
Dec 10, 2022, 6:30:08 PM12/10/22
to
On Sat, 10 Dec 2022 15:05:54 -0800 (PST), briang wrote:

> I doubt you'll be able to best the speed of find. I have written a
> utility in Tcl that scans the entire hard drive. I used a threaded model
> to try and take advantage of I/O latency, since it also gathers file size
> info. My assumption is that the OS will optimize its operations and
> suspend the thread(s) until the data is ready. I have not timed it or
> compared to "find", but it is able to scan ~0.5TB fast enough for me.
> It's not quick, nor does it take "forever." It also runs on all
> platforms.
>
> It scans the starting dir for files and subdirectories, and farms the
> subdirectories out to another thread from a pool. The thread jobs get
> queued as worker threads become available. This is done recursively. The
> results in the worker thread are queued back to the main thread via a
> non-blocking callback command, making the worker thread quickly available
> for the next job.
>
> -Brian

Interesting, but I wonder how effective that concept of threads really is.
The CPU may support multiple threads, but does the hard disk?

--
Luc
>>

briang

unread,
Dec 10, 2022, 7:00:34 PM12/10/22
to
Yes, they do.

-Brian

Rich

unread,
Dec 10, 2022, 9:47:25 PM12/10/22
to
Yes. Look up Native Command Queuing:
https://en.wikipedia.org/wiki/NCQ

For a mechanical drive, there is only one head arm, so ultimately the
"threads" serialize on that fact, but the drive can readjust ordering
to minimize head arm seeks.

For a SSD drive, since there is no head arm, there is no head arm seek
time, and depending upon the internal flash memory design, the
'threads' could possibly perform parallel reads from different areas of
the flash.

Robert Heller

unread,
Dec 11, 2022, 12:10:28 AM12/11/22
to
I would expect that at the application level, disk I/O might not be tied
*directly* to physical "disk" I/O, but rather be accessing the RAM-based disk
cache buffers. That is the *kernel* might be reading large parts of the disk
(whole tracks) into RAM buffers. Depending on how the data is on the "disk",
it *might* be possible to effective access multiple parts of the disk
"concurrently" with different threads.

blacksqr

unread,
Dec 19, 2022, 12:04:44 PM12/19/22
to
On Thursday, December 8, 2022 at 4:01:23 PM UTC-6, Luc wrote:
> Anyway, my question is, do you think it's possible to write Tcl code
> that can rival `find' in speed?
>
> --
> Luc
> >>

I wrote a Tcl program called globfind a while back (https://wiki.tcl-lang.org/page/globfind) which I tried to optimize for speed in searches of large filesystem spaces. I got a performance improvement of about three times over Tcllib's fileutil::find, but it's still slower than GNU find. A large pattern-match search using globfind requires about 150% of the time GNU find takes.
0 new messages