Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

How to check for filetype existence quickly

0 views
Skip to first unread message

fidokomik

unread,
Aug 6, 2008, 4:32:09 AM8/6/08
to
I have directory, say "c:\documents" on Windows or "/home/petr/
documents" on Linux. In this directory many files are stored with many
filetypes (extensions), say *.doc, *.txt, *.zip. I need to find fastes
way how to check if some filetype exist. Now I use routine where I use
readdir() and return true if passed filetype is first time found but
when I have huge number of files but passed filetype is not found then
routine is very slow before return false.
Any idea?

Lars Eighner

unread,
Aug 6, 2008, 6:02:15 AM8/6/08
to
In our last episode,
<6d9d54ba-0b32-4868...@y21g2000hsf.googlegroups.com>,
the lovely and talented fidokomik
broadcast on comp.lang.perl.misc:

Obviously no routine can say no such file exists until, in one way or
another, it has examined all of the files.

Did you try globbing to see if it is any faster:

$found = 0;
if (</usr/home/lars/saves/*.cgi>){
$found = 1;
}

--
Lars Eighner <http://larseighner.com/> use...@larseighner.com
"Fascism should more properly be called corporatism, since it is the
merger of state and corporate power."-Benito Mussolini * When you write the
check to pay your taxes, remember there are two l's in "Halliburton."

RedGrittyBrick

unread,
Aug 6, 2008, 6:08:46 AM8/6/08
to
fidokomik wrote:
> I have directory, say "c:\documents" on Windows or "/home/petr/
> documents" on Linux. In this directory many files are stored with many
> filetypes (extensions), say *.doc, *.txt, *.zip.

On Linux, file name extensions are not required and if present, may not
be a reliable guide to file type. `man file`.

For Windows, consider ADT, AFM, ALL etc in
http://en.wikipedia.org/wiki/List_of_file_formats_(alphabetical)

I guess you have control of these files and so the above isn't an issue
in this case.

> I need to find fastes
> way how to check if some filetype exist. Now I use routine where I use
> readdir() and return true if passed filetype is first time found but
> when I have huge number of files but passed filetype is not found then
> routine is very slow before return false.
> Any idea?

Create, maintain and use an index or cache? I'd use a hash, maybe backed
by DBM or suchlike. I'd schedule index updates or use the OS'
directory-change notification mechanism to ensure file additions,
deletions and renames get indexed.

--
RGB

Leon Timmermans

unread,
Aug 6, 2008, 6:19:37 AM8/6/08
to

To find a negative match, you will have to loop through the whole list,
that is unavoidable. However, if you find yourself testing the same
directory a number of times for different extentions, you could loop
through it once and save a list of extensions you've found.

opendir my $dh, $basedir;
my %is_found;
while (my $dirname = readdir $dh) {
$dirname =~ / \. (\w+) \z /x or next;
$is_found{$1}++;
}

for my $extention (qw/exe doc txt zip mp3/) {
my $found = $is_found{$extention} ? "Found" : "Didn't found";
print "$found $extention\n";
}

Regards,

Leon Timmermans

Justin C

unread,
Aug 6, 2008, 6:18:32 AM8/6/08
to

ls | egrep doc\|txt\|zip

(you need to escape the 'or' operator, and use egrep instead of grep)

I know you want to use perl, but perl won't be as fast as this... unless
there are a very, very large[1] number of files.

Justin.

1. Depending on your concept of large.
--
Justin C, by the sea.

xho...@gmail.com

unread,
Aug 6, 2008, 11:56:23 AM8/6/08
to
fidokomik <fido...@gmail.com> wrote:
> I have directory, say "c:\documents" on Windows or "/home/petr/
> documents" on Linux. In this directory many files are stored with many
> filetypes (extensions), say *.doc, *.txt, *.zip. I need to find fastes
> way how to check if some filetype exist.

You will have to write a file system that is optimized for this (for
example, it stores directory information in some kind of tree based on the
reversed file name, so that extensions group together.) Then you have to
hack the operating system so that it can take advantage of the FS features.
Then you would have to hack perl so that it can take advantage of the OS
features.

Personally, I think I'd settle for something other than the fastest, and
just aim for good enough.

Xho

--
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.

cc96ai

unread,
Aug 6, 2008, 6:31:20 PM8/6/08
to
if you have sub-directory,
you could use find ()

find(\&listfile, $dir);

sub listfile(){

if ( -f ) {
my($filename, $directories, $suffix) = fileparse($File::Find::name);
#check extension
.....
}
}

On Aug 6, 8:56 am, xhos...@gmail.com wrote:

Peter J. Holzer

unread,
Aug 7, 2008, 2:20:12 PM8/7/08
to
On 2008-08-06 10:18, Justin C <justi...@purestblue.com> wrote:
> On 2008-08-06, fidokomik <fido...@gmail.com> wrote:
>> I have directory, say "c:\documents" on Windows or "/home/petr/
>> documents" on Linux. In this directory many files are stored with many
>> filetypes (extensions), say *.doc, *.txt, *.zip. I need to find fastes
>> way how to check if some filetype exist. Now I use routine where I use
>> readdir() and return true if passed filetype is first time found but
>> when I have huge number of files but passed filetype is not found then
>> routine is very slow before return false.
>> Any idea?
>
> ls | egrep doc\|txt\|zip

Please not that the OP is passed *one* filetype. So that would be

ls | egrep '\.doc$'

or

ls | egrep '\.txt$'

or

ls | egrep '\.zip$'

instead. Which can be simplified to "ls *.doc", etc. Which can be
simplified to a call to glob().


> (you need to escape the 'or' operator, and use egrep instead of grep)
>
> I know you want to use perl, but perl won't be as fast as this... unless
> there are a very, very large[1] number of files.

Perl is also likely to be faster for a small number of files - spawning
a shell which then spawns two other programs is not exactly a cheap
operation.

hp

fidokomik

unread,
Aug 7, 2008, 8:23:49 PM8/7/08
to
On Aug 6, 12:02 pm, Lars Eighner <use...@larseighner.com> wrote:
> Did you try globbing to see if it is any faster:
>
> $found = 0;
> if (</usr/home/lars/saves/*.cgi>){
> $found = 1;
>
> }
>
Hmm, easy and quick. Thank you Lars. But how to pass variable to this
but avoid eval()? Is it possible?

I thinked up this only:

my $searchfor = 'c:/images/*.jpg';
if(checkit($searchfor)) {do_something}
else {do_other}

sub checkit {
return 1 if( eval('<' . shift . '>') );
return 0;
}

Ben Morrow

unread,
Aug 7, 2008, 8:46:06 PM8/7/08
to

Quoth fidokomik <fido...@gmail.com>:

> On Aug 6, 12:02 pm, Lars Eighner <use...@larseighner.com> wrote:
> > Did you try globbing to see if it is any faster:
> >
> > $found = 0;
> > if (</usr/home/lars/saves/*.cgi>){
> > $found = 1;
> >
> > }
> >
> Hmm, easy and quick. Thank you Lars. But how to pass variable to this
> but avoid eval()? Is it possible?

perldoc -f glob

glob is the function that underlies this meaning of <>, and it's
probably cleanest to simply call it directly. If you carefully read the
section on the <> operator in perldoc perlop, you will see that it is
possible to use variables in the glob form of <>, but rather tricky.

You may also want to look at the File::Glob or (as you appear to be on
Win32) the File::DosGlob extension, which implement other forms of
globbing.

Ben

--
'Deserve [death]? I daresay he did. Many live that deserve death. And some die
that deserve life. Can you give it to them? Then do not be too eager to deal
out death in judgement. For even the very wise cannot see all ends.'
b...@morrow.me.uk

Jim Gibson

unread,
Aug 7, 2008, 8:56:42 PM8/7/08
to
In article
<491ff6d8-af23-432f...@34g2000hsh.googlegroups.com>,
fidokomik <fido...@gmail.com> wrote:

Use the 'glob' function (untested):

return 1 if glob shift;
return 0;

Or possibly

return scalar glob shift;

See 'perldoc -f glob'

--
Jim Gibson

Tad J McClellan

unread,
Aug 7, 2008, 9:00:53 PM8/7/08
to
fidokomik <fido...@gmail.com> wrote:
> On Aug 6, 12:02 pm, Lars Eighner <use...@larseighner.com> wrote:
>> Did you try globbing to see if it is any faster:
>>
>> $found = 0;
>> if (</usr/home/lars/saves/*.cgi>){
>> $found = 1;
>>
>> }
>>
> Hmm, easy and quick. Thank you Lars. But how to pass variable to this
> but avoid eval()? Is it possible?


$extension = 'cgi';
if (</usr/home/lars/saves/*.$extension>){

Though that would be the bad kind of Lazy, IMO.

It makes it easier for the 1 programmer at the expense of making it
harder for the many readers/maintainers.

So I would instead write it for others rather than for myself:

if ( glob "/usr/home/lars/saves/*.$extension" ) {

--
Tad McClellan
email: perl -le "print scalar reverse qq/moc.noitatibaher\100cmdat/"

xho...@gmail.com

unread,
Aug 8, 2008, 12:28:35 PM8/8/08
to
Tad J McClellan <ta...@seesig.invalid> wrote:
> fidokomik <fido...@gmail.com> wrote:
> > On Aug 6, 12:02 pm, Lars Eighner <use...@larseighner.com> wrote:
> >> Did you try globbing to see if it is any faster:

Globbing will go through all the files in the directory with no possibility
of stopping early. It won't be more than slightly faster on failure, and
will be substantially slower on success.

> >>
> >> $found = 0;
> >> if (</usr/home/lars/saves/*.cgi>){
> >> $found = 1;
> >>
> >> }
> >>
> > Hmm, easy and quick. Thank you Lars. But how to pass variable to this
> > but avoid eval()? Is it possible?
>
> $extension = 'cgi';
> if (</usr/home/lars/saves/*.$extension>){
>
> Though that would be the bad kind of Lazy, IMO.
>
> It makes it easier for the 1 programmer at the expense of making it
> harder for the many readers/maintainers.
>
> So I would instead write it for others rather than for myself:
>
> if ( glob "/usr/home/lars/saves/*.$extension" ) {

Even this is bad. The glob is being executed in a scalar context,
so doesn't reset itself, it iterates and next time it gets invoked
(assuming the if is in a loop, or a subroutine which gets called from a
loop), the new value of $extension is not even inspected, unless the old
iterator has exhausted itself.


if ( () = glob "/usr/home/lars/saves/*.$extension" ) {


Xho

--
-------------------- http://NewsReader.Com/ --------------------

Justin C

unread,
Aug 11, 2008, 10:11:28 AM8/11/08
to
On 2008-08-07, Peter J. Holzer <hjp-u...@hjp.at> wrote:
> On 2008-08-06 10:18, Justin C <justi...@purestblue.com> wrote:
>> On 2008-08-06, fidokomik <fido...@gmail.com> wrote:
>>> I have directory, say "c:\documents" on Windows or "/home/petr/
>>> documents" on Linux. In this directory many files are stored with many
>>> filetypes (extensions), say *.doc, *.txt, *.zip. I need to find fastes
>>> way how to check if some filetype exist. Now I use routine where I use
>>> readdir() and return true if passed filetype is first time found but
>>> when I have huge number of files but passed filetype is not found then
>>> routine is very slow before return false.
>>> Any idea?
>>
>> ls | egrep doc\|txt\|zip
>
> Please not that the OP is passed *one* filetype. So that would be
>
Well spotted. Thanks for pointing it out.

[snip]

>> I know you want to use perl, but perl won't be as fast as this... unless
>> there are a very, very large[1] number of files.
>
> Perl is also likely to be faster for a small number of files - spawning
> a shell which then spawns two other programs is not exactly a cheap
> operation.

Gasp! You mean, you don't *always* have a TERM to hand?! :) I see
what you mean, I was just thinking quick and dirty, and not "write once,
use many"... which is a habit I'm trying to cultivate.

Justin.

Jürgen Exner

unread,
Aug 14, 2008, 10:37:08 AM8/14/08
to
fidokomik <fido...@gmail.com> wrote:
>I have directory, say "c:\documents" on Windows or "/home/petr/
>documents" on Linux. In this directory many files are stored with many
>filetypes (extensions), say *.doc, *.txt, *.zip. I need to find fastes
>way how to check if some filetype exist.

I would simply use a glob and check if it returns any results:

if (<*.doc>) {
print ".doc file(s) found\n";
}

Only downside: it will read the directory in full even if the first file
is a match already.
Only way I know to avoid that is to do what you are doing already: loop
through the directory using readdir() and abort as soon as the first
match is found.
Which one of these is faster for your environment you will have to
benchmark.

jue

0 new messages