Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Optimizing tips for os.listdir

155 views
Skip to first unread message

Thomas

unread,
Sep 27, 2004, 6:44:24 AM9/27/04
to
I'm doing this :

[os.path.join(path, p) for p in os.listdir(path) if \
os.path.isdir(os.path.join(path, p))]

to get a list of folders in a given directory, skipping all plain
files. When used on folders with lots of files, it takes rather long
time to finish. Just doing a listdir, filtering out all plain files
and a couple of joins, I didn't think this would take so long.

Is there a faster way of doing stuff like this?

Best regards,
Thomas

Nick Craig-Wood

unread,
Sep 27, 2004, 10:30:18 AM9/27/04
to
Thomas <20...@weholt.org> wrote:
> I'm doing this :
>
> [os.path.join(path, p) for p in os.listdir(path) if \
> os.path.isdir(os.path.join(path, p))]
>
> to get a list of folders in a given directory, skipping all plain
> files. When used on folders with lots of files, it takes rather long
> time to finish. Just doing a listdir, filtering out all plain files
> and a couple of joins, I didn't think this would take so long.

How many files, what OS and what filing system?

Under a unix based OS the above will translate to 1
opendir()/readdir()/closedir() and 1 stat() for each file. There
isn't a quicker way in terms of system calls AFAIK.

However some filing systems handle lots of files in a directory better
than others. Eg reiserfs is much better than ext2/3 for this purpose.
(ext3 has a dirhash module to fix this in the works though).

Eg on my linux box, running ext3, with various numbers of files in a
directory :-

/usr/lib/python2.3/timeit.py -s 'import os; path="."' \
'[os.path.join(path, p) for p in os.listdir(path) if \
os.path.isdir(os.path.join(path, p))]'

Files Time

10 3.01e+02 usec per loop
100 2.74e+03 usec per loop
1000 2.73e+04 usec per loop
10000 2.76e+05 usec per loop
100000 2.81e+06 usec per loop

Which is pretty linear... much more so than I expected!

The above timings ignore the effect of caching - will the directory
you are enumerating be hot in the cache?

Something similar may apply under Windows but I don't know ;-)

--
Nick Craig-Wood <ni...@craig-wood.com> -- http://www.craig-wood.com/nick

Peter Otten

unread,
Sep 27, 2004, 11:01:18 AM9/27/04
to
Thomas wrote:

If you are on windows, Michael Peuser reported a significant speed up
obtained by using win32api.FindFiles() instead of the portable stuff:

http://groups.google.com/groups?selm=bhalk1%24qlv%2406%241%40news.t-online.com

Peter

Bengt Richter

unread,
Sep 27, 2004, 2:24:31 PM9/27/04
to
On 27 Sep 2004 14:30:18 GMT, Nick Craig-Wood <ni...@craig-wood.com> wrote:

>Thomas <20...@weholt.org> wrote:
>> I'm doing this :
>>
>> [os.path.join(path, p) for p in os.listdir(path) if \
>> os.path.isdir(os.path.join(path, p))]

You ought to be able to gain a little by hoisting the os.path.xxx
attribute lookups for join and isdir out of the loop. E.g, (not tested)

opj=os.path.join; oisd=os.path.isdir
[opj(path, p) for p in os.listdir(path) if oisd(opj(path, p))]

But it seems like you are asking the os to chase through full paths at
every isdir operation, rather than just telling it to make its current working
directory the directory you are interested in and doing it there. E.g., (untested)

savedir = os.getcwd()
os.chdir(path)
dirs = [opj(path, p) for p in os.listdir('.') if oisd(p)]
os.chdir(savedir)


>>
>> to get a list of folders in a given directory, skipping all plain
>> files. When used on folders with lots of files, it takes rather long
>> time to finish. Just doing a listdir, filtering out all plain files
>> and a couple of joins, I didn't think this would take so long.
>

I'd be curious to know how much difference the above would make.

>How many files, what OS and what filing system?
>
>Under a unix based OS the above will translate to 1
>opendir()/readdir()/closedir() and 1 stat() for each file. There
>isn't a quicker way in terms of system calls AFAIK.

Except IWT chdir could help there too?


>
>However some filing systems handle lots of files in a directory better
>than others. Eg reiserfs is much better than ext2/3 for this purpose.
>(ext3 has a dirhash module to fix this in the works though).
>
>Eg on my linux box, running ext3, with various numbers of files in a
>directory :-
>
>/usr/lib/python2.3/timeit.py -s 'import os; path="."' \
>'[os.path.join(path, p) for p in os.listdir(path) if \
>os.path.isdir(os.path.join(path, p))]'
>

path="." might be a special case in some of that though. I would
try a long absolute path for comparison. (What did the OP have as
actual use case?)

>Files Time
>
>10 3.01e+02 usec per loop
>100 2.74e+03 usec per loop
>1000 2.73e+04 usec per loop
>10000 2.76e+05 usec per loop
>100000 2.81e+06 usec per loop
>
>Which is pretty linear... much more so than I expected!
>
>The above timings ignore the effect of caching - will the directory
>you are enumerating be hot in the cache?

Even if so, I doubt the os finds it via a hash of the full path instead
of checking that every element of the path exists and is a subdirectory.
IWT that could be a dangerous short cut, whereas chdir and using the cwd
should be fast and safe and most likely guarantee cached content availability.

Just guessing, though ;-)

>
>Something similar may apply under Windows but I don't know ;-)
>
>--
>Nick Craig-Wood <ni...@craig-wood.com> -- http://www.craig-wood.com/nick

Regards,
Bengt Richter

Brian Beck

unread,
Sep 27, 2004, 3:09:36 PM9/27/04
to
Thomas wrote:
> I'm doing this :
>
> [os.path.join(path, p) for p in os.listdir(path) if \
> os.path.isdir(os.path.join(path, p))]
>

I don't have much time to experiment, but I came up with a method using
os.walk:

import os
def childdirs(path):
for root, dirs, files in os.walk(path):
return [os.path.join(root, name) for name in dirs]

This appears to be about the same speed as your code -- for all I know,
os.walk could be doing the same exact thing under the hood. But perhaps
someone could use this and come up with a one-liner -- as of now I can't
think of a way to do it properly without containing it in a function.

--
Brian Beck
Adventurer of the First Order

Peter Hansen

unread,
Sep 27, 2004, 3:36:42 PM9/27/04
to
Bengt Richter wrote:
> But it seems like you are asking the os to chase through full paths at
> every isdir operation, rather than just telling it to make its current working
> directory the directory you are interested in and doing it there. E.g., (untested)
>
> savedir = os.getcwd()
> os.chdir(path)
> dirs = [opj(path, p) for p in os.listdir('.') if oisd(p)]
> os.chdir(savedir)

One aspect of this which needs mentioning is that it could
cause incorrect behaviour of other parts of the application
if there are multiple threads relying on using the current
working directory... the CWD is global to the application, AFAIK.

-Peter

G. S. Hayes

unread,
Sep 27, 2004, 6:32:10 PM9/27/04
to
Nick Craig-Wood <ni...@craig-wood.com> wrote in message news:<slrnclg6v...@irishsea.home.craig-wood.com>...

> Thomas <20...@weholt.org> wrote:
> > [os.path.join(path, p) for p in os.listdir(path) if \
> > os.path.isdir(os.path.join(path, p))]
> >
> > to get a list of folders in a given directory, skipping all plain
> > files. When used on folders with lots of files, it takes rather long
> > time to finish. Just doing a listdir, filtering out all plain files
> > and a couple of joins, I didn't think this would take so long.
> How many files, what OS and what filing system?
>
> Under a unix based OS the above will translate to 1
> opendir()/readdir()/closedir() and 1 stat() for each file. There
> isn't a quicker way in terms of system calls AFAIK.

Under Linux, readdir() returns a struct dirent that has a d_type
member indicating the file type (DT_DIR for directories) so you can
avoid calling stat() on each file. I thought some BSD systems did
this as well.

I don't see how to get at this information from Python without making
the extra syscalls.

Kjetil Torgrim Homme

unread,
Sep 27, 2004, 8:53:05 PM9/27/04
to
[Nick Craig-Wood]:
>
> [Thomas]:

> >
> > I'm doing this :
> >
> > [os.path.join(path, p) for p in os.listdir(path) \
> > if os.path.isdir(os.path.join(path, p))]
>
> Under a unix based OS the above will translate to 1
> opendir()/readdir()/closedir() and 1 stat() for each file. There
> isn't a quicker way in terms of system calls AFAIK.

there's a well known trick for Unix file systems[1]:

if a directory has a link count of exactly 2, there are no
subdirectories. if there was a subdirectory, it would contain a ".."
refering to its parent, thereby raising the parent's link count to 3.

(it's not clear to me whether this is helpful for the OP, though.)


[1] the behaviour is not mandated by POSIX. e.g., a mounted ISO
9660-file system will not necessarily obey this "rule". GNU find
relies on it unless you explicitly specify "-noleaf", so it works
pretty well in practice.
--
Kjetil T.

Nick Craig-Wood

unread,
Sep 28, 2004, 5:30:00 AM9/28/04
to
Bengt Richter <bo...@oz.net> wrote:
> On 27 Sep 2004 14:30:18 GMT, Nick Craig-Wood <ni...@craig-wood.com> wrote:
>
> You ought to be able to gain a little by hoisting the os.path.xxx
> attribute lookups for join and isdir out of the loop. E.g, (not tested)
>
> opj=os.path.join; oisd=os.path.isdir
> [opj(path, p) for p in os.listdir(path) if oisd(opj(path, p))]
>
> But it seems like you are asking the os to chase through full paths at
> every isdir operation, rather than just telling it to make its current working
> directory the directory you are interested in and doing it there. E.g., (untested)
>
> savedir = os.getcwd()
> os.chdir(path)
> dirs = [opj(path, p) for p in os.listdir('.') if oisd(p)]
> os.chdir(savedir)

with 1000 files in the directory

# 1) Original using '.'


/usr/lib/python2.3/timeit.py -s 'import os; path="."' \
'[os.path.join(path, p) for p in os.listdir(path) if os.path.isdir(os.path.join(path, p))]'

10 loops, best of 3: 2.69e+04 usec per loop

# 2) Original with long path
/usr/lib/python2.3/timeit.py -s 'import os;
path="/tmp/a/b/c/d/e/f/g/h/i/j/k/l/m/n/o/p/q/r/s/t/u/v/w/x/y/z"' \
'[os.path.join(path, p) for p in os.listdir(path) if os.path.isdir(os.path.join(path, p))]'
10 loops, best of 3: 4.16e+04 usec per loop

# 3) Using cwd
/usr/lib/python2.3/timeit.py -s 'import os' \
'[os.path.join(path, p) for p in os.listdir(".") if os.path.isdir(p)]'
100 loops, best of 3: 1.85e+04 usec per loop

> > The above timings ignore the effect of caching - will the directory
> > you are enumerating be hot in the cache?
> Even if so, I doubt the os finds it via a hash of the full path instead
> of checking that every element of the path exists and is a subdirectory.
> IWT that could be a dangerous short cut

It is. Linux will look through each path entry. However they will be
hot in the dcache. It doesn't take much time hence the relatively
small difference between 1) and 2).

I expect the main difference between 1) and 3) is the fact it contains
one less os.path.join()

/usr/lib/python2.3/timeit.py -s 'import os;' 'os.path.join("a", "b")'
100000 loops, best of 3: 7.34 usec per loop

Its executed 1000 times above which is 7340 usec. The difference
between 1) and 3) is 8400 usec - pretty close!

G. S. Hayes

unread,
Sep 29, 2004, 4:49:47 PM9/29/04
to
sjde...@yahoo.com (G. S. Hayes) wrote in message news:<96c2e938.04092...@posting.google.com>...

> Nick Craig-Wood <ni...@craig-wood.com> wrote in message news:<slrnclg6v...@irishsea.home.craig-wood.com>...
> > Under a unix based OS the above will translate to 1
> > opendir()/readdir()/closedir() and 1 stat() for each file. There
> > isn't a quicker way in terms of system calls AFAIK.
>
> Under Linux, readdir() returns a struct dirent that has a d_type
> member indicating the file type (DT_DIR for directories) so you can
> avoid calling stat() on each file. I thought some BSD systems did
> this as well.

Offtopic since it's really not Python related, (though I guess Python
might want to consider exposing this functionality in a portable way
eventually):

As a quick followup, with 10000 files on my machine it takes about
twice as long to use stat to get this information as to access the
d_type field. And it costs an extra 10000 syscalls (the d_type one is
about 93 syscalls total, mostly standard program startup/shutdown
costs like mapping in shared libs, flushing output on exit, etc).

On the other hand, they both execute in under a second. So for most
programs the difference in speed is probably negligible, and the
programming cost of portably choosing which method you want to use
probably isn't worth it in general (I could maybe see it for
specialized applications).

0 new messages