[Python-ideas] os.listdir iteration support

90 views
Skip to first unread message

Giampaolo Rodola'

unread,
Nov 22, 2007, 4:04:05 PM11/22/07
to python...@python.org
Hi to all,
I would find very useful having a version of os.listdir returning a
generator.
If a directory has many files, say 20,000, it could take a long time
getting all of them with os.listdir and this could be a problem in
asynchronous environments (e.g. asynchronous servers).

The only solution which comes to my mind in such case is using a
thread/fork or having a non-blocking version of listdir() returning an
iterator.

What do you think about that?
_______________________________________________
Python-ideas mailing list
Python...@python.org
http://mail.python.org/mailman/listinfo/python-ideas

Terry Reedy

unread,
Nov 22, 2007, 6:25:06 PM11/22/07
to python...@python.org

"Giampaolo Rodola'" <gne...@gmail.com> wrote
in message
news:d827975f-7c1e-471e...@d27g2000prf.googlegroups.com...

| I would find very useful having a version of os.listdir returning a
generator.

If there are no technical issues in the way, such a replacement (rather
than addition) would be in line with other list -> iterator replacements in
3.0 (range, dict,items, etc). A list could then be obtained with
list(os.listdir).

tjr

Guido van Rossum

unread,
Nov 22, 2007, 8:40:45 PM11/22/07
to Terry Reedy, python...@python.org
On Nov 22, 2007 3:25 PM, Terry Reedy <tjr...@udel.edu> wrote:
> "Giampaolo Rodola'" <gne...@gmail.com> wrote

> > I would find very useful having a version of os.listdir returning a
> > generator.
>
> If there are no technical issues in the way, such a replacement (rather
> than addition) would be in line with other list -> iterator replacements in
> 3.0 (range, dict,items, etc). A list could then be obtained with
> list(os.listdir).

But how common is this use case really?

--
--Guido van Rossum (home page: http://www.python.org/~guido/)

Aahz

unread,
Nov 22, 2007, 11:59:02 PM11/22/07
to python...@python.org
On Thu, Nov 22, 2007, Giampaolo Rodola' wrote:
>
> I would find very useful having a version of os.listdir returning a
> generator. If a directory has many files, say 20,000, it could take
> a long time getting all of them with os.listdir and this could be a
> problem in asynchronous environments (e.g. asynchronous servers).
>
> The only solution which comes to my mind in such case is using a
> thread/fork or having a non-blocking version of listdir() returning an
> iterator.
>
> What do you think about that?

-1

The problem is that reading a directory requires an open file handle;
given a generator context, there's no clear mechanism for determining
when to close the handle. Because the list needs to be created in the
first place, why bother with a generator?
--
Aahz (aa...@pythoncraft.com) <*> http://www.pythoncraft.com/

"Typing is cheap. Thinking is expensive." --Roy Smith

Adam Atlas

unread,
Nov 23, 2007, 12:54:48 AM11/23/07
to python...@python.org

On 22 Nov 2007, at 23:59, Aahz wrote:
> The problem is that reading a directory requires an open file handle;
> given a generator context, there's no clear mechanism for determining
> when to close the handle.

Whenever the generator is __del__ed, or whenever the iteration
completes, whichever comes first?

> Because the list needs to be created in the first place

How so?

Greg Ewing

unread,
Nov 23, 2007, 2:01:43 AM11/23/07
to python...@python.org
Adam Atlas wrote:
> On 22 Nov 2007, at 23:59, Aahz wrote:
>
>>The problem is that reading a directory requires an open file handle;
>>given a generator context, there's no clear mechanism for determining
>>when to close the handle.
>
> Whenever the generator is __del__ed, or whenever the iteration
> completes, whichever comes first?

Maybe what we really want is the functionality of
the C opendir and readdir functions exposed in the os
module. Then we could have an explicit method for
closing the file handle.

--
Greg

Neil Toronto

unread,
Nov 23, 2007, 2:18:37 AM11/23/07
to python...@python.org
Adam Atlas wrote:
> On 22 Nov 2007, at 23:59, Aahz wrote:
>> Because the list needs to be created in the first place
>
> How so?

It doesn't, actually. On Windows, os.listdir uses FindFirstFile and
FindNextFile, on OS2 it's DosFindFirst and DosFindNext, and on
everything else it's Posix opendir and readdir. All of these are
incremental, so a generator is the most natural way to expose the
underlying API.

That's just a set of facts and a single opinion. Past that I personally
have no preference.

Neil

Georg Brandl

unread,
Nov 23, 2007, 3:06:43 AM11/23/07
to python...@python.org
Greg Ewing schrieb:

> Adam Atlas wrote:
>> On 22 Nov 2007, at 23:59, Aahz wrote:
>>
>>>The problem is that reading a directory requires an open file handle;
>>>given a generator context, there's no clear mechanism for determining
>>>when to close the handle.
>>
>> Whenever the generator is __del__ed, or whenever the iteration
>> completes, whichever comes first?
>
> Maybe what we really want is the functionality of
> the C opendir and readdir functions exposed in the os
> module. Then we could have an explicit method for
> closing the file handle.

What about an os.iterdir() generator which uses opendir/readdir as proposed?
The generator's close() could also call closedir(), and you could have a
warning in the docs about making sure to have it closed at some point.
One could even use an enclosing with closing(os.iterdir()) as d: block.

Georg

--
Thus spake the Lord: Thou shalt indent with four spaces. No more, no less.
Four shall be the number of spaces thou shalt indent, and the number of thy
indenting shall be four. Eight shalt thou not indent, nor either indent thou
two, excepting that thou then proceed to four. Tabs are right out.

Greg Ewing

unread,
Nov 23, 2007, 6:11:35 AM11/23/07
to python...@python.org
Georg Brandl wrote:
> What about an os.iterdir() generator which uses opendir/readdir as proposed?

I was feeling in the mood for a diversion, so I whipped up
a Pyrex prototype of an opendir() object that can be used
either as a file-like object or an iterator.

Here's the docstring:

"""opendir(pathname) --> an open directory object

Opens a directory and provides incremental access to
the filenames it contains. May be used as a file-like
object or as an iterator.

When used as a file-like object, each call to read()
returns one filename, or an empty string when the end
of the directory is reached. The close() method should
be called when finished with the directory.

The close() method should also be called when used as
an iterator and iteration is stopped prematurely. If
iteration proceeds to completion, the directory is
closed automatically."""

Source, setup.py and a brief test attached.

--
Greg

opendir.pyx
setup.py
test.py

Giampaolo Rodola'

unread,
Nov 23, 2007, 9:06:01 AM11/23/07
to python...@python.org
imho, not so unusual.
First examples which come to my mind are HTTP and FTP servers which
commonly have to list the content of local directories.
FTP servers, in particular, have to do that VERY often.

On 23 Nov, 02:40, "Guido van Rossum" <gu...@python.org> wrote:


> On Nov 22, 2007 3:25 PM, Terry Reedy <tjre...@udel.edu> wrote:
>
> > "Giampaolo Rodola'" <gne...@gmail.com> wrote
> > > I would find very useful having a version of os.listdir returning a
> > > generator.
>
> > If there are no technical issues in the way, such a replacement (rather
> > than addition) would be in line with other list -> iterator replacements in
> > 3.0 (range, dict,items, etc). A list could then be obtained with
> > list(os.listdir).
>
> But how common is this use case really?
>
> --
> --Guido van Rossum (home page:http://www.python.org/~guido/)
> _______________________________________________
> Python-ideas mailing list

> Python-id...@python.orghttp://mail.python.org/mailman/listinfo/python-ideas

Giampaolo Rodola'

unread,
Nov 23, 2007, 9:12:30 AM11/23/07
to python...@python.org
On 23 Nov, 12:11, Greg Ewing <greg.ew...@canterbury.ac.nz> wrote:
> Georg Brandl wrote:

> from opendir import opendir
>
> print "READ"
> d = opendir(".")
> while 1:
> name = d.read()
> if not name:
> break
> print " ", name
> print "EOF"
>
> print "ITERATE"
> d = opendir(".")
> for name in d:
> print " ", name
> print "STOP"
>
> print "TELL/SEEK"
> d = opendir(".")
> for i in range(3):
> name = d.read()
> print " ", name
> pos = d.tell()
> for i in range(3):
> name = d.read()
> print " ", name
> d.seek(pos)
> while 1:
> name = d.read()
> if not name:
> break
> print " ", name
> print "EOF"

This is exactly the usage I was talking about.

Aahz

unread,
Nov 23, 2007, 9:39:39 AM11/23/07
to python...@python.org
On Fri, Nov 23, 2007, Adam Atlas wrote:
> On 22 Nov 2007, at 23:59, Aahz wrote:
>>
>> The problem is that reading a directory requires an open file handle;
>> given a generator context, there's no clear mechanism for determining
>> when to close the handle.
>
> Whenever the generator is __del__ed, or whenever the iteration
> completes, whichever comes first?

Enh. That is not reliable without work, and getting it reliable is a
waste of work. The proposed idea for adding an opendir() function is
workable, but it still doesn't solve the need for closing the handle
within listdir().

No matter what, changes the semantics of listdir() to leave a handle
lying around is going to cause problems for some people.

>> Because the list needs to be created in the first place
>
> How so?

If you're going to ask a question, it would be nice to leave the entire
original context in place, especially given that it's not a particularly
long chunk of text.

Anyway, the Windows case aside, if you don't have a reliable close()
mechanism, you need to slurp the whole thing into a list in one swell
foop so that you can just close the handle. Even in the Windows case,
you need a handle, and I don't know what the consequences are of leaving
it lying around.

"Typing is cheap. Thinking is expensive." --Roy Smith

Guido van Rossum

unread,
Nov 23, 2007, 3:23:37 PM11/23/07
to Giampaolo Rodola', python...@python.org
But how many FTP servers are written in Python *and* have directories
with 20,000 files in them?

--Guido

Python...@python.org
http://mail.python.org/mailman/listinfo/python-ideas

Giampaolo Rodola'

unread,
Nov 23, 2007, 4:26:40 PM11/23/07
to python...@python.org
On 23 Nov, 21:23, "Guido van Rossum" <gu...@python.org> wrote:
> But how many FTP servers are written in Python *and* have directories
> with 20,000 files in them?
>
> --Guido

I sincerely don't know.
Surely it's a rather specific use case, but it is one of the tasks
which takes the longest amount of time on an FTP server. 20,000 is
probably an exaggerated hypothetical situation, so I did a simple test
with a more realistic scenario.
On windows a very crowded directory is C:\windows\system32. Currently
the C:\windows\system32 of my Windows XP workstation contains 2201
files.
I tried to run the code below which is how an FTP server should
properly respond to a "LIST" command issued by client.
It took 1.70300006866 seconds to complete the first time and
0.266000032425 the second one.
I don't know if such specific use case could justify a listdir
generators support to have into the stdlib but having something like
Greg Ewing's opendirs module could have saved a lot of time in this
specific case.


-- Giampaolo


import os, stat, time
from tarfile import filemode
try:
import pwd, grp
except ImportError:
pwd = grp = None


def format_list(directory):
"""Return a directory listing emulating "/bin/ls -lA" UNIX
command output.

This is how output appears to client:
-rw-rw-rw- 1 owner group 7045120 Sep 02 3:47 music.mp3
drwxrwxrwx 1 owner group 0 Aug 31 18:50 e-books
-rw-rw-rw- 1 owner group 380 Sep 02 3:40 module.py
"""
listing = os.listdir(directory)

result = []
for basename in listing:
file = os.path.join(directory, basename)

# if the file is a broken symlink, use lstat to get stat for
# the link
try:
stat_result = os.stat(file)
except (OSError,AttributeError):
stat_result = os.lstat(file)

perms = filemode(stat_result.st_mode) # permissions

nlinks = stat_result.st_nlink # number of links to inode
if not nlinks: # non-posix system, let's use a bogus value
nlinks = 1

if pwd and grp:
# get user and group name, else just use the raw uid/gid
try:
uname = pwd.getpwuid(stat_result.st_uid).pw_name
except KeyError:
uname = stat_result.st_uid
try:
gname = grp.getgrgid(stat_result.st_gid).gr_name
except KeyError:
gname = stat_result.st_gid
else:
# on non-posix systems the only chance we use default
# bogus values for owner and group
uname = "owner"
gname = "group"

size = stat_result.st_size # file size

# stat.st_mtime could fail (-1) if file's last modification
# time is too old, in that case we return local time as last
# modification time.
try:
mtime = time.strftime("%b %d %H:%M",
time.localtime(stat_result.st_mtime))
except ValueError:
mtime = time.strftime("%b %d %H:%M")

# if the file is a symlink, resolve it, e.g. "symlink ->
real_file"
if stat.S_ISLNK(stat_result.st_mode):
basename = basename + " -> " + os.readlink(file)

# formatting is matched with proftpd ls output
result.append("%s %3s %-8s %-8s %8s %s %s\r\n" %(
perms, nlinks, uname, gname, size, mtime, basename))

return ''.join(result)

if __name__ == '__main__':
before = time.time()
format_list(r'C:\windows\system32')
print time.time() - before

Aahz

unread,
Nov 24, 2007, 7:29:17 PM11/24/07
to python...@python.org
On Fri, Nov 23, 2007, Giampaolo Rodola' wrote:
>
> Surely it's a rather specific use case, but it is one of the tasks
> which takes the longest amount of time on an FTP server. 20,000 is
> probably an exaggerated hypothetical situation, so I did a simple test
> with a more realistic scenario.
> On windows a very crowded directory is C:\windows\system32. Currently
> the C:\windows\system32 of my Windows XP workstation contains 2201
> files.
> I tried to run the code below which is how an FTP server should
> properly respond to a "LIST" command issued by client.
> It took 1.70300006866 seconds to complete the first time and
> 0.266000032425 the second one.

Your code calls os.stat() on each file. I know from past experience
that os.stat() is *extremely* expensive. Because os.listdir() runs at C
speed, it only gets slow when run against hundreds of thousands of
entries.

(One directory on a work server has over 200K entries, and it takes
os.listdir() about twenty seconds. I believe that if we switched from
ext3 to something more appropriate that would get reduced.)

> I don't know if such specific use case could justify a listdir
> generators support to have into the stdlib but having something like
> Greg Ewing's opendirs module could have saved a lot of time in this
> specific case.

Doubtful.

"Typing is cheap. Thinking is expensive." --Roy Smith

Reply all
Reply to author
Forward
0 new messages