Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

waling a directory with very many files

19 views
Skip to first unread message

tom

unread,
Jun 14, 2009, 12:35:24 PM6/14/09
to
i can traverse a directory using os.listdir() or os.walk(). but if a
directory has a very large number of files, these methods produce very
large objects talking a lot of memory.

in other languages one can avoid generating such an object by walking
a directory as a liked list. for example, in c, perl or php one can
use opendir() and then repeatedly readdir() until getting to the end
of the file list. it seems this could be more efficient in some
applications.

is there a way to do this in python? i'm relatively new to the
language. i looked through the documentation and tried googling but
came up empty.

Tim Golden

unread,
Jun 14, 2009, 1:35:23 PM6/14/09
to pytho...@python.org

If you're on Windows, you can use the win32file.FindFilesIterator
function from the pywin32 package. (Which wraps the Win32 API
FindFirstFile / FindNextFile pattern).

TJG

tom

unread,
Jun 14, 2009, 2:32:12 PM6/14/09
to
On Jun 14, 1:35 pm, Tim Golden <m...@timgolden.me.uk> wrote:
>
> If you're on Windows, you can use the win32file.FindFilesIterator
> function from the pywin32 package. (Which wraps the Win32 API
> FindFirstFile / FindNextFile pattern).

thanks, tim.

however, i'm not using windows. freebsd and os x.

Tim Golden

unread,
Jun 14, 2009, 3:21:51 PM6/14/09
to pytho...@python.org

Presumably, if Perl etc. can do it then it should be simple
enough to drop into ctypes and call the same library code, no?
(I'm not a BSD / OS X person, I'm afraid, so perhaps this isn't
so easy...)

TJG

Andre Engels

unread,
Jun 14, 2009, 4:35:50 PM6/14/09
to tom, pytho...@python.org

What kind of directories are those that just a list of files would
result in a "very large" object? I don't think I have ever seen
directories with more than a few thousand files...


--
André Engels, andre...@gmail.com

Christian Heimes

unread,
Jun 14, 2009, 7:47:03 PM6/14/09
to pytho...@python.org
tom schrieb:

Some time ago we had a discussion about turning os.listdir() into a
generator. No conclusion was agreed on. We also thought about exposing
the functions opendir(), readdir(), closedir() and friends but as far as
I know and as far as I've checked the C code in Modules/posixmodule.c
none of the functions as been added.

For now you are on your own to implement wrappers for the system calls.
For the distant future you may see the appropriate functions in the os
module. A mail to the python ideas list may increase your chances. ;)

Christian

Terry Reedy

unread,
Jun 14, 2009, 7:48:15 PM6/14/09
to pytho...@python.org

You did not specify version. In Python3, os.walk has become a generater
function. So, to answer your question, use 3.1.

tjr

Christian Heimes

unread,
Jun 14, 2009, 7:50:02 PM6/14/09
to pytho...@python.org
Andre Engels wrote:
> What kind of directories are those that just a list of files would
> result in a "very large" object? I don't think I have ever seen
> directories with more than a few thousand files...

I've seen directories with several hundreds of thousand files. Depending
on the file system and IO capacity it can take about one to several
minute until 'ls' even starts to print out files names. It's no fun on
directories on a CIFS storage or ext3 w/o a btree dir index.

Christian

MRAB

unread,
Jun 14, 2009, 8:06:20 PM6/14/09
to pytho...@python.org
Christian Heimes wrote:
> tom schrieb:
> Some time ago we had a discussion about turning os.listdir() into a
> generator. No conclusion was agreed on. We also thought about exposing
> the functions opendir(), readdir(), closedir() and friends but as far as
> I know and as far as I've checked the C code in Modules/posixmodule.c
> none of the functions as been added.
>
Perhaps if there's a generator it should be called iterdir(). Or would
it be unPythonic to have listdir() and iterdir()? Probably.

Christian Heimes

unread,
Jun 14, 2009, 8:06:18 PM6/14/09
to pytho...@python.org
Terry Reedy wrote:
> You did not specify version. In Python3, os.walk has become a generater
> function. So, to answer your question, use 3.1.

I'm sorry to inform you that Python 3.x still returns a list, not a
generator.


ython 3.1rc1+ (py3k:73396, Jun 12 2009, 22:45:18)
[GCC 4.3.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> type(os.listdir("."))
<class 'list'>

Christian

Lawrence D'Oliveiro

unread,
Jun 14, 2009, 8:45:43 PM6/14/09
to
In message <mailman.1560.1245011...@python.org>, Andre
Engels wrote:

> On Sun, Jun 14, 2009 at 6:35 PM, tom<f...@thefsb.org> wrote:
>
>> in other languages one can avoid generating such an object by walking
>> a directory as a liked list.

I suppose it depends how well-liked it is. Nerdy lists may work better, but
they tend not to be liked.

> What kind of directories are those that just a list of files would
> result in a "very large" object? I don't think I have ever seen
> directories with more than a few thousand files...

I worked on an application system which, at one point, routinely dealt with
directories containing hundreds of thousands of files. But even that kind of
directory contents only adds up to a few megabytes.

Tim Chase

unread,
Jun 14, 2009, 10:07:25 PM6/14/09
to Terry Reedy, pytho...@python.org
> You did not specify version. In Python3, os.walk has become a generater
> function. So, to answer your question, use 3.1.

Since at least 2.4, os.walk has itself been a generator.
However, the contents of the directory (the 3rd element of the
yielded tuple) is a list produced by listdir() instead of a
generator. Unless listdir() has been changed to a generator
instead of a list (which other respondents seem to indicate has
not been implemented), this doesn't address the OP's issue of
"lots of files in a single directory".

-tkc


Steven D'Aprano

unread,
Jun 15, 2009, 12:56:06 AM6/15/09
to


You haven't looked very hard :)

$ pwd
/home/steve/.thumbnails/normal
$ ls | wc -l
33956

And I periodically delete thumbnails, to prevent the number of files
growing to hundreds of thousands.

--
Steven

Nick Craig-Wood

unread,
Jun 15, 2009, 4:29:34 AM6/15/09
to

Here is a ctypes generator listdir for unix-like OSes. I tested it
under linux.

#!/usr/bin/python
"""
An equivalent os.listdir but as a generator using ctypes
"""

from ctypes import CDLL, c_char_p, c_int, c_long, c_ushort, c_byte, c_char, Structure, POINTER
from ctypes.util import find_library

class c_dir(Structure):
"""Opaque type for directory entries, corresponds to struct DIR"""
c_dir_p = POINTER(c_dir)

class c_dirent(Structure):
"""Directory entry"""
# FIXME not sure these are the exactly correct types!
_fields_ = (
('d_ino', c_long), # inode number
('d_off', c_long), # offset to the next dirent
('d_reclen', c_ushort), # length of this record
('d_type', c_byte), # type of file; not supported by all file system types
('d_name', c_char * 4096) # filename
)
c_dirent_p = POINTER(c_dirent)

c_lib = CDLL(find_library("c"))
opendir = c_lib.opendir
opendir.argtypes = [c_char_p]
opendir.restype = c_dir_p

# FIXME Should probably use readdir_r here
readdir = c_lib.readdir
readdir.argtypes = [c_dir_p]
readdir.restype = c_dirent_p

closedir = c_lib.closedir
closedir.argtypes = [c_dir_p]
closedir.restype = c_int

def listdir(path):
"""
A generator to return the names of files in the directory passed in
"""
dir_p = opendir(".")
while True:
p = readdir(dir_p)
if not p:
break
name = p.contents.d_name
if name not in (".", ".."):
yield name
closedir(dir_p)

if __name__ == "__main__":
for name in listdir("."):
print name

--
Nick Craig-Wood <ni...@craig-wood.com> -- http://www.craig-wood.com/nick

Hrvoje Niksic

unread,
Jun 15, 2009, 7:47:08 AM6/15/09
to
Terry Reedy <tjr...@udel.edu> writes:

> You did not specify version. In Python3, os.walk has become a
> generater function. So, to answer your question, use 3.1.

os.walk has been a generator function all along, but that doesn't help
OP because it still uses os.listdir internally. This means that it
both creates huge lists for huge directories, and holds on to those
lists until the iteration over the directory (and all subdirectories)
is finished.

In fact, os.walk is not suited for this kind of memory optimization
because yielding a *list* of files (and a separate list of
subdirectories) is specified in its interface. This hasn't changed in
Python 3.1:

dirs, nondirs = [], []
for name in names:
if isdir(join(top, name)):
dirs.append(name)
else:
nondirs.append(name)

if topdown:
yield top, dirs, nondirs

Hrvoje Niksic

unread,
Jun 15, 2009, 8:00:02 AM6/15/09
to
Nick Craig-Wood <ni...@craig-wood.com> writes:

> Here is a ctypes generator listdir for unix-like OSes.

ctypes code scares me with its duplication of the contents of system
headers. I understand its use as a proof of concept, or for hacks one
needs right now, but can anyone seriously propose using this kind of
code in a Python program? For example, this seems much more
"Linux-only", or possibly even "32-bit-Linux-only", than "unix-like":

Diez B. Roggisch

unread,
Jun 15, 2009, 8:15:25 AM6/15/09
to
tom wrote:

> i can traverse a directory using os.listdir() or os.walk(). but if a
> directory has a very large number of files, these methods produce very
> large objects talking a lot of memory.

if we assume the number of files to be a million (which certainly qualifies
as one of the larger directory sizes one encounters...), and the average
filename length with 20, you'd end up with 20 megs of data.

Is that really a problem on nowadays several gigabyte machines? And we are
talking a rather freakish case here.

Diez

Nick Craig-Wood

unread,
Jun 15, 2009, 10:29:33 AM6/15/09
to
Hrvoje Niksic <hni...@xemacs.org> wrote:
> Nick Craig-Wood <ni...@craig-wood.com> writes:
>
> > Here is a ctypes generator listdir for unix-like OSes.
>
> ctypes code scares me with its duplication of the contents of system
> headers. I understand its use as a proof of concept, or for hacks one
> needs right now, but can anyone seriously propose using this kind of
> code in a Python program? For example, this seems much more
> "Linux-only", or possibly even "32-bit-Linux-only", than
> "unix-like":

It was a proof of concept certainly..

It can be done properly with gccxml though which converts structures
into ctypes definitions.

That said the dirent struct is specified by POSIX so if you get the
correct types for all the individual members then it should be correct
everywhere. Maybe ;-)

Jean-Paul Calderone

unread,
Jun 15, 2009, 10:38:44 AM6/15/09
to pytho...@python.org, Nick Craig-Wood

The problem is that POSIX specifies the fields with types like off_t and
ino_t. Since ctypes doesn't know anything about these types, application
code has to specify their size and other attributes. As these vary from
platform to platform, you can't get it correct without asking a real C
compiler.

In other words, POSIX talks about APIs and ctypes deals with ABIs.

http://pypi.python.org/pypi/ctypes_configure/0.1 helps with the problem,
and is a bit more accessible than gccxml.

It is basically correct to say that using ctypes without using something
like gccxml or ctypes_configure will give you non-portable code.

Jean-Paul

Terry Reedy

unread,
Jun 15, 2009, 3:35:04 PM6/15/09
to pytho...@python.org
Christian Heimes wrote:

> Terry Reedy wrote:
>> You did not specify version. In Python3, os.walk has become a generater
>> function. So, to answer your question, use 3.1.
>
> I'm sorry to inform you that Python 3.x still returns a list, not a
> generator.

>>> type(os.walk('.'))
<class 'generator'>

However, it is a generator of directory tuples that include a filename
list produced by listdir, rather than a generator of filenames
themselves, as I was thinking. I wish listdir had been changed in 3.0
along with map, filter, and range, but I made no effort and hence cannot
complain.

tjr

Nick Craig-Wood

unread,
Jun 15, 2009, 5:29:33 PM6/15/09
to
Jean-Paul Calderone <exa...@divmod.com> wrote:
> On Mon, 15 Jun 2009 09:29:33 -0500, Nick Craig-Wood <ni...@craig-wood.com> wrote:
> >Hrvoje Niksic <hni...@xemacs.org> wrote:
> >> Nick Craig-Wood <ni...@craig-wood.com> writes:
> >>
> >> > Here is a ctypes generator listdir for unix-like OSes.
> >>
> >> ctypes code scares me with its duplication of the contents of system
> >> headers. I understand its use as a proof of concept, or for hacks one
> >> needs right now, but can anyone seriously propose using this kind of
> >> code in a Python program? For example, this seems much more
> >> "Linux-only", or possibly even "32-bit-Linux-only", than
> >> "unix-like":
> >
> >It was a proof of concept certainly..
> >
> >It can be done properly with gccxml though which converts structures
> >into ctypes definitions.
> >
> >That said the dirent struct is specified by POSIX so if you get the
> >correct types for all the individual members then it should be correct
> >everywhere. Maybe ;-)
>
> The problem is that POSIX specifies the fields with types like off_t and
> ino_t. Since ctypes doesn't know anything about these types, application
> code has to specify their size and other attributes. As these vary from
> platform to platform, you can't get it correct without asking a real C
> compiler.

These types could be part of ctypes. After all ctypes knows how big a
long is on all platforms, and it knows that a uint32_t is the same on
all platforms, it could conceivably know how big an off_t or an ino_t
is too.

> In other words, POSIX talks about APIs and ctypes deals with ABIs.
>
> http://pypi.python.org/pypi/ctypes_configure/0.1 helps with the problem,
> and is a bit more accessible than gccxml.

I haven't seen that before - looks interesting.

> It is basically correct to say that using ctypes without using something
> like gccxml or ctypes_configure will give you non-portable code.

Well it depends on if the API is specified in types that ctypes
understands. Eg, short, int, long, int32_t, uint64_t etc. A lot of
interfaces are specified exactly like that and work just fine with
ctypes in a portable way. I agree with you that struct dirent
probably isn't one of those though!

I think it would be relatively easy to implent the code I demonstrated
in a portable way though... I'd do it by defining dirent as a block
of memory and then for the first run, find a known filename in the
block, establishing the offset of the name field since that is all we
are interested in for the OPs problem.

Mike Kazantsev

unread,
Jun 15, 2009, 7:31:53 PM6/15/09
to

Why? We have itertools.imap, itertools.ifilter and xrange already.

--
Mike Kazantsev // fraggod.net

signature.asc

Hrvoje Niksic

unread,
Jun 16, 2009, 3:03:43 AM6/16/09
to
Nick Craig-Wood <ni...@craig-wood.com> writes:

> It can be done properly with gccxml though which converts structures
> into ctypes definitions.

That sounds interesting.

> That said the dirent struct is specified by POSIX so if you get the
> correct types for all the individual members then it should be
> correct everywhere. Maybe ;-)

AFAIK POSIX specifies the names and types of the members, but not
their order in the structure, nor alignment.

thebjorn

unread,
Jun 16, 2009, 1:39:44 PM6/16/09
to
On Jun 15, 6:56 am, Steven D'Aprano
> Steven

Not proud of this, but...:

[django] www4:~/datakortet/media$ ls bfpbilder|wc -l
174197

all .jpg files between 40 and 250KB with the path stored in a database
field... *sigh*

Oddly enough, I'm a relieved that others have had similar folder sizes
(I've been waiting for this burst to the top of my list for a while
now).

Bjorn

Nick Craig-Wood

unread,
Jun 16, 2009, 4:29:33 PM6/16/09
to
Nick Craig-Wood <ni...@craig-wood.com> wrote:
> Jean-Paul Calderone <exa...@divmod.com> wrote:
> > On Mon, 15 Jun 2009 09:29:33 -0500, Nick Craig-Wood <ni...@craig-wood.com> wrote:
> > >Hrvoje Niksic <hni...@xemacs.org> wrote:
> > >> Nick Craig-Wood <ni...@craig-wood.com> writes:
> > >>
> > >> > Here is a ctypes generator listdir for unix-like OSes.
> > >>
> > >> ctypes code scares me with its duplication of the contents of system
> > >> headers. I understand its use as a proof of concept, or for hacks one
> > >> needs right now, but can anyone seriously propose using this kind of
> > >> code in a Python program? For example, this seems much more
> > >> "Linux-only", or possibly even "32-bit-Linux-only", than
> > >> "unix-like":
> > >
> > >It was a proof of concept certainly..

Just in case anyone is interested here is an implementation using cython.

Compile with python setup.py build_ext --inplace

And run listdir.py

This would have been much easier if cython supported yield, but
unfortunately it doesn't (yet - I think it is in the works).

This really should work on any platform!

--setup.py----------------------------------------------------------
from distutils.core import setup
from distutils.extension import Extension
from Cython.Distutils import build_ext

setup(
cmdclass = {'build_ext': build_ext},
ext_modules = [Extension("directory", ["directory.pyx"])]
)

--directory.pyx----------------------------------------------------------
# Cython interface for listdir
# python setup.py build_ext --inplace

import cython

cdef extern from "dirent.h":
struct dirent:
char d_name[0]
struct dir_handle:
pass
ctypedef dir_handle DIR "DIR"
DIR *opendir(char *name)
int closedir(DIR *dirp)
dirent *readdir(DIR *dirp)

cdef class Directory:
"""Represents an open directory"""

cdef DIR *handle

def __init__(self, path):
self.handle = opendir(path)

def readdir(self):
cdef dirent *p
p = readdir(self.handle)
if p is NULL:
return None
return p.d_name

def close(self):
closedir(self.handle)

--listdir.py----------------------------------------------------------
from directory import Directory

def listdir(path):
"""
A generator to return the names of files in the directory passed
in
"""

d = Directory(".")
while True:
name = d.readdir()
if not name:
break


if name not in (".", ".."):
yield name

d.close()

if __name__ == "__main__":
for name in listdir("."):
print name

------------------------------------------------------------

Lawrence D'Oliveiro

unread,
Jun 16, 2009, 10:52:28 PM6/16/09
to
In message
<234b19ac-7baf-4356...@z9g2000yqi.googlegroups.com>, thebjorn
wrote:

> Not proud of this, but...:
>
> [django] www4:~/datakortet/media$ ls bfpbilder|wc -l
> 174197
>
> all .jpg files between 40 and 250KB with the path stored in a database
> field... *sigh*

Why not put the images themselves into database fields?

> Oddly enough, I'm a relieved that others have had similar folder sizes ...

One of my past projects had 400000-odd files in a single folder. They were
movie frames, to allow assembly of movie sequences on demand.

Mike Kazantsev

unread,
Jun 16, 2009, 11:18:58 PM6/16/09
to

For both scenarios:
Why not use hex representation of md5/sha1-hashed id as a path,
arranging them like /path/f/9/e/95ea4926a4 ?

That way, you won't have to deal with many-files-in-path problem, and,
since there's thousands of them anyway, name readability shouldn't
matter.

In fact, on modern filesystems it doesn't matter whether you accessing
/path/f9e95ea4926a4 with million files in /path or /path/f/9/e/95ea
with only hundred of them in each path. Former case (all-in-one-path)
would even outperform the latter with ext3 or reiserfs by a small
margin.
Sadly, that's not the case with filesystems like FreeBSD ufs2 (at least
in sixth branch), so it's better to play safe and create subdirs if the
app might be run on different machines than keeping everything in one
path.

signature.asc

Lie Ryan

unread,
Jun 16, 2009, 11:42:02 PM6/16/09
to

It might not matter for the filesystem, but the file explorer (and ls)
would still suffer. Subfolder structure would be much better, and much
easier to navigate manually when you need to.

Mike Kazantsev

unread,
Jun 17, 2009, 1:07:05 AM6/17/09
to
On Wed, 17 Jun 2009 03:42:02 GMT
Lie Ryan <lie....@gmail.com> wrote:

> Mike Kazantsev wrote:
> > In fact, on modern filesystems it doesn't matter whether you
> > accessing /path/f9e95ea4926a4 with million files in /path
> > or /path/f/9/e/95ea with only hundred of them in each path. Former
> > case (all-in-one-path) would even outperform the latter with ext3
> > or reiserfs by a small margin.
> > Sadly, that's not the case with filesystems like FreeBSD ufs2 (at
> > least in sixth branch), so it's better to play safe and create
> > subdirs if the app might be run on different machines than keeping
> > everything in one path.
>
> It might not matter for the filesystem, but the file explorer (and ls)
> would still suffer. Subfolder structure would be much better, and much
> easier to navigate manually when you need to.

It's an insane idea to navigate any structure with hash-based names
and hundreds of thousands files *manually*: "What do we have here?
Hashies?" ;)

signature.asc

Lawrence D'Oliveiro

unread,
Jun 17, 2009, 1:45:23 AM6/17/09
to
In message <mailman.1588.1245076...@python.org>, Jean-Paul
Calderone wrote:

> The problem is that POSIX specifies the fields with types like off_t and
> ino_t. Since ctypes doesn't know anything about these types, application
> code has to specify their size and other attributes. As these vary from
> platform to platform, you can't get it correct without asking a real C
> compiler.

Just to add to the complications, on 32-bit platforms, off_t can be either
64 bits or 32 bits, depending on whether a C program is compiled with
-D_FILE_OFFSET_BITS=64 or not. This causes all kinds of aliasing of POSIX
routines to the appropriate variants of the underlying libc routine names.

With ctypes, you have to directly access the underlying routine names,
unless you implement some kind of equivalent aliasing scheme on top.

Lawrence D'Oliveiro

unread,
Jun 17, 2009, 1:53:33 AM6/17/09
to
In message <20090617091858.432f89ca@malediction>, Mike Kazantsev wrote:

> On Wed, 17 Jun 2009 14:52:28 +1200
> Lawrence D'Oliveiro <l...@geek-central.gen.new_zealand> wrote:
>
>> In message
>> <234b19ac-7baf-4356...@z9g2000yqi.googlegroups.com>,
>> thebjorn wrote:
>>
>> > Not proud of this, but...:
>> >
>> > [django] www4:~/datakortet/media$ ls bfpbilder|wc -l
>> > 174197
>> >
>> > all .jpg files between 40 and 250KB with the path stored in a
>> > database field... *sigh*
>>
>> Why not put the images themselves into database fields?
>>
>> > Oddly enough, I'm a relieved that others have had similar folder
>> > sizes ...
>>
>> One of my past projects had 400000-odd files in a single folder. They
>> were movie frames, to allow assembly of movie sequences on demand.
>
> For both scenarios:
> Why not use hex representation of md5/sha1-hashed id as a path,
> arranging them like /path/f/9/e/95ea4926a4 ?
>

> That way, you won't have to deal with many-files-in-path problem ...

Why is that a problem?

Mike Kazantsev

unread,
Jun 17, 2009, 4:24:31 AM6/17/09
to
On Wed, 17 Jun 2009 17:53:33 +1200
Lawrence D'Oliveiro <l...@geek-central.gen.new_zealand> wrote:

> > Why not use hex representation of md5/sha1-hashed id as a path,
> > arranging them like /path/f/9/e/95ea4926a4 ?
> >
> > That way, you won't have to deal with many-files-in-path problem ...
>
> Why is that a problem?

So you can os.listdir them?
Don't ask me what for, however, since that's the original question.
Also not every fs still in use handles this situation effectively, see
my original post.

signature.asc

Lawrence D'Oliveiro

unread,
Jun 17, 2009, 7:04:37 AM6/17/09
to
In message <20090617142431.2b25faf5@malediction>, Mike Kazantsev wrote:

> On Wed, 17 Jun 2009 17:53:33 +1200
> Lawrence D'Oliveiro <l...@geek-central.gen.new_zealand> wrote:
>
>> > Why not use hex representation of md5/sha1-hashed id as a path,
>> > arranging them like /path/f/9/e/95ea4926a4 ?
>> >
>> > That way, you won't have to deal with many-files-in-path problem ...
>>
>> Why is that a problem?
>
> So you can os.listdir them?

Why should you have a problem os.listdir'ing lots of files?

Scott David Daniels

unread,
Jun 17, 2009, 9:05:12 AM6/17/09
to
Mike Kazantsev wrote:
> On Wed, 17 Jun 2009 14:52:28 +1200
> Lawrence D'Oliveiro <l...@geek-central.gen.new_zealand> wrote:
>
>> In message
>> <234b19ac-7baf-4356...@z9g2000yqi.googlegroups.com>,
>> thebjorn wrote:
>>
>>> Not proud of this, but...:
>>>
>>> [django] www4:~/datakortet/media$ ls bfpbilder|wc -l
>>> 174197
>>>
>>> all .jpg files between 40 and 250KB with the path stored in a
>>> database field... *sigh*
>> Why not put the images themselves into database fields?
>>
>>> Oddly enough, I'm a relieved that others have had similar folder
>>> sizes ...
>> One of my past projects had 400000-odd files in a single folder. They
>> were movie frames, to allow assembly of movie sequences on demand.
>
> For both scenarios:
> Why not use hex representation of md5/sha1-hashed id as a path,
> arranging them like /path/f/9/e/95ea4926a4 ?
> ...

> In fact, on modern filesystems it doesn't matter whether you accessing
> /path/f9e95ea4926a4 with million files in /path or /path/f/9/e/95ea
> with only hundred of them in each path.
Probably better to use:
/path/f9/e9/5ea4926a4
If you want to talk hundreds per layer. Branching 16 ways seems silly.

--Scott David Daniels
Scott....@Acm.Org

Mike Kazantsev

unread,
Jun 17, 2009, 11:45:35 AM6/17/09
to
On Wed, 17 Jun 2009 23:04:37 +1200

Lawrence D'Oliveiro <l...@geek-central.gen.new_zealand> wrote:

> In message <20090617142431.2b25faf5@malediction>, Mike Kazantsev wrote:
>
> > On Wed, 17 Jun 2009 17:53:33 +1200
> > Lawrence D'Oliveiro <l...@geek-central.gen.new_zealand> wrote:
> >
> >> > Why not use hex representation of md5/sha1-hashed id as a path,
> >> > arranging them like /path/f/9/e/95ea4926a4 ?
> >> >
> >> > That way, you won't have to deal with many-files-in-path problem ...
> >>
> >> Why is that a problem?
> >
> > So you can os.listdir them?
>
> Why should you have a problem os.listdir'ing lots of files?

I shouldn't, and I don't ;)

signature.asc

Lawrence D'Oliveiro

unread,
Jun 17, 2009, 6:33:49 PM6/17/09
to

Then why did you suggest that there was a problem being able to os.listdir
them?

Mike Kazantsev

unread,
Jun 17, 2009, 10:14:23 PM6/17/09
to
On Thu, 18 Jun 2009 10:33:49 +1200

Lawrence D'Oliveiro <l...@geek-central.gen.new_zealand> wrote:

> In message <20090617214535.108667ca@coercion>, Mike Kazantsev wrote:
>
> > On Wed, 17 Jun 2009 23:04:37 +1200
> > Lawrence D'Oliveiro <l...@geek-central.gen.new_zealand> wrote:
> >
> >> In message <20090617142431.2b25faf5@malediction>, Mike Kazantsev wrote:
> >>
> >>> On Wed, 17 Jun 2009 17:53:33 +1200
> >>> Lawrence D'Oliveiro <l...@geek-central.gen.new_zealand> wrote:
> >>>
> >>>>> Why not use hex representation of md5/sha1-hashed id as a path,
> >>>>> arranging them like /path/f/9/e/95ea4926a4 ?
> >>>>>
> >>>>> That way, you won't have to deal with many-files-in-path problem ...
> >>>>
> >>>> Why is that a problem?
> >>>
> >>> So you can os.listdir them?
> >>
> >> Why should you have a problem os.listdir'ing lots of files?
> >
> > I shouldn't, and I don't ;)
>
> Then why did you suggest that there was a problem being able to os.listdir
> them?

I didn't, OP did, and that's what the topic "walking directory with
many files" is about.
I wonder whether you're unable to read past the first line, trying to
make some point or just some kind of alternatively-gifted (i.e.
brain-handicapped) person to keep interpreting posts w/o context like
that.

signature.asc

Ethan Furman

unread,
Jun 17, 2009, 7:56:03 PM6/17/09
to pytho...@python.org

He didn't, the OP did.

Asun Friere

unread,
Jun 18, 2009, 1:42:40 AM6/18/09
to
On Jun 15, 6:35 am, Andre Engels <andreeng...@gmail.com> wrote:
> What kind of directories are those that just a list of files would
> result in a "very large" object? I don't think I have ever seen
> directories with more than a few thousand files...


(asun@lucrezia:~/pit/lsa/act:5)$ ls -1 | wc -l
142607

There, you've seen one with 142 thousand now! :P

Lie Ryan

unread,
Jun 18, 2009, 2:48:27 PM6/18/09
to

Like... when you're trying to debug a code that generates an error with
a specific file...

Yeah, it might be possible to just mv the file from outside, but not
being able to enter a directory just because you've got too many files
in it is kind of silly.

Lawrence D'Oliveiro

unread,
Jun 19, 2009, 1:53:40 AM6/19/09
to
In message <20090618081423.2e0356b9@coercion>, Mike Kazantsev wrote:

> On Thu, 18 Jun 2009 10:33:49 +1200
> Lawrence D'Oliveiro <l...@geek-central.gen.new_zealand> wrote:
>
>> In message <20090617214535.108667ca@coercion>, Mike Kazantsev wrote:
>>
>>> On Wed, 17 Jun 2009 23:04:37 +1200
>>> Lawrence D'Oliveiro <l...@geek-central.gen.new_zealand> wrote:
>>>
>>>> In message <20090617142431.2b25faf5@malediction>, Mike Kazantsev
>>>> wrote:
>>>>
>>>>> On Wed, 17 Jun 2009 17:53:33 +1200
>>>>> Lawrence D'Oliveiro <l...@geek-central.gen.new_zealand> wrote:
>>>>>
>>>>>>> Why not use hex representation of md5/sha1-hashed id as a path,
>>>>>>> arranging them like /path/f/9/e/95ea4926a4 ?
>>>>>>>
>>>>>>> That way, you won't have to deal with many-files-in-path problem
>>>>>>> ...
>>>>>>
>>>>>> Why is that a problem?
>>>>>
>>>>> So you can os.listdir them?
>>>>
>>>> Why should you have a problem os.listdir'ing lots of files?
>>>
>>> I shouldn't, and I don't ;)
>>
>> Then why did you suggest that there was a problem being able to
>> os.listdir them?
>

> I didn't, OP did ...

Then why did you reply to my question "Why is that a problem?" with "So that
you can os.listdir them?", if you didn't think there was a problem (see
above)?

Lawrence D'Oliveiro

unread,
Jun 19, 2009, 1:54:31 AM6/19/09
to
In message <mailman.1734.1245302...@python.org>, Ethan
Furman wrote:

> He didn't ...

He replied to my question "Why is that a problem?" with "So you can
os.listdir them?". Why reply with an explanation of why it's a problem if
you don't think it's a problem?


Lawrence D'Oliveiro

unread,
Jun 19, 2009, 1:55:18 AM6/19/09
to
In message <%Zv_l.19493$y61....@news-server.bigpond.net.au>, Lie Ryan
wrote:

> Yeah, it might be possible to just mv the file from outside, but not
> being able to enter a directory just because you've got too many files
> in it is kind of silly.

Sounds like a problem with your file/directory-manipulation tools.

Mike Kazantsev

unread,
Jun 19, 2009, 3:40:15 AM6/19/09
to
On Fri, 19 Jun 2009 17:53:40 +1200

Why do you think that if I didn't suggest there is a problem, I think
there is no problem?

I do think there might be such a problem and even I may have to face it
someday. So, out of sheer curiosity how more rediculous this topic can
be I'll try to rephrase and extend what I wrote in the first place:


Why would you want to listdir them?
I can imagine at least one simple scenario: you had some nasty crash
and you want to check that every file has corresponding, valid db
record.

What's the problem with listdir if there's 10^x of them?
Well, imagine that db record also holds file modification time (say,
the files are some kind of cache), so not only you need to compare
listdir results with db, but also do os.stat on every file and some
filesystems will do it very slowly with so many of them in one place.


Now, I think I made this point in the first answer, no?

Of course you can make it more rediculous by your
I-can-talk-away-any-problem-I-can't-see-or-solve approach by asking "why
would you want to use such filesystems?", "why do you have to use
FreeBSD?", "why do you have to work for such employer?", "why do you
have to eat?" etc, but you know, sometimes it's easier and better for
the project/work just to solve it, than talk everyone else away from it
just because you don't like otherwise acceptable solution.

signature.asc

Lie Ryan

unread,
Jun 19, 2009, 2:42:49 PM6/19/09
to

try an `ls` on a folder with 10000+ files.

See how long is needed to print all the files.

Ok, now pipe ls to less, take three days to browse through all the
filenames to locate the file you want to see.

The file manipulation tool may not have problems with it; it's the user
that would have a hard time sorting through the huge amount of files.

Even with glob and grep, some types of queries are just too difficult or
is plain silly to write a full fledged one-time use program just to
locate a few files.

Lawrence D'Oliveiro

unread,
Jun 20, 2009, 4:51:07 AM6/20/09
to

It wasn't that you didn't suggest there was a problem, but that you
suggested a "solution" as though there was a problem.

> Why would you want to listdir them?

It's a common need, to find out what's in a directory.

> I can imagine at least one simple scenario: you had some nasty crash
> and you want to check that every file has corresponding, valid db
> record.

But why would that be relevant to this case?


Lawrence D'Oliveiro

unread,
Jun 20, 2009, 4:54:03 AM6/20/09
to
In message <J_Q_l.19655$y61....@news-server.bigpond.net.au>, Lie Ryan
wrote:

> Lawrence D'Oliveiro wrote:
>
>> In message <%Zv_l.19493$y61....@news-server.bigpond.net.au>, Lie Ryan
>> wrote:
>>
>>> Yeah, it might be possible to just mv the file from outside, but not
>>> being able to enter a directory just because you've got too many files
>>> in it is kind of silly.
>>
>> Sounds like a problem with your file/directory-manipulation tools.
>
> try an `ls` on a folder with 10000+ files.
>
> See how long is needed to print all the files.

As I've mentioned elsewhere, I had scripts routinely dealing with
directories containing around 400,000 files.

> Ok, now pipe ls to less, take three days to browse through all the
> filenames to locate the file you want to see.

Sounds like you're approaching the issue with a GUI-centric mentality, which
is completely hopeless at dealing with this sort of situation.

Steven D'Aprano

unread,
Jun 20, 2009, 10:56:47 AM6/20/09
to
Lawrence D'Oliveiro wrote:

>> Ok, now pipe ls to less, take three days to browse through all the
>> filenames to locate the file you want to see.
>
> Sounds like you're approaching the issue with a GUI-centric mentality,
> which is completely hopeless at dealing with this sort of situation.

Piping the output of ls to less is a GUI-centric mentality?


--
Steven

rkl

unread,
Jun 21, 2009, 5:57:33 AM6/21/09
to
On Jun 15, 2:35 am, tom <f...@thefsb.org> wrote:
> i can traverse adirectoryusing os.listdir() or os.walk(). but if adirectoryhas a very large number of files, these methods produce very

> large objects talking a lot of memory.
>
> in other languages one can avoid generating such an object by walking
> adirectoryas a liked list. for example, in c, perl or php one can

> use opendir() and then repeatedly readdir() until getting to the end
> of the file list. it seems this could be more efficient in some

> applications.
>
> is there a way to do this in python? i'm relatively new to the
> language. i looked through the documentation and tried googling but
> came up empty.

I might be a little late with my comment here.

David Beazley in his PyCon'2008 presentation "Generator Tricks
For Systems Programmers" had this very elegant example of handling an
unlimited numbers of files:


import os, fnmatch

def gen_find(filepat,top):
"""gen_find(filepat,top) - find matching files in directory tree,
start searching from top

expects: a file pattern as string, and a directory path as string
yields: a sequence of filenames (including paths)
"""
for path, dirlist, filelist in os.walk(top):
for name in fnmatch.filter(filelist,filepat):
yield os.path.join(path,name)


for file in gen_find('*.py', '/'):
print file

Tim Golden

unread,
Jun 21, 2009, 6:06:23 AM6/21/09
to pytho...@python.org
rkl wrote:
> I might be a little late with my comment here.
>
> David Beazley in his PyCon'2008 presentation "Generator Tricks
> For Systems Programmers" had this very elegant example of handling an
> unlimited numbers of files:


David Beazley's generator stuff is definitely worth recommending
on. I think the issue here is that: anything which ultimately uses
os.listdir (and os.walk does) is bound by the fact that it will
create a long list of every file before handing it back. Certainly
there are techniques (someone posted a ctypes wrapper for opendir;
I recommended FindFirst/NextFile on Windows) which could be applied,
but those are all outside the stdlib.

TJG

Mel

unread,
Jun 23, 2009, 10:29:21 AM6/23/09
to
Steven D'Aprano wrote:

Yeah. The "dump it on the user" idea, or more politely "can't decide
anything until the user has seen everything" is evident in the most
"characteristic" GUIs.

Mel.


Steven D'Aprano

unread,
Jun 23, 2009, 10:20:11 PM6/23/09
to


Perhaps you're using different GUIs to me. In my experience, most GUIs
tend to *hide* data from the user rather than give them everything under
the sun.

The classic example is Windows, which hides certain files in the GUI file
manager even if you tell it to show all files.


--
Steven

Lawrence D'Oliveiro

unread,
Jun 24, 2009, 1:08:57 AM6/24/09
to
In message <pan.2009.06...@REMOVE.THIS.cybersource.com.au>, Steven
D'Aprano wrote:

> On Tue, 23 Jun 2009 10:29:21 -0400, Mel wrote:
>
>> Steven D'Aprano wrote:
>>
>>> Lawrence D'Oliveiro wrote:
>>>
>>>>> Ok, now pipe ls to less, take three days to browse through all the
>>>>> filenames to locate the file you want to see.
>>>>
>>>> Sounds like you're approaching the issue with a GUI-centric mentality,
>>>> which is completely hopeless at dealing with this sort of situation.
>>>
>>> Piping the output of ls to less is a GUI-centric mentality?
>>
>> Yeah. The "dump it on the user" idea, or more politely "can't decide
>> anything until the user has seen everything" is evident in the most
>> "characteristic" GUIs.
>
> Perhaps you're using different GUIs to me. In my experience, most GUIs
> tend to *hide* data from the user rather than give them everything under
> the sun.

Which is getting a bit away from what we're discussing here, but certainly
it is characteristic of GUIs to show you all 400,000 files in a directory,
or at least try to do so, and either hang for half an hour or run out of
memory and crash, rather than give you some intelligent way of prefiltering
the file display up front.

Lie Ryan

unread,
Jun 24, 2009, 3:30:24 AM6/24/09
to

In many debugging cases, you don't even know what to filter, which is
what I was referring to when I said "Even with glob and grep ..."

For example, when the problem mysteriously disappears when the file is
isolated

Lawrence D'Oliveiro

unread,
Jul 8, 2009, 7:49:54 PM7/8/09
to
In message <kCk0m.406$ze1...@news-server.bigpond.net.au>, Lie Ryan wrote:

> Lawrence D'Oliveiro wrote:
>
>> ... certainly it is characteristic of GUIs to show you all 400,000 files


>> in a directory, or at least try to do so, and either hang for half an
>> hour or run out of memory and crash, rather than give you some
>> intelligent way of prefiltering the file display up front.
>

> In many debugging cases, you don't even know what to filter ...

So pick a random sample to start with. Do you know of a GUI tool that can do
that?

0 new messages