in other languages one can avoid generating such an object by walking
a directory as a liked list. for example, in c, perl or php one can
use opendir() and then repeatedly readdir() until getting to the end
of the file list. it seems this could be more efficient in some
applications.
is there a way to do this in python? i'm relatively new to the
language. i looked through the documentation and tried googling but
came up empty.
If you're on Windows, you can use the win32file.FindFilesIterator
function from the pywin32 package. (Which wraps the Win32 API
FindFirstFile / FindNextFile pattern).
TJG
thanks, tim.
however, i'm not using windows. freebsd and os x.
Presumably, if Perl etc. can do it then it should be simple
enough to drop into ctypes and call the same library code, no?
(I'm not a BSD / OS X person, I'm afraid, so perhaps this isn't
so easy...)
TJG
What kind of directories are those that just a list of files would
result in a "very large" object? I don't think I have ever seen
directories with more than a few thousand files...
--
André Engels, andre...@gmail.com
Some time ago we had a discussion about turning os.listdir() into a
generator. No conclusion was agreed on. We also thought about exposing
the functions opendir(), readdir(), closedir() and friends but as far as
I know and as far as I've checked the C code in Modules/posixmodule.c
none of the functions as been added.
For now you are on your own to implement wrappers for the system calls.
For the distant future you may see the appropriate functions in the os
module. A mail to the python ideas list may increase your chances. ;)
Christian
You did not specify version. In Python3, os.walk has become a generater
function. So, to answer your question, use 3.1.
tjr
I've seen directories with several hundreds of thousand files. Depending
on the file system and IO capacity it can take about one to several
minute until 'ls' even starts to print out files names. It's no fun on
directories on a CIFS storage or ext3 w/o a btree dir index.
Christian
I'm sorry to inform you that Python 3.x still returns a list, not a
generator.
ython 3.1rc1+ (py3k:73396, Jun 12 2009, 22:45:18)
[GCC 4.3.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> type(os.listdir("."))
<class 'list'>
Christian
> On Sun, Jun 14, 2009 at 6:35 PM, tom<f...@thefsb.org> wrote:
>
>> in other languages one can avoid generating such an object by walking
>> a directory as a liked list.
I suppose it depends how well-liked it is. Nerdy lists may work better, but
they tend not to be liked.
> What kind of directories are those that just a list of files would
> result in a "very large" object? I don't think I have ever seen
> directories with more than a few thousand files...
I worked on an application system which, at one point, routinely dealt with
directories containing hundreds of thousands of files. But even that kind of
directory contents only adds up to a few megabytes.
Since at least 2.4, os.walk has itself been a generator.
However, the contents of the directory (the 3rd element of the
yielded tuple) is a list produced by listdir() instead of a
generator. Unless listdir() has been changed to a generator
instead of a list (which other respondents seem to indicate has
not been implemented), this doesn't address the OP's issue of
"lots of files in a single directory".
-tkc
You haven't looked very hard :)
$ pwd
/home/steve/.thumbnails/normal
$ ls | wc -l
33956
And I periodically delete thumbnails, to prevent the number of files
growing to hundreds of thousands.
--
Steven
Here is a ctypes generator listdir for unix-like OSes. I tested it
under linux.
#!/usr/bin/python
"""
An equivalent os.listdir but as a generator using ctypes
"""
from ctypes import CDLL, c_char_p, c_int, c_long, c_ushort, c_byte, c_char, Structure, POINTER
from ctypes.util import find_library
class c_dir(Structure):
"""Opaque type for directory entries, corresponds to struct DIR"""
c_dir_p = POINTER(c_dir)
class c_dirent(Structure):
"""Directory entry"""
# FIXME not sure these are the exactly correct types!
_fields_ = (
('d_ino', c_long), # inode number
('d_off', c_long), # offset to the next dirent
('d_reclen', c_ushort), # length of this record
('d_type', c_byte), # type of file; not supported by all file system types
('d_name', c_char * 4096) # filename
)
c_dirent_p = POINTER(c_dirent)
c_lib = CDLL(find_library("c"))
opendir = c_lib.opendir
opendir.argtypes = [c_char_p]
opendir.restype = c_dir_p
# FIXME Should probably use readdir_r here
readdir = c_lib.readdir
readdir.argtypes = [c_dir_p]
readdir.restype = c_dirent_p
closedir = c_lib.closedir
closedir.argtypes = [c_dir_p]
closedir.restype = c_int
def listdir(path):
"""
A generator to return the names of files in the directory passed in
"""
dir_p = opendir(".")
while True:
p = readdir(dir_p)
if not p:
break
name = p.contents.d_name
if name not in (".", ".."):
yield name
closedir(dir_p)
if __name__ == "__main__":
for name in listdir("."):
print name
--
Nick Craig-Wood <ni...@craig-wood.com> -- http://www.craig-wood.com/nick
> You did not specify version. In Python3, os.walk has become a
> generater function. So, to answer your question, use 3.1.
os.walk has been a generator function all along, but that doesn't help
OP because it still uses os.listdir internally. This means that it
both creates huge lists for huge directories, and holds on to those
lists until the iteration over the directory (and all subdirectories)
is finished.
In fact, os.walk is not suited for this kind of memory optimization
because yielding a *list* of files (and a separate list of
subdirectories) is specified in its interface. This hasn't changed in
Python 3.1:
dirs, nondirs = [], []
for name in names:
if isdir(join(top, name)):
dirs.append(name)
else:
nondirs.append(name)
if topdown:
yield top, dirs, nondirs
> Here is a ctypes generator listdir for unix-like OSes.
ctypes code scares me with its duplication of the contents of system
headers. I understand its use as a proof of concept, or for hacks one
needs right now, but can anyone seriously propose using this kind of
code in a Python program? For example, this seems much more
"Linux-only", or possibly even "32-bit-Linux-only", than "unix-like":
> i can traverse a directory using os.listdir() or os.walk(). but if a
> directory has a very large number of files, these methods produce very
> large objects talking a lot of memory.
if we assume the number of files to be a million (which certainly qualifies
as one of the larger directory sizes one encounters...), and the average
filename length with 20, you'd end up with 20 megs of data.
Is that really a problem on nowadays several gigabyte machines? And we are
talking a rather freakish case here.
Diez
It was a proof of concept certainly..
It can be done properly with gccxml though which converts structures
into ctypes definitions.
That said the dirent struct is specified by POSIX so if you get the
correct types for all the individual members then it should be correct
everywhere. Maybe ;-)
The problem is that POSIX specifies the fields with types like off_t and
ino_t. Since ctypes doesn't know anything about these types, application
code has to specify their size and other attributes. As these vary from
platform to platform, you can't get it correct without asking a real C
compiler.
In other words, POSIX talks about APIs and ctypes deals with ABIs.
http://pypi.python.org/pypi/ctypes_configure/0.1 helps with the problem,
and is a bit more accessible than gccxml.
It is basically correct to say that using ctypes without using something
like gccxml or ctypes_configure will give you non-portable code.
Jean-Paul
>>> type(os.walk('.'))
<class 'generator'>
However, it is a generator of directory tuples that include a filename
list produced by listdir, rather than a generator of filenames
themselves, as I was thinking. I wish listdir had been changed in 3.0
along with map, filter, and range, but I made no effort and hence cannot
complain.
tjr
These types could be part of ctypes. After all ctypes knows how big a
long is on all platforms, and it knows that a uint32_t is the same on
all platforms, it could conceivably know how big an off_t or an ino_t
is too.
> In other words, POSIX talks about APIs and ctypes deals with ABIs.
>
> http://pypi.python.org/pypi/ctypes_configure/0.1 helps with the problem,
> and is a bit more accessible than gccxml.
I haven't seen that before - looks interesting.
> It is basically correct to say that using ctypes without using something
> like gccxml or ctypes_configure will give you non-portable code.
Well it depends on if the API is specified in types that ctypes
understands. Eg, short, int, long, int32_t, uint64_t etc. A lot of
interfaces are specified exactly like that and work just fine with
ctypes in a portable way. I agree with you that struct dirent
probably isn't one of those though!
I think it would be relatively easy to implent the code I demonstrated
in a portable way though... I'd do it by defining dirent as a block
of memory and then for the first run, find a known filename in the
block, establishing the offset of the name field since that is all we
are interested in for the OPs problem.
> It can be done properly with gccxml though which converts structures
> into ctypes definitions.
That sounds interesting.
> That said the dirent struct is specified by POSIX so if you get the
> correct types for all the individual members then it should be
> correct everywhere. Maybe ;-)
AFAIK POSIX specifies the names and types of the members, but not
their order in the structure, nor alignment.
Not proud of this, but...:
[django] www4:~/datakortet/media$ ls bfpbilder|wc -l
174197
all .jpg files between 40 and 250KB with the path stored in a database
field... *sigh*
Oddly enough, I'm a relieved that others have had similar folder sizes
(I've been waiting for this burst to the top of my list for a while
now).
Bjorn
Just in case anyone is interested here is an implementation using cython.
Compile with python setup.py build_ext --inplace
And run listdir.py
This would have been much easier if cython supported yield, but
unfortunately it doesn't (yet - I think it is in the works).
This really should work on any platform!
--setup.py----------------------------------------------------------
from distutils.core import setup
from distutils.extension import Extension
from Cython.Distutils import build_ext
setup(
cmdclass = {'build_ext': build_ext},
ext_modules = [Extension("directory", ["directory.pyx"])]
)
--directory.pyx----------------------------------------------------------
# Cython interface for listdir
# python setup.py build_ext --inplace
import cython
cdef extern from "dirent.h":
struct dirent:
char d_name[0]
struct dir_handle:
pass
ctypedef dir_handle DIR "DIR"
DIR *opendir(char *name)
int closedir(DIR *dirp)
dirent *readdir(DIR *dirp)
cdef class Directory:
"""Represents an open directory"""
cdef DIR *handle
def __init__(self, path):
self.handle = opendir(path)
def readdir(self):
cdef dirent *p
p = readdir(self.handle)
if p is NULL:
return None
return p.d_name
def close(self):
closedir(self.handle)
--listdir.py----------------------------------------------------------
from directory import Directory
def listdir(path):
"""
A generator to return the names of files in the directory passed
in
"""
d = Directory(".")
while True:
name = d.readdir()
if not name:
break
if name not in (".", ".."):
yield name
d.close()
if __name__ == "__main__":
for name in listdir("."):
print name
------------------------------------------------------------
> Not proud of this, but...:
>
> [django] www4:~/datakortet/media$ ls bfpbilder|wc -l
> 174197
>
> all .jpg files between 40 and 250KB with the path stored in a database
> field... *sigh*
Why not put the images themselves into database fields?
> Oddly enough, I'm a relieved that others have had similar folder sizes ...
One of my past projects had 400000-odd files in a single folder. They were
movie frames, to allow assembly of movie sequences on demand.
For both scenarios:
Why not use hex representation of md5/sha1-hashed id as a path,
arranging them like /path/f/9/e/95ea4926a4 ?
That way, you won't have to deal with many-files-in-path problem, and,
since there's thousands of them anyway, name readability shouldn't
matter.
In fact, on modern filesystems it doesn't matter whether you accessing
/path/f9e95ea4926a4 with million files in /path or /path/f/9/e/95ea
with only hundred of them in each path. Former case (all-in-one-path)
would even outperform the latter with ext3 or reiserfs by a small
margin.
Sadly, that's not the case with filesystems like FreeBSD ufs2 (at least
in sixth branch), so it's better to play safe and create subdirs if the
app might be run on different machines than keeping everything in one
path.
It might not matter for the filesystem, but the file explorer (and ls)
would still suffer. Subfolder structure would be much better, and much
easier to navigate manually when you need to.
> Mike Kazantsev wrote:
> > In fact, on modern filesystems it doesn't matter whether you
> > accessing /path/f9e95ea4926a4 with million files in /path
> > or /path/f/9/e/95ea with only hundred of them in each path. Former
> > case (all-in-one-path) would even outperform the latter with ext3
> > or reiserfs by a small margin.
> > Sadly, that's not the case with filesystems like FreeBSD ufs2 (at
> > least in sixth branch), so it's better to play safe and create
> > subdirs if the app might be run on different machines than keeping
> > everything in one path.
>
> It might not matter for the filesystem, but the file explorer (and ls)
> would still suffer. Subfolder structure would be much better, and much
> easier to navigate manually when you need to.
It's an insane idea to navigate any structure with hash-based names
and hundreds of thousands files *manually*: "What do we have here?
Hashies?" ;)
> The problem is that POSIX specifies the fields with types like off_t and
> ino_t. Since ctypes doesn't know anything about these types, application
> code has to specify their size and other attributes. As these vary from
> platform to platform, you can't get it correct without asking a real C
> compiler.
Just to add to the complications, on 32-bit platforms, off_t can be either
64 bits or 32 bits, depending on whether a C program is compiled with
-D_FILE_OFFSET_BITS=64 or not. This causes all kinds of aliasing of POSIX
routines to the appropriate variants of the underlying libc routine names.
With ctypes, you have to directly access the underlying routine names,
unless you implement some kind of equivalent aliasing scheme on top.
> On Wed, 17 Jun 2009 14:52:28 +1200
> Lawrence D'Oliveiro <l...@geek-central.gen.new_zealand> wrote:
>
>> In message
>> <234b19ac-7baf-4356...@z9g2000yqi.googlegroups.com>,
>> thebjorn wrote:
>>
>> > Not proud of this, but...:
>> >
>> > [django] www4:~/datakortet/media$ ls bfpbilder|wc -l
>> > 174197
>> >
>> > all .jpg files between 40 and 250KB with the path stored in a
>> > database field... *sigh*
>>
>> Why not put the images themselves into database fields?
>>
>> > Oddly enough, I'm a relieved that others have had similar folder
>> > sizes ...
>>
>> One of my past projects had 400000-odd files in a single folder. They
>> were movie frames, to allow assembly of movie sequences on demand.
>
> For both scenarios:
> Why not use hex representation of md5/sha1-hashed id as a path,
> arranging them like /path/f/9/e/95ea4926a4 ?
>
> That way, you won't have to deal with many-files-in-path problem ...
Why is that a problem?
> > Why not use hex representation of md5/sha1-hashed id as a path,
> > arranging them like /path/f/9/e/95ea4926a4 ?
> >
> > That way, you won't have to deal with many-files-in-path problem ...
>
> Why is that a problem?
So you can os.listdir them?
Don't ask me what for, however, since that's the original question.
Also not every fs still in use handles this situation effectively, see
my original post.
> On Wed, 17 Jun 2009 17:53:33 +1200
> Lawrence D'Oliveiro <l...@geek-central.gen.new_zealand> wrote:
>
>> > Why not use hex representation of md5/sha1-hashed id as a path,
>> > arranging them like /path/f/9/e/95ea4926a4 ?
>> >
>> > That way, you won't have to deal with many-files-in-path problem ...
>>
>> Why is that a problem?
>
> So you can os.listdir them?
Why should you have a problem os.listdir'ing lots of files?
--Scott David Daniels
Scott....@Acm.Org
> In message <20090617142431.2b25faf5@malediction>, Mike Kazantsev wrote:
>
> > On Wed, 17 Jun 2009 17:53:33 +1200
> > Lawrence D'Oliveiro <l...@geek-central.gen.new_zealand> wrote:
> >
> >> > Why not use hex representation of md5/sha1-hashed id as a path,
> >> > arranging them like /path/f/9/e/95ea4926a4 ?
> >> >
> >> > That way, you won't have to deal with many-files-in-path problem ...
> >>
> >> Why is that a problem?
> >
> > So you can os.listdir them?
>
> Why should you have a problem os.listdir'ing lots of files?
I shouldn't, and I don't ;)
Then why did you suggest that there was a problem being able to os.listdir
them?
> In message <20090617214535.108667ca@coercion>, Mike Kazantsev wrote:
>
> > On Wed, 17 Jun 2009 23:04:37 +1200
> > Lawrence D'Oliveiro <l...@geek-central.gen.new_zealand> wrote:
> >
> >> In message <20090617142431.2b25faf5@malediction>, Mike Kazantsev wrote:
> >>
> >>> On Wed, 17 Jun 2009 17:53:33 +1200
> >>> Lawrence D'Oliveiro <l...@geek-central.gen.new_zealand> wrote:
> >>>
> >>>>> Why not use hex representation of md5/sha1-hashed id as a path,
> >>>>> arranging them like /path/f/9/e/95ea4926a4 ?
> >>>>>
> >>>>> That way, you won't have to deal with many-files-in-path problem ...
> >>>>
> >>>> Why is that a problem?
> >>>
> >>> So you can os.listdir them?
> >>
> >> Why should you have a problem os.listdir'ing lots of files?
> >
> > I shouldn't, and I don't ;)
>
> Then why did you suggest that there was a problem being able to os.listdir
> them?
I didn't, OP did, and that's what the topic "walking directory with
many files" is about.
I wonder whether you're unable to read past the first line, trying to
make some point or just some kind of alternatively-gifted (i.e.
brain-handicapped) person to keep interpreting posts w/o context like
that.
He didn't, the OP did.
(asun@lucrezia:~/pit/lsa/act:5)$ ls -1 | wc -l
142607
There, you've seen one with 142 thousand now! :P
Like... when you're trying to debug a code that generates an error with
a specific file...
Yeah, it might be possible to just mv the file from outside, but not
being able to enter a directory just because you've got too many files
in it is kind of silly.
> On Thu, 18 Jun 2009 10:33:49 +1200
> Lawrence D'Oliveiro <l...@geek-central.gen.new_zealand> wrote:
>
>> In message <20090617214535.108667ca@coercion>, Mike Kazantsev wrote:
>>
>>> On Wed, 17 Jun 2009 23:04:37 +1200
>>> Lawrence D'Oliveiro <l...@geek-central.gen.new_zealand> wrote:
>>>
>>>> In message <20090617142431.2b25faf5@malediction>, Mike Kazantsev
>>>> wrote:
>>>>
>>>>> On Wed, 17 Jun 2009 17:53:33 +1200
>>>>> Lawrence D'Oliveiro <l...@geek-central.gen.new_zealand> wrote:
>>>>>
>>>>>>> Why not use hex representation of md5/sha1-hashed id as a path,
>>>>>>> arranging them like /path/f/9/e/95ea4926a4 ?
>>>>>>>
>>>>>>> That way, you won't have to deal with many-files-in-path problem
>>>>>>> ...
>>>>>>
>>>>>> Why is that a problem?
>>>>>
>>>>> So you can os.listdir them?
>>>>
>>>> Why should you have a problem os.listdir'ing lots of files?
>>>
>>> I shouldn't, and I don't ;)
>>
>> Then why did you suggest that there was a problem being able to
>> os.listdir them?
>
> I didn't, OP did ...
Then why did you reply to my question "Why is that a problem?" with "So that
you can os.listdir them?", if you didn't think there was a problem (see
above)?
> He didn't ...
He replied to my question "Why is that a problem?" with "So you can
os.listdir them?". Why reply with an explanation of why it's a problem if
you don't think it's a problem?
> Yeah, it might be possible to just mv the file from outside, but not
> being able to enter a directory just because you've got too many files
> in it is kind of silly.
Sounds like a problem with your file/directory-manipulation tools.
Why do you think that if I didn't suggest there is a problem, I think
there is no problem?
I do think there might be such a problem and even I may have to face it
someday. So, out of sheer curiosity how more rediculous this topic can
be I'll try to rephrase and extend what I wrote in the first place:
Why would you want to listdir them?
I can imagine at least one simple scenario: you had some nasty crash
and you want to check that every file has corresponding, valid db
record.
What's the problem with listdir if there's 10^x of them?
Well, imagine that db record also holds file modification time (say,
the files are some kind of cache), so not only you need to compare
listdir results with db, but also do os.stat on every file and some
filesystems will do it very slowly with so many of them in one place.
Now, I think I made this point in the first answer, no?
Of course you can make it more rediculous by your
I-can-talk-away-any-problem-I-can't-see-or-solve approach by asking "why
would you want to use such filesystems?", "why do you have to use
FreeBSD?", "why do you have to work for such employer?", "why do you
have to eat?" etc, but you know, sometimes it's easier and better for
the project/work just to solve it, than talk everyone else away from it
just because you don't like otherwise acceptable solution.
try an `ls` on a folder with 10000+ files.
See how long is needed to print all the files.
Ok, now pipe ls to less, take three days to browse through all the
filenames to locate the file you want to see.
The file manipulation tool may not have problems with it; it's the user
that would have a hard time sorting through the huge amount of files.
Even with glob and grep, some types of queries are just too difficult or
is plain silly to write a full fledged one-time use program just to
locate a few files.
It wasn't that you didn't suggest there was a problem, but that you
suggested a "solution" as though there was a problem.
> Why would you want to listdir them?
It's a common need, to find out what's in a directory.
> I can imagine at least one simple scenario: you had some nasty crash
> and you want to check that every file has corresponding, valid db
> record.
But why would that be relevant to this case?
> Lawrence D'Oliveiro wrote:
>
>> In message <%Zv_l.19493$y61....@news-server.bigpond.net.au>, Lie Ryan
>> wrote:
>>
>>> Yeah, it might be possible to just mv the file from outside, but not
>>> being able to enter a directory just because you've got too many files
>>> in it is kind of silly.
>>
>> Sounds like a problem with your file/directory-manipulation tools.
>
> try an `ls` on a folder with 10000+ files.
>
> See how long is needed to print all the files.
As I've mentioned elsewhere, I had scripts routinely dealing with
directories containing around 400,000 files.
> Ok, now pipe ls to less, take three days to browse through all the
> filenames to locate the file you want to see.
Sounds like you're approaching the issue with a GUI-centric mentality, which
is completely hopeless at dealing with this sort of situation.
>> Ok, now pipe ls to less, take three days to browse through all the
>> filenames to locate the file you want to see.
>
> Sounds like you're approaching the issue with a GUI-centric mentality,
> which is completely hopeless at dealing with this sort of situation.
Piping the output of ls to less is a GUI-centric mentality?
--
Steven
I might be a little late with my comment here.
David Beazley in his PyCon'2008 presentation "Generator Tricks
For Systems Programmers" had this very elegant example of handling an
unlimited numbers of files:
import os, fnmatch
def gen_find(filepat,top):
"""gen_find(filepat,top) - find matching files in directory tree,
start searching from top
expects: a file pattern as string, and a directory path as string
yields: a sequence of filenames (including paths)
"""
for path, dirlist, filelist in os.walk(top):
for name in fnmatch.filter(filelist,filepat):
yield os.path.join(path,name)
for file in gen_find('*.py', '/'):
print file
David Beazley's generator stuff is definitely worth recommending
on. I think the issue here is that: anything which ultimately uses
os.listdir (and os.walk does) is bound by the fact that it will
create a long list of every file before handing it back. Certainly
there are techniques (someone posted a ctypes wrapper for opendir;
I recommended FindFirst/NextFile on Windows) which could be applied,
but those are all outside the stdlib.
TJG
Yeah. The "dump it on the user" idea, or more politely "can't decide
anything until the user has seen everything" is evident in the most
"characteristic" GUIs.
Mel.
Perhaps you're using different GUIs to me. In my experience, most GUIs
tend to *hide* data from the user rather than give them everything under
the sun.
The classic example is Windows, which hides certain files in the GUI file
manager even if you tell it to show all files.
--
Steven
> On Tue, 23 Jun 2009 10:29:21 -0400, Mel wrote:
>
>> Steven D'Aprano wrote:
>>
>>> Lawrence D'Oliveiro wrote:
>>>
>>>>> Ok, now pipe ls to less, take three days to browse through all the
>>>>> filenames to locate the file you want to see.
>>>>
>>>> Sounds like you're approaching the issue with a GUI-centric mentality,
>>>> which is completely hopeless at dealing with this sort of situation.
>>>
>>> Piping the output of ls to less is a GUI-centric mentality?
>>
>> Yeah. The "dump it on the user" idea, or more politely "can't decide
>> anything until the user has seen everything" is evident in the most
>> "characteristic" GUIs.
>
> Perhaps you're using different GUIs to me. In my experience, most GUIs
> tend to *hide* data from the user rather than give them everything under
> the sun.
Which is getting a bit away from what we're discussing here, but certainly
it is characteristic of GUIs to show you all 400,000 files in a directory,
or at least try to do so, and either hang for half an hour or run out of
memory and crash, rather than give you some intelligent way of prefiltering
the file display up front.
In many debugging cases, you don't even know what to filter, which is
what I was referring to when I said "Even with glob and grep ..."
For example, when the problem mysteriously disappears when the file is
isolated
> Lawrence D'Oliveiro wrote:
>
>> ... certainly it is characteristic of GUIs to show you all 400,000 files
>> in a directory, or at least try to do so, and either hang for half an
>> hour or run out of memory and crash, rather than give you some
>> intelligent way of prefiltering the file display up front.
>
> In many debugging cases, you don't even know what to filter ...
So pick a random sample to start with. Do you know of a GUI tool that can do
that?