I'm developing a software program
which stores 100,000 or more files in
one single directory.
Is directory entry lookup based on some
hashing type of scheme? Or is it a linear
lookup?
It's a linear lookup, i.e. it gets very slow.
Good luck,
Jurriaan
--
I that case, I shall prepare my Turnip Surprise.
And the surprise is?
There's nothing else in it except turnip.
Baldrick on Haute Cuisine
GNU/Linux 2.4.5-ac4 SMP/ReiserFS 2x1402 bogomips load av: 0.01 0.01 0.00
example:
all files wil names starting with 'a' go into /a. Same with other
letters. If that doesn't break it up enough, start going with two
letters, or three letters, in a tree arrangement.
/a
/a/aa
/a/ab
/a/ac
/a/ad
/b
/b/ba
/b/bb
and so on. The reason for the tree is to make it so that no directory
has too many files. Your program would then look at the filename to get
the right path to the file it's looking for. A good way to do it is to
make a function that given a filename, returns the path that you would
expect to find the file in.
To see another example of this, take a look at your terminfo database
which on my Debian system is in /usr/share/terminfo. I have 2139 files
in that heirarchy, which was enough to warrant splitting into the tree.
You should definitely do this if you have 100,000 files.
> It's a linear lookup, i.e. it gets very slow.
One thing I noticed is that even for as many
as 60,000 directory entries, the performance
isn't all that bad.
I wonder why?
--------== Posted Anonymously via Newsfeeds.Com ==-------
Featuring the worlds only Anonymous Usenet Server
-----------== http://www.newsfeeds.com ==----------
Depending on your access patterns, the directory cache will kick in, and
do most of the real work.
And the dcache uses a pretty efficient hashing mechanism, regardless of
what the underlying filesystem is doing.
But you should realize that the dcache is nothing but a cache, and while
very good for most normal loads you can still get into nasty performance
behaviour by having the "wrong" access patterns.
Linus
>Depending on your access patterns, the directory cache will kick in, and
>do most of the real work.
>
>And the dcache uses a pretty efficient hashing mechanism, regardless of
>what the underlying filesystem is doing.
>
>But you should realize that the dcache is nothing but a cache, and while
>very good for most normal loads you can still get into nasty performance
>behaviour by having the "wrong" access patterns.
IIRC inn was great at having the "wrong" patterns. But storing news as
one file per article is probably one of the worst things you can do.