Big volumes, small files

the....@gmail.com

unread,

Apr 18, 2007, 9:48:30 AM4/18/07

to

Hello,

I'd like to plan a new storage solution for a system currently in
production.

The system's storage is based on code which writes many files to the
file system, with overall storage needs currently around 40TB and
expected to reach hundreds of TBs. The average file size of the system
is ~100K, which translates to ~500 million files today, and billions
of files in the future. This storage is accessed over NFS by a rack of
40 Linux blades, and is mostly read-only (99% of the activity is
reads). While I realize calling this sub-optimal system design is
probably an understatement, the design of the system is beyond my
control and isn't likely to change in the near future.

The system's current storage is based on 4 VxFS filesystems, created
on SVM meta-devices each ~10TB in size. A 2-node Sun Cluster serves
the filesystems, 2 filesystems per node. Each of the filesystems
undergoes growfs as more storage is made available. We're looking for
an alternative solution, in an attempt to improve performance and
ability to recover from disasters (fsck on 2^42 files isn't practical,
and I'm getting pretty worried due to this fact - even the smallest
filesystem inconsistency will leave me lots of useless bits).

Question is - can someone with experience with large filesystems and
many small-files share his stories? Is it practical to base such a
solution on a few (8) large volumes, each with single large filesystem
in it?

Many thanks in advance for any advice,
- Yaniv

carmelomcc

unread,

Apr 19, 2007, 11:37:06 AM4/19/07

to

The best bet, would e to go to a NAS appliance, IE EMC or NetApp.
There NAS devices can handle this load better than any veritas
solution. The NSX model from EMC will let you go to 32 TB per file
system per data mover. It also allows for backups via snap shots. Do
not use ATA storage, try to use low cost fiber channel drives. They
have a higher run rate then standard ATA. If you used a DMX3 and a
NSX you would be able to handle 2 to 3 years worth of growth within a
two unit environment.

Faeandar

unread,

Apr 19, 2007, 1:54:25 PM4/19/07

to

For your particular situation there is a universal statement that
applies; you're screwed.

The simple answer is there is nothing out there yet that willl handle
lots of small files well, except maybe a RamSAN.

I agree with carmelomcc that a proprietary NAS may be a good fit, but
I disagree that EMC makes a NAS. It's a pile of crap.
Depending on budget NetApp may do fine. There's also BlueArc and
agami. I don;t think clustered storage is going to help you in any
way.

However, if you are mostly reads what is keeping you from using a
boatload of cache?

Backups are going to be painful, no way around it. The best thing I
can think of given your current environment is using cache devices for
client performance enhancement and FlashBackup for backups and
recovery.
FlashBackup will take the entire image as a backup and not spend eons
mapping blocks to files. I'm not exactly sure how it works but people
swear by it.

~F

Bill Todd

unread,

Apr 20, 2007, 12:40:13 AM4/20/07

to

Though I have no direct experience with it, my impression is that this
may be a workload which ZFS could handle well (I don't know what level
of maturity ZFS has attained by now, but Apple's recent embrace of it
suggests that it may be pretty solid). Its maximum block size goes to
128KB, so many/most files could fit in a single block. It grows
dynamically as required, without the 16TB (?) limit of ext[2|3]fs on
Linux (though other mature Linux file systems like JFS and XFS might be
worth considering - possibly even ReiserFS if V3 is sufficiently
stable). Sun may support ZFS as a cluster file system by now (IIRC
plans were in place to).

Any mature journaling file system with snapshots should address the fsck
and backup issues (one of the nice things about ZFS is that its
background integrity-checking and increased metadata-replication
mechanisms reduce even further the chances that the system will ever get
sufficiently hosed that fsck would be required). If directories must be
large (are all the files in just one?), you'd want a file system with
b-tree or hash-indexed directories (I think everything I mentioned above
qualifies).

Good luck,

- bill

Faeandar

unread,

Apr 20, 2007, 11:01:54 AM4/20/07

to

On Fri, 20 Apr 2007 00:40:13 -0400, Bill Todd <bill...@metrocast.net>
wrote:

ZFS is great in concept and I think they are on the right path,
however it's not yet ready for primetime imo.

The integrated integrity checking is extremely cpu intesive. It does
not cluster yet, at least not as of 2 weeks ago. Many file systems
grow dynamically so I would make that a check in ZFS's column. No
practical TB limit is a win if you need to go beyond 16TB in a single
FS.
I'm not sure I see how snapshots or journaling helps with backups. It
still has to map blocks to files, which is the long part of a backup.
I know when NetApp backups occur it takes the snapshot and then tries
to do a dump. If you have millions of files it can be hours before
data is actually transferred, I believe ZFS is no different.

Since the OP's IO pattern is mostly reads the cpu load may not be an
issue but writes suffer a serious penalty if you are not cpu-rich.
I've spoken with people who ran an Oracle db on ZFS and said they had
to move back until they had a T2000 or so.

~F

Bill Todd

unread,

Apr 20, 2007, 10:24:31 PM4/20/07

to

Faeandar wrote:

...

> ZFS is great in concept and I think they are on the right path,
> however it's not yet ready for primetime imo.

Though (as I already noted) I don't have any direct experience with it,
my impression is that people are using it in production systems
successfully - so a description of your specific reservations would be
useful.

>
> The integrated integrity checking is extremely cpu intesive.

I suspect that you're mistaken: IIRC it occurs as part of an
already-existing data copy operation at a very low level in the disk
read/write routines, and at close to memory-streaming speeds (i.e.,
mostly using CPU cycles that are being used anyway just to copy the data).

It does
> not cluster yet, at least not as of 2 weeks ago.

It was not clear that this was a requirement in this case - but since
the OP mentioned clustering, I mentioned the soon-to-arrive capability.

Many file systems
> grow dynamically so I would make that a check in ZFS's column.

I'm not sure they grow dynamically quite as painlessly as ZFS does:
usually, you first have to arrange to expand the underlying disk storage
at the volume-manager level, and then have to incorporate the increase
in volume size into the file system.

No
> practical TB limit is a win if you need to go beyond 16TB in a single
> FS.
> I'm not sure I see how snapshots or journaling helps with backups.

I should have added the word 'respectively', I guess: journaling helps
avoid the need for fsck, and snapshots help expedite backups (by
avoiding any need for down-time while making them).

It
> still has to map blocks to files, which is the long part of a backup.
> I know when NetApp backups occur it takes the snapshot and then tries
> to do a dump. If you have millions of files it can be hours before
> data is actually transferred, I believe ZFS is no different.

Actually, it is, since it allows block sizes up to 128 KB (vs. 4 KB for
WAFL IIRC, though if WAFL does a good job of defragmenting files the
difference may not be too substantial). With the OP's 100 KB file
sizes, this means that each file can be accessed (backed up) with a
single disk access, yielding a fairly respectable backup bandwidth of
about 6 MB/sec (assuming that such an access takes about 16 ms. for a
7200 rpm drive, including transfer time, and that the associated
directory accesses can be batched during the scan).

>
> Since the OP's IO pattern is mostly reads the cpu load may not be an
> issue but writes suffer a serious penalty if you are not cpu-rich.

I'm not sure why that would be the case even if the integrity-checking
*were* CPU-intensive, since the overhead to check the integrity on a
read should be just about the same as the overhead to generate the
checksum on a write. True, one must generate it all the way back up to
the system superblock for a write (one reason why I prefer a
log-oriented implementation that can defer and consolidate such
activity), but below the root unless you've got many of the
intermediate-level blocks cached you have to access and validate them on
each read (and with on the order of a billion files, my guess is that
needed directory data will quite frequently not be cached).

> I've spoken with people who ran an Oracle db on ZFS and said they had
> to move back until they had a T2000 or so.

Now in *that* application I suspect that a lot of the intermediate
blocks *are* often cached on reads, which does drive up relative write
overhead substantially (not so much due to integrity-checking per se -
since as I already noted I think that it piggybacks on a copy operation
- as due to the need to write back the entire block-tree path on each
update).

- bill

Jan-Frode Myklebust

unread,

Apr 21, 2007, 1:18:25 PM4/21/07

to

On 2007-04-18, the....@gmail.com <the....@gmail.com> wrote:

[saw your question on the GPFS forum, but prefer news..]

> The system's storage is based on code which writes many files to the
> file system, with overall storage needs currently around 40TB and
> expected to reach hundreds of TBs. The average file size of the system
> is ~100K, which translates to ~500 million files today, and billions
> of files in the future. This storage is accessed over NFS by a rack of
> 40 Linux blades, and is mostly read-only (99% of the activity is
> reads). While I realize calling this sub-optimal system design is
> probably an understatement, the design of the system is beyond my
> control and isn't likely to change in the near future.

> Question is - can someone with experience with large filesystems and

> many small-files share his stories? Is it practical to base such a
> solution on a few (8) large volumes, each with single large filesystem
> in it?

My largest GPFS system is a 10 TB usable, 700 GB currently used, average
file size of 70-80KB, 9M inodes used. Not quite as large as your current,
but it might have some of the same properties.. It's a Maildir-based
mailserver-cluster, and is working very well. Only issue we'd had is
that writing to directories with 10's of thousands files can be too
expensive. We will probably need to move metadata to separate volumes
to give them more cache than the data-volumes to fix this.

If I was to do your huge system with GPFS, I would try to first spread
the files over as many directories as possible, and also across as many
separate file systems as possible. Spreading over many directories,
because I think GPFS is doing directory-level locking for some
operations (adding new files?), and spread over as many file systems
as possible to reduce the fsck time (GPFS does do online fsck, but
not everything can be fixed while online) and make sure that a
catastrophic file system error doesn't take down everything.

I would also try to avoid NFS if possible. Having the clients mount the
fs's as GPFS clients (tcpip or SAN) will probably be much better, and
will avoid the bottlenecks and SPOFs of NFS.

-jf

Pete

unread,

Apr 21, 2007, 7:39:10 PM4/21/07

to

What features does GPFS have that will make in better than NFS? GPFS is
a good filesystem with cluster support, but I don't see anything special
that will help with the large numbers of files the OP is trying to deal
with.

Pete

the wharf rat

unread,

Apr 22, 2007, 12:59:10 AM4/22/07

to

In article <1176904109....@b75g2000hsg.googlegroups.com>,
<the....@gmail.com> wrote:
>Hello,

>
>Question is - can someone with experience with large filesystems and
>many small-files share his stories? Is it practical to base such a
>solution on a few (8) large volumes, each with single large filesystem
>in it?

You've already got Sun. Why not just migrate to ZFS? ZFS
operations are constant speed regardless of size of file system or
size of directory. Lots of other neat stuff too.

Aknin

unread,

Apr 22, 2007, 5:44:19 AM4/22/07

to

On Apr 22, 7:59 am, w...@panix.com (the wharf rat) wrote:
> In article <1176904109.982831.76...@b75g2000hsg.googlegroups.com>,

>
> <the.ak...@gmail.com> wrote:
> >Hello,
>
> >Question is - can someone with experience with large filesystems and
> >many small-files share his stories? Is it practical to base such a
> >solution on a few (8) large volumes, each with single large filesystem
> >in it?
>
> You've already got Sun. Why not just migrate to ZFS? ZFS
> operations are constant speed regardless of size of file system or
> size of directory. Lots of other neat stuff too.

We have seen unexplained performance issues with NFS/ZFS. We've tried
several configurations in which we ran the SPEC SFS test against
identical systems, one exporting NFS over ZFS, the other over UFS.
UFS' performance was an order of magnitude better.
While there may be several explanations, we're still pretty worried
about switching such a large installation to a relatively new
filesystem, esp. with the performance questions hovering above its
head. We've tried several performance and tuning advice, including
setting zil_disable, to no avail. UFS beat ZFS every time (in other
tests, not SFS and NFS based, ZFS was superior).

Anyone else with information about ZFS/NFS is very much invited to
share his or her experience.

Jan-Frode Myklebust

unread,

Apr 22, 2007, 6:26:10 AM4/22/07

to

On 2007-04-21, Pete <ni...@EGGSANDSPAMblueyonder.co.uk> wrote:
>>
>> I would also try to avoid NFS if possible. Having the clients mount the
>> fs's as GPFS clients (tcpip or SAN) will probably be much better, and
>> will avoid the bottlenecks and SPOFs of NFS.
>
> What features does GPFS have that will make in better than NFS? GPFS is
> a good filesystem with cluster support, but I don't see anything special
> that will help with the large numbers of files the OP is trying to deal
> with.

Comparing GPFS to NFS is a bit apples to oranges, in that one is an access
method to an underlying fs, while the other is a real fs.. But, besides
own experience in that NFS is slow at serving small files I would point
at a few features making it better than NFS for OP's problem:

With GPFS you have two modes of giving access to disk. Either you give
all nodes direct access trough SAN, or you let only a subset of the nodes
access trough SAN and have them serve the disks as (in gpfs speak) Network
Shared Disk (NSD). These NSD's can be accessed directly on SAN for the
nodes that see them there, or they can be accessed trough tcp/ip to a node
that again can access it on the SAN. A single NSD will typically have
a primary and a seconary node serving it, to avoid SPOF. So already here
we have higher availability than NFS, in that there's not a single node
the client is depending on.

Further, once you have more than one NSD in the same file system, the nodes
will typically load balance the I/O over several NSD-serving nodes.

Further, the NSD-serving nodes woun't be busy with file system operations,
as I believe the NSD's are more like network block devices, so GPFS has
distributed the filesystem operations away from the single NFS-server
and out to the clients. I would assume this will work especially well for
the OP's problem, in that he's mostly doing reads, and then woun't have
to worry about overloading the lock manager.

Random result from google Comparing GPFS/NSD to NFS:
http://www.nus.edu.sg/comcen/svu/publications/hpc_nus/sep_2005/Performance.pdf

But, finding benchmark results for small file I/O is not easy.. FS's seems
to be too much focused on high troughput, streaming I/O...

-jf

Faeandar

unread,

Apr 24, 2007, 8:17:08 PM4/24/07

to

On Fri, 20 Apr 2007 22:24:31 -0400, Bill Todd <bill...@metrocast.net>
wrote:

>Faeandar wrote:

>
>...
>
>> ZFS is great in concept and I think they are on the right path,
>> however it's not yet ready for primetime imo.
>
>Though (as I already noted) I don't have any direct experience with it,
>my impression is that people are using it in production systems
>successfully - so a description of your specific reservations would be
>useful.
>
>>
>> The integrated integrity checking is extremely cpu intesive.
>
>I suspect that you're mistaken: IIRC it occurs as part of an
>already-existing data copy operation at a very low level in the disk
>read/write routines, and at close to memory-streaming speeds (i.e.,
>mostly using CPU cycles that are being used anyway just to copy the data).

According to Sun, the integrity check and file system self-healing
process is a permanent background process as well as the foreground
checks you mention. In the case of a system that is completely idle
of actual IO the system hung at around 40% performing these
consistency checks. When IO is going on it backs off to some extent
but it's still a hog.
There is no disk IO that is close to memory speeds. The consistency
checks and verifications involve checking data on platter.

>
> It does
>> not cluster yet, at least not as of 2 weeks ago.
>
>It was not clear that this was a requirement in this case - but since
>the OP mentioned clustering, I mentioned the soon-to-arrive capability.

Soon-to-arrive means 1.0. It's worth noting points like that. While
ZFS is great in design it is still new.

>
> Many file systems
>> grow dynamically so I would make that a check in ZFS's column.
>
>I'm not sure they grow dynamically quite as painlessly as ZFS does:
>usually, you first have to arrange to expand the underlying disk storage
>at the volume-manager level, and then have to incorporate the increase
>in volume size into the file system.

It depends on the system, but these days those tasks are fairly
simple. ZFS gets this extreme ease of use by not having a RAID
controller between itself and the disks, which means a jbod (not
everyone is keen on that yet). If you put a raid controller between
them then Sun recommends turning off the consistency checking. Alot
of what ZFS is depends on direct control of blocks.

>
> No
>> practical TB limit is a win if you need to go beyond 16TB in a single
>> FS.
>> I'm not sure I see how snapshots or journaling helps with backups.
>
>I should have added the word 'respectively', I guess: journaling helps
>avoid the need for fsck, and snapshots help expedite backups (by
>avoiding any need for down-time while making them).

True, but my example of the NetApp filer demonstrates that just
because you don't need downtime to do the backup it is still extremely
painful in an environment like what the OP describes.

>
> It
>> still has to map blocks to files, which is the long part of a backup.
>> I know when NetApp backups occur it takes the snapshot and then tries
>> to do a dump. If you have millions of files it can be hours before
>> data is actually transferred, I believe ZFS is no different.
>
>Actually, it is, since it allows block sizes up to 128 KB (vs. 4 KB for
>WAFL IIRC, though if WAFL does a good job of defragmenting files the
>difference may not be too substantial). With the OP's 100 KB file
>sizes, this means that each file can be accessed (backed up) with a
>single disk access, yielding a fairly respectable backup bandwidth of
>about 6 MB/sec (assuming that such an access takes about 16 ms. for a
>7200 rpm drive, including transfer time, and that the associated
>directory accesses can be batched during the scan).

It's not the transfer I was referring to but rather the mapping (phase
I and II of a dump). I believe ZFS still has to map the files to
blocks even if it's a one to one ratio. At millions of files this can
be painful. Once those phases are done the transfer rates are
probably full pipe.
Also, in the 100KB file to 128KB block ratio you lose what, 20% of
your capacity? Big trade off in some environments.

>
>>
>> Since the OP's IO pattern is mostly reads the cpu load may not be an
>> issue but writes suffer a serious penalty if you are not cpu-rich.
>
>I'm not sure why that would be the case even if the integrity-checking
>*were* CPU-intensive, since the overhead to check the integrity on a
>read should be just about the same as the overhead to generate the
>checksum on a write. True, one must generate it all the way back up to
>the system superblock for a write (one reason why I prefer a
>log-oriented implementation that can defer and consolidate such
>activity), but below the root unless you've got many of the
>intermediate-level blocks cached you have to access and validate them on
>each read (and with on the order of a billion files, my guess is that
>needed directory data will quite frequently not be cached).

In this case ZFS would also be doing the raid. If you're using a raid
controller the rules change, as do the features.

~F

Bill Todd

unread,

Apr 24, 2007, 11:48:15 PM4/24/07

to

Faeandar wrote:
> On Fri, 20 Apr 2007 22:24:31 -0400, Bill Todd <bill...@metrocast.net>
> wrote:
>
>> Faeandar wrote:
>>
>> ...
>>
>>> ZFS is great in concept and I think they are on the right path,
>>> however it's not yet ready for primetime imo.
>> Though (as I already noted) I don't have any direct experience with it,
>> my impression is that people are using it in production systems
>> successfully - so a description of your specific reservations would be
>> useful.
>>
>>> The integrated integrity checking is extremely cpu intesive.
>> I suspect that you're mistaken: IIRC it occurs as part of an
>> already-existing data copy operation at a very low level in the disk
>> read/write routines, and at close to memory-streaming speeds (i.e.,
>> mostly using CPU cycles that are being used anyway just to copy the data).
>
> According to Sun, the integrity check and file system self-healing
> process is a permanent background process as well as the foreground
> checks you mention.

Yes, but there's no reason for that to take up very much in the way of
resources (e.g., the last study I saw in this area indicated that a full
integrity sweep once every couple of months was more than adequate to
cut the incidence of latent errors - unnoticed corruption that jumps up
to bite you after the *good* copy dies - down by at least an order of
magnitude.

In the case of a system that is completely idle
> of actual IO the system hung at around 40% performing these
> consistency checks.

That's a ridiculous amount to use as the default (well, at least for
production software - if they're still using pure idle time heavily to
reassure customers due to ZFS's newness that might explain it), and I
would be very surprised if it weren't at least tunable to a much lesser
amount.

When IO is going on it backs off to some extent
> but it's still a hog.
> There is no disk IO that is close to memory speeds. The consistency
> checks and verifications involve checking data on platter.

Of course they do, and I never suggested otherwise. What can move at
close to memory speeds is the *CPU* overhead involved in the checks, and
it can piggyback on a memory-to-memory data move that is happening
anyway (such that few *extra* CPU cycles beyond what would already be
consumed in the move are required).

>
>> It does
>>> not cluster yet, at least not as of 2 weeks ago.
>> It was not clear that this was a requirement in this case - but since
>> the OP mentioned clustering, I mentioned the soon-to-arrive capability.
>
> Soon-to-arrive means 1.0. It's worth noting points like that. While
> ZFS is great in design it is still new.

Everything starts off new. The question is when a product becomes
usable in production, and that's something that's measured far more by
customer experience than by a clock.

My impression is that *some* customers have workloads that have found
ZFS to be very stable already, while others push corner cases that are
still uncovering bugs (I haven't heard of any for a while that involve
actual data corruption, but I haven't been paying close attention, either).

>
>> Many file systems
>>> grow dynamically so I would make that a check in ZFS's column.
>> I'm not sure they grow dynamically quite as painlessly as ZFS does:
>> usually, you first have to arrange to expand the underlying disk storage
>> at the volume-manager level, and then have to incorporate the increase
>> in volume size into the file system.
>
> It depends on the system, but these days those tasks are fairly
> simple. ZFS gets this extreme ease of use by not having a RAID
> controller between itself and the disks, which means a jbod (not
> everyone is keen on that yet).

Their loss, unless they need the raw single-operation low-latency
write-through performance that NVRAM hardware assist can give to a
hardware RAID box.

...

>> It
>>> still has to map blocks to files, which is the long part of a backup.
>>> I know when NetApp backups occur it takes the snapshot and then tries
>>> to do a dump. If you have millions of files it can be hours before
>>> data is actually transferred, I believe ZFS is no different.
>> Actually, it is, since it allows block sizes up to 128 KB (vs. 4 KB for
>> WAFL IIRC, though if WAFL does a good job of defragmenting files the
>> difference may not be too substantial). With the OP's 100 KB file
>> sizes, this means that each file can be accessed (backed up) with a
>> single disk access, yielding a fairly respectable backup bandwidth of
>> about 6 MB/sec (assuming that such an access takes about 16 ms. for a
>> 7200 rpm drive, including transfer time, and that the associated
>> directory accesses can be batched during the scan).
>
> It's not the transfer I was referring to but rather the mapping (phase
> I and II of a dump). I believe ZFS still has to map the files to
> blocks even if it's a one to one ratio.

The one-to-one ratio is what makes the difference (at least in this
particular case, and even in general the ratio is considerably better
than a non-extent-based file system that uses a 4 KB block size).

At millions of files this can
> be painful.

Not with ZFS in this instance, unless one constructs a pathological case
with a deep directory structure and only one or two files mapped per
deep path traversal: otherwise, the mapping can proceed at less than
one mapping access per 100 KB file (if each leaf directory has multiple
files to be mapped), plus the eventual transfer access itself.

Once those phases are done the transfer rates are
> probably full pipe.
> Also, in the 100KB file to 128KB block ratio you lose what, 20% of
> your capacity? Big trade off in some environments.

But likely not in this one: it's just not that large a system, nor are
the disks very expensive if they're SATA.

>
>>> Since the OP's IO pattern is mostly reads the cpu load may not be an
>>> issue but writes suffer a serious penalty if you are not cpu-rich.
>> I'm not sure why that would be the case even if the integrity-checking
>> *were* CPU-intensive, since the overhead to check the integrity on a
>> read should be just about the same as the overhead to generate the
>> checksum on a write. True, one must generate it all the way back up to
>> the system superblock for a write (one reason why I prefer a
>> log-oriented implementation that can defer and consolidate such
>> activity), but below the root unless you've got many of the
>> intermediate-level blocks cached you have to access and validate them on
>> each read (and with on the order of a billion files, my guess is that
>> needed directory data will quite frequently not be cached).
>
> In this case ZFS would also be doing the raid. If you're using a raid
> controller the rules change, as do the features.

I have no idea how your comment is meant to relate to the material it's
responding to above.

- bill