newfs creates much too little inodes for 2TB filesystems? [solaris 10]

1718 views
Skip to first unread message

Peter Eriksson

unread,
Sep 8, 2005, 5:19:30 AM9/8/05
to
I just created a 2TB filesystem on a Solaris 10/x86 system (32bit) and that
worked fine. However - I just noticed (when my rsync run started failing)
that it allocated very little inodes for that filesystem:

sunky# gdf -h /export/mirror
Filesystem Size Used Avail Use% Mounted on
/dev/md/dsk/d101 2.0T 261G 1.7T 13% /export/mirror

sunky# gdf -i /export/mirror
Filesystem Inodes IUsed IFree IUse% Mounted on
/dev/md/dsk/d101 2190272 2190272 0 100% /export/mirror


Compare this to this 100GB filesystem (on another server):

adhara# gdf -h /export/staff
Filesystem Size Used Avail Use% Mounted on
/dev/md/dsk/d32 114G 73G 40G 65% /export/staff

adhara# gdf -i /export/staff
Filesystem Inodes IUsed IFree IUse% Mounted on
/dev/md/dsk/d32 14414400 1140274 13274126 8% /export/staff


No wonder my rsync run started failing with "No space left on device"
after a little while (I have about 1.6TB worth of files to transfer)...

This surely must be a bug? :-)

- Peter
--
--
Peter Eriksson <pe...@ifm.liu.se> Phone: +46 13 28 2786
Computer Systems Manager/BOFH Cell/GSM: +46 705 18 2786
Physics Department, Linköping University Room: Building F, F203

Casper H.S. Dik

unread,
Sep 8, 2005, 5:22:38 AM9/8/05
to
Peter Eriksson <pe...@ifm.liu.se> writes:

>I just created a 2TB filesystem on a Solaris 10/x86 system (32bit) and that
>worked fine. However - I just noticed (when my rsync run started failing)
>that it allocated very little inodes for that filesystem:


Unfortunately, that is a "known limitation": one inode per MB is the default
and minimum or fsck times would be preposterous.

I think there's talk about lowering that limit.

Casper
--
Expressed in this posting are my opinions. They are in no way related
to opinions held by my employer, Sun Microsystems.
Statements on Sun products included here are not gospel and may
be fiction rather than truth.

Bernd Haug

unread,
Sep 8, 2005, 7:35:22 AM9/8/05
to
Casper H.S Dik <Caspe...@Sun.COM> wrote:
> and minimum or fsck times would be preposterous.

Isn't fsck more or less obsolete in times of default logging?
Or is the logging code not to be *completely* trusted?

I'm asking because with Linux-Based logging filesystems I had the
experience that things don't go wrong easily w/ logging, but when bad
things happen, grab the tapes.

Never had such problems with Solaris UFS logging so far, though.

lg, Bernd
--
When emailing me, excuse my annoing spamfilter - it works for me.

Casper H.S. Dik

unread,
Sep 8, 2005, 8:06:16 AM9/8/05
to
Bernd Haug <ha...@berndhaug.net> writes:

>Casper H.S Dik <Caspe...@Sun.COM> wrote:
>> and minimum or fsck times would be preposterous.

>Isn't fsck more or less obsolete in times of default logging?
>Or is the logging code not to be *completely* trusted?

There are always cases when an fsck is done (needed or not).

Logging is mandatory for such large filesystems.

Bernd Haug

unread,
Sep 8, 2005, 9:04:28 AM9/8/05
to
Casper H.S Dik <Caspe...@Sun.COM> wrote:
> Bernd Haug <ha...@berndhaug.net> writes:
>>Casper H.S Dik <Caspe...@Sun.COM> wrote:
>>> and minimum or fsck times would be preposterous.
>>Isn't fsck more or less obsolete in times of default logging?
>>Or is the logging code not to be *completely* trusted?
> There are always cases when an fsck is done (needed or not).

Could you point me to authoritative docs regarding when this happens?
I'd rather know beforehand when to expect pretty long fsck's on boot or
somesuch.

> Logging is mandatory for such large filesystems.

As in, not mounted w/o the option, or silently enabled or..?

Thank you very much for your response.

Casper H.S. Dik

unread,
Sep 8, 2005, 9:44:17 AM9/8/05
to
Bernd Haug <ha...@berndhaug.net> writes:

>Could you point me to authoritative docs regarding when this happens?
>I'd rather know beforehand when to expect pretty long fsck's on boot or
>somesuch.

Generally, this only happens when there a I/O error when writing
the log or rolling the log. So this happens when there are issues
with the hardware.

>> Logging is mandatory for such large filesystems.

>As in, not mounted w/o the option, or silently enabled or..?

As in "you cannot mount w/o logging".

Peter Eriksson

unread,
Sep 9, 2005, 8:04:43 AM9/9/05
to
Casper H.S. Dik <Caspe...@Sun.COM> writes:

>Peter Eriksson <pe...@ifm.liu.se> writes:

>>I just created a 2TB filesystem on a Solaris 10/x86 system (32bit) and that
>>worked fine. However - I just noticed (when my rsync run started failing)
>>that it allocated very little inodes for that filesystem:


>Unfortunately, that is a "known limitation": one inode per MB is the default
>and minimum or fsck times would be preposterous.

>I think there's talk about lowering that limit.

Sigh.

It then seems the largest usable filesystem one can create is 999GB - as soon as one tries a 1TB one it switches to the silly amount of inodes:


# metainit d102 -p d100 990G
d102: Soft Partition is setup
# newfs /dev/md/rdsk/d102
newfs: construct a new file system /dev/md/rdsk/d102: (y/n)? y
/dev/md/rdsk/d102: 2076180480 sectors in 63360 cylinders of 128 tracks, 256 sectors
1013760.0MB in 21120 cyl groups (3 c/g, 48.00MB/g, 5824 i/g)
super-block backups (for fsck -F ufs -o b=#) at:
32, 98592, 197152, 295712, 394272, 492832, 591392, 689952, 788512, 887072,
Initializing cylinder groups:
.....^C# metaclear d102
d102: Soft Partition is cleared
# metainit d102 -p d100 999G
d102: Soft Partition is setup
# newfs /dev/md/rdsk/d102
newfs: construct a new file system /dev/md/rdsk/d102: (y/n)? y
/dev/md/rdsk/d102: 2095054848 sectors in 63936 cylinders of 128 tracks, 256 sectors
1022976.0MB in 21312 cyl groups (3 c/g, 48.00MB/g, 5824 i/g)
super-block backups (for fsck -F ufs -o b=#) at:
32, 98592, 197152, 295712, 394272, 492832, 591392, 689952, 788512, 887072,
Initializing cylinder groups:


Irritating limitation. Ah well, then I guess I'll have to clone the multi-
filsystem configuration from my primary server to the backup one instead of
simply using one fat partition for the backup...

Logan Shaw

unread,
Sep 9, 2005, 6:31:46 PM9/9/05
to
Peter Eriksson wrote:
> Sigh.
>
> It then seems the largest usable filesystem one can create is 999GB - as
> soon as one tries a 1TB one it switches to the silly amount of inodes:

The Solaris 10 manual page for newfs says that the default number of
bytes per i-node (which determines the total i-node count for a
filesystem of a given size) jumps from 8192 to 1048576 when the
filesystem size goes beyond the 1 TB boundary.

What's really strange about that is that it's such a huge jump. It's
more than 2 orders of magnitude. It means that a 9.0 TB filesystem
has less than 1/10th as many i-nodes as a 0.9 TB filesystem.

But, since there's so much room between 8192 bytes per i-node and
1048576 bytes per i-node, what about using "newfs -i NNNN" to choose
something more in the middle?

The "df" output you posted shows that at 1048576 bytes per i-node
you can only fill the disk to 13% disk space used. Which means
that, for that mix of file sizes, you have about 1/8th as many
i-nodes as you need.

So, you don't need to go all the way to 8192 bytes per i-node. Just
going to 131072 bytes per i-node should be sufficient to give you
enough i-nodes to fill that filesystem to 100% with that mix of files.

Of course, you'd want a safety factor, so what about creating the
2 TB filesystem with "newfs -i 32768"? That would create a filesystem
with the same number of i-nodes as the system would, by default, create
in a 0.5 TB filesystem. And we know the system can handle an 0.5 TB
filesystem with that number of i-nodes, so it stands to reason it would
likely be able to handle a 2 TB filesystem with the same number of i-nodes.

That is, unless newfs enforces some kind of hard limit on the "-i" value,
which I think personally would make no sense. (A limit on the maximum
number of i-nodes is something I could understand, but that's a different
thing.)

- Logan

Frank Batschulat

unread,
Sep 10, 2005, 5:00:02 AM9/10/05
to
On Thu, 08 Sep 2005 15:04:28 +0200, Bernd Haug <ha...@berndhaug.net> wrote:

> Casper H.S Dik <Caspe...@Sun.COM> wrote:
>> Bernd Haug <ha...@berndhaug.net> writes:
>>> Casper H.S Dik <Caspe...@Sun.COM> wrote:
>>>> and minimum or fsck times would be preposterous.
>>> Isn't fsck more or less obsolete in times of default logging?
>>> Or is the logging code not to be *completely* trusted?
>> There are always cases when an fsck is done (needed or not).
>
> Could you point me to authoritative docs regarding when this happens?
> I'd rather know beforehand when to expect pretty long fsck's on boot or
> somesuch.

the only reason for a logging filesystem to cause a full blown
fsck run is A) the UFS self induced panics like "freeing free XXX" panics
that point to meta data inconsistancies, B) a user manually
forcing this, C) the log has been trashed, e.g. due I/O problems.

>> Logging is mandatory for such large filesystems.
>
> As in, not mounted w/o the option, or silently enabled or..?

it is not stricly enforced by the code but recommended for
MTBufs filesystems.

---
frankB

Frank Batschulat

unread,
Sep 10, 2005, 5:16:18 AM9/10/05
to
On Thu, 08 Sep 2005 11:19:30 +0200, Peter Eriksson <pe...@ifm.liu.se>
wrote:

> I just created a 2TB filesystem on a Solaris 10/x86 system (32bit) and
> that
> worked fine. However - I just noticed (when my rsync run started failing)
> that it allocated very little inodes for that filesystem:
>
> sunky# gdf -h /export/mirror
> Filesystem Size Used Avail Use% Mounted on
> /dev/md/dsk/d101 2.0T 261G 1.7T 13% /export/mirror
>
> sunky# gdf -i /export/mirror
> Filesystem Inodes IUsed IFree IUse% Mounted on
> /dev/md/dsk/d101 2190272 2190272 0 100% /export/mirror
>
>
> Compare this to this 100GB filesystem (on another server):
>
> adhara# gdf -h /export/staff
> Filesystem Size Used Avail Use% Mounted on
> /dev/md/dsk/d32 114G 73G 40G 65% /export/staff
>
> adhara# gdf -i /export/staff
> Filesystem Inodes IUsed IFree IUse% Mounted on
> /dev/md/dsk/d32 14414400 1140274 13274126 8% /export/staff
>
>
> No wonder my rsync run started failing with "No space left on device"
> after a little while (I have about 1.6TB worth of files to transfer)...
>
> This surely must be a bug? :-)

Not directly, MTBufs was not planned to be just to enlarge UFS
to a bigger data store for many more files, it was intended
for special needs like databases with tend to have a few files
that are really large in size. to prevent weeks or month
of fsck times when logging is disabled the restrictions
you encounter for the inode density (and thus number of inodes)
has been put in place.

we're currently evaluating to remove this limit perhaps.

---
frankB

Peter Eriksson

unread,
Sep 10, 2005, 8:42:26 AM9/10/05
to
Logan Shaw <lshaw-...@austin.rr.com> writes:

>Peter Eriksson wrote:
>So, you don't need to go all the way to 8192 bytes per i-node. Just
>going to 131072 bytes per i-node should be sufficient to give you
>enough i-nodes to fill that filesystem to 100% with that mix of files.

>Of course, you'd want a safety factor, so what about creating the
>2 TB filesystem with "newfs -i 32768"? That would create a filesystem
>with the same number of i-nodes as the system would, by default, create
>in a 0.5 TB filesystem. And we know the system can handle an 0.5 TB
>filesystem with that number of i-nodes, so it stands to reason it would
>likely be able to handle a 2 TB filesystem with the same number of i-nodes.

>That is, unless newfs enforces some kind of hard limit on the "-i" value,
>which I think personally would make no sense. (A limit on the maximum
>number of i-nodes is something I could understand, but that's a different
>thing.)

It seems it silently enforces the limit :-(


# metainit d101 -p d100 2T
d101: Soft Partition is setup

# newfs -i 32768 /dev/md/rdsk/d101
newfs: construct a new file system /dev/md/rdsk/d101: (y/n)? y
Warning: 2048 sector(s) in last cylinder unallocated
/dev/md/rdsk/d101: 4294967296 sectors in 699051 cylinders of 48 tracks, 128 sectors
2097152.0MB in 4889 cyl groups (143 c/g, 429.00MB/g, 448 i/g)


super-block backups (for fsck -F ufs -o b=#) at:

32, 878752, 1757472, 2636192, 3514912, 4393632, 5272352, 6151072, 7029792,
7908512,
Initializing cylinder groups:
...............................................................................
..................
super-block backups for last 10 cylinder groups at:
4286652320, 4287531040, 4288409760, 4289288480, 4290167200, 4291045920,
4291924640, 4292803360, 4293682080, 4294560800,

# mount /dev/md/dsk/d101 /mnt

# /pkg/gnu/bin/gdf -i /mnt


Filesystem Inodes IUsed IFree IUse% Mounted on

/dev/md/dsk/d101 2190272 4 2190268 1% /mnt

Peter Eriksson

unread,
Sep 10, 2005, 8:49:55 AM9/10/05
to
"Frank Batschulat" <frank.batschulat@i_hate_spam_sun.com> writes:

>Not directly, MTBufs was not planned to be just to enlarge UFS
>to a bigger data store for many more files, it was intended
>for special needs like databases with tend to have a few files
>that are really large in size. to prevent weeks or month
>of fsck times when logging is disabled the restrictions
>you encounter for the inode density (and thus number of inodes)
>has been put in place.

>we're currently evaluating to remove this limit perhaps.

Long fsck times would be a non-issue for _my_ application. If the
filesystem would go bad I'd simple newfs it and then resync the files
from the source system (our primary fileserver).

I don't even use redundancy for the 14x400GB SATA drives that is used
to make the (5.1TB) basic metadevice... :-)

(It is built out of 2 groups of 7 striped drives for maximum performance :-)

# metastat d100
d100: Concat/Stripe
Size: 10936457595 blocks (5.1 TB)
Stripe 0: (interlace: 512 blocks)
Device Start Block Dbase State Reloc Hot Spare
c0t0d0s2 0 No Okay Yes
c0t1d0s2 16065 No Okay Yes
c0t2d0s2 16065 No Okay Yes
c0t3d0s2 16065 No Okay Yes
c0t4d0s2 16065 No Okay Yes
c0t5d0s2 16065 No Okay Yes
c0t6d0s2 16065 No Okay Yes
Stripe 1: (interlace: 512 blocks)
Device Start Block Dbase State Reloc Hot Spare
c0t7d0s2 16065 No Okay Yes
c0t8d0s2 16065 No Okay Yes
c0t9d0s2 16065 No Okay Yes
c0t10d0s2 16065 No Okay Yes
c0t11d0s2 16065 No Okay Yes
c0t12d0s2 16065 No Okay Yes
c0t13d0s2 16065 No Okay Yes

By the way - I attempted to clone my metadevice configuration from the primary server
(Ran "metastat -p | fgrep -- '-p d100'" and then tried to use that to create an
identical setting on my "clone" server - but failed because the first soft partition
on the source server (Solaris 9/SPARC) looks like this:

# metastat -p |egrep d31
d31 -p d100 -o 1 -b 209715200

If I try to run "metainit d31 -p d100 -o 1 -b 209715200" on the clone server
(Solaris 10/x86 32bit) I just get an error:

# metainit d31 -p d100 -o 1 -b 209715200
metainit: sunky.ifm.liu.se: d31: overlapping extents specified

Annoying....

Logan Shaw

unread,
Sep 10, 2005, 4:13:27 PM9/10/05
to
Frank Batschulat wrote:
> Not directly, MTBufs was not planned to be just to enlarge UFS
> to a bigger data store for many more files, it was intended
> for special needs like databases with tend to have a few files
> that are really large in size. to prevent weeks or month
> of fsck times when logging is disabled the restrictions
> you encounter for the inode density (and thus number of inodes)
> has been put in place.

If there is going to be a limit, why not put the limit directly on
the number of i-nodes instead of on the i-node density? I agree
too many i-nodes causes slow fsck times, but why is it possible to
make a 0.9 TB filesystem with more i-nodes than that possible with
a 9.0 TB filesystem[1]?

If the limit is lowered (1048576 is changed to another number) but
the type of limit isn't changed, that would offer some relief, but
it still would be odd and inconsistent. Doesn't it seem like maximum
number of i-nodes as a function of filesystem size should be a
non-decreasing function?

- Logan

[1] For 0.9 TB, 8192 bytes per i-node density is allowed, giving
about 121 million i-nodes. For 9.0 TB, only 1048576 density
is allowed, giving only 2.1 million i-nodes.

Logan Shaw

unread,
Sep 10, 2005, 4:17:49 PM9/10/05
to
Peter Eriksson wrote:
> It seems it silently enforces the limit :-(
>
>
> # metainit d101 -p d100 2T
> d101: Soft Partition is setup
>
> # newfs -i 32768 /dev/md/rdsk/d101

> # /pkg/gnu/bin/gdf -i /mnt


> Filesystem Inodes IUsed IFree IUse% Mounted on
> /dev/md/dsk/d101 2190272 4 2190268 1% /mnt

Darn.

Well, here's an idea for a really stupid hack to get the number of
i-nodes you want:

(1) Create 0.99 TB metadevice.
(2) newfs to get the number of i-nodes you ultimately want for
your 2.0 TB filesystem.
(3) Expand metadevice to 2.0 TB.
(4) growfs filesystem to 2.0 TB.

Presumably, growfs isn't going to delete already-existing i-nodes.
And it probably won't even pay attention to the i-node density
issue for filesystems of 1 TB or larger.

On the other hand, it can't guarantee it won't refuse to operate,
or crash, or trash the filesystem... :-)

(Whether you think this is a good procedure to use on a production
server is another question entirely...)

- Logan

Daniel Rock

unread,
Sep 10, 2005, 7:36:03 PM9/10/05
to
Logan Shaw <lshaw-...@austin.rr.com> wrote:
> (1) Create 0.99 TB metadevice.
> (2) newfs to get the number of i-nodes you ultimately want for
> your 2.0 TB filesystem.
> (3) Expand metadevice to 2.0 TB.
> (4) growfs filesystem to 2.0 TB.

Won't work. growfs checks if the filesystem was created as MTB. If not it will
refuse to grow it beyond 1TB.

But why don't you grab the source of OpenSolaris and recompile mkfs/newfs
by yourself. Just redefine:

#define MTB_NBPI (MB) /* Number Bytes Per Inode for multi-terabyte */

in mkfs.c

I don't think the inode limits are enforced in the ufs kernel code, only
in mkfs.


--
Daniel

Martin Maclaren

unread,
Sep 15, 2005, 4:47:46 AM9/15/05
to
>we're currently evaluating to remove this limit perhaps.

Should support calls/RFEs be raised? Our situation looks like:

We are looking to expand `/mail':
Filesystem kbytes used avail capacity Mounted on
393214384 248176623 105716323 71% /mail
Filesystem iused ifree %iused Mounted on
9297284 187310716 5% /mail

But the new `/mail' looks like:

Filesystem kbytes used avail capacity Mounted on
1430721384 65560 1416348616 1% /mail
Filesystem iused ifree %iused Mounted on
4 1459580 0% /mail

Martin

Peter Eriksson

unread,
Sep 15, 2005, 5:50:44 AM9/15/05
to
cc...@bath.ac.uk (Martin Maclaren) writes:

>>we're currently evaluating to remove this limit perhaps.

>Should support calls/RFEs be raised? Our situation looks like:


I can report that it works to build your own "mkfs" from the OpenSolaris
sources if you replace:

#define MTB_NBPI (MB) /* Number Bytes Per Inode for multi-terabyte */

with:

#define MTB_NBPI 8192 /* Number Bytes Per Inode for multi-terabyte */

This made it possible for me to create a 2TB filsystem with plenty of inodes:

sunky:~> df -o i /export/mirror


Filesystem iused ifree %iused Mounted on

/dev/md/dsk/d101 11807794 242648590 5% /export/mirror

sunky:~> df -k /export/mirror


Filesystem kbytes used avail capacity Mounted on

/dev/md/dsk/d101 2114976840 1594704104 499122968 77% /export/mirror

Brave admins may find my patched sources together with a Makefile at:

ftp://ftp.ifm.liu.se/pub/unix/mkfs-sol10-ifm.tar.gz

Don't blame me if you use this and your server then catches fire and/or
explodes.

- Peter--

Frank Batschulat

unread,
Sep 17, 2005, 5:07:01 AM9/17/05
to
On Thu, 15 Sep 2005 10:47:46 +0200, Martin Maclaren <cc...@bath.ac.uk>
wrote:

>> we're currently evaluating to remove this limit perhaps.
>
> Should support calls/RFEs be raised? Our situation looks like:

that exists already:

5015643 Remove or reduce the nbpi restriction from *mkfs_ufs* for UFS mtb
filesystems

---
frankB

Reply all
Reply to author
Forward
0 new messages