Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

ZFS NFS service hanging on Sunday morning problem

256 views
Skip to first unread message

Use-Author-Suppli...@127.1

unread,
May 29, 2012, 7:39:11 AM5/29/12
to
Dear All,
Can anyone give any tips on diagnosing the following recurring problem?

I have a Solaris box (server5, SunOS server5 5.10 Generic_147441-15
i86pc i386 i86pc ) whose ZFS FS NFS exported service fails every so
often, always in the early hours of Sunday morning. I am barely
familiar with Solaris but here what I have managed to discern when the
problem occurs;

Jobs on other machines which access server5's shares (via automounter)
hang and attempts to manually remote-mount shares just timeout.

Remotely, showmount -e server5 shows all the exported FS are available.

On server5, the following services are running;

root@server5:/var/adm# svcs | grep nfs
online May_25 svc:/network/nfs/status:default
online May_25 svc:/network/nfs/nlockmgr:default
online May_25 svc:/network/nfs/cbd:default
online May_25 svc:/network/nfs/mapid:default
online May_25 svc:/network/nfs/rquota:default
online May_25 svc:/network/nfs/client:default
online May_25 svc:/network/nfs/server:default

On server5, I can list and read files on the affected FSs w/o problem
but any attempt to write to the FS (eg. copy a file to or rm a file
on the FS) just hangs the cp/rm process.

On server5, using a zfs command zfs 'get sharenfs pptank/local_linux'
displays the expected list of hosts/IPs with remote ro & rw access.

Here is the O/P from some other hopefully relevant commands;

root@server5:/# zpool status
pool: pptank
state: ONLINE
status: The pool is formatted using an older on-disk format. The pool can
still be used, but some features are unavailable.
action: Upgrade the pool using 'zpool upgrade'. Once this is done, the
pool will no longer be accessible on older software versions.
scan: none requested
config:

NAME STATE READ WRITE CKSUM
pptank ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
c3t0d0 ONLINE 0 0 0
c3t1d0 ONLINE 0 0 0
c3t2d0 ONLINE 0 0 0
c3t3d0 ONLINE 0 0 0
c3t4d0 ONLINE 0 0 0
c3t5d0 ONLINE 0 0 0
c3t6d0 ONLINE 0 0 0

errors: No known data errors

root@server5:/# zpool list
NAME SIZE ALLOC FREE CAP HEALTH ALTROOT
pptank 12.6T 384G 12.3T 2% ONLINE -

root@server5:/# zpool history
History for 'pptank':
<just hangs here>

root@server5:/# zpool iostat 5
capacity operations bandwidth
pool alloc free read write read write
---------- ----- ----- ----- ----- ----- -----
pptank 384G 12.3T 92 115 3.08M 1.22M
pptank 384G 12.3T 1.11K 629 35.5M 3.03M
pptank 384G 12.3T 886 889 27.1M 3.68M
pptank 384G 12.3T 837 677 24.9M 2.82M
pptank 384G 12.3T 1.19K 757 37.4M 3.69M
pptank 384G 12.3T 1.02K 759 29.6M 3.90M
pptank 384G 12.3T 952 707 32.5M 3.09M
pptank 384G 12.3T 1.02K 831 34.5M 3.72M
pptank 384G 12.3T 707 503 23.5M 1.98M
pptank 384G 12.3T 626 707 20.8M 3.58M
pptank 384G 12.3T 816 838 26.1M 4.26M
pptank 384G 12.3T 942 800 30.1M 3.48M
pptank 384G 12.3T 677 675 21.7M 2.91M
pptank 384G 12.3T 590 725 19.2M 3.06M


top shows the following runnable processes. Nothing excessive here AFAICT?

last pid: 25282; load avg: 1.98, 1.95, 1.86; up 1+09:02:05 07:46:29
72 processes: 67 sleeping, 1 running, 1 stopped, 3 on cpu
CPU states: 81.5% idle, 0.1% user, 18.3% kernel, 0.0% iowait, 0.0% swap
Memory: 2048M phys mem, 32M free mem, 16G total swap, 16G free swap

PID USERNAME LWP PRI NICE SIZE RES STATE TIME CPU COMMAND
748 root 18 60 -20 103M 9752K cpu/1 78:44 6.62% nfsd
24854 root 1 54 0 1480K 792K cpu/1 0:42 0.69% cp
25281 root 1 59 0 3584K 2152K cpu/0 0:00 0.02% top

The above cp job is as mentioned above, attempting to copy a file to
an effected FS, I've noticed is apparently not completely hung.

The only thing that appears specific to Sunday morning is a cronjob to
remove old .nfs* files,

root@server5:/# crontab -l | grep nfsfind
15 3 * * 0 /usr/lib/fs/nfs/nfsfind

Any suggestions on how to proceed?

Many thanks
Tom Crane

Ps. The email address in the header is just a spam-trap.
--
Tom Crane, IT support, RHUL Particle Physics.,
Dept. Physics, Royal Holloway, University of London, Egham Hill,
Egham, Surrey, TW20 0EX, England.
Email: T.Crane at rhul dot ac dot uk

John D Groenveld

unread,
May 29, 2012, 8:25:41 AM5/29/12
to
In article <jq2cgv$t8r$1...@mklab.ph.rhul.ac.uk>,
<Use-Author-Supplied-Address-Header@[127.1]> wrote:
>On server5, I can list and read files on the affected FSs w/o problem
>but any attempt to write to the FS (eg. copy a file to or rm a file
>on the FS) just hangs the cp/rm process.

Local access to server5's pptank zpool hang, your NFS issues
are merely a symptom of some other problem.

>root@server5:/# zpool status
> pool: pptank
> state: ONLINE
>status: The pool is formatted using an older on-disk format. The pool can
> still be used, but some features are unavailable.
>action: Upgrade the pool using 'zpool upgrade'. Once this is done, the
> pool will no longer be accessible on older software versions.
> scan: none requested
>config:
>
> NAME STATE READ WRITE CKSUM
> pptank ONLINE 0 0 0
> raidz1-0 ONLINE 0 0 0
> c3t0d0 ONLINE 0 0 0
> c3t1d0 ONLINE 0 0 0
> c3t2d0 ONLINE 0 0 0
> c3t3d0 ONLINE 0 0 0
> c3t4d0 ONLINE 0 0 0
> c3t5d0 ONLINE 0 0 0
> c3t6d0 ONLINE 0 0 0
>
>errors: No known data errors

What does "zpool upgrade" report for pptank's zpool version?

Can you scrub pptank?

>root@server5:/# zpool list
>NAME SIZE ALLOC FREE CAP HEALTH ALTROOT
>pptank 12.6T 384G 12.3T 2% ONLINE -
>
>root@server5:/# zpool history
>History for 'pptank':
><just hangs here>

That suggests your pptank zpool is broken.
Does fmadm(1M) report any of your disks as faulty?

Do you have lots of ZFS snapshots in pptank?

>The only thing that appears specific to Sunday morning is a cronjob to
>remove old .nfs* files,
>
>root@server5:/# crontab -l | grep nfsfind
>15 3 * * 0 /usr/lib/fs/nfs/nfsfind

Did you check the cronjobs of your non-root users?

John
groe...@acm.org

Use-Author-Suppli...@127.1

unread,
May 30, 2012, 3:25:01 PM5/30/12
to
John D Groenveld <groe...@cse.psu.edu> wrote:
: In article <jq2cgv$t8r$1...@mklab.ph.rhul.ac.uk>,
: <Use-Author-Supplied-Address-Header@[127.1]> wrote:
: >On server5, I can list and read files on the affected FSs w/o problem
: >but any attempt to write to the FS (eg. copy a file to or rm a file
: >on the FS) just hangs the cp/rm process.

Thanks for the followup.

: Local access to server5's pptank zpool hang, your NFS issues
: are merely a symptom of some other problem.

: >root@server5:/# zpool status
: > pool: pptank
: > state: ONLINE
: >status: The pool is formatted using an older on-disk format. The pool can
: > still be used, but some features are unavailable.
: >action: Upgrade the pool using 'zpool upgrade'. Once this is done, the
: > pool will no longer be accessible on older software versions.
: > scan: none requested
: >config:
: >
: > NAME STATE READ WRITE CKSUM
: > pptank ONLINE 0 0 0
: > raidz1-0 ONLINE 0 0 0
: > c3t0d0 ONLINE 0 0 0
: > c3t1d0 ONLINE 0 0 0
: > c3t2d0 ONLINE 0 0 0
: > c3t3d0 ONLINE 0 0 0
: > c3t4d0 ONLINE 0 0 0
: > c3t5d0 ONLINE 0 0 0
: > c3t6d0 ONLINE 0 0 0
: >
: >errors: No known data errors

: What does "zpool upgrade" report for pptank's zpool version?

root@server5:/# zpool upgrade pptank
This system is currently running ZFS pool version 29.

Pool 'pptank' is already formatted using the current version.


: Can you scrub pptank?

Sure. Just done it. (zpool scrub pptank). The result is,

root@server5:/# zpool status
pool: pptank
state: ONLINE
scan: scrub repaired 0 in 2h51m with 0 errors on Wed May 30 20:03:35 2012
config:

NAME STATE READ WRITE CKSUM
pptank ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
c3t0d0 ONLINE 0 0 0
c3t1d0 ONLINE 0 0 0
c3t2d0 ONLINE 0 0 0
c3t3d0 ONLINE 0 0 0
c3t4d0 ONLINE 0 0 0
c3t5d0 ONLINE 0 0 0
c3t6d0 ONLINE 0 0 0

errors: No known data errors


: >root@server5:/# zpool list
: >NAME SIZE ALLOC FREE CAP HEALTH ALTROOT
: >pptank 12.6T 384G 12.3T 2% ONLINE -
: >
: >root@server5:/# zpool history
: >History for 'pptank':
: ><just hangs here>

: That suggests your pptank zpool is broken.
: Does fmadm(1M) report any of your disks as faulty?

fmadm faulty -a -v

finds no faults, (produces no O/P).

: Do you have lots of ZFS snapshots in pptank?

It has 258 snapshots and the FS is 12.6TB.

: >The only thing that appears specific to Sunday morning is a cronjob to
: >remove old .nfs* files,
: >
: >root@server5:/# crontab -l | grep nfsfind
: >15 3 * * 0 /usr/lib/fs/nfs/nfsfind

: Did you check the cronjobs of your non-root users?

Thanks for that tip. The only other cronjob running early on Sunday
morning is a benign looking one, belonging the lp user, which
renames/copies 'request' files in /var/lp/logs. Currently the directory is
empty and /var has plenty of free space and is not in pptank.


Many thanks
Tom

: John
: groe...@acm.org

John D Groenveld

unread,
May 30, 2012, 4:24:21 PM5/30/12
to
In article <jq5s6d$j3m$1...@mklab.ph.rhul.ac.uk>,
<Use-Author-Supplied-Address-Header@[127.1]> wrote:
>Sure. Just done it. (zpool scrub pptank). The result is,
>
>root@server5:/# zpool status
> pool: pptank
> state: ONLINE
> scan: scrub repaired 0 in 2h51m with 0 errors on Wed May 30 20:03:35 2012

Good.

>: >root@server5:/# zpool history
>: >History for 'pptank':
>: ><just hangs here>
>
>: That suggests your pptank zpool is broken.
>: Does fmadm(1M) report any of your disks as faulty?
>
>fmadm faulty -a -v
>
>finds no faults, (produces no O/P).

What does truss(1) hint where "zpool history pptank" is hanging?

John
groe...@acm.org

Use-Author-Suppli...@127.1

unread,
Jun 12, 2012, 7:23:34 AM6/12/12
to
John D Groenveld <groe...@cse.psu.edu> wrote:
: In article <jq5s6d$j3m$1...@mklab.ph.rhul.ac.uk>,
Thanks for that suggestion. Truss shows that in fact it is not a true
hang, it just appears to slow to a crawl. I've put the truss O/P on
http://mklab.ph.rhul.ac.uk/~tom/server5_10246-truss.txt. In real time
towards the end of the run (before <ctrl>-Cing out) it might be
executing only ~ one brk() call every second. I am wondering, does
ZFS have any kind of debugging mode -- where it logs some/all its
operations, that might help here?

Regards
Tom

: John
: groe...@acm.org

Ps. The email address in the header is just a spam-trap.
--
Tom Crane, Dept. Physics, Royal Holloway, University of London, Egham Hill,

John D Groenveld

unread,
Jun 12, 2012, 11:10:24 AM6/12/12
to
In article <jr78rm$v8h$1...@mklab.ph.rhul.ac.uk>,
<Use-Author-Supplied-Address-Header@[127.1]> wrote:
>executing only ~ one brk() call every second. I am wondering, does
>ZFS have any kind of debugging mode -- where it logs some/all its
>operations, that might help here?

I would send your system and zpool configuration and status
to OpenSolaris.ORG's zfs-discuss m/l and see if any of
the ZFS developers there recognize your bug.

John
groe...@acm.org

cindy.sw...@oracle.com

unread,
Jun 12, 2012, 12:20:50 PM6/12/12
to
Hi Tom,

I think SunOS server5 5.10 Generic_147441-15 is the Solaris 10 8/11
release. Is this correct?

We looked at your truss output briefly and it looks like it is hanging
trying to allocate memory. At least, that's what the "br ...." statements
are at the end.

I will see if I can find out what diagnostic info would be help in
this case.

You might get a faster response on zfs-discuss as John suggested.

Thanks,

Cindy

Use-Author-Suppli...@127.1

unread,
Jun 13, 2012, 6:38:41 AM6/13/12
to
cindy.sw...@oracle.com wrote:
Hi Cindy,
Thanks for the followup

: I think SunOS server5 5.10 Generic_147441-15 is the Solaris 10 8/11
: release. Is this correct?

I think so,...
root@server5:/# cat /etc/release
Solaris 10 10/08 s10x_u6wos_07b X86
Copyright 2008 Sun Microsystems, Inc. All Rights Reserved.
Use is subject to license terms.
Assembled 27 October 2008


: We looked at your truss output briefly and it looks like it is hanging
: trying to allocate memory. At least, that's what the "br ...." statements
: are at the end.

: I will see if I can find out what diagnostic info would be help in
: this case.

Thanks. That would be much appreciated.

: You might get a faster response on zfs-discuss as John suggested.

I will CC to zfs-discuss.

Best regards
Tom.

: Thanks,

: Cindy

Ps. The email address in the header is just a spam-trap.
--
Tom Crane, Dept. Physics, Royal Holloway, University of London, Egham Hill,
0 new messages