[Samba] client hangs

Doug Tucker

unread,

Oct 3, 2013, 11:40:02 AM10/3/13

to

All,

I've exhausted myself on this issue. Our samba server has been up and
running for ages without any issues. About 6 weeks ago quite suddenly
we began having intermittent clients hangs network wide and I'm at a
loss to find the issue. The users have so named them the windows
explorer status bar of death. It has been extremely disruptive when it
happens. Looking at the logs at the time of the event there doesn't
seems to be anything particularly unusual anywhere. It's as if all is
well in the world at every level. Network is quiet, file server is fine,
samba server is fine, but client attempts to access a resource on a
shared drive either by saving, or just simply clicking on a folder on
the shared drive can takes minutes to complete. Anyone else suddently
experiencing this?

Clients are mostly windows7. Though even the mac clients as well as the
linux clients are seeing the slowness.

Running samba: samba-3.0.33-3.39.el5_8
Centos5 x86_64

I know I'm not providing much here, but I simply can't find anything
relevant to send.

--
Sincerely,

Doug Tucker

--
To unsubscribe from this list go to the following URL and read the
instructions: https://lists.samba.org/mailman/options/samba

Klaus Hartnegg

unread,

Oct 3, 2013, 12:10:02 PM10/3/13

to

On 03.10.2013 17:20, Doug Tucker wrote:
> client attempts to access a resource on a
> shared drive either by saving, or just simply clicking on a folder on
> the shared drive can takes minutes to complete.

Is it reproducable by clicking the same folder again after rebooting the
client?

Do you have the same antivirus software on Win and Mac? I've seen such
behaviour years ago after an antivirus update when accessing a remote
directory with a certain powerpoint file in it, that suddenly took
minutes to scan. The scan can take place already when going into that
directory, even when not clicking on the specific file.

Klaus

Doug Tucker

unread,

Oct 3, 2013, 12:20:01 PM10/3/13

to

Virus scanning was one of the early suspects. For no real reason though
as nothing had changed. The macs and linux clients though are affected
and neither have virus software installed.

That's a huge frustrating point about it. It's is completely and wildly
random. I can't reproduce it at all, I can only see it when it happens
if someone calls and I run down there really quick. The only common
thing being that when it's happening to 1, it's happening to all. And
during the time it takes to reboot, it probably would have cleared up
anyway.

Yesterday during a bad hang a user called, so I immediately tried to
smbmount my home directory (I usually just have it mounted) and it hung
for quite a while, then returned "resource unavailable". The server
seemed completely fine though. About 2 minutes later after the caller
said it cleared up I was able to mount it. Looking at the server
everything seemed fine. I could ping the server. I could telnet to 139
and 445, so they were listening. Load was less than 1. The file server
seemed fine. Communication between the 2 was fine. It seems like an
internal issue with samba somehow but samba itself hasn't been updated
since this started happening (it was already at the latest version for
the distro).

Sincerely,

Doug Tucker

Doug Tucker

unread,

Oct 3, 2013, 1:10:01 PM10/3/13

to

I see a lot of this in the logs, but can't determine if it really means
anything:

Oct 2 09:45:28 agentsmith2 smbd[21954]: getpeername failed. Error was
Transport endpoint is not connected
Oct 2 09:45:28 agentsmith2 smbd[25948]: write_data: write failure in
writing to client 129.119.104.44. Error Connection reset by peer
Oct 2 09:45:28 agentsmith2 smbd[25971]: write_data: write failure in
writing to client 129.119.105.246. Error Connection reset by peer
Oct 2 09:45:28 agentsmith2 smbd[25883]: write_data: write failure in
writing to client 129.119.103.96. Error Connection reset by peer
Oct 2 09:45:28 agentsmith2 smbd[25987]: getpeername failed. Error was
Transport endpoint is not connected
Oct 2 09:45:28 agentsmith2 smbd[25988]: getpeername failed. Error was
Transport endpoint is not connected
Oct 2 09:45:28 agentsmith2 smbd[25986]: getpeername failed. Error was
Transport endpoint is not connected
Oct 2 09:45:29 agentsmith2 smbd[25985]: getpeername failed. Error was
Transport endpoint is not connected
Oct 2 09:45:29 agentsmith2 smbd[25989]: getpeername failed. Error was
Transport endpoint is not connected
Oct 2 09:45:29 agentsmith2 smbd[25704]: write_data: write failure in
writing to client 129.119.105.119. Error Broken pipe
Oct 2 09:45:29 agentsmith2 smbd[21702]: write_data: write failure in
writing to client 129.119.105.139. Error Connection reset by peer
Oct 2 09:45:29 agentsmith2 smbd[21954]: [2013/10/02 09:45:29, 0]
lib/util_sock.c:write_data(568)
Oct 2 09:45:29 agentsmith2 smbd[25948]: [2013/10/02 09:45:29, 0]
lib/util_sock.c:send_smb(767)
Oct 2 09:45:29 agentsmith2 smbd[25971]: [2013/10/02 09:45:29, 0]
lib/util_sock.c:send_smb(767)
Oct 2 09:45:29 agentsmith2 smbd[25883]: [2013/10/02 09:45:29, 0]
lib/util_sock.c:send_smb(767)
Oct 2 09:45:29 agentsmith2 smbd[25987]: [2013/10/02 09:45:29, 0]
lib/util_sock.c:get_peer_addr(1232)
Oct 2 09:45:29 agentsmith2 smbd[25988]: [2013/10/02 09:45:29, 0]
lib/util_sock.c:get_peer_addr(1232)
Oct 2 09:45:29 agentsmith2 smbd[25986]: [2013/10/02 09:45:29, 0]
lib/util_sock.c:get_peer_addr(1232)
Oct 2 09:45:29 agentsmith2 smbd[25985]: [2013/10/02 09:45:29, 0]
lib/util_sock.c:get_peer_addr(1232)
Oct 2 09:45:29 agentsmith2 smbd[25989]: [2013/10/02 09:45:29, 0]
lib/util_sock.c:get_peer_addr(1232)
Oct 2 09:45:29 agentsmith2 smbd[25704]: [2013/10/02 09:45:29, 0]
lib/util_sock.c:send_smb(767)
Oct 2 09:45:29 agentsmith2 smbd[21702]: [2013/10/02 09:45:29, 0]
lib/util_sock.c:send_smb(767)
Oct 2 09:45:29 agentsmith2 smbd[21954]: write_data: write failure in
writing to client 129.119.103.85. Error Connection reset by peer
Oct 2 09:45:29 agentsmith2 smbd[25948]: Error writing 60 bytes to
client. -1. (Connection reset by peer)
Oct 2 09:45:29 agentsmith2 smbd[25971]: Error writing 60 bytes to
client. -1. (Connection reset by peer)
Oct 2 09:45:29 agentsmith2 smbd[25883]: Error writing 60 bytes to
client. -1. (Connection reset by peer)
Oct 2 09:45:29 agentsmith2 smbd[25987]: getpeername failed. Error was
Transport endpoint is not connected
Oct 2 09:45:29 agentsmith2 smbd[25988]: getpeername failed. Error was
Transport endpoint is not connected
Oct 2 09:45:29 agentsmith2 smbd[25986]: getpeername failed. Error was
Transport endpoint is not connected
Oct 2 09:45:30 agentsmith2 smbd[25985]: getpeername failed. Error was
Transport endpoint is not connected
Oct 2 09:45:30 agentsmith2 smbd[25989]: getpeername failed. Error was
Transport endpoint is not connected

Sincerely,

Doug Tucker

Jeremy Allison

unread,

Oct 3, 2013, 1:20:02 PM10/3/13

to

All this is saying is that the client disconnected - smbd doesn't
know why. I'd start suspecting a network failure somewhere. Check
switches, cables and other hardware.

Jeremy.

Doug Tucker

unread,

Oct 3, 2013, 3:10:01 PM10/3/13

to

Already been down that path. I can't find a network issue anywhere.
Our samba server itself is set up with a bonded interface which attaches
to 2 different cards in the switch. I've pulled each ethernet cable to
see the results and there is no ping loss or interruption of any sort
and shutting down the port at the logical level I see the same result.
I monitor all of the switches and routers in our network with opennms
and cannot find there, or in any of the logs any network interruption
anywhere. Additionally samba seems to be the only server affected. We
have 50 or so linux servers on the network that aren't experiencing any
file server interruption. As for the samba server itself, I moved some
clients over to our backup and the problem follows them.

Sincerely,

Doug Tucker

Doug Tucker

unread,

Oct 3, 2013, 3:20:02 PM10/3/13

to

Additionally, this has happened from time to time (again, no idea what
it means exactly), but it doesn't necessarily correllate with when users
are seeing the hang. Any idea if this is fatal?

Oct 3 08:31:57 agentsmith2 kernel: INFO: task smbd:26597 blocked for
more than 120 seconds.
Oct 3 08:31:57 agentsmith2 kernel: "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 3 08:31:57 agentsmith2 kernel: smbd D ffffffff80157f0a
0 26597 6359 26677 26482 (NOTLB)
Oct 3 08:31:57 agentsmith2 kernel: ffff81172b963af8 0000000000000082
ffff81183f0db400 ffffffff884cfe7a
Oct 3 08:31:57 agentsmith2 kernel: ffff8115f9621888 0000000000000009
ffff81183fbd70c0 ffff810c3ff110c0
Oct 3 08:31:57 agentsmith2 kernel: 0001fafb7157cdc0 0000000000000909
ffff81183fbd72a8 000000113fb24bf8
Oct 3 08:31:57 agentsmith2 kernel: Call Trace:
Oct 3 08:31:57 agentsmith2 kernel: [<ffffffff884cfe7a>]
:sunrpc:xprt_end_transmit+0x2c/0x39
Oct 3 08:31:57 agentsmith2 kernel: [<ffffffff8006ed98>]
do_gettimeofday+0x40/0x90
Oct 3 08:31:57 agentsmith2 kernel: [<ffffffff80029172>] sync_page+0x0/0x43
Oct 3 08:31:57 agentsmith2 kernel: [<ffffffff800637de>]
io_schedule+0x3f/0x67
Oct 3 08:31:57 agentsmith2 kernel: [<ffffffff800291b0>]
sync_page+0x3e/0x43
Oct 3 08:31:57 agentsmith2 kernel: [<ffffffff80063a0a>]
__wait_on_bit+0x40/0x6e
Oct 3 08:31:57 agentsmith2 kernel: [<ffffffff800355f7>]
wait_on_page_bit+0x6c/0x72
Oct 3 08:31:57 agentsmith2 kernel: [<ffffffff800a3cfd>]
wake_bit_function+0x0/0x23
Oct 3 08:31:57 agentsmith2 kernel: [<ffffffff800482e8>]
pagevec_lookup_tag+0x1a/0x21
Oct 3 08:31:57 agentsmith2 kernel: [<ffffffff8004a2d0>]
wait_on_page_writeback_range+0x62/0x133
Oct 3 08:31:57 agentsmith2 kernel: [<ffffffff800ca3ee>]
filemap_write_and_wait+0x26/0x31
Oct 3 08:31:57 agentsmith2 kernel: [<ffffffff8852cc9c>]
:nfs:nfs_setattr+0x8e/0xfc
Oct 3 08:31:57 agentsmith2 kernel: [<ffffffff8000d01d>]
do_lookup+0x8f/0x24b
Oct 3 08:31:57 agentsmith2 kernel: [<ffffffff8000d57f>] dput+0x2c/0x114
Oct 3 08:31:57 agentsmith2 kernel: [<ffffffff8000a7b9>]
__link_path_walk+0xf10/0xf39
Oct 3 08:31:57 agentsmith2 kernel: [<ffffffff8002d0f6>]
mntput_no_expire+0x19/0x89
Oct 3 08:31:57 agentsmith2 kernel: [<ffffffff8000e4a2>]
current_fs_time+0x3b/0x40
Oct 3 08:31:57 agentsmith2 kernel: [<ffffffff8000ec03>]
link_path_walk+0xac/0xb8
Oct 3 08:31:57 agentsmith2 kernel: [<ffffffff8002cf2d>]
notify_change+0x145/0x2f5
Oct 3 08:31:57 agentsmith2 kernel: [<ffffffff800e401f>]
do_utimes+0x106/0x129
Oct 3 08:31:57 agentsmith2 kernel: [<ffffffff8000d73b>]
inotify_inode_queue_event+0xad/0xe8
Oct 3 08:31:57 agentsmith2 kernel: [<ffffffff80016bbf>]
vfs_write+0x13f/0x174
Oct 3 08:31:57 agentsmith2 kernel: [<ffffffff800e407e>]
sys_futimesat+0x3c/0x4b
Oct 3 08:31:57 agentsmith2 kernel: [<ffffffff8005d116>]
system_call+0x7e/0x83
Oct 3 08:31:57 agentsmith2 kernel:
Oct 3 08:31:57 agentsmith2 kernel: INFO: task smbd:29945 blocked for
more than 120 seconds.
Oct 3 08:31:57 agentsmith2 kernel: "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 3 08:31:57 agentsmith2 kernel: smbd D ffffffff80157f0a
0 29945 6359 29946 29942 (NOTLB)
Oct 3 08:31:57 agentsmith2 kernel: ffff8115f260fd98 0000000000000082
ffff8115f260fd48 ffffffff8000d01d
Oct 3 08:31:57 agentsmith2 kernel: ffff8115f260fd58 000000000000000a
ffff8102ae715040 ffff810c3fea7040
Oct 3 08:31:57 agentsmith2 kernel: 0001fb0e52713c02 000000000002b635
ffff8102ae715228 0000000fca752ca8
Oct 3 08:31:57 agentsmith2 kernel: Call Trace:
Oct 3 08:31:57 agentsmith2 kernel: [<ffffffff8000d01d>]
do_lookup+0x8f/0x24b
Oct 3 08:31:57 agentsmith2 kernel: [<ffffffff8000a7b9>]
__link_path_walk+0xf10/0xf39
Oct 3 08:31:58 agentsmith2 kernel: [<ffffffff80063c63>]
__mutex_lock_slowpath+0x60/0x9b
Oct 3 08:31:58 agentsmith2 kernel: [<ffffffff80063cad>]
.text.lock.mutex+0xf/0x14
Oct 3 08:31:58 agentsmith2 kernel: [<ffffffff8852c9bb>]
:nfs:nfs_getattr+0x45/0xd9
Oct 3 08:31:58 agentsmith2 kernel: [<ffffffff80028f4a>]
vfs_stat_fd+0x32/0x4a
Oct 3 08:31:58 agentsmith2 kernel: [<ffffffff800671cf>]
do_page_fault+0x4cc/0x842
Oct 3 08:31:58 agentsmith2 kernel: [<ffffffff80023cc3>]
sys_newstat+0x19/0x31
Oct 3 08:31:58 agentsmith2 kernel: [<ffffffff8005ddf9>]
error_exit+0x0/0x84
Oct 3 08:31:58 agentsmith2 kernel: [<ffffffff8005d116>]
system_call+0x7e/0x83
Oct 3 08:31:58 agentsmith2 kernel:

Sincerely,

Doug Tucker

On 10/03/2013 12:11 PM, Jeremy Allison wrote:

Jeremy Allison

unread,

Oct 3, 2013, 3:50:02 PM10/3/13

to

On Thu, Oct 03, 2013 at 02:07:05PM -0500, Doug Tucker wrote:
> Already been down that path. I can't find a network issue anywhere.
> Our samba server itself is set up with a bonded interface which
> attaches to 2 different cards in the switch. I've pulled each
> ethernet cable to see the results and there is no ping loss or
> interruption of any sort and shutting down the port at the logical
> level I see the same result. I monitor all of the switches and
> routers in our network with opennms and cannot find there, or in any
> of the logs any network interruption anywhere. Additionally samba
> seems to be the only server affected. We have 50 or so linux
> servers on the network that aren't experiencing any file server
> interruption. As for the samba server itself, I moved some clients
> over to our backup and the problem follows them.

Then you need to look at the clients. All smbd knows is
that the client disconnected. It doesn't know why.

Doug Tucker

unread,

Oct 3, 2013, 4:00:02 PM10/3/13

to

I wasn't suggesting that those were issues, I was asking if it was. It
sounds like that probably has nothing to do with the issue going on and
is just normal disconnects. I thought a windows update may have gone in
as this literally just started occurring suddenly about 6 weeks ago.
But alas, mac and linux clients see the same issue.

Sincerely,

Doug Tucker

Colb, Andrew

unread,

Oct 14, 2013, 12:00:03 AM10/14/13

to

Doug,

Has anything changed on your DCs?

When we had a similar sounding issue it took us about a month to connect that a) a Windows domain controller had its IP address changed with b) the old IP address was still lurking in DNS that was managed by the DC. Once the obsolete addressing was repaired, Samba started working correctly again. We were not able, however, to create a scenario that would lead to the failure, so we solved the problem only by inference. The one (simple) test that we did use was to put the DC address relationships into the Samba server's /etc/hosts and saw the issue disappear.

Andy Colb