Network link broken but user still connected

11 views
Skip to first unread message

Cedric Fontaine

unread,
Oct 3, 2009, 9:12:12 AM10/3/09
to OpenQM
Hello,

On our ASP server, we're offering access using ssh to our customers.

Each linux login has one or more fixed users on QM and they get
connected directly to qm -AACCOUNT -12 for example using a bash login
procedure.

We have both keepalive on the ssh server directly and on the client
part in the emulator. It works great in 95% of the situation but
sometimes some lines got stuck when network link breaks.

This morning 3 lines where broken and QM and the only solution was to
restart the whole QM. Here are some informations :

qm -U
12 4146 0 /dev/pts/0 tca
13 29595 0 /dev/pts/1 tca
14 17279 0 /dev/pts/2 tca

A ps -auxw on the pib 4146 gave me :
tca 4146 0.0 0.2 3408 1904 ? S Oct02 0:02 /usr/
qmsys/bin/qm -12 -ATCA

A netstat shows me that there is no ESTABLISHED or PENDING ssh
connection on the OS part.

A PSTAT gave me no information except that process are not
responding :

:pstat
User Detail
12 (Not responding)

13 (Not responding)

14 (Not responding)

And LOGOUT hangs
:LOGOUT 12
Force logout initiated for user 12

At this point we had to stop and start QM.

Thanks,

Martin Phillips

unread,
Oct 5, 2009, 5:27:18 AM10/5/09
to ope...@googlegroups.com
Hi Cedric,

We have seen something similar where QM is not notified of loss of the
network connection and the process hangs inside a Linux library call where
we cannot see the logout request.

Although we need a better solution for this, you should be able to kill the
QM processes from Linux rather than a complete restart. Our cleanup
mechanism will then recover the licences within five minutes. You can speed
this up by doing
qm -cleanup

I have forwarded your email to one of our dealers who has identified a
problem in Linux ssh that might explain this.


Martin Phillips
Ladybridge Systems Ltd
17b Coldstream Lane, Hardingstone, Northampton, NN4 6DB
+44-(0)1604-709200

eppick77

unread,
Oct 5, 2009, 10:08:25 AM10/5/09
to OpenQM
Cedric,

We also get the same problem on occassion. We are running Centos
5.3. What are you running?

Eugene

Cedric Fontaine

unread,
Oct 5, 2009, 11:31:51 AM10/5/09
to ope...@googlegroups.com
eppick77 wrote:
> Cedric,
>
> We also get the same problem on occassion. We are running Centos
> 5.3. What are you running?

We are running Gentoo base system version 1.6.14.

--
Cedric Fontaine
http://www.terroirsquebec.com

Cedric Fontaine

unread,
Oct 5, 2009, 12:38:58 PM10/5/09
to ope...@googlegroups.com
Martin Phillips wrote:
> Hi Cedric,
>
> We have seen something similar where QM is not notified of loss of the
> network connection and the process hangs inside a Linux library call where
> we cannot see the logout request.

So I should just kill -9 the qm process on linux and then qm -cleanup ?

Martin Phillips

unread,
Oct 7, 2009, 3:31:29 AM10/7/09
to OpenQM
On 5 Oct, 17:38, Cedric Fontaine <cfonta...@spidmail.net> wrote:
>
> So I should just kill -9 the qm process on linux and then qm -cleanup ?

Although use of kill -9 is not a good idea in most situations, it
should be safe to do when QM is not responding to other termination
requests.

We do need to understand this problem more fully and come up with a
solution though all the evidence we have so far suggests that the hang
is deep inside the Linux networking system and hence outside of our
control.


Martin Phillips, Ladybridge Systems

Tony G

unread,
Oct 9, 2009, 12:42:23 AM10/9/09
to Ope...@googlegroups.com
I can see it now - Cedric files a support request with his Linux
provider (because we all know he's paying for support on his
FOSS, right?) and he tells them his DBMS provider says there is a
bug in the networking system. Yes, and the issue will be
resolved quickly as a million highly motivated people devote
their free time to solving the problem. Somehow I don't think
Cedric is going to get a resolution to this issue anytime soon.

I'm sorry Martin, I really don't expect you to be resolving Linux
issues, but I do see a great deal of irony in all of this.

T

Ashley Chapman

unread,
Oct 9, 2009, 1:12:25 AM10/9/09
to ope...@googlegroups.com
2009/10/9 Tony G <wosc...@sneakemail.com>:

I sometimes get this sort of finger pointing. It often happens on
Windows systems, and another MV database that I use. Nice to see the
same thing happening with Linux. Don't want the FOSS people missing
out! ;-)

Anyway, I've found an effective way to stop the finger pointing is to
ask the person doing the pointing for EVIDENCE that the bug is where
they say it is.

So, Martin. Do you have proof that the bug is in the Linux networking code?


Ashley Chapman

Martin Phillips

unread,
Oct 9, 2009, 4:58:51 AM10/9/09
to ope...@googlegroups.com
Hi Tony,

> I'm sorry Martin, I really don't expect you to be resolving
> Linux issues, but I do see a great deal of irony in all of this.

I agree that it is not our job but, in this particular instance, one of our
dealers has identified and fixed a problem that sounds like it could be the
same issue. I have asked him to communicate directly with Cedric (or perhaps
via this list) and he has agreed to do so as soon as time permits.

Re Ashley's comment...


> So, Martin. Do you have proof that the bug is in the Linux
> networking code?

We have seen two network connection problems that appear to be in Linux. The
one that fits closest to Cedric's problem is where we hang inside a kernel
function (as shown by strace) and never return to QM. This makes it
difficult for us to catch the error.

The other one involves poll() or select() saying "yes, there is data waiting
to be read" and read() saying "no there isn't", resulting in a loop trying
to recover the non-existant data. We have worked around this one inside QM.

Cedric

unread,
Oct 9, 2009, 2:48:46 PM10/9/09
to OpenQM
> I agree that it is not our job but, in this particular instance, one of our
> dealers has identified and fixed a problem that sounds like it could be the
> same issue. I have asked him to communicate directly with Cedric (or perhaps
> via this list) and he has agreed to do so as soon as time permits.

I didn't receive any direct support for now. I must admit that we're
currently stopping our migration to QM on those servers for now as
this point is a show stopper. We didn 't get any new hangs since last
week but I'm not sure that a kill will help cause it's pretty much
what I've been doing.

Our experience with D3 is that it could happens also on D3 but a
logoff will just bring the line back, as in QM, it will breaks the
whole QM server. Is it possible at least to fix the LOGOUT problem in
this case ?

Thanks,

Cedric

Ashley Chapman

unread,
Oct 9, 2009, 4:47:17 PM10/9/09
to ope...@googlegroups.com
2009/10/9 Cedric <cfon...@spidmail.net>:

Just a thought...

If there's a suspected problem in the linux internals, then presumably
this problem does not exist for QM on the Windows or BSD platforms.
If that's the case, perhaps you can consider using QM on top of
FreeBSD. Unless you are tightly tied to Gentoo.

Ashley

Martin Phillips

unread,
Oct 10, 2009, 4:37:56 AM10/10/09
to OpenQM
Hi Cedric,

We need to investigate this more fully. Please let us have full
details of how your connections are set up (direct into QM, via Linux
shell, ssh, etc) and the kernel revision in use.

A core dump of the process when it is stuck would be very helpful.
Failing this, please run strace to record the state of the process and
let us have the output.


Martin Phillips, Ladybridge Systems.

Tony G

unread,
Oct 11, 2009, 5:07:19 AM10/11/09
to Ope...@googlegroups.com
It occurs to me that one of the best ways to get people to
recognize FOSS OpenQM is to present a reproducible case to the
Linux distro developers with OpenQM as the focal point. To fix
the Linux problem they might need to install OpenQM, and in doing
so they may want to know more about what it is. I hope it plays
out like this.

In other words, Martin, it may be better to be less eager to fix
this on your own, even if you can.

T

Martin Phillips

unread,
Oct 15, 2009, 6:45:44 AM10/15/09
to ope...@googlegroups.com
Hi Cedric,

Any sign of the requested diagnostics?

We have tried repeatedly to reproduce this here but have so far failed. It
is tough to diagnose the cause without an example to look at.


Martin Phillips
Ladybridge Systems Ltd
17b Coldstream Lane, Hardingstone, Northampton, NN4 6DB
+44-(0)1604-709200

Cedric Fontaine

unread,
Oct 16, 2009, 3:43:15 PM10/16/09
to OpenQM


On 10 oct, 04:37, Martin Phillips <MartinPhill...@ladybridge.com>
wrote:
> Hi Cedric,
>
> We need to investigate this more fully. Please let us have full
> details of how your connections are set up (direct into QM, via Linux
> shell, ssh, etc) and the kernel revision in use.

Users are connecting via ssh and then are redirected to Qm using
bash_profile executing "/usr/qmsys/bin/qm -12 -AACCOUNT" for example.

Linux 2.6.28.4-xxxx-std-ipv4-32 #2 SMP Wed Feb 18 16:34:04 UTC 2009
i686 AMD Athlon(tm) X2 Dual Core Processor BE-2300 AuthenticAMD GNU/
Linux

> A core dump of the process when it is stuck would be very helpful.
> Failing this, please run strace to record the state of the process and
> let us have the output.

What are the command lines for core dump or strace ?

Sorry for the late answer. We didn't get any problem since then, but
we didn't change any settings either.

Martin Phillips

unread,
Oct 19, 2009, 1:26:59 PM10/19/09
to ope...@googlegroups.com
Hi Cedric,

This sounds very much like the Linux problem that one of our users tracked
down. I will ask him again to reply.

> What are the command lines for core dump or strace ?

Depending on your system, you should be able to force a core dump with
kill -4 (SIGILL) but it doesn't seem to work on all systems.

Were supported, strace is
strace -p 1234
where 1234 is the pid of the process that you want to trace.

Reply all
Reply to author
Forward
0 new messages