anoying death blocks

22 views
Skip to first unread message

Alexey UImanov

unread,
Jul 10, 2011, 8:38:19 AM7/10/11
to funto...@googlegroups.com
I dont know is that a right place to post this. I just dont know where must i post this problem to get help. One guy already posted this problem on kernel.org here, he is experiancing this problem too.

I must warn about my english also.

The problem is in some unpredictable blocking of processes when dealing with I/O. It looks like this: You trying launch any program and you can not, but applications that are currently running continue to run without any problems. You are trying to launch any application from an already-running xterm and command line hangs up, you can not terminate launching this program by Ctrl-c and to get access to the shell again, you can not kill this process, terminal is useless at this time. "su -" hangs up also as launching new xterm under launched xterm or by Win-t (openbox keybinding). Some applications can be launched from the already-running xtem, for exmple htop or ls or find works well.
Then you are trying switch to another console by Ctrl-Alt-F1 and when you enter "root" in user name prompt, login just hangs up and nothing happens. You are trying to swith back to X console and you see the black screen, then you swiching to the first console and you see the black screen
Now you are rebooting by SysRq.
It seems that processes which hangs up just hangs in some syscalls and becomes immortal zimbies because of this.
I has tried to do strace on hanging applications when this happens, i has done "strace -o xterm xterm" and the last system call was wat4, waiting pid 11139, i looked this pid - this was utempter. Then i tried "strace -o xterm2 -f xterm" to hook syscalls from the children processes and xterm just launched ! It does not launch when i just type "xterm" in the shell prompt. It hangs up instead.
Then i tried "strace -o su -f su -" and it just done nothing because of su program, it does not allow launch it by this way somehow.
The cause of this is high load on IO operations, posible together with high load on CPU and/or memory. It becomes too unpredictable, i has tried reproduce this by dbench and got nothing.
Guy from kernel.org which posted this bug has definitely problem with /home partition, processes hangs up on syscalls trying read/write to /home filesystem. I do not have this definiteness. As i sad i can execute find and ls on any partition of my system, but "su" or "xterm" or some other programs hangs up.

I posted this here in hope that any one answer me how can i diagnose this problem to find out the cause of problem.
PS: russian users can read my post on linux.org.ru here

Alexey UImanov

unread,
Jul 10, 2011, 8:41:33 AM7/10/11
to funto...@googlegroups.com
I did not say: i has already done fsck of the all papritions, checked the hard drive on badblocks, checked memory by memtest, i do not belive this is the hardware problem.

Adrien Dessemond

unread,
Jul 10, 2011, 9:53:19 PM7/10/11
to funto...@googlegroups.com
Does this happens even with the latest kernel (3.0-rc7)? Do you have
any oops or BUG() messages in the system logs?

> --
> To manage your subscription, visit this group at
> http://groups.google.com/group/funtoo-dev?hl=en
> ---
> Also be sure to check out:
> Funtoo Forums: http://forums.funtoo.org
> Planet Larry: http://larrythecow.org
>

Adrien Dessemond

unread,
Jul 10, 2011, 9:57:14 PM7/10/11
to funto...@googlegroups.com
Sorry, missed a typo in my reply: 3.0-rc6 at date of writing, not 3.0-rc7

Daniel Robbins

unread,
Jul 10, 2011, 10:58:37 PM7/10/11
to funto...@googlegroups.com
On Sun, Jul 10, 2011 at 7:57 PM, Adrien Dessemond <adess...@funtoo.org> wrote:
Sorry, missed a typo in my reply: 3.0-rc6 at date of writing, not 3.0-rc7

What kernel are you using? Try using a known-stable kernel and see if the problem goes away (ubuntu-server would be a good one to try.)

-Daniel 

Alexey UImanov

unread,
Jul 11, 2011, 12:02:55 AM7/11/11
to funto...@googlegroups.com
This problem appears for 2.6.39, 2.6.38 and 2.6.36 for me, other version i did not checked. I just knoe that 39th kernel suffers from this problem much harder.
>>>BUG() messages
no one bug or oops message in syslog or dmesg, syslog by the way hangs up too.
I have hand - configured kernel with almost all drivers compiled monilitic. config of my kernel you can download from kernel.org

Alexey UImanov

unread,
Jul 11, 2011, 12:07:36 AM7/11/11
to funto...@googlegroups.com
The problem is this bug is hard reproducible. I do not know what exactly i must do to reproduce it on ubuntu-server. I can say more that when system in hanged up state htop shows very high average load and no load on cpu or high consuming memory, yesterday avg load was almost 10 and when system "unhanged" the avg load was droped.
There is new post on kernel org with straces on kernel.org also.

Daniel Robbins

unread,
Jul 11, 2011, 12:22:07 AM7/11/11
to funto...@googlegroups.com
On Sun, Jul 10, 2011 at 10:07 PM, Alexey UImanov <s9gf...@gmail.com> wrote:
The problem is this bug is hard reproducible. I do not know what exactly i must do to reproduce it on ubuntu-server. I can say more that when system in hanged up state htop shows very high average load and no load on cpu or high consuming memory, yesterday avg load was almost 10 and when system "unhanged" the avg load was droped.
There is new post on kernel org with straces on kernel.org also.

If the problem appears in 2.6.39, 2.6.38 and 2.6.36 for you, then the best recommendation I have is to not use those kernels. I do not trust stock kernels or near-stock kernels and wouldn't recommend that people use them, due to issues like this. If you are a kernel developer and want to debug the problem, then maybe it is worth trying to struggle with this issue, but that also implies you have the skills to troubleshoot it and investigate further. If you are a regular user who just wants their Linux system to work, just don't use those kernels.

That is why I added a section on recommended kernels here:


Also, if you are using an Intel Xeon 5500 or 5600 series processor, make sure you have updated to the latest BIOS so you have updated microcode which will fix CPU bugs that can cause issues like this. It could also be a CPU bug -- the 5500 and 5600 have had quite a few bad ones and they will cause strange unexplainable weird issues, lock-ups, etc.

Regards,

Daniel

Alexey UImanov

unread,
Jul 11, 2011, 1:35:51 AM7/11/11
to funto...@googlegroups.com
I do not have Intel Xeon.

I am regular user with some programming skills (python primary), I will try use more mature kernel like 2.6.32 from ubuntu server. If this problem will not appear on ubuntu kernel what can i do to force fixing this problem on new kernels ? This is realy serious bug.

Daniel Robbins

unread,
Jul 11, 2011, 11:06:20 AM7/11/11
to funto...@googlegroups.com
On Sun, Jul 10, 2011 at 11:35 PM, Alexey UImanov <s9gf...@gmail.com> wrote:
I do not have Intel Xeon.

I am regular user with some programming skills (python primary), I will try use more mature kernel like 2.6.32 from ubuntu server. If this problem will not appear on ubuntu kernel what can i do to force fixing this problem on new kernels ? This is realy serious bug.

Well, you have a set of general symptoms of failure, but so far you do not have the specific root cause of failure. For a kernel developer to fix it, they must first reproduce it, and then they must identify the root cause of failure. If they can't reproduce it, then they can't fix it. If it is specific to your hardware, then they will need your hardware to reproduce it. If you hardware is common, eventually a kernel developer will experience it, troubleshoot it, and fix it.

Options for fixing it sooner:

1) report it to kernel mailing list, ask what the root cause may be, and maybe kernel developers will help you troubleshoot further and identify the root cause, and maybe send you a few patches to test.
2) look for reports of identical issue and reply to the list letting people know it is a serious bug so it gets attention, and assist with testing.
3) find a helpful kernel developer and give him/her access to your hardware so they can troubleshoot it directly (most kernel devs are too busy to do this, but you never know... good option if you know a kernel developer locally :) )
4) pay a kernel developer for their time to resolve this issue for you (usually only an option for commercial distributions or Linux-based companies.) They will then take responsibility for troubleshooting the problem, identifying the root cause, and fixing it. Note that this can be time consuming.

But the biggest help is probably to find a reliable way to reproduce this problem on other hardware, and post instructions on how to do it. Once others can repro the bug, they can troubleshoot it themselves.

Make sense?

Regards,

Daniel



Reply all
Reply to author
Forward
0 new messages