WM8850 hangs in nanosleep

321 views
Skip to first unread message

Ivan Krylov

unread,
Sep 5, 2015, 6:00:39 PM9/5/15
to vt8500-wm8505...@googlegroups.com
I've just upgraded my WM8850 from 3.13.something (from
github.com/linux-wmt/linux-vtwm, probably testing branch) to 4.2 from
kernel.org and almost immediately noticed that programs gradually start
to hang, then the whole system becomes unresponsive (even VTs cannot be
switched; at least SysRq+EISUB works). strace shows that the frozen
programs are inside nanosleep({0,n},...) call, which, AFAIU, should
return after less than one second (tv_sec = 0, tv_nsec < 999999999 ns ~
1 sec). If I send SIGTERM, nanosleep() is restarted after gettimeofday().

Attaching my .config (which is vt8500_v6_v7_defconfig with some
additional features enabled) and dmesg (there is nothing interesting
even when the bug shows up).

I had also encountered a BUG when the whole system froze for a while:
> Sep 5 21:52:59 mahavishnu kernel: [ 1989.490000] hrtimer: interrupt took 18334 ns
> Sep 5 21:52:59 mahavishnu kernel: [ 2016.080000] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [minissdpd:2077]
> Sep 5 21:52:59 mahavishnu kernel: [ 2016.080000] CPU: 0 PID: 2077 Comm: minissdpd Not tainted 4.2.0 #3
> Sep 5 21:52:59 mahavishnu kernel: [ 2016.080000] Hardware name: VIA/Wondermedia SoC (Device Tree Support)
> Sep 5 21:52:59 mahavishnu kernel: [ 2016.080000] task: ee686380 ti: ee670000 task.ti: ee670000
> Sep 5 21:52:59 mahavishnu kernel: [ 2016.080000] PC is at __getnstimeofday64+0x104/0x168
> Sep 5 21:52:59 mahavishnu kernel: [ 2016.080000] LR is at __getnstimeofday64+0x44/0x168
> Sep 5 21:52:59 mahavishnu kernel: [ 2016.080000] pc : [<c004e9f4>] lr : [<c004e934>] psr: 800f0013
> Sep 5 21:52:59 mahavishnu kernel: [ 2016.080000] sp : ee671f28 ip : edacca9a fp : ee671f78
> Sep 5 21:52:59 mahavishnu kernel: [ 2016.080000] r10: 00000268 r9 : a25c451f r8 : ff2e4aa8
> Sep 5 21:52:59 mahavishnu kernel: [ 2016.080000] r7 : ffffffff r6 : c4653600 r5 : 00000000 r4 : 3b9ac9ff
> Sep 5 21:52:59 mahavishnu kernel: [ 2016.080000] r3 : 5173e317 r2 : 2a9b0d40 r1 : fffffff7 r0 : 00000009
> Sep 5 21:52:59 mahavishnu kernel: [ 2016.080000] Flags: Nzcv IRQs on FIQs on Mode SVC_32 ISA ARM Segment user
> Sep 5 21:52:59 mahavishnu kernel: [ 2016.080000] Control: 10c53c7d Table: 2d5c8059 DAC: 00000015
> Sep 5 21:52:59 mahavishnu kernel: [ 2016.080000] CPU: 0 PID: 2077 Comm: minissdpd Not tainted 4.2.0 #3
> Sep 5 21:52:59 mahavishnu kernel: [ 2016.080000] Hardware name: VIA/Wondermedia SoC (Device Tree Support)
> Sep 5 21:52:59 mahavishnu kernel: [ 2016.080000] [<c0012e18>] (unwind_backtrace) from [<c0011534>] (show_stack+0x10/0x14)
> Sep 5 21:52:59 mahavishnu kernel: [ 2016.080000] [<c0011534>] (show_stack) from [<c005c088>] (watchdog_timer_fn+0x138/0x17c)
> Sep 5 21:52:59 mahavishnu kernel: [ 2016.080000] [<c005c088>] (watchdog_timer_fn) from [<c0049b80>] (__hrtimer_run_queues.constprop.23+0xd0/0x138)
> Sep 5 21:52:59 mahavishnu kernel: [ 2016.080000] [<c0049b80>] (__hrtimer_run_queues.constprop.23) from [<c004a32c>] (hrtimer_interrupt+0x9c/0x1cc)
> Sep 5 21:52:59 mahavishnu kernel: [ 2016.080000] [<c004a32c>] (hrtimer_interrupt) from [<c02ada84>] (vt8500_timer_interrupt+0x3c/0x4c)
> Sep 5 21:52:59 mahavishnu kernel: [ 2016.080000] [<c02ada84>] (vt8500_timer_interrupt) from [<c0042a00>] (handle_irq_event_percpu+0x50/0x15c)
> Sep 5 21:52:59 mahavishnu kernel: [ 2016.080000] [<c0042a00>] (handle_irq_event_percpu) from [<c0042b38>] (handle_irq_event+0x2c/0x40)
> Sep 5 21:52:59 mahavishnu kernel: [ 2016.080000] [<c0042b38>] (handle_irq_event) from [<c00450f0>] (handle_level_irq+0x94/0x118)
> Sep 5 21:52:59 mahavishnu kernel: [ 2016.080000] [<c00450f0>] (handle_level_irq) from [<c00423fc>] (generic_handle_irq+0x20/0x30)
> Sep 5 21:52:59 mahavishnu kernel: [ 2016.080000] [<c00423fc>] (generic_handle_irq) from [<c0042458>] (__handle_domain_irq+0x4c/0xa0)
> Sep 5 21:52:59 mahavishnu kernel: [ 2016.080000] [<c0042458>] (__handle_domain_irq) from [<c000937c>] (vt8500_handle_irq+0x6c/0x74)
> Sep 5 21:52:59 mahavishnu kernel: [ 2016.080000] [<c000937c>] (vt8500_handle_irq) from [<c0011f80>] (__irq_svc+0x40/0x54)
> Sep 5 21:52:59 mahavishnu kernel: [ 2016.080000] Exception stack(0xee671ee0 to 0xee671f28)
> Sep 5 21:52:59 mahavishnu kernel: [ 2016.080000] 1ee0: 00000009 fffffff7 2a9b0d40 5173e317 3b9ac9ff 00000000 c4653600 ffffffff
> Sep 5 21:52:59 mahavishnu kernel: [ 2016.080000] 1f00: ff2e4aa8 a25c451f 00000268 ee671f78 edacca9a ee671f28 c004e934 c004e9f4
> Sep 5 21:52:59 mahavishnu kernel: [ 2016.080000] 1f20: 800f0013 ffffffff
> Sep 5 21:52:59 mahavishnu kernel: [ 2016.080000] [<c0011f80>] (__irq_svc) from [<c004e9f4>] (__getnstimeofday64+0x104/0x168)
> Sep 5 21:52:59 mahavishnu kernel: [ 2016.080000] [<c004e9f4>] (__getnstimeofday64) from [<c004ea60>] (getnstimeofday64+0x8/0x24)
> Sep 5 21:52:59 mahavishnu kernel: [ 2016.080000] [<c004ea60>] (getnstimeofday64) from [<c004f060>] (do_gettimeofday+0x14/0x40)
> Sep 5 21:52:59 mahavishnu kernel: [ 2016.080000] [<c004f060>] (do_gettimeofday) from [<c0047edc>] (SyS_gettimeofday+0x1c/0xb4)
> Sep 5 21:52:59 mahavishnu kernel: [ 2016.080000] [<c0047edc>] (SyS_gettimeofday) from [<c000f0e0>] (ret_fast_syscall+0x0/0x3c)

Is there another clocksource I should be able to switch to on WM8850? Is
there a kernel option I should have switched to avoid this behavour?
Is it a sign of faulty hardware I can't do anything about?

--
Best regards,
Ivan
.config
dmesg.1441489596

Alexey Charkov

unread,
Sep 7, 2015, 6:38:37 AM9/7/15
to VT8500/WM8505 Linux Kernel
Hi Ivan,

Looks like it might be a race condition in the clocksource code (our
hardware doesn't support direct reads or writes of the clock value or
the alarm setting, requiring a busy loop to sync to the bus). Maybe a
spinlock or such in the read/write functions will help.

Could you please reproduce this with the mainline kernel (master branch)?

Thanks,
Alexey

Ivan Krylov

unread,
Sep 7, 2015, 10:01:27 AM9/7/15
to vt8500-wm8505...@googlegroups.com
Alexey Charkov wrote:
> Could you please reproduce this with the mainline kernel (master branch)?

Do you mean current master branch of
<https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git>?
I'll try it on the weekend.

Meanwhile, downgrading from 4.2.0 to 4.1.6 (both from
https://kernel.org/) seems to have fixed the problem: there are no soft
lockups in dmesg, and all my processes still work after a lot of manual
nanosleep() calls and 16 hours of uptime (4.2.0 evidently failed overnight).

--
Best regards,
Ivan

Alexey Charkov

unread,
Sep 7, 2015, 10:18:26 AM9/7/15
to VT8500/WM8505 Linux Kernel
2015-09-07 17:01 GMT+03:00 Ivan Krylov <krylo...@gmail.com>:
> Alexey Charkov wrote:
>> Could you please reproduce this with the mainline kernel (master branch)?
>
> Do you mean current master branch of
> <https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git>?
> I'll try it on the weekend.

Yes, that one. I think it should already include most of the 8850 code
anyway (except for the known-unstable parts like serial flash)

> Meanwhile, downgrading from 4.2.0 to 4.1.6 (both from
> https://kernel.org/) seems to have fixed the problem: there are no soft
> lockups in dmesg, and all my processes still work after a lot of manual
> nanosleep() calls and 16 hours of uptime (4.2.0 evidently failed overnight).

Quite unusual. I guess 'git bisect' would be somewhat tricky/lengthy
to run with that test case? If not, that could help a lot in
diagnosing where the problem comes from.

Thanks a lot,
Alexey

Ivan Krylov

unread,
Sep 13, 2015, 5:31:24 AM9/13/15
to vt8500-wm8505...@googlegroups.com
Alexey Charkov wrote:
> I guess 'git bisect' would be somewhat tricky/lengthy
> to run with that test case? If not, that could help a lot in
> diagnosing where the problem comes from.

It looks like f7dc7fd1c08c6d8109688694fd93d80645df89ff is the first bad
commit; the previous good one ran overnight without soft lookups, and
the last good one (32b0ed3) has sustained running top,
transmission-remote-cli and `while perl -MTime::HiRes=nanosleep
-e'nanosleep 1e4 for 1..100000'; do date; done` for 20 minutes. I
started to suspect that the bug had gone away (bisect session began with
a series of good commits, then a pause while I was busy, then a series
of good commits), but the issue is still replicable if I reboot into 4.2.0.

--
Best regards,
Ivan

Ivan Krylov

unread,
Sep 13, 2015, 5:40:10 AM9/13/15
to vt8500-wm8505...@googlegroups.com
Oops.

Ivan Krylov wrote:
> (bisect session began with
> a series of good commits, then a pause while I was busy, then a series
> of good commits)

I meant: "a series of *bad* commits, then a pause while I was busy, then
a series of good commits"

Sorry for the typo.

--
Best regards,
Ivan

Alexey Charkov

unread,
Sep 16, 2015, 8:39:08 AM9/16/15
to VT8500/WM8505 Linux Kernel


On 13 Sep 2015 11:31, "Ivan Krylov" <krylo...@gmail.com> wrote:
>
> Alexey Charkov wrote:
> > I guess 'git bisect' would be somewhat tricky/lengthy
> > to run with that test case? If not, that could help a lot in
> > diagnosing where the problem comes from.
>
> It looks like f7dc7fd1c08c6d8109688694fd93d80645df89ff is the first bad

This commit looks completely unrelated to the issue. Most likely, your last good one wasn't really all that good :)

> commit; the previous good one ran overnight without soft lookups, and
> the last good one (32b0ed3) has sustained running top,
> transmission-remote-cli and `while perl -MTime::HiRes=nanosleep
> -e'nanosleep 1e4 for 1..100000'; do date; done` for 20 minutes. I
> started to suspect that the bug had gone away (bisect session began with
> a series of good commits, then a pause while I was busy, then a series
> of good commits), but the issue is still replicable if I reboot into 4.2.0.

Can you try spawning several such loops in parallel? As many as the machine can sustain would be best. I suspect your problem may be due to a low-level race condition in the clocksource code, so adding many parallel calls should help trigger it quicker.

Best,
Alexey

Ivan Krylov

unread,
Sep 21, 2015, 4:07:14 PM9/21/15
to vt8500-wm8505...@googlegroups.com
Alexey Charkov wrote:
> Can you try spawning several such loops in parallel? As many as the
> machine can sustain would be best. I suspect your problem may be due to
> a low-level race condition in the clocksource code, so adding many
> parallel calls should help trigger it quicker.

As it turns out after a lot of bisecting,

$ perl -MTime::HiRes=nanosleep -Mthreads -MThread::Queue -s -E'my $i =
0; my $q = Thread::Queue::->new; async { { nanosleep 1e4 for 1..50000;
$q->enqueue(threads::->tid." ".localtime); redo } }->detach for 1..$nth;
while ($_ = $q->dequeue) { say; $i++; print "\e[${nth}A" if $i==$nth;
$i%=$nth; }' -- -nth=4

hangs with

> Process 2196 attached with 5 threads - interrupt to quit
> [pid 2200] nanosleep({0, 10000}, <unfinished ...>
> [pid 2199] nanosleep({0, 10000}, <unfinished ...>
> [pid 2198] nanosleep({0, 10000}, <unfinished ...>
> [pid 2197] nanosleep({0, 10000}, <unfinished ...>
> [pid 2196] futex(0x2a177c, FUTEX_WAIT_PRIVATE, 1, NULL

even on 4.1.6. I'm not sure what should I do next. I'll try 4.0 and
3.18.21, though I'm starting again to suspect that this might be a
hardware problem after all.

--
Best regards,
Ivan

Ivan Krylov

unread,
Nov 1, 2015, 5:03:34 AM11/1/15
to vt8500-wm8505...@googlegroups.com
Well, I finally found the time to do some experiments.

Ivan Krylov wrote:
> I'll try 4.0 and
> 3.18.21, though I'm starting again to suspect that this might be a
> hardware problem after all.

3.18.21 did survive a night of `perl -MTime::HiRes=nanosleep -Mthreads
-MThread::Queue -s -E'my $i = 0; my $q = Thread::Queue::->new; async { {
nanosleep 1e4 for 1..50000; $q->enqueue(threads::->tid." ".localtime);
redo } }->detach for 1..$nth; while ($_ = $q->dequeue) { say; $i++;
print "\e[${nth}A" if $i==$nth; $i%=$nth; }' -- -nth=3` and `top`
running in a detached tmux session, and both programs were still running
normally when I came back to check them, but there were 9 soft lockups
in dmesg (for some reason, nearly always CPU was stuck for 22 seconds
and the stuck application was minissdpd) and the clock drifted 3 hours
and 35 minutes back.

Should I just give up and declare this a hardware issue?

--
Best regards,
Ivan

Alexey Charkov

unread,
Nov 9, 2015, 10:24:33 AM11/9/15
to VT8500/WM8505 Linux Kernel
So it looks like the problem could be related to the hardware timer
wrapping around out of sync vs. what the kernel expects... Though our
32-bit timer at 3MHz should wrap around every ~24 minutes.

> Should I just give up and declare this a hardware issue?

I think it's not necessarily a hardware issue per se - but most likely
a consequence of the peculiar way how WMT timer works. It doesn't
allow direct reads/writes, but instead requires a busy waiting loop on
a status register while the timer value is being read/written to. This
probably needs some kind of a spinlock or such to guard against
concurrent reads of the register.

Vendor code doesn't use any locking there, though (but then they don't
use the proper clocksource/clockevents frameworks either - rather they
just hack something directly into the core timekeeping code).

Could you please try recompiling your kernel with SMP and preemption
disabled in config, and check if the problem persists? Or just add a
global spinlock around the vt8500_timer_read() contents to check if
that helps?

Thanks a lot,
Alexey

Ivan Krylov

unread,
Nov 15, 2015, 2:45:39 AM11/15/15
to vt8500-wm8505...@googlegroups.com
Alexey Charkov wrote:
> Could you please try recompiling your kernel with SMP and preemption
> disabled in config, and check if the problem persists? Or just add a
> global spinlock around the vt8500_timer_read() contents to check if
> that helps?

I believe that these options in config mean that SMP and preemtion are
disabled:
$ zgrep -e SMP -e PREEMPT /proc/config.gz
CONFIG_BROKEN_ON_SMP=y
CONFIG_GENERIC_SMP_IDLE_THREAD=y
CONFIG_HAVE_SMP=y
# CONFIG_SMP is not set
CONFIG_PREEMPT_NONE=y
# CONFIG_PREEMPT_VOLUNTARY is not set
# CONFIG_PREEMPT is not set

The spinlock I've added looks like this:

--- - 2015-11-15 10:33:36.492323975 +0300
+++ linux-4.1.6/drivers/clocksource/vt8500_timer.c 2015-11-14
23:10:06.995785533 +0300
@@ -52,14 +52,21 @@

static void __iomem *regbase;

+// FIXME: trying to investigate hangs in nanosleep
+#include <linux/spinlock.h>
+DEFINE_SPINLOCK(vt8500_timer_read_lock);
static cycle_t vt8500_timer_read(struct clocksource *cs)
{
- int loops = msecs_to_loops(10);
+ int loops; cycle_t retval;
+ spin_lock(&vt8500_timer_read_lock);
+ loops = msecs_to_loops(10);
writel(3, regbase + TIMER_CTRL_VAL);
while ((readl((regbase + TIMER_AS_VAL)) & TIMER_COUNT_R_ACTIVE)
&& --loops)
cpu_relax();
- return readl(regbase + TIMER_COUNT_VAL);
+ retval = readl(regbase + TIMER_COUNT_VAL);
+ spin_unlock(&vt8500_timer_read_lock);
+ return retval;
}

static struct clocksource clocksource = {

There are no soft lockups in dmesg after ~10 hours spent running the
same script + top in tmux (as before), only one warning:
> [ 6599.250000] hrtimer: interrupt took 19000 ns
and the clock drifted back 1.5 hours. However, running ntpdate to fix
the drift caused nanosleep calls to start hanging again, so I had to
reboot the system.

--
Best regards,
Ivan

Roman Volkov

unread,
Dec 13, 2015, 5:53:48 AM12/13/15
to VT8500/WM8505 Linux Kernel
Hi Ivan, Alexey

The fix is available here:
https://github.com/v1ron/linux-mainline/tree/4.3_vt8500_clocksource_breakage
Git bisect pointed me to c6eb3f70d4482806dc2d3e1e3c7736f497b1d418, the
actual problem is that the value of the minimum delta passed to
clockevents_config_and_register() is less than checked in
vt8500_timer_set_next_event(), the latter function fails, if called with minimum
available delta.

Roman

Alexey Charkov

unread,
Dec 15, 2015, 1:08:49 PM12/15/15
to VT8500/WM8505 Linux Kernel

Roman, that's pretty cool, thanks a lot!

You mention that PXA driver is the same - any chance to use it directly?

--
You received this message because you are subscribed to the Google Groups "VT8500/WM8505 Linux Kernel" group.
To unsubscribe from this group and stop receiving emails from it, send an email to vt8500-wm8505-linux...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Roman Volkov

unread,
Dec 17, 2015, 2:56:32 AM12/17/15
to VT8500/WM8505 Linux Kernel
I would not say that it exactly the same, vt8500 timer is based on PXA. vt8500 has the
same registers as PXA, with additional two registers for bus synchronization. So the
difference should be in readi\write functions, other logic and constants should be the
same.

Regarding reusing the PXA file, it looks quite difficult due to static declarations.
I would just make the code as similar as possible, someone may try to merge these
files later.
To unsubscribe from this group and stop receiving emails from it, send an email to vt8500-wm8505-linux-kernel+unsub...@googlegroups.com.

Ivan Krylov

unread,
Jan 6, 2016, 3:54:43 PM1/6/16
to vt8500-wm8505...@googlegroups.com
Roman Volkov wrote:
> Hi Ivan, Alexey
>
> The fix is available here:
> https://github.com/v1ron/linux-mainline/tree/4.3_vt8500_clocksource_breakage

Roman, thank you very much for your work!

I git am'ed commits 4cbfbaa24d, 78a9e29290, c25c0e9318, 210a265342 from
your tree on top of latest stable 4.3.3 and stumbled upon an impressive
memory leak (which probably isn't relevant); then I tried current LTS
4.1.15 with minor modifications and succeeded, so now the timer in my
WM8500 netbook works normally.

--
Best regards,
Ivan

Roman Volkov

unread,
Jan 6, 2016, 6:34:13 PM1/6/16
to VT8500/WM8505 Linux Kernel
Ivan,

You may try to change min_oscr_delta to (16 * 2) for the leaking kernel 4.3 or apply the patch 1/3 from my newer branch 4.5. This patch goes to the -stable.
My code does not allocate memory, let me know if you find where it leaks.

Regards,
Roman
Reply all
Reply to author
Forward
0 new messages