Kernel crash in "cpu_freq"

99 views
Skip to first unread message

Torsten Beyer

unread,
Jul 13, 2022, 4:28:46 AM7/13/22
to linux-sunxi
Hi all,

I am trying to debug a bug on an open source air navigation box for gliders called openvario. It is based on a cubieboard (A20) plus some additional serial connections and an optional sensor board for various flight related pressures.

System runs on kernel 5.18.5 generated using Yocto 4.0 kirkstone. The system tends to run for a couple of hours and then freezes/crashes. At the bottom of this post I have pasted a typical kernel debug output once these freezes happen. The crash always happens in the cpu_freq driver. If I set cpu frequency to a fixed frequency (setting min=max frequency) those crashed disappear. This seems to be a work around at the cost of fixing cpu speed.

So it _seems_ the crash is caused by cpu_freq trying to change the cpu frequency (at least at some point in time). 

To be honest, I am rather clueless on how to go about finding the root of this issue, let along fixing it. So I thought, I'd ask around here whether this bug somehow looks familiar and may have been tackled (or even fixed) previously (didn't find anything, though, via the search function). In other words: I am thankful for any hint people may be able to give me to get nearer to a fix. 

thanks for any pointers
Torsten

[26996.004010] Unable to handle kernel paging request at virtual address 08d80050
[26996.011337] [08d80050] *pgd=00000000
[26996.014952] Internal error: Oops: 5 [#1] SMP ARM
[26996.019590] Modules linked in:
[26996.022663] CPU: 1 PID: 95 Comm: sugov:0 Not tainted 5.18.5 #1
[26996.028509] Hardware name: Allwinner sun7i (A20) Family
[26996.033738] PC is at ccu_div_recalc_rate+0x48/0x90
[26996.038555] LR is at ccu_mux_helper_apply_prediv+0x18/0x1c
[26996.044054] pc : [] lr : [] psr: 600b0113
[26996.050326] sp : f09e5dc8 ip : 00000000 fp : c1938200
[26996.055554] r10: c1867440 r9 : 1f78a400 r8 : c1302d00
[26996.060781] r7 : 1312d000 r6 : 1f78a400 r5 : 00000002 r4 : 08d80084
[26996.067311] r3 : 00000000 r2 : ffffffff r1 : 00000001 r0 : 1f78a400
[26996.073843] Flags: nZCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment none
[26996.080985] Control: 10c5387d Table: 41ff006a DAC: 00000051
[26996.086733] Register r0 information: non-paged memory
[26996.091799] Register r1 information: non-paged memory
[26996.096858] Register r2 information: non-paged memory
[26996.101915] Register r3 information: NULL pointer
[26996.106627] Register r4 information: non-paged memory
[26996.111688] Register r5 information: non-paged memory
[26996.116746] Register r6 information: non-paged memory
[26996.121805] Register r7 information: non-paged memory
[26996.126863] Register r8 information: slab kmalloc-128 start c1302d00 pointer offset 0 size 128
[26996.135514] Register r9 information: non-paged memory
[26996.140574] Register r10 information: slab task_struct start c1867440 pointer offset 0
[26996.148517] Register r11 information: slab kmalloc-128 start c1938200 pointer offset 0 size 128
[26996.157244] Register r12 information: NULL pointer
[26996.162049] Process sugov:0 (pid: 95, stack limit = 0xf4bf205c)
[26996.167985] Stack: (0xf09e5dc8 to 0xf09e6000)
[26996.172361] 5dc0: c0d81584 c03db530 00000000 1f78a400 c1355700 c03d181c
[26996.180547] 5de0: c1355600 c1355700 1f78a400 c03d34ec 00000000 c1355600 1f78a400 39387000
[26996.188733] 5e00: c1302d00 1f78a400 c1867440 c03d3554 00000000 c1302d00 016e3600 39387000
[26996.196917] 5e20: c1302d00 1f78a400 c1867440 c03d3554 c1355600 00000000 1f78a400 c1867440
[26996.205101] 5e40: c1302d00 1f78a400 c1867440 c03d39f0 1f78a400 00000000 ffffffff 1f78a400
[26996.213287] 5e60: c0d81bd0 df7bf617 c193a340 1f78a400 1f78a400 c1938300 ef7dc050 1f78a400
[26996.221474] 5e80: c1867440 c03d3c28 c18b3b00 c1938500 1f78a400 c1938300 ef7dc050 c06122a4
[26996.229659] 5ea0: c1938300 00000001 ffffffff df7bf617 c0d81bd0 c18b3b00 ef7dc050 1f78a400
[26996.237844] 5ec0: 00000007 c1867440 c1938500 c0db652c 00080e80 c0612674 00000000 c0db617c
[26996.246030] 5ee0: 1f78a400 df7bf617 c1812800 c1812800 00000000 c0dfd944 000ea600 00000000
[26996.254214] 5f00: 00000002 c0617054 00000001 c1867440 00000000 00000000 f09e5f5c c1812800
[26996.262400] 5f20: 000ea600 00080e80 00000024 df7bf617 00000004 c184ba00 c184ba14 00000000
[26996.270585] 5f40: 00080e80 c184ba2c 00000001 c0a34650 00000000 c0159c98 00000000 c184ba28
[26996.278770] 5f60: c1867440 c0dea144 c184ba2c c0136954 c193a500 c1867440 c01368e0 c184ba28
[26996.286955] 5f80: c13c2100 f0891c44 00000000 c0138194 c193a500 c01380c4 00000000 00000000
[26996.295138] 5fa0: 00000000 00000000 00000000 c0100148 00000000 00000000 00000000 00000000
[26996.303321] 5fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[26996.311505] 5fe0: 00000000 00000000 00000000 00000000 00000013 00000000 00000000 00000000
[26996.319695] ccu_div_recalc_rate from clk_recalc+0x34/0x78
[26996.325215] clk_recalc from clk_change_rate+0xa4/0x29c
[26996.330461] clk_change_rate from clk_change_rate+0x10c/0x29c
[26996.336226] clk_change_rate from clk_change_rate+0x10c/0x29c
[26996.341991] clk_change_rate from clk_core_set_rate_nolock+0x16c/0x234
[26996.348539] clk_core_set_rate_nolock from clk_set_rate+0x30/0x154
[26996.354741] clk_set_rate from _set_opp+0x268/0x550
[26996.359644] _set_opp from dev_pm_opp_set_rate+0xe8/0x20c
[26996.365062] dev_pm_opp_set_rate from __cpufreq_driver_target+0x584/0x6e4
[26996.371876] __cpufreq_driver_target from sugov_work+0x48/0x54
[26996.377741] sugov_work from kthread_worker_fn+0x74/0x1a4
[26996.383167] kthread_worker_fn from kthread+0xd0/0xec
[26996.388242] kthread from ret_from_fork+0x14/0x2c
[26996.392967] Exception stack(0xf09e5fb0 to 0xf09e5ff8)
[26996.398032] 5fa0: 00000000 00000000 00000000 00000000
[26996.406216] 5fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[26996.414398] 5fe0: 00000000 00000000 00000000 00000000 00000013 00000000
[26996.421027] Code: e0055231 e244102c e3e02000 eb0001f3 (e5143034)

Samuel Holland

unread,
Jul 16, 2022, 12:16:16 AM7/16/22
to tb5...@gmail.com, linux-sunxi
Hi Torsten,

On 7/13/22 3:18 AM, Torsten Beyer wrote:
> Hi all,
>
> I am trying to debug a bug on an open source air navigation box for
> gliders called openvario <https://www.openvario.org/doku.php>. It is
> based on a cubieboard (A20) plus some additional serial connections
> and an optional sensor board for various flight related pressures.
>
> System runs on kernel 5.18.5 generated using Yocto 4.0 kirkstone. The
> system tends to run for a couple of hours and then freezes/crashes.
> At the bottom of this post I have pasted a typical kernel debug
> output once these freezes happen. The crash always happens in the
> cpu_freq driver. If I set cpu frequency to a fixed frequency (setting
> min=max frequency) those crashed disappear. This seems to be a work
> around at the cost of fixing cpu speed.
>
> So it _seems_ the crash is caused by cpu_freq trying to change the
> cpu frequency (at least at some point in time).
>
> To be honest, I am rather clueless on how to go about finding the
> root of this issue, let along fixing it. So I thought, I'd ask around
> here whether this bug somehow looks familiar and may have been
> tackled (or even fixed) previously (didn't find anything, though, via
> the search function). In other words: I am thankful for any hint
> people may be able to give me to get nearer to a fix. 

I have not seen something like this before. It looks like hardware
flakiness. Can you provide a disassembly of ccu_div_recalc_rate
from the kernel this splat came from, to confirm my analysis?

> thanks for any pointers
> Torsten
>
> [26996.004010] Unable to handle kernel paging request at virtual address 08d80050
> [26996.011337] [08d80050] *pgd=00000000
> [26996.014952] Internal error: Oops: 5 [#1] SMP ARM
> [26996.019590] Modules linked in:
> [26996.022663] CPU: 1 PID: 95 Comm: sugov:0 Not tainted 5.18.5 #1
> [26996.028509] Hardware name: Allwinner sun7i (A20) Family
> [26996.033738] PC is at ccu_div_recalc_rate+0x48/0x90
> [26996.038555] LR is at ccu_mux_helper_apply_prediv+0x18/0x1c

The crash is between the calls to ccu_mux_helper_apply_prediv and
divider_recalc_rate, so we are loading arguments for the call to
divider_recalc_rate.

> [26996.044054] pc : [] lr : [] psr: 600b0113
> [26996.050326] sp : f09e5dc8 ip : 00000000 fp : c1938200
> [26996.055554] r10: c1867440 r9 : 1f78a400 r8 : c1302d00
> [26996.060781] r7 : 1312d000 r6 : 1f78a400 r5 : 00000002 r4 : 08d80084

Assuming r4 is "hw", then the faulting address is cd->div.flags.
This is weird because r5 already contains cd->div.width...

> [26996.067311] r3 : 00000000 r2 : ffffffff r1 : 00000001 r0 : 1f78a400

..and r3 already contains cd->div.table. So we were already able
to access parts of the struct both before and after the faulting
address.

> [26996.073843] Flags: nZCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment none
> [26996.080985] Control: 10c5387d Table: 41ff006a DAC: 00000051
> [26996.086733] Register r0 information: non-paged memory
> [26996.091799] Register r1 information: non-paged memory
> [26996.096858] Register r2 information: non-paged memory
> [26996.101915] Register r3 information: NULL pointer
> [26996.106627] Register r4 information: non-paged memory
> [26996.111688] Register r5 information: non-paged memory
> [26996.116746] Register r6 information: non-paged memory
> [26996.121805] Register r7 information: non-paged memory
> [26996.126863] Register r8 information: slab kmalloc-128 start c1302d00 pointer offset 0 size 128
> [26996.135514] Register r9 information: non-paged memory
> [26996.140574] Register r10 information: slab task_struct start c1867440 pointer offset 0
> [26996.148517] Register r11 information: slab kmalloc-128 start c1938200 pointer offset 0 size 128
> [26996.157244] Register r12 information: NULL pointer
> [26996.162049] Process sugov:0 (pid: 95, stack limit = 0xf4bf205c)
> [26996.167985] Stack: (0xf09e5dc8 to 0xf09e6000)
> [26996.172361] 5dc0: c0d81584 c03db530 00000000 1f78a400 c1355700 c03d181c

What I think is happening is that the value in r4 got corrupted from
0xc0d81584 (the saved value on the top of the stack) to 0x08d80084.

Can you try increasing the voltage of the lower OPPs by 100 mV? And
if that doesn't work, try setting all of the OPPs to 1.4 V. That
should rule out any instability due to an insufficient CPU supply
voltage, and also due to any delay in slewing the regulator output.

Regards,
Samuel

Torsten Beyer

unread,
Jul 16, 2022, 2:54:32 AM7/16/22
to linux-sunxi
Hi Samuel,

thanks for your insights. Will try to follow them and will report back here. 

In the meantime I have built a kernel with dynamic debug and I can see that cpu_freq and the associated calls shown in my earlier post must run millions of times. And then out of the blue a crash...so some hw flakiness came to my mind, too.

regards
Torsten

Torsten Beyer

unread,
Jul 18, 2022, 3:26:58 AM7/18/22
to linux-sunxi
Hi Samuel,

am stuck trying to figure out how to increase the voltages. Can you point me to some documentation or quickly explain how I would do that?

regards
Torsten

sam...@sholland.org schrieb am Samstag, 16. Juli 2022 um 06:16:16 UTC+2:

Torsten Beyer

unread,
Jul 18, 2022, 8:13:47 AM7/18/22
to linux-sunxi
Hi again,

I believe I found the place. Can you confirm that changing the microvolts OPPs in "arch/arm/boot/dts/sun7i-a20.dtsi" is the right place for upping the microvolts for lower frequencies?

cheers
-tb

Samuel Holland

unread,
Jul 18, 2022, 10:13:48 PM7/18/22
to tb5...@gmail.com, linux-sunxi
Hi Torsten,

On 7/18/22 7:13 AM, Torsten Beyer wrote:
> Hi again,
>
> I believe I found the place. Can you confirm that changing the microvolts OPPs
> in "arch/arm/boot/dts/sun7i-a20.dtsi" is the right place for upping the
> microvolts for lower frequencies?

Yes, in that file, in the operating-points property of the cpu nodes.

Regards,
Samuel

Torsten Beyer

unread,
Jul 19, 2022, 4:44:53 AM7/19/22
to linux-sunxi
Hi Samuel,

fantastic - I have built a kernel with changed setting yesterday afternoon (increased min OPP to 1.1V) and the system has been running for about 20hrs now without freezes. Thanks for your help - looks like this patch fixes it. 

cheers
Torsten

Samuel Holland

unread,
Jul 19, 2022, 11:24:27 PM7/19/22
to Torsten Beyer, linux-sunxi
Hi Torsten,

On 7/19/22 3:44 AM, Torsten Beyer wrote:
> fantastic - I have built a kernel with changed setting yesterday afternoon
> (increased min OPP to 1.1V) and the system has been running for about 20hrs now
> without freezes. Thanks for your help - looks like this patch fixes it. 

Thanks for the confirmation. The voltage regulator supplying the CPU (reg_dcdc2)
has a 25 mV step size, so you could see if a smaller change to the OPP is enough
to make the board stable.

I assume you are powering the board with a reasonably stable power supply? In
that case, it would be good to apply your change upstream, in case any other
Cubieboard 2 users are experiencing crashes. If you want to submit a patch, I
would suggest overriding the OPP table in the board-specific DTS. See here[1]
for an example of a board that does this.

Regards,
Samuel

[1]:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm/boot/dts/sun7i-a20-bananapi.dts#n105

Torsten Beyer

unread,
Jul 20, 2022, 1:33:39 AM7/20/22
to linux-sunxi
Hi Samuel,

thanks for the additional hints. Will try to follow (this is the first time I am poking around in a unix kernel since 4.3bsd ... so I'm a bit rusty still). The power supply is suitably stable. I will try and experiment with lower voltages and see what happens. I may run into questions and would appreciate your further support in case I need it.

cheers
Torsten

Torsten Beyer

unread,
Jul 22, 2022, 10:16:44 AM7/22/22
to linux-sunxi
Hi Samuel,

tried setting OPP voltages in 50mV increments. That doesn't seem to work. Shouldn't /sys/class/regulator/regulator.9/microvolts show the actual microvolts supplied to the CPU? When setting the lowest value to 1.15V I never see a value lower than 1.2V in the above. Either I am looking to the wrong part or the regulator is rounding the value I would say. Any comments?

regards
Torsten

sam...@sholland.org schrieb am Mittwoch, 20. Juli 2022 um 05:24:27 UTC+2:
Reply all
Reply to author
Forward
0 new messages