Re: Network debug info level breaks nx_start -> group_setupidlefiles

13 views
Skip to first unread message

Sebastien Lorquet

unread,
May 19, 2022, 6:09:57 PM5/19/22
to nu...@googlegroups.com

hi all,

I would be really grateful if any one of you can throw a bit of time at this issue, because I think it could manifest in several boards and not just the nucleo h743.

I could be loosing time on something obvious that you will identify faster than me, because I've not looked at NuttX for a long time.


net init is like this:

nx_start -> up_initialize -> arm_netinitialize -> stm32_ethinitialize -> netdev_register

/dev/console is initialized in: arm_serialinit, which is called in up_initialize right before arm_netinitialize


And this is executed JUST BEFORE drivers_initialize


I am also seeing that all debug calls all end up in syslog()

Which itself is by default a driver that will lead to file_open("/dev/console"), am I right?


So any debug message emitted in arm_netinitialize/stm32_ethinitialize is executed while the driver subsystem is not initialized yet.

What does syslog do in that case? Is there an early syslog mechanism?


And this is a recent change, still visible in the current github display for nx_start.c:

https://github.com/apache/incubator-nuttx/commit/a594a5d7a8dcd756884ca7d3449f37ba77f96af3

I cant find a pull request for this change. Nor any discussion. And the commit message makes no sense, there are no clear explanations.

This is an important change that can have lots of consequences with drivers and chip init, specifically in the case where I am stuck.


I believe that nx_open in group_setupidlefiles expects to get fd 0 (because it's the first opened file?) but it does not, because syslog or something else has opened a file before and this breaks an assumption.

I might be seeing different issues. I'll probably put an early breakpoint in fopen to see which values are returned.


I have a workaround by removing the DEBUGASSERT, but I fear an importnat issue hides somewhere else, and I want to find the root cause of this instead of hiding a bug under the carpet.

Sebastien


On 5/19/22 22:34, Petro Karashchenko wrote:
Hi,

Most probably we need to trace when network init is done. Usually in the
early stages the "#  define showprogress(c) arm_lowputc(c)" and
"arm_lowputs" is used to trace the init process, so maybe you can
instrument the code to find out what is going on.

Best regards,
Petro

чт, 19 трав. 2022 р. о 22:15 Nathan Hartman <hartman...@gmail.com> пише:

On Thu, May 19, 2022 at 12:40 PM Sebastien Lorquet <seba...@lorquet.fr>
wrote:
Hi,

I have a nucleo-h743zi, in which I have enabled network.

It seems to work so far.

When I enable Network Info Debug, I get a crash.

The crash does not happen with Network info disabled (Enabling network
warnings and errors is ok)

I am writing on the mailing list to get feedback and discussion.

Here is the crash, pretty obvious cause:

A▒DE
stm32_ethinitialize: intf: 0
stm32_ifdown: Taking the network down
netdev_register: Registered MAC: 2e:7e:27:a2:69:88 as dev: eth0
up_assert: Assertion failed at file:init/nx_start.c line: 703 task: Idle
Task

First of all I dont know where this MAC is obtained but whatever, thats
not my problem. I will put the real mac later.


Note that nx_start is a pretty general file for nuttx, not related to
board or chip.

(gdb) break nx_start.c:703
Breakpoint 1 at 0x8002b9a: file init/nx_start.c, line 703.
Note: automatically using hardware breakpoints for read-only addresses.
(gdb) mon reset init
SWD DPIDR 0x6ba02477
target halted due to debug-request, current mode: Thread
xPSR: 0x01000000 pc: 0x080002e0 msp: 0x240059fc
(gdb) c
Continuing.

Breakpoint 1, nx_start () at init/nx_start.c:703
703 DEBUGVERIFY(group_setupidlefiles(&g_idletcb[i]));
(gdb) s
group_setupidlefiles (tcb=0x24001114 <g_idletcb>) at
group/group_setupidlefiles.c:61
61        FAR struct task_group_s *group = tcb->cmn.group;
(gdb) n
66        DEBUGASSERT(group != NULL);
(gdb) n
74        fd = nx_open("/dev/console", O_RDWR);
(gdb) n
75        if (fd == 0)
(gdb) n
88            if (fd > 0)
(gdb) n
98            return -ENFILE;

=>assert

This function group_setupidlefiles fails to open /dev/console and
triggers an assert.

Now please dont tell me that this code will work when I remove debug, I
dont care. This is not the root cause.


Does this code fails to open the console because some network debug has
been emitted first?

What to do, then?

It is pretty common to have boards emit debug messages before the OS is
started. I cant believe I'm the only one that will see this.

It seems to happen before I have CONFIG_DEV_CONSOLE, which is the
default, and is ok, since i'm just using a serial port with this board.

If I disable CONFIG_DEV_CONSOLE, then nsh does not start, so I believe
this setting is required.

USB console is not enabled nor planned to be used.


Can you please help me understand and fix this?

It does not look to be recent code.

Unfortunately I don't know the cause, but I have actually run into a
rather similar problem just a few days ago. In my case it's on Tiva
platform (armv7m-based) and there is an assertion failure related in
some way to the console device. In my case, the board is a custom
board that does not have a serial console (though it does have a UART
which controls an another chip). For everything else, this board has
only network connectivity. For the time being I have worked around the
crash by disabling a bunch of Kconfigs. I didn't know it was the
network debugging but I do remember that being one of the Kconfigs I
turned off. I will try to look at the specific areas you mentioned and
I'll post back if I figure out what's going on with that. It might
take me a few days though. Inundated with other work at the moment...

Nathan


    

Xiang Xiao

unread,
May 20, 2022, 2:13:29 AM5/20/22
to NuttX
On Fri, May 20, 2022 at 6:09 AM Sebastien Lorquet <seba...@lorquet.fr> wrote:

hi all,

I would be really grateful if any one of you can throw a bit of time at this issue, because I think it could manifest in several boards and not just the nucleo h743.

I could be loosing time on something obvious that you will identify faster than me, because I've not looked at NuttX for a long time.


net init is like this:

nx_start -> up_initialize -> arm_netinitialize -> stm32_ethinitialize -> netdev_register

/dev/console is initialized in: arm_serialinit, which is called in up_initialize right before arm_netinitialize



could you ensure "/dev/console" correctly registers after arm_serialinit return?
 

And this is executed JUST BEFORE drivers_initialize


I am also seeing that all debug calls all end up in syslog()

Which itself is by default a driver that will lead to file_open("/dev/console"), am I right?


So any debug message emitted in arm_netinitialize/stm32_ethinitialize is executed while the driver subsystem is not initialized yet.

What does syslog do in that case? Is there an early syslog mechanism?



syslog is independent from the file system. You can call syslog from any point(except some sched related code to avoid the infinite recursive).
 

And this is a recent change, still visible in the current github display for nx_start.c:

https://github.com/apache/incubator-nuttx/commit/a594a5d7a8dcd756884ca7d3449f37ba77f96af3

I cant find a pull request for this change. Nor any discussion. And the commit message makes no sense, there are no clear explanations. 

This is an important change that can have lots of consequences with drivers and chip init, specifically in the case where I am stuck.


 


I believe that nx_open in group_setupidlefiles expects to get fd 0 (because it's the first opened file?) but it does not, because syslog or something else has opened a file before and this breaks an assumption.


Kernel shouldn't use open/read/write/close, but call file_open/file_read/file_write/file_close instead.
 

I might be seeing different issues. I'll probably put an early breakpoint in fopen to see which values are returned.



Yes, we need to find the caller and correct the function usage.
 
--
This group has moved to the Apache Foundation:
http://nuttx.incubator.apache.org/community/
---
You received this message because you are subscribed to the Google Groups "NuttX" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nuttx+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/nuttx/53381675-ce05-cf21-818b-5b799c6c2eb1%40lorquet.fr.

--
This group has moved to the Apache Foundation:
http://nuttx.incubator.apache.org/community/
---
You received this message because you are subscribed to the Google Groups "NuttX" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nuttx+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/nuttx/53381675-ce05-cf21-818b-5b799c6c2eb1%40lorquet.fr.

Sebastien Lorquet

unread,
May 20, 2022, 5:17:51 AM5/20/22
to nu...@googlegroups.com

Hi,

Complex message.

See below for specific answers.

What is WEIRD is that

-when I put a breakpoint on syslog and continue each time it is reached, the boad boots. So it sounds like a race condition.

-when the function group_setupidlefiles executes, the function inode_search does not find /dev/console. The tree only seems to contain /dev/null

Could that be a cache issue on the stm32h7?

Sébastien

Le 20/05/2022 à 08:13, Xiang Xiao a écrit :


On Fri, May 20, 2022 at 6:09 AM Sebastien Lorquet <seba...@lorquet.fr> wrote:

hi all,

I would be really grateful if any one of you can throw a bit of time at this issue, because I think it could manifest in several boards and not just the nucleo h743.

I could be loosing time on something obvious that you will identify faster than me, because I've not looked at NuttX for a long time.


net init is like this:

nx_start -> up_initialize -> arm_netinitialize -> stm32_ethinitialize -> netdev_register

/dev/console is initialized in: arm_serialinit, which is called in up_initialize right before arm_netinitialize



could you ensure "/dev/console" correctly registers after arm_serialinit return?

It does:

register_driver (path=0x802fb18 "/dev/console", fops=0x803019c <g_serialops>, mode=438, priv=0x24000068 <g_usart3priv>)
    at driver/fs_registerdriver.c:72
72        ret = inode_semtake();
(gdb) n
73        if (ret < 0)
(gdb)
78        ret = inode_reserve(path, mode, &node);
(gdb)
79        if (ret >= 0)
(gdb)
85            INODE_SET_DRIVER(node);
(gdb)
87            node->u.i_ops   = fops;
(gdb)
88            node->i_private = priv;
(gdb)
89            ret             = OK;
(gdb)
92        inode_semgive();
(gdb)
93        return ret;
(gdb)
94      }
(gdb)



 

And this is executed JUST BEFORE drivers_initialize


I am also seeing that all debug calls all end up in syslog()

Which itself is by default a driver that will lead to file_open("/dev/console"), am I right?


So any debug message emitted in arm_netinitialize/stm32_ethinitialize is executed while the driver subsystem is not initialized yet.

What does syslog do in that case? Is there an early syslog mechanism?



syslog is independent from the file system. You can call syslog from any point(except some sched related code to avoid the infinite recursive).

I have checked that syslog indeed ends up calling low level serial functions without using the VFS.


 

And this is a recent change, still visible in the current github display for nx_start.c:

https://github.com/apache/incubator-nuttx/commit/a594a5d7a8dcd756884ca7d3449f37ba77f96af3

I cant find a pull request for this change. Nor any discussion. And the commit message makes no sense, there are no clear explanations. 

This is an important change that can have lots of consequences with drivers and chip init, specifically in the case where I am stuck.


Here is the related PR:

OK...

 


I believe that nx_open in group_setupidlefiles expects to get fd 0 (because it's the first opened file?) but it does not, because syslog or something else has opened a file before and this breaks an assumption.


Kernel shouldn't use open/read/write/close, but call file_open/file_read/file_write/file_close instead.

No idea why it's made like that. It's old code.

Sebastien Lorquet

unread,
May 20, 2022, 5:30:31 AM5/20/22
to nu...@googlegroups.com

Not related to cache, disabling it changes nothing.

If I setup a breakpoint on syslog() just once, then delete the break point after it is hit, then continue, the board starts

When it boots like that, /dev is fully populated

But when group_setupidlefiles, only /dev/null is present, so /dev/console is not found, and then nx_open fails and asserts.

I'm loosing my mind!

Sebastien

Sebastien Lorquet

unread,
May 20, 2022, 10:09:49 AM5/20/22
to nu...@googlegroups.com

new info found just now: the board boots with network info debug when I comment the call to netdev_register.

When I enable netdev_register here:

https://github.com/apache/incubator-nuttx/blob/master/arch/arm/src/stm32h7/stm32_ethernet.c#L4474

, the board hangs with a stack dump pointing to here:

https://github.com/apache/incubator-nuttx/blob/master/sched/init/nx_start.c#L703

But if I set a breakpoint on netdev_register, the board starts correctly after a "continue"


Has anyone ever seen something like that?


Could we have stack limitations with the initial boot stack?


Aaaaand the answer is yes: setting CONFIG_IDLETHREAD_STACKSIZE to 2048 fixes my problem. Same with 1536. But 1024 just hangs.

I did not bisect the exact limit.

I consider the root cause found.


1024 might be a bit too small for the idle thread. Might happen to someone else...


Sébastien

Reply all
Reply to author
Forward
0 new messages