[OVN][nbctld][bug] ovn-nbctl daemon hits an infinite loop?

12 views
Skip to first unread message

Girish Moodalbail

unread,
Dec 17, 2020, 11:55:08 AM12/17/20
to ovs dev, ovn-kub...@googlegroups.com
Hello all,

Say, ovn-nbctl is started in daemon mode with options set for certs, and those certs do not exist on the file system. For example, in the following invocation assume that `/ovn-cert` folder is empty

ovn-nbctl -vconsole:dbg --pidfile=/tmp/ovn-nbctl.pid --db=ssl:10.0.64.7:6641,ssl:10.0.64.6:6641,ssl:10.0.64.4:6641 --log-file=/tmp/ovn-nbctl.log --detach -p /ovn-cert/ovncontroller-privkey.pem -c /ovn-cert/ovncontroller-cert.pem -C /ovn-cert/ca-cert.pem

Now, if we run a command against that daemon via....

ovs-appctl -t /var/run/ovn/ovn-nbctl.32254.ctl list-commands

....then ovn-nbctl daemon goes into infinite loop hogging 100% CPU. In the log file, I see millions of following messages:

2020-12-17T08:00:10Z|37902584|poll_loop|DBG|wakeup due to [POLLIN] on fd 9 (/var/run/ovn/ovn-nbctl.32254.ctl<->) at lib/stream-fd.c:274 (36% CPU usage)
2020-12-17T08:00:10Z|37902585|poll_loop|DBG|wakeup due to [POLLIN] on fd 9 (/var/run/ovn/ovn-nbctl.32254.ctl<->) at lib/stream-fd.c:274 (36% CPU usage)
2020-12-17T08:00:10Z|37902586|poll_loop|DBG|wakeup due to [POLLIN] on fd 9 (/var/run/ovn/ovn-nbctl.32254.ctl<->) at lib/stream-fd.c:274 (36% CPU usage)
2020-12-17T08:00:10Z|37902587|poll_loop|DBG|wakeup due to [POLLIN] on fd 9 (/var/run/ovn/ovn-nbctl.32254.ctl<->) at lib/stream-fd.c:274 (36% CPU usage)
2020-12-17T08:00:10Z|37902588|poll_loop|DBG|wakeup due to [POLLIN] on fd 9 (/var/run/ovn/ovn-nbctl.32254.ctl<->) at lib/stream-fd.c:274 (36% CPU usage)
2020-12-17T08:00:10Z|37902589|poll_loop|DBG|wakeup due to [POLLIN] on fd 9 (/var/run/ovn/ovn-nbctl.32254.ctl<->) at lib/stream-fd.c:274 (36% CPU usage)
2020-12-17T08:00:10Z|37902590|poll_loop|DBG|wakeup due to [POLLIN] on fd 9 (/var/run/ovn/ovn-nbctl.32254.ctl<->) at lib/stream-fd.c:274 (36% CPU usage)
2020-12-17T08:00:10Z|37902591|poll_loop|DBG|wakeup due to [POLLIN] on fd 9 (/var/run/ovn/ovn-nbctl.32254.ctl<->) at lib/stream-fd.c:274 (36% CPU usage)
2020-12-17T08:00:10Z|37902592|poll_loop|DBG|wakeup due to [POLLIN] on fd 9 (/var/run/ovn/ovn-nbctl.32254.ctl<->) at lib/stream-fd.c:274 (36% CPU usage)
2020-12-17T08:00:10Z|37902593|poll_loop|DBG|wakeup due to [POLLIN] on fd 9 (/var/run/ovn/ovn-nbctl.32254.ctl<->) at lib/stream-fd.c:274 (36% CPU usage)
2020-12-17T08:00:10Z|37902594|poll_loop|DBG|wakeup due to [POLLIN] on fd 9 (/var/run/ovn/ovn-nbctl.32254.ctl<->) at lib/stream-fd.c:274 (36% CPU usage)
2020-12-17T08:00:10Z|37902595|poll_loop|DBG|wakeup due to [POLLIN] on fd 9 (/var/run/ovn/ovn-nbctl.32254.ctl<->) at lib/stream-fd.c:274 (36% CPU usage)
2020-12-17T08:00:10Z|37902596|poll_loop|DBG|wakeup due to [POLLIN] on fd 9 (/var/run/ovn/ovn-nbctl.32254.ctl<->) at lib/stream-fd.c:274 (36% CPU usage)



This is my theory. In ovn-nbctl.c`server_loop(), we have this infinite loop

    for (;;) {
        if (ovsdb_idl_has_ever_connected(idl)) {
            daemonize_complete();
            unixctl_server_run(server);
        }
        ovsdb_idl_wait(idl);
        unixctl_server_wait(server);
        poll_block();
    }


Since ovsdb_idl_has_ever_connected()  is not true due to missing certs, we never get a chance to run the command from ovs-appctl and then poll_block() will return immediately and we enter an infinite loop?

Regards,
~Girish


Dumitru Ceara

unread,
Dec 18, 2020, 6:47:19 AM12/18/20
to Ben Pfaff, Girish Moodalbail, ovs dev, ovn-kub...@googlegroups.com
On 12/17/20 7:55 PM, Ben Pfaff wrote:
> On Thu, Dec 17, 2020 at 08:54:56AM -0800, Girish Moodalbail wrote:
>> Hello all,
>>
>> Say, ovn-nbctl is started in daemon mode with options set for certs, and
>> those certs do not exist on the file system. For example, in the following
>> invocation assume that `/ovn-cert` folder is empty
>>
>> ovn-nbctl -vconsole:dbg --pidfile=/tmp/ovn-nbctl.pid --db=ssl:10.0.64.7:6641
>> ,ssl:10.0.64.6:6641,ssl:10.0.64.4:6641 --log-file=/tmp/ovn-nbctl.log
>> --detach -p /ovn-cert/ovncontroller-privkey.pem -c
>> /ovn-cert/ovncontroller-cert.pem -C /ovn-cert/ca-cert.pem
>>
>> Now, if we run a command against that daemon via....
>>
>> ovs-appctl -t /var/run/ovn/ovn-nbctl.32254.ctl list-commands
>
> [...]
>
>> This is my theory. In ovn-nbctl.c`server_loop(), we have this infinite loop
>>
>> for (;;) {
>> if (ovsdb_idl_has_ever_connected(idl)) {
>> daemonize_complete();
>> unixctl_server_run(server);
>> }
>> ovsdb_idl_wait(idl);
>> unixctl_server_wait(server);
>> poll_block();
>> }
>>
>> Since ovsdb_idl_has_ever_connected() is not true due to missing certs, we
>> never get a chance to run the command from ovs-appctl and then poll_block()
>> will return immediately and we enter an infinite loop?
>
> (The above is a partial snippet, there's actually more in the loop.)
>
> It's always an infinite loop, it's just that it wastes CPU in that case.
> I think that you're right about the cause. I think we should only call
> unixctl_server_wait() if we'd call unixctl_server_run(), so the right
> think to do appears to be move the unixctl_server_wait() call into the
> "if" condition.

In that case an "ovn-appctl -t ... <command>" will just block until the
IDL connects at least once.

Instead, would there be a concern with calling unixctl_server_run()
unconditionally?

This would allow the users to actually interact with the nbctl daemon
and, for example, gracefully stop it if it can't connect for whatever
reason:

# Start ovn-nbctl daemon without first starting the NB DB:
# This blocks because the IDL cannot connect.
export OVN_NB_DAEMON=$(ovn-nbctl --detach)

# In a different terminal, enable debug logs, exit, etc.
ovn-appctl -t /var/run/ovn/ovn-nbctl.18042.ctl vlog/set dbg
ovn-appctl -t /var/run/ovn/ovn-nbctl.18042.ctl exit

Regards,
Dumitru


Dumitru Ceara

unread,
Dec 21, 2020, 11:00:20 AM12/21/20
to Ben Pfaff, Girish Moodalbail, ovs dev, ovn-kub...@googlegroups.com
I went ahead and sent a patch in this direction:

http://patchwork.ozlabs.org/project/ovn/patch/1608566295-1324-1-g...@redhat.com/

Regards,
Dumitru

Dumitru Ceara

unread,
Dec 22, 2020, 4:38:56 AM12/22/20
to Ben Pfaff, Girish Moodalbail, ovs dev, ovn-kub...@googlegroups.com
On 12/22/20 12:03 AM, Ben Pfaff wrote:
> OK.
>
> I can think of two possible reasons it isn't already outside the "if".
> One is a simple mistake. The other is that it's so that the server
> doesn't have to buffer ovn-nbctl requests until they can be serviced (or
> reject them before they can be serviced). If it's the latter, then some
> decision about what to do about those now will have to be made.
>

OK, I see. I'm not sure what the original reason was but wouldn't
requests be buffered in both cases, that is, unixctl_server_run/wait()
both inside or outside the "if"?

If unixctl server is run unconditionally then server_cmd_run() ->
main_loop() blocks until the IDL connects for the first time. Following
requests are buffered in the socket buffer.

If unixctl server/wait is run inside the "if", only if the IDL connected
at least once, then requests are just buffered in the socket buffer
until the IDL connects for the first time.

Also, I'm still confused about why this would be useful only when the
IDL didn't connect at least once.

Thanks,
Dumitru

Dumitru Ceara

unread,
Jan 11, 2021, 4:32:25 AM1/11/21
to Ben Pfaff, Girish Moodalbail, ovs dev, ovn-kub...@googlegroups.com
On 1/7/21 9:21 PM, Ben Pfaff wrote:
> Looking closer, I don't have a good reason why not to move it outside
> the loop. I suggest making the change.
>

I guess you meant "move it outside the 'if' block" in which case the
following still applies:

http://patchwork.ozlabs.org/project/ovn/patch/1608566295-1324-1-g...@redhat.com/

Thanks,
Dumitru

Reply all
Reply to author
Forward
0 new messages