[PATCH 2/2] suricatta process notification: improve ipc_get_status scheduling

11 views
Skip to first unread message

Dominique Martinet

unread,
Mar 21, 2024, 1:48:33 AMMar 21
to stefan...@swupdate.org, swup...@googlegroups.com, shiita....@atmark-techno.com, Dominique Martinet
The process notification thread gets messages from swupdate to submit
them to the hawkbit server.

Before this patch, ipc_get_status_timeout() was called in a tight loop,
probably with the assumption that it would wait until the timeout
happens if no message was available.
Unfortunately, the server just replies immediately that there is no
message, so this would just hammer the server in a tight loop.

Furthermore, on slow devices if the server takes more than 100ms to
reply we close our socket so the server would crash to sigpipe without
the previous fix.

Adjust this loop to:
- since ipc_get_status() always returns immediately the timeout is
useless, use the version without timeout.
- wait 100ms before calling ipc_get_status() again if no message was
available.

Signed-off-by: Dominique Martinet <dominique...@atmark-techno.com>
---
suricatta/server_hawkbit.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/suricatta/server_hawkbit.c b/suricatta/server_hawkbit.c
index 1633c5cca65b..7ceb1dc038fd 100644
--- a/suricatta/server_hawkbit.c
+++ b/suricatta/server_hawkbit.c
@@ -997,7 +997,7 @@ static void *process_notification_thread(void *data)
for (;;) {
ipc_message msg;
bool data_avail = false;
- int ret = ipc_get_status_timeout(&msg, 100);
+ int ret = ipc_get_status(&msg);

data_avail = ret > 0 && (strlen(msg.data.status.desc) != 0);

@@ -1060,6 +1060,10 @@ static void *process_notification_thread(void *data)

if (stop && !data_avail)
break;
+
+ // wait a bit for next message...
+ if (!data_avail)
+ usleep(100000);
}

pthread_mutex_unlock(&notifylock);
--
2.39.2


Stefano Babic

unread,
Mar 21, 2024, 11:23:38 AMMar 21
to Dominique Martinet, stefan...@swupdate.org, swup...@googlegroups.com, shiita....@atmark-techno.com
Hi Dominique,

On 21.03.24 06:41, Dominique Martinet wrote:
> The process notification thread gets messages from swupdate to submit
> them to the hawkbit server.
>
> Before this patch, ipc_get_status_timeout() was called in a tight loop,
> probably with the assumption that it would wait until the timeout
> happens if no message was available.

Correct.

> Unfortunately, the server just replies immediately that there is no
> message, so this would just hammer the server in a tight loop.
>
> Furthermore, on slow devices if the server takes more than 100ms to
> reply we close our socket so the server would crash to sigpipe without
> the previous fix.

Thanks for the previous fix, I agree there is a race and the patch fixes it.

>
> Adjust this loop to:
> - since ipc_get_status() always returns immediately the timeout is
> useless, use the version without timeout.
> - wait 100ms before calling ipc_get_status() again if no message was
> available.

It looks to me safe.

Acked-by: Stefano Babic <stefan...@swupdate.org>

Best regards,
Stefano

Dominique Martinet

unread,
Mar 24, 2024, 9:33:06 PMMar 24
to Stefano Babic, swup...@googlegroups.com, shiita....@atmark-techno.com
Stefano Babic wrote on Thu, Mar 21, 2024 at 04:23:33PM +0100:
> Hi Dominique,

Thanks for the prompt review!

> > Adjust this loop to:
> > - since ipc_get_status() always returns immediately the timeout is
> > useless, use the version without timeout.

Just an extra comment here: I got pointed out during internal review
that this was the last user of ipc_get_status_timeout() in tree.

Given that function is user-facing (network_ipc) I don't think we should
remove it in this patch, but we might want to consider if it's useful to
keep and mark it deprecated (__attribute__((deprecated)) or equivalent?)
in a follow-up commit as we don't have any waiting-style ipc as far as I
can see, so leaving it might cause more misunderstandings like we had
here.

--
Dominique


Stefano Babic

unread,
Mar 27, 2024, 5:02:05 AMMar 27
to Dominique Martinet, Stefano Babic, swup...@googlegroups.com, shiita....@atmark-techno.com
Hi Dominique,

On 25.03.24 02:32, Dominique Martinet wrote:
> Stefano Babic wrote on Thu, Mar 21, 2024 at 04:23:33PM +0100:
>> Hi Dominique,
>
> Thanks for the prompt review!
>
>>> Adjust this loop to:
>>> - since ipc_get_status() always returns immediately the timeout is
>>> useless, use the version without timeout.
>
> Just an extra comment here: I got pointed out during internal review
> that this was the last user of ipc_get_status_timeout() in tree.
>
> Given that function is user-facing (network_ipc) I don't think we should
> remove it in this patch, but we might want to consider if it's useful to
> keep and mark it deprecated (__attribute__((deprecated)) or equivalent?)

Well, the whole interface is mostly deprecated (for external usage) and
replaced by the progress interface. But last one requires more for the
integrator, as it should runs it as background task, it is event driven,
etc. So older application should be ported to use this interface, and it
is not what many want. I won't remove function from the API to avoid
breakages and to remain backward compatible, everyone should decide the
most suitable way for the own project.

Best regards,
Stefano

James Hilliard

unread,
Jul 2, 2024, 8:42:24 PM (24 hours ago) Jul 2
to Dominique Martinet, stefan...@swupdate.org, swup...@googlegroups.com, shiita....@atmark-techno.com
On Wed, Mar 20, 2024 at 11:48 PM Dominique Martinet
<dominique...@atmark-techno.com> wrote:
>
> The process notification thread gets messages from swupdate to submit
> them to the hawkbit server.
>
> Before this patch, ipc_get_status_timeout() was called in a tight loop,
> probably with the assumption that it would wait until the timeout
> happens if no message was available.
> Unfortunately, the server just replies immediately that there is no
> message, so this would just hammer the server in a tight loop.

This patch appears to have broken the ability for swupdate to send logs
to hawkbit, reverting it fixed the issue for me.
> --
> You received this message because you are subscribed to the Google Groups "swupdate" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to swupdate+u...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/swupdate/20240321054146.2303245-2-dominique.martinet%40atmark-techno.com.

Dominique Martinet

unread,
Jul 2, 2024, 10:19:59 PM (22 hours ago) Jul 2
to James Hilliard, stefan...@swupdate.org, swup...@googlegroups.com, shiita....@atmark-techno.com
James Hilliard wrote on Tue, Jul 02, 2024 at 06:42:09PM -0600:
> On Wed, Mar 20, 2024 at 11:48 PM Dominique Martinet
> <dominique...@atmark-techno.com> wrote:
> >
> > The process notification thread gets messages from swupdate to submit
> > them to the hawkbit server.
> >
> > Before this patch, ipc_get_status_timeout() was called in a tight loop,
> > probably with the assumption that it would wait until the timeout
> > happens if no message was available.
> > Unfortunately, the server just replies immediately that there is no
> > message, so this would just hammer the server in a tight loop.
>
> This patch appears to have broken the ability for swupdate to send logs
> to hawkbit, reverting it fixed the issue for me.

Hmm, now you say that I guess that this will lose the last 100ms of
logs, but you should still have some logs -- is that correct?

Anyway, we should convert this to use ipc_notify_receive instead of
polling the server with status in a loop; I think that should get as
much data as we had before.


$job no longer actively supports hawkbit and I was just looking into it
because a customer reported swupdate crashes (because of the sigpipe
fixed in the previous patch), but I'll try to find some time to look
at it next week... They probably wouldn't have noticed the CPU load but
this was bad enough to make the server thread take longer than 100ms to
reply, so I don't think we want to just revert this.

I don't think I'll have time this week, but I'll get back to you if you
don't have time either.

> > Furthermore, on slow devices if the server takes more than 100ms to
> > reply we close our socket so the server would crash to sigpipe without
> > the previous fix.
> >
> > Adjust this loop to:
> > - since ipc_get_status() always returns immediately the timeout is
> > useless, use the version without timeout.
> > - wait 100ms before calling ipc_get_status() again if no message was
> > available.
--
Dominique


James Hilliard

unread,
Jul 2, 2024, 11:06:51 PM (21 hours ago) Jul 2
to Dominique Martinet, stefan...@swupdate.org, swup...@googlegroups.com, shiita....@atmark-techno.com
On Tue, Jul 2, 2024 at 8:19 PM Dominique Martinet
<dominique...@atmark-techno.com> wrote:
>
> James Hilliard wrote on Tue, Jul 02, 2024 at 06:42:09PM -0600:
> > On Wed, Mar 20, 2024 at 11:48 PM Dominique Martinet
> > <dominique...@atmark-techno.com> wrote:
> > >
> > > The process notification thread gets messages from swupdate to submit
> > > them to the hawkbit server.
> > >
> > > Before this patch, ipc_get_status_timeout() was called in a tight loop,
> > > probably with the assumption that it would wait until the timeout
> > > happens if no message was available.
> > > Unfortunately, the server just replies immediately that there is no
> > > message, so this would just hammer the server in a tight loop.
> >
> > This patch appears to have broken the ability for swupdate to send logs
> > to hawkbit, reverting it fixed the issue for me.
>
> Hmm, now you say that I guess that this will lose the last 100ms of
> logs, but you should still have some logs -- is that correct?

Nope, the logs from this loop seem to not be getting uploaded at all
to hawkbit.

>
> Anyway, we should convert this to use ipc_notify_receive instead of
> polling the server with status in a loop; I think that should get as
> much data as we had before.

Ah

>
>
> $job no longer actively supports hawkbit and I was just looking into it
> because a customer reported swupdate crashes (because of the sigpipe
> fixed in the previous patch), but I'll try to find some time to look
> at it next week... They probably wouldn't have noticed the CPU load but
> this was bad enough to make the server thread take longer than 100ms to
> reply, so I don't think we want to just revert this.
>
> I don't think I'll have time this week, but I'll get back to you if you
> don't have time either.

Thanks.
Reply all
Reply to author
Forward
0 new messages