[PATCH 1/2 v2] progress_thread: add message if notification couldn't be sent

18 views
Skip to first unread message

Dominique Martinet

unread,
May 21, 2026, 3:36:02 AMMay 21
to swup...@googlegroups.com, Dominique Martinet
If the socket is closed it could cause other errors down the road,
having a log here helps understand such issues.

Note this always happens at the end of installs, so this cannot be a
warning, but a debug log should not hurt.. We could check for EPIPE
and use warn if not EPIPE but it's probably not worth it.

Signed-off-by: Dominique Martinet <dominique...@atmark-techno.com>
---
v1->v2:
- add SoB
- lowoer from WARN to DEBUG as this happens after successful installs
too

core/progress_thread.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/core/progress_thread.c b/core/progress_thread.c
index 4b115ba63658..eb105a31154d 100644
--- a/core/progress_thread.c
+++ b/core/progress_thread.c
@@ -92,10 +92,11 @@ static void send_progress_msg(void)
n = send(conn->sockfd, buf, count, MSG_NOSIGNAL | MSG_DONTWAIT);
attempt++;
tryagain = n <= 0 && (errno == EWOULDBLOCK || errno == EAGAIN);
- if (tryagain)
+ if (tryagain && attempt < maxAttempts)
sleep(1);
} while (tryagain && attempt < maxAttempts);
if (n <= 0) {
+ DEBUG("Removed progress socket %d after error: %m", conn->sockfd);
close(conn->sockfd);
SIMPLEQ_REMOVE(&pprog->conns, conn,
progress_conn, next);
--
2.47.3


Dominique Martinet

unread,
May 21, 2026, 3:36:04 AMMay 21
to swup...@googlegroups.com, Dominique Martinet
Running swupdate with a SWU (streamed through swupdate_async_start such
as with `swupdate -i` or `swupdate-client`) can hang if the update
contains many small steps followed by a large step:
- swupdate_async_thread() is stuck trying to write the SWU content to
the server
- but the server sent many progress events and filled progressfd
backlog, and closes the server side of in send_progress_msg() after
retries failed (client is still blocked)
- when server continued installing more items the client is unblocked
and reads up on progress notifications, notices the socket is closed
(POLLHUP) and fails
- the server fails the update as it is not complete

This fails with logs similar to this:
```
progress_ipc_receive_nb failed (-1)
Cannot consume progress events. Fail.
[ERROR] : SWUPDATE failed [0] ERROR install_from_file.c : endupdate : 55 : SWUpdate *failed* !
[TRACE] : SWUPDATE running : [unlink_sockets] : unlink socket /tmp/swupdateprog
[TRACE] : SWUPDATE running : [unlink_sockets] : unlink socket /tmp/sockinstctrl
[ERROR] : SWUPDATE failed [0] ERROR cpio_utils.c : hash_compare : 521 : HASH mismatch : 4395313e6f79b6554521e1933304e95d0bea836d2bc72e9b3828fd62c5125baf <--> 3eb039c6666dfd614a73b930c98c38ab77ba362ce3b9d745c7684d0d10335726
```

This patch makes the sending socket nonblocking, allows partial sends of
SWU content (there is no need for it to be a full write unlike IPC
messages), and polls both progressfd and connfd to handle both sockets
simultaneously.

The poll is somewhat redundant with the other poll on progressfd in
progress_ipc_receive_nb(), but progress_ipc_receive_nb() is part of the
progress_ipc API and cannot be changed, so is left as is.

Fixes: 8b4efc0d23db ("swupdate_async_start: fix early termination blocking in client")
Link: https://groups.google.com/g/swupdate/c/F3XKUdP0KTc/m/m2jk4EdkBgAJ
Signed-off-by: Dominique Martinet <dominique...@atmark-techno.com>
---
v1 -> v2:
- add SoB
- fix hang when trying to install empty image: if size == 0 then we
shouldn't poll at all, because the server didn't get anything to
unblock us and end the loop.
This somewhat simplifies the logic a bit too.

ipc/network_ipc-if.c | 55 +++++++++++++++++++++++++++++++++++---------
ipc/network_ipc.c | 5 ++++
2 files changed, 49 insertions(+), 11 deletions(-)

diff --git a/ipc/network_ipc-if.c b/ipc/network_ipc-if.c
index 7c209ad484fc..340715855221 100644
--- a/ipc/network_ipc-if.c
+++ b/ipc/network_ipc-if.c
@@ -5,13 +5,15 @@
* SPDX-License-Identifier: LGPL-2.1-or-later
*/

+#include <errno.h>
+#include <fcntl.h>
+#include <inttypes.h>
+#include <poll.h>
+#include <pthread.h>
+#include <signal.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
-#include <errno.h>
-#include <signal.h>
-#include <pthread.h>
-#include <inttypes.h>
#include <unistd.h>
#include "network_ipc.h"
#include "progress_ipc.h"
@@ -155,14 +157,15 @@ static int consume_progress_events(int *progressfd)
static void *swupdate_async_thread(void *data)
{
char *pbuf;
- int size;
+ int size = 0, buf_offset = 0;
+ struct pollfd pfds[2];
sigset_t sigpipe_mask;
sigset_t saved_mask;
struct timespec zerotime = {0, 0};
struct async_lib *rq = (struct async_lib *)data;
int swupdate_result = FAILURE;
int progressfd = -1;
- int ret;
+ int ret, nfd, old_flags;
int early_status = -1;

sigemptyset(&sigpipe_mask);
@@ -183,19 +186,49 @@ static void *swupdate_async_thread(void *data)
goto out;
}

+ /* make connfd non-blocking while we're writing image */
+ old_flags = fcntl(rq->connfd, F_GETFL);
+ if (old_flags < 0) {
+ fprintf(stderr, "Could not get connfd flags? %m\n");
+ goto out;
+ }
+ fcntl(rq->connfd, F_SETFL, old_flags | O_NONBLOCK);
+
/* Start writing the image until end */

do {
if (!rq->wr)
break;
+ if (size == buf_offset) {
+ buf_offset = 0;
+ rq->wr(&pbuf, &size);
+ }

- rq->wr(&pbuf, &size);
if (size) {
- if (swupdate_image_write(pbuf, size) != size) {
- perror("swupdate_image_write failed");
- swupdate_result = FAILURE;
+ pfds[0].fd = progressfd;
+ pfds[0].events = POLLIN;
+ pfds[1].fd = rq->connfd;
+ pfds[1].events = POLLOUT;
+ nfd = 2;
+ do {
+ ret = poll(pfds, nfd, -1);
+ } while (ret < 0 && errno == EINTR);
+
+ if (ret < 0) {
+ fprintf(stderr, "poll failed: %m\n");
goto out;
}
+ if (pfds[1].revents & POLLOUT) {
+ ret = swupdate_image_write(pbuf + buf_offset,
+ size - buf_offset);
+ if (ret < 0 && errno != EAGAIN) {
+ perror("swupdate_image_write failed");
+ swupdate_result = FAILURE;
+ goto out;
+ } else if (ret > 0) {
+ buf_offset += ret;
+ }
+ }
}
/* Consume progress events so that the pipe does not get full
* and block the daemon */
@@ -214,7 +247,7 @@ static void *swupdate_async_thread(void *data)
/* interrupt the transfer */
break;
}
- } while(size > 0);
+ } while (size > 0);

ipc_end(rq->connfd);

diff --git a/ipc/network_ipc.c b/ipc/network_ipc.c
index 96a7701d0c9a..ede80dfcffe7 100644
--- a/ipc/network_ipc.c
+++ b/ipc/network_ipc.c
@@ -320,6 +320,11 @@ int ipc_send_data(int connfd, char *buf, int size)

while (len) {
ret = write(connfd, buf, (size_t)len);
+ if (ret < 0 && errno == EINTR)
+ continue;
+ /* return partially written length if any */
+ if (ret < 0 && errno == EAGAIN && len != size)
+ return size - len;
if (ret < 0)
return ret;
len -= ret;
--
2.47.3


Stefano Babic

unread,
May 28, 2026, 5:58:00 AM (8 days ago) May 28
to Dominique Martinet, swup...@googlegroups.com
Hi Dominique,

On 5/21/26 09:35, Dominique Martinet wrote:
> If the socket is closed it could cause other errors down the road,
> having a log here helps understand such issues.
>
> Note this always happens at the end of installs, so this cannot be a
> warning, but a debug log should not hurt.. We could check for EPIPE
> and use warn if not EPIPE but it's probably not worth it.
>
> Signed-off-by: Dominique Martinet <dominique...@atmark-techno.com>
> ---
> v1->v2:
> - add SoB
> - lowoer from WARN to DEBUG as this happens after successful installs
> too
>
> core/progress_thread.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/core/progress_thread.c b/core/progress_thread.c
> index 4b115ba63658..eb105a31154d 100644
> --- a/core/progress_thread.c
> +++ b/core/progress_thread.c
> @@ -92,10 +92,11 @@ static void send_progress_msg(void)
> n = send(conn->sockfd, buf, count, MSG_NOSIGNAL | MSG_DONTWAIT);
> attempt++;
> tryagain = n <= 0 && (errno == EWOULDBLOCK || errno == EAGAIN);
> - if (tryagain)
> + if (tryagain && attempt < maxAttempts)

This is fine.

> sleep(1);
> } while (tryagain && attempt < maxAttempts);
> if (n <= 0) {
> + DEBUG("Removed progress socket %d after error: %m", conn->sockfd);

But the debug can hurt - there are already cases where the client
crashes and it is restarted, and this causes a flood of traces that at
the end can block the daemon. There are a few points inside the code (in
the notify() for example) where logs are forbidden.

If some more information are requested from the daemon, we should then
extend the IPC. But I will avoid this.

Regards,
Stefano

> close(conn->sockfd);
> SIMPLEQ_REMOVE(&pprog->conns, conn,
> progress_conn, next);

--
_______________________________________________________________________
Nabla Software Engineering GmbH
Hirschstr. 111A | 86156 Augsburg | Tel: +49 821 45592596
Geschäftsführer : Stefano Babic | HRB 40522 Augsburg
E-Mail: sba...@nabladev.com

Stefano Babic

unread,
May 28, 2026, 6:11:19 AM (8 days ago) May 28
to Dominique Martinet, swup...@googlegroups.com
Hi Dominique,

On 5/21/26 09:35, Dominique Martinet wrote:
> Running swupdate with a SWU (streamed through swupdate_async_start such
> as with `swupdate -i` or `swupdate-client`) can hang if the update
> contains many small steps followed by a large step:
> - swupdate_async_thread() is stuck trying to write the SWU content to
> the server
> - but the server sent many progress events and filled progressfd
> backlog, and closes the server side of in send_progress_msg() after
> retries failed (client is still blocked)
> - when server continued installing more items the client is unblocked
> and reads up on progress notifications, notices the socket is closed
> (POLLHUP) and fails
> - the server fails the update as it is not complete
>

Understood.

> This fails with logs similar to this:
> ```
> progress_ipc_receive_nb failed (-1)
> Cannot consume progress events. Fail.
> [ERROR] : SWUPDATE failed [0] ERROR install_from_file.c : endupdate : 55 : SWUpdate *failed* !
> [TRACE] : SWUPDATE running : [unlink_sockets] : unlink socket /tmp/swupdateprog
> [TRACE] : SWUPDATE running : [unlink_sockets] : unlink socket /tmp/sockinstctrl
> [ERROR] : SWUPDATE failed [0] ERROR cpio_utils.c : hash_compare : 521 : HASH mismatch : 4395313e6f79b6554521e1933304e95d0bea836d2bc72e9b3828fd62c5125baf <--> 3eb039c6666dfd614a73b930c98c38ab77ba362ce3b9d745c7684d0d10335726
> ```
>

My point is that I prefer to have the blocking mechanism that is
reliable, and I am trying to check for options - anyway, I understand
the issue that should be solved.

I am also releasing 2026.05, so I guess this requires more discussion
and it cannot follow into the release. If needed, I can offer later a
2026.05.x.

One option is swupdate_async_thread(), that gets the error, understands
it and do something. For example, calling get_status() and reconnecting
to the progress interface, that is already in non blocking mode.

This also breaks with the past because internal callers of
ipc_send_data() do not manage partial sending of data, because up now
the whole buffer is sent or an error is returned.

Regards,
Stefano

Dominique Martinet

unread,
May 28, 2026, 7:47:16 PM (7 days ago) May 28
to Stefano Babic, swup...@googlegroups.com
Stefano Babic wrote on Thu, May 28, 2026 at 11:57:55AM +0200:
> > + DEBUG("Removed progress socket %d after error: %m", conn->sockfd);
>
> But the debug can hurt - there are already cases where the client crashes
> and it is restarted, and this causes a flood of traces that at the end can
> block the daemon. There are a few points inside the code (in the notify()
> for example) where logs are forbidden.
>
> If some more information are requested from the daemon, we should then
> extend the IPC. But I will avoid this.

Ok - it took me time to notice the problem was with the client and such
a debug message would have helped me here, but if we fix the problem I
don't really need this, let's drop this patch

--
Dominique

Dominique Martinet

unread,
May 28, 2026, 7:59:07 PM (7 days ago) May 28
to Stefano Babic, swup...@googlegroups.com
Stefano Babic wrote on Thu, May 28, 2026 at 12:11:12PM +0200:
> > This fails with logs similar to this:
> > ```
> > progress_ipc_receive_nb failed (-1)
> > Cannot consume progress events. Fail.
> > [ERROR] : SWUPDATE failed [0] ERROR install_from_file.c : endupdate : 55 : SWUpdate *failed* !
> > [TRACE] : SWUPDATE running : [unlink_sockets] : unlink socket /tmp/swupdateprog
> > [TRACE] : SWUPDATE running : [unlink_sockets] : unlink socket /tmp/sockinstctrl
> > [ERROR] : SWUPDATE failed [0] ERROR cpio_utils.c : hash_compare : 521 : HASH mismatch : 4395313e6f79b6554521e1933304e95d0bea836d2bc72e9b3828fd62c5125baf <--> 3eb039c6666dfd614a73b930c98c38ab77ba362ce3b9d745c7684d0d10335726
> > ```
> >
>
> My point is that I prefer to have the blocking mechanism that is reliable,
> and I am trying to check for options - anyway, I understand the issue that
> should be solved.
>
> I am also releasing 2026.05, so I guess this requires more discussion and it
> cannot follow into the release. If needed, I can offer later a 2026.05.x.

No worry - thanks for the review and sorry for dumping a big patch on
you before release.
I'm not following releases strictly (building master+patches) so I don't
need an extra tag; this bug only happens with "pathological" SWU images
so I don't think this warrants an urgent fix for most people.

> One option is swupdate_async_thread(), that gets the error, understands it
> and do something. For example, calling get_status() and reconnecting to the
> progress interface, that is already in non blocking mode.

Hmm, the problem with reconnecting is that we might miss the done event,
that the progress interface polling was added to fix in the first place.
I'd rather avoid reconnecting if we can. If you want blocking we can add
yet another thread and some synchronisation between the two... but:

> This also breaks with the past because internal callers of ipc_send_data()
> do not manage partial sending of data, because up now the whole buffer is
> sent or an error is returned.

This is not correct, current callers should not be broken.

ipc_send_data() didn't change much: its behavior only changes if the
underlying socket has been marked O_NONBLOCK, in which case it returns
partial lengths instead of a hard error (so something that didn't work
up till now). If the fd is blocking then EAGAIN should never happen,
and not handling EINTR was a pre-existing bug that should have been
handled anyway in my opinion.

What changed was the internal (static) swupdate_async_thread function,
which flags a fd it opened itself as O_NONBLOCK, and that fd never
leaves the function either, so it's a self-contained change that does
not impact the API.

My personal stance is that I prefer nonblocking to an extra thread, but
if you prefer to keep things blocking then I can rewrite this with
another thread to handle progress and a pthread cond to terminate the
sender loop

Thanks,
--
Dominique

Stefano Babic

unread,
May 29, 2026, 5:50:11 AM (7 days ago) May 29
to Dominique Martinet, Stefano Babic, swup...@googlegroups.com
Hi Dominique,

On 5/29/26 01:58, Dominique Martinet wrote:
> Stefano Babic wrote on Thu, May 28, 2026 at 12:11:12PM +0200:
>>> This fails with logs similar to this:
>>> ```
>>> progress_ipc_receive_nb failed (-1)
>>> Cannot consume progress events. Fail.
>>> [ERROR] : SWUPDATE failed [0] ERROR install_from_file.c : endupdate : 55 : SWUpdate *failed* !
>>> [TRACE] : SWUPDATE running : [unlink_sockets] : unlink socket /tmp/swupdateprog
>>> [TRACE] : SWUPDATE running : [unlink_sockets] : unlink socket /tmp/sockinstctrl
>>> [ERROR] : SWUPDATE failed [0] ERROR cpio_utils.c : hash_compare : 521 : HASH mismatch : 4395313e6f79b6554521e1933304e95d0bea836d2bc72e9b3828fd62c5125baf <--> 3eb039c6666dfd614a73b930c98c38ab77ba362ce3b9d745c7684d0d10335726
>>> ```
>>>
>>
>> My point is that I prefer to have the blocking mechanism that is reliable,
>> and I am trying to check for options - anyway, I understand the issue that
>> should be solved.
>>
>> I am also releasing 2026.05, so I guess this requires more discussion and it
>> cannot follow into the release. If needed, I can offer later a 2026.05.x.
>
> No worry - thanks for the review and sorry for dumping a big patch on
> you before release.
> I'm not following releases strictly (building master+patches) so I don't
> need an extra tag; this bug only happens with "pathological" SWU images
> so I don't think this warrants an urgent fix for most people.
>

Ok, fine.

>> One option is swupdate_async_thread(), that gets the error, understands it
>> and do something. For example, calling get_status() and reconnecting to the
>> progress interface, that is already in non blocking mode.
>
> Hmm, the problem with reconnecting is that we might miss the done event,
> that the progress interface polling was added to fix in the first place.

Yes.

> I'd rather avoid reconnecting if we can. If you want blocking we can add
> yet another thread and some synchronisation between the two... but:

mmhh....not, and yes, I thought the same, but another thread adds more
asynchronicity and require sync, and....

>
>> This also breaks with the past because internal callers of ipc_send_data()
>> do not manage partial sending of data, because up now the whole buffer is
>> sent or an error is returned.
>
> This is not correct, current callers should not be broken.
>
> ipc_send_data() didn't change much: its behavior only changes if the
> underlying socket has been marked O_NONBLOCK, in which case it returns
> partial lengths instead of a hard error (so something that didn't work
> up till now). If the fd is blocking then EAGAIN should never happen,
> and not handling EINTR was a pre-existing bug that should have been
> handled anyway in my opinion.

Ah ok, yes, I was wrong. It is fine.

>
> What changed was the internal (static) swupdate_async_thread function,
> which flags a fd it opened itself as O_NONBLOCK, and that fd never
> leaves the function either, so it's a self-contained change that does
> not impact the API.
>
> My personal stance is that I prefer nonblocking to an extra thread,

I absolutely agree on this.

> but
> if you prefer to keep things blocking then I can rewrite this with
> another thread to handle progress and a pthread cond to terminate the
> sender loop

I will check (and test) your patch after the release, but we agree on
using non-blocking over an additional thread.

Best regards,
Stefano
Reply all
Reply to author
Forward
0 new messages