[PATCH] progress_thread: warn if notification couldn't be sent

20 views
Skip to first unread message

Dominique Martinet

unread,
May 10, 2026, 9:56:26 PM (3 days ago) May 10
to swup...@googlegroups.com, Dominique Martinet
If the socket is closed it could cause other errors down the road,
having a log here helps understand such issues
---

I had a bug (in my code, that I should have dropped after commits
8b4efc0d23db (swupdate_async_start: fix early termination blocking in
client) and e6b2081bae16 (swupdate_async_start: fix race condition) but
for some reason forgot to...) that was hard to track down due to lack of
messages here:
the client disconnected after notify socket exited, then the server
had not received the full swu so install errored out on bad hash;
this was hard to understand so while I did fix my bug I think having a
log here would be helpful if similar failures happen again in the
future.

The extra check before the last sleep ensures errno is not touched after
the send failure on last iteration.

Thanks!

core/progress_thread.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/core/progress_thread.c b/core/progress_thread.c
index 4b115ba63658..c4bbfbf8126f 100644
--- a/core/progress_thread.c
+++ b/core/progress_thread.c
@@ -92,10 +92,11 @@ static void send_progress_msg(void)
n = send(conn->sockfd, buf, count, MSG_NOSIGNAL | MSG_DONTWAIT);
attempt++;
tryagain = n <= 0 && (errno == EWOULDBLOCK || errno == EAGAIN);
- if (tryagain)
+ if (tryagain && attempt < maxAttempts)
sleep(1);
} while (tryagain && attempt < maxAttempts);
if (n <= 0) {
+ WARN("Removed progress socket %d after error: %m", conn->sockfd);
close(conn->sockfd);
SIMPLEQ_REMOVE(&pprog->conns, conn,
progress_conn, next);
--
2.47.3


Dominique Martinet

unread,
May 11, 2026, 1:01:15 AM (3 days ago) May 11
to swup...@googlegroups.com
Dominique Martinet wrote on Mon, May 11, 2026 at 10:56:16AM +0900:
> I had a bug (in my code, that I should have dropped after commits
> 8b4efc0d23db (swupdate_async_start: fix early termination blocking in
> client) and e6b2081bae16 (swupdate_async_start: fix race condition) but
> for some reason forgot to...) that was hard to track down due to lack of
> messages here:
> the client disconnected after notify socket exited, then the server
> had not received the full swu so install errored out on bad hash;
> this was hard to understand so while I did fix my bug I think having a
> log here would be helpful if similar failures happen again in the
> future.

Ah, actually I didn't check properly and it's broken in master too, so
it wasn't my bug (not sure if I should be glad or not..)

This patch (extra warning) still makes sense to me, but I think some
discussion is needed to make a fix, so describing the bug a bit more:
basically, running swupdate -i` with a SWU containing many small files
followed by a large file ought to do it, but for some reason it only
bugs if compression is used on my machine, which is why I didn't
reproduce at first.

I've attached 'repro.sh' which will create repro/sw-desc-many and a few
files to feed to `swugenerator create`, which fails installing with
swupdate -i (or swupdate + swupdate-client) as per the attached
swupdate.log (relevant snippet below):
```
2026-05-11 13:11:26.584663 [TRACE] : SWUPDATE running : [extract_files] : Installing STREAM foo24.tar.zstd, 285 bytes
2026-05-11 13:11:26.584695 [TRACE] : SWUPDATE running : [install_single_image] : Found installer for stream foo24.tar.zstd archive
2026-05-11 13:11:26.584754 [TRACE] : SWUPDATE running : [install_archive_image] : Installing file foo24.tar.zstd on /tmp/sw/f24, ignoring attributes
2026-05-11 13:11:30.585185 [WARN ] : SWUPDATE running : [send_progress_msg] : Removed progress socket 5 after error: Resource temporarily unavailable
[...]
2026-05-11 13:11:31.523934 [TRACE] : SWUPDATE running : [install_archive_image] : Installing file bigfile.tar.zstd on /tmp/sw/big, ignoring attributes
2026-05-11 13:11:31.525010 progress_ipc_receive_nb failed (-1)
2026-05-11 13:11:31.525077 Cannot consume progress events. Fail.
2026-05-11 13:11:31.525113 [ERROR] : SWUPDATE failed [0] ERROR install_from_file.c : endupdate : 55 : SWUpdate *failed* !
2026-05-11 13:11:31.525154 [TRACE] : SWUPDATE running : [unlink_sockets] : unlink socket /tmp/swupdateprog
2026-05-11 13:11:31.525184 [TRACE] : SWUPDATE running : [unlink_sockets] : unlink socket /tmp/sockinstctrl
2026-05-11 13:11:31.525454 [ERROR] : SWUPDATE failed [0] ERROR cpio_utils.c : hash_compare : 521 : HASH mismatch : 4395313e6f79b6554521e1933304e95d0bea836d2bc72e9b3828fd62c5125baf <--> 3eb039c6666dfd614a73b930c98c38ab77ba362ce3b9d745c7684d0d10335726
2026-05-11 13:11:31.525527 [ERROR] : SWUPDATE failed [0] ERROR archive_handler.c : install_archive_image : 344 : Error copying extracted file
2026-05-11 13:11:31.525561 [ERROR] : SWUPDATE failed [0] ERROR archive_handler.c : copy_data : 64 : archive_read_data_block(): Truncated tar archive for 'bigfile': Unknown error -1
```


What goes wrong is the swupdate_async_thread alternates between writing
the image and reading progress, but the thread is stuck writing the swu
image when the socket is full, and meanwhile the server filled the
progress socket as well and had to give up:
```
static void *swupdate_async_thread(void *data)
{
...
/* Start writing the image until end */

do {
...
if (size) {
if (swupdate_image_write(pbuf, size) != size) {
^ stuck here writing SWU to rq->connfd...
perror("swupdate_image_write failed");
swupdate_result = FAILURE;
goto out;
}
}
/* Consume progress events so that the pipe does not get full
* and block the daemon */
ret = consume_progress_events(&progressfd);
^ ... and stops reading new progress events here
```

I can think of two ways to fix this:
- move consuming progress to another thread...
swupdate_async_start() already starts a thread, so it won't hurt much to
start another one, but it's another synchronization point to take care
of (early failures will make image write fail as well so it actually
shouldn't be _too_ bad in practice, but I don't like more threads)

- make rq->connfd NONBLOCK, so write would stop at however how many
bytes it could handle, and poll progressfd/connfd to handle events as
they come and write as we can (at this point we're writing the SWU
binary data so there's no harm in having short writes, so I think this
is cleaner, but it's likely a bit more churn)


Do you have a preference?
I personally prefer the later, so if you're fine with that I can send a
follow-up patch in the following weeks.

Thanks,
--
Dominique
repro.sh
swupdate.log

Dominique Martinet

unread,
May 12, 2026, 2:52:25 AM (yesterday) May 12
to swup...@googlegroups.com
Dominique Martinet wrote on Mon, May 11, 2026 at 10:56:16AM +0900:
> diff --git a/core/progress_thread.c b/core/progress_thread.c
> index 4b115ba63658..c4bbfbf8126f 100644
> --- a/core/progress_thread.c
> +++ b/core/progress_thread.c
> @@ -92,10 +92,11 @@ static void send_progress_msg(void)
> n = send(conn->sockfd, buf, count, MSG_NOSIGNAL | MSG_DONTWAIT);
> attempt++;
> tryagain = n <= 0 && (errno == EWOULDBLOCK || errno == EAGAIN);
> - if (tryagain)
> + if (tryagain && attempt < maxAttempts)
> sleep(1);
> } while (tryagain && attempt < maxAttempts);
> if (n <= 0) {
> + WARN("Removed progress socket %d after error: %m", conn->sockfd);

Ugh, testing with a fix to the problem in sibling mail with this log
enabled, I noticed this warns all the time on success as the client
disconnects after success (progress report fails with EPIPE)

I guess we could make this log only if errno != EPIPE, but I'm thinking
it'd be better to have this as a DEBUG() log instead?
If it's expected I think DEBUG or TRACE is more appropriate and it
likely won't hurt to add even if it always logs at this level.

I'll send a v2 with either check or lower level depending on your
opinion here...
--
Dominique
Reply all
Reply to author
Forward
0 new messages