Dominique Martinet wrote on Mon, May 11, 2026 at 10:56:16AM +0900:
> I had a bug (in my code, that I should have dropped after commits
> 8b4efc0d23db (swupdate_async_start: fix early termination blocking in
> client) and e6b2081bae16 (swupdate_async_start: fix race condition) but
> for some reason forgot to...) that was hard to track down due to lack of
> messages here:
> the client disconnected after notify socket exited, then the server
> had not received the full swu so install errored out on bad hash;
> this was hard to understand so while I did fix my bug I think having a
> log here would be helpful if similar failures happen again in the
> future.
Ah, actually I didn't check properly and it's broken in master too, so
it wasn't my bug (not sure if I should be glad or not..)
This patch (extra warning) still makes sense to me, but I think some
discussion is needed to make a fix, so describing the bug a bit more:
basically, running swupdate -i` with a SWU containing many small files
followed by a large file ought to do it, but for some reason it only
bugs if compression is used on my machine, which is why I didn't
reproduce at first.
I've attached 'repro.sh' which will create repro/sw-desc-many and a few
files to feed to `swugenerator create`, which fails installing with
swupdate -i (or swupdate + swupdate-client) as per the attached
swupdate.log (relevant snippet below):
```
2026-05-11 13:11:26.584663 [TRACE] : SWUPDATE running : [extract_files] : Installing STREAM foo24.tar.zstd, 285 bytes
2026-05-11 13:11:26.584695 [TRACE] : SWUPDATE running : [install_single_image] : Found installer for stream foo24.tar.zstd archive
2026-05-11 13:11:26.584754 [TRACE] : SWUPDATE running : [install_archive_image] : Installing file foo24.tar.zstd on /tmp/sw/f24, ignoring attributes
2026-05-11 13:11:30.585185 [WARN ] : SWUPDATE running : [send_progress_msg] : Removed progress socket 5 after error: Resource temporarily unavailable
[...]
2026-05-11 13:11:31.523934 [TRACE] : SWUPDATE running : [install_archive_image] : Installing file bigfile.tar.zstd on /tmp/sw/big, ignoring attributes
2026-05-11 13:11:31.525010 progress_ipc_receive_nb failed (-1)
2026-05-11 13:11:31.525077 Cannot consume progress events. Fail.
2026-05-11 13:11:31.525113 [ERROR] : SWUPDATE failed [0] ERROR install_from_file.c : endupdate : 55 : SWUpdate *failed* !
2026-05-11 13:11:31.525154 [TRACE] : SWUPDATE running : [unlink_sockets] : unlink socket /tmp/swupdateprog
2026-05-11 13:11:31.525184 [TRACE] : SWUPDATE running : [unlink_sockets] : unlink socket /tmp/sockinstctrl
2026-05-11 13:11:31.525454 [ERROR] : SWUPDATE failed [0] ERROR cpio_utils.c : hash_compare : 521 : HASH mismatch : 4395313e6f79b6554521e1933304e95d0bea836d2bc72e9b3828fd62c5125baf <--> 3eb039c6666dfd614a73b930c98c38ab77ba362ce3b9d745c7684d0d10335726
2026-05-11 13:11:31.525527 [ERROR] : SWUPDATE failed [0] ERROR archive_handler.c : install_archive_image : 344 : Error copying extracted file
2026-05-11 13:11:31.525561 [ERROR] : SWUPDATE failed [0] ERROR archive_handler.c : copy_data : 64 : archive_read_data_block(): Truncated tar archive for 'bigfile': Unknown error -1
```
What goes wrong is the swupdate_async_thread alternates between writing
the image and reading progress, but the thread is stuck writing the swu
image when the socket is full, and meanwhile the server filled the
progress socket as well and had to give up:
```
static void *swupdate_async_thread(void *data)
{
...
/* Start writing the image until end */
do {
...
if (size) {
if (swupdate_image_write(pbuf, size) != size) {
^ stuck here writing SWU to rq->connfd...
perror("swupdate_image_write failed");
swupdate_result = FAILURE;
goto out;
}
}
/* Consume progress events so that the pipe does not get full
* and block the daemon */
ret = consume_progress_events(&progressfd);
^ ... and stops reading new progress events here
```
I can think of two ways to fix this:
- move consuming progress to another thread...
swupdate_async_start() already starts a thread, so it won't hurt much to
start another one, but it's another synchronization point to take care
of (early failures will make image write fail as well so it actually
shouldn't be _too_ bad in practice, but I don't like more threads)
- make rq->connfd NONBLOCK, so write would stop at however how many
bytes it could handle, and poll progressfd/connfd to handle events as
they come and write as we can (at this point we're writing the SWU
binary data so there's no harm in having short writes, so I think this
is cleaner, but it's likely a bit more churn)
Do you have a preference?
I personally prefer the later, so if you're fine with that I can send a
follow-up patch in the following weeks.
Thanks,
--
Dominique