Groups keyboard shortcuts have been updated
Dismiss
See shortcuts

[PATCH] Fix REQ_INSTALL/GET_STATUS synchronization

29 views
Skip to first unread message

Frederic Hoerni

unread,
Jul 10, 2024, 4:47:32 AM7/10/24
to swup...@googlegroups.com
From time to time, there was an install error raised by swupdate-client when
the daemon processed GET_STATUS too quickly after a REQ_INSTALL:

- when starting, inst.last_install=0, inst.status=IDLE
- swupdate-client:
* sends REQ_INSTALL to the daemon
* ... followed by the payload (the image), which is buffered by the OS
* ... and swupdate-client continues its execution whereas the daemon has
not started receiving the payload so far
* sends REQ_STATUS to the daemon to get the progress status
- swupdate daemon:
* receives REQ_INSTALL and sends ACK
* (1) receives GET_STATUS and answers with inst.last_install=0,
inst.status=IDLE
* (2) notifies another thread that sets inst.status=RUN and proceeds with
the installation
- swupdate-client:
* receives GET_STATUS with inst.last_install=0, inst.status=IDLE
* ... and considers this as a failure, and exits with code 1 (error)
- swupdate daemon:
* continues normally and succeeds

The fact that (1) occurs before (2) is the root cause of the issue as it gives
a wrong information to the client.

It happens when the thread "network_thread" re-acquires the mutex before
pthread_cond_wait (in the thread "network_initializer") re-acquires it himself
(the re-acquiring of the mutex is not said atomic in the man page).

This has been seen on some machines with a swu image of 20480 bytes.

Other improvements:
- remove useless stream_wkup
- move locking to before starting the other thread in order not to miss the
first signal
- do not release the mutex between status=IDLE and pthread_cond_wait in order
not to miss a signal

Signed-off-by: Frederic Hoerni <fho...@witekio.com>

---
core/network_thread.c | 4 +++-
core/stream_interface.c | 24 ++++++++++--------------
include/network_interface.h | 1 -
3 files changed, 13 insertions(+), 16 deletions(-)

diff --git a/core/network_thread.c b/core/network_thread.c
index d7b713fb..e033f3ab 100644
--- a/core/network_thread.c
+++ b/core/network_thread.c
@@ -486,8 +486,10 @@ void *network_thread (void *data)
/* Drop all old notification from last run */
cleanum_msg_list();

+ /* Switch to RUN */
+ instp->status = RUN;
+
/* Wake-up the installer */
- stream_wkup = true;
pthread_cond_signal(&stream_cond);
} else {
msg.type = NACK;
diff --git a/core/stream_interface.c b/core/stream_interface.c
index 5f3ad2e3..59449815 100644
--- a/core/stream_interface.c
+++ b/core/stream_interface.c
@@ -67,7 +67,6 @@ static pthread_t network_thread_id;
* reception of an install request
*
*/
-bool stream_wkup = false;
pthread_mutex_t stream_mutex = PTHREAD_MUTEX_INITIALIZER;
pthread_cond_t stream_cond = PTHREAD_COND_INITIALIZER;

@@ -533,6 +532,9 @@ void *network_initializer(void *data)
inst.status = IDLE;
inst.software = software;

+ /* Lock in order not to miss the first signal (before pthread_cond_wait) */
+ pthread_mutex_lock(&stream_mutex);
+
/* fork off the local dialogs and network service */
network_thread_id = start_thread(network_thread, &inst);

@@ -543,12 +545,7 @@ void *network_initializer(void *data)
ret = 0;

/* wait for someone to issue an install request */
- pthread_mutex_lock(&stream_mutex);
- while (stream_wkup != true) {
- pthread_cond_wait(&stream_cond, &stream_mutex);
- }
- stream_wkup = false;
- inst.status = RUN;
+ pthread_cond_wait(&stream_cond, &stream_mutex);
pthread_mutex_unlock(&stream_mutex);
notify(START, RECOVERY_NO_ERROR, INFOLEVEL, "Software Update started !");
TRACE("Software update started");
@@ -731,13 +728,6 @@ void *network_initializer(void *data)
swupdate_remove_directory(DATADST_DIR_SUFFIX);
#endif

- pthread_mutex_lock(&stream_mutex);
- inst.status = IDLE;
- inst.req.source = SOURCE_UNKNOWN;
- pthread_mutex_unlock(&stream_mutex);
- TRACE("Main thread sleep again !");
- notify(IDLE, RECOVERY_NO_ERROR, INFOLEVEL, "Waiting for requests...");
-
/*
* Last step, if no restart is required,
* SWUpdate can send automatically the feedback.
@@ -768,6 +758,12 @@ void *network_initializer(void *data)
ipc_send_cmd(&msg);
}

+ pthread_mutex_lock(&stream_mutex);
+ inst.status = IDLE;
+ inst.req.source = SOURCE_UNKNOWN;
+ /* Keep stream_mutex locked until pthread_cond_wait */
+ TRACE("Main thread sleep again !");
+ notify(IDLE, RECOVERY_NO_ERROR, INFOLEVEL, "Waiting for requests...");

}

diff --git a/include/network_interface.h b/include/network_interface.h
index 0a60cad4..53e7fa44 100644
--- a/include/network_interface.h
+++ b/include/network_interface.h
@@ -9,6 +9,5 @@
void *network_initializer(void *data);
void *network_thread(void *data);

-extern bool stream_wkup;
extern pthread_mutex_t stream_mutex;
extern pthread_cond_t stream_cond;
--
2.34.1

Stefano Babic

unread,
Jul 17, 2024, 5:51:10 AM7/17/24
to Frederic Hoerni, swup...@googlegroups.com
Hi Frederic,

On 10.07.24 10:28, 'Frederic Hoerni' via swupdate wrote:
> From time to time, there was an install error raised by swupdate-client
> when
> the daemon processed GET_STATUS too quickly after a REQ_INSTALL:
>
> - when starting, inst.last_install=0, inst.status=IDLE
> - swupdate-client:
>     * sends REQ_INSTALL to the daemon
>     * ... followed by the payload (the image), which is buffered by the OS
>     * ... and swupdate-client continues its execution whereas the
> daemon has
>           not started receiving the payload so far
>     * sends REQ_STATUS to the daemon to get the progress status
> - swupdate daemon:
>     * receives REQ_INSTALL and sends ACK
>     * (1) receives GET_STATUS and answers with inst.last_install=0,
>       inst.status=IDLE
>     * (2) notifies another thread that sets inst.status=RUN and
> proceeds with
>       the installation
> - swupdate-client:
>     * receives GET_STATUS with inst.last_install=0, inst.status=IDLE
>     * ... and considers this as a failure, and exits with code 1 (error)
> - swupdate daemon:
>     * continues normally and succeeds
>

Thanks for analyses and yes this can happen when the SWU is very small.

However, your patch is putting the status back like the first
implementation. Later, commit 7e80ad491 introduced the condition to be
checked ( stream_wkup ). According to several documentation about
pthread_cond_wait(), the condition could change before the awakened
thread locks the mutex and returns from pthread_cond_wait(), so the
recommended method is like the one in commit 7e80ad491, that you are
removing.

Apart of this, you are moving the writer of inst.status to another
thread, and this seems dangerous. The network thread has just received
the message to trigger an install, but this is managed by the thread
defined in network_initializer (core/stream_interface.c). This is the
only writer for that structure., and in fact, it is in charge to start /
reject an update.

I would say that the issue is outside this code. The GET_STATUS message
(polling) was the first implementation to retrieve it. Later, I
implemented the "progress" interface, that means an event driven, whose
events are managed by the core, and the race cannot happen. But this
requires to listen for events to get the changes in the status machinery
(IDLE --> RUN --> [SUCCESS | FAILURE].

Best regards,
Stefano Babic

Frederic Hoerni

unread,
Jul 18, 2024, 4:41:38 AM7/18/24
to Stefano Babic, swup...@googlegroups.com
Hi Stefano,
Sure, I agree with that.
>
> Apart of this, you are moving the writer of inst.status to another thread, and this seems dangerous. The network thread has just received the message to trigger an install, but this is managed by the thread defined in network_initializer (core/stream_interface.c). This is the only writer for that structure., and in fact, it is in charge to start / reject  an update.
>
> I would say that the issue is outside this code. The GET_STATUS message (polling) was the first implementation to retrieve it. Later, I implemented the "progress" interface, that means an event driven, whose events are managed by the core, and the race cannot happen. But this requires to listen for events to get the changes in the status machinery (IDLE --> RUN --> [SUCCESS | FAILURE].

If I correctly understand, this means that we should rather patch the client side.
I will figure out how to do this. I guess this should be done somewhere in swupdate_async_start, using progress_ipc_connect and progress_ipc_receive.

Best regards,
Frederic

Stefano Babic

unread,
Jul 18, 2024, 4:45:19 AM7/18/24
to Frederic Hoerni, Stefano Babic, swup...@googlegroups.com
In the IPC code, yes.

> I will figure out how to do this. I guess this should be done somewhere
> in swupdate_async_start, using progress_ipc_connect and
> progress_ipc_receive.

Yes, the race is in this code, swupdate_async_start and
ipc_wait_for_complete.

Best regards,
Stefano

Frederic Hoerni

unread,
Jul 19, 2024, 5:22:07 AM7/19/24
to Stefano Babic, swup...@googlegroups.com
Hi Stefano,

On 18/07/2024 10:45, Stefano Babic wrote:
>
>> I will figure out how to do this. I guess this should be done somewhere in swupdate_async_start, using progress_ipc_connect and progress_ipc_receive.
>
> Yes, the race is in this code, swupdate_async_start and ipc_wait_for_complete.
>


I looked at how we could use progress_ipc_receive() to have the client notified about the result of the installation and I see 2 solutions. I would like to have your point of view, and if you think of other solutions that would be simpler or more consistent.

Solution 1:

In swupdate_async_thread:
- start listening to progress events before sending the image to the server (progress_ipc_connect)
- send the image to the server by chunks, interleaved with progress_ipc_receive_nb (a non-blocking variant of progress_ipc_receive, to be created)
- wait until SUCCESS or FAILURE (progress_ipc_receive)

Example in simplified code:

progressfd = progress_ipc_connect(0 /* no reconnect */);

do {
if (!rq->wr)
break;

rq->wr(&pbuf, &size);
if (size) {
if (swupdate_image_write(pbuf, size) != size) {
perror("swupdate_image_write failed");
swupdate_result = FAILURE;
goto out;
}
}

/* Consume progress events */
int ret = progress_ipc_receive_nb(&progressfd, &msg);
/* ... leave the loop on error ... */

} while(size > 0);

/* Wait until the end of the installation (FAILURE or SUCCESS) */
while (1) {
int ret = progress_ipc_receive(progressfd, &msg);
/* Leave the loop on SUCCESS or FAILURE */
}


Reasons:
- we start listening before sending the image to be sure not to miss the final result (SUCCESS or FAILURE);
- we read events interleaved with swupdate_image_write in order to consume the events and prevent having the socket full, which would lead to errors on the server side and the server closing the progress socket.


Solution 2:

In swupdate_async_start or swupdate_async_thread, create a new thread in charge of listening to progress events and notifying its parent thread when reaching SUCCESS or FAILURE. This solution is more complex as it involves 1 more thread, and more synchronization plumbing between threads.



However, with any solution using progress_ipc_receive, we won't have the messages on stdout as before:

Status: 1 message: Software Update started !
Status: 2 message: [network_initializer] : Software update started
Status: 2 message: [extract_file_to_tmp] : Found file
Status: 2 message: [extract_file_to_tmp] : filename sw-description
Status: 2 message: [extract_file_to_tmp] : size 352
...

The progress_ipc_receive API does not provide these messages, as far as I know.
Would it be acceptable for you to change these messages on the stdout interface? (only swupdate-client is impacted)

Best regards,
Frederic



> Best regards,
> Stefano
>
>>
>> Best regards,
>> Frederic

Frederic Hoerni

unread,
Aug 6, 2024, 9:01:44 AM8/6/24
to swup...@googlegroups.com
Hi,
I will send another patch for addressing this issue on the client side.

Frederic
--
Frederic Hoerni
Witekio
Reply all
Reply to author
Forward
0 new messages