CrapFileTest sometimes hangs in CI/CD

42 views
Skip to first unread message

Mark Jonas

unread,
Sep 30, 2024, 2:37:54 PMSep 30
to swupdate
Hi,

I have a CI/CD in GitLab which runs all swupdate tests once every week.

https://gitlab.com/toertel/docker-image-swupdate-contribute

Since about September 22 I see the problem that the CrapFileTest might
run endlessly. There is not a specific *_defconfig connected to this.
It seems to be random whether it will hang or not.

I ran "make V=1 test" to get an insight what might be going wrong.

When it hangs it looks like this:

[DEBUG] : SWUPDATE running : [read_module_settings] : No config
settings found for module versions
[TRACE] : SWUPDATE running : [listener_create] : creating socket at
/tmp/swupdateprog
[TRACE] : SWUPDATE running : [network_initializer] : Main loop daemon
[TRACE] : SWUPDATE running : [listener_create] : creating socket at
/tmp/sockinstctrl
[DEBUG] : SWUPDATE running : [read_module_settings] : No config
settings found for module download
[TRACE] : SWUPDATE running : [start_swupdate_subprocess] : Started
chunks_downloader with pid 19491 and fd 8
[TRACE] : SWUPDATE running : [network_thread] : Incoming network
request: processing...
[INFO ] : SWUPDATE started : Software Update started !
[TRACE] : SWUPDATE running : [network_initializer] : Software update started
[TRACE] : SWUPDATE running : [start_delta_downloader] : Starting
Internal process for downloading chunks
[ERROR] : SWUPDATE failed [0] ERROR cpio_utils.c : get_cpiohdr : 52 :
CPIO Format not recognized: magic not found
[ERROR] : SWUPDATE failed [0] ERROR cpio_utils.c : extract_cpio_header
: 732 : CPIO Header corrupted, cannot be parsed
[ERROR] : SWUPDATE failed [1] Image invalid or corrupted. Not installing ...
[TRACE] : SWUPDATE running : [network_initializer] : Main thread sleep again !
[INFO ] : No SWUPDATE running : Waiting for requests...

When it does not hang it finishes the test nicely like this.

[ERROR] : SWUPDATE failed [0] ERROR cpio_utils.c : get_cpiohdr : 52 :
CPIO Format not recognized: magic not found
[ERROR] : SWUPDATE failed [0] ERROR cpio_utils.c : extract_cpio_header
: 732 : CPIO Header corrupted, cannot be parsed
[ERROR] : SWUPDATE failed [1] Image invalid or corrupted. Not installing ...
[TRACE] : SWUPDATE running : [network_initializer] : Main thread sleep again !
[INFO ] : No SWUPDATE running : Waiting for requests...
[ERROR] : SWUPDATE failed [0] ERROR install_from_file.c : endupdate :
55 : SWUpdate *failed* !
[TRACE] : SWUPDATE running : [unlink_sockets] : unlink socket /tmp/swupdateprog
[TRACE] : SWUPDATE running : [unlink_sockets] : unlink socket /tmp/sockinstctrl

I cannot reproduce the problem when running the tests locally on my
PC. And I am lacking fantasy what could go wrong on the GitLab build
machine. It worked there nicely for years.

Has somebody seen something like that before?

Have there been recent changes in swupdate which could explain that?

Cheers
Mark

Frederic Hoerni

unread,
Oct 1, 2024, 2:56:43 AMOct 1
to swupdate
Hi,

On 30/09/2024 20:37, Mark Jonas wrote:
> [You don't often get email from toe...@gmail.com. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
I managed to reproduce the issue, after ~40 iterations.
I think it appeared after commit e6b2081bae166c09bc542229088ea302eb4f0899, that fixed a race condition between the client and the daemon by using the progress API.

But it looks like this patch was not enough and introduces the current issue: the client misses the progress event about the failure and waits forever.

To fix that, I am working on a patch where the client subscribes to the progress events even before starting the transfer of the image to be installed.

Frederic

> Cheers
> Mark
>
> --
> You received this message because you are subscribed to the Google Groups "swupdate" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to swupdate+u...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/swupdate/CAEE5dN1K-YdEgvFNPmrxJQP2kZhUeCPC60p7Ta7cX%3D0OdB9c2A%40mail.gmail.com.

Mark Jonas

unread,
Oct 3, 2024, 10:25:25 AMOct 3
to Frederic Hoerni, swupdate
Hi Frederic,

> > I have a CI/CD in GitLab which runs all swupdate tests once every week.
> >
> > https://gitlab.com/toertel/docker-image-swupdate-contribute
> >
> > Since about September 22 I see the problem that the CrapFileTest might
> > run endlessly. There is not a specific *_defconfig connected to this.
> > It seems to be random whether it will hang or not.

> I managed to reproduce the issue, after ~40 iterations.

Meanwhile, I was also able to reproduce it on my local computer.

> I think it appeared after commit e6b2081bae166c09bc542229088ea302eb4f0899, that fixed a race condition between the client and the daemon by using the progress API.
>
> But it looks like this patch was not enough and introduces the current issue: the client misses the progress event about the failure and waits forever.
>
> To fix that, I am working on a patch where the client subscribes to the progress events even before starting the transfer of the image to be installed.

Great, that you are looking into it! :)

Cheers,
Mark

Frederic Hoerni

unread,
Oct 3, 2024, 3:25:48 PMOct 3
to swupdate
Hi,
It looks like the progress API may not be reliable enough for that case, as the client cannot know when the daemon is actually ready to send progress events (the end of the subscription procedure is not atomic and not notified to the client). Therefore the client cannot be sure it will not miss any event.

In order to fix this race condition, I can image different technical solutions:

1. Modify the progress API so that the daemon notifies the client when the subscription is effective. (for example the daemon sends an ACK to the client after it inserted the socket fd in progress.conns). This impacts existing clients using the progress API, as they will receive and ACK that they do not expect, but we can minimize the impact by having progress_ipc_connect() handle this ACK.

2. Revert back to the old way of polling the daemon with GET_STATUS, and on the daemon side, set instp->status = START on REQ_INSTALL (that would be a simplified version of a previous patch that I submitted a few months ago and was not merged - https://groups.google.com/g/swupdate/c/T8SEuwjNxxU/m/D3ZqgKHMAQAJ).

What do you think guys? Any other suggestion?

Frederic

Stefano Babic

unread,
Oct 5, 2024, 5:30:21 AMOct 5
to Frederic Hoerni, swupdate
Hi Frederic,

On 03.10.24 21:25, 'Frederic Hoerni' via swupdate wrote:
> Hi,
>
> On 03/10/2024 16:25, Mark Jonas wrote:
>>
>> Hi Frederic,
>>
>>>> I have a CI/CD in GitLab which runs all swupdate tests once every week.
>>>>
>>>> https://gitlab.com/toertel/docker-image-swupdate-contribute
>>>>
>>>> Since about September 22 I see the problem that the CrapFileTest might
>>>> run endlessly. There is not a specific *_defconfig connected to this.
>>>> It seems to be random whether it will hang or not.
>>
>>> I managed to reproduce the issue, after ~40 iterations.
>>
>> Meanwhile, I was also able to reproduce it on my local computer.
>>
>>> I think it appeared after commit
>>> e6b2081bae166c09bc542229088ea302eb4f0899, that fixed a race condition
>>> between the client and the daemon by using the progress API.
>>>
>>> But it looks like this patch was not enough and introduces the
>>> current issue: the client misses the progress event about the failure
>>> and waits forever.
>>>
>>> To fix that, I am working on a patch where the client subscribes to
>>> the progress events even before starting the transfer of the image to
>>> be installed.
>>
>> Great, that you are looking into it! :
>
> It looks like the progress API may not be reliable enough for that case,
> as the client cannot know when the daemon is actually ready to send
> progress events (the end of the subscription procedure is not atomic and
> not notified to the client). Therefore the client cannot be sure it will
> not miss any event.
>
> In order to fix this race condition, I can image different technical
> solutions:
>
> 1. Modify the progress API so that the daemon notifies the client when
> the subscription is effective. (for example the daemon sends an ACK to
> the client after it inserted the socket fd in progress.conns). This
> impacts existing clients using the progress API, as they will receive
> and ACK that they do not expect, but we can minimize the impact by
> having progress_ipc_connect() handle this ACK.

It seems to me a clean solution - this will remove any possible race. If
the ACK is managed by progress_ipc_connect(), I do not see disadvantages
for the client.

>
> 2. Revert back to the old way of polling the daemon with GET_STATUS, and
> on the daemon side, set instp->status = START on REQ_INSTALL (that would
> be a simplified version of a previous patch that I submitted a few
> months ago and was not merged - https://groups.google.com/g/swupdate/c/
> T8SEuwjNxxU/m/D3ZqgKHMAQAJ).
>
> What do you think guys? Any other suggestion?
>

Best regards,
Stefano
Reply all
Reply to author
Forward
0 new messages