SWUpdate+EBG: The impossible state and how it's being handled so far

Jan Kiszka

unread,

Feb 21, 2023, 2:09:35 AM2/21/23

to efibootguard-dev, cip-dev, Christian Storm, Quirin Gylstorff, Adler, Michael (CT RDA IOT SES-DE)

Hi again,

playing with updates, I maneuvered the EBG envs on a system into this
weird state:

----------------------------
Config Partition #0 Values:
in_progress: yes
revision: 4
kernel: C:BOOT1:linux.efi
kernelargs:
watchdog timeout: 0 seconds
ustate: 3 (FAILED)

user variables:
recovery_status = failed

----------------------------
Config Partition #1 Values:
in_progress: no
revision: 3
kernel: C:BOOT1:linux.efi
kernelargs:
watchdog timeout: 0 seconds
ustate: 2 (TESTING)

user variables:

To get there, I started an upstate with swupdate and booted into testing
path #1. But then didn't confirm this update and rather started it
again, using the same swu. That didn't complete because the UUID clash
was detected. swupdate terminated, and I was left with the above.

I can still boot this constellation, EBG will select path #1 (endless
testing, so to say). OTOH:

# bg_printenv -c
Using latest config partition
Values:
in_progress: yes
revision: 4
kernel: C:BOOT1:linux.efi
kernelargs:
watchdog timeout: 0 seconds
ustate: 3 (FAILED)

user variables:
recovery_status = failed

That is not quite correct. To be fair, bg_printenv deals with an illegal
state here. Still...

The key question is where to avoid best entering this state in the first
place?

Jan

--
Siemens AG, Technology
Competence Center Embedded Linux

Stefano Babic

unread,

Feb 21, 2023, 3:31:27 AM2/21/23

to Jan Kiszka, efibootguard-dev, cip-dev, Christian Storm, Quirin Gylstorff, Adler, Michael (CT RDA IOT SES-DE)

Hi Jan,

On 21.02.23 08:09, Jan Kiszka wrote:
> Hi again,
>
> playing with updates, I maneuvered the EBG envs on a system into this
> weird state:
>
>
> ----------------------------
> Config Partition #0 Values:
> in_progress: yes
> revision: 4
> kernel: C:BOOT1:linux.efi
> kernelargs:
> watchdog timeout: 0 seconds
> ustate: 3 (FAILED)
>
> user variables:
> recovery_status = failed
>
>
>
> ----------------------------
> Config Partition #1 Values:
> in_progress: no
> revision: 3
> kernel: C:BOOT1:linux.efi
> kernelargs:
> watchdog timeout: 0 seconds
> ustate: 2 (TESTING)
>
> user variables:
>

I see - we should *never* reach this state.

>
> To get there, I started an upstate with swupdate and booted into testing
> path #1.

Ok

> But then didn't confirm this update and rather started it
> again, using the same swu.

It looks to me that this is the point. SWUpdate requires to close the
transaction, for itself or for the deployment server (Hawkbit). If a
system boots with TESTING, the glue logic should start SWUpdate asking
to close the transaction - with OK or FAILED by passing the -c parameter.

However, this was thought to work together with the deployment server,
because it handles the state machine on Hawkbit. The parameter is
ignored if another deployment interface (Webserver, USB, ..) is used.
This is managed (again) on such situation on glue logic, and the
transaction (that is set of ustate) is done before starting SWUpdate. Or
in case of U-Boot, it is also managed with the help of additional (and
custom) variables.

In your case, it seems that nothing is done at boot time, and SWUpdate
is started. SWUpdate does not know (because it expects that someone has
already decided, and ustate is not checked) that a new software is
running, and the same SWU is loaded again.

> That didn't complete because the UUID clash
> was detected. swupdate terminated, and I was left with the above.
>
> I can still boot this constellation, EBG will select path #1 (endless
> testing, so to say). OTOH:
>
> # bg_printenv -c
> Using latest config partition
> Values:
> in_progress: yes
> revision: 4
> kernel: C:BOOT1:linux.efi
> kernelargs:
> watchdog timeout: 0 seconds
> ustate: 3 (FAILED)
>
> user variables:
> recovery_status = failed
>
>
> That is not quite correct. To be fair, bg_printenv deals with an illegal
> state here.

Agree.

> Still...
>
> The key question is where to avoid best entering this state in the first
> place?

My question is why the transaction was not closed before running
SWUpdate. This is a common pattern even with other bootloader, but it is
more important here because EBG stores an history (well, with deep=1) of
previous run.

SWUpdate can check the state when is running, but there is no general
cases. There are use cases where the OK is coming from the application,
and SWUpdate waits via IPC the result (but then SWUpdate is started with
WAIT option, and does not try to load a new SWU). So SWUpdate cannot
decide itself that TESTING is a wrong ustate, because it depends on a
single project.

Stefano

--
=====================================================================
DENX Software Engineering GmbH, Managing Director: Erika Unter
HRB 165235 Munich, Office: Kirchenstr.5, 82194 Groebenzell, Germany
Phone: +49-8142-66989-53 Fax: +49-8142-66989-80 Email: sba...@denx.de
=====================================================================

Jan Kiszka

unread,

Feb 21, 2023, 4:21:38 AM2/21/23

to Stefano Babic, efibootguard-dev, cip-dev, Christian Storm, Quirin Gylstorff, Adler, Michael (CT RDA IOT SES-DE)

I was running swupdate manually from the command line. No backend
involved, just the desire to intentionally break things. ;)

Stefano Babic

unread,

Feb 21, 2023, 4:33:59 AM2/21/23

to Jan Kiszka, Stefano Babic, efibootguard-dev, cip-dev, Christian Storm, Quirin Gylstorff, Adler, Michael (CT RDA IOT SES-DE)

Hi Jan,

The best way to reach the goal...:-D

And yes, this can happen because the part deciding if previous update
was ok, is missing. In most projects, if system is up and running, it is
considered ok. That means the decision is done in SWUpdate's systemd run
unit (or SystemV init script), see also glue logic under
/usr/lib/swupdate. In some other cases, update is ok only if application
is running, a migration of a custom database was ok, ad, and....that
means is outside SWUpdate. SWUpdate supports all these use cases.

To avoid the issue you are seeing, the decsion should be done inside
SWUpdate: something like a transiction TESTING ==> OK, because SWUpdate
is running. But as I said, this can be done if it will be configurable,
or it will break the use cases I mentioned.

Regards,

Christian Storm

unread,

Feb 21, 2023, 4:20:26 PM2/21/23

to efibootguard-dev, cip-dev

Hi,

> > > > playing with updates, I maneuvered the EBG envs on a system into this
> > > > weird state:
> > > >
> > > >
> > > > ----------------------------
> > > > Config Partition #0 Values:
> > > > in_progress:      yes
> > > > revision:         4
> > > > kernel:           C:BOOT1:linux.efi
> > > > kernelargs:
> > > > watchdog timeout: 0 seconds
> > > > ustate:           3 (FAILED)
> > > >
> > > > user variables:
> > > > recovery_status = failed

Hm, did you start with a clean environment and SWUpdate >= 2022.12?

> > > > ----------------------------
> > > > Config Partition #1 Values:
> > > > in_progress:      no
> > > > revision:         3
> > > > kernel:           C:BOOT1:linux.efi
> > > > kernelargs:
> > > > watchdog timeout: 0 seconds
> > > > ustate:           2 (TESTING)
> > > >
> > > > user variables:
> > > >
> > >
> > > I see - we should *never* reach this state.
> > >
> > > >
> > > > To get there, I started an upstate with swupdate and booted into testing
> > > > path #1.
> > >
> > > Ok
> > >
> > > > But then didn't confirm this update and rather started it
> > > > again, using the same swu.
> > >
> > > It looks to me that this is the point. SWUpdate requires to close the
> > > transaction, for itself or for the deployment server (Hawkbit). If a
> > > system boots with TESTING, the glue logic should start SWUpdate asking
> > > to close the transaction - with OK or FAILED by passing the -c parameter.
> > >
> > > However, this was thought to work together with the deployment server,
> > > because it handles the state machine on Hawkbit. The parameter is
> > > ignored if another deployment interface (Webserver, USB, ..) is used.

The suricatta modules handle this for you ― as a "convenience" feature
and to keep the (hawkBit, ...) server's view of things consistent with
the device's, which is more important than the convenience aspect :)

If you're running it with other modules/modes, you're on your own.
Then, you have to play along the (convention) rules to close the
transaction as there's nothing preventing you to get into this
situation with EFI Boot Guard.

Hence, the valid question whether this should be allowed / denied by EFI
Boot Guard or the tools (SWUpdate in this case) making use of it?

> > > This is managed (again) on such situation on glue logic, and the
> > > transaction (that is set of ustate) is done before starting SWUpdate. Or
> > > in case of U-Boot, it is also managed with the help of additional (and
> > > custom) variables.
> > >
> > > In your case, it seems that nothing is done at boot time, and SWUpdate
> > > is started. SWUpdate does not know (because it expects that someone has
> > > already decided, and ustate is not checked) that a new software is
> > > running, and the same SWU is loaded again.

Exactly, here you're on your own. You have to instrument EFI Boot Guard
so that it's happy... which is convention and not enforced, currently.
Granted, this requires a lot of context knowledge how to integrate
things properly and seamlessly...

One common pattern is to have a "health" target and once that's reached
you start SWUpdate with according parameters (or set them yourself via
some glueing method). But again, that is convention, not enforced, and
it's currently the responsibility of the system integrator to get right.

> > I was running swupdate manually from the command line. No backend
> > involved, just the desire to intentionally break things. ;)
>
> The best way to reach the goal...:-D

If you would have used suricatta, you would have missed this :)

> And yes, this can happen because the part deciding if previous update was ok,
> is missing. In most projects, if system is up and running, it is considered
> ok. That means the decision is done in SWUpdate's systemd run unit (or SystemV
> init script), see also glue logic under /usr/lib/swupdate. In some other
> cases, update is ok only if application is running, a migration of a custom
> database was ok, ad, and....that means is outside SWUpdate. SWUpdate supports
> all these use cases.

Yes, that's the codified context knowledge. Still, if you miss out on
one thing, the whole integration will crash and burn. And it's quite
easy to miss a thing...

The question is whether there is a generic pattern like the "health"
target I sketched above so that SWUpdate can handle and abstract
the bootloader interactions?

Then, any SWUpdate mode/module will behave the same and there's all
in one place reducing the need for having all the context knowledge...

> To avoid the issue you are seeing, the decsion should be done inside SWUpdate:
> something like a transiction TESTING ==> OK, because SWUpdate is running. But
> as I said, this can be done if it will be configurable, or it will break the
> use cases I mentioned.

This is essentially promoting the current suricatta behavior to all
SWUpdate modes/modules w/o the remote reporting part if not run from
a suricatta module. Would be a starter...

Kind regards,
Christian

--
Dr. Christian Storm
Siemens AG, Technology, T CED SES-DE
Otto-Hahn-Ring 6, 81739 München, Germany

Stefano Babic

unread,

Feb 22, 2023, 3:36:40 AM2/22/23

to efibootguard-dev, cip-dev

Hi Christian, Jan,

On 21.02.23 22:21, Christian Storm wrote:
> Hi,
>
>>>>> playing with updates, I maneuvered the EBG envs on a system into this
>>>>> weird state:
>>>>>
>>>>>
>>>>> ----------------------------
>>>>> Config Partition #0 Values:
>>>>> in_progress:      yes
>>>>> revision:         4
>>>>> kernel:           C:BOOT1:linux.efi
>>>>> kernelargs:
>>>>> watchdog timeout: 0 seconds
>>>>> ustate:           3 (FAILED)
>>>>>
>>>>> user variables:
>>>>> recovery_status = failed
>
> Hm, did you start with a clean environment and SWUpdate >= 2022.12?

I think we can reach the status with any SWUpdate version.

Right, this was an initial decision. For not suricatta aka Hawkbit use
case, this is handled outside SWUpdate, often before running SWUpdate.
It is duty of the integrator understand this and add the required glue
logic.

I just ask the question if this should be handled completely by
SWUpdate, if configured. The "state" itself is part of SWUpdateś state
machine, too, and it could be moved into core, informing suricatta to
send the correct feedback to the deployment server.

>
> If you're running it with other modules/modes, you're on your own.

Right.

> Then, you have to play along the (convention) rules to close the
> transaction as there's nothing preventing you to get into this
> situation with EFI Boot Guard.

Exactly - issue raises because the transaction was not closed, and glue
logic is missing in the Jan'use case.

>
> Hence, the valid question whether this should be allowed / denied by EFI
> Boot Guard or the tools (SWUpdate in this case) making use of it?

IMHO EFI boot guard should be transparent, and someone else takes the
decision. My question here is if we add a way to avoid external glue
logic and put it into SWUpdate's core.

>
>
>>>> This is managed (again) on such situation on glue logic, and the
>>>> transaction (that is set of ustate) is done before starting SWUpdate. Or
>>>> in case of U-Boot, it is also managed with the help of additional (and
>>>> custom) variables.
>>>>
>>>> In your case, it seems that nothing is done at boot time, and SWUpdate
>>>> is started. SWUpdate does not know (because it expects that someone has
>>>> already decided, and ustate is not checked) that a new software is
>>>> running, and the same SWU is loaded again.
>
> Exactly, here you're on your own.

Yes, you are on your own !!

Exactly.

>
>
>>> I was running swupdate manually from the command line. No backend
>>> involved, just the desire to intentionally break things. ;)
>>
>> The best way to reach the goal...:-D
>
> If you would have used suricatta, you would have missed this :)
>
>
>> And yes, this can happen because the part deciding if previous update was ok,
>> is missing. In most projects, if system is up and running, it is considered
>> ok. That means the decision is done in SWUpdate's systemd run unit (or SystemV
>> init script), see also glue logic under /usr/lib/swupdate. In some other
>> cases, update is ok only if application is running, a migration of a custom
>> database was ok, ad, and....that means is outside SWUpdate. SWUpdate supports
>> all these use cases.
>
> Yes, that's the codified context knowledge. Still, if you miss out on
> one thing, the whole integration will crash and burn. And it's quite
> easy to miss a thing...

There is a balaance between flexibility to cover all use cases and
convenience.

>
> The question is whether there is a generic pattern like the "health"
> target I sketched above so that SWUpdate can handle and abstract
> the bootloader interactions?

Yes - and yes, the generic pattern is:

- transaction is closed by SWUpdate and not by another process
- SWUpdate evaluated ustate and close the transaction, independently if
the update was done via suricatta, Webserver, USB, command line
- related processes like suricatta are informed and they do what they
need to do : suricatta sends feedback to Hawkbit.

Nevertheless, the "open" approach must still remain in case a custom
acknowledge is required. This is also very common, for example if an
operator must acknowledge the update via a GUI.

>
> Then, any SWUpdate mode/module will behave the same and there's all
> in one place reducing the need for having all the context knowledge...

Correct.

>
>> To avoid the issue you are seeing, the decsion should be done inside SWUpdate:
>> something like a transiction TESTING ==> OK, because SWUpdate is running. But
>> as I said, this can be done if it will be configurable, or it will break the
>> use cases I mentioned.
>
> This is essentially promoting the current suricatta behavior to all
> SWUpdate modes/modules w/o the remote reporting part if not run from
> a suricatta module. Would be a starter...

Exactly, this is what should be done.

Regards,
Stefano

>
>
> Kind regards,
> Christian

Reply all

Reply to author

Forward