How should reboot/rollback work if upgrade fails?

244 views
Skip to first unread message

Don Cross

unread,
Aug 7, 2018, 5:15:56 PM8/7/18
to men...@lists.mender.io
Hi everyone,

I just had an interesting experience where I made a mistake that caused an upgrade to fail. Fortunately, this is still in the early stages of development so it was just on my own test machine. My question isn't about that; instead it is about why the system didn't immediately reboot itself and roll back to the previous artifact. I ended up manually rebooting and then the rollback occurred.

More details:  As mentioned in another thread, I am now installing ca-certificates instead of using /etc/mender/server.crt. But I forgot to change this part of my mender.conf:

"ServerCertificate": "/etc/mender/server.crt"

Oops!  When the upgrade finished downloading and the device booted into it, I got the following errors in my mender.log:

ERRO[0000] /etc/mender/server.crt is inaccessible: open /etc/mender/server.crt: no such file or directory  module=client
ERRO[0000] error initializing mender controller: error creating HTTP client: cannot initialize server trust: open /etc/mender/server.crt: no such file or directory  module=main

And then the machine just sits there.  I thought, OK, I will just manually fix the bad ServerCertificate line in mender.conf and reboot and see if it starts working. After the reboot I realize I am now back in the previous artifact on the other root partition. It has rolled back. So somehow Mender detected that the upgrade had failed, but only after I manually rebooted.

So I'm guessing that my non-Yocto build is not quite right. Whose responsibility is it when Mender gets a fundamental error like this (one that will break its ability to upgrade again) to know to immediately reboot so the rollback can occur without manual intervention?  Maybe my watchdog should have tested for this somehow?  Or is this a bug in Mender itself?  I can reproduce this if that would help diagnose it.

Thanks,
Don



Vladimir Bashkirtsev

unread,
Aug 7, 2018, 8:45:00 PM8/7/18
to men...@lists.mender.io
I believe mender client should reboot the system if it cannot initialize properly for whatever reason because otherwise you would loose update ability and will have no way to update bad image should things go wrong. But by the looks of it your current client does not do so. So question is: which version of mender client do you use?

--
You received this message because you are subscribed to the Google Groups "Mender List mender.io" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mender+un...@lists.mender.io.
To post to this group, send email to men...@lists.mender.io.
Visit this group at https://groups.google.com/a/lists.mender.io/group/mender/.

Vladimir Bashkirtsev

unread,
Aug 7, 2018, 8:47:53 PM8/7/18
to men...@lists.mender.io
The reason why it has rolled back is simple: mender did not reach commit point on boot and so on reboot it just booted previous image. If it would commit and then fail - then it would be a major bug.

Don Cross

unread,
Aug 7, 2018, 9:19:26 PM8/7/18
to men...@lists.mender.io
This is Mender 1.5.0.  I think the failure is that it didn't reboot itself automatically after entering a non-functional state.  This would require someone to manually fix the device.

Vladimir Bashkirtsev

unread,
Aug 7, 2018, 10:28:53 PM8/7/18
to men...@lists.mender.io
I guess you should create a ticket in Mender's tracker so it will not be forgotten. That's clearly wrong behaviour by mender client.

Kristian Amlie

unread,
Aug 8, 2018, 2:00:08 AM8/8/18
to men...@lists.mender.io, Don Cross

How long did you wait?

Mender will use the interval specified in RetryPollIntervalSeconds to determine how long to wait before retrying to contact the server, and will do so at least three times. In the production template for Mender, the interval is 5 minutes, which means that at least 15 minutes will go by before Mender gives up and rolls back.

The reason for this approach is that there are many uncertainties during the boot process and sometimes things just take a long time for a new update (slow NTP is a very common occurrence). Mender must not reboot immediately or it might erroneously roll back a valid update which just happens to be a bit slow.

--
Kristian
signature.asc

Don Cross

unread,
Aug 8, 2018, 5:25:03 PM8/8/18
to men...@lists.mender.io
On Wed, Aug 8, 2018 at 2:00 AM Kristian Amlie <kristia...@northern.tech> wrote:
How long did you wait?

Mender will use the interval specified in RetryPollIntervalSeconds to determine how long to wait before retrying to contact the server, and will do so at least three times. In the production template for Mender, the interval is 5 minutes, which means that at least 15 minutes will go by before Mender gives up and rolls back.

The reason for this approach is that there are many uncertainties during the boot process and sometimes things just take a long time for a new update (slow NTP is a verycommon occurrence). Mender must not reboot immediately or it might erroneously roll back a valid update which just happens to be a bit slow.


Thanks Kristian,

You were right. When I repeated the experiment (this time by creating a deliberately bad filename for the certificate bundle, /etc/ssl/ca-bundle.bad instead of /etc/ssl/ca-bundle.pem) it did eventually give up, reboot the system, and roll back to the previous deployment like it should. It tried 13 times to report status, at 5 minute intervals, for a total of just over one hour before recovering.  So it works fine if I just know to be patient!  :)

Thanks for your help again.
Reply all
Reply to author
Forward
0 new messages