Suspend issues on X1 Carbon gen 5th

56 views
Skip to first unread message

Marek Marczykowski-Górecki

unread,
Jan 29, 2018, 1:53:05 PM1/29/18
to Simon Gaiser, qubes-devel
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

[moving to qubes-devel]

On Mon, Jan 29, 2018 at 01:55:20AM +0100, Marek Marczykowski-Górecki wrote:
> Interesting, some times it works on my machine, but indeed some times
> it doesn't. Especially it worked just after installation, and after one
> (warm) reboot. But then after cold reboot it mostly doesn't (I get
> successful suspend once, then like 10 of them failed).
>
> It appears that downgrading kernel for sys-net and sys-usb helps:
>
> sudo qubes-dom0-update --action=downgrade 'kernel-qubes-vm-4.9*'
> # set default kernel back to 4.14.13-1 - required for PVH - most of
> # VMs
> qubes-prefs default-kernel 4.14.13-1
> # then set sys-net and sys-usb to 4.9
> qvm-prefs sys-net 4.9.56-21
> qvm-prefs sys-usb 4.9.56-21
>
> This applies only on X1 Carbon. On T460p (one generation older than X1)
> it works just fine with 4.14 in VM.

Some more info:

VM suspend fails for any HVM running 4.14.13-1 kernel, not only those
with PCI devices (it just happens that other VMs are PVH by default,
where suspend works just fine). It's easy to test even without host
suspend:

virsh -c xen:/// dompmsuspend VMNAME mem

When it works, it finishes immediately. When it fails, the above will
wait until 60s timeout and fail with:

error: Domain VMNAME could not be suspended
error: internal error: Failed to suspend domain '...'

And in /var/log/libvirt/libxl/libxl-driver.log (loglevel debug):

libxl: libxl_dom_suspend.c:206:domain_suspend_callback_common: issuing PVH/HVM suspend request via XenBus control node
libxl: libxl_event.c:636:libxl__ev_xswatch_register: watch w=0x7aa824004680 wpath=/local/domain/13/control/shutdown token=15/61: register slotnum=15
libxl: libxl.c:982:libxl_domain_suspend: ao 0x7aa8240127e0: inprogress: poller=0x7aa824009030, flags=i
libxl: libxl_event.c:573:watchfd_callback: watch w=0x7aa824004680 wpath=/local/domain/13/control/shutdown token=15/61: event epath=/local/domain/13/control/shutdown
libxl: libxl_event.c:673:libxl__ev_xswatch_deregister: watch w=0x7aa824004680 wpath=/local/domain/13/control/shutdown token=15/61: deregister slotnum=15
libxl: libxl_dom_suspend.c:288:domain_suspend_common_pvcontrol_suspending: guest acknowledged suspend request
libxl: libxl_dom_suspend.c:307:domain_suspend_common_wait_guest: wait for the guest to suspend
libxl: libxl_event.c:636:libxl__ev_xswatch_register: watch w=0x7aa824004698 wpath=@releaseDomain token=15/62: register slotnum=15
libxl: libxl_event.c:548:watchfd_callback: watch w=0x7aa824004698 epath=/local/domain/13/control/shutdown token=15/61: counter != 62
libxl: libxl_event.c:573:watchfd_callback: watch w=0x7aa824004698 wpath=@releaseDomain token=15/62: event epath=@releaseDomain

(...60s...)

libxl: libxl_dom_suspend.c:380:suspend_common_wait_guest_timeout: guest did not suspend, timed out

It fails only the first time after system startup, or suspend/resume.
Later, if you kill the VM and start it again, it works most of the time
(but not always). If for some VM it works once, it will work next time
too, until VM shutdown or host suspend.

If the VM is running 4.9.56 kernel, the problem doesn't happen.

I've tried also disabling PTI on 4.14.13 kernel in the VM, but it
doesn't appear to change anything.

It is just my observation, it may be totally independent of those action
- - for example some race condition, being affected by some data being
cached or not...

Since I've seen similar reports also from other users, I'll try to
convince our anaconda to install both kernel packages on 4.0rc4, to ease
implementing workaround. But nonetheless it would be better to properly
fix the issue.

Simon, do you have any idea? If not, probably worth asking on xen-devel.

- --
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?
-----BEGIN PGP SIGNATURE-----

iQEzBAEBCAAdFiEEhrpukzGPukRmQqkK24/THMrX1ywFAlpvbR4ACgkQ24/THMrX
1ywVqgf+Nk7m5gEo1iA24z9tDl0phrcz8xqBRGSnHjNr+MOfIXy31/AQGqgob0Ct
mjqKQl1gTRHIVuj1hRFCGrmtl1h2Z6sCr6CTmxfg4q2QBHWJDwQQ19QXsWnPIBzM
a8KUmrOvTK1iRVuLGjLfC2DJzdm6Mfn+B1p2YBDFoqUEHzMEuf0nzqC3awY0tCLn
RZk5F8QOy05msG9ElkvOhgON2kmrwEbZNF/txOlY2IotZaz/t/JfR1V2xzNH0Ccs
AghdU+Xk8K4FN8YMH3eCQrMV6QDLHiHCxBl+UuquHjfOYFMMh1mkQVHPKzkIioMP
MSXk1lMCK5TPXW3rUS5PBTr4bk8MMw==
=94FB
-----END PGP SIGNATURE-----

Simon Gaiser

unread,
Jan 29, 2018, 7:45:03 PM1/29/18
to Marek Marczykowski-Górecki, qubes-devel
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512

Marek Marczykowski-Górecki:
> [moving to qubes-devel]
>
> On Mon, Jan 29, 2018 at 01:55:20AM +0100, Marek Marczykowski-Górecki wrote:
>> Interesting, some times it works on my machine, but indeed some times
>> it doesn't. Especially it worked just after installation, and after one
>> (warm) reboot. But then after cold reboot it mostly doesn't (I get
>> successful suspend once, then like 10 of them failed).
>
>> It appears that downgrading kernel for sys-net and sys-usb helps:
>
>> sudo qubes-dom0-update --action=downgrade 'kernel-qubes-vm-4.9*'
>> # set default kernel back to 4.14.13-1 - required for PVH - most of
>> # VMs
>> qubes-prefs default-kernel 4.14.13-1
>> # then set sys-net and sys-usb to 4.9
>> qvm-prefs sys-net 4.9.56-21
>> qvm-prefs sys-usb 4.9.56-21
>
>> This applies only on X1 Carbon. On T460p (one generation older than X1)
>> it works just fine with 4.14 in VM.
>
> Some more info:
>
> VM suspend fails for any HVM running 4.14.13-1 kernel, not only those
> with PCI devices (it just happens that other VMs are PVH by default,
> where suspend works just fine).

Oh, that's a very useful info. Was looking for an pass-through issue.
So after poking around some more observations:

- I can observe this also a different machine (i5-3340M). So most
likely not directly X1 Carbon related.

- If suspend does not fails the first time after a VM start it seems to
fail never (at least it worked ~200 times for me). After restarting the
VM it can happen again.

- It also happens with qemu in dom0. So more likely an upstream bug.

- I can confirm Linux 4.9.56 inside the VM seems to solve (or hide?)
the problem.

WIP:

- Test Linux 4.15

- Test vanilla Xen
-----BEGIN PGP SIGNATURE-----

iQIzBAEBCgAdFiEE3E8ezGzG3N1CTQ//kO9xfO/xly8FAlpvwAsACgkQkO9xfO/x
ly+ayg//Zd3k7wgfofZMNv6rS80QS7GBZe86xj/Z/4fDiS4cPY13zvzJORfIjoAk
pSbMZuZQoAZyYZxBgGzqyUB0l0D9N0f9mY4mRMOYvf2r9OrpJUfPupo0TAMueFIK
aIvphPhFa/Z6T43Qk6qUD3T+pAQJTiV4y2mD4FUS4eunYTzSYwg7kaeFNFmoWukb
Hd7lzfb5HGPAQhntVSnV+/RLQteYqkupWblwd9zgx5x6wo1UQrI9HkBixKkM25ei
eta+znLxbg3YQUYkLjBUeIns/389G2FTGOr+nNVWQQxWP+WKdjYw5cZVkNGOGvp6
kE/mTQYrzBs6qvIPNOvDt4s6McLhhbFSz2/67ZNxw4wnfXvHlenlJbNELuIKx8ao
SOeOzh3JgpJXPgjVn27wNUa9Gc0FGuahM3MqLKxzwN5eHNaRte/1nywBrtDAYVyw
NK1WH55NJccEiOZknRgskgEZFA/YyMMXKXXuiMpxo92Yl5ExYBdkZsaAj9+SnJVJ
yfvEsHJDciyxXNaYXQGMkgPzZI3QHe04rrcYjJG2QQp1VDh7S/7WWoBiKRPnIr+e
0cawwaQRbG/6bF3M19GvF3Py1dfQD0ZT8oMQfUhBHkcq2H0f0dKAEB6uxGeBgSRK
ni86fMCuMYn7cw/6Wm9QjT/oXUff2un0yyNdjDye3mPL7SVoXIU=
=Guye
-----END PGP SIGNATURE-----

Simon Gaiser

unread,
Jan 30, 2018, 4:10:19 AM1/30/18
to Marek Marczykowski-Górecki, qubes-devel
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512

Simon Gaiser:
The culprint seems to be commit fd8aa9095a95 "xen: optimize xenbus
driver for multiple concurrent xenstore accesses".

https://github.com/HW42/qubes-linux-kernel/tree/hw42/hvm-suspend
probably works around the issue. Till somebody understands what exactly
is broken with the "optimized" version.
-----BEGIN PGP SIGNATURE-----

iQIzBAEBCgAdFiEE3E8ezGzG3N1CTQ//kO9xfO/xly8FAlpwNnwACgkQkO9xfO/x
ly/qwQ/+PF3qooS44Sn8xdBj+s84jJujAO2uzj4dP1s1nJTfNnXXZLDqoEpnVYHw
vtSn1aO6Ps1Uo3DsT5sULAalVATKRG3VUunhVgarsiMIMPVdZZFB035idZJlGlCj
ORBUCQXqmgmx6VkKrX5N5iptr8pcGDokVKT4tgGCAaJN72L4N8jYfS0y5Q8pTdTS
U5qfV4j/oA8JgbrmvebU8cnnYQqEmPYnDuFcu5DxfAo81Ba8kiSC9alvirMAheR2
KK9+XV2IGVZ3V9tdn7lxY73HNLHkA7P97tJfpLpPl2vWHFd46D+A51oFQI8E1NY3
jHtnNehNqd8uB9UlTMHLbq7GJb+Vm9+4xB8aj+L0J2sE2TndGeddi/2hjDcashqK
m1+8GVHeY6hqljI3Lv0shMs7w7QDioC3fYT/nZcefc5ff/PtQDi8oELAhJWKXPFA
d8qfmZ04zACuZ1Etb7mOWqor96DAgBkqs6aBP0Cx4ntBFmUzullK10tNpwAFCSHM
t5ONoSStrTQHU+uBJ08IneQvVkmwQWoWb+hSJmc5sWfVofAMHTxwzr+Plh5+G8GX
WnGcDW/p3dun9O20Lh9GaFXMIrVPCmDeZYFMN8eKVnk4oywXW347HLGXUTUXq8UZ
fXUgwSD8vuMlXXAP+soe7XK3ZU6QgukK6rQRlX+kdQyYq5s/Emg=
=t15x
-----END PGP SIGNATURE-----
Reply all
Reply to author
Forward
0 new messages