Qubes 3 HVM suspend issue

165 views
Skip to first unread message

mkru...@gmail.com

unread,
Nov 13, 2015, 6:27:15 AM11/13/15
to qubes-users
Hi everyone
I installed Qubes 3 from official ISO.
I am trying to solve two issues which i'm facing which both leads to a single problem here they are:
1. I want to save state of a HVM VM and resume it later.
2. I want to be able to close the laptop's lid which sends it to Sleep and after i open the lid
again to have all VM working again.

I noticed that when the computer goes to sleep, Qubes calls some scripts to prepare each VM to be to suspended or maybe paused and then resumed or maybe woken up after that.

First thing. After doing some research in DVM creation scripts noticed that saving state also involves suspend function to be called for that specific vm, so when calling
virsh -c xen:/// save <domain> <filename>
will also call
virsh -c xen:/// dompmsuspend <domain> mem
This comes in to play later after reading all of this.

For the second thig this sounds simple.
PV VMs they can handle that if they are not attacehd to any PCI device.
PV VMs with attached PCI devices usualy they freeze after resuming from sleep. A workaround for this is here:
https://groups.google.com/forum/#!msg/qubes-users/XtdtD20BiSA/1AJ-zZuX5IkJ
HVM VMs they freeze after coming back from sleep as reported here https://groups.google.com/forum/#!topic/qubes-users/UI7cKNCjj4w
Which doesn't matter if you are running Windows or other OS on HVM, after i open the lid i get them all the windows frozen
I also did some test with a Windows 7 installation and a Ubuntu instalation on HVM to make sure and the same thing happens.
I also tried to change kernels at boot, i also tried 3 different laptops, all leads to the same problem.

Now, after digging around with my Ubuntu HVM installation i did the followng

[user@dom0 ~]$ qvm-start ubuntu ; xl list
--> Loading the VM (type = HVM)...
--> Starting Qubes DB...
--> Setting Qubes DB info for the VM...
--> Updating firewall rules...
--> Starting the VM...
--> Starting Qubes GUId (full screen)...
Connecting to VM's GUI agent: .connected
--> Starting the qrexec daemon...
Name ID Mem VCPUs State Time(s)
dom0 0 9150 8 r----- 427.2
sys-net 1 291 8 -b---- 46.5
sys-firewall 2 1339 8 -b---- 27.3
ubuntu 9 1023 1 ------ 0.0
ubuntu-dm 10 44 1 -b---- 0.0

As we can see here, ubuntu runs two VM actually the VM itself and HVM stub domain.
After that we clear the logs, and try to do a suspend with virsh:

[user@dom0 ~]$ sudo bash -c 'echo > /var/log/libvirt/libxl/ubuntu.log'
[user@dom0 ~]$ sudo bash -c 'echo > /var/log/xen/console/guest-ubuntu-dm.log'
[user@dom0 ~]$ virsh -c xen:/// dompmsuspend ubuntu mem
Domain ubuntu successfully suspended
[user@dom0 ~]$ xl list
Name ID Mem VCPUs State Time(s)
dom0 0 9150 8 r----- 471.0
sys-net 1 291 8 -b---- 49.5
sys-firewall 2 1422 8 -b---- 28.8
ubuntu 9 1023 8 ---ss- 11.5
ubuntu-dm 10 44 1 ---sc- 11.9

Now i notice that suspending ubuntu VM makes ubuntu in a suspended state and ubuntu-dm in a suspended state but just before suspending is done on stubdom, the ubuntu-dm crashes and shows 'sc' letters wich mean SUSPENDED+CRASHED
This also hapens when doing "virsh -c xen:/// save <domain> <filename>" which leads to "dompmsuspend"
Now checking the logs i found the following:

[user@dom0 ~]$ sudo cat /var/log/libvirt/libxl/ubuntu.log

libxl: debug: libxl.c:776:libxl_domain_suspend: ao 0x7efe9000cd00: create: how=(nil) callback=(nil) poller=0x7efe900049b0
libxl: debug: libxl_dom.c:1074:libxl__domain_suspend_common_callback: issuing PVHVM suspend request via XenBus control node
libxl: debug: libxl_dom.c:1078:libxl__domain_suspend_common_callback: wait for the guest to acknowledge suspend request
libxl: debug: libxl_dom.c:1125:libxl__domain_suspend_common_callback: guest acknowledged suspend request
libxl: debug: libxl_dom.c:1129:libxl__domain_suspend_common_callback: wait for the guest to suspend
libxl: debug: libxl_dom.c:1143:libxl__domain_suspend_common_callback: guest has suspended
libxl: debug: libxl_dom.c:987:libxl__domain_suspend_device_model: Saving device model state to /var/lib/xen/qemu-save.9
libxl: error: libxl_exec.c:227:libxl__xenstore_child_wait_deprecated: Device Model not ready <-----------------could this be a problem?
libxl: debug: libxl_event.c:1600:libxl__ao_complete: ao 0x7efe9000cd00: complete, rc=0
libxl: debug: libxl.c:798:libxl_domain_suspend: ao 0x7efe9000cd00: inprogress: poller=0x7efe900049b0, flags=ic
libxl: debug: libxl_event.c:1572:libxl__ao__destroy: ao 0x7efe9000cd00: destroy
libxl: debug: libxl_event.c:518:watchfd_callback: watch w=0x7efe90008ed0 wpath=@releaseDomain token=3/2d: event epath=@releaseDomain
libxl: debug: libxl.c:1012:domain_death_xswatch_callback: [evg=0x7efe7c00ddd0:9] nentries=1 rc=1 9..9
libxl: debug: libxl.c:1023:domain_death_xswatch_callback: [evg=0x7efe7c00ddd0:9] got=domaininfos[0] got->domain=9
libxl: debug: libxl.c:1050:domain_death_xswatch_callback: exists shutdown_reported=0 dominf.flags=20006
libxl: debug: libxl.c:1062:domain_death_xswatch_callback: shutdown reporting
libxl: debug: libxl.c:1016:domain_death_xswatch_callback: [evg=0] all reported
libxl: debug: libxl.c:1079:domain_death_xswatch_callback: domain death search done
libxl: debug: libxl_event.c:518:watchfd_callback: watch w=0x7efe90008ed0 wpath=@releaseDomain token=3/2d: event epath=@releaseDomain
libxl: debug: libxl.c:1012:domain_death_xswatch_callback: [evg=0x7efe7c00ddd0:9] nentries=1 rc=1 9..9
libxl: debug: libxl.c:1023:domain_death_xswatch_callback: [evg=0x7efe7c00ddd0:9] got=domaininfos[0] got->domain=9
libxl: debug: libxl.c:1050:domain_death_xswatch_callback: exists shutdown_reported=1 dominf.flags=20006
libxl: debug: libxl.c:1016:domain_death_xswatch_callback: [evg=0] all reported
libxl: debug: libxl.c:1079:domain_death_xswatch_callback: domain death search done
libxl: debug: libxl_event.c:1155:egc_run_callbacks: event 0x7efeb4c603b0 callback type=domain_shutdown

[user@dom0 ~]$ sudo cat /var/log/xen/console/guest-ubuntu-dm.log

xs_read_watch() -> /local/domain/0/device-model/9/command dm-command
dm-command: pause and save state
device model saving state
xs_read_watch() -> /local/domain/0/device-model/9/command dm-command
xs_read(/local/domain/0/device-model/9/command): ENOENT
******************* CONSFRONT for device/console/1 **********


Failed to read device/console/1/backend-id.
Page fault at linear address 0x0, rip 0x1048cd, regs 0x5ff568, sp 0x5ff618, our_sp 0x5ff530, code 0
Thread: main
RIP: e030:[<00000000001048cd>]
RSP: e02b:00000000005ff618 EFLAGS: 00010246
RAX: 0000000000000000 RBX: 0000002002c76e40 RCX: 0000000000000001
RDX: 0000002002c03730 RSI: 0000000000000000 RDI: 0000000000000000
RBP: 00000000005ff618 R08: 0000002002c04910 R09: 000000000057a000
R10: 000000000000276d R11: 000000000000276d R12: 0000000000000000
R13: 0000000000000000 R14: 0000002002c03150 R15: 000000000000001c
base is 0x5ff618 caller is 0xead7d
base is 0x5ff678 caller is 0xeb32d
base is 0x5ff8d8 caller is 0xe4c49
base is 0x5ff938 caller is 0xe4d69
base is 0x5ff958 caller is 0x100d94
base is 0x5ff978 caller is 0xfd5ab
base is 0x5ff9b8 caller is 0x7d75e
base is 0x5ff9e8 caller is 0x660e
base is 0x5ffa18 caller is 0x21f9e
base is 0x5ffa68 caller is 0x952d
base is 0x5ffdf8 caller is 0xdfbb7
base is 0x5fffe8 caller is 0x343b

5ff600: 18 f6 5f 00 00 00 00 00 2b e0 00 00 00 00 00 00
5ff610: a0 37 c0 02 20 00 00 00 78 f6 5f 00 00 00 00 00
5ff620: 7d ad 0e 00 00 00 00 00 48 f6 5f 00 00 00 00 00
5ff630: 20 49 c0 02 20 00 00 00 98 f6 5f 00 00 00 00 00

5ff600: 18 f6 5f 00 00 00 00 00 2b e0 00 00 00 00 00 00
5ff610: a0 37 c0 02 20 00 00 00 78 f6 5f 00 00 00 00 00
5ff620: 7d ad 0e 00 00 00 00 00 48 f6 5f 00 00 00 00 00
5ff630: 20 49 c0 02 20 00 00 00 98 f6 5f 00 00 00 00 00

1048b0: 5d 41 5e 41 5f 5d c3 66 0f 1f 84 00 00 00 00 00
1048c0: 55 40 f6 c7 07 48 89 f8 48 89 e5 75 57 48 8b 07
1048d0: 49 b8 ff fe fe fe fe fe fe fe 48 be 80 80 80 80
1048e0: 80 80 80 80 4a 8d 14 00 48 f7 d0 48 21 c2 48 89
Pagetable walk from virt 0, base 57b000:
L4 = 00000000aa995067 (0x57c000) [offset = 0]
L3 = 00000000aa994067 (0x57d000) [offset = 0]
L2 = 00000000aa993067 (0x57e000) [offset = 0]
L1 = 0000000000000000 [offset = 0]

As i see here, the stubdom crashes exactly before going to suspend mode and not after coming back.
So this is the explanation for freezing after resuming the laptop from sleep which actually happens before
actually sleep occurs.
Also it is clear now that saving state leads to the same issue.

Now here is the problem.
I don't know to trace the stack of the error for any of the code written inside stubdom. Or i don't know may be there is something that could be changed in a script of qubes os to solve this problem.
Also i suspect this could be a memory problem in DM or stack overflow but i don't know for sure.
Can someone help me? or can some one advice what to do next to debug the problem?

Marek Marczykowski-Górecki

unread,
Nov 13, 2015, 7:41:57 AM11/13/15
to mkru...@gmail.com, qubes-users
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

On Fri, Nov 13, 2015 at 03:27:15AM -0800, mkru...@gmail.com wrote:
> Hi everyone
> I installed Qubes 3 from official ISO.
> I am trying to solve two issues which i'm facing which both leads to a single problem here they are:
> 1. I want to save state of a HVM VM and resume it later.
> 2. I want to be able to close the laptop's lid which sends it to Sleep and after i open the lid
> again to have all VM working again.
>
> I noticed that when the computer goes to sleep, Qubes calls some scripts to prepare each VM to be to suspended or maybe paused and then resumed or maybe woken up after that.
>
> First thing. After doing some research in DVM creation scripts noticed that saving state also involves suspend function to be called for that specific vm, so when calling
> virsh -c xen:/// save <domain> <filename>
> will also call

This one is (semi-intentionally) broken for Qubes HVMs, because it
requires qemu in dom0 (in addition to the one in stubdomain).

> virsh -c xen:/// dompmsuspend <domain> mem

Not sure about this. I think it can be made to working on HVMs. But
previously it wasn't needed, simple "pause" ("suspend" in libvirt) was
enough.
Yes, see below.

> libxl: debug: libxl_event.c:1600:libxl__ao_complete: ao 0x7efe9000cd00: complete, rc=0
> libxl: debug: libxl.c:798:libxl_domain_suspend: ao 0x7efe9000cd00: inprogress: poller=0x7efe900049b0, flags=ic
> libxl: debug: libxl_event.c:1572:libxl__ao__destroy: ao 0x7efe9000cd00: destroy
> libxl: debug: libxl_event.c:518:watchfd_callback: watch w=0x7efe90008ed0 wpath=@releaseDomain token=3/2d: event epath=@releaseDomain
> libxl: debug: libxl.c:1012:domain_death_xswatch_callback: [evg=0x7efe7c00ddd0:9] nentries=1 rc=1 9..9
> libxl: debug: libxl.c:1023:domain_death_xswatch_callback: [evg=0x7efe7c00ddd0:9] got=domaininfos[0] got->domain=9
> libxl: debug: libxl.c:1050:domain_death_xswatch_callback: exists shutdown_reported=0 dominf.flags=20006
> libxl: debug: libxl.c:1062:domain_death_xswatch_callback: shutdown reporting
> libxl: debug: libxl.c:1016:domain_death_xswatch_callback: [evg=0] all reported
> libxl: debug: libxl.c:1079:domain_death_xswatch_callback: domain death search done
> libxl: debug: libxl_event.c:518:watchfd_callback: watch w=0x7efe90008ed0 wpath=@releaseDomain token=3/2d: event epath=@releaseDomain
> libxl: debug: libxl.c:1012:domain_death_xswatch_callback: [evg=0x7efe7c00ddd0:9] nentries=1 rc=1 9..9
> libxl: debug: libxl.c:1023:domain_death_xswatch_callback: [evg=0x7efe7c00ddd0:9] got=domaininfos[0] got->domain=9
> libxl: debug: libxl.c:1050:domain_death_xswatch_callback: exists shutdown_reported=1 dominf.flags=20006
> libxl: debug: libxl.c:1016:domain_death_xswatch_callback: [evg=0] all reported
> libxl: debug: libxl.c:1079:domain_death_xswatch_callback: domain death search done
> libxl: debug: libxl_event.c:1155:egc_run_callbacks: event 0x7efeb4c603b0 callback type=domain_shutdown
>
> [user@dom0 ~]$ sudo cat /var/log/xen/console/guest-ubuntu-dm.log
>
> xs_read_watch() -> /local/domain/0/device-model/9/command dm-command
> dm-command: pause and save state
> device model saving state

This log is from "save" or "dompmsuspend"? I guess the later
(unfortunately). The sole "dompmsuspend" shouldn't require dumping
stubdomain state to a file. This is needed only in "save". But
"dompmsuspend" (without actual saving to file, or migrating) isn't fully
supported by Xen. And probably we've done something wrong while adding
this feature.

> xs_read_watch() -> /local/domain/0/device-model/9/command dm-command
> xs_read(/local/domain/0/device-model/9/command): ENOENT
> ******************* CONSFRONT for device/console/1 **********
>
>
> Failed to read device/console/1/backend-id.

And this the thing requiring qemu in dom0, which we don't have. Shame it
crashes that badly, instead of some nice error message...
Qubes suspend scripts pauses each VM before going to sleep (virsh
suspend), with exception for VMs with PCI devices - which are told to go
to sleep to properly handle PCI device state (first qubes.SuspendPre
qrexec service, then virsh dompmsuspend).

Check if virsh suspend also pauses stubdomain - it should. If not, you
can do it manually with xl pause. AFAIR the order of pausing actual VM
and stubdomain is important - first stubdomain, then actual domain, and
resume in the same order. At least that was working in R2 :)

> Also i suspect this could be a memory problem in DM or stack overflow but i don't know for sure.
> Can someone help me? or can some one advice what to do next to debug the problem?
>


- --
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1

iQEcBAEBCAAGBQJWRdqQAAoJENuP0xzK19csmogH/RF0I/Hv+F7ld9oY1GCGIbZy
6fJuGJGfsdwOj0kzp9Xrwkg3RhjmJCmrPb6U/uO6Rr1BH6zsogzM8Jgpvm0P+6sw
v2J6sSimJsgGuOGHOsoHhZspw/tZI/7T9KrLXLL3cunWqPY1rSnxSrI4ybepo3Ay
OpmZ4J2LyAKNeqUm5FvceuF9DFzWVBGZ2u6i2O+AEMw6QbVUd1YT1u5jY0TuKgdw
laBe0WVXpvgIMTyYIBhQ+NM7I7bkPET7J/LlrYPKR9dAmf8ZJ0Am0HLr437yoVH+
wHqFjwpfTNif7h6rcbnq4oo8Y0dZOZKbQp6SVJn1Gzj6Goim/6Ik9Yj+qutdY9g=
=NFfk
-----END PGP SIGNATURE-----

mkru...@gmail.com

unread,
Nov 13, 2015, 9:45:02 AM11/13/15
to qubes-users, mkru...@gmail.com
> Check if virsh suspend also pauses stubdomain - it should. If not, you
> can do it manually with xl pause. AFAIR the order of pausing actual VM
> and stubdomain is important - first stubdomain, then actual domain, and
> resume in the same order. At least that was working in R2 :)
I checked some things now and qubes does "virsh suspend" and "virsh resume" which is the exact
equivalent of "xl pause" "xl unpause"
I also checked and virsh suspend does not pauses stubdom and by not doing this stubdom crashes immediately after resume
Few days ago, indeed, i already found that doing "xl pause domain-dm" before sleep and then after "xl unpause domain-dm" solves the problem.
I now realize that "dompmsuspend" has nothing to do with the sleep issue but still has something to do with save/restore state of the VM which i want to try


> > [user@dom0 ~]$ sudo cat /var/log/xen/console/guest-ubuntu-dm.log
> >
> > xs_read_watch() -> /local/domain/0/device-model/9/command dm-command
> > dm-command: pause and save state
> > device model saving state
>
> This log is from "save" or "dompmsuspend"? I guess the later
> (unfortunately). The sole "dompmsuspend" shouldn't require dumping
> stubdomain state to a file. This is needed only in "save". But
> "dompmsuspend" (without actual saving to file, or migrating) isn't fully
> supported by Xen.

The log output above is from "dompmsuspend"


> > virsh -c xen:/// save <domain> <filename>

> This one is (semi-intentionally) broken for Qubes HVMs, because it
> requires qemu in dom0 (in addition to the one in stubdomain).

> > xs_read_watch() -> /local/domain/0/device-model/9/command dm-command


> > xs_read(/local/domain/0/device-model/9/command): ENOENT
> > ******************* CONSFRONT for device/console/1 **********
> >
> >
> > Failed to read device/console/1/backend-id.
>
> And this the thing requiring qemu in dom0, which we don't have. Shame it
> crashes that badly, instead of some nice error message...

I managed to do a successful save. but before that i got an error that
a file called /var/lib/xen/qemu-save.DOMID was missing.
I did a "touch /var/lib/xen/qemu-save.DOMID" and then "virsh -c xen:/// save <domain> <filename>" it worked.
I didn't know that there was something else missing at that time. I did found about this now.
Anyway. After doing the save procedure and after doing "xl list" i noticed again the stubdom is crashed and suspended so i assumed that "virsh save" is also calling "dompmsuspend"
After that i also managed to do a successfull restore and after running "xl list" i noticed domain was resumed successfully and was running but stubdom was crashed because it crashed before save

So now back to the save/restore.
If i can somehow fix "dompmsuspend", will save/restore work after that? or that qemu part that is missing is something essential?
What are the missing parts of qemu exactly?

Marek Marczykowski-Górecki

unread,
Nov 13, 2015, 10:23:31 AM11/13/15
to mkru...@gmail.com, qubes-users
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

On Fri, Nov 13, 2015 at 06:45:02AM -0800, mkru...@gmail.com wrote:
> > Check if virsh suspend also pauses stubdomain - it should. If not, you
> > can do it manually with xl pause. AFAIR the order of pausing actual VM
> > and stubdomain is important - first stubdomain, then actual domain, and
> > resume in the same order. At least that was working in R2 :)
> I checked some things now and qubes does "virsh suspend" and "virsh resume" which is the exact
> equivalent of "xl pause" "xl unpause"
> I also checked and virsh suspend does not pauses stubdom and by not doing this stubdom crashes immediately after resume

> Few days ago, indeed, i already found that doing "xl pause domain-dm" before sleep and then after "xl unpause domain-dm" solves the problem.

Ok, so this is the thing to fix.
https://github.com/QubesOS/qubes-issues/issues/1417

> I now realize that "dompmsuspend" has nothing to do with the sleep issue but still has something to do with save/restore state of the VM which i want to try

Yes, domain is "dompmsuspend"-ed before save.

> I managed to do a successful save. but before that i got an error that
> a file called /var/lib/xen/qemu-save.DOMID was missing.
> I did a "touch /var/lib/xen/qemu-save.DOMID" and then "virsh -c xen:/// save <domain> <filename>" it worked.
> I didn't know that there was something else missing at that time. I did found about this now.
> Anyway. After doing the save procedure and after doing "xl list" i noticed again the stubdom is crashed and suspended so i assumed that "virsh save" is also calling "dompmsuspend"
> After that i also managed to do a successfull restore and after running "xl list" i noticed domain was resumed successfully and was running but stubdom was crashed because it crashed before save
>
> So now back to the save/restore.
> If i can somehow fix "dompmsuspend", will save/restore work after that? or that qemu part that is missing is something essential?
> What are the missing parts of qemu exactly?

The missing part is additional console channel for qemu to dump its
state to some file in dom0 - to be able to restore it later (on virsh
restore), which BTW requires third console channel...

But xenconsoled supports only one console channel per VM (which, in case
of stubdomain, is used for logs). Additional channels requires qemu in
dom0 - then all the console channels are handled by qemu for such VM
(including the first one, for logs).

Patches which "breaks" this are here:
https://github.com/QubesOS/qubes-vmm-xen/blob/xen-4.4/patches.qubes/xen-libxl-qubes-minimal-stubdom.patch
https://github.com/QubesOS/qubes-vmm-xen/blob/xen-4.4/patches.qubes/xen-disable-dom0-qemu.patch

(the first one is harmless, but required for the second one)

If you really want and really know what you are doing, you can enable
additional serial console for a VM[1] (using manual libvirt config
modification, then qvm-start --custom-config=...), which will bring back
qemu in dom0. But you shouldn't do that.

[1] https://www.qubes-os.org/doc/windows-debugging/ (somehow outdated,
because it is about R2)

- --
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1

iQEcBAEBCAAGBQJWRgBvAAoJENuP0xzK19csO20H/i5+onwAcSTUIqEv2L7Y0XeV
SuNxxSvAxxReHne9gIUUp75nPtfQgYCgnb1wDbfuBhMv8BWqEoULbkZpkLtKmnlw
T1PSND6OTDB765sB6rAtkzQ2+aDj3qBam98joAmnhJ6OhgV0xNPW0H5Pw+TpHVF7
SPBhwNWMjZGDeHdvSqjIuW9pOHjlXCdBZNCAFd+dHnSjHb2wPCyaAN20V+R5F9Ly
AD/1atZGZ5RJRL3gCXIFv8o4sZTsTi2alSo2Fv5Ri/c6C39RDfiQq6SpLbfgip/X
vxBol97YRcT7Wlr1mttn8IXfZzl10UFv+p/IU7H7qYkU5r7YmeSOqgLKjKRByqU=
=7W4L
-----END PGP SIGNATURE-----

mkru...@gmail.com

unread,
Nov 13, 2015, 12:52:54 PM11/13/15
to qubes-users, mkru...@gmail.com
Hi Marek,

So if i understand correctly, there is no other way to do this currently because xenconsoled doesn't support more than one channel. So qemu in dom0 indeed i consider that add extra risk. So i decided to give up on this.
Anyway. Just to have something as a replacement i just did a test to save/restore a PV VM and it works very nice.

Reply all
Reply to author
Forward
0 new messages