It works pretty well, definitely better than 4.0, but there are some weird boot issues. If I let it boot with everything as default, it will boot loop before reaching the disk password screen. I found I can get it to boot successfully if I add to the Xen commandline
noreboot=1 loglvl=all
and remove from the linux commandline
rhgb quiet rd.qubes.hide_all_usb
Still working on narrowing down which of those is/are responsible for fixing the problem (I can't figure out why any of them would).
Improvements since 4.0:
Screen power management works - brightness controls, and screen poweroff after inactivity (in 4.0
it would just blank but not power off)
Audio works, which it did not work in 4.0 even after many days of troubleshooting
amdgpu works correctly - doesn't freeze when booting without nomodeset
Multimedia keys - not sure if they worked in 4.0 or not
Still working:
UEFI mode
wifi
touchpad
keyboard
Still NOT working:
Suspend/resume
Not tested (yet):
Legacy mode
HDMI audio & video
USB Qube
SD card reader
Microphone
Webcam
Wired networking
I'll try to do some more testing and update this thread when I have a chance. Just putting this out
there for now.
> This is R4.1 build 20191013
>
> It works pretty well, definitely better than 4.0, but there are some weird boot issues. If I let it
> boot with everything as default, it will boot loop before reaching the disk password screen. I
> found I can get it to boot successfully if I add to the Xen commandline
> noreboot=1 loglvl=all
> and remove from the linux commandline
> rhgb quiet rd.qubes.hide_all_usb
>
> Still working on narrowing down which of those is/are responsible for fixing the problem (I can't
> figure out why any of them would).
Looks like rd.qubes.hide_all_usb is what's causing it to crash. When I remove it, it boots fine with the graphical splash and passphrase prompt. Another AMD Ryzen user mentioned having the same problem a while back. Something about AMD's IOMMU grouping of USB controllers, or something.
I'm planning on installing kernel-latest and I'll test it again when I do.
December 19, 2019 12:13 AM, "Claudia" <clau...@disroot.org> wrote:
> This is R4.1 build 20191013
>
> It works pretty well, definitely better than 4.0, but there are some weird boot issues. If I let it
> boot with everything as default, it will boot loop before reaching the disk password screen. I
...Looks like rd.qubes.hide_all_usb is what's causing it to crash. When I remove it, it boots fine with the graphical splash and passphrase prompt. Another AMD Ryzen user mentioned having the same problem a while back. Something about AMD's IOMMU grouping of USB controllers, or something.
B
Thanks for the info. Yes, that sounds correct from what I could tell, too. More specifically, what rd.qubes.hide_all_usb actually does is looks for USB controllers device files in /sys/bus/pci, and then calls their driver/unbind, driver/new_slot, and driver/bind functions. So basically, it forces the Linux USB driver (xhci_pci, in my case) to detach from the USB controller, and then attaches it to Xen's pciback driver so that it can be used by sys-usb later (although I don't know if this step is actually necessary, since starting sys-usb without hide_all_usb works just fine). I'm assuming this step happens before udev trigger, which probes devices including USB devices. Maybe that's why the hook assigns pciback instead of just unbinding the USB driver - so udev doesn't see that the device has no driver and attempt to bind the USB driver to it again. Or maybe the act of binding pciback is what actually places the USB controller under IOMMU isolation by Xen (otherwise the USB controller could still perform a DMA attack even with no driver bound to it).
I don't know which step -- unbinding xhci_pci, or binding pciback -- actually causes the crash in this case. I can say, however, that one of them does cause an immediate crash, before sys-usb ever starts or has a chance to take over the USB controllers.
The same hook does the same thing for networking devices as well, so that those are never exposed. In my case this doesn't cause a problem because both network cards are their own devices on their own busses and have their own IOMMU groups, unlike my USB controllers.
Here's the actual code from /usr/lib/dracut/modules.d/99qubes-pciback/qubes-pciback.sh
#!/usr/bin/sh
type getarg >/dev/null 2>&1 || . /lib/dracut-lib.sh
# Find all networking devices currenly installed...
HIDE_PCI="`lspci -mm -n | grep '^[^ ]* "02'|awk '{print $1}'`"
# ... and optionally all USB controllers...
if getargbool 0 rd.qubes.hide_all_usb; then
HIDE_PCI="$HIDE_PCI `lspci -mm -n | grep '^[^ ]* "0c03'|awk '{print $1}'`"
fi
HIDE_PCI="$HIDE_PCI `getarg rd.qubes.hide_pci | tr ',' ' '`"
modprobe xen-pciback 2>/dev/null || :
# ... and hide them so that Dom0 doesn't load drivers for them
for dev in $HIDE_PCI; do
BDF=0000:$dev
if [ -e /sys/bus/pci/devices/$BDF/driver ]; then
echo -n $BDF > /sys/bus/pci/devices/$BDF/driver/unbind
fi
echo -n $BDF > /sys/bus/pci/drivers/pciback/new_slot
echo -n $BDF > /sys/bus/pci/drivers/pciback/bind
done
As for USB controllers being on the CPU, yes that's what I found as well; all current CPUs bundle their integrated USB controllers right on the chip. I don't think code in the Linux kernel expects the USB controllers to be available. Rather I think it has to do with IOMMU grouping, which tends to be structured differently for AMD than Intel. I'm going to start a new thread about that here soon.
> Helllo,
>
> Dec 22, 2019 15:17:27 Claudia :
>
>
>> I don't know which step -- unbinding xhci_pci, or binding pciback -- actually causes the crash in
> this case. I can say, however, that one of them does cause an immediate crash, before sys-usb ever
> starts or has a chance to take over the USB controllers.
>
> That (and the specific script) is an interesting finding. Maybe it would be possible to run the
> commands one-by-one to see which one crashes the system.
>
I think I might try that sometime, just out of curiosity.
> BTW, could you confirm that the USB Controller is an AMD one? This could be helpful for anyone
> wanting a Qubes laptop with AMD.
>
Yes, as far as I can tell.
03:00.3 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] Raven USB 3.1 [1022:15e0]
03:00.4 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] Raven USB 3.1 [1022:15e1]
This is R4.1 build 20191013It works pretty well, definitely better than 4.0, but there are some weird boot issues. If I let it boot with everything as default, it will boot loop before reaching the disk password screen. I found I can get it to boot successfully if I add to the Xen commandline
noreboot=1 loglvl=all
and remove from the linux commandline
rhgb quiet rd.qubes.hide_all_usbStill working on narrowing down which of those is/are responsible for fixing the problem (I can't figure out why any of them would).
Improvements since 4.0:
Screen power management works - brightness controls, and screen poweroff after inactivity (in 4.0
it would just blank but not power off)
Audio works, which it did not work in 4.0 even after many days of troubleshooting
amdgpu works correctly - doesn't freeze when booting without nomodeset
Multimedia keys - not sure if they worked in 4.0 or notStill working:
UEFI mode
wifi
touchpad
keyboardStill NOT working:
Suspend/resume
How recent? Is it present in Xen 4.8-fc25 (R4.0)? Xen 4.12-fc29 (R4.1)?
> For many AMD systems (eg. Trinity/Richland) CPUID changes after suspend (some of the high bits),
> resulting in Xen Panic (see xen/arch/x86/acpi/power.c). So, more investigation would be needed to
> check why the CPUID bits are changing after resume and whether it had any security implications or
> not.
> For the time being - if you accept the possible security implications - you can disable that check
> eg. by commenting the panic line out after "recheck_cpu_features" in the above mentioned power.c
> file, compile xen for dom0 via qubes builder and test it in your system.
Thanks for the info.
I'm not sure this is the problem, though, because I get the same symptoms when suspending in a Fedora 25 livecd. Which makes me think it's a Fedora problem not a Xen problem, at least for R4.0. In Fedora 29 I think the symptoms were slightly different, the system was responsive but the screen just didn't power back on after resume. I don't think suspend/resume actually worked correctly until Fedora 30. We should have an F31-based R4.1 developer release by the end of the month, which would be a more accurate test.
What are the symptoms of a Xen panic? Would it prevent the screen from powering back on? Would it reboot after five seconds? Or would it just hang?
I'll try booting qubes R4.1 on bare metal without Xen and try suspend/resume. If it works I'll post cpuinfo before and after.
> For many AMD systems (eg. Trinity/Richland) CPUID changes after suspend (some of the high bits),
> resulting in Xen Panic (see xen/arch/x86/acpi/power.c). So, more investigation would be needed to
> check why the CPUID bits are changing after resume and whether it had any security implications or
> not.
> For the time being - if you accept the possible security implications - you can disable that check
> eg. by commenting the panic line out after "recheck_cpu_features" in the above mentioned power.c
> file, compile xen for dom0 via qubes builder and test it in your system.Thanks for the info.
I'm not sure this is the problem, though, because I get the same symptoms when suspending in a Fedora 25 livecd. Which makes me think it's a Fedora problem not a Xen problem, at least for R4.0. In Fedora 29 I think the symptoms were slightly different, the system was responsive but the screen just didn't power back on after resume. I don't think suspend/resume actually worked correctly until Fedora 30. We should have an F31-based R4.1 developer release by the end of the month, which would be a more accurate test.
What are the symptoms of a Xen panic? Would it prevent the screen from powering back on? Would it reboot after five seconds? Or would it just hang?
I'll try booting qubes R4.1 on bare metal without Xen and try suspend/resume. If it works I'll post cpuinfo before and after.
I'm not sure this is the problem, though, because I get the same symptoms when suspending in a Fedora 25 livecd. Which makes me think it's a Fedora problem not a Xen problem, at least for R4.0. In Fedora 29 I think the symptoms were slightly different, the system was responsive but the screen just didn't power back on after resume. I don't think suspend/resume actually worked correctly until Fedora 30. We should have an F31-based R4.1 developer release by the end of the month, which would be a more accurate test.
Awesome, in time for Christmas even! Downloading it now. Looks like it failed a few tests, so I don't know if it'll be usable enough to really test suspend/resume on it but we'll see. Not sure if I'll get a chance to install it today but I'll follow up when I do. Thanks for the link brendan.
Awesome, in time for Christmas even! Downloading it now. Looks like it failed a few tests, so I don't know if it'll be usable enough to really test suspend/resume on it but we'll see. Not sure if I'll get a chance to install it today but I'll follow up when I do. Thanks for the link brendan.
When you say it remains blank, do you mean the screen is totally powered off, or do you mean the backlight comes on but it just displays a black screen?
Going from memory here, but I *think* in F29 (without Xen) the backlight would come on, but the screen was just blank, and I could make the caps lock light come on and hear sounds from the OS, and ctrl-alt-delete would cause it to reboot after 60 seconds as expected. Possibly a graphics driver problem.
In Qubes R4.1 (F29-based) with Xen, when I resume I can hear the fans come on, but that's it. The backlight remains powered off, the caps lock light won't come on, sound doesn't resume playing, and I have to hold the power button to force reboot. Sounds to me like it could be a Xen panic, although I believe this is the same as what happened in F25, if memory serves.
Also, I don't know what a debug card is, but my BIOS has an option called "USB Debugging" which is enabled. Do you know anything about that, or how to make use of it? I'm not looking to get into any serial/UART type stuff, but USB might be an option, depending on what it does, what you need to have, and how difficult it is.
When you say it remains blank, do you mean the screen is totally powered off, or do you mean the backlight comes on but it just displays a black screen?
Going from memory here, but I *think* in F29 (without Xen) the backlight would come on, but the screen was just blank, and I could make the caps lock light come on and hear sounds from the OS, and ctrl-alt-delete would cause it to reboot after 60 seconds as expected. Possibly a graphics driver problem.
In Qubes R4.1 (F29-based) with Xen, when I resume I can hear the fans come on, but that's it. The backlight remains powered off, the caps lock light won't come on, sound doesn't resume playing, and I have to hold the power button to force reboot. Sounds to me like it could be a Xen panic, although I believe this is the same as what happened in F25, if memory serves.
Also, I don't know what a debug card is, but my BIOS has an option called "USB Debugging" which is enabled. Do you know anything about that, or how to make use of it? I'm not looking to get into any serial/UART type stuff, but USB might be an option, depending on what it does, what you need to have, and how difficult it is.
Installed the new F31-based 4.1. Near the end of the installation, it said
The following error occurred while installing the boot loader. This system will not be bootable. Would you like to continue?
failed to write boot loader configuration
I don't know why that happened, but sure enough grub.cfg was missing. I guess the installer keeps logs in /tmp, but I didn't know that at the time. Rebooted into recovery, chrooted, and ran grub2-mkconfig.
I also noticed the UEFI boot menu situation changed. It added two entries "QubesOS" and "Fedora", and installed some other efi file in the default path \EFI\BOOT\. The former two pointed to she shim binary (under different paths), not sure about the default one. All three of them caused an instant reboot. Weird, but I didn't really investigate. I ended up just using my old "Qubes" boot entry (for grubx64.efi) after generating grub.cfg.
After installation it worked fine during my hour or two of casual usage. When I was using it, it seemed noticeably faster than previous versions. So you're right it probably just had to do with the new Xen under openqa. Suspend/resume still doesn't work though.
> Suspend/resume problem is most likely caused by a recently added security feature in Xen, that
> checks CPUID after resume with the previously (at boot time) known CPUID. This is to ensure, that
> the CPU microcode level - along with the resulting Spectere/Meltdown etc. mitigations - still
> persist after system resume and there are no features missing.
>
> For many AMD systems (eg. Trinity/Richland) CPUID changes after suspend (some of the high bits),
> resulting in Xen Panic (see xen/arch/x86/acpi/power.c). So, more investigation would be needed to
> check why the CPUID bits are changing after resume and whether it had any security implications or
> not.
> For the time being - if you accept the possible security implications - you can disable that check
> eg. by commenting the panic line out after "recheck_cpu_features" in the above mentioned power.c
> file, compile xen for dom0 via qubes builder and test it in your system.
So I installed the new F31-based R4.1 with Xen 8.13. Suspend/resume still isn't working; same
symptoms as before. A few corrections: I dug up some old threads and found that suspend/resume
actually did work correctly in F29, and on F25 the screen would power on, but just remain blank. So
in fact I never got the same symptoms on Qubes as I did with Fedora 25. This means that very likely
could be a Xen panic.
Something new, I booted R4.1 on bare metal without Xen, and it resumes fine. It probably will even
under R4.0 without Xen, too, but I haven't tried yet. So apparently it's not a version issue.
While booted without Xen, I checked /proc/cpuinfo before and after resume and they were the same
except for clock rates. The output is significantly different under Xen than bare metal, but the
microcode version is the same. In Xen, I obviously can't compare before- and after-resume outputs.
Not sure what to do. I'm really not looking forward to patching Xen.
> I use a Corebooted system, where the latest AMD microcode is compiled into
> the BIOS statically. And yes, I use a newer version of the AMD Fam15h
> microcode, than the version in the Linux Firmware package.
> This change in some of the CPUID bits after resume could be a result of
> xen/kernel trying to load the published microcode, and then fails because
> the BIOS version is newer.
> However, the /proc/cpuinfo reported microcode version always stays the same
> - the BIOS version. (..assuming the /proc/cpuinfo output is updated on any
> microcode upgrade attempts..)
>
> As noted, I have a "special" use case, so testing the recommended change in
> power.c for Claudia's newer AMD system could show, that this CPUID change
> issue after resume is "special" for my case or "general" for some AMD users.
That kind of makes sense. With your patch applied, can you see the CPUID bits in /proc/cpuid change after resume, or is the output the same as before?
Like I said, in my case nothing in /proc/cpuinfo changes before and after resume without Xen (although it could be different under Xen).
> For many AMD systems (eg. Trinity/Richland) CPUID changes after suspend (some of the high bits),
> resulting in Xen Panic (see xen/arch/x86/acpi/power.c). So, more investigation would be needed to
> check why the CPUID bits are changing after resume and whether it had any security implications or
> not.
> For the time being - if you accept the possible security implications - you can disable that check
> eg. by commenting the panic line out after "recheck_cpu_features" in the above mentioned power.c
> file, compile xen for dom0 via qubes builder and test it in your system.
I decided to give this a try, but I don't really know how to use the build system. I did `make vmm-xen`, modified the file chroot-dom0-fc29/home/user/rpmbuild/BUILD/xen-4.12.1/xen/arch/x86/acpi/power.c, but it appears after running `make vmm-xen` again my changes have been reverted. After it finishes the line is no longer commented out. Do I have to commit the change, or generate a patch file, or something like that?
I decided to give this a try, but I don't really know how to use the build system. I did `make vmm-xen`, modified the file chroot-dom0-fc29/home/user/rpmbuild/BUILD/xen-4.12.1/xen/arch/x86/acpi/power.c, but it appears after running `make vmm-xen` again my changes have been reverted. After it finishes the line is no longer commented out. Do I have to commit the change, or generate a patch file, or something like that?
Answering to your earlier question, my CPU capability information bits change like this after suspend:(XEN) Entering ACPI S3 state.
(XEN) AMD-Vi: Applying erratum 746 workaround for IOMMU at 0000:00:00.2
(XEN) Finishing wakeup from ACPI S3 state.
(XEN) CPU0: cap[ 1] is 3e98320b (expected b698320b)
(XEN) Missing previously available feature(s).
(XEN) Enabling non-boot CPUs ...
Without the patch this result in xen panic.
PS: my patch looks like this (it will show the CPUID capability bits changing in the hypervisor log)
diff -ruN a/xen/arch/x86/acpi/power.c b/xen/arch/x86/acpi/power.c
--- a/xen/arch/x86/acpi/power.c 2019-12-15 18:26:11.183000000 +0100
+++ b/xen/arch/x86/acpi/power.c 2019-12-15 18:23:15.439000000 +0100
@@ -257,7 +257,7 @@
microcode_resume_cpu(0);
if ( !recheck_cpu_features(0) )
- panic("Missing previously available feature(s).");
+ printk(XENLOG_ERR "Missing previously available feature(s).\n");
/* Re-enabled default NMI/#MC use of MSR_SPEC_CTRL. */
ci->spec_ctrl_flags |= (default_spec_ctrl_flags & SCF_ist_wrmsr);
Brendan
Thanks for the patch and the instructions. The qubes-builder documentation is outdated and sorely
lacking (it doesn't even mention ./setup!). I applied the patch for marek's 4.1 repo but I couldn't
get to produce an fc31 package. It kept building for fc29 which I don't currently have installed.
Then I built it for fc25 4.0 stable, but the patch wouldn't apply cleanly so I just modified the
existing patch-x86-check-feature-flags-after-resume.patch to print an error instead of panic, and
changed the message slightly.
patch-x86-check-feature-flags-after-resume.patch
diff --git a/xen/arch/x86/acpi/power.c b/xen/arch/x86/acpi/power.c
index 3d26d4be31..e8fb3f6f31 100644
--- a/xen/arch/x86/acpi/power.c
+++ b/xen/arch/x86/acpi/power.c
@@ -255,6 +255,9 @@ static int enter_state(u32 state)
microcode_resume_cpu(0);
+ if ( !recheck_cpu_features(0) )
+ printk(XENLOG_ERR "Missing previously available feature(s). Ignoring.\n");
+
/* Re-enabled default NMI/#MC use of MSR_SPEC_CTRL. */
ci->spec_ctrl_flags |= (default_spec_ctrl_flags & SCF_ist_wrmsr);
spec_ctrl_exit_idle(ci);
Installed the seven packages already present in dom0. In case anyone is wondering those are:
xen-libs-4.8.5-14custom.fc25.x86_64.rpm
xen-4.8.5-14custom.fc25.x86_64.rpm
xen-hypervisor-4.8.5-14custom.fc25.x86_64.rpm
xen-runtime-4.8.5-14custom.fc25.x86_64.rpm
python3-xen-4.8.5-14custom.fc25.x86_64.rpm
xen-licenses-4.8.5-14custom.fc25.x86_64.rpm
xen-hvm-4.8.5-14custom.fc25.x86_64.rpm
Note that 4.8.5-14 -> 4.8.5-14custom shows up as a downgrade.
Ran `strings -a /boot/efi/EFI/qubes/xen.efi | grep Ignoring` to check for my unique message, just to be sure.
Rebooted. Checked xl info. Looks good. (Yes, it actually truncated the last character of the
version, apparently. Odd.)
xen_major : 4
xen_minor : 8
xen_extra : .5-14custom.fc2
xen_version : 4.8.5-14custom.fc2
cc_compile_date : Wed Jan 1 01:11:51 UTC 2020
Hit suspend from the XFCE menu. Waited 30 seconds or so. Crossed my fingers and resumed.
And... SUCCESS!
xl dmesg
(XEN) Preparing system for ACPI S3 state.
(XEN) Disabling non-boot CPUs ...
(XEN) Entering ACPI S3 state.
(XEN) Finishing wakeup from ACPI S3 state.
(XEN) CPU0: cap[ 1] is 7ed8320b (expected f6d8320b)
(XEN) Missing previously available feature(s). Ignoring.
(XEN) Enabling non-boot CPUs ...
Thank you for your help! It appears your machine is not a special case. Exact same result for both of us. Bit 27 flips on and bit 31 flips off (xor of 0x88000000). No idea what those mean, though.
However, I still have a long road ahead of me. I did several suspend/resume cycles, and each time I had a different combination of problems, including the mouse sticking, the keyboard not working, and finally input/output errors and segmentation faults in the terminal. But the Xen problem has been identified nonetheless. I'll try kernel-latest and see if that changes anything.
Thanks again.
BTW, have you reported this to upstream or do you have any plans to?
> However, I still have a long road ahead of me. I did several suspend/resume cycles, and each time I
> had a different combination of problems, including the mouse sticking, the keyboard not working,
> and finally input/output errors and segmentation faults in the terminal. But the Xen problem has
> been identified nonetheless. I'll try kernel-latest and see if that changes anything.
Installed kernel-latest from stable, 5.3.11-1.qubes.x86, and no difference as far as I can tell. It resumes fine the first time usually, but after the second or third cycle, I get a bunch of io errors, as though someone unplugged the SATA connector. I think this is actually the underlying cause of the other symptoms. This is with no VMs running. No swap.
ata1.00: qc timeout (cmd 0xec)
ata1.00: failed to IDENTIFY (I/O error, err_mask=0x4)
ata1.00: limiting SATA link speed to 3.0 Gbps
ata1.00: SATA link up 3.0 Gbps (SStatus 123 SControl 320)
ata1.00: revalidation failed (errno=-5)
ata1.00: disabled
sd 0:0:0:0: [sda] Start/Stop Unit failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
sd 0:0:0:0: [sda] tag#21 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
sd 0:0:0:0: [sda] tag#21 CDB: Write(10) 2a 00 3c 9f [...]
blk_update_request: I/O error, dev sda, sector [...] op 0x1: (WRITE) flags 0x100000 phys_seg 1 prio class 0
BTRFS error (device dm-0): bdev /dev/mapper/luks-[...] errs: wr 1, rd 0, flush 0, corrupt 0, gen0
Note this different than the Fedora 25 resume behavior. In F25 with 4.8.6, the screen doesn't power on, but the system seems responsive otherwise. For example ctrl-alt-delete reboots after 60 seconds as expected. (In Qubes, after resuming a second or third time and getting disk errors, when you try to shutdown it will just hang indefinitely.) But F25 was running from a USB drive so I wouldn't necessarily know if there were SATA errors in that case.
I'll see if I can figure out how to apply the patch to the latest 4.1 (F31-based) and try it from there. In the mean time, if anyone has any ideas please share.
January 1, 2020 5:09 PM, "Claudia" <clau...@disroot.org> wrote:I'll see if I can figure out how to apply the patch to the latest 4.1 (F31-based) and try it from there. In the mean time, if anyone has any ideas please share.
Since it appears the old made-for-purpose USB 2.0 EHCI Debug port dongles are impossible to find these days, I've been looking around for alternatives and stumbled upon use of the raspberry pi zero w/ USB Gadget drivers to log chromebook coreboot debug data. Pretty sure (but not 100%) the same could be done for Xen debug data:
https://johnlewis.ie/pi-zero-w-flashrom-and-usb-gadget-debug/
https://johnlewis.ie/wp-content/uploads/2017/04/ehcidebug.gif
raspberrypi/linux#1907
https://gist.github.com/gbaman/50b6cca61dd1c3f88f41
So...now I have a pi zero on the way.
Brendan
Funny you should mention that. I happened to have a Pi Zero W lying around, and I almost did go that route. However when I started looking into USB 2.0 EHCI debug (thanks to user Qubes123 for the tip), it looked pretty complicated and somewhat unreliable, so I decided to try some simpler techniques first. Also my USB controllers don't list the debug capability so I don't think it would work on this machine. Luckily Qubes123's patch worked, or at least fixed the Xen panic, so I don't think I have a need for USB debugging at the moment.
However it is something I'd like to learn more about in case I need it in the future. Please let me know how you make out!
Also something you might be interested in is USB 3.0 XHCI Debug Capability, or DbC, which is built into the USB 3.0 spec. It's a host-to-host protocol so it doesn't require any OTG/gadget hardware, just two devices that support USB 3.0 Enhanced Superspeed, a USB 3.0 Enhanced Superspeed cable, and the target device (USB controller) must support XHCI Debug Capability (DbC). https://www.kernel.org/doc/html/v4.17/driver-api/usb/usb3-debug-port.html
The machine I was trying to debug does have a USB 3.1 controller, but it doesn't list the either the XHCI nor EHCI debug capability, even when USB debug is enabled in firmware. Just because there's a setting for it in firmware doesn't necessarily mean the hardware supports it, I suppose.
January 3, 2020 7:17 PM, brend...@gmail.com wrote:
> Since it appears the old made-for-purpose USB 2.0 EHCI Debug port dongles are impossible to find
> these days, I've been looking around for alternatives and stumbled upon use of the raspberry pi
> zero w/ USB Gadget drivers to log chromebook coreboot debug data. Pretty sure (but not 100%) the
> same could be done for Xen debug data:
...
> So...now I have a pi zero on the way.
Funny you should mention that. I happened to have a Pi Zero W lying around, and I almost did go that route. However when I started looking into USB 2.0 EHCI debug (thanks to user Qubes123 for the tip), it looked pretty complicated and somewhat unreliable, so I decided to try some simpler techniques first. Also my USB controllers don't list the debug capability so I don't think it would work on this machine.
Luckily Qubes123's patch worked, or at least fixed the Xen panic, so I don't think I have a need for USB debugging at the moment.
However it is something I'd like to learn more about in case I need it in the future. Please let me know how you make out!
Also something you might be interested in is USB 3.0 XHCI Debug Capability, or DbC, which is built into the USB 3.0 spec. It's a host-to-host protocol so it doesn't require any OTG/gadget hardware, just two devices that support USB 3.0 Enhanced Superspeed, a USB 3.0 Enhanced Superspeed cable, and the target device (USB controller) must support XHCI Debug Capability (DbC). https://www.kernel.org/doc/html/v4.17/driver-api/usb/usb3-debug-port.html
The machine I was trying to debug does have a USB 3.1 controller, but it doesn't list the either the XHCI nor EHCI debug capability, even when USB debug is enabled in firmware. Just because there's a setting for it in firmware doesn't necessarily mean the hardware supports it, I suppose.
And... SUCCESS 2.0!
Perhaps it's still too early to celebrate, but after six months of troubleshooting I think I might finally have working suspend/resume. I did some googling around, and eventually came across a rather inconspicuous post[1] from 2013 in the Xen archives that mentioned something I hadn't tried or heard about before. All I had to do was add to the Xen command line "dom0_max_vcpus=1 dom0_vcpus_pin". And that's it. Couldn't have been simpler. I should not have had to go to the 20th page of search results to find out about this.
This runs dom0 on CPU0 and only CPU0. My understanding is that it has to be running on the boot CPU at the exact moment of suspend and resume. Or something like that. Not sure of the specifics. Note that this may have a performance impact depending on your situation.
At first, I thought maybe this would render the Xen patch unnecessary: e.g. that it was suspending on one core and resuming on another causing an apparent change in cpuid bits. But I can see from the log the cpuid capability bits are still changing as before. (Those of you just tuning in, the patch and instructions are earlier in this thread. However you probably won't need it unless you have an AMD Fam15h processor. Note that there may be security implications associated with this patch.)
I've only had a chance to test about 15-20 cycles or so, but it works great so far. Suspends fast, resumes fast, lid-switch triggers both suspend and resume, WiFi automatically reconnects. I suspended in the middle of a YouTube video and came back up seamlessly. However after resume all instances of Firefox seem to jump to 100% CPU (but not frozen) until I close it, but that appears to be a known issue outside of Qubes and Xen also.
Tested on R4.0 stable with kernel-latest 5.3.11-1.qubes.x86 on Xen 4.8.5-14.fc25 (patched). I haven't tried this yet on the default kernel but I think it would probably work just as well. It also very well might work on other Qubes/Xen versions. I'll update my HCL accordingly when I have a chance.
For me, the boot and install mostly just worked out of the box. I never experienced the installer drop to shell or "X failed to start" or anything like that. I did have the installer screen freeze sometimes on one version, I think 4.0.2-rc2, but I was able to get past it and never really investigated the cause. In my case, it was the post-installation stuff that took some real troubleshooting. So I don't have much to offer beyond the generic troubleshooting tips.
I looked at your thread, but it doesn't appear to have /tmp/X.log, please post that if you can. You're at least making it to the console, so that's good. I would definitely try booting with nomodeset if you haven't already. It can fix a wide variety of different X-related problems.
Also please mention what Qubes ISO versions and kernel versions you've tried. You may want to try an R4.1 pre-release build. Look for the link Brendan posted earlier in this thread. You may also want to try installing Qubes on a different machine, upgrading to kernel-latest, and then moving the disk or USB drive to the target machine.
For what it's worth, the "[Firmware bug]" and "ACPI Error" lines are quite common, if not universal, on Ryzen systems. However they don't seem to be related to any specific problems in practice, so I wouldn't worry too much about those. The "Failed to load kernel modules" error seems to be common in Qubes and even other OSes, regardless of hardware, so I wouldn't worry about that either. I doubt any of those are directly related to the X failure you're experiencing.
I can also say, sys-usb causes all manner of problems on my machine and for some other Ryzen users as well. So when you do finally make it to that point, I definitely would not recommend enabling that option until you have everything else working.
> Thank you for your answer.
>
> What do you mean by “nomodeset” ? - is it regarding legacy and UEFI mode or... ?
In 4.0, to enable nomodeset you have to edit the bootloader files files in the installation media. I just realized, since you're using DVDs instead of USB, this is going to be a lot more difficult. You'll have to unpack the ISO, modify the boot loader file, and then repack the ISO and burn it. I would recommend using a USB drive in this case if you can. That way you can do the modifications directly to the USB drive, and you don't have to waste additional DVDs.
In R4.1, you just have to press 'e' at the boot menu, and you can make last minute changes to the boot parameters without modifying anything. This would probably be the easiest option.
nomodeset is a kernel command line option that disables kernel-modesetting and prevents graphics drivers from being loaded, so they just use a basic minimal driver essentially. In 4.0 this would be the "kernel=" line of the xen.cfg file.
> As I only have tried with Qubes OS stable version 4.0.1 and 4.0.2 and is now going to try 4.0.3 the
> kernel version is 4.19. How can I try to install Qubes with a newer kernel.
I'm not sure if there's any easy way to install a newer kernel into the installer. The way most people do it is to do the installation on a different machine, install kernel-latest, and then move the drive to the other machine. However 4.0.3 should come with a newer LTS kernel at least, so try that first.
When the installer fails, copy or screenshot /tmp/X.log and post it.
> Could an idea be to try to install Linux Mint or Fedora 31 if 4.0.3 doesn’t work either ? - just to
> make sure they work and rule basic things out.
R4.0 is based on Fedora 25, so you could try booting that just to make sure it works, just to rule that out. However there's still a big difference between Qubes and Fedora 25, so it won't tell us very much.