zoffgard waverleigh hesperia

0 views

Skip to first unread message

Glendora Starr

unread,

Aug 2, 2024, 8:21:08 PM8/2/24

to seivikcyapan

This issue shows up even as early as during the Arch Linux installation, however no data is ever lost, filesystem is consistent. BTW, RTIRQ and threaded irqs don't seem to influence this issue at all.
Moreover, I got two hard freezes today (blinking caps lock led - kernel panic), unfortunately I was unable to determine the cause.

There are no freezes whatsoever when running Win10 on this machine. The NVMe SSD drive is new, with several hours runtime reported by smartctl. I also updated the BIOS firmware to the newest one from Dell (1.24.4) for this machine. There are no updates to the SSD firmware.

I kindly ask for your help. I've been using Arch on much older Dell machines for ages without any problems. What can I do to solve this issue?
How can I determine the cause of the panics when no dump has been displayed (running graphical session with dmesg -w on the console). Journal just stops at the freeze time, no info there
Thanks for any help in advance.

If APST is enabled but no non-zero states appear in the table, the latencies might be too high for any states to be enabled by default. The output of # nvme id-ctrl /dev/nvme[0-9] should show the available non-operational power states of the NVME controller. If the total latency of any state (enlat + xlat) is greater than 25000 (25ms) you must pass a value at least that high as parameter default_ps_max_latency_us for the nvme_core kernel module. This should enable APST and make the table in # nvme get-feature show the entries.

Interestingly, there are no powerstates defined by the controller. I tried disabling APST altogether by issuing nvme_core.default_ps_max_latency_us=0 in the kernel command line. Used nvme-cli get-feature to confirm APST is disabled and commenced the test.

As I expected, nothing changed with 500000us timeout. I managed to catch the missing part of the dump - it was a General Protection Fault, find the links to imgs below. (sorry for the poor quality, these are single frames extracted from a 60fps mobile phone movie )

I also noticed an interesting behavior - after each crash power cycling won't wake it up. Only performing an onboard Diagnostics drive test (which fails due to no drive being detected) and then a reboot will. WTH?

I have seen the same behavior on some GPUs (mostly AMD) that suffer the reset bug.
You have to shutdown the machine and power back up and not just reboot, because some internal state is not flushed out. If you could boot I would suspect that you actually would receive some PCIe header errors.

The errors also appeared on Fedora and Windows as well, albeit after some uptime, not as immediately as in Arch.
The symptoms were exactly the same - freezes for a few seconds and either "corrected PCIe errors" (when Inter RST was disabled) or "I/O errors/timeouts/retries" (when RST was enabled) were in the logs.

Hello! I'm working on a custom board with two TCAN4550 connected to the same SPI bus of nRF52832, but can't read any register. I've tried all the lowest registers 0x0000, 0x0004, 0x0008 and 0x000C which should work even with crystal problems, but i'm always getting 88000000 as response. I tried everything from 500kHz to 18MHz bus speed,CPOL=0 and CPHA=0 (but for troubleshooting I've tried all SPI modes).
I see that the correct bytes are sent from the nRF52832 with my logic analyzer (saleae pro 8), but the response is almost always 88000000 (sometimes 00000000). If I disconnect the 12V, I'm getting FFFFFFFF, so I'm suspecting the TCAN is responding to something. Is this an error code? Since the messages are sent in 4 bursts of 8 bits, the 88000000 response mean bits 31 and 27 are set? What errors does that indicate?

I ported your official driver to Zephyr RTOS, but after failing to read even the lowest registers, I focused on simple write/read commands to figure out if it was hardware or software. But now I'm stuck.

We had a few hardware problems in the beginning. We first assembled with a 40MHz resonator (OT201640MJBA4SL) with wrong footprint instead of the crystal we intended to use (ABM11W40.0000MHZ8B1UT).
Resonator:
1 Tri-state 2 GND 3 OUTPUT 4 VDD
Footprint:
1 OSC1 2 GND 3 OSC2 4 GND
Don't think it damaged the TCAN4550?
My assembly person replaced the resonator with the proper crystal on his board, and he could see it oscillating fine. We work in different countries, and I don't have the crystals. Instead I removed the resonator and caps, soldered OSC2 to GND and use my signal gen set at 40MHz 3.2Vpp offset 1.6V to OSC1. No luck. But even with a faulty or missing oscillator, the lowest registers should be accessable anyway?

Next, the standard drive strength of nRF52832 was causing too slow rise time for SCK (measured to about 40ns). We increased it, and now it's measured to rise about 8ns (within 10ns spec). Still no improvement. I take it the rise time spec 10ns is measured between the points of 0.3*VIO (max L) and 0.7*VIO (min H)? Anyway, no luck.

From the logic analyzer plot, I see the SPI enable pin is toggling high after the first 4 bytes and not remaining low for the entire 8 byte SPI transaction. The TCAN4550 will treats the transition of the enable signal as the end of the SPI transaction which is why you are not having any success with your communication. You will need to modify the firmware or control the enable pin like a GPIO pin if it can't be modified to remain low for the full 8 bytes.

There are a couple of reasons the TCAN4550 requires this. The first is that it counts the number of clock cycles to make sure it is an exact multiple of 32 bits, or one word, or data, and because of the R/W opcode, address, and data length information takes up 4 bytes, there is a minimum of 64 bits required for a register SPI transaction. It is possible to read/write multiple consecutive registers in a single transaction allowing for 64 bits, 96 bits, 128 bits, etc. but always a multiple of 32.

The second reason is that the TCAN4550 will return the Global Fault Flag bits (register 0x0820[7:0]) on MISO signal immediately following the enable signal transitioning low. This is to inform the MCU of any important interrupt or status flags that may be time critical. SPI errors are included in this and can be a way for the MCU to monitor the success or failure of the previous SPI transaction. If the TCAN4550 detected an error such as too few, or too many clock pulses, it will assume there was noise or some other error that corrupted the data. If this was a write command (0x61) to the device, the TCAN4550 will discard the data and keep the register's old value. If this was a read command (0x41) the MCU should consider the value returned as potentially corrupted and repeat the transaction. Other SPI errors can be from a mismatch between the Length field and the actual data. For example, if 2 words of data were indicated in the Length field, but the enable pin transitioned high after only a single word (register) was written or read, then the device would treat this as an error. Likewise, if the Length field was 1, and 2 words of data were written, this too would be treated as an error.

As you have observed, the device is returning 0x88000000 indicating that the Global Error (GLOBALERR) and SPI Error (SPIERR) bits are being set due to the enable pin transitioning high in the middle of your transaction.

Thank you so much! I didn't notice it, since I wrote my code to explicitly keep the chip select low until the transaction was complete. Problem was that Zephyr also controlled the chip select and pulled it high after each individual spi_write / spi_read.

From these outputs I think there is some problem with the Btrfs filesystem on the NVMe.
Are these problems serious? Can they compromise the data? Do they indicate that the NVMe is corrupted? Could they be the cause of the kernel not booting?

I also created on the same disk a new Btrfs partition of the same size and run the same commands to compare the outputs and find the same errors. Apparently no errors are being reported, so perhaps I can rule out possible NVMe hardware problems.
Could it just be related to the single partition on which Fedora is installed?

Finally, I realized that I occasionally experience boot problems even with the 6.8.10 kernel running: basically, once I boot the system crashes before GDM appears and I can only interact with the tty shell. I have to reboot in order to access the system with the Gnome shell working.

If the disk was damaged at the hardware level, should I experience problems in the other partitions as well? Instead, all other Btrfs partitions are healthy and error-free.
In dmesg are the errors I wrote in the initial post.

Apparently instead the corruption probably affects booting even now that scrubbing has solved most of the problems, since in some cases the Gnome shell gets stuck before the GDM login screen appears. In this case I am forced to reboot via tty.

What I wrote in the reply to your message is all the output of smartctl -a.
If any details are missing the reason could be that the NVMe is connected to the motherboard via a PCIe adapter. This perhaps does not allow all the SMART data to be detected.

Yes, I run the command as root.
I did some research on this, and apparently the most likely reason for the output being reduced is because the NVMe is connected via a PCIe adapter to the motherboard.

I have formatted the whole NVMe disk and run the various Btrfs commands for checksum checking and I do not encounter any errors.
I have also followed the other advice you suggested and tried to isolate the cause: the NVMe disk is healthy and working properly, the problem was caused by the backup mishandled by Rescuezilla.

Regarding the disk statistics counter, after running dmesg I found the corrupt files, deleted them, and the counter stopped increasing the number of corruptions.errors.
I also ran the btrfs commands to repair and rebuild checksums, so it seems that the boot problems seem to be gone but the csum errors remain.