RAID card error after upgrading to 4.x.x

201 views
Skip to first unread message

Oskar

unread,
Jun 15, 2022, 2:55:04 PM6/15/22
to esos-users
Hi!

I've been running 3.0.12 for a while now and was doing som maintenance and saw that version 4.0.7 was out. So I installed it on the server with the in place upgrade. But the script mentioned something about some signature that didn't match so it had to do a clean install instead.

First, everything seemed to work correctly, but after somewhere around 3-4 minutes I got an error through the iLO (It is an HP server, HP ProLiant DL380e gen 8 to be exact) where the storage controller reported a critical failure. At first I thought that it was just an unlucky coincidence that it broke down now. But then I had already started the update of another server and got the same error when I rebooted that server. So I thought that the chances for both raid cards to break in almost the exact same instance was so minimal that it almost can't happen.

Which made me curious, so I started to investigate a little and came to the conclusion that something has changed between version 3.0.14 and 4.0.1, that introduces this error.

I should probably also mention that the error disappears when you reboot the server. Not directly when the server gets shutdown but during post. It does also mention the error during post. I'm attaching a picture so you can see for your self! :) But just in case somebody is searching for this error in the future, here is the error:

1719-Slot 1 Drive Array - A controller failure event occurred prior to this power-up. (Previous lock up code = 0x14)

VideoCapture_20220615-201626.jpg

I've searched for this error on the web and found that there was some firmware update, that should fix this error. But the firmware mentioned was version 3 something, if I remember correctly. But as you can see on the screenshot, the cards are already running version 8.32. Which is the latest version available on HP:s support site.

Another thing to mention might be that non of the drives connected through  the RAID card gets listed with the lsblk command after the error shows up. (Haven't checked if they show up before the error though)

I assume that there is something that gets run on startup, or shortly after, that generates some command to the raid card that it can't handle. Which in turn makes it crash or lock up as the error indicates.

Is there anything that I can try to find the problem?

Listing some info about my setup (So that it is easier to find than to scroll through the text):
Servers (2x): HP ProLiant DL380e Gen 8
RAID card (Same card in both servers): HP Smart Array P822

Best regards
Oskar

Marc Smith

unread,
Jun 15, 2022, 3:11:05 PM6/15/22
to esos-...@googlegroups.com
Hmm... 3.x.x and 4.x.x both use Linux 5.4.x kernels, if those were different I'd suspect a driver version bump. I wouldn't expect much change in a patch release bump of the 5.4.x LTS kernel. That said, I can't imagine that changing/affecting much of anything internal to the adapter firmware/operation.

So if you put the 3.x.x version back on, you don't have this problem?

--Marc


--
You received this message because you are subscribed to the Google Groups "esos-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to esos-users+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/esos-users/5cb2cca2-8e47-44c3-8fa5-b28faa3a2716n%40googlegroups.com.

Oskar

unread,
Jun 15, 2022, 3:16:44 PM6/15/22
to esos-users
Exactly, I've been jumping back and forth with the esos versions and the last version that works is 3.0.14. And if I go to version 4.0.1 I get the error again.

Marc Smith

unread,
Jun 15, 2022, 3:23:39 PM6/15/22
to esos-...@googlegroups.com
But the error only occurs in the BIOS post messages for the add-on card? No I/O impact while the OS is running?


Oskar

unread,
Jun 15, 2022, 3:35:39 PM6/15/22
to esos-users
Well that's the only place where I've seen an error message. Haven't checked the logs OS logs yet. But the drives connected to the RAID card won't show up in esos after the error shows up. (I'm not sure if they show up or not before the error, as I haven't checked that) 

Oskar

unread,
Jun 21, 2022, 4:39:35 PM6/21/22
to esos-users
Hi again Marc!

I've been investigating into this a bit and with the help of git bisect I found out that it is this git commit that introduces the bug:

    dafb570e8bbd6c375dfa76a760a1e208e9bffa94 is the first bad commit
    commit dafb570e8bbd6c375dfa76a760a1e208e9bffa94
    Author: Marc A. Smith <marc....@quantum.com>
    Date:   Mon Oct 26 22:03:16 2020 -0400

        updated sedutil package

     CHECKSUM.MD5    | 2 +-
     CHECKSUM.SHA256 | 2 +-
     Makefile.in     | 2 +-
     3 files changed, 3 insertions(+), 3 deletions(-)

Here is the changes made in the makefile on this commit:

    diff --git a/Makefile.in b/Makefile.in
    index dd37b2c..6d5a5e6 100644
    --- a/Makefile.in
    +++ b/Makefile.in
    @@ -325,7 +325,7 @@ dist_files  = $(addprefix $(DIST_FILES_DIR)/, \
                    iotop-master_20200318.tar.xz \
                    node_exporter-1.0.0.linux-amd64.tar.xz \
                    libcap-2.36.tar.gz \

    -               sedutil-master_20201009.tar.xz \
    +               sedutil-master_20201026.tar.xz \
                    win_binaries-20180813.tar.xz)
     dist_files_repo        = http://download.esos-project.com/dist_files


As you can see the sedutil update crashes my RAID card.

So somewhere in between those dates, the developers of sedutil, seems to have introduced the bug. I'm going to try and see if I can change the version of sedutil when building esos. To see if I can locate the bug in their code. But I'm not really sure on how you get the code for sedutil. I'm trying to go through the code. But maybe you could tell me how it works, to save some time? :)

Best regards
Oskar

onsdag 15 juni 2022 kl. 21:23:39 UTC+2 skrev Marc Smith:

Oskar

unread,
May 15, 2023, 3:13:16 PM5/15/23
to esos-users
Sorry for necroposting, but thought I should post the solution to my problem just in case someone else has this problem in the future! :)

I had to put this project on hold for some other stuff at work. But came back to this recently and bisected it again, as the error didn't seem to originate from the commit I previously mentioned. So I probably made some error when I bisected it the last time...

This time however, I came to the conclusion that it was this commit that was the culprit:

    commit 4ef5ccf0b8990bda1f5c6ae87cd814af05a72219 (HEAD)
    Author: Marc A. Smith <marc....@quantum.com>
    Date:   Thu Dec 24 16:32:54 2020 -0500

        enabled additional IOMMU and VFIO kernel drivers

And after some googling I found this page, that mentions a few different workarounds for their problem with IOMMU. (Which is not the same as the one I had, but the workarounds worked nonetheless) I choose to disable the VT-d setting in the BIOS. And after rebooting the server. The newer versions of ESOS ran perfectly!

Best regards
Oskar

Marc Smith

unread,
May 20, 2023, 12:05:28 AM5/20/23
to esos-...@googlegroups.com
On Mon, May 15, 2023 at 3:13 PM Oskar <os...@stahls.se> wrote:
Sorry for necroposting, but thought I should post the solution to my problem just in case someone else has this problem in the future! :)

I had to put this project on hold for some other stuff at work. But came back to this recently and bisected it again, as the error didn't seem to originate from the commit I previously mentioned. So I probably made some error when I bisected it the last time...

This time however, I came to the conclusion that it was this commit that was the culprit:

    commit 4ef5ccf0b8990bda1f5c6ae87cd814af05a72219 (HEAD)
    Author: Marc A. Smith <marc....@quantum.com>
    Date:   Thu Dec 24 16:32:54 2020 -0500

        enabled additional IOMMU and VFIO kernel drivers

And after some googling I found this page, that mentions a few different workarounds for their problem with IOMMU. (Which is not the same as the one I had, but the workarounds worked nonetheless) I choose to disable the VT-d setting in the BIOS. And after rebooting the server. The newer versions of ESOS ran perfectly!

Good to know, thank you for sharing!

--Marc

 
Reply all
Reply to author
Forward
0 new messages