MDADM Failed to RUN_ARRAY

66 views
Skip to first unread message

Alec Myers

unread,
Sep 11, 2022, 4:57:39 PM9/11/22
to esos-users
Preface: Ive been working with ESOS for a few years, trying to create a redundant FC storage array using shared disks.

Ive just recently picked back up where I left off, and managed to get 4 Raid1 arrays built. After manually starting SCST, and plugging everything in, I was even able to get ESXi to connect and run VMs using round robin to both servers!

However, I wasnt done and its nowhere near ready for production. When I rebooted one of the hosts, I encountered something odd. The MD arrays were no longer there. Looking in /var/log/boot, I saw "mdadm: failed to RUN_ARRAY /dev/md0: Transport endpoint is not connected".

I rebooted the second host to find it had the same errors. When I ran "mdadm --assemble --scan", each host will only connect a random number of arrays. but If i give it some time, and run the command again, all will get detected and rebuilt.

Is there a step that Im missing? I was hoping to get something similar to Marc's NVME array, but using spinning rust (http://marcitland.blogspot.com/2017/05/millions-of-iops-with-esos-nvme.html).

PS: I also see that mdadm is now allowing Raid10 in experimental arrays. Has this been added to ESOS?

Current Details:
Supermicro SSG-2027B-DE2R24L (2-Node, 24x2.5")
ESXi Hypervisor (until I finish configuring)
VM - ESOS 3.0.12-z

Andrei Wasylyk

unread,
Sep 11, 2022, 5:57:45 PM9/11/22
to esos-...@googlegroups.com
If I remember correctly, I finally got this working to a semi-reliable degree by tuning the timing between when the node boots -> starts the cluster stack -> starts dlm -> joins the lock space -> tries to assemble the clustered raid set.

For example, I found that if I put the node into standby, disabled every crm resource on standby node except dlm and its dependencies, rebooted it, made sure it did not touch the md array, Then un standby, then wait a while.... and slowly enable the services - it worked every time.

I eventually gave up on clustered raid10. But maybe you could try that and see at what step it starts to fail and that will clue you in. How's your pacemaker knowledge? I think there is also a default behavior where mdadm always runs a consistency check after boot (even when stopped cleanly) - I am not sure why it does that, I don't *think* there is a fundamental requirement to do so. But I never put that into prod. Disabling that behavior solved some probs as well.

I ended up doing raid sets that are active on one node only, but are spread evenly accross both nodes and fail over between.
My nodes sound like same setup as yours except iscsi vs FC - sas jbods with dual expanders connected to two nodes with an  HBA each.


--
You received this message because you are subscribed to the Google Groups "esos-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to esos-users+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/esos-users/4ce7532d-77f4-417c-aaa1-47498d0f8723n%40googlegroups.com.

Alec Myers

unread,
Dec 29, 2022, 5:06:39 PM12/29/22
to esos-users
So, ive been trying over and over again, but I cant seem to tune it right. It also appears we dont actually have support for raid 10 yet in the clustered MD currently on ESOS, so Im having to go back to the drawing board.

Id love to hear more about the split raid sets that youre running, because that sounds like my best step from here without essentially throwing away all my money on more disks, because it would be cheaper to by JovianDSS than to do that. Im sure the iSCSI vs FC wont be a major issue.
Reply all
Reply to author
Forward
0 new messages