loop mounted /usr partion disappears after some time of idle

Skip to first unread message

Alexander Stielau

Oct 5, 2021, 11:32:20 AM10/5/21
to Flatcar Container Linux User

we got new hardware (Dell Poweredge R6515 with 10x nvme on PCIe backplane and 2x ssd (m2) on their own controller (BOSS-S1).

We running flatcar 2905.2.3; also tested 2905.2.4 (not possible due the known mellanox driver problems), and now 2905.2.5, same problem with older (2605.5.0 was available on our matchbox servers) versions.

Installation via matchbox, ignition files created by terraform, we use some persistent partitions (for persistent stuff, etcd, ... storage), / is on wipe-on-reboot partition, /usr is loop mounted. 
We switched the /usr-a /usr-b update feature off by disabling update-engine and locksmithd by masking their services out.

Following example is with all relevant partitions on a raid1 built by the small boot disks on the boss-s1 controller:

core@storage-0012 ~ $ sudo lsblk -o  name,mountpoint,label,size,uuid

loop0   /usr                      347.6M
sda                               223.5G
|-sda1  /var/lib/persistent          20G d5604c8b-1c78-4506-8920-115a077512f6
|-sda2  /var/lib/rook                50G 8a7316f5-b310-4aa4-9d10-2daebc233c83
|-sda3  /var/lib/etcd                50G 200eee13-9d92-4a68-a042-5097715b21f1
`-sda4  /                   ROOT  103.5G 2174585f-c1df-4f5f-8e1c-071eb96171ca
nvme3n1                             2.9T
nvme2n1                             2.9T
nvme0n1                             2.9T
nvme5n1                             2.9T
nvme4n1                             2.9T
nvme7n1                             2.9T
nvme8n1                             2.9T
nvme1n1                             2.9T
nvme6n1                             2.9T
nvme9n1                             2.9T

We did the same with all partitions on a nvme for good measure. :-) 

Problem: Some time (we guess: hours) after boot/installation the /usr partition is not available anymore and with it no binaries to check what happens.

We managed it to redirect the output from dmesg to a persistent partition, but we found no interesting information in it - just the connect of the idrac virtual console after ssh login is not possible due the loss of all binaries.

At first we suspected this new behavior in a conjunction with this boss-s1 controller and its disks, so we removed them from one node completly, used it without raid configuration, let it run without using it (used one nvme instead), but it looks like it is not releated to it, it also happens on the node which has it removed.

I followed now the spawn-a-toolbox-container advise and run it with a detached tmux (https://kinvolk.io/docs/flatcar-container-linux/latest/setup/debug/install-debugging-tools/#spawn-a-toolbox-with-tmux-in-the-background) and tried to link /run/log/journal to a persistent partition - i will see tomorrow, if this will show us more information.

For me it looks like the loop module just unloads, but thats a wild guess, and i have no evidence for that. 

We run some other hardware (supermicro) in similar configuration without that problems also as storage nodes for our clusters.

Has anybody here an idea, how to approach this in a better way?
Which information do you need for a better guess?


Alexander Stielau

Oct 7, 2021, 4:15:07 AM10/7/21
to Flatcar Container Linux User

To get a better picture i enabled debuging in systemd-journald as described in https://kinvolk.io/docs/flatcar-container-linux/latest/setup/debug/reading-the-system-log/#enable-debugging-via-a-container-linux-config and installed a systemd timer (all 15m) to copy journal, lsmod and dmesg to a persistent partition.

The problem disapeared - i could login into all nodes this morning.
Will remove the systemd timer on one node and revert both changes on another node again to proof.

And i will create a github issue to get help and/or for further research for the root cause.

Reply all
Reply to author
0 new messages