OMV5 on Debian Buster w/ kernel 6.1.1 causes OOM

Conway Smith (Beolach)

unread,

May 8, 2023, 9:51:30 PM5/8/23

to GnuBee

I'm finally taking my GnuBee PC2 out of the box & trying to set it up. I've got it working pretty well w/ a basic Debian Buster installation w/ Neil Brown's latest 6.1.1 kernel, but when I try to install OMV5 following Antoine Besnier's script [1], it ends up with the oom-killer going crazy & hanging the machine. On the serial console I get repeated oom-killer & Stack Call Trace messages, eventually ending up at:

Out of memory and no killable processes...
Kernel panic - not syncing: System is deadlocked on memory
Rebooting in 90 seconds..

And every reboot it just keeps doing that again, until I use a USB thumbdrive w/ gnubee-config.txt & CONSOLE_SHELL=yes to switch to a working GNUBEE-ROOT w/out OMV5 installed. Then if I chroot into the broken GNUBEE-ROOT w/ OMV5 & `apt purge openmediavault`, that GNUBEE-ROOT will start working again w/out the OOMs.

I've tried tweaking the omv5-install.sh script to build a few different OMV5 releases, most recently openmediavault_5.6.27-1_all.deb, and I've also tried switching to Neil Brown's previous kernel 5.19.13, but I always end up w/ OOM panics (or some of the OMV versions failed to build for me). I've also tried both w/ & w/out swap partitions, and various root filesystem configurations (btrfs, ext4; on MD-RAID, or a single plain partition).

Does anyone have a working OMV5 install? What specific version of OMV5 (git commit) is it? What kernel version are you running? Is there anything else anyone can think of that might be causing the OOM? Should I just give up on OMV5 & switch to OpenWRT for the web interface? (I haven't used OMV at all in the past; OpenWRT I've used, but not for a NAS).

Thanks,

Conway S. Smith

[1]: https://github.com/abesnier/gnubee-omv4-packages/blob/master/omv5-install.sh

Brett Neumeier

unread,

May 9, 2023, 9:28:00 AM5/9/23

to Conway Smith (Beolach), GnuBee

On Mon, May 8, 2023 at 8:51 PM Conway Smith (Beolach) <beo...@gmail.com> wrote:

I'm finally taking my GnuBee PC2 out of the box & trying to set it up. I've got it working pretty well w/ a basic Debian Buster installation w/ Neil Brown's latest 6.1.1 kernel, but when I try to install OMV5 following Antoine Besnier's script [1], it ends up with the oom-killer going crazy & hanging the machine. On the serial console I get repeated oom-killer & Stack Call Trace messages, eventually ending up at:

Hi Conway,

I have not tried to get OMV set up or working, but one thing that might help is to upgrade to a later version of Debian -- I am using Bullseye on both of my GnuBee units, and it is quite stable. You may find that works better.

It may also be that the default configuration settings being used for some process are trying to allocate too much memory. From the install script, it looks like OMV uses nginx and php-fpm; I wouldn't expect nginx to be a problem, but php-fpm might be trying to spawn too many server processes at startup. Or maybe OMV itself runs some process(es) that are very memory-hungry, I don't know anything about how it works.

If you set up a job that just logs memory usage of all processes every minute, you might be able to do some forensic investigation of exactly what process or processes are triggering the behavior you see.

Good luck! Hope this helps.

Cheers,

Brett

--

Brett Neumeier (bneu...@gmail.com)

Neil Brown

unread,

May 9, 2023, 6:11:14 PM5/9/23

to GnuBee

I'm surprised that you get OOM panics if you have swap configured. How big did you make the swap partition?

(Though I admit I don't use OMV and so don't know how much memory it might need. A few Gig *should* be enough)

NelBrown

Beolach

unread,

May 9, 2023, 10:15:19 PM5/9/23

to Brett Neumeier, GnuBee

On Tue, May 9, 2023 at 7:28 AM Brett Neumeier <bneu...@gmail.com> wrote:
>
> I have not tried to get OMV set up or working, but one thing that might help is to upgrade to a later version of Debian -- I am using Bullseye on both of my GnuBee units, and it is quite stable. You may find that works better.
>

Brett,

Thanks for the reply. I actually did start out trying w/ Bullseye &
OMV6; but OMV doesn't officially support the GnuBee or mips
architecture, and w/ OMV6 they changed their build process, and I
couldn't get it to build for me at all. IIRC it was their salt
dependancy that caused the problems - OMV depends on more recent
versions of salt than are packaged in Debian, and it was actually the
Salt project that changed their build process & I couldn't get it to
work for my GnuBee.

OMV5 is specifically tied to Buster, as OMV4 was on Stretch. Those
two versions are the ones I've seen documentation on for the GnuBee -
OMV4 at launch in the official GnuBee site (and w/ some later
unofficial updated references); and OMV5 was discussed here in this
group several months back:
https://groups.google.com/g/gnubee/c/3IpIUXEgM54 (that's where I found
Antoine Besnier's script).

> It may also be that the default configuration settings being used for some process are trying to allocate too much memory. From the install script, it looks like OMV uses nginx and php-fpm; I wouldn't expect nginx to be a problem, but php-fpm might be trying to spawn too many server processes at startup. Or maybe OMV itself runs some process(es) that are very memory-hungry, I don't know anything about how it works.
>
> If you set up a job that just logs memory usage of all processes every minute, you might be able to do some forensic investigation of exactly what process or processes are triggering the behavior you see.
>

It's not an over-time thing I can check on every minute; it happens
immediately upon installing the openmediavault package, and then
immediately following a reboot - it never even gets to a login on the
serial console. I'm now working on getting a log of the serial
console into shape to post here w/ full details, but so far it's
beyond my comprehension - it doesn't seem like any (userspace?)
process is actually using much memory to me, but I'm not sure I'm
parsing the oom log correctly.

Thanks,
Conway S. Smith

Beolach

unread,

May 9, 2023, 10:35:43 PM5/9/23

to Neil Brown, GnuBee

On Tue, May 9, 2023 at 4:11 PM Neil Brown <ne...@brown.name> wrote:
>
> I'm surprised that you get OOM panics if you have swap configured. How big did you make the swap partition?
> (Though I admit I don't use OMV and so don't know how much memory it might need. A few Gig *should* be enough)
>
> NelBrown
>

I have 1.5G swap partitions on each of the 6 HDDs; `free -m` shows
9143 swap space total. I'm still working on putting serial console
logs into shape to post publicly, but one thing I've noticed already
is after a reboot, the OOM killer starts its rampage before swap is
enabled. I've tried running swapon in the initramfs shell, but the
OOM killer still hangs the system during boot - it does show the swap
space is available (after swapon in initramfs), but it's not used at
all.

Thanks,
Conway S. Smith

Beolach

unread,

May 10, 2023, 6:04:00 AM5/10/23

to GnuBee

OK, I've narrowed it down to a much better specific cause.
Openmediavault includes a
/etc/udev/rules.d/99-openmediavault-md-raid.rules file, with:

SUBSYSTEM=="block", KERNEL=="md*", ACTION=="add|change",
TEST=="md/stripe_cache_size", ATTR{md/stripe_cache_size}="8192"

When I disable that udev rule, the GnuBee boots w/out the OOM issues,
and the OMV web interface does work. I haven't started doing anything
w/ that yet though - I want to dig more into the OOM problem first.

Currently I have 3 md arrays, none actually in use - they're left over
from previous attempts at getting OMV working. One is raid1, so it
doesn't have md/stripe_cache_size & doesn't trigger that udev rule.
The other two are raid6; one is degraded, but I don't *think* that's
relevant - the reason it's degraded is I removed a member partition
(and later a 2nd one) in order to test root on a normal partition, not
on a RAID device. Unless I'm misremembering, I was seeing these same
OOM issues when I was using the clean array as the actual GNUBEE-ROOT.

I can directly trigger the OOM by running `echo 8192 >
/sys/block/mdX/md/stripe_cache_size` (replacing X w/ the number) - but
it doesn't trigger the OOM the first time I do that. It's only when I
do it again to the second raid6 array that the OOM goes off. Another
reason I don't think the degraded array is relevant is because it
doesn't matter which array I set stripe_cache_size to 8192 first or
second - it's always the 2nd time that the OOM triggers.

I have serial console logs taken with commands like `screen -L
-Logfile GnuBee-OOM-screen.1.log /dev/ttyUSB0 57600`, but they end up
with a lot of control characters (^M, ^G, ^[[, ^H), especially on the
shell prompts when I'm entering commands; ansifilter helps clean them
up a lot, but many of the commands I enter still end up mangled, and
for now I've given up trying to manually clean them up. I think `bash
-v` might help keep the commands executed clean in the log, but I
haven't done that yet. Below are a couple small snippets from when I
trigger the OOM w/ `echo 8192 > /sys/block/mdX/md/stripe_cache_size`.

At least I'm feeling like I'm making progress now. I'll keep working
on this (fix the degraded array, get better logs w/ `bash -v`) & send
more updates.

### This first time I tested, I set md125 (the degraded array) first,
and md127 (the clean array) second:
root@GnuBee0:/etc/udev/rules.d.disabled# cat /proc/mdstat
Personalities : [raid1] [raid6] [raid5] [raid4]
md125 : active (auto-read-only) raid6 sdf5[5] sde5[4] sdd5[3] sdc5[2]
167636992 blocks super 1.2 level 6, 512k chunk, algorithm 2 [6/4] [__UUUU]
bitmap: 1/2 pages [4KB], 16384KB chunk

md126 : active (auto-read-only) raid1 sdf3[5] sde3[4] sdd3[3] sdc3[2]
sdb3[1] sda3[0]
8379392 blocks super 1.2 [6/6] [UUUUUU]
bitmap: 0/1 pages [0KB], 16384KB chunk

md127 : active (auto-read-only) raid6 sdf2[5] sde2[4] sdd2[3] sdc2[2]
sdb2[1] sda2[0]
150861824 blocks super 1.2 level 6, 512k chunk, algorithm 2 [6/6] [UUUUUU]
bitmap: 0/2 pages [0KB], 16384KB chunk

unused devices: <none>
root@GnuBee0:/etc/udev/rules.d.disabled# echo 8192 >
/sys/block/md125/md/stripe_ cache_size
root@GnuBee0:/etc/udev/rules.d.disabled# echo 8192 >
/sys/block/md127/md/stripe_cache_size
[ 740.834868] bash invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL),
order=0, oom_score_adj=0
### ... snip the rest of the oom messages

### This second time I tested, I set md126 (the clean array) first,
and md125 (the degraded array) second, with a bit more activity in
between:
root@GnuBee0:~# cat /proc/mdstat
Personalities : [raid1] [raid6] [raid5] [raid4]
md125 : active (auto-read-only) raid6 sdf5[5] sde5[4] sdd5[3] sdc5[2]
167636992 blocks super 1.2 level 6, 512k chunk, algorithm 2 [6/4] [__UUUU]
bitmap: 1/2 pages [4KB], 16384KB chunk

md126 : active (auto-read-only) raid6 sdf2[5] sde2[4] sdd2[3] sda2[0]
sdb2[1] sdc2[2]
150861824 blocks super 1.2 level 6, 512k chunk, algorithm 2 [6/6] [UUUUUU]
bitmap: 0/2 pages [0KB], 16384KB chunk

md127 : active (auto-read-only) raid1 sdf3[5] sde3[4] sdd3[3] sda3[0]
sdb3[1] sdc3[2]
8379392 blocks super 1.2 [6/6] [UUUUUU]
bitmap: 0/1 pages [0KB], 16384KB chunk

unused devices: <none>
root@GnuBee0:~# ls /sys/block/md12*/md/stripe_cache_size
/sys/block/md125/md/stripe_cache_size /sys/block/md126/md/stripe_cache_size
root@GnuBee0:~# cat /sys/block/md12*/md/stripe_cache_size
256
256
root@GnuBee0:~# echo 8192 > /sys/block/md126/md/stripe_cache_size
root@GnuBee0:~# cat /sys/block/md12*/md/stripe_cache_size
256
8192
root@GnuBee0:~# uptime
01:35:58 up 8 min, 1 user, load average: 0.09, 0.87, 0.64
root@GnuBee0:~# free -m
total used free shared buff/cache available
Mem: 496 305 20 26 170 139
Swap: 9143 12 9131
root@GnuBee0:~# logout
### ... snip logging back in ...
You have mail.
root@GnuBee0:~# uptime
01:36:30 up 8 min, 1 user, load average: 0.11, 0.79, 0.62
root@GnuBee0:~# cat /sys/block/md12*/md/stripe_cache_size
256
8192
root@GnuBee0:~# cat /proc/mdstat
Personalities : [raid1] [raid6] [raid5] [raid4]
md125 : active (auto-read-only) raid6 sdf5[5] sde5[4] sdd5[3] sdc5[2]
167636992 blocks super 1.2 level 6, 512k chunk, algorithm 2 [6/4] [__UUUU]
bitmap: 1/2 pages [4KB], 16384KB chunk

md126 : active (auto-read-only) raid6 sdf2[5] sde2[4] sdd2[3] sda2[0]
sdb2[1] sdc2[2]
150861824 blocks super 1.2 level 6, 512k chunk, algorithm 2 [6/6] [UUUUUU]
bitmap: 0/2 pages [0KB], 16384KB chunk

md127 : active (auto-read-only) raid1 sdf3[5] sde3[4] sdd3[3] sda3[0]
sdb3[1] sdc3[2]
8379392 blocks super 1.2 [6/6] [UUUUUU]
bitmap: 0/1 pages [0KB], 16384KB chunk

unused devices: <none>
root@GnuBee0:~# echo 8192 > /sys/block/md125/md/stripe_cache_size
[ 609.614295] bash invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL),
order=0, oom_score_adj=0
### ... snip the rest of the oom messages

Thanks,
Conway S. Smith

Beolach

unread,

May 12, 2023, 11:24:18 AM5/12/23

to GnuBee

I've got the OOM issue figured out now, and it's not a bug; it's an
actual legitimate OOM situation. I feel kinda silly now - it's pretty
obvious in hindsight. md/stripe_cache_size uses memory, and
increasing it uses more memory. Turns out the GnuBee's 512M RAM is
enough for one array w/ 8192 md/stripe_cache_size, but not two -
trying to set the second array to the larger stripe_cache_size uses up
all the memory, and triggers a legitimate OOM. Also obviously swap
space won't help here.

The light bulb went on for me when I was testing: I'd switched one of
the two raid6 arrays to raid10 & had it running as GNUBEE-ROOT, the
other raid6 I kept for testing, re-adding the missing drives so it's
not degraded any more. Setting 8192 md/stripe_cache_size worked w/out
triggering OOM, same as before for a single array.

Then I created a new raid5 array for testing w/ two arrays w/
md/stripe_cache_size, and when I tried `echo 8192 >
/sys/block/md124/md/stripe_cache_size`, it did trigger the OOM. But
to my surprise this time it actually managed to recover from the OOM -
and checking `cat /sys/block/md124/md/stripe_cache_size` showed 7184 -
not the 8192 I had echoed into it. And that's when the light bulb
went on.

To confirm, I kept the first array at 8192 & set the second array to
512, then doubling to 1024, 2048, and 4096, and checking free memory;
it succeeded each time, w/ less free memory. I kept increasing at
smaller intervals, until 7168 succeeded, showing 8M free memory, and
the next test at 7680 triggered the OOM.

So now I'm happy I understand what was triggering the OOM, and it's
not a bug. I'm just going to leave the OMV udev rule disabled (w/
dpkg-divert to prevent it getting installed again), and stay at the
default 256 stripe_cache_size.

Thanks,
Conway S. Smith

P.S.
Here's some of my testing, getting close to but not triggering the OOM.

root@GnuBee0 ~# cat /sys/block/md12?/md/stripe_cache_size
256
256