KASAN-related VMAP allocation errors in debug kernels with many logical CPUS

0 views
Skip to first unread message

David Hildenbrand

unread,
Oct 6, 2022, 9:47:05 AM10/6/22
to Alexander Potapenko, Uladzislau Rezki (Sony), Andrey Konovalov, linu...@kvack.org, Andrey Ryabinin, Dmitry Vyukov, Vincenzo Frascino, kasa...@googlegroups.com
Hi,

we're currently hitting a weird vmap issue in debug kernels with KASAN enabled
on fairly large VMs. I reproduced it on v5.19 (did not get the chance to
try 6.0 yet because I don't have access to the machine right now, but
I suspect it persists).

It seems to trigger when udev probes a massive amount of devices in parallel
while the system is booting up. Once the system booted, I no longer see any
such issues.


[ 165.818200] vmap allocation for size 2498560 failed: use vmalloc=<size> to increase size
[ 165.836622] vmap allocation for size 315392 failed: use vmalloc=<size> to increase size
[ 165.837461] vmap allocation for size 315392 failed: use vmalloc=<size> to increase size
[ 165.840573] vmap allocation for size 2498560 failed: use vmalloc=<size> to increase size
[ 165.841059] vmap allocation for size 2498560 failed: use vmalloc=<size> to increase size
[ 165.841428] vmap allocation for size 2498560 failed: use vmalloc=<size> to increase size
[ 165.841819] vmap allocation for size 2498560 failed: use vmalloc=<size> to increase size
[ 165.842123] vmap allocation for size 2498560 failed: use vmalloc=<size> to increase size
[ 165.843359] vmap allocation for size 2498560 failed: use vmalloc=<size> to increase size
[ 165.844894] vmap allocation for size 2498560 failed: use vmalloc=<size> to increase size
[ 165.847028] CPU: 253 PID: 4995 Comm: systemd-udevd Not tainted 5.19.0 #2
[ 165.935689] Hardware name: Lenovo ThinkSystem SR950 -[7X12ABC1WW]-/-[7X12ABC1WW]-, BIOS -[PSE130O-1.81]- 05/20/2020
[ 165.947343] Call Trace:
[ 165.950075] <TASK>
[ 165.952425] dump_stack_lvl+0x57/0x81
[ 165.956532] warn_alloc.cold+0x95/0x18a
[ 165.960836] ? zone_watermark_ok_safe+0x240/0x240
[ 165.966100] ? slab_free_freelist_hook+0x11d/0x1d0
[ 165.971461] ? __get_vm_area_node+0x2af/0x360
[ 165.976341] ? __get_vm_area_node+0x2af/0x360
[ 165.981219] __vmalloc_node_range+0x291/0x560
[ 165.986087] ? __mutex_unlock_slowpath+0x161/0x5e0
[ 165.991447] ? move_module+0x4c/0x630
[ 165.995547] ? vfree_atomic+0xa0/0xa0
[ 165.999647] ? move_module+0x4c/0x630
[ 166.003741] module_alloc+0xe7/0x170
[ 166.007747] ? move_module+0x4c/0x630
[ 166.011840] move_module+0x4c/0x630
[ 166.015751] layout_and_allocate+0x32c/0x560
[ 166.020519] load_module+0x8e0/0x25c0
[ 166.024623] ? layout_and_allocate+0x560/0x560
[ 166.029586] ? kernel_read_file+0x286/0x6b0
[ 166.034269] ? __x64_sys_fspick+0x290/0x290
[ 166.038946] ? userfaultfd_unmap_prep+0x430/0x430
[ 166.044203] ? lock_downgrade+0x130/0x130
[ 166.048698] ? __do_sys_finit_module+0x11a/0x1c0
[ 166.053854] __do_sys_finit_module+0x11a/0x1c0
[ 166.058818] ? __ia32_sys_init_module+0xa0/0xa0
[ 166.063882] ? __seccomp_filter+0x92/0x930
[ 166.068494] do_syscall_64+0x59/0x90
[ 166.072492] ? do_syscall_64+0x69/0x90
[ 166.076679] ? do_syscall_64+0x69/0x90
[ 166.080864] ? do_syscall_64+0x69/0x90
[ 166.085047] ? asm_sysvec_apic_timer_interrupt+0x16/0x20
[ 166.090984] ? lockdep_hardirqs_on+0x79/0x100
[ 166.095855] entry_SYSCALL_64_after_hwframe+0x63/0xcd



Some facts:

1. The #CPUs seems to be more important than the #MEM

Initially we thought the memory size would be the relevant trigger,
because we've only seen it on 8TiB machines. But I was able to
reproduce also on a "small" machine with ~450GiB.

We've seen this issue only on machines with a lot (~448) logical CPUs.

On such systems, I was not able to reproduce when booting the kernel with
"nosmt" so far, which could indicate some kind of concurrency problem.


2. CONFIG_KASAN_INLINE seems to be relevant

This issue only seems to trigger with KASAN enabled, and what I can tell,
only with CONFIG_KASAN_INLINE=y:

CONFIG_KASAN_INLINE: "but makes the kernel's .text size much bigger.", that
should include kernel module to be loaded.


3. All systems have 8, equally sized NUMA nodes

... which implies, that at least one node is practically completely filled with
KASAN data. I remember adjusting the system size with "mem=", such that some
nodes were memory-less but NODE 0 would still have some free memory.
I remember that it still triggered.



My current best guess is that this is a combination of large VMAP demands
(e.g., kernel modules with quite a size due to CONFIG_KASAN_INLINE) and
eventually a concurrency issue with large #CPUs. But I might be wrong and
this might be something zone/node related.

Does any of that ring a bell -- especially why it would fail with 448 logical
CPUs but succeed with 224 logical CPUs (nosmt)?

My best guess would be that the purge_vmap_area_lazy() logic in alloc_vmap_area()
might not be sufficient when there is a lot of concurrency: simply purging
once and then failing might be problematic in corner cases where there is a lot of
concurrent vmap action going on. But that's just my best guess.


Cheers!

--
Thanks,

David / dhildenb

Uladzislau Rezki

unread,
Oct 6, 2022, 11:35:54 AM10/6/22
to David Hildenbrand, Alexander Potapenko, Uladzislau Rezki (Sony), Andrey Konovalov, linu...@kvack.org, Andrey Ryabinin, Dmitry Vyukov, Vincenzo Frascino, kasa...@googlegroups.com
Can it be that we do not have enough "module section" size? I mean the
section size, which is MODULES_END - MODULES_VADDR is rather small so
some modules are not loaded due to no space.

CONFIG_RANDOMIZE_BASE also creates some offset overhead if enabled on
your box. But it looks it is rather negligible.

Maybe try to increase the module-section size to see if it solves the
problem.

--
Uladzislau Rezki

David Hildenbrand

unread,
Oct 6, 2022, 12:12:44 PM10/6/22
to Uladzislau Rezki, Alexander Potapenko, Andrey Konovalov, linu...@kvack.org, Andrey Ryabinin, Dmitry Vyukov, Vincenzo Frascino, kasa...@googlegroups.com
Right, I suspected both points -- but was fairly confused why the
numbers of CPUs would matter.

What would make sense is that if we're tight on module vmap space, that
the race I think that could happen with purging only once and then
failing could become relevant.

>
> Maybe try to increase the module-section size to see if it solves the
> problem.

What would be the easiest way to do that?

Thanks!

Uladzislau Rezki

unread,
Oct 7, 2022, 11:34:52 AM10/7/22
to David Hildenbrand, Uladzislau Rezki, Alexander Potapenko, Andrey Konovalov, linu...@kvack.org, Andrey Ryabinin, Dmitry Vyukov, Vincenzo Frascino, kasa...@googlegroups.com
Sorry for late answer. I was trying to reproduce it on my box. What i
did was trying to load all modules in my system with KASAN_INLINE option:

<snip>
#!/bin/bash

# Exclude test_vmalloc.ko
MODULES_LIST=(`find /lib/modules/$(uname -r) -type f \
\( -iname "*.ko" -not -iname "test_vmalloc*" \) | awk -F"/" '{print $NF}' | sed 's/.ko//'`)

function moduleExist(){
MODULE="$1"
if lsmod | grep "$MODULE" &> /dev/null ; then
return 0
else
return 1
fi
}

i=0

for module_name in ${MODULES_LIST[@]}; do
sudo modprobe $module_name

if moduleExist ${module_name}; then
((i=i+1))
echo "Successfully loaded $module_name counter $i"
fi
done
<snip>

as you wrote it looks like it is not easy to reproduce. So i do not see
any vmap related errors.

Returning back to the question. I think you could increase the MODULES_END
address and shift the FIXADDR_START little forward. See the dump_pagetables.c
But it might be they are pretty compact and located in the end. So i am not
sure if there is a room there.

Second. It would be good to understand if vmap only fails on allocating for a
module:

<snip>
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index dd6cdb201195..53026fdda224 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -1614,6 +1614,8 @@ static struct vmap_area *alloc_vmap_area(unsigned long size,
va->va_end = addr + size;
va->vm = NULL;

+ trace_printk("-> alloc %lu size, align: %lu, vstart: %lu, vend: %lu\n", size, align, vstart, vend);
+
spin_lock(&vmap_area_lock);
<snip>

--
Uladzislau Rezki

David Hildenbrand

unread,
Oct 10, 2022, 2:57:00 AM10/10/22
to Uladzislau Rezki, Alexander Potapenko, Andrey Konovalov, linu...@kvack.org, Andrey Ryabinin, Dmitry Vyukov, Vincenzo Frascino, kasa...@googlegroups.com
>>> Maybe try to increase the module-section size to see if it solves the
>>> problem.
>>
>> What would be the easiest way to do that?
>>
> Sorry for late answer. I was trying to reproduce it on my box. What i
> did was trying to load all modules in my system with KASAN_INLINE option:
>

Thanks!

> <snip>
> #!/bin/bash
>
> # Exclude test_vmalloc.ko
> MODULES_LIST=(`find /lib/modules/$(uname -r) -type f \
> \( -iname "*.ko" -not -iname "test_vmalloc*" \) | awk -F"/" '{print $NF}' | sed 's/.ko//'`)
>
> function moduleExist(){
> MODULE="$1"
> if lsmod | grep "$MODULE" &> /dev/null ; then
> return 0
> else
> return 1
> fi
> }
>
> i=0
>
> for module_name in ${MODULES_LIST[@]}; do
> sudo modprobe $module_name
>
> if moduleExist ${module_name}; then
> ((i=i+1))
> echo "Successfully loaded $module_name counter $i"
> fi
> done
> <snip>
>
> as you wrote it looks like it is not easy to reproduce. So i do not see
> any vmap related errors.

Yeah, it's quite mystery and only seems to trigger on these systems with
a lot of CPUs.

>
> Returning back to the question. I think you could increase the MODULES_END
> address and shift the FIXADDR_START little forward. See the dump_pagetables.c
> But it might be they are pretty compact and located in the end. So i am not
> sure if there is a room there.

That's what I was afraid of :)

>
> Second. It would be good to understand if vmap only fails on allocating for a
> module:
>
> <snip>
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index dd6cdb201195..53026fdda224 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -1614,6 +1614,8 @@ static struct vmap_area *alloc_vmap_area(unsigned long size,
> va->va_end = addr + size;
> va->vm = NULL;
>
> + trace_printk("-> alloc %lu size, align: %lu, vstart: %lu, vend: %lu\n", size, align, vstart, vend);
> +
> spin_lock(&vmap_area_lock);
> <snip>

I'll try grabbing a suitable system again and add some more debugging
output. Might take a while, unfortunately.

Uladzislau Rezki

unread,
Oct 10, 2022, 8:19:08 AM10/10/22
to David Hildenbrand, Uladzislau Rezki, Alexander Potapenko, Andrey Konovalov, linu...@kvack.org, Andrey Ryabinin, Dmitry Vyukov, Vincenzo Frascino, kasa...@googlegroups.com
Yes that makes sense. Especially to understand if it fails on the MODULES_VADDR
- MODULES_END range or somewhere else. According to your trace output it looks
like that but it would be good to confirm it by adding some traces.

BTW, vmap code is lack of good trace events. Probably it is worth to add
some basic ones.

--
Uladzislau Rezki

David Hildenbrand

unread,
Oct 11, 2022, 3:52:48 PM10/11/22
to Uladzislau Rezki, Alexander Potapenko, Andrey Konovalov, linu...@kvack.org, Andrey Ryabinin, Dmitry Vyukov, Vincenzo Frascino, kasa...@googlegroups.com
Was lucky to grab that system again. Compiled a custom 6.0 kernel, whereby I printk all vmap allocation errors, including the range similarly to what you suggested above (but printk only on the failure path).

So these are the failing allocations:

# dmesg | grep " -> alloc"
[ 168.862511] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[ 168.863020] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[ 168.863841] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[ 168.864562] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[ 168.864646] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[ 168.865688] -> alloc 319488 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[ 168.865718] -> alloc 319488 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[ 168.866098] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[ 168.866551] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[ 168.866752] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[ 168.867147] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[ 168.867210] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[ 168.867312] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[ 168.867650] -> alloc 319488 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[ 168.867767] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[ 168.867815] -> alloc 319488 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[ 168.867815] -> alloc 319488 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[ 168.868059] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[ 168.868463] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[ 168.868822] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[ 168.868919] -> alloc 319488 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[ 168.869843] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[ 168.869854] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[ 168.870174] -> alloc 319488 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[ 168.870611] -> alloc 319488 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[ 168.870806] -> alloc 319488 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[ 168.870982] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[ 168.879000] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[ 169.449101] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[ 169.449834] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[ 169.450667] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[ 169.451539] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[ 169.452326] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[ 169.453239] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[ 169.454052] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[ 169.454697] -> alloc 319488 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[ 169.454811] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[ 169.455575] -> alloc 319488 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[ 169.455754] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[ 169.461450] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[ 169.805223] -> alloc 319488 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[ 169.805507] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[ 169.929577] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[ 169.930389] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[ 169.931244] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[ 169.932035] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[ 169.932796] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[ 169.933592] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[ 169.934470] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[ 169.935344] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[ 169.970641] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[ 170.191600] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[ 170.191875] -> alloc 40960 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[ 170.241901] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[ 170.242708] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[ 170.243465] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[ 170.244211] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[ 170.245060] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[ 170.245868] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[ 170.246433] -> alloc 40960 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[ 170.246657] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[ 170.247451] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[ 170.248226] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[ 170.248902] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[ 170.249704] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[ 170.250497] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[ 170.251244] -> alloc 319488 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[ 170.252076] -> alloc 319488 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[ 170.587168] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[ 170.598995] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[ 171.865721] -> alloc 2506752 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400
[ 172.138557] -> alloc 917504 size, align: 4096, vstart: 18446744072639352832, vend: 18446744073692774400


Really looks like only module vmap space. ~ 1 GiB of vmap module space ...

I did try:

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index dd6cdb201195..199154a2228a 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -72,6 +72,8 @@ early_param("nohugevmalloc", set_nohugevmalloc);
static const bool vmap_allow_huge = false;
#endif /* CONFIG_HAVE_ARCH_HUGE_VMALLOC */

+static atomic_long_t vmap_lazy_nr = ATOMIC_LONG_INIT(0);
+
bool is_vmalloc_addr(const void *x)
{
unsigned long addr = (unsigned long)kasan_reset_tag(x);
@@ -1574,7 +1576,6 @@ static struct vmap_area *alloc_vmap_area(unsigned long size,
struct vmap_area *va;
unsigned long freed;
unsigned long addr;
- int purged = 0;
int ret;

BUG_ON(!size);
@@ -1631,23 +1632,22 @@ static struct vmap_area *alloc_vmap_area(unsigned long size,
return va;

overflow:
- if (!purged) {
+ if (atomic_long_read(&vmap_lazy_nr)) {
purge_vmap_area_lazy();
- purged = 1;
goto retry;
}

freed = 0;
blocking_notifier_call_chain(&vmap_notify_list, 0, &freed);

- if (freed > 0) {
- purged = 0;
+ if (freed > 0)
goto retry;
- }

- if (!(gfp_mask & __GFP_NOWARN) && printk_ratelimit())
+ if (!(gfp_mask & __GFP_NOWARN)) {
pr_warn("vmap allocation for size %lu failed: use vmalloc=<size> to increase size\n",
size);
+ printk("-> alloc %lu size, align: %lu, vstart: %lu, vend: %lu\n", size, align, vstart, vend);
+ }

kmem_cache_free(vmap_area_cachep, va);
return ERR_PTR(-EBUSY);
@@ -1690,8 +1690,6 @@ static unsigned long lazy_max_pages(void)
return log * (32UL * 1024 * 1024 / PAGE_SIZE);
}

-static atomic_long_t vmap_lazy_nr = ATOMIC_LONG_INIT(0);
-


But that didn't help at all. That system is crazy:

# lspci | wc -l
1117


What I find interesting is that we have these recurring allocations of similar sizes failing.
I wonder if user space is capable of loading the same kernel module concurrently to
trigger a massive amount of allocations, and module loading code only figures out
later that it has already been loaded and backs off.

My best guess would be that module loading is serialized completely, but for some reason,
something seems to go wrong with a lot of concurrency ...

Uladzislau Rezki

unread,
Oct 12, 2022, 12:36:12 PM10/12/22
to David Hildenbrand, Uladzislau Rezki, Alexander Potapenko, Andrey Konovalov, linu...@kvack.org, Andrey Ryabinin, Dmitry Vyukov, Vincenzo Frascino, kasa...@googlegroups.com
OK. It is related to a module vmap space allocation when a module is
inserted. I wounder why it requires 2.5MB for a module? It seems a lot
to me.

>
> Really looks like only module vmap space. ~ 1 GiB of vmap module space ...
>
If an allocation request for a module is 2.5MB we can load ~400 modules
having 1GB address space.

"lsmod | wc -l"? How many modules your system has?
If an allocation fails, the next step is to drain outstanding vmap
areas. So a caller does it from its context and then repeat one more
time and only after that a fail message is printed.

>
> # lspci | wc -l
> 1117
>
So probably you need a lot of modules in order to fully make functional your HW :)

>
> What I find interesting is that we have these recurring allocations of similar sizes failing.
> I wonder if user space is capable of loading the same kernel module concurrently to
> trigger a massive amount of allocations, and module loading code only figures out
> later that it has already been loaded and backs off.
>
If there is a request about allocating memory it has to be succeeded
unless there are some errors like no space no memory.

>
> My best guess would be that module loading is serialized completely, but for some reason,
> something seems to go wrong with a lot of concurrency ...
>
lazy_max_pages() depends on number of online CPUs. Probably something
related...

I wrote a small patch to dump a modules address space when a fail occurs:

<snip v6.0>
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 83b54beb12fa..88d323310df5 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -1580,6 +1580,37 @@ preload_this_cpu_lock(spinlock_t *lock, gfp_t gfp_mask, int node)
kmem_cache_free(vmap_area_cachep, va);
}

+static void
+dump_modules_free_space(unsigned long vstart, unsigned long vend)
+{
+ unsigned long va_start, va_end;
+ unsigned int total = 0;
+ struct vmap_area *va;
+
+ if (vend != MODULES_END)
+ return;
+
+ trace_printk("--- Dump a modules address space: 0x%lx - 0x%lx\n", vstart, vend);
+
+ spin_lock(&free_vmap_area_lock);
+ list_for_each_entry(va, &free_vmap_area_list, list) {
+ va_start = (va->va_start > vstart) ? va->va_start:vstart;
+ va_end = (va->va_end < vend) ? va->va_end:vend;
+
+ if (va_start >= va_end)
+ continue;
+
+ if (va_start >= vstart && va_end <= vend) {
+ trace_printk(" va_free: 0x%lx - 0x%lx size=%lu\n",
+ va_start, va_end, va_end - va_start);
+ total += (va_end - va_start);
+ }
+ }
+
+ spin_unlock(&free_vmap_area_lock);
+ trace_printk("--- Total free: %u ---\n", total);
+}
+
/*
* Allocate a region of KVA of the specified size and alignment, within the
* vstart and vend.
@@ -1663,10 +1694,13 @@ static struct vmap_area *alloc_vmap_area(unsigned long size,
goto retry;
}

- if (!(gfp_mask & __GFP_NOWARN) && printk_ratelimit())
+ if (!(gfp_mask & __GFP_NOWARN) && printk_ratelimit()) {
pr_warn("vmap allocation for size %lu failed: use vmalloc=<size> to increase size\n",
size);

+ dump_modules_free_space();
+ }
+
kmem_cache_free(vmap_area_cachep, va);
return ERR_PTR(-EBUSY);
}
<snip>

it would be good to understand whether we are really run out of space?
Adding a print of lazy_max_pages() and vmap_lazy_nr would be also good.

--
Uladzislau Rezki

David Hildenbrand

unread,
Oct 13, 2022, 12:21:31 PM10/13/22
to Uladzislau Rezki, Alexander Potapenko, Andrey Konovalov, linu...@kvack.org, Andrey Ryabinin, Dmitry Vyukov, Vincenzo Frascino, kasa...@googlegroups.com
>>
> OK. It is related to a module vmap space allocation when a module is
> inserted. I wounder why it requires 2.5MB for a module? It seems a lot
> to me.
>

Indeed. I assume KASAN can go wild when it instruments each and every
memory access.

>>
>> Really looks like only module vmap space. ~ 1 GiB of vmap module space ...
>>
> If an allocation request for a module is 2.5MB we can load ~400 modules
> having 1GB address space.
>
> "lsmod | wc -l"? How many modules your system has?
>

~71, so not even close to 400.

>> What I find interesting is that we have these recurring allocations of similar sizes failing.
>> I wonder if user space is capable of loading the same kernel module concurrently to
>> trigger a massive amount of allocations, and module loading code only figures out
>> later that it has already been loaded and backs off.
>>
> If there is a request about allocating memory it has to be succeeded
> unless there are some errors like no space no memory.

Yes. But as I found out we're really out of space because module loading
code allocates module VMAP space first, before verifying if the module
was already loaded or is concurrently getting loaded.

See below.

[...]
Thanks!

I can spot the same module getting loaded over and over again
concurrently from user space, only failing after all the allocations
when realizing that the module is in fact already loaded in
add_unformed_module(), failing with -EEXIST.

That looks quite inefficient. Here is how often user space tries to load
the same module on that system. Note that I print *after* allocating
module VMAP space.

# dmesg | grep Loading | cut -d" " -f5 | sort | uniq -c
896 acpi_cpufreq
1 acpi_pad
1 acpi_power_meter
2 ahci
1 cdrom
2 compiled-in
1 coretemp
15 crc32c_intel
307 crc32_pclmul
1 crc64
1 crc64_rocksoft
1 crc64_rocksoft_generic
12 crct10dif_pclmul
16 dca
1 dm_log
1 dm_mirror
1 dm_mod
1 dm_region_hash
1 drm
1 drm_kms_helper
1 drm_shmem_helper
1 fat
1 fb_sys_fops
14 fjes
1 fuse
205 ghash_clmulni_intel
1 i2c_algo_bit
1 i2c_i801
1 i2c_smbus
4 i40e
4 ib_core
1 ib_uverbs
4 ice
403 intel_cstate
1 intel_pch_thermal
1 intel_powerclamp
1 intel_rapl_common
1 intel_rapl_msr
399 intel_uncore
1 intel_uncore_frequency
1 intel_uncore_frequency_common
64 ioatdma
1 ipmi_devintf
1 ipmi_msghandler
1 ipmi_si
1 ipmi_ssif
4 irdma
406 irqbypass
1 isst_if_common
165 isst_if_mbox_msr
300 kvm
408 kvm_intel
1 libahci
2 libata
1 libcrc32c
409 libnvdimm
8 Loading
1 lpc_ich
1 megaraid_sas
1 mei
1 mei_me
1 mgag200
1 nfit
1 pcspkr
1 qrtr
405 rapl
1 rfkill
1 sd_mod
2 sg
409 skx_edac
1 sr_mod
1 syscopyarea
1 sysfillrect
1 sysimgblt
1 t10_pi
1 uas
1 usb_storage
1 vfat
1 wmi
1 x86_pkg_temp_thermal
1 xfs


For each if these loading request, we'll reserve module VMAP space, and
free it once we realize later that the module was already previously loaded.

So with a lot of CPUs we might end up trying to load the same module
that often at the same time that we actually run out of module VMAP space.

I have a prototype patch that seems to fix this in module loading code.

Thanks!

Uladzislau Rezki

unread,
Oct 15, 2022, 5:23:20 AM10/15/22
to David Hildenbrand, Uladzislau Rezki, Alexander Potapenko, Andrey Konovalov, linu...@kvack.org, Andrey Ryabinin, Dmitry Vyukov, Vincenzo Frascino, kasa...@googlegroups.com
OK. It explains the problem :) Indeed it is inefficient. Allocating and later
on figuring out that a module is already there looks weird. Furthermore an
attacking from the user space can be organized.
Good! I am glad the problem can be solved :)

--
Uladzislau Rezki
Reply all
Reply to author
Forward
0 new messages