On this - urgency is less as vmg_nomem() really can only occur with fault
injection afaict as it is a 'too small to fail' scenario, so we won't see
this IRL.
So we can relax a bit given LSF timing etc, and do something for 6.15rc1
I'd say :) then we can backport as needed obviously.
Also if this did somehow happen under such extreme memory pressure the
process would soon be torn down along with the system...
Sorry Jann, no security flaw here I would say :P
Oh lord... thanks for your analysis Pedro! But also, oh lord.
Yeah this 'pointer actually is an error value' thing is a thing I've seen
(and *ahem* caused) before.
It's funny because userfaultfd_clear_vma() -already has handling_ for errors...:
if (!IS_ERR(ret))
userfaultfd_reset_ctx(ret);
And yet...
> >
> > prev = NULL;
> > for_each_vma(vmi, vma) {
> > cond_resched();
> > BUG_ON(!!vma->vm_userfaultfd_ctx.ctx ^
> > !!(vma->vm_flags & __VM_UFFD_FLAGS));
> > if (vma->vm_userfaultfd_ctx.ctx != ctx) {
> > prev = vma;
> > continue;
> > }
> >
> > vma = userfaultfd_clear_vma(&vmi, prev, vma,
> > vma->vm_start, vma->vm_end);
> > prev = vma;
> > }
> >
> > So, if uffd gets an IS_ERR(vma), it keeps going and takes that vma as the prev value,
> > which leads to that ERR_PTR(-ENOMEM) deref crash (-12 = -ENOMEM = 0xffffff4).
> > This situation is kind of awkward because ->release() errors don't mean a thing.
> > So, I have another idea (pasting for syzbot) which might just be cromulent.
> > Untested, but thoughts?
Yeah that seems to match the values, I did wonder about this 'underflowed
u64' value and whether it was an error one and yeah.. sigh.
> >
> > #syz test
> >
> > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> > index d06453fa8aba..fb835d82eb84 100644
> > --- a/mm/userfaultfd.c
> > +++ b/mm/userfaultfd.c
> > @@ -2023,6 +2023,8 @@ void userfaultfd_release_all(struct mm_struct *mm,
> >
> > vma = userfaultfd_clear_vma(&vmi, prev, vma,
> > vma->vm_start, vma->vm_end);
> > + if (WARN_ON(IS_ERR(vma)))
> > + break;
>
> If this WARN_ON() was ever actually hit, I think we'd leave dangling
> pointers in VMAs? As much as Linus hates BUG_ON(), I personally think
> that would be a situation warranting BUG_ON(), or at least
> CHECK_DATA_CORRUPTION(). That said:
Yeah indeed, agreed.
>
> > prev = vma;
> > }
> > mmap_write_unlock(mm);
> > diff --git a/mm/vma.c b/mm/vma.c
> > index 71ca012c616c..b2167b7dc27d 100644
> > --- a/mm/vma.c
> > +++ b/mm/vma.c
> > @@ -1517,8 +1517,16 @@ static struct vm_area_struct *vma_modify(struct vma_merge_struct *vmg)
> > merged = vma_merge_existing_range(vmg);
> > if (merged)
> > return merged;
> > - if (vmg_nomem(vmg))
> > + if (vmg_nomem(vmg)) {
> > + /* If we can avoid failing the whole modification
> > + * due to a merge OOM and validly keep going
> > + * (we're modifying the whole VMA), return vma intact.
> > + * It won't get merged, but such is life - we're avoiding
> > + * OOM conditions in other parts of mm/ this way */
> > + if (start <= vma->vm_start && end >= vma->vm_end)
> > + return vma;
I do not like this solution at all, sorry.
I mean I get what you're doing and it's smart to try to find a means out of
this in general :) but let me explain my reasoning:
For one this is uffd's fault, and the fix clearly needs to be there.
But also, we _can't be sure_ vma is valid any more. The second it goes off
to vma_merge_existing_range() it might be removed, which is why it's
critical to only use 'merged'.
Now you might be able to prove that _right now_ it'd be ok, if you do this
check, because vma_complete() does the delete and only if either
vma_iter_prealloc() or dup_anon_vma() fails would we return -ENOMEM and
these happen _before_ vma_complete(), but that's an _implementation detail_
and now we've made an assumption that this is the case here.
An implicit effectively precondition on something unexpressed like that is
asking for trouble, really don't want to go that way.
> > return ERR_PTR(-ENOMEM);
> > + }
>
> Along the lines of your idea, perhaps we could add a parameter "bool
> never_fail" to vma_modify() that is passed through to
> vma_merge_existing_range(), and guarantee that it never fails when
> that parameter is set? Then we could also check that never_fail is
> only used in cases where no split is necessary. That somewhat avoids
> having this kind of check that only ever runs in error conditions...
Yeah hmmm, again this is _not where the problem is_. And we're doing it for
_one case only_, who _must_ succeed. Right?
Buuuut then again, we could add a _feature_ (it'd be something in VMG not a
bool, hey what are helper structs for right? :P)
We coould add a special mode that says we __GFP_NOFAIL, we do that in
vms_abort_munmap_vmas() and man that was under similar circumstances (hey
remember that fun Liam? :)
But at the same time, it feels icky, and we probably don't want to
proliferate this pattern too much.
So I'd kind of rather prefer a _general_ no-fail unmap that the uffd
release code could invoke.
Perhaps we could genericise the vms_abort_munmap_vmas():
mas_store_gfp(mas, NULL, GFP_KERNEL|__GFP_NOFAIL);
And make that available or some form of it, to do the 'simplest' thing in
this scenario.
If we called that say vma_emergency_clear() then we could do something
like:
void userfaultfd_release_all(struct mm_struct *mm,
struct userfaultfd_ctx *ctx)
{
...
for_each_vma(vmi, vma) {
unsigned long start = vma->vm_start;
unsigned long end = vma->vm_end;
...
vma = userfaultfd_clear_vma(&vmi, prev, vma,
vma->vm_start, vma->vm_end);
if (IS_ERR(vma)) {
/*
* We can no longer rely on VMA state, we must
* unconditionally leave a gap.
*/
vma_emergency_clear(mm, start, end);
vma_iter_reset(&vmi); /* Probably? */
prev = NULL; /* Probably? */
continue;
}
}
}
I mean this isn't fun either. I wonder if we (that probably mean sme)
should go audit cases like this.
Another possible solution is to add a flag that _explicitly asserts and
documents_ that you require that no VMA be removed before attempting to
allocate.
Or we could make that an _explicit_ assumption?
And then the uffd code itself could cache vma and take Pedro's solution on
that basis?
void userfaultfd_release_all(struct mm_struct *mm,
struct userfaultfd_ctx *ctx)
{
...
for_each_vma(vmi, vma) {
struct vm_area_struct *old = vma;
...
vma = userfaultfd_clear_vma(&vmi, prev, vma,
vma->vm_start, vma->vm_end);
if (IS_ERR(vma)) {
BUG_ON(vma != -ENOMEM); /* Sorry Linus! */
/*
* OK we assert above that vma must remain intact if we fail to allocate,
* We are in an extreme memory pressure state, we simply cannot clear this VMA. Move on.
*/
prev = old;
}
...
}
}
I mean it's going to be dirty whichever way round we do this.
Thoughts guys?
But key point being - this is not quite as urgent as it might seem :)