On Tue, Dec 26, 2023 at 2:32 PM Chris Li <
chr...@kernel.org> wrote:
>
> Hi Nhat,
>
> Thanks for the first stab on this bug.
> > > RIP: 0010:scatterwalk_start include/crypto/scatterwalk.h:63 [inline] <========= This is the offending line.
> > > RIP: 0010:scatterwalk_pagedone include/crypto/scatterwalk.h:83 [inline]
> > > RIP: 0010:scatterwalk_pagedone include/crypto/scatterwalk.h:72 [inline]
> > > RIP: 0010:scatterwalk_copychunks+0x3e0/0x560 crypto/scatterwalk.c:50
>
> This is the call stack with an inline function. Very nice annotations
> on the inline function expansions call stacks.
>
> > > Code: f0 48 c1 e8 03 80 3c 08 00 0f 85 81 01 00 00 49 8d 44 24 08 4d 89 26 48 bf 00 00 00 00 00 fc ff df 48 89 44 24 10 48 c1 e8 03 <0f> b6 04 38 84 c0 74 08 3c 03 0f 8e 47 01 00 00 48 8b 44 24 08 41
> > > RSP: 0018:ffffc90003a8ecf0 EFLAGS: 00010202
> > > RAX: 0000000000000001 RBX: 0000000000000000 RCX: dffffc0000000000
> > > RDX: ffff88802785d940 RSI: ffffffff8465df74 RDI: dffffc0000000000
> > > RBP: 0000000000001000 R08: 0000000000000005 R09: 0000000000000000
> > > R10: 0000000000000002 R11: 82d8bd1b6060f805 R12: 0000000000000000
> > > R13: 0000000000000014 R14: ffffc90003a8ed88 R15: 0000000000001000
> > > FS: 00005555565c5380(0000) GS:ffff8880b9800000(0000) knlGS:0000000000000000
> > > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > CR2: 0000000000d5e538 CR3: 0000000079f3a000 CR4: 00000000003506f0
> > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > > Call Trace:
> > > <TASK>
> > > scatterwalk_map_and_copy+0x151/0x1d0 crypto/scatterwalk.c:67
> > > scomp_acomp_comp_decomp+0x3a3/0x780 crypto/scompress.c:149
> >
> > Looks like it's this line:
> >
> > scatterwalk_map_and_copy(scratch->dst, req->dst, 0, req->dlen,1);
>
> Yes indeed.
>
> The offending instruction is actually this one:
>
> The inlined part of the call stack:
> RIP: 0010:scatterwalk_start include/crypto/scatterwalk.h:63 [inline]
> <========= This is the offending line.
> static inline void scatterwalk_start(struct scatter_walk *walk,
> struct scatterlist *sg)
> {
> walk->sg = sg;
> walk->offset = sg->offset; <============== RIP pointer
> }
>
> RIP: 0010:scatterwalk_pagedone include/crypto/scatterwalk.h:83 [inline]
>
> static inline void scatterwalk_pagedone(struct scatter_walk *walk, int out,
> unsigned int more)
> {
> if (out) {
> struct page *page;
>
> page = sg_page(walk->sg) + ((walk->offset - 1) >> PAGE_SHIFT);
> flush_dcache_page(page);
> }
>
> if (more && walk->offset >= walk->sg->offset + walk->sg->length)
> scatterwalk_start(walk, sg_next(walk->sg)); <==========================
> }
>
> RIP: 0010:scatterwalk_pagedone include/crypto/scatterwalk.h:72 [inline]
> RIP: 0010:scatterwalk_copychunks+0x3e0/0x560 crypto/scatterwalk.c:50
>
> void scatterwalk_copychunks(void *buf, struct scatter_walk *walk,
> size_t nbytes, int out)
> {
> for (;;) {
> unsigned int len_this_page = scatterwalk_pagelen(walk);
> u8 *vaddr;
>
> if (len_this_page > nbytes)
> len_this_page = nbytes;
>
> if (out != 2) {
> vaddr = scatterwalk_map(walk);
> memcpy_dir(buf, vaddr, len_this_page, out);
> scatterwalk_unmap(vaddr);
> }
>
> scatterwalk_advance(walk, len_this_page);
>
> if (nbytes == len_this_page)
> break;
>
> buf += len_this_page;
> nbytes -= len_this_page;
>
> scatterwalk_pagedone(walk, out & 1, 1); <=========================
> }
> }
>
> then the unlined portion of the call stack:
> scatterwalk_map_and_copy+0x151/0x1d0 crypto/scatterwalk.c:67
>
> void scatterwalk_map_and_copy(void *buf, struct scatterlist *sg,
> unsigned int start, unsigned int nbytes, int out)
> {
> struct scatter_walk walk;
> struct scatterlist tmp[2];
>
> if (!nbytes)
> return;
>
> sg = scatterwalk_ffwd(tmp, sg, start);
>
> scatterwalk_start(&walk, sg);
> scatterwalk_copychunks(buf, &walk, nbytes, out); <===========================
> scatterwalk_done(&walk, out, 0);
> }
>
> scomp_acomp_comp_decomp+0x3a3/0x780 crypto/scompress.c:149
> if (dir)
> ret = crypto_scomp_compress(scomp, scratch->src, req->slen,
> scratch->dst, &req->dlen, *ctx);
> else
> ret = crypto_scomp_decompress(scomp, scratch->src, req->slen,
> scratch->dst, &req->dlen, *ctx);
> if (!ret) {
> if (!req->dst) {
> req->dst = sgl_alloc(req->dlen, GFP_ATOMIC, NULL);
> if (!req->dst) {
> ret = -ENOMEM;
> goto out;
> }
> }
> scatterwalk_map_and_copy(scratch->dst, req->dst, 0, req->dlen,
> <=======================
> 1);
> }
>
> crypto_acomp_compress include/crypto/acompress.h:302 [inline]
> zswap_store+0x98b/0x2430 mm/zswap.c:1666
> swap_writepage+0x8e/0x220 mm/page_io.c:198
> pageout+0x399/0x9e0 mm/vmscan.c:656
> shrink_folio_list+0x2f47/0x3ea0 mm/vmscan.c:1319
> reclaim_folio_list+0xe4/0x3a0 mm/vmscan.c:2104
> reclaim_pages+0x483/0x6a0 mm/vmscan.c:2140
> madvise_cold_or_pageout_pte_range+0x129e/0x1f70 mm/madvise.c:526
> walk_pmd_range mm/pagewalk.c:143 [inline]
> >
> The decompressed output can be bigger than input. However, here we
> specify the output size limit to be one page. The decompressor should
> not output more than the 1 page of the dst buffer.
>
> I did check the lzo1x_decompress_safe() function.
I think you meant lzo1x_1_do_compress() right? This error happens on
the zswap_store() path, so it is a compression bug.
> It is supposed to use the NEED_OP(x) check before it stores the output buffer.
I can't seem to find any check in compression code. But that function
is 300 LoC, with no comment :)
> However I do find one place that seems to be missing that check, at
> least it is not obvious to me how that check is effective. I will
> paste it here let other experts take a look as well:
> Line 228:
>
> if (op - m_pos >= 8) {
> unsigned char *oe = op + t;
> if (likely(HAVE_OP(t + 15))) {
> do {
> COPY8(op, m_pos);
> op += 8;
> m_pos += 8;
> COPY8(op, m_pos);
> op += 8;
> m_pos += 8;
> } while (op < oe);
> op = oe;
> if (HAVE_IP(6)) {
> state = next;
> COPY4(op, ip); <========================= This COPY4 does not have
> obvious NEED_OP() check. As far as I can tell, this output is not
> converted by the above HAVE_OP(t + 15)) check, which means to protect
> the unaligned two 8 byte stores inside the "do while" loops.
> op += next;
> ip += next;
> continue;
> }
> } else {
> NEED_OP(t);
> do {
> *op++ = *m_pos++;
> } while (op < oe);
> }
> } else
>
>
> >
> > Not 100% sure about linux kernel's implementation though. I'm no
> > compression expert :)
>
> Me neither. Anyway, if it is a decompression issue. For this bug, it
> is good to have some debug print assert to check that after
> decompression, the *dlen is not bigger than the original output
> length. If it does blow over the decompression buffer, it is a bug
> that needs to be fixed in the decompression function side, not the
> zswap code.
But regardless, I agree. We should enforce the condition that the
output should not exceed the given buffer's size, and gracefully fails
if it does (i.e returns an interpretable error code as opposed to just
walking off the cliff like this).
I imagine that many compression use-cases are best-effort
optimization, i.e it's fine to fail. zswap is exactly this - do your
best to compress the data, but if it's too incompressible, failure is
acceptable (we have plenty of fallback options - swapping, keeping the
page in memory, etc.).
Furthermore, this is a bit of a leaky abstraction - currently there's
no contract on how much bigger can the "compressed" output be with
respect to the input. It's compression algorithm-dependent, which
defeats the point of the abstraction. In zswap case, 2 * PAGE_SIZE is
just a rule of thumb - now imagine we use a compression algorithm that
behaves super well in most of the cases, but would output 3 *
PAGE_SIZE in some edge cases. This will still break the code. Better
to give the caller the ability to bound the output size, and
gracefully fail if that bound cannot be respected for a particular
input.
>
> Chris