I also did the thing and asked claude about this. I'm not betting a lot on it being right, but here's what it thinks after I've fed it with my findings...
❯ Given all findings discussed so far, the current codebase and the error logged when the issue manifests:
* what is the most likely trigger for the issue?
* where to go from here in terms of dissection?
● Most likely trigger
Pulling everything together, the failure mode is structurally:
A monotonic growth of the bufferpool's shard count, driven by concurrent-allocation contention against a lock-free bufferpool_make_shard, terminating against the kernel module's hard MAX_SHM_AREAS=16 ceiling.
Why this rather than the alternatives:
- It's not a refcount imbalance — your per-shard patch demonstrated individual shards complete fill/drain/recycle cycles. The __free_* paths are doing their job.
- It's not the via-branch=auto or failure-route cleanup angle in kamailio — the docs and config audit show the SIP-side lifecycle is clean for the sbcprov path.
- It's not the DTX random-seq bug — rtpengine isn't transcoding for you; that codepath is dormant. The seq jumps you saw are an Asterisk/peer-side artefact, unrelated to memory.
- It's the only mechanism left that produces exactly this curve: shards recycle correctly, pin count climbs only at discrete events, plateaus in between, never decreases, and abruptly aborts at 16.
The smoking gun is the structure of bufferpool_make_shard itself (lib/bufferpool.c:143-203): bufferpool_new_shard invokes bp->alloc() (→ kernel_alloc → kernel_pin_memory) before the CAS that inserts the new shard into the pool. Multiple concurrent allocators that simultaneously find "no shard has room" each call bufferpool_make_shard, each pin 16 MB in the kernel, and then each succeed at inserting into a different slot. The losing thread's shard isn't discarded — it joins the pool, contributing +1 to nshms permanently. There is no path to give a shard back to the kernel.
The acceleration "rare last week, 2–3 times per day since yesterday" is the part that's hardest to attribute without more data, but it's consistent with this trigger because the failure mode is inherently bursty: a single moment of concurrent fill-cliff against n worker threads adds up to n-1 phantom shards. Whatever changed yesterday made that moment more likely — could be a load step (more concurrent calls), a traffic-shape change (more allocation-heavy calls coinciding), or a code path that synchronises allocators (e.g., a periodic timer that touches all packet streams at once).
The MT#55283 commit cluster is consistent with this being a recently-introduced regression: bcbe751d MT#55283 make bufferpool (mostly) lock-free, the "delegate shm refcount to table" series, and 847d0043 revamp kernel shm strategy together rewrote both the shard accounting and the kernel-pin path while making the hot path lock-free — exactly the sequence of changes that would surface this kind of issue.
Where to go from here
In order of actionability:
1. Confirm or falsify the race hypothesis directly. One patch, three counters in bufferpool_make_shard:
static struct bpool_shard *bufferpool_make_shard(struct bufferpool *bp) {
// ── instrument before pinning ──
unsigned int snap_num = atomic_get_na(&bp->num_shards);
unsigned int n_full = 0, n_has_space = 0;
for (unsigned int i = 0; i < snap_num; i++) {
struct bpool_shard *s = atomic_get_na(&bp->shards[i]);
if (atomic_get(&s->full)) n_full++;
else n_has_space++;
}
ilog(LOG_WARN, "bufferpool_make_shard: num=%u full=%u space=%u tid=%lu",
snap_num, n_full, n_has_space, pthread_self());
// ── end instrumentation ──
struct bpool_shard *shard = bufferpool_new_shard(bp);
...
}
Then watch the log lines around a pin-count jump:
- If every make_shard call shows full == num and space == 0, every new shard was genuinely needed — the issue is real peak growth, not a race.
- If you see multiple make_shard calls within a millisecond from different tids, and at least one of them shows space > 0, the race is firing.
This one experiment cleanly decides whether the fix is operational (raise the limit) or code-level (fix the race).
2. If the race is real, the fix in bufferpool_make_shard is straightforward: re-check after pinning, before inserting. Or move the kernel pin to after the CAS reservation slot is claimed, so only one thread per "this slot is mine" actually pins. Both are small patches; the second is structurally cleaner. Worth proposing to upstream alongside the issue.
3. Operational mitigation regardless of root cause. Two things to do now, since the abort is hitting 2–3× per day:
- Watch Memory pins: from /proc/rtpengine/<id>/status on a 1-min cadence. Alert at 12. Plan a rolling rtpengine restart cron (every N hours, where N is conservative enough that even worst-case growth keeps you below 14).
- Bump MAX_SHM_AREAS from 16 to 32 in kernel-module/nft_rtpengine.c:442 and rebuild the kernel module. This is a one-line patch that doubles your headroom and roughly doubles time-to-abort. It buys you days while you investigate, with negligible runtime cost (the array is per-table, sized once at table allocation).
4. Capture the workload at the moment of growth. Independent of the race vs. peak diagnosis, you want to know what allocation site is consuming a shard's 16 MB. A small patch adding a per-allocation site tag (one byte in the back-pointer header is enough — there are only six callsites of bufferpool_alloc* against shm_bufferpool) and a periodic histogram dump would tell you whether you're filling shards with stream stats, SSRC entries, or per-PT entries. That information narrows where, if at all, the per-call footprint could be reduced.