--
You received this message because you are subscribed to a topic in the Google Groups "AtoM Users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/ica-atom-users/aW9dDL1TWGc/unsubscribe.
To unsubscribe from this group and all its topics, send an email to ica-atom-user...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/ica-atom-users/cd0860ce-4f28-449b-98a1-763e91c7ecd3n%40googlegroups.com.

To view this discussion visit https://groups.google.com/d/msgid/ica-atom-users/a821f6e6-ddb7-4969-8f17-b12de9a70825n%40googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/ica-atom-users/b4b294d7-6060-4823-9d6f-43c13debdc7an%40googlegroups.com.
Hi Edgar,
Thanks, that's a thorough response, and it's good to hear the sibling-sort field and the transaction work are in hand. And I'll happily concede the coexistence decision: I'd underweighted the migration lock window.
For a multi-million-object instance a 24h+ transition is a real operational constraint, and a config flag that lets large sites stage the migration, small sites (<20k) skip it, and anyone roll back is clearly the more pragmatic call. Agreed.
On the subtree-move reindex, here's an idea that may give you the middle ground 4a and 4b didn't, while keeping the denormalised ancestors field (so aggregations and ACL filters that depend on it don't have to change):
The key observation is that on a move, the ancestor delta is identical for every descendant in the moved subtree. Moving node N from P1 to P2 only changes the prefix above N: every descendant loses N's old ancestor chain and gains P2's new one; the portion of the chain within the subtree (N downward) is unchanged. So, you don't need to recompute each document from the closure table, you can apply one delta to the whole subtree server-side:
POST <index>/_update_by_query?wait_for_completion=false&conflicts=proceed
{
"query": { "term": { "ancestors": N } }, // selects exactly N's subtree — every descendant already contains N
"script": {
"lang": "painless",
"source": "ctx._source.ancestors.removeAll(params.oldPrefix); ctx._source.ancestors.addAll(0, params.newPrefix);",
"params": { "oldPrefix": [<ancestors of N before the move>], "newPrefix": [<P2 + its ancestors>] }
}
}
Selecting the subtree with term: ancestors = N sidesteps the 65k terms limit entirely (no need to pass the descendant id list), and the whole rewrite is a single async request that ES/OpenSearch executes server-side, no per-document roundtrips from PHP and no per-document DB reads. It works on both ES 5.x and OpenSearch.
It layers naturally on your 4a: the Gearman job issues this one _update_by_query instead of looping documents. Two caveats: under concurrent moves on overlapping subtrees, you'd want conflicts=proceed plus a re-run (or version_type=external as you already planned), and if you prefer maximum robustness over speed, you can keep the per-document recompute-from-closure path as a fallback. But for the common case it turns an O(S) reindex into one bulk operation.
On the performance report (2026-05-27), I've now been through it, and it makes the write case conclusively. Testing two real production datasets at 60k and 600k, with ES isolated via the on/off passes, is exactly the right design, and the 10× volume step is the decisive result: A3 going from 341× to 4,092× as nodes scale demonstrates O(n) vs O(1) write cost directly, not merely "closure is faster." The candour about where Nested Set wins (B1/B2 point reads, sub-millisecond) makes the rest credible. A few constructive observations and one data gap that connects to the ES question above:
1. Concurrency would strengthen the case further. The measurements are single-operation latency (10 sequential reps). The real production pain of Nested Set is that an lft/rgt renumber locks the table and blocks all other users, so under concurrent load the write advantage is even larger than the single-thread numbers suggest. A small concurrency pass would understate nothing and likely make the gap more dramatic.
2. The large subtree move at scale is the missing number, and it's the one most relevant to the ES discussion above. A4 (move-subtree) ran only at 60k with ~9 and ~84 descendants; moving a ~10k-node subtree in a 600k tree wasn't measured. That's precisely the scenario that drives the expensive ancestors reindex, so there's currently no closure-side figure for the case the 4a/_update_by_query strategy is meant to address. It would be very useful to have it.
3. B3's 473× is partly the Propel path, not the data structure alone. Since the Nested Set side hydrates full Propel objects with i18n N+1 queries, B3 is really "hydrated objects vs a JOIN returning rows." It's a fair as-AtoM-actually-behaves comparison, but worth framing as an implementation win rather than a purely structural one (a raw WHERE lft BETWEEN returning columns would also be fast). The memory_peak_kb figures for B3 would be a nice complement here, closure should avoid the large allocation of hydrating thousands of objects.
4. Closure-table size isn't reported. Given O(n·depth) rows, the 600k dataset is presumably several million closure rows, and you note A5-delete slows at 600k for exactly that reason. Publishing the row counts and table/buffer-pool footprint would round out the picture and likely explain the B2 cold-cache variance (P95 ≈ 35 ms).
5. The migration build looks like it could be much faster. ~1,000 records/30s (17h for 2M) is the unoptimised per-record path. A set-based build, seed depth-0, then iteratively INSERT depth+1 by joining the closure to the parent edges, is typically orders of magnitude faster and could shrink the maintenance window from hours to minutes, which also softens the coexistence trade-off.
None of these change the conclusion, the report already establishes that Nested Set is unviable past ~30–50k nodes, they'd just make it airtight and fill in the one scenario (large-subtree move at scale) that both documents currently leave open.
And yes, we'd be glad to collaborate. I'm happy to review the refactored 2.10 code, particularly the lft/rgt consumer inventory and the ES integration. One honest note on testing, our own instances are modest in size, so where we can help most is code review and functional-correctness testing rather than million-row performance runs; if that's useful, send the branch over.
Groete / Regards
Johan Pieterse
082 337-1406
--
You received this message because you are subscribed to the Google Groups "AtoM Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ica-atom-user...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/ica-atom-users/CAGWhycqEbVEtuq5LSP60dgAquWTUto9MDugvA94WFf9J8_nO%3DA%40mail.gmail.com.