Hi all,
I’m still pretty new to low-latency transport design, so some of this may be naive. I built a Java RPC-over-Aeron prototype as fast as I could to learn, got it into a working state, and now I feel like I’ve hit a wall. I no longer have a good intuition for which parts of the design are causing the remaining tail-latency and scaling issues, so I’d really appreciate advice from people who have seen similar systems before.
I’m not asking anyone to review the whole project or codebase. I’m mainly hoping for architectural/performance guidance based on the design shape and the benchmark numbers below.
Environment:
- WSL2
- Linux 6.6.87.2
- OpenJDK HotSpot 25.0.2
- Intel i7-13620H
- 16 vCPU visible to WSL
- payload = 32 bytes
- handler mode = OFFLOAD only
- idle strategy = YIELDING
- no listeners, no protocol handshake, no reconnect logic enabled during these runs
- closed-loop benchmark: one in-flight request per caller thread
High-level design:
- synchronous request/response RPC over Aeron UDP
- one logical RPC channel owns one `ConcurrentPublication`, one `Subscription`, one pending-call registry, one correlation-id generator, one heartbeat/liveness state, and one handler registry
- client call path:
1. acquire reusable pending-call slot from a pool
2. generate correlation id
3. register `correlationId -> pendingCall` in a hash map protected by a short lock
4. encode request into a thread-local direct staging buffer
5. publish using `ConcurrentPublication.tryClaim()` fast path, fallback to `offer()`
6. wait synchronously for response
- wait strategy is 3-phase: short spin, then yield, then park
- receive side uses a node-level shared RX poller
- channels are grouped by idle strategy
- each group has N long-lived poller lanes/threads
- each lane iterates its assigned channels and polls them
- one subscription is never polled concurrently by multiple threads
- response path:
1. lookup/remove pending call by correlation id
2. validate expected response type
3. copy response payload into a reusable direct buffer inside the pending-call slot
4. unpark the waiting caller thread
- handlers are offloaded, so user code is not intentionally running on the RX poller thread in these measurements
What I’m trying to understand:
1. how to reduce p99 when it stays much worse than p90
2. how to scale better when increasing the number of channel/stream pairs
3. how to choose RX poller lane count without improving median latency while making tails much worse
Benchmark results (microseconds):
| Scenario | p50 | p90 | p99 | p99.9 | Rate |
|---|---:|---:|---:|---:|---:|
| raw Aeron, 1 thread, 1 stream pair | 4.7 | 7.0 | 44.6 | 245.6 | ~99k |
| rpc-core, 1 thread, 1 channel, rx=1 | 6.5 | 16.5 | 78.1 | 204.8 | ~94k |
| rpc-core, 1 thread, 1 channel, rx=4 | 7.4 | 19.2 | 75.5 | 204.8 | ~82k |
| rpc-core, 8 threads, 1 channel, rx=4 | 40.4 | 121.1 | 273.7 | 987.6 | ~126k |
| rpc-core, 8 threads, 4 channels, rx=1 | 56.3 | 125.0 | 268.8 | 1001.0 | ~107k |
| rpc-core, 8 threads, 4 channels, rx=2 | 40.7 | 88.9 | 253.4 | 1281.0 | ~132k |
| rpc-core, 8 threads, 4 channels, rx=4 | 37.3 | 82.8 | 519.2 | 2914.3 | ~120k |
| rpc-core, 8 threads, 8 channels, rx=1 | 68.5 | 136.1 | 266.2 | 633.9 | ~93k |
| rpc-core, 8 threads, 8 channels, rx=2 | 56.3 | 116.4 | 332.5 | 1106.9 | ~99k |
| rpc-core, 8 threads, 8 channels, rx=4 | 45.0 | 117.4 | 797.2 | 2414.6 | ~89k |
| rpc-core, 8 threads, 8 channels, rx=2, lower target rate | 42.9 | 96.9 | 238.5 | 953.9 | ~93k |
What seems suspicious to me:
- extra RX poller threads do not help for a single channel
- with 8 callers on one channel, p99 is already about 2.25x p90
- with 4 channels, 2 RX poller threads gives the best balance I found
- with 4 or 8 channels, increasing RX poller threads can improve p50/p90 but make p99/p99.9 much worse
- at higher fan-out I seem to hit a throughput ceiling before reaching the requested rate
If you were looking at this kind of system, what would you investigate first?
- pending-call registry lock / correlation lifecycle
- park/unpark behavior in the synchronous waiter
- shared RX poller lane topology
- publication contention
- completion path / response copy
- scheduler effects from too many poller lanes
- something else more obvious?
If it is helpful and anyone is curious about the concrete implementation I’m referring to, the code is here:
https://github.com/VadimKrut/rpc-coreI’d be very grateful for any direction on what is most likely wrong, naive, or simply expensive in this kind of design.