Advice wanted on reducing p99 tail latency and scaling channel fan-out in Java RPC over Aeron

132 views
Skip to first unread message

Вадик Рабочий

unread,
Apr 21, 2026, 6:02:23 PM (3 days ago) Apr 21
to mechanical-sympathy
Hi all,

I’m still pretty new to low-latency transport design, so some of this may be naive. I built a Java RPC-over-Aeron prototype as fast as I could to learn, got it into a working state, and now I feel like I’ve hit a wall. I no longer have a good intuition for which parts of the design are causing the remaining tail-latency and scaling issues, so I’d really appreciate advice from people who have seen similar systems before.

I’m not asking anyone to review the whole project or codebase. I’m mainly hoping for architectural/performance guidance based on the design shape and the benchmark numbers below.

Environment:
- WSL2
- Linux 6.6.87.2
- OpenJDK HotSpot 25.0.2
- Intel i7-13620H
- 16 vCPU visible to WSL
- payload = 32 bytes
- handler mode = OFFLOAD only
- idle strategy = YIELDING
- no listeners, no protocol handshake, no reconnect logic enabled during these runs
- closed-loop benchmark: one in-flight request per caller thread

High-level design:
- synchronous request/response RPC over Aeron UDP
- one logical RPC channel owns one `ConcurrentPublication`, one `Subscription`, one pending-call registry, one correlation-id generator, one heartbeat/liveness state, and one handler registry
- client call path:
  1. acquire reusable pending-call slot from a pool
  2. generate correlation id
  3. register `correlationId -> pendingCall` in a hash map protected by a short lock
  4. encode request into a thread-local direct staging buffer
  5. publish using `ConcurrentPublication.tryClaim()` fast path, fallback to `offer()`
  6. wait synchronously for response
- wait strategy is 3-phase: short spin, then yield, then park
- receive side uses a node-level shared RX poller
- channels are grouped by idle strategy
- each group has N long-lived poller lanes/threads
- each lane iterates its assigned channels and polls them
- one subscription is never polled concurrently by multiple threads
- response path:
  1. lookup/remove pending call by correlation id
  2. validate expected response type
  3. copy response payload into a reusable direct buffer inside the pending-call slot
  4. unpark the waiting caller thread
- handlers are offloaded, so user code is not intentionally running on the RX poller thread in these measurements

What I’m trying to understand:
1. how to reduce p99 when it stays much worse than p90
2. how to scale better when increasing the number of channel/stream pairs
3. how to choose RX poller lane count without improving median latency while making tails much worse

Benchmark results (microseconds):

| Scenario | p50 | p90 | p99 | p99.9 | Rate |
|---|---:|---:|---:|---:|---:|
| raw Aeron, 1 thread, 1 stream pair | 4.7 | 7.0 | 44.6 | 245.6 | ~99k |
| rpc-core, 1 thread, 1 channel, rx=1 | 6.5 | 16.5 | 78.1 | 204.8 | ~94k |
| rpc-core, 1 thread, 1 channel, rx=4 | 7.4 | 19.2 | 75.5 | 204.8 | ~82k |
| rpc-core, 8 threads, 1 channel, rx=4 | 40.4 | 121.1 | 273.7 | 987.6 | ~126k |
| rpc-core, 8 threads, 4 channels, rx=1 | 56.3 | 125.0 | 268.8 | 1001.0 | ~107k |
| rpc-core, 8 threads, 4 channels, rx=2 | 40.7 | 88.9 | 253.4 | 1281.0 | ~132k |
| rpc-core, 8 threads, 4 channels, rx=4 | 37.3 | 82.8 | 519.2 | 2914.3 | ~120k |
| rpc-core, 8 threads, 8 channels, rx=1 | 68.5 | 136.1 | 266.2 | 633.9 | ~93k |
| rpc-core, 8 threads, 8 channels, rx=2 | 56.3 | 116.4 | 332.5 | 1106.9 | ~99k |
| rpc-core, 8 threads, 8 channels, rx=4 | 45.0 | 117.4 | 797.2 | 2414.6 | ~89k |
| rpc-core, 8 threads, 8 channels, rx=2, lower target rate | 42.9 | 96.9 | 238.5 | 953.9 | ~93k |

What seems suspicious to me:
- extra RX poller threads do not help for a single channel
- with 8 callers on one channel, p99 is already about 2.25x p90
- with 4 channels, 2 RX poller threads gives the best balance I found
- with 4 or 8 channels, increasing RX poller threads can improve p50/p90 but make p99/p99.9 much worse
- at higher fan-out I seem to hit a throughput ceiling before reaching the requested rate

If you were looking at this kind of system, what would you investigate first?
- pending-call registry lock / correlation lifecycle
- park/unpark behavior in the synchronous waiter
- shared RX poller lane topology
- publication contention
- completion path / response copy
- scheduler effects from too many poller lanes
- something else more obvious?

If it is helpful and anyone is curious about the concrete implementation I’m referring to, the code is here:
https://github.com/VadimKrut/rpc-core

I’d be very grateful for any direction on what is most likely wrong, naive, or simply expensive in this kind of design.

Faraz Babar

unread,
Apr 21, 2026, 7:28:17 PM (3 days ago) Apr 21
to mechanica...@googlegroups.com
I did not expect to see AI slop on this group. Mechanical sympathy is a passion of mine and learnt by acutely understanding how and where each cycle is spent by the cpu, this is not it. I don't want to discourage you either, but ask yourself, why are you using AI to generate tight code (AI will let you down every single time), and second, why bring this to this group if you don't even have a handle on the generated slop.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
To view this discussion, visit https://groups.google.com/d/msgid/mechanical-sympathy/1f4ba9e9-5d10-4a38-b1cc-ec31ca8fbf7bn%40googlegroups.com.

Piyush A

unread,
Apr 21, 2026, 7:33:45 PM (3 days ago) Apr 21
to mechanica...@googlegroups.com
been part of this group long time - mostly reading - we can’t call everything AI slop now - this is what it looks like now - one can always ignore - grateful for this community.

Sincerely,
Piyush | 206-915-3736


Вадик Рабочий

unread,
Apr 22, 2026, 2:38:38 AM (3 days ago) Apr 22
to mechanica...@googlegroups.com
Hi, I’m sorry that you got the impression that I use AI to generate documentation and tests, but the core itself was conceived and structured by me. All the ideas for the core were taken from Aeron, and synchronization is partially from Netty and from the implementation of virtual threads in Java. I can explain every part of the code inside the core, but I cannot understand how the processor behaves or how the HotSpot compiler behaves, which is why I came to a group of engineers, not to AI. Sorry if AI is responsible for code refactoring and documentation, but the core is my ideas, and not a single neural network can write code that approaches Aeron in performance.

ср, 22 апр. 2026 г. в 02:33, Piyush A <papiyu...@gmail.com>:

Peter Veentjer

unread,
Apr 22, 2026, 3:25:47 AM (3 days ago) Apr 22
to mechanica...@googlegroups.com
Did you look with a profiler in what is going on?

Did you check if there is any GC going on? That can be a big cause of tail latency increase. 

I would definitely not rely on Windows for good performance; no matter if you use WSL2.

I would also not use any parking because that involves the OS scheduler. Typically you only want to spin.

It is difficult to suggest that you need to apply optimization X, if it isn't clear where you are at. E.g. there is no point in replacing the metal car hood with a carbon-fiber one if the whole family is in the car. Only when you have already optimized the car, then it makes sense to do these optimizations.

I would take 20 steps back.. first thing I would determine is what is important: maximum throughput, or latency at some rate of requests. Because if you focus on maximum throughput, and typically high utilization, latency topically goes through the roof due queueing effects. For low latency application, you typically have 2 extremes:

1) What is the maximum throughput I can get so that percentile X (e.g. p99.99) <=1ms.

2) What is my latency distribution for a fixed rate of requests. This is often the easiest one to test for.

The next step is to start with the simplest possible system and then make tweaks to determine what is improved performance and what doesn't.

Last but not least, RPC typically is terrible because of the thread waiting for a response due to the synchronous nature of RPC (unless you have async RPC). And if you measure latency by placing a stop watch call around every request start/completion, then your actual latencies are probably significantly worse due to coordinated omission. THroughput benchmarking typically suffers from this because there is no intended schedule. So if you want low latency, I would ditch sync RPC unless you can use e.g. virtual threads. 

Вадик Рабочий

unread,
Apr 22, 2026, 5:12:11 AM (3 days ago) Apr 22
to mechanica...@googlegroups.com
Hi, thanks for the reply. 

1) About profiling:
I reran the larger tests with async-profiler in wall / cpu / lock / alloc modes, and also checked JFR.

What I see so far:
- a lot of ThreadPark in the synchronous wait path
- noticeable GC activity
- one clearly unnecessary steady-state allocation in the RX path (`CopyOnWriteArrayList.iterator()` in `SharedReceivePoller`)

After profiling I already started optimizing some of the obvious things. But my main question is still about the compromise itself.

My synchronous model currently uses `spin -> yield -> park` in `SyncWaiter.await(...)`, so yes, parking is really there. In other places I use yield, but not in this waiting path. I also tried spin-only, but in my tests the difference between spin and yield was small, around 5%. On Windows it looked more noticeable, but on WSL it is almost negligible.

So the main thing I am trying to understand now is:
how to reduce the cost of waiting and make wakeup faster in a synchronous request-response model, without turning everything into constant heavy spinning and burning CPU all the time.

If I remove park/unpark and leave only spin or yield, CPU usage becomes too high. If I keep parking, the tail latency gets worse. This is the part I do not understand well enough yet: where is the right compromise here?

2) About Windows / WSL2:
Yes, I agree. I do not treat WSL2 as a final low-latency environment. I use it only as a first pass because right now I do not have access to a real Linux machine.

3) About the design itself:
I was inspired by Aeron and first built similar ideas on top of `DatagramSocket`, then kept moving from there. But I specifically need synchronous request-response, and this is exactly the place where I realized I do not yet understand the best design well enough.

With 1 channel and 1 thread I can get reasonably close to raw Aeron RTT, roughly within 20% including the full round trip. But once I add more channels and more caller threads, things become much less clear.

One important thing here is that I tried to isolate channel state as much as possible. In practice each channel has its own data and its own objects for the request-response path, so they do not share much state directly. The main thing they still share is that they all write into the same MediaDriver.

I also tried virtual threads, but in this workload they were worse in my tests.

And one more thing I wanted to ask:
does it make sense at all to look at native calls through FFM for this kind of hot path, for example for lower-level waiting / signaling, or even pinning hot channel paths to specific CPU cores, or is that the wrong direction at this stage and I should first focus on the compromise between waiting strategy, CPU load, and the obvious costs inside Java?

ср, 22 апр. 2026 г. в 10:25, Peter Veentjer <alarm...@gmail.com>:

Вадик Рабочий

unread,
Apr 22, 2026, 6:23:37 AM (3 days ago) Apr 22
to mechanica...@googlegroups.com
One clarification from my side: I described this part incorrectly before.

I do use virtual threads, but specifically for offloaded handler tasks. I do not mean that the whole system is built entirely on virtual threads. In the OFFLOAD path, the handler task is submitted through an executor, and in the default setup that ends up using virtual threads.

I also tried other variants for the same place:
- fixed platform-thread pool
- channel-affine worker model
- and I also tested the synchronous waiter without park, leaving only spin + yield

What is interesting is that removing park from the waiter did not materially improve the benchmark results. p50 / p90 / p99 stayed almost the same, and in some runs the deeper tail was even a bit worse. So at least from these tests it does not look like park itself is the main blocking factor.

ср, 22 апр. 2026 г. в 12:04, Вадик Рабочий <mne1...@gmail.com>:

Peter Veentjer

unread,
Apr 22, 2026, 7:49:35 AM (3 days ago) Apr 22
to mechanica...@googlegroups.com
Go back 20 steps and start with a super simple echo server using aeron. Ditch all your existing code for the moment.

So the load generator sends an echo message, and puts the (intended) start time in nanoseconds as payload in the message. The server will return the payload as is. And then wait for the response and determine the latency based on the timestamp in the response.

You can only use spinning (parking forbidden)/pinning/isolation/gc-free. 

So effectively there are only 2 threads you need to create:
1) load generator thread
2) server thread

And then measure what kind of latency you get on your system.

If all goes well, this latency should be pretty good. And then slowly start to add more features to match your intended design.

But RPC over networking is a disaster in the making; all the parking/unparking and limited amount of concurrency...  



Вадик Рабочий

unread,
Apr 22, 2026, 9:04:44 AM (3 days ago) Apr 22
to mechanica...@googlegroups.com
Thank you for the advice.

Regarding the echo service, I initially made one like that, and at the moment the implementation with 1 channel gives about the same result, plus or minus 1 or 2 percent. The problem is specifically when scaling channels.

Thank you, Peter Veentjer, at first I used a profiler, and then I stopped, thinking I wouldn’t clutter it anymore))
I improved the tails and time even more thanks to new tests with the profiler. My mistake was in virtual-thread-per-task + thread-local first-touch, that is, with each new task, an empty ThreadLocalMap was initialized. Now I just pre-allocate 1024 initialized ones, and then add them as needed. This reduced time and leveled the tails in multi-channel mode.

Apparently, further optimization means fun weeks with the profiler)

ср, 22 апр. 2026 г. в 14:49, Peter Veentjer <alarm...@gmail.com>:
Reply all
Reply to author
Forward
0 new messages