Hey all,
I’m following up here on behalf of @Carl Sandland regarding our experience here.
Our production Riemann environment consists of around 20 c5.9xlarge instances, receiving around 1.5-2 million events/sec.
In our testing, we found that upgrading from JDK 8 to JDK 11 provided some small improvements to stream latency, but nothing too spectacular.
However, once we changed our garbage collector from CMS to ZGC, we saw significant improvements in stream latencies, with a reduction between 15-20x. We’re now receiving sub-millisecond latencies for our 99.9 percentile metrics. Forwarding our metrics to Kafka used to be our most computationally expensive stream, hovering around 5ms latencies, but is now consistently around 130us.
We experienced a slight decrease in CPU load averages with JDK 11 and ZGC as compared to JDK 8 and CMS. We tested briefly with JDK 11 and G1GC, but this performed comparably to JDK 11 and CMS. We also tested JDK 17 and ZGC, but this provided only very small improvements to 99.9 percentile stream latencies.
One word of warning is that it appears that ZGC does some funny things with heap allocation, and reported memory usage may actually be more than physical memory used - more information here
https://stackoverflow.com/a/62934057.
Attached are a few screenshots of our metrics showing performance improvements. The latency graph may look a bit strange, but that’s due to us performing a phased rollout on our hosts.
Standard disclaimer that the above statements are specific to our use case and may not apply to your use cases, but in any case ZGC is worth trying.
Thanks
Matt