Run riemann using JDK8... Cargo Cult or good advice?

Carl Sandland

unread,

Feb 16, 2022, 12:37:45 AM2/16/22

to Riemann Users

Hi,

At my place of work we have this idea that riemann has been specifically optimised to run on JDK8, I'm not able to find the definitive source for that idea and thought I'd ask here: Is this idea still true today and if so provide any links I can follow up on.

We have moved all our other services to JDK11 containers.

I know the garbage collectors have been improved, which might offset older gains?

Thanks for any pointers; worst case we run some experiments which I can share here if anyone else wants to know.

Cheers,
Carl

Sanel Zukan

unread,

Feb 16, 2022, 7:36:06 AM2/16/22

to Riemann Users

Hi,

I'm not sure I've seen anywhere online information that Riemann is
specifically optimized for JDK8, and if you look at tests [1], they are
regularly run against major JDK versions.

But, if there are any, I'd be curious for visible differences.

[1] https://app.circleci.com/pipelines/github/riemann/riemann

> --
> You received this message because you are subscribed to the Google Groups "Riemann Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to riemann-user...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/riemann-users/c61c8504-9e5c-4f35-9891-7fe7a0d75942n%40googlegroups.com.

ap...@aphyr.com

unread,

Feb 16, 2022, 7:47:09 AM2/16/22

to rieman...@googlegroups.com

Ha! Yes, when I was first optimizing Riemann performance, that was probably around JDK8, but that was a decade ago! I would absolutely give a more modern JVM a try. G1 and ZGC are quite good now and will probably reduce latency, though my understanding is that CMS remains an excellent choice for throughput.

--Kyle

--

Carl Sandland

unread,

Feb 16, 2022, 4:52:36 PM2/16/22

to Riemann Users

Thanks for the info; CMS being a good fit complicates matters I guess as it was deprecated in JDK9 (apparently).

We operate a fleet of 20 monster servers running partitioned streams of metrics into riemann so even small differences in memory and cpu can add up.

Without wasting a lot of time I'd guess "it's complicated" could be followed by "lets just move and monitor closely".

Cheers,

Carl

Kyle Kingsbury

unread,

Feb 16, 2022, 5:00:47 PM2/16/22

to rieman...@googlegroups.com

On Wed, 2022-02-16 at 13:52 -0800, Carl Sandland wrote:
> Thanks for the info; CMS being a good fit complicates matters I guess
> as it was deprecated in JDK9 (apparently).
> We operate a fleet of 20 monster servers running partitioned streams
> of metrics into riemann so even small differences in memory and cpu
> can add up.
> Without wasting a lot of time I'd guess "it's complicated" could be
> followed by "lets just move and monitor closely".
> Cheers,
> Carl

Nothing in Riemann is actually optimized for a specific GC. I just
mention CMS because back when I was personally tuning Riemann heavily,
that was the collector I used. Try out G1 on, say, JDK17 (the collector
made significant advances over those years) and see how it goes. :-)

--Kyle

Vipin Menon

unread,

Mar 9, 2022, 12:35:32 AM3/9/22

to Riemann Users

We upgraded our setup of Riemann from JDK8 to JDK 17. Seems to be happy with it.

However for reasons we are trying to figure out... Our setup of Riemann is only happy with ParallelGC and MAXRam set to 75%. (sadly G1 and CMS brought down our Riemann VMs -> Next round is to try zgc)

Matthew Millett

unread,

May 18, 2022, 1:45:49 AM5/18/22

to Riemann Users

Hey all,
I’m following up here on behalf of @Carl Sandland regarding our experience here.

Our production Riemann environment consists of around 20 c5.9xlarge instances, receiving around 1.5-2 million events/sec.

In our testing, we found that upgrading from JDK 8 to JDK 11 provided some small improvements to stream latency, but nothing too spectacular.

However, once we changed our garbage collector from CMS to ZGC, we saw significant improvements in stream latencies, with a reduction between 15-20x. We’re now receiving sub-millisecond latencies for our 99.9 percentile metrics. Forwarding our metrics to Kafka used to be our most computationally expensive stream, hovering around 5ms latencies, but is now consistently around 130us.

We experienced a slight decrease in CPU load averages with JDK 11 and ZGC as compared to JDK 8 and CMS. We tested briefly with JDK 11 and G1GC, but this performed comparably to JDK 11 and CMS. We also tested JDK 17 and ZGC, but this provided only very small improvements to 99.9 percentile stream latencies.

One word of warning is that it appears that ZGC does some funny things with heap allocation, and reported memory usage may actually be more than physical memory used - more information here https://stackoverflow.com/a/62934057.

Attached are a few screenshots of our metrics showing performance improvements. The latency graph may look a bit strange, but that’s due to us performing a phased rollout on our hosts.

Standard disclaimer that the above statements are specific to our use case and may not apply to your use cases, but in any case ZGC is worth trying.

Thanks

Matt

riemann_host_latencies.png

riemann_host_cpu_usage.png

Sanel Zukan

unread,

May 18, 2022, 9:26:42 AM5/18/22

to Matthew Millett, Riemann Users

Matthew Millett <matthew...@instaclustr.com> writes:
> Our production Riemann environment consists of around 20 c5.9xlarge
> instances, receiving around 1.5-2 million events/sec.

Impressive! May I ask how did you distribute Riemann?

> Thanks
> Matt

Best,
Sanel

Carl Sandland

unread,

May 18, 2022, 6:35:37 PM5/18/22

to Riemann Users

Hi Sanel,

It's an impressive clamp down on processing costs traded off against memory usage. Colour me extra paranoid now about GC, as if I wasn't already!

We shard out incoming messages via RMQ configuration, by cluster-id, across a range of 'monitoring' nodes running a single riemann core, this is done using a mix of round robin and static assignments (RMQ config). The core itself is unawares any of that is happening; it just gets messages from its assigned rabbitMQ subscription. This is not ideal and highly managed and somewhat of an art form to balance, but it is effective for us.

Shameless plug, but I do find this quite a nice article series;
https://www.instaclustr.com/blog/the-introduction-of-apache-kafka-infrastructure/
https://www.instaclustr.com/blog/upgrades-to-our-internal-monitoring-pipeline-using-redis-as-a-cassandra-cache/
perhaps Matt can write another article ;)

It will be interesting to see what this additional 'headspace' on the riemann core gives us as far as our scalability. (we keep managing more metrics across more nodes over time and machine can only scale vertically so much!).

Reply all

Reply to author

Forward