Implementing Rate limiting in riemann

27 views
Skip to first unread message

Vipin Menon

unread,
Mar 9, 2022, 12:50:23 AM3/9/22
to Riemann Users
Our riemann seutp is currently running with 48CPU and 96GB RAM (C5.12X EC2). As of now its happily running  with stream rate of 100k and CPU at 20% and memory at around 20gb. Essentially we have a large sized VM for our usual flow of traffic.

However a few day back there was a surge of metrics where the stream rate touched 200K in less than 5mins. CPU shot upto 85% ~90% (after which we started losing metrics) and eventually app restarted. During that time netty queue went from 0 to about 300K and memory touched 75GB (RAM).

The first issue question is..
Anyway to debug why this occurred despite have large CPU count. With almost twice the metrics the CPU shot up by more than 4 times.

Additionally are there ways to prevent this. For EG rate limiting setup? 

- Regards
Vipin

Jordan Braiuka

unread,
Mar 9, 2022, 6:59:56 PM3/9/22
to Riemann Users
We have seen similar things on some of our Riemann servers - generally it is as a result of us trying to do some kind of aggregation at one point which forces all the events into a single instance of a stream -  which forces it to all happen on a single core (which then thrashes the cpu quite hard, and everything starts to backup from there).

We find that the easiest way to try and find where this is happening is to use the built in instrumentation library, which can tell us specifically which part of our metrics pipeline is slowing down. 

In terms of rate limiting - you could utilise either kafka or RMQ as a buffer to inputting metrics into your Riemann servers if you have a bursty setup. 

Cheers

Carl Sandland

unread,
Mar 9, 2022, 9:09:51 PM3/9/22
to Riemann Users
Sounds like a fun little game lined up here: see what kind of incoming streams burst rate we can configure an available AWS instance with. 
Another interesting measure would be "cost per metric" in AWS costs.
I'd be very interested to hear about your GC experiments from the other post go too.
... might be cheaper to pick a non-expensive instance size and attempt to maximise throughput ;) It does appear bursts can push it too far.

btw my colleague greg says "“thread contention on a shared data-structure” is probably more accurate than “everything goes to 1 cpu”. there are still multiple threads it’s just they all try to get to the same thing". If anyone has 'tricks' to handle bursts better that would be interesting; of course as Jordan says you can add external buffering at the front side to riemann.

ap...@aphyr.com

unread,
Mar 9, 2022, 9:42:43 PM3/9/22
to rieman...@googlegroups.com
It's been many years since I worked on Riemann's streams but IIRC I left some very low-hanging fruit in terms of performance. If you can localize the issue to a particular kind of stream, let me know and I'll see what I can do. A YourKit snapshot from a running process would be especially useful. :-)

--Kyle

--
You received this message because you are subscribed to the Google Groups "Riemann Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to riemann-user...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/riemann-users/77facd12-b8ec-4b9a-93a4-e203243da2cdn%40googlegroups.com.

Avishai Ish-Shalom

unread,
Jun 13, 2022, 8:38:02 AM6/13/22
to Riemann Users
  • You might want to look at GC metrics first. In any case, I'd limit the netty executor queue by setting the "io.netty.eventexecutor.maxPendingTasks" system property - note that the actual size of the queue is maxPendingTaks*cpuCount. This somewhat breaks applicative backpressure but its better than having your Riemann instance go into GC pressure and thrashing... 
  • JDK8 is notoriously bad with large heaps, so if that's what you are running limit the heap to 32GB or upgrade to JDK 17/18 and use ZGC.
  • There are some performance painpoints (as @aphyr wrote, lots of work remains) which pop up on high concurrency. I profiled with JFR and discovered lots of low hanging fruits.
  • Last resort, you can always wrap your stream config with `(throttle)` to rate limit
Reply all
Reply to author
Forward
0 new messages