JMX Server high CPU usage and thread safety

113 views
Skip to first unread message

Kristian Rosenvold

unread,
Aug 25, 2020, 8:17:24 AM8/25/20
to Prometheus Users
We just had our second server reboot with excessive CPU usage through the metrics servlet on java. In both cases our thread dumps have been littered with runnable threads along the lines of the sample at the bottom of this mail. (These are never visible on normal execution) This really has a clear smell of a thread-safety problem. We have been fine studying the overall thread safety of the JMX collector code.

We believe that the LinkedHashMap in the cache at  https://github.com/prometheus/jmx_exporter/blob/master/collector/src/main/java/io/prometheus/jmx/JmxMBeanPropertyCache.java#L36 is in violation of JVM thread safety rules. The unsynchronized initialization of the LinkedHashMap in this class at line 46 does not guarantee that clients running on other threads will see the correct values inside this map, and even if they do it is exposed to hashmap "get" thread safety issues and potential CPU leakage 

We have discussed the best way to fix this problem and are somewhat undecided as to what is the best approach. We will provide a patch, but we would appreciate your opinions on this issue up-front.

Kristian


java.lang.Thread.State: RUNNABLE
at io.prometheus.jmx.JmxCollector$Receiver.recordBean(JmxCollector.java:373)
at io.prometheus.jmx.JmxScraper.processBeanValue(JmxScraper.java:199)
at io.prometheus.jmx.JmxScraper.scrapeBean(JmxScraper.java:163)
at io.prometheus.jmx.JmxScraper.doScrape(JmxScraper.java:117)
at io.prometheus.jmx.JmxCollector.collect(JmxCollector.java:473)
at io.prometheus.client.CollectorRegistry$MetricFamilySamplesEnumeration.findNextElement(CollectorRegistry.java:190)
at io.prometheus.client.CollectorRegistry$MetricFamilySamplesEnumeration.nextElement(CollectorRegistry.java:223)
at io.prometheus.client.CollectorRegistry$MetricFamilySamplesEnumeration.nextElement(CollectorRegistry.java:144)
at io.prometheus.client.exporter.common.TextFormat.write004(TextFormat.java:22)
at io.prometheus.client.exporter.MetricsServlet.doGet(MetricsServlet.java:49)

Brian Brazil

unread,
Aug 25, 2020, 9:08:43 AM8/25/20
to Kristian Rosenvold, Prometheus Users
On Tue, 25 Aug 2020 at 13:17, Kristian Rosenvold <kristian....@gmail.com> wrote:
We just had our second server reboot with excessive CPU usage through the metrics servlet on java. In both cases our thread dumps have been littered with runnable threads along the lines of the sample at the bottom of this mail. (These are never visible on normal execution) This really has a clear smell of a thread-safety problem. We have been fine studying the overall thread safety of the JMX collector code.

We believe that the LinkedHashMap in the cache at  https://github.com/prometheus/jmx_exporter/blob/master/collector/src/main/java/io/prometheus/jmx/JmxMBeanPropertyCache.java#L36 is in violation of JVM thread safety rules. The unsynchronized initialization of the LinkedHashMap in this class at line 46 does not guarantee that clients running on other threads will see the correct values inside this map, and even if they do it is exposed to hashmap "get" thread safety issues and potential CPU leakage 

The LinkedHashMap is never written to after it is inserted into the ConcurrentHashMap, only read from. So I'd expect that to be safe.

Brian
 

We have discussed the best way to fix this problem and are somewhat undecided as to what is the best approach. We will provide a patch, but we would appreciate your opinions on this issue up-front.

Kristian


java.lang.Thread.State: RUNNABLE
at io.prometheus.jmx.JmxCollector$Receiver.recordBean(JmxCollector.java:373)
at io.prometheus.jmx.JmxScraper.processBeanValue(JmxScraper.java:199)
at io.prometheus.jmx.JmxScraper.scrapeBean(JmxScraper.java:163)
at io.prometheus.jmx.JmxScraper.doScrape(JmxScraper.java:117)
at io.prometheus.jmx.JmxCollector.collect(JmxCollector.java:473)
at io.prometheus.client.CollectorRegistry$MetricFamilySamplesEnumeration.findNextElement(CollectorRegistry.java:190)
at io.prometheus.client.CollectorRegistry$MetricFamilySamplesEnumeration.nextElement(CollectorRegistry.java:223)
at io.prometheus.client.CollectorRegistry$MetricFamilySamplesEnumeration.nextElement(CollectorRegistry.java:144)
at io.prometheus.client.exporter.common.TextFormat.write004(TextFormat.java:22)
at io.prometheus.client.exporter.MetricsServlet.doGet(MetricsServlet.java:49)

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/526dcf66-8151-4d8d-9487-807cd9ebda08n%40googlegroups.com.


--

Kristian Rosenvold

unread,
Aug 25, 2020, 9:26:51 AM8/25/20
to Brian Brazil, Prometheus Users
I must admit that I am somewhat unsure if things have changed in
recent years, but for the longest time hashmap "get" has not been
thread safe due to potential bucket recalculations. Trying to google
around a little, I seem to be unable to find my old favourite sources,
but this SO post seems to indicate nothing has changed:
https://stackoverflow.com/questions/35321481/thread-safe-linkedhashmap-without-collections-synchronized

I have the crashes to prove *something* is wrong and I've inspected
most of the collector code looking for thread safety holes. This is
the only one I've found so far.

Kristian

Brian Brazil

unread,
Aug 25, 2020, 9:49:49 AM8/25/20
to Kristian Rosenvold, Prometheus Users
On Tue, 25 Aug 2020 at 14:45, Kristian Rosenvold <kristian....@gmail.com> wrote:
If the object had been subject to a safe publication that's true. But
aren't you basically working on an unsafely published LinkedHashMap ?

What do you mean by unsafely published? I've never come across that term in my previous reading up on the Java memory model.

Brian
 

Kristian

On Tue, Aug 25, 2020 at 3:34 PM Brian Brazil
<brian....@robustperception.io> wrote:

>
> On Tue, 25 Aug 2020 at 14:26, Kristian Rosenvold <kristian....@gmail.com> wrote:
>>
>> I must admit that I am somewhat unsure if things have changed in
>> recent years, but for the longest time hashmap "get" has not been
>> thread safe due to potential bucket recalculations. Trying to google
>> around a little, I seem to be unable to find my old favourite sources,
>> but this SO post seems to indicate nothing has changed:
>> https://stackoverflow.com/questions/35321481/thread-safe-linkedhashmap-without-collections-synchronized
>
>

>
>>
>>
>>
>> I have the crashes to prove *something* is wrong and I've inspected
>> most of the collector code looking for thread safety holes. This is
>> the only one I've found so far.
>>
>> Kristian
>>
>> On Tue, Aug 25, 2020 at 3:08 PM Brian Brazil
>> <brian....@robustperception.io> wrote:
>> >
> --
> Brian Brazil
> www.robustperception.io


--
Reply all
Reply to author
Forward
0 new messages