We just had our second server reboot with excessive CPU usage through the metrics servlet on java. In both cases our thread dumps have been littered with runnable threads along the lines of the sample at the bottom of this mail. (These are never visible on normal execution) This really has a clear smell of a thread-safety problem. We have been fine studying the overall thread safety of the JMX collector code.We believe that the LinkedHashMap in the cache at https://github.com/prometheus/jmx_exporter/blob/master/collector/src/main/java/io/prometheus/jmx/JmxMBeanPropertyCache.java#L36 is in violation of JVM thread safety rules. The unsynchronized initialization of the LinkedHashMap in this class at line 46 does not guarantee that clients running on other threads will see the correct values inside this map, and even if they do it is exposed to hashmap "get" thread safety issues and potential CPU leakage
We have discussed the best way to fix this problem and are somewhat undecided as to what is the best approach. We will provide a patch, but we would appreciate your opinions on this issue up-front.Kristian
java.lang.Thread.State: RUNNABLE
at io.prometheus.jmx.JmxCollector$Receiver.recordBean(JmxCollector.java:373)
at io.prometheus.jmx.JmxScraper.processBeanValue(JmxScraper.java:199)
at io.prometheus.jmx.JmxScraper.scrapeBean(JmxScraper.java:163)
at io.prometheus.jmx.JmxScraper.doScrape(JmxScraper.java:117)
at io.prometheus.jmx.JmxCollector.collect(JmxCollector.java:473)
at io.prometheus.client.CollectorRegistry$MetricFamilySamplesEnumeration.findNextElement(CollectorRegistry.java:190)
at io.prometheus.client.CollectorRegistry$MetricFamilySamplesEnumeration.nextElement(CollectorRegistry.java:223)
at io.prometheus.client.CollectorRegistry$MetricFamilySamplesEnumeration.nextElement(CollectorRegistry.java:144)
at io.prometheus.client.exporter.common.TextFormat.write004(TextFormat.java:22)
at io.prometheus.client.exporter.MetricsServlet.doGet(MetricsServlet.java:49)
--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/526dcf66-8151-4d8d-9487-807cd9ebda08n%40googlegroups.com.
If the object had been subject to a safe publication that's true. But
aren't you basically working on an unsafely published LinkedHashMap ?
Kristian
On Tue, Aug 25, 2020 at 3:34 PM Brian Brazil
<brian....@robustperception.io> wrote:
>
> On Tue, 25 Aug 2020 at 14:26, Kristian Rosenvold <kristian....@gmail.com> wrote:
>>
>> I must admit that I am somewhat unsure if things have changed in
>> recent years, but for the longest time hashmap "get" has not been
>> thread safe due to potential bucket recalculations. Trying to google
>> around a little, I seem to be unable to find my old favourite sources,
>> but this SO post seems to indicate nothing has changed:
>> https://stackoverflow.com/questions/35321481/thread-safe-linkedhashmap-without-collections-synchronized
>
>
> Per https://docs.oracle.com/javase/7/docs/api/java/util/LinkedHashMap.html what we're doing is fine.
>
> Brian
>
>>
>>
>>
>> I have the crashes to prove *something* is wrong and I've inspected
>> most of the collector code looking for thread safety holes. This is
>> the only one I've found so far.
>>
>> Kristian
>>
>> On Tue, Aug 25, 2020 at 3:08 PM Brian Brazil
>> <brian....@robustperception.io> wrote:
>> >