[JIRA] (JENKINS-59966) Prometheus Plugin: Causes StackOverFlowerError

13 views
Skip to first unread message

tpoerio@argo.ai (JIRA)

unread,
Oct 28, 2019, 1:22:02 PM10/28/19
to jenkinsc...@googlegroups.com
Tony Poerio created an issue
 
Jenkins / Bug JENKINS-59966
Prometheus Plugin: Causes StackOverFlowerError
Issue Type: Bug Bug
Assignee: Marky Jackson
Components: prometheus-plugin
Created: 2019-10-28 17:21
Environment: Jenkins ver. 2.176.2

Prometheus metrics plugin
2.0.6

OS: Ubuntu 16.04 / x86_64 x86_64 GNU/Linux
^ all nodes, master and worker, using the same

Jenkins is running inside a Docker container. The agents on each node are running directly on VMs in AWS and GCP.

Installation: Master is run in a Docker container. Slave agent is via SSH.

Web browser: N/A. It was seen on Chrome Latest, but these error logs are taken from the docker container running the master node, directly.
Priority: Major Major
Reporter: Tony Poerio

Overview

=========

After installation of the Prometheus Plugin, we performed the following actions:

  1. Restart Jenkins => Jenkins started gracefully, and jobs began running as expected

  2. Navigate to <jenkins-url>/prometheus, and observe that there is no data being written there. Wait ~ 20 minutes with no change.

  3. Investigate Jenkins logs on the master and find critical errors, pasted below.

  4. Turn off Prometheus Plugin and restart Jenkins, because: (a) no data as captured; and (b) the error messages were alarming.

The message we are most concerned about is:
```

SEVERE: A thread (prometheus_async_worker thread/48636) died unexpectedly due to an uncaught exception, this may leave your Jenkins in a bad way and is usually indicative of a bug in the code.

```

 

Errors (longer form)

-------------------------------

The specific message we are seeing, as soon as jobs begin running, is: 

```

Oct 28, 2019 2:48:34 PM org.jenkinsci.plugins.workflow.cps.CpsFlowExecution createPlaceholderNodesOct 28, 2019 2:48:34 PM org.jenkinsci.plugins.workflow.cps.CpsFlowExecution createPlaceholderNodesINFO: Creating placeholder flownodes for execution: CpsFlowExecution[Owner[redacted]]--WARNING: Error initializing storage and loading nodes, will try to create placeholders for: CpsFlowExecution[Owner[redacted]]java.io.IOException: Tried to load head FlowNodes for execution Owner[redacted] but FlowNode was not found in storage for head id:FlowNodeId 1:1469 at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.initializeStorage(CpsFlowExecution.java:679) at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.onLoad(CpsFlowExecution.java:716) at org.jenkinsci.plugins.workflow.job.WorkflowRun.getExecution(WorkflowRun.java:662) at org.jenkinsci.plugins.prometheus.JobCollector.appendJobMetrics(JobCollector.java:270) at org.jenkinsci.plugins.prometheus.JobCollector.lambda$collect$0(JobCollector.java:176) at org.jenkinsci.plugins.prometheus.util.Jobs.forEachJob(Jobs.java:20) at org.jenkinsci.plugins.prometheus.JobCollector.collect(JobCollector.java:159) at io.prometheus.client.CollectorRegistry$MetricFamilySamplesEnumeration.findNextElement(CollectorRegistry.java:183) at io.prometheus.client.CollectorRegistry$MetricFamilySamplesEnumeration.nextElement(CollectorRegistry.java:216) at io.prometheus.client.CollectorRegistry$MetricFamilySamplesEnumeration.nextElement(CollectorRegistry.java:137) at io.prometheus.client.exporter.common.TextFormat.write004(TextFormat.java:22) at org.jenkinsci.plugins.prometheus.service.DefaultPrometheusMetrics.collectMetrics(DefaultPrometheusMetrics.java:43) at org.jenkinsci.plugins.prometheus.service.PrometheusAsyncWorker.execute(PrometheusAsyncWorker.java:40) at hudson.model.AsyncPeriodicWork$1.run(AsyncPeriodicWork.java:101) at java.lang.Thread.run(Thread.java:748)

```

Followed by...
Oct 28, 2019 3:08:00 PM hudson.init.impl.InstallUncaughtExceptionHandler$DefaultUncaughtExceptionHandler uncaughtException
SEVERE: A thread (prometheus_async_worker thread/48636) died unexpectedly due to an uncaught exception, this may leave your Jenkins in a bad way and is usually indicative of a bug in the code.
java.lang.StackOverflowError
    at java.util.TreeMap.put(TreeMap.java:568)
    at org.jenkinsci.plugins.prometheus.util.FlowNodes.traverseTree(FlowNodes.java:44)
    at org.jenkinsci.plugins.prometheus.util.FlowNodes.traverseTree(FlowNodes.java:49)
    at org.jenkinsci.plugins.prometheus.util.FlowNodes.traverseTree(FlowNodes.java:49)
    at org.jenkinsci.plugins.prometheus.util.FlowNodes.traverseTree(FlowNodes.java:49)
    at org.jenkinsci.plugins.prometheus.util.FlowNodes.traverseTree(FlowNodes.java:49)
    at org.jenkinsci.plugins.prometheus.util.FlowNodes.traverseTree(FlowNodes.java:49)
    at org.jenkinsci.plugins.prometheus.util.FlowNodes.traverseTree(FlowNodes.java:49)
    at org.jenkinsci.plugins.prometheus.util.FlowNodes.traverseTree(FlowNodes.java:49)
    at org.jenkinsci.plugins.prometheus.util.FlowNodes.traverseTree(FlowNodes.java:49)
    at org.jenkinsci.plugins.prometheus.util.FlowNodes.traverseTree(FlowNodes.java:49)
    at org.jenkinsci.plugins.prometheus.util.FlowNodes.traverseTree(FlowNodes.java:49)
 

This message is repeated throughout the logs many times.

We believe this is the reason why no valid Prometheus metrics are reported. 

Does that sound correct, and is there any way we can validate/verify/fix?

Add Comment Add Comment
 
This message was sent by Atlassian Jira (v7.13.6#713006-sha1:cc4451f)
Atlassian logo

marcel@mahillmann.de (JIRA)

unread,
Nov 22, 2019, 2:12:04 AM11/22/19
to jenkinsc...@googlegroups.com
Marcel Hillmann commented on Bug JENKINS-59966
 
Re: Prometheus Plugin: Causes StackOverFlowerError

hi,

 

any updates, facing the same issue.

 

Thx Marcel

damien.coraboeuf@gmail.com (JIRA)

unread,
Nov 27, 2019, 11:34:02 AM11/27/19
to jenkinsc...@googlegroups.com

We have the same issue in my company:

  • Jenkins LTS 2.190.1
  • Prometheus plugin 2.0.6

Stacktrace overflow as mentioned above.

However, we do have a second Jenkins instance, with same version, and no issue there. So this must be somehow linked to some kind of setup.

mironov.v@torrowtech.com (JIRA)

unread,
Dec 28, 2019, 2:44:03 AM12/28/19
to jenkinsc...@googlegroups.com

Facing same issue in the same setup as metioned above.

Plugin worked fine for a week i think, but at one moment i noticed that information on our dashboard stood for a week. So i went to the endpoint and made sure that metrics have not been updated for awhile. After restart of master - endpoint became empty as well as after plugin reinstallation.

aburdajewicz@cloudbees.com (JIRA)

unread,
Mar 19, 2020, 2:53:03 AM3/19/20
to jenkinsc...@googlegroups.com

This is supposedly solved in version 2.0.7 of the plugin: https://github.com/jenkinsci/prometheus-plugin/pull/143

This message was sent by Atlassian Jira (v7.13.12#713012-sha1:6e07c38)
Atlassian logo

kupchenko@gmail.com (JIRA)

unread,
Apr 26, 2020, 3:47:03 AM4/26/20
to jenkinsc...@googlegroups.com
Daniil Kupchenko updated an issue
Change By: Daniil Kupchenko
Attachment: screenshot-1.png

kupchenko@gmail.com (JIRA)

unread,
Apr 26, 2020, 3:49:02 AM4/26/20
to jenkinsc...@googlegroups.com

The same behavior with jenkins 2.204.2 and prometheus-plugin 2.0.6:

2020-04-26 07:24:17.409+0000 [id=13003] SEVERE  h.i.i.InstallUncaughtExceptionHandler$DefaultUncaughtExceptionHandler#uncaughtException: A thread (prometheus_async_worker thread/13003) died unexpecte
dly due to an uncaught exception, this may leave your Jenkins in a bad way and is usually indicative of a bug in the code.
java.lang.StackOverflowError
        at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution$TimingFlowNodeStorage.getNode(CpsFlowExecution.java:1806)
        at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.getNode(CpsFlowExecution.java:1181)
        at org.jenkinsci.plugins.workflow.graph.FlowNode.loadParents(FlowNode.java:164)
        at org.jenkinsci.plugins.workflow.graph.FlowNode.getParents(FlowNode.java:155)
        at org.jenkinsci.plugins.prometheus.util.FlowNodes.traverseTree(FlowNodes.java:49)
        at org.jenkinsci.plugins.prometheus.util.FlowNodes.traverseTree(FlowNodes.java:49)

With all unchecked options in configuration:

it work as expected:

2020-04-26 07:27:47.283+0000 [id=13311] INFO    hudson.model.AsyncPeriodicWork#lambda$doRun$0: Started prometheus_async_worker
2020-04-26 07:27:47.311+0000 [id=13311] INFO    hudson.model.AsyncPeriodicWork#lambda$doRun$0: Finished prometheus_async_worker. 28 ms
2020-04-26 07:29:47.283+0000 [id=13708] INFO    hudson.model.AsyncPeriodicWork#lambda$doRun$0: Started prometheus_async_worker
2020-04-26 07:29:47.310+0000 [id=13708] INFO    hudson.model.AsyncPeriodicWork#lambda$doRun$0: Finished prometheus_async_worker. 27 ms
2020-04-26 07:31:47.283+0000 [id=13785] INFO    hudson.model.AsyncPeriodicWork#lambda$doRun$0: Started prometheus_async_worker
2020-04-26 07:31:47.308+0000 [id=13785] INFO    hudson.model.AsyncPeriodicWork#lambda$doRun$0: Finished prometheus_async_worker. 24 ms

Just wondering why release 2.0.7 was not released yet.

Reply all
Reply to author
Forward
0 new messages