Jenkins cloud health reporting

148 views
Skip to first unread message

Oliver Gondža

unread,
Mar 16, 2016, 3:00:48 PM3/16/16
to jenkin...@googlegroups.com
Hi,

I am looking for a way to collect and report provisioning statistics.
Clouds, as implemented in Jenkins, works mostly on background but
provide very little feedback to the user when something goes wrong. I am
looking for following information:

- What provisioning activities are currently in progress?
- Where exceeded quota is my bottleneck?
- What is the log of current/past provisioning activities?
- What percentage of past provisioning attempts failed (per cloud/template)?
- Present short summary in UI (widget?)

There is a possibility to make the subsystem more reliable taking the
past success rate into account. (This cloud/template failed tan times
out of ten last attempts, can it really provision the 11th?)

I am checking there is no project that aims to provide this before I
start to work on this.

--
oliver

Baptiste Mathus

unread,
Mar 16, 2016, 6:24:49 PM3/16/16
to Jenkins Developers
+1. That'd be valuable information. For example, my current master adresses a docker swarm manager, and when it's full, the only way I know it is 1) because I monitor the swarm itself, and 2) there's a bunch of stack traces in the logs :).

If you need feedback or beta testing, I'll be happy to help.

Cheers



--
oliver

--
You received this message because you are subscribed to the Google Groups "Jenkins Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to jenkinsci-de...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/jenkinsci-dev/56E9AD5C.4040706%40gmail.com.
For more options, visit https://groups.google.com/d/optout.

ogondza

unread,
Mar 22, 2016, 11:31:29 AM3/22/16
to Jenkins Developers, m...@batmat.net
Thanks, I have started to work on this and expect to have something to show latter this week.

It seems to be possible to do a lot in cloud agnostic way (count provisioning attempts, report failures with exceptions, timings, etc.). Though, more interesting things will require plugins to integrate explicitly (separating provisioning attempts by template, record listener log, etc.).

https://github.com/jenkinsci/cloud-stats-plugin

Oliver Gondža

unread,
Apr 8, 2016, 9:44:16 AM4/8/16
to Jenkins Developers
I have dug in this a bit more and here is the general design I ended up
with after a couple of iterations:

- The plugin tracks N most recent provisioning activities. One such
activity covers the whole lifecycle from provisioning to slave deletion.
- The activity have 4 hardcoded phases: provisioning, launching,
operating, completed. Operation starts with first successful launch and
ends with slave deletion (it is the only productive phase). The activity
is Completed once the slave is gone and the activity is effectively a
history.
- Each phase execution tracks start time and a list of attachments. (I
will reconsider making it actionable and use actions instead of
attachments as they are similar). The attachment is extensible and can
be a mare piece of html, hyperlink or a model object with URL subspace.
This is to attach and present any kind of information: logs, exceptions,
etc.
- Each Attachment(/Action) has a state: ok, warn or fail. The worse of
attached states is propagated to phase execution and activity level. (If
slave fails to launch, and exception will be attached explaining why the
launch phase and thus the whole activity has failed).

While that sounds reasonable, there is a couple of problems:

1) Each phase is considered completed (as long as time measurement is
concerned) when the next phase starts. This is caused by core extension
points being called on rather random places. It is possible that
launching starts or even completes (ComputerListener#onOnline) before
provisioning is done (CloudProvisioningListener#onComplete). Each phase
execution needs to be therefore open for new attachments even next phase
has started. (Attaching provisioning log while launching has already
started)

2) Provisioning activity can start without core's involvement
(provisioning from UI on /computer page). In such cases, plugins will
have to call listener in cloud-stats-plugin to have this activity tracked.

3) There is no concept of templates in core cloud API. However, it is
used by most cloud implementations and it is valuable information for
statistics.

4) Tracking the same activity as it goes through
PlannedNode/Computer/Slave phases turned out to be lot trickier than I
expected. I tried several approaches:

- Almost no plugin uses PlannedNode#displayName as the actual slave
name so it is of no use. Not to mention we would have to reflect slave
renames.
- Calculating fingerprint based on PlannedNode's identity and
attaching to Slave instance as NodeProperty in
CloudProvisioningListener#onComplete was the closest thing to working
solution I have got. The problem is that at the time it gets called,
launching can already start. Some plugins even wait for launch to
complete before leaving PlannedNode#future. (Plus, for whatever reason
computer passed to ComputerListener#preLaunch might not have Node
assigned yet which seems like a bug to me.)

Having said that, I see no other way but require every plugin to
implement custom interface in PlannedNode(or its #future), computer and
node to have the activity tracked correctly. However invasive this might
be, it will remove problems #3 and #4 entirely.

At this point, any feedback welcome!
--
oliver

ogondza

unread,
Apr 27, 2016, 4:45:32 AM4/27/16
to Jenkins Developers
Finally, there is something to show:
history.png

Stephen Connolly

unread,
Apr 27, 2016, 8:08:12 AM4/27/16
to jenkin...@googlegroups.com

On 8 April 2016 at 14:44, Oliver Gondža <ogo...@gmail.com> wrote:
Some plugins even wait for launch to complete before leaving PlannedNode#future

Those are broken plugins

Oliver Gondža

unread,
Apr 27, 2016, 8:54:18 AM4/27/16
to jenkin...@googlegroups.com
On 04/27/2016 02:08 PM, Stephen Connolly wrote:
>
> On 8 April 2016 at 14:44, Oliver Gondža <ogo...@gmail.com
> <mailto:ogo...@gmail.com>> wrote:
>
> Some plugins even wait for launch to complete before leaving
> PlannedNode#future
>
>
> Those are broken plugins

As a maintainer of one such plugin I would like to hear your input on
[1] is it seems to be quite common. With regards to the cloud-stats
plugin, it would not change much even if they all get fixed since launch
can be started by Jenkins before it collects completed futures, it just
makes it happen all the time instead of rarely.

[1]
https://github.com/jenkinsci/jclouds-plugin/blob/master/jclouds-plugin/src/main/java/jenkins/plugins/jclouds/compute/JCloudsCloud.java#L299-L303

--
oliver

Stephen Connolly

unread,
Apr 27, 2016, 9:59:35 AM4/27/16
to jenkin...@googlegroups.com
So you should not be checking that the slave is launched by running through computer.connect(false) as that requires that you return early. Instead you should be checking for e.g. the ssh service to be open and responding to connection requests (you can do that easily with your launcher... but a custom check would be better as you only want to check that the port is open) then you return and let Jenkins add the node and connect. The critical bug is https://github.com/jenkinsci/jclouds-plugin/blob/master/jclouds-plugin/src/main/java/jenkins/plugins/jclouds/compute/JCloudsCloud.java#L297 i.e. you should never add the node to jenkins yourself, when you do that you cause the second issue that you are then trying to correct.

In short the future should return the Node instance to be added when the actual Node instance is ready to connect. If you return the Node too soon, then you get over-provisioning... which then leads to people trying to "fix" that by copying the hacks in the EC2 plugin.

The cloud API is a mess. I want to replace it with something that a sane individual has a hope of implementing correctly.

Do not fear, after dealing with the cloud API for many years now I have come to the conclusion that it is IMPOSSIBLE to write a correct implementation. The closest thing to a correct implementation is the oc-cloud implementation used by CJOC (and that only avoids the bugs that you cannot avoid because we can linearize operations in CJOC). 

Our nectar-vmware plugin is "ok" (largely due to a shake-down we did when we were using it for the predecessor to CJOC)

The next closest to a correct implementation is Docker.

After that... well it's all mostly sh1t... I loose the will to live when I start looking at other implementations.

If you pick EC2 as your template you will be following every wrong example in the book.



--
oliver

--
You received this message because you are subscribed to the Google Groups "Jenkins Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to jenkinsci-de...@googlegroups.com.

Kanstantsin Shautsou

unread,
Apr 27, 2016, 10:40:56 AM4/27/16
to Jenkins Developers
@ogondza you can try to look at https://github.com/KostyaSha/yet-another-docker-plugin where i clean-uped most of the code that i started working on in docker-plugin. Not ideal of course but i un-hardcoded and clean-uped as much as i can. All other plugins copy-pasting code without understanding how it works (my impression after reading a lot of cloud plugins).
PS feel free to contact about stats if you need some experiments.

Oleg Nenashev

unread,
Apr 30, 2016, 8:57:38 AM4/30/16
to Jenkins Developers
Aside from the "Cloud API sucks" discussion (I totally agree with it), I'm definitely interested to get as much metrics from the provisioning process. So +1 for all your efforts, Oliver.

As we discussed yesterday at Linuxwochen, it would be great to get this stuff integrated with https://wiki.jenkins-ci.org/display/JENKINS/Metrics+Plugin for Jenkins. It automatically makes the metrics pluggable and allows to access them from APIs.

I wish to get Metrics and Monitoring plugins integrated as well...

BR, Oleg

среда, 27 апреля 2016 г., 16:40:56 UTC+2 пользователь Kanstantsin Shautsou написал:

ogondza

unread,
May 1, 2016, 2:37:37 PM5/1/16
to Jenkins Developers
An example of how integration with provisioning plugins will look like: https://github.com/jenkinsci/openstack-cloud-plugin/pull/71

--
oliver
Reply all
Reply to author
Forward
0 new messages