Issues with Jenkins

Paweł Albecki

unread,

Jan 19, 2018, 2:11:52 PM1/19/18

to OpenLMIS Dev

Hi all,

Recently we experienced some issues with Jenkins like no space left on device or out of memory. I would like to start discussion how we can improve how our CI server works so we are not blocked by issues during development.

One of the ideas is add Scalyr alert for low free disk space so we can manually check what takes too much space unnecessary and clean it.

Other idea is to remove workspace after every build job (at least when success) or with jenkins clean daily job. We can also remove all unused images not just dangling ones. This is already done for mbili but not for master.

I wonder if there is something that can be done to prevent out of memory as we currently remove containers after every job. But I think that sometimes running containers dangle when some job is stopped/killed and they need to be stopped/removed manually.

Feel free to reply what do you think about proposed solutions and to suggest another approach. As a result of this discussion I would like to create ticket for implement improvements and I believe this can be prioritized.

Best regards,

Paweł

Paweł Gesek

unread,

Jan 22, 2018, 11:35:37 AM1/22/18

to OpenLMIS Dev

Hello,

a few questions:

* Do we know what images/containers in particular clog the disk space because they are not cleaned up? Is this also an issue with memory, when running services instances are not killed?

* Perhaps jobs can also do the clean up when they start and remove any potential danglers?

* What are the numbers? How many dangling images/containers will it take to paralyze Jenkins?

* Are there any Jenkins plugins that can help us with these container/image management issues?

A daily job sounds a bit hacky, but I guess it can get the job done if all else fails.

Regards,

Paweł

SolDevelo Sp. z o.o. [LLC] / www.soldevelo.com
Al. Zwycięstwa 96/98, 81-451, Gdynia, Poland
Phone: +48 58 782 45 40 / Fax: +48 58 782 45 41

--
You received this message because you are subscribed to the Google Groups "OpenLMIS Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openlmis-dev+unsubscribe@googlegroups.com.
To post to this group, send email to openlm...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/openlmis-dev/720ef4e4-e969-49b5-9ce9-a2fce4d3c99b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

Paweł Gesek
Technical Project Manager
pge...@soldevelo.com / +48 690 020 875

SolDevelo Sp. z o.o. [LLC] / www.soldevelo.com
Al. Zwycięstwa 96/98, 81-451, Gdynia, Poland
Phone: +48 58 782 45 40 / Fax: +48 58 782 45 41

Paweł Albecki

unread,

Jan 24, 2018, 10:04:56 AM1/24/18

to Paweł Gesek, OpenLMIS Dev

Hi Paweł,

Thanks for your input.

Do we know what images/containers in particular clog the disk space because they are not cleaned up? Is this also an issue with memory, when running services instances are not killed?

I can see that we have a lot of cached images that can possible clog disk space. There are 3 images with different tags for every docker image. Not sure why we need images with "latest" tag (don't we use only version tags?) but I already removed all images without a tag and Jenkins clean daily job do it too. I can see that two most discs pace taker are images for performance and selenium tests (about 2 GB each). Regarding second question, I think it is an issue with memory when running services instances are not killed.

Perhaps jobs can also do the clean up when they start and remove any potential danglers?

I think this is a good idea. We remove workspace after build is done for contract tests. I believe we also clean containers, networks, volumes at the end of every job but sometimes it can be not called.

What are the numbers? How many dangling images/containers will it take to paralyze Jenkins?

This is good question. I started monitor our instance and I will be back with numbers later.

Are there any Jenkins plugins that can help us with these container/image management issues?

I didn't find any plugin that can be helpful in this case. Do you have something in mind?

Best regards,

Paweł

To view this discussion on the web visit https://groups.google.com/d/msgid/openlmis-dev/CADt-Nu0e16hodoUUdqt1k1pU%3D%2BTpofOv8rmYB5T9RyMbcfpVPg%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.

--

Paweł Albecki
Software Developer
palb...@soldevelo.com

Josh Zamor

unread,

Jan 27, 2018, 4:26:07 PM1/27/18

to OpenLMIS Dev

Thanks Pawel for taking a lead on this.

I used to spend more time maintaining Jenkins, and recently haven't had it. Today I took a look again and noticed a few things:

Failed workspace cleanup's continue to dominate rapid storage depletion. Workspace cleanups almost never succeed afaik with our setup due to our Docker usage. We have a bug for this: https://openlmis.atlassian.net/browse/OLMIS-3101 though I'll caution that the potential fix I put in there, never seems to work quite right in docker - it's a bit of a rabbit hole. While that docker issue may not be easy to resolve, I continue to recommend we never configure the job to cleanup its workspace. In the meantime I've added a crontab entry to force remove these.
A few jobs, notably requisition contract test and stockmanagement contract test weren't configured to dispose of old jobs and so had hundreds of saved jobs artifacts eating up GB.
I also added a quick Scalyr alert if it dips below 4GB - we should see it in Slack.
A few plugins as well as Jenkins itself have critical security patches, I've upgraded the plugins, and next should be Jenkins.

The above freed up an easy 20GB. Looking at the Scalyr charts over the past 2 weeks I only see 1 or 2 times when it ran out or came close to running out of storage, yet Jenkins and Sonar have stopped many times. It doesn't seem like storage has always been the culprit. Similar with memory, there are times Jenkins appears to have stopped, when Scalyr reports free memory. Something isn't quite adding up from what I'm seeing.

I'm making a snapshot and I'll see if I can get to upgrading Jenkins here. Otherwise I'm looking forward to what else you find.

Best,
Josh

Josh Zamor

unread,

Jan 29, 2018, 7:06:12 PM1/29/18

to OpenLMIS Dev

So you have a record Pawel of what I've found. This past weekend I did:

upgrade Jenkins and all it's plugins (which had multiple security warnings)
changed our Jenkins node protocol away from JNLP as there was a note it might effect stability
upgraded Sonar to the latest LTS
as noted freed up some disk space.

All of this and 24+ hours later and Jenkins and Sonar are still crashing. Process logs from Sonar clearly indicate it's running out of memory available to it through the JVM. Since we're using 64bit JVM on 64bit arch, it appears some other process is eating up all the available memory. The most likely cause is a Jenkins job (likely one which brings up many containers), though I haven't ruled out all other processes (e.g. some of our cron jobs cleaning up docker artifacts might be causing memory spikes). Scalyr is monitoring memory usage of all JVMs, Postgres, etc and it's not showing at the process level memory spikes which is why I wouldn't be so quick to rule out other processes which don't have the same detailed monitoring the aforementioned ones do.

If you could pick this up from here Pawel I think we can figure it out together.

Best,

Josh

Paweł Albecki

unread,

Jan 30, 2018, 8:32:52 AM1/30/18

to Josh Zamor, OpenLMIS Dev

Thanks Josh,

That's unfortunate that all these improvements didn't fix main issue but at least no space left on device should be no longer an issue thanks to automatic cleanups and low disc space alert.

I don't believe our cron jobs cleaning up docker artifacts cause memory problems. I didn't notice that. What was noticed, Jenkins and Sonar crash usually when performance or functional tests are running. When I run referencedata performance tests build on master, free memory decreased drastically and then Jenkins crashed. I think if we restrict these jobs to run only on mbili, it should resolve our problems.

Regards,

Paweł

To view this discussion on the web visit https://groups.google.com/d/msgid/openlmis-dev/b0ab2f35-89c9-4c0a-aaef-69206e83262d%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--

Paweł Albecki
Software Developer
palb...@soldevelo.com

Paweł Albecki

unread,

Feb 1, 2018, 9:01:29 AM2/1/18

to Josh Zamor, OpenLMIS Dev

It occurred that the reason for Jenkins and Sonar crashes is referencedata performance tests. There are many scenarios executed concurrently that cause server to run out of memory. For now we restricted this job to run only on mbili but there is downside of it, slaves restarts very often and workspace is wiped so we don't have access e.g. to logs with errors. I see two possible solutions:

1. Don't run all refdata tests at the same time. I'm not sure if we can achieve it in other way than just in test.sh run taurus for every test file separately one by one.

2. Limit maximum amount of memory the taurus container can use. I checked that Jenkins crash when container use more than 5GB, I'm not sure how it will affect ran scenarios, though.

What do you think?

Best regard,

Pawel

Łukasz Lewczyński

unread,

Feb 1, 2018, 9:09:24 AM2/1/18

to Paweł Albecki, Josh Zamor, OpenLMIS Dev

I think we could try with memory limit. Also there is a ticket to modify the way how our performance tests are executed. Details can be found here: OLMIS-3735. Briefly we should create a separate repository (like contract-tests, functional-tests) that will contain all performance tests. With this change should be easier to manage tests because now we have to make changes in several repositories instead of one.

Łukasz Lewczyński
Software Developer
llewc...@soldevelo.com

To view this discussion on the web visit https://groups.google.com/d/msgid/openlmis-dev/CAAJzpfmK1CXGcd-NDjv8tNyz-NZycDvg9G_kgN7hB1MDkT3gKA%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.

Paweł Albecki

unread,

Feb 12, 2018, 11:33:56 AM2/12/18

to OpenLMIS Dev

I tried to limit memory usage but it does not seem to work very well. The error about socket closed occurs for some executions when I limit memory that Taurus container can use. I think we should go with executing test scenarios separately. Łukasz, do you think we can deal with it within OLMIS-3735? If yes, I think we should prioritize this ticket. For now, ReferenceData performance can run on Slave.

To unsubscribe from this group and stop receiving emails from it, send an email to openlmis-dev...@googlegroups.com.

To post to this group, send email to openlm...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/openlmis-dev/720ef4e4-e969-49b5-9ce9-a2fce4d3c99b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
Paweł Gesek
Technical Project Manager
pge...@soldevelo.com / +48 690 020 875

SolDevelo Sp. z o.o. [LLC] / www.soldevelo.com
Al. Zwycięstwa 96/98, 81-451, Gdynia, Poland
Phone: +48 58 782 45 40 / Fax: +48 58 782 45 41

--
You received this message because you are subscribed to the Google Groups "OpenLMIS Dev" group.

To unsubscribe from this group and stop receiving emails from it, send an email to openlmis-dev...@googlegroups.com.

To post to this group, send email to openlm...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/openlmis-dev/CADt-Nu0e16hodoUUdqt1k1pU%3D%2BTpofOv8rmYB5T9RyMbcfpVPg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

--

Paweł Albecki
Software Developer
palb...@soldevelo.com

SolDevelo Sp. z o.o. [LLC] / www.soldevelo.com
Al. Zwycięstwa 96/98, 81-451, Gdynia, Poland
Phone: +48 58 782 45 40 / Fax: +48 58 782 45 41

--
You received this message because you are subscribed to the Google Groups "OpenLMIS Dev" group.

To unsubscribe from this group and stop receiving emails from it, send an email to openlmis-dev...@googlegroups.com.

To post to this group, send email to openlm...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/openlmis-dev/b0ab2f35-89c9-4c0a-aaef-69206e83262d%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--

Paweł Albecki
Software Developer
palb...@soldevelo.com

SolDevelo Sp. z o.o. [LLC] / www.soldevelo.com
Al. Zwycięstwa 96/98, 81-451, Gdynia, Poland
Phone: +48 58 782 45 40 / Fax: +48 58 782 45 41

--
You received this message because you are subscribed to the Google Groups "OpenLMIS Dev" group.

To unsubscribe from this group and stop receiving emails from it, send an email to openlmis-dev...@googlegroups.com.

To post to this group, send email to openlm...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/openlmis-dev/CAAJzpfmK1CXGcd-NDjv8tNyz-NZycDvg9G_kgN7hB1MDkT3gKA%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.

Łukasz Lewczyński

unread,

Feb 12, 2018, 11:51:26 PM2/12/18

to Paweł Albecki, OpenLMIS Dev

Yes, Pawel. There are even requirements in the ticket description:

It should be possible to set what tests should be executed. For instance I would like to execute only reference data tests or auth and fulfillment tests.
Each package of tests should be executed separately (similar to now: auth package, then cce package, then referencedata package etc). Parameters should be stored in settings.env file. By defaults all packages should be executed by CI job.

Łukasz Lewczyński
Software Developer
llewc...@soldevelo.com

To unsubscribe from this group and stop receiving emails from it, send an email to openlmis-dev+unsubscribe@googlegroups.com.

To post to this group, send email to openlm...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/openlmis-dev/c86e9739-e56e-4e39-849d-f54f24e8587c%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Reply all

Reply to author

Forward