updating a bosh deployment

chinc...@gmail.com

unread,

Jul 7, 2014, 12:38:51 AM7/7/14

to bosh-...@cloudfoundry.org

Hi,

I have a bosh deployment & I am trying to re-deploy it with increased memory & CPU. However, when I do the bosh deploy with the new memory & cpu settings, I notice that the stop jobs fails & subsequently deploy also fails. I debugged it and noticed that the jobs are not being stopped in the reverse order of their start. Couple of questions:

1. What happens when re-deploying ? Does it call a monit stop all ?
2. Is there any variable to make stopping the jobs go in the reverse order of the start ?

Thanks,
-C

Dr Nic Williams

unread,

Jul 7, 2014, 1:31:09 AM7/7/14

to bosh-...@cloudfoundry.org, bosh-...@cloudfoundry.org

I thought I saw that this issue was fixed in a recent bosh release. Yeah it was annoying; and yes you had to manually "monit stop all" in each VM. If you're not using a recent BOSH release (not sure how recent; perhaps a few weeks) then try that.

Not sure if the bug fix was only applied to go_agent or not.

To unsubscribe from this group and stop receiving emails from it, send an email to bosh-users+...@cloudfoundry.org.

Dr Nic Williams

unread,

Jul 7, 2014, 1:31:45 AM7/7/14

to bosh-...@cloudfoundry.org, bosh-...@cloudfoundry.org

Re: stopping order - perhaps research options for this in monit docs as a starting point?

On Sun, Jul 6, 2014 at 9:38 PM, chinc...@gmail.com <chinc...@gmail.com> wrote:

To unsubscribe from this group and stop receiving emails from it, send an email to bosh-users+...@cloudfoundry.org.

Chinchu Sup

unread,

Jul 7, 2014, 6:18:42 PM7/7/14

to bosh-...@cloudfoundry.org

Thanks Dr Nic. I looked at the monit docs for stop/start ordering and I tried the "depends on..". It worked with the older stemcells - bosh-stemcell-2427-vsphere-esxi-ubuntu.tgz. But on the newer stemcell bosh-stemcell-2624-vsphere-esxi-ubuntu-trusty-go_agent.tgz it doesn't.

The monit file:
--
check process hadoop-datanode
with pidfile /var/vcap/sys/run/hadoop-datanode/hadoop-datanode.pid
start program "/var/vcap/jobs/hadoop-datanode/bin/monit_debugger hadoop-datanode_ctl '/var/vcap/jobs/hadoop-datanode/bin/hadoop-datanode_ctl start'"
stop program "/var/vcap/jobs/hadoop-datanode/bin/monit_debugger hadoop-datanode_ctl '/var/vcap/jobs/hadoop-datanode/bin/hadoop-datanode_ctl stop'"
<% if File.exist?("/var/vcap/jobs/hadoop-namenode") %>
depends on hadoop-namenode
<% end %>
group vcap

On the older stemcell, the ruby templating was getting replaced correctly & I was seeing the "depends on hadoop-namenode" in the monitrc. But on the newer stemcell, I dont see the "depends on", it was as if the /var/vcap/jobs/hadoop-namenode did not exist..

The monit files should be de-erb'ified on the deployed VM, correct ?

-C

Rob Day-Reynolds

unread,

Jul 7, 2014, 9:48:21 PM7/7/14

to bosh-...@cloudfoundry.org

Previously, the agent would parse the ERB on the deployed VM. However, since moving to the golang agent, those templates are now rendered on the director, and transferred down to the agent. So, the director is seeing that '/var/vcap/jobs/hadoop-namenode' does not exist on the BOSH director VM, and therefore isn't adding in the 'depends on hadoop-namenode' line.

-Rob

dmi...@pivotallabs.com

unread,

Jul 7, 2014, 11:06:14 PM7/7/14

to bosh-...@cloudfoundry.org, chinc...@gmail.com

As Rob Day-Reynolds mentions ERB files are now evaluated in a different environment (and that environment might change again without notice since it's considered to be a BOSH Director implementation detail).

So to make your release more BOSH friendly, ideally hadoop-datanode would not depend on hadoop-namenode to start up first. If that's not possible at this time, exposing a boolean property for deployment operators would be next best thing.

Regarding job template ordering:

BOSH makes _no_ promise to start/stop job templates in any specific order (only that job templates will get started soon). That's done on purpose and in future I could see start/stop operations be randomized to expose accidental reliance on timing.

An ideal BOSH job template should work in a way such that it's not affected by restarts of other job templates. Of course in practice, job templates rely on other job templates (either on the same or different VM) to provide some information. So to avoid cascading failure, each job template should start and wait for its dependencies to become available (e.g. poll for server to become available, wait for database to _come back_ after it crashed mid-operation) and the continue with its work.

Some of our job templates (e.g. some cf-release pieces) are not following above guideline but we are planning to fix that as we improve things.

-dk

Chinchu Sup

unread,

Jul 7, 2014, 11:14:38 PM7/7/14

to dmi...@pivotallabs.com, bosh-...@cloudfoundry.org

Makes sense .. but these are apache components & I don't have control over their behavior.

But these components do wait & the startup goes ok. The problem is during stop. When the master goes down first the slave nodes don't stop since they may have pending state to flush to the master. Do you have this scenario with cf-release pieces & how do you plan to handle them ?

-C

dmi...@pivotallabs.com

unread,

Jul 8, 2014, 1:00:41 AM7/8/14

to bosh-...@cloudfoundry.org, dmi...@pivotallabs.com, chinc...@gmail.com

Ah I see. Job template drain scripts might help you here.

BOSH provides a way to run a script before job update or shutdown. You can add bin/drain script to slave node job template so that it can flush all of its data and then safely exit. cf-release does something similar for DEA and gorouter job templates. See spec files in

https://github.com/cloudfoundry/cf-release/tree/c4dfff2fe703fd05c4a9044b492d8e4abfb4ac6b/jobs/dea_next

https://github.com/cloudfoundry/cf-release/tree/c4dfff2fe703fd05c4a9044b492d8e4abfb4ac6b/jobs/gorouter

-dk

Chinchu Sup

unread,

Jul 8, 2014, 1:59:06 AM7/8/14

to Dmitriy Kalinin, bosh-...@cloudfoundry.org

Cool ! I am assuming that drain gets called before the stop is called on the job, correct ?

Also on the same note what's the sequence of steps for a shutdown (or job update) ?

1. Call drain on each of the jobs.

2. monit stop all ?
3. ??

Thanks guys,

-C

chinc...@gmail.com

unread,

Jul 8, 2014, 9:49:53 PM7/8/14

to bosh-...@cloudfoundry.org, dmi...@pivotallabs.com, chinc...@gmail.com

Dmitriy,

The drain script doesn't get called for all the jobs. I have a job with 5 templates. The drain gets called only on the first template & then for some reason it goes ahead & calls stop on all of them. This is on the bosh-vsphere-esxi-ubuntu-trusty-go_agent-2624 stemcell. DO yo

/var/vcap/bosh/log/current
==
2014-07-09_01:32:31.58079 2014/07/09 01:32:31 mail from: "monit@localhost"
2014-07-09_01:32:31.58080 [File System] 2014/07/09 01:32:31 DEBUG - Checking if file exists /var/vcap/jobs/common-utils/bin/drain
2014-07-09_01:32:31.58080 [Cmd Runner] 2014/07/09 01:32:31 DEBUG - Running command: /var/vcap/jobs/common-utils/bin/drain job_shutdown hash_unchanged
2014-07-09_01:32:31.58523 [Cmd Runner] 2014/07/09 01:32:31 DEBUG - Stdout: 0
2014-07-09_01:32:31.58525 [Cmd Runner] 2014/07/09 01:32:31 DEBUG - Stderr:
2014-07-09_01:32:31.58529 [Cmd Runner] 2014/07/09 01:32:31 DEBUG - Successful: true (0)
2014-07-09_01:32:31.62468 2014/07/09 01:32:31 mail from: "monit@localhost"
===

ls -l /var/vcap/jobs/*/bin/drain
===
-rwxr-xr-x 1 root root   21 Jul 8 18:26 /var/vcap/jobs/common-utils/bin/drain*
-rwxr-xr-x 1 root root 1370 Jul 8 18:27 /var/vcap/jobs/hadoop-datanode/bin/drain*
-rwxr-xr-x 1 root root   21 Jul 8 18:27 /var/vcap/jobs/hadoop-historyserver/bin/drain*
-rwxr-xr-x 1 root root 925 Jul 8 18:27 /var/vcap/jobs/hadoop-namenode/bin/drain*
-rwxr-xr-x 1 root root   21 Jul 8 18:27 /var/vcap/jobs/hadoop-nodemanager/bin/drain*
-rwxr-xr-x 1 root root   21 Jul 8 18:27 /var/vcap/jobs/hadoop-proxyserver/bin/drain*
-rwxr-xr-x 1 root root   21 Jul 8 18:27 /var/vcap/jobs/hadoop-resourcemanager/bin/drain*
-rwxr-xr-x 1 root root   21 Jul 8 18:27 /var/vcap/jobs/hbase-master/bin/drain*
-rwxr-xr-x 1 root root   21 Jul 8 18:27 /var/vcap/jobs/hbase-regionserver/bin/drain*
-rwxr-xr-x 1 root root   21 Jul 8 18:27 /var/vcap/jobs/opentsdb/bin/drain*
-rwxr-xr-x 1 root root 1059 Jul 8 18:27 /var/vcap/jobs/storm-nimbus/bin/drain*
-rwxr-xr-x 1 root root 1130 Jul 8 18:27 /var/vcap/jobs/storm-supervisor/bin/drain*
-rwxr-xr-x 1 root root   21 Jul 8 18:27 /var/vcap/jobs/storm-ui/bin/drain*
-rwxr-xr-x 1 root root 1312 Jul 8 18:27 /var/vcap/jobs/zookeeper/bin/drain*

-C

Karl Isenberg

unread,

Jul 9, 2014, 12:43:26 PM7/9/14

to bosh-...@cloudfoundry.org, dmi...@pivotallabs.com, chinc...@gmail.com

Correct, the drain script is only called on the first job template today. There are two stories in the backlog to call the drain scripts of all jobs:

https://www.pivotaltracker.com/story/show/70697526

https://www.pivotaltracker.com/story/show/70697490

We will let Greg (our PM) know that more people want this behavior changed.

-Karl & Dmitriy

Reply all

Reply to author

Forward