All instances running, health manager reports 0/1 instances

594 views
Skip to first unread message

Brian McClain

unread,
Mar 20, 2014, 12:21:54 PM3/20/14
to vcap...@cloudfoundry.org
Hey all,

Another one of my strange issues :)


Currently running health_manager_next on v161, stemcell 2200. We're seeing a strange behavior which is best described below:



Showing health and status for app rubytest in org org / space space as user...
OK

requested state: started
instances: 0/1
usage: 1G x 1 instances

     state     since                    cpu    memory        disk
#0   running   2014-03-20 12:00:19 PM   0.0%   76.2M of 1G   58.7M of 1G




Watching the health_manager_next logs, it truly believes the app is down and requests a new instance be started, despite the fact that I can reach the app no problem. However, despite the fact it requests a new instance, a new one is never started, or at least, the instance ID of the app never changes. From the HMN logs:



{"timestamp":1395331896.365486,"message":"harmonizer: Analyzed 1 running 1 missing instances. Elapsed time: 0.009356267","log_level":"info","source":"hm","data":{},"thread_id":16810120,"fiber_id":24014660,"process_id":6740,"file":"/var/vcap/packages/health_manager_next/health_manager_next/lib/health_manager/harmonizer.rb","lineno":229,"method":"finish_droplet_analysis"}
{"timestamp":1395331905.8794415,"message":"hm.actual-state.process-heartbeat","log_level":"debug","source":"hm","data":{"dea":"1-338ba66f0d4d4bb592a5f2f90faff93a"},"thread_id":25176160,"fiber_id":25213960,"process_id":6740,"file":"/var/vcap/packages/health_manager_next/health_manager_next/lib/health_manager/actual_state.rb","lineno":62,"method":"process_heartbeat"}
{"timestamp":1395331915.9453702,"message":"hm.actual-state.process-heartbeat","log_level":"debug","source":"hm","data":{"dea":"1-338ba66f0d4d4bb592a5f2f90faff93a"},"thread_id":16697460,"fiber_id":25242680,"process_id":6740,"file":"/var/vcap/packages/health_manager_next/health_manager_next/lib/health_manager/actual_state.rb","lineno":62,"method":"process_heartbeat"}
{"timestamp":1395331916.3971412,"message":"harmonizer: droplets_analysis","log_level":"debug","source":"hm","data":{},"thread_id":16810120,"fiber_id":24014660,"process_id":6740,"file":"/var/vcap/packages/health_manager_next/health_manager_next/lib/health_manager/harmonizer.rb","lineno":195,"method":"analyze_apps"}
{"timestamp":1395331916.3981845,"message":"harmonizer.missing-instances.processing","log_level":"debug","source":"hm","data":{},"thread_id":16810120,"fiber_id":24014660,"process_id":6740,"file":"/var/vcap/packages/health_manager_next/health_manager_next/lib/health_manager/harmonizer.rb","lineno":86,"method":"on_missing_instances"}
{"timestamp":1395331916.398576,"message":"nudger: queued: start instances [0] for a426e654-f754-4c8f-86cd-bb2027d9f97c priority: 1000","log_level":"debug","source":"hm","data":{},"thread_id":16810120,"fiber_id":24014660,"process_id":6740,"file":"/var/vcap/packages/health_manager_next/health_manager_next/lib/health_manager/nudger.rb","lineno":46,"method":"start_instances"}
{"timestamp":1395331916.3990774,"message":"nudger: queueing: {:droplet=>\"a426e654-f754-4c8f-86cd-bb2027d9f97c\", :last_updated=>1395331216, :version=>\"a26c4b98-527c-48bc-aec0-1b915b01bb0f\", :indices=>[0], :running=>{\"a26c4b98-527c-48bc-aec0-1b915b01bb0f\"=>0}, :flapping=>false}, 1000","log_level":"debug","source":"hm","data":{},"thread_id":16810120,"fiber_id":24014660,"process_id":6740,"file":"/var/vcap/packages/health_manager_next/health_manager_next/lib/health_manager/nudger.rb","lineno":90,"method":"queue"}




This is done may 60 seconds after the app is up and running, receiving and serving requests as expected, and "cf app" shows "0/1 running", despite listing one instance running. Eventually (Maybe 3-5 minutes after push?), it seems to do a cleanup:




{"timestamp":1395332157.128465,"message":"harmonizer: Analyzed 1 running 1 missing instances. Elapsed time: 0.002016553","log_level":"info","source":"hm","data":{},"thread_id":16810120,"fiber_id":24014660,"process_id":6740,"file":"/var/vcap/packages/health_manager_next/health_manager_next/lib/health_manager/harmonizer.rb","lineno":229,"method":"finish_droplet_analysis"}
{"timestamp":1395332157.2271812,"message":"hm.actual-state.process-heartbeat","log_level":"debug","source":"hm","data":{"dea":"1-338ba66f0d4d4bb592a5f2f90faff93a"},"thread_id":16701260,"fiber_id":25316600,"process_id":6740,"file":"/var/vcap/packages/health_manager_next/health_manager_next/lib/health_manager/actual_state.rb","lineno":62,"method":"process_heartbeat"}
{"timestamp":1395332157.949391,"message":"hm.nudger.request","log_level":"info","source":"hm","data":{"operation":"start","payload":{"droplet":"a426e654-f754-4c8f-86cd-bb2027d9f97c","last_updated":1395331216,"version":"a26c4b98-527c-48bc-aec0-1b915b01bb0f","indices":[0],"running":{"a26c4b98-527c-48bc-aec0-1b915b01bb0f":0},"flapping":false}},"thread_id":16810120,"fiber_id":24014660,"process_id":6740,"file":"/var/vcap/packages/health_manager_next/health_manager_next/lib/health_manager/nudger.rb","lineno":27,"method":"publish_request_message"}
{"timestamp":1395332167.2962203,"message":"hm.actual-state.process-heartbeat","log_level":"debug","source":"hm","data":{"dea":"1-338ba66f0d4d4bb592a5f2f90faff93a"},"thread_id":24704000,"fiber_id":16719440,"process_id":6740,"file":"/var/vcap/packages/health_manager_next/health_manager_next/lib/health_manager/actual_state.rb","lineno":62,"method":"process_heartbeat"}
{"timestamp":1395332173.518845,"message":"harmonizer: droplet GC ran. Number of droplets before: 2, after: 1. 1 droplets removed","log_level":"info","source":"hm","data":{},"thread_id":16810120,"fiber_id":24014660,"process_id":6740,"file":"/var/vcap/packages/health_manager_next/health_manager_next/lib/health_manager/harmonizer.rb","lineno":184,"method":"gc_droplets"}
{"timestamp":1395332175.085469,"message":"bulk: user counts received: {\"counts\"=>{\"user\"=>2}}","log_level":"debug","source":"hm","data":{},"thread_id":16810120,"fiber_id":24014660,"process_id":6740,"file":"/var/vcap/packages/health_manager_next/health_manager_next/lib/health_manager/desired_state.rb","lineno":40,"method":"block (2 levels) in update_user_counts"}
{"timestamp":1395332175.097792,"message":"hm.desired-state.bulk-update-batch-received","log_level":"debug","source":"hm","data":{"size":1},"thread_id":16810120,"fiber_id":24014660,"process_id":6740,"file":"/var/vcap/packages/health_manager_next/health_manager_next/lib/health_manager/desired_state.rb","lineno":182,"method":"process_response_and_get_next_bulk_token"}
{"timestamp":1395332175.098254,"message":"bulk: #set_desired_state","log_level":"debug","source":"hm","data":{"actual":{"instances":[{"state":"RUNNING","version":"170c8b12-7d3a-40e7-b36a-b6194d0dfab5","guid":"ab24129b1c8b4ead8b9d69ef343a1189","index":0,"crash_count":0}]},"desired_droplet":{"instances":1,"state":"STARTED","memory":1024,"package_state":"STAGED","version":"170c8b12-7d3a-40e7-b36a-b6194d0dfab5","id":"f1aabf32-db13-4323-802f-0195e47b438a","updated_at":"2014-03-20T16:10:52+00:00"}},"thread_id":16810120,"fiber_id":24014660,"process_id":6740,"file":"/var/vcap/packages/health_manager_next/health_manager_next/lib/health_manager/droplet.rb","lineno":30,"method":"set_desired_state"}
{"timestamp":1395332175.1270504,"message":"hm.desired-state.bulk-update-done","log_level":"info","source":"hm","data":{"duration":0.061918503},"thread_id":16810120,"fiber_id":24014660,"process_id":6740,"file":"/var/vcap/packages/health_manager_next/health_manager_next/lib/health_manager/desired_state.rb","lineno":177,"method":"process_response_and_get_next_bulk_token"}
{"timestamp":1395332177.201661,"message":"harmonizer: droplets_analysis","log_level":"debug","source":"hm","data":{},"thread_id":16810120,"fiber_id":24014660,"process_id":6740,"file":"/var/vcap/packages/health_manager_next/health_manager_next/lib/health_manager/harmonizer.rb","lineno":195,"method":"analyze_apps"}
{"timestamp":1395332177.203008,"message":"harmonizer: Analyzed 1 running 0 missing instances. Elapsed time: 0.001215878","log_level":"info","source":"hm","data":{},"thread_id":16810120,"fiber_id":24014660,"process_id":6740,"file":"/var/vcap/packages/health_manager_next/health_manager_next/lib/health_manager/harmonizer.rb","lineno":229,"method":"finish_droplet_analysis"}





However, "cf app" still shows "0/1" instances, but lists the instance state as "running"


Showing health and status for app rubytest in org org / space space as user...
OK

requested state: started
instances: 0/1
usage: 1G x 1 instances

     state     since                    cpu    memory        disk
#0   running   2014-03-20 12:00:19 PM   0.0%   77.2M of 1G   58.7M of 1G








As mentioned, this is on health_manager_next, not hm9000, so I'm not sure what the current state for support is for HMN, but comparing this to other deployments, nothing immediate jumps out

Thanks!
Brian

Eric Lee

unread,
Mar 20, 2014, 3:09:02 PM3/20/14
to vcap...@cloudfoundry.org
Brian wrote:
>
Currently running health_manager_next on v161, stemcell 2200. We're seeing a strange behavior which is best described below:
>

We're seeing the same thing on v161 - health manager reports 0/1 instances, but also says the instance is running.

We're also seeing a problem where when we push an app multiple times to update it we end up with multiple instances of the app actually executing on the DEA box (even though the CLI says there's only one instance officially running).

Eric

Brian McClain

unread,
Mar 21, 2014, 8:53:51 AM3/21/14
to vcap...@cloudfoundry.org
Hi Eric,

Glad to know I'm not alone on this one :)

Out of curiosity, were you running any version before v161 that worked as expected? We have another cluster running v159 and an slightly older stemcell that does not have this behavior, but is running health_manager_next rather than hm9000

Trying to figure out if it's a config issue, environment issue, etc

Thanks!
Brian

nicola...@gmail.com

unread,
Mar 25, 2014, 10:38:48 AM3/25/14
to vcap...@cloudfoundry.org
Hello Brian

I face exactly the same behavior with stemcell 2200, cf 163 and hm9000.

$ cf app myapp
Showing health and status for app myapp in org myorg / space development as myuser...

OK

requested state: started
instances: 0/1
usage: 256M x 1 instances
urls: myapp...


     state     since                    cpu    memory           disk          
#0   running   2014-03-25 02:27:13 PM   0.1%   255.7M of 256M   125.3M of 1G

I checked the logs on the hm job VM, where I see exactly three sorts of log entries containing the word error:

# pwd
/var/vcap/sys/log/hm9000
# grep error * | tail -3
hm9000_sender
.stdout.log:{"timestamp":1395747849.180017710,"process_id":1600,"source":"vcap.hm9000.sender","log_level":"error","message":"Store is not fresh - Error:Actual state is not fresh","data":null}
hm9000_sender
.stdout.log:{"timestamp":1395747849.180342436,"process_id":1600,"source":"vcap.hm9000.sender","log_level":"error","message":"Sender failed with error - Error:Actual state is not fresh","data":null}
hm9000_sender
.stdout.log:{"timestamp":1395747849.180811405,"process_id":1600,"source":"vcap.hm9000.sender","log_level":"error","message":"Daemon returned an error. Continuining... - Error:Actual state is not fresh","data":null}

After the bosh deployment to CF 163, introducing etcd and hm9000 for the first time on this installation, I observed all apps were not started. After starting them manually, they are running perfectly, they bind nicely to services (v1 services as well as v2 services), I can also delete apps and re-deploy them, only hm9000 obviously is not able to do its job and have cloud controller start the apps.

Is there anything else I can post here to help analyze this issue?

Thanks for your help.
Nicolas

Brian McClain

unread,
Mar 25, 2014, 11:09:23 AM3/25/14
to vcap...@cloudfoundry.org, nicola...@gmail.com
Strangely enough, switching from health_manager_next to hm9000 seemed to actually resolve our issue. Still on release 163 on stemcell 2200, simply changed the template, and added a couple etcd nodes

- Brian

David Williams

unread,
Mar 25, 2014, 12:51:00 PM3/25/14
to vcap...@cloudfoundry.org, nicola...@gmail.com
Hey Brian,

That is odd. We're still on v161 (stemcell 2175) in our lab (not comfortable enough yet with it for this and other reasons to move it up the stack). We had deployed 161 with hm9000 (never added the health_manager_next job) along with the etcd ndoes, and we have the issue you described. So don't think that's the root of the issue, but glad that change resolved things for you. I should also note that the number of instances doesn't matter. We can scale up and down, all instances appear to be running and are routed to, but cf apps always shows 0/n (where "n" is the number of instances).

Just for giggles, I killed one of the DEAs, and single-instance droplets it was running were never restarted on another DEA. So although initial app deployment and scaling works fine, the self-healing functionality definitely isn't working.

Onsi Fakhouri

unread,
Mar 26, 2014, 4:32:08 PM3/26/14
to vcap...@cloudfoundry.org
Hey all,

By way of replying, allow me to describe how HM9000 works and give you some pointers/tips for debugging.  Hopefully this will help you reason about the system and figure out what logs to look at/send along.  Of course, I'm happy to help with specific issues you're having as well.

## Overview:

HM9000, like HM_next, has several responsibilities.  Among these:
- starting missing instances/stopping extra instances
- knowing about, and reporting on, the actual state of the world (how many instances of a given application are running)
- moving apps from one DEA to another during a rolling deploy

In order to do this HM9K needs to have two pieces of information:
- the desired state: which apps should be running and how many instances should be running?  This information is fetched form CC (over http) periodically.
- the actual state: which instances are actually running?  This information is received, via NATS, from the DEAs.  Each DEA sends a heartbeat, periodically, identifying all running apps.

HM9K stores the desired state and the actual state in etcd. With knowledge of the actual state and the desired state, HM9K can make decisions about whether or not a given instance should be started or stopped.  This information is sent (over NATS) to the CC which is then responsible for taking action (by sending a directed NATS message to a DEA to either start or stop the instance).

As you can see, HM9K has much authority (it can start and stop apps!) and with great power comes great responsibility.  It is very important that HM9K knows that it has a complete picture of the world before it takes action.  To do this, we introduced the notion of "freshness":
- The actual state is deemed "fresh" if HM9K has an active NATS connection, is regularly receiving DEA heartbeats, and able to successfully store the actual state into etcd. If any of these things breaks down (NATS/no DEA heartbeats/etcd writes fail) then the actual state will lose freshness and HM9K will stop taking *any* actions.
- The desired state is deemed "fresh" if HM9K successfully downloads the desired state payload from CC (without timing out) and successfully stores the desired state into etcd.  As with actual state, if the desired state is not fresh, HM9K will stop taking actions.

One last piece of the puzzle: HM9K is actually comprised of ~5 independent components, each with a simple single responsibility.  Each of these components has its own log file on the HM9K job.  Here's a quick overview of the main components:

- listener: listens for DEA heartbeats over NATS and populates the actual state.  If the actual state is not fresh, look at the listener's logs to understand why it is unable to maintain the actual state.
- desired_state_fetcher: periodically fetches the desired state from CC.  If the desired state is not fresh, look at the fetcher's logs to understand why.
- analyzer: analyzes the actual state and desired state and makes decisions
- sender: acts on the decisions made by the analyzer and sends notifications to CC
- api_server: responds to CC requests for app state (this is what reports the number of running instances to the CC)

The analyzer, sender, and api_server will only perform their jobs if the actual and desired state are *both fresh*.

## Debugging HM9K:

With that mental model of the world, here are some tips for debugging HM9K:

- Make sure that CC is configured to talk to the correct health manager:  CC has an hm9000_noop flag. If set to *true* then CC will only listen to directives from health_manager_next, and will ask health_manager_next for information about running instances.  If set to *false* then CC will communicate with HM9K instead.

- Make sure etcd is not in a bad state: etcd, sometimes, enters a bad state (typically when deployed, not while running).  When this happens etcd can, sometimes, be unable to write data.  This prevents HM9K from attaining freshness and operating correctly.  Etcd is continuously improving and we expect this to be come a non-issue soon (in fact, we have a fix to mitigate one of these failure modes coming down the pipe).  However, if you notice that HM9K is failing to attain freshness, and if the listener complains that it is unable to write to etcd, you may need to restart etcd.  One way to do this is to `bosh ssh` onto each etcd node.  `monit stop all`, then delete the etcd data directory under `/var/vcap/store`.  Once you have done this for all etcd nodes, you can bring them up with `monit start all`.

- Verify that CC is able to send the desired state in time.  If CC takes too long to send desired state HM9K timesout the connection and the desired state loses freshness.  To verify that this is not happening, check the logs of the desired_state_fetcher.  If CC is taking too long, please let us know.  We have seen this behavior when CC is under heavy load and are continuously making CC more performant.

- Check the load on etcd.  We've run tests against an etcd cluster operating on AWS across two availability zones and have seen HM9K/etcd comfortably scale up to 10-20,000 apps.  If you have substantially more load than this, or are using slower hardware, then etcd may be unable to keep up with the listener and actual state may lose freshness.  You can check this by looking at the listener's logs.  If you see that freshness is not being maintained because etcd is timing out or taking too long to commit the writes one workaround is to switch from using an etcd cluster to just a single etcd node.  Yes, this comes at the cost of high availability.  While we may be able to optimize HM9K's etcd writes some more, we expect that the long term solution for supporting very large deployments will likely involve sharding HM9K/etcd.  FWIW, coreos is just now working on optimizing etcd's performance and we're hopeful that big gains will be coming soon :)

If you made it this far... thanks for reading!  I hope this helps you reason about how HM9K works and gives you a starting point to debug things.  As you look at the various logs, please reply to this thread if you come across anything curious/mysterious.

## One more thing:

HM9K has a very very very useful helper utility.  If you `bosh ssh` onto the HM9K box you can:

`/var/vcap/packages/hm9000/hm9000 dump --config=/var/vcap/jobs/hm9000/config/hm9000.json`

this will fetch both the desired and actual state from etcd and print out a pretty-formatted birds-eye view of all running & desired apps on your cloud foundry installation.  You can also get raw JSON output by passing dump the --raw flag.

Onsi

nicola...@gmail.com

unread,
Mar 27, 2014, 7:05:04 AM3/27/14
to vcap...@cloudfoundry.org
Hello Onsi.

Your post is an amazing insight into how hm9k works, written on exactly the right level of details I need as a service provider. Many many thanks.
I proceeded with analysis of your "debugging" checklist and was so able to correct my deployment file. It took me less than 60 minutes to fix the problem.
Now my installation on Openstack, Stemcell 2200, CF 163 runs perfectly.

Again thank your very much!

Nicolas

PS: I suggest you convert your text into markdown and stow it somewhere on http://docs.cloudfoundry.org/

Aristoteles Neto

unread,
Mar 27, 2014, 5:14:08 PM3/27/14
to vcap...@cloudfoundry.org
Hi Onsi,

Great post indeed.

- Verify that CC is able to send the desired state in time

What was the rationale behind choosing 10 seconds for this value?




-- Neto



To unsubscribe from this group and stop receiving emails from it, send an email to vcap-dev+u...@cloudfoundry.org.

Onsi Fakhouri

unread,
Mar 27, 2014, 5:29:12 PM3/27/14
to vcap...@cloudfoundry.org
truly? not much... 10 seconds "felt about right"

sorry to dissapoint ;)

onsi

Brian McClain

unread,
Mar 28, 2014, 11:53:59 AM3/28/14
to vcap...@cloudfoundry.org
Finally had a chance to read your post Onsi, awesome information to have!

As silly as the root cause is, I'm assuming it had to do with this:

"- Make sure that CC is configured to talk to the correct health manager:  CC has an hm9000_noop flag. If set to *true* then CC will only listen to directives from health_manager_next, and will ask health_manager_next for information about running instances.  If set to *false* then CC will communicate with HM9K instead."

I believe my manifest had hm9000 set to false, though I was running health_manager_next. Not ashamed to admit I was clueless what this option did, but I read in several threads to check this field. Didn't click that it would cause issues if I didn't have hm9000 deployed, and explains why everything works as expected now that hm9000 has replaced HMN in my deployment.

Thanks!
Brian

ci...@troyastle.com

unread,
May 30, 2014, 7:10:47 AM5/30/14
to vcap...@cloudfoundry.org
I hit this same issue today running CF170.

Configuration was correct.
I needed to run 'monit restart all' on my hm9000 machine. Odd.
If useful I can collect the logs and raise an issue on github.

Troy.

James Bayer

unread,
May 30, 2014, 10:58:01 AM5/30/14
to vcap...@cloudfoundry.org
i've seen this myself occasionally as recently as today. i'm following up with mark and we'll report back.


--
You received this message because you are subscribed to the Google Groups "Cloud Foundry Developers" group.
To view this discussion on the web visit https://groups.google.com/a/cloudfoundry.org/d/msgid/vcap-dev/73c3f726-65b5-4ff2-a4be-3102adf9c789%40cloudfoundry.org.

To unsubscribe from this group and stop receiving emails from it, send an email to vcap-dev+u...@cloudfoundry.org.



--
Thank you,

James Bayer

Troy Astle

unread,
Jun 24, 2014, 9:24:46 AM6/24/14
to vcap...@cloudfoundry.org
@James

I hit this again today. for me the error was in my etcd cluster, HM9000 was completely fine.

Symptoms:
 - cf apps showing 0/n for all running apps.
 - no logs coming from heart beats on the HM9000 box.

Issue (for me):
 - My etcd nodes had become unclustered and could not resolve the issue.
 - Onsi had previously had an issue that looked similar here:  https://github.com/coreos/etcd/issues/381

My resolution:
 - To stop all etcd nodes. 
 - Remove /var/vcap/store/etcd on each node
 - Then restart and allow the nodes to re-cluster and rebuild their state.

Hope this is useful.
T.

Ryan Grenz

unread,
Jun 30, 2014, 5:45:29 AM6/30/14
to vcap...@cloudfoundry.org
Thanks Troy - that was useful. Fixed the same problem for us too!
Cheers!

Ryan
Reply all
Reply to author
Forward
0 new messages