Hey all,
Another one of my strange issues :)
Currently running health_manager_next on v161, stemcell 2200. We're seeing a strange behavior which is best described below:
Showing health and status for app rubytest in org org / space space as user...
OK
requested state: started
instances: 0/1
usage: 1G x 1 instances
state since cpu memory disk
#0 running 2014-03-20 12:00:19 PM 0.0% 76.2M of 1G 58.7M of 1G
Watching the health_manager_next logs, it truly believes the app is down and requests a new instance be started, despite the fact that I can reach the app no problem. However, despite the fact it requests a new instance, a new one is never started, or at least, the instance ID of the app never changes. From the HMN logs:
{"timestamp":1395331896.365486,"message":"harmonizer: Analyzed 1 running 1 missing instances. Elapsed time: 0.009356267","log_level":"info","source":"hm","data":{},"thread_id":16810120,"fiber_id":24014660,"process_id":6740,"file":"/var/vcap/packages/health_manager_next/health_manager_next/lib/health_manager/harmonizer.rb","lineno":229,"method":"finish_droplet_analysis"}
{"timestamp":1395331905.8794415,"message":"hm.actual-state.process-heartbeat","log_level":"debug","source":"hm","data":{"dea":"1-338ba66f0d4d4bb592a5f2f90faff93a"},"thread_id":25176160,"fiber_id":25213960,"process_id":6740,"file":"/var/vcap/packages/health_manager_next/health_manager_next/lib/health_manager/actual_state.rb","lineno":62,"method":"process_heartbeat"}
{"timestamp":1395331915.9453702,"message":"hm.actual-state.process-heartbeat","log_level":"debug","source":"hm","data":{"dea":"1-338ba66f0d4d4bb592a5f2f90faff93a"},"thread_id":16697460,"fiber_id":25242680,"process_id":6740,"file":"/var/vcap/packages/health_manager_next/health_manager_next/lib/health_manager/actual_state.rb","lineno":62,"method":"process_heartbeat"}
{"timestamp":1395331916.3971412,"message":"harmonizer: droplets_analysis","log_level":"debug","source":"hm","data":{},"thread_id":16810120,"fiber_id":24014660,"process_id":6740,"file":"/var/vcap/packages/health_manager_next/health_manager_next/lib/health_manager/harmonizer.rb","lineno":195,"method":"analyze_apps"}
{"timestamp":1395331916.3981845,"message":"harmonizer.missing-instances.processing","log_level":"debug","source":"hm","data":{},"thread_id":16810120,"fiber_id":24014660,"process_id":6740,"file":"/var/vcap/packages/health_manager_next/health_manager_next/lib/health_manager/harmonizer.rb","lineno":86,"method":"on_missing_instances"}
{"timestamp":1395331916.398576,"message":"nudger: queued: start instances [0] for a426e654-f754-4c8f-86cd-bb2027d9f97c priority: 1000","log_level":"debug","source":"hm","data":{},"thread_id":16810120,"fiber_id":24014660,"process_id":6740,"file":"/var/vcap/packages/health_manager_next/health_manager_next/lib/health_manager/nudger.rb","lineno":46,"method":"start_instances"}
{"timestamp":1395331916.3990774,"message":"nudger: queueing: {:droplet=>\"a426e654-f754-4c8f-86cd-bb2027d9f97c\", :last_updated=>1395331216, :version=>\"a26c4b98-527c-48bc-aec0-1b915b01bb0f\", :indices=>[0], :running=>{\"a26c4b98-527c-48bc-aec0-1b915b01bb0f\"=>0}, :flapping=>false}, 1000","log_level":"debug","source":"hm","data":{},"thread_id":16810120,"fiber_id":24014660,"process_id":6740,"file":"/var/vcap/packages/health_manager_next/health_manager_next/lib/health_manager/nudger.rb","lineno":90,"method":"queue"}
This is done may 60 seconds after the app is up and running, receiving and serving requests as expected, and "cf app" shows "0/1 running", despite listing one instance running. Eventually (Maybe 3-5 minutes after push?), it seems to do a cleanup:
{"timestamp":1395332157.128465,"message":"harmonizer: Analyzed 1 running 1 missing instances. Elapsed time: 0.002016553","log_level":"info","source":"hm","data":{},"thread_id":16810120,"fiber_id":24014660,"process_id":6740,"file":"/var/vcap/packages/health_manager_next/health_manager_next/lib/health_manager/harmonizer.rb","lineno":229,"method":"finish_droplet_analysis"}
{"timestamp":1395332157.2271812,"message":"hm.actual-state.process-heartbeat","log_level":"debug","source":"hm","data":{"dea":"1-338ba66f0d4d4bb592a5f2f90faff93a"},"thread_id":16701260,"fiber_id":25316600,"process_id":6740,"file":"/var/vcap/packages/health_manager_next/health_manager_next/lib/health_manager/actual_state.rb","lineno":62,"method":"process_heartbeat"}
{"timestamp":1395332157.949391,"message":"hm.nudger.request","log_level":"info","source":"hm","data":{"operation":"start","payload":{"droplet":"a426e654-f754-4c8f-86cd-bb2027d9f97c","last_updated":1395331216,"version":"a26c4b98-527c-48bc-aec0-1b915b01bb0f","indices":[0],"running":{"a26c4b98-527c-48bc-aec0-1b915b01bb0f":0},"flapping":false}},"thread_id":16810120,"fiber_id":24014660,"process_id":6740,"file":"/var/vcap/packages/health_manager_next/health_manager_next/lib/health_manager/nudger.rb","lineno":27,"method":"publish_request_message"}
{"timestamp":1395332167.2962203,"message":"hm.actual-state.process-heartbeat","log_level":"debug","source":"hm","data":{"dea":"1-338ba66f0d4d4bb592a5f2f90faff93a"},"thread_id":24704000,"fiber_id":16719440,"process_id":6740,"file":"/var/vcap/packages/health_manager_next/health_manager_next/lib/health_manager/actual_state.rb","lineno":62,"method":"process_heartbeat"}
{"timestamp":1395332173.518845,"message":"harmonizer: droplet GC ran. Number of droplets before: 2, after: 1. 1 droplets removed","log_level":"info","source":"hm","data":{},"thread_id":16810120,"fiber_id":24014660,"process_id":6740,"file":"/var/vcap/packages/health_manager_next/health_manager_next/lib/health_manager/harmonizer.rb","lineno":184,"method":"gc_droplets"}
{"timestamp":1395332175.085469,"message":"bulk: user counts received: {\"counts\"=>{\"user\"=>2}}","log_level":"debug","source":"hm","data":{},"thread_id":16810120,"fiber_id":24014660,"process_id":6740,"file":"/var/vcap/packages/health_manager_next/health_manager_next/lib/health_manager/desired_state.rb","lineno":40,"method":"block (2 levels) in update_user_counts"}
{"timestamp":1395332175.097792,"message":"hm.desired-state.bulk-update-batch-received","log_level":"debug","source":"hm","data":{"size":1},"thread_id":16810120,"fiber_id":24014660,"process_id":6740,"file":"/var/vcap/packages/health_manager_next/health_manager_next/lib/health_manager/desired_state.rb","lineno":182,"method":"process_response_and_get_next_bulk_token"}
{"timestamp":1395332175.098254,"message":"bulk: #set_desired_state","log_level":"debug","source":"hm","data":{"actual":{"instances":[{"state":"RUNNING","version":"170c8b12-7d3a-40e7-b36a-b6194d0dfab5","guid":"ab24129b1c8b4ead8b9d69ef343a1189","index":0,"crash_count":0}]},"desired_droplet":{"instances":1,"state":"STARTED","memory":1024,"package_state":"STAGED","version":"170c8b12-7d3a-40e7-b36a-b6194d0dfab5","id":"f1aabf32-db13-4323-802f-0195e47b438a","updated_at":"2014-03-20T16:10:52+00:00"}},"thread_id":16810120,"fiber_id":24014660,"process_id":6740,"file":"/var/vcap/packages/health_manager_next/health_manager_next/lib/health_manager/droplet.rb","lineno":30,"method":"set_desired_state"}
{"timestamp":1395332175.1270504,"message":"hm.desired-state.bulk-update-done","log_level":"info","source":"hm","data":{"duration":0.061918503},"thread_id":16810120,"fiber_id":24014660,"process_id":6740,"file":"/var/vcap/packages/health_manager_next/health_manager_next/lib/health_manager/desired_state.rb","lineno":177,"method":"process_response_and_get_next_bulk_token"}
{"timestamp":1395332177.201661,"message":"harmonizer: droplets_analysis","log_level":"debug","source":"hm","data":{},"thread_id":16810120,"fiber_id":24014660,"process_id":6740,"file":"/var/vcap/packages/health_manager_next/health_manager_next/lib/health_manager/harmonizer.rb","lineno":195,"method":"analyze_apps"}
{"timestamp":1395332177.203008,"message":"harmonizer: Analyzed 1 running 0 missing instances. Elapsed time: 0.001215878","log_level":"info","source":"hm","data":{},"thread_id":16810120,"fiber_id":24014660,"process_id":6740,"file":"/var/vcap/packages/health_manager_next/health_manager_next/lib/health_manager/harmonizer.rb","lineno":229,"method":"finish_droplet_analysis"}
However, "cf app" still shows "0/1" instances, but lists the instance state as "running"
Showing health and status for app rubytest in org org / space space as user...
OK
requested state: started
instances: 0/1
usage: 1G x 1 instances
state since cpu memory disk
#0 running 2014-03-20 12:00:19 PM 0.0% 77.2M of 1G 58.7M of 1G
As mentioned, this is on health_manager_next, not hm9000, so I'm not sure what the current state for support is for HMN, but comparing this to other deployments, nothing immediate jumps out
Thanks!
Brian