App fails to stage after push, but starts after restart

628 views
Skip to first unread message

Brian McClain

unread,
Mar 5, 2014, 3:56:29 PM3/5/14
to vcap...@cloudfoundry.org
Hey all,

A little bit of odd behavior I was curious if anyone's seen. I'm running cf-159 currently, and we're seeing the following behavior:

After performing a "cf push" of a very simple Sinatra app, either with the buildpacks shipped with cf-159 or specifying the most recent buildpack, the app seems to stage without error, but then errors out when starting the instance. I don't think I'm hitting a resource limit as this is a whole new environment we brought up, and currently have 3 DEAs with plenty of memory, not running anything. Output from "cf push":

Starting app sintest in org dev / space dev as admin...
OK
-----> Downloaded app package (4.0K)
-----> Using Ruby version: ruby-1.9.3
-----> Installing dependencies using Bundler version 1.3.2
       Running: bundle install --without development:test --path vendor/bundle --binstubs vendor/bundle/bin --deployment
       Fetching gem metadata from http://rubygems.org/..........
       Fetching gem metadata from http://rubygems.org/..
       Installing rack (1.5.2)
       Installing rack-protection (1.5.2)
       Installing tilt (1.4.1)
       Installing sinatra (1.4.4)
       Using bundler (1.3.2)
       Your bundle is complete! It was installed into ./vendor/bundle
       Cleaning up the bundler cache.
-----> Uploading droplet (23M)

FAILED
Server error, status code: 400, error code: 170001, message: Staging error: cannot get instances since staging failed

TIP: use 'cf logs sintest --recent' for more information



And from "cf logs":




2014-03-05T20:51:43.07+0000 [STG]     OUT -----> Uploading droplet (23M)
2014-03-05T20:51:44.58+0000 [API]     ERR Encountered error: Staging error: failed to stage application:
2014-03-05T20:51:44.58+0000 [API]     ERR Error uploading: http://api.cfv2.mydomain.com/v2/jobs/ad428cae-bbac-4f53-867e-eda1a41dfa87 (Polling status: 303 - )





HOWEVER

If I simply then run "cf restart sintest" it start without error (though still strange behavior). Output of "cf restart sintest"




Starting app sintest in org dev / space dev as admin...
OK

1 of 1 instances running

App started

Showing health and status for app sintest in org dev / space dev as admin...
OK

requested state: started
instances: 0/1
usage: 1G x 1 instances
urls: sintest.cfv2.mydomain.com

     state     since                    cpu    memory        disk
#0   running   2014-03-05 08:53:21 PM   0.0%   69.9M of 1G   52.1M of 1G



And from the app logs:



2014-03-05T20:53:18.14+0000 [API]     OUT Updated app with guid f9016678-e851-4fbd-afb1-52acb62afe6c ({"state"=>"STOPPED"})
2014-03-05T20:53:18.24+0000 [API]     OUT Updated app with guid f9016678-e851-4fbd-afb1-52acb62afe6c ({"state"=>"STARTED"})
2014-03-05T20:53:18.62+0000 [DEA]     OUT Starting app instance (index 0) with guid f9016678-e851-4fbd-afb1-52acb62afe6c
2014-03-05T20:53:21.23+0000 [App/0]   ERR [2014-03-05 20:53:21] INFO  WEBrick 1.3.1
2014-03-05T20:53:21.23+0000 [App/0]   ERR [2014-03-05 20:53:21] INFO  ruby 1.9.3 (2013-11-22) [x86_64-linux]
2014-03-05T20:53:21.23+0000 [App/0]   ERR [2014-03-05 20:53:21] INFO  WEBrick::HTTPServer#start: pid=31 port=61009





While the app logs are as expected, I noticed that even though the app was started without error, and responds to cURL requests, it reports "0/1" instances running, which is weird as well. I thought this was a Health Manager issue, and redeployed CF switching from health_manager_next to hm9000, however the exact same behavior is still occurring. I figured I'd propose the issue to the group and see if anyone has seen this or has any idea of a possible cause?

Thanks!
Brian

Mark Kropf

unread,
Mar 5, 2014, 6:23:02 PM3/5/14
to vcap...@cloudfoundry.org
Hey Brian,

I'm not aware of us ever returning a 303 "See Other" code in the CC. Can you cf curl /v2/jobs/ad428cae-bbac-4f53-867e-eda1a41dfa87 and get the job state back? I wonder if this is bubbling up from the ntfserver. 

Thanks,

Mark

Brian McClain

unread,
Mar 5, 2014, 11:37:47 PM3/5/14
to vcap...@cloudfoundry.org
Hey Mark,

Great to hear from you as always. Below you'll find the output from the cf curl command:


{
  "metadata": {
    "guid": "0",
    "created_at": "1970-01-01T00:00:00+00:00",
    "url": "/v2/jobs/0"
  },
  "entity": {
    "guid": "0",
    "status": "finished"
  }
}


What's catching my eye is the "created_at" field, created at epoch. Smells fishy at least

Thanks!
Brian

Eric Malm

unread,
Mar 6, 2014, 3:31:50 AM3/6/14
to vcap...@cloudfoundry.org
Hi, Brian,

Actually, this response means that this job eventually finished successfully. If a job succeeds, Delayed::Job immediately prunes its record from its delayed_jobs table in CCDB. Consequently, CC has no way to report any information on a successful job, so if it receives a query for a job without a corresponding record in the table it assumes the job was successful and presents this 'null' job with an epoch timestamp and a 'finished' status.

If you're still getting this same error, can you check the status of that job right after you see it appear in the logs? Delayed::Job will retry the queued job periodically until it succeeds or reaches the maximum number of attempts.

Also, if you have access to the worker logs on the CC VM, do you see anything in there related to the guid for this job?

Best,
Eric


To unsubscribe from this group and stop receiving emails from it, send an email to vcap-dev+u...@cloudfoundry.org.

Brian McClain

unread,
Mar 6, 2014, 4:10:40 PM3/6/14
to vcap...@cloudfoundry.org
Thanks Eric, that makes a lot of sense. Looking at the worker jobs, nothing seems to stick out, these are the two items in the log

After Push:

[Worker(host:2f179f27-a031-43ca-adad-782ea70197d0 pid:7558)] Job VCAP::CloudController::Jobs::ExceptionCatchingJob (id=6) RUNNING
{"timestamp":1394139455.7768395,"message":"blobstore.cp-start","log_level":"info","source":"cc.blobstore","data":{"destination_key":"b7998ccb-0966-4eed-9a82-1cf3542a5395","source_path":"/var/vcap/data/cloud_controller_ng/tmp/safezipper20140306-7558-1q11q5/package.zip","bucket":"cc-packages"},"thread_id":16010940,"fiber_id":39815220,"process_id":7558,"file":"/var/vcap/data/packages/cloud_controller_ng/41.1/cloud_controller_ng/lib/cloud_controller/blobstore/client.rb","lineno":47,"method":"cp_to_blobstore"}
{"timestamp":1394139455.8918765,"message":"blobstore.cp-finish","log_level":"info","source":"cc.blobstore","data":{"destination_key":"b7998ccb-0966-4eed-9a82-1cf3542a5395","duration_seconds":0.115279247,"size":903},"thread_id":16010940,"fiber_id":39815220,"process_id":7558,"file":"/var/vcap/data/packages/cloud_controller_ng/41.1/cloud_controller_ng/lib/cloud_controller/blobstore/client.rb","lineno":60,"method":"cp_to_blobstore"}
[Worker(host:2f179f27-a031-43ca-adad-782ea70197d0 pid:7558)] Job VCAP::CloudController::Jobs::ExceptionCatchingJob (id=6) COMPLETED after 0.2888
[Worker(host:2f179f27-a031-43ca-adad-782ea70197d0 pid:7558)] 1 jobs processed at 3.3084 j/s, 0 failed




After Staging:


[Worker(host:2f179f27-a031-43ca-adad-782ea70197d0 pid:7627)] Job VCAP::CloudController::Jobs::ExceptionCatchingJob (id=7) RUNNING
{"timestamp":1394139479.8246639,"message":"blobstore.cp-start","log_level":"info","source":"cc.blobstore","data":{"destination_key":"b7998ccb-0966-4eed-9a82-1cf3542a5395/b222524b022b1eba5dc83ef669a638cdcd29714b","source_path":"/var/vcap/data/cloud_controller_ng/tmp/staged_droplet_uploads/0000000004","bucket":"cc-droplets"},"thread_id":8611500,"fiber_id":33206960,"process_id":7627,"file":"/var/vcap/data/packages/cloud_controller_ng/41.1/cloud_controller_ng/lib/cloud_controller/blobstore/client.rb","lineno":47,"method":"cp_to_blobstore"}
{"timestamp":1394139481.5314448,"message":"blobstore.cp-finish","log_level":"info","source":"cc.blobstore","data":{"destination_key":"b7998ccb-0966-4eed-9a82-1cf3542a5395/b222524b022b1eba5dc83ef669a638cdcd29714b","duration_seconds":1.706629006,"size":23388778},"thread_id":8611500,"fiber_id":33206960,"process_id":7627,"file":"/var/vcap/data/packages/cloud_controller_ng/41.1/cloud_controller_ng/lib/cloud_controller/blobstore/client.rb","lineno":60,"method":"cp_to_blobstore"}
[Worker(host:2f179f27-a031-43ca-adad-782ea70197d0 pid:7627)] Job VCAP::CloudController::Jobs::ExceptionCatchingJob (id=7) COMPLETED after 1.8844
[Worker(host:2f179f27-a031-43ca-adad-782ea70197d0 pid:7627)] 1 jobs processed at 0.5188 j/s, 0 failed




It's a bit tough to catch the job ID in time, but if I turn on tracing from the cf client, I see this in both the response from the server, as well as the Cloud Controller logs:



{"timestamp":1394139700.879901,"message":"Encountered error on stager with id 1-6a33b83af1a54ad5a94bd299df081bd1: Staging error: failed to stage application:\nError uploading: http://api.cfv2.mydomain.com/v2/jobs/2581e044-6ce6-4652-84e0-20657e35a359 (Polling status: 303 - )\n\n/var/vcap/packages/cloud_controller_ng/cloud_controller_ng/lib/cloud_controller/app_stager_task.rb:162:in `check_staging_error!'



Which bring me back to the original error. The response to the client includes the full stacktrace, is this expected?

I've also upgraded to 161 to see if it happened to be a weird one-off bug I was seeing, but I'm still seeing the same behavior, so I'm thinking it's a config issue. Mark's comment got me curious, from my Cloud Foundry manifest, I've specified only the following

ccng:
  resource_pool:
    resource_directory_key: cc-resources


I wasn't sure if the fog_connection defaulted to the local provider or not, but it seems explicitly defining this doesn't resolve the issue either:

ccng:
  resource_pool:
    resource_directory_key: cc-resources
    fog_connection:
          provider: Local
          local_root: /var/vcap/nfs/store




- Brian

Brian McClain

unread,
Mar 6, 2014, 4:12:43 PM3/6/14
to vcap...@cloudfoundry.org
One clarification: I only see that stracktrace DURING staging, after the cf client reports the app is starting up, I don't see any exceptions being thrown or stacktraces being printed, only polling of the status of the app

- Brian

Brian McClain

unread,
Mar 7, 2014, 4:06:40 PM3/7/14
to vcap...@cloudfoundry.org
I believe I've found the root cause, but I can't tell you why it works after a restart

The issue is that the DEAs cannot resolve my API URL endpoint (ie. api.cfv2.mydomain.com), which explains the error upload the droplet after staging

HOWEVER...I have no idea why it can stage it after a restart with no changes in the network, the app even starts.

The final error may be unrelated, but all apps are reporting 0/1 running, but the app is actually running. I might revert back from hm9000 to health_manager_next and see if that helps

I'll also have to revisit my DNS settings and see why it's not using Route53 as a recursor as I had defined

- Brian

Simon Johansson

unread,
Mar 8, 2014, 7:16:47 AM3/8/14
to vcap...@cloudfoundry.org

Regarding hm9000 not reporting correct usage, have you set the hm9000_noop property? Sent from my phone, I might have misremembered the name. Have a look at the ccng specs

Brian McClain

unread,
Mar 8, 2014, 3:10:45 PM3/8/14
to vcap...@cloudfoundry.org
Hi Simon, I do have the ccng.hm9000_noop property currently set to false, I don't recall the thread I saw that in but I saw some users reporting that solved the issue as well, however I'll keep that setting in my manifest and stay on hm9000 until I resolve the DNS issue on my end, perhaps that will help the issue as well. I assume other machines in the cluster might not be behaving as expected due to not being able to resolve any subdomains

I'll report back with my findings, thanks!

- Brian

Simon Johansson

unread,
Mar 8, 2014, 6:19:27 PM3/8/14
to vcap...@cloudfoundry.org

Out of interest. Are you using nats or gnats? I'm troubleshooting a issue in my installation where I don't get correct status with hm9000_noop set to false if I'm using gnats. With nats it works as expected. In case you are using gnats, could you try to switch to nats to see if you can replicate my issue?

Brian McClain

unread,
Mar 9, 2014, 2:08:04 PM3/9/14
to vcap...@cloudfoundry.org
Hey Simon,

I am using nats rather than gnats, nonetheless I'll dive into the actual machines and sure the noop setting is actually getting set as well, thanks for the tip!

- Brian

Brian McClain

unread,
Mar 10, 2014, 10:19:30 AM3/10/14
to vcap...@cloudfoundry.org
It looks like fixing my DNS so that all machines in the cluster can resolve my domains cleared up both issues I was having, the issue with apps not starting after staging and CF reporting that "0/1" instances were running even though the app was running perfectly fine.

Moral of the story: it really is always a DNS issue :)

- Brian

Simon Johansson

unread,
Mar 10, 2014, 10:23:26 AM3/10/14
to vcap...@cloudfoundry.org
Woho, nice one.
Was the fix to use microbosh/bosh server as a DNS server?

Brian McClain

unread,
Mar 10, 2014, 10:37:28 AM3/10/14
to vcap...@cloudfoundry.org
Embarrassingly enough, I was already using microbosh as a DNS, which explained why all internal DNS addresses worked, but it was a fat-fingered recursor to resolve entries in our Route53 Hosted Zone. Fixed that in microbosh, and everything started working again

- Brian
Reply all
Reply to author
Forward
0 new messages