Seems the problem on non-default us AWS region was that the /var/vcap/deploy/bosh/aws_registry/shared/config/aws_registry.yml file as populated by the chef deployer is missing the ec2 endpoint. After adding it manually to the aws section, and restarting the aws_registry, the compilation job worked fine for 3 jobs and then failed, presumably for a similar race condition Martin fixed in
http://reviews.cloudfoundry.org/#/c/6507/ I guess the retry could also apply to InvalidInstanceID.NotFound in addition to InvalidAMIID::NotFound. I'll give it a try tomorrow if I get sufficient time (I'm still learning ruby basics) before leaving for some vacations.
However, I could not understand why the ec2 instances created by bosh don't seem to accept ssh connections. Digging into the root EBS volume of the compilation instance, from another instance I could see the key pairs are not installed into /home/vcap/ or /root (no .ssh directory)
I've updated Dr Nic's tutorial with the workarounds the community gave to run on non us ec2 region (thanks!), and will submit a pull request after reviewing them, in the meantime, next users starting they install from scratch can have a look at
https://github.com/gberche/bosh-getting-started to avoid following each email in this thread.
Following below are details on my current state and diagnostic, if this can help others.
Guillaume.
Current state:
Compiling packages
mysql/0.1-dev (00:03:03)
wordpress/0.1-dev (00:02:53)
apache2/0.1-dev (00:10:46)
mysqlclient/0.1-dev (00:02:47)
nginx/0.1-dev: <?xml version="1.0" encoding="UTF-8"?>
<Response><Errors><Error><Code>InvalidInstanceID.NotFound</Code><Message>The instance ID 'i-498afc01' does not exist</Message></Error></Errors><RequestID>e0181565-379d-453a-a9ff-382c6a4e7d24</RequestID></Response> (00:00:01)
Error 5/6 00:19:30
Error 100: The instance ID 'i-498afc01' does not exist
Previous diagnostics:
Digging into the root EBS volume of the compilation instance, from another EC2 instance I could see
- the key pairs are not installed into /home/vcap/ or /root (no .ssh directory)
- that the aws registry is returning 500 errors for the just created EC2 instance.
/mnt/srv-recovery$ sudo less ./var/vcap/bosh/log/current
2012-07-05_20:43:22.84664 #[628] INFO: Starting agent 0.5.1...
2012-07-05_20:43:22.84674 #[628] INFO: Configuring agent...
2012-07-05_20:43:23.06420 #[628] INFO: Configuring instance
2012-07-05_20:43:23.37917 /var/vcap/bosh/agent/lib/agent/infrastructure/aws/registry.rb:53:in `get_json_from_url': Cannot read settings for `
http://admin:ad...@mydns.com:25777/instances/i-2f94e267/settings' from registry, got HTTP 500 (RuntimeError)
2012-07-05_20:43:23.37920 from /var/vcap/bosh/agent/lib/agent/infrastructure/aws/registry.rb:91:in `get_settings'
2012-07-05_20:43:23.37921 from /var/vcap/bosh/agent/lib/agent/infrastructure/aws/settings.rb:32:in `load_settings'
2012-07-05_20:43:23.37921 from /var/vcap/bosh/agent/lib/agent/infrastructure/aws.rb:10:in `load_settings'
2012-07-05_20:43:23.37922 from /var/vcap/bosh/agent/lib/agent/bootstrap.rb:60:in `load_settings'
2012-07-05_20:43:23.37922 from /var/vcap/bosh/agent/lib/agent/bootstrap.rb:34:in `configure'
2012-07-05_20:43:23.37923 from /var/vcap/bosh/agent/lib/agent.rb:92:in `start'
2012-07-05_20:43:23.37924 from /var/vcap/bosh/agent/lib/agent.rb:71:in `run'
2012-07-05_20:43:23.37924 from /var/vcap/bosh/agent/bin/agent:97:in `<main>'
Following that on the bosh micro:
less /var/vcap/deploy/bosh/aws_registry/shared/logs/aws_registry.debug.log
E, [2012-07-05T20:43:18.396913 #11223] ERROR -- : AWS error: <?xml version="1.0" encoding="UTF-8"?>
<Response><Errors><Error><Code>InvalidInstanceID.NotFound</Code><Message>The instance ID 'i-2f94e267' does not exist</Message></Error></Errors><RequestID>502c1485-6f02-49fd-81aa-de97a48d92c0</RequestID></Response> (Bosh::AwsRegistry::AwsError)
/var/vcap/deploy/bosh/aws_registry/current/aws_registry/lib/aws_registry/instance_manager.rb:67:in `rescue in instance_private_ip'
/var/vcap/deploy/bosh/aws_registry/current/aws_registry/lib/aws_registry/instance_manager.rb:65:in `instance_private_ip'
/var/vcap/deploy/bosh/aws_registry/current/aws_registry/lib/aws_registry/instance_manager.rb:47:in `check_instance_ip'
/var/vcap/deploy/bosh/aws_registry/current/aws_registry/lib/aws_registry/instance_manager.rb:34:in `read_settings'
/var/vcap/deploy/bosh/aws_registry/current/aws_registry/lib/aws_registry/api_controller.rb:20:in `block in <class:ApiController>'
after fixing the ec2 endpoint, restart the agent with "sudo sv restart aws_registry" the first compilation jobs work until the apparent race condition triggered:
bosh task 41 –debug
[…]
<Response><Errors><Error><Code>InvalidInstanceID.NotFound</Code><Message>The instance ID 'i-498afc01' does not exist</Message></Error></Errors><RequestID>e0181565-379d-453a-a9ff-382c6a4e7d24</RequestID></Response>
/var/vcap/deploy/bosh/director/shared/gems/ruby/1.9.1/gems/aws-sdk-1.3.8/lib/aws/core/client.rb:277:in `return_or_raise'
/var/vcap/deploy/bosh/director/shared/gems/ruby/1.9.1/gems/aws-sdk-1.3.8/lib/aws/core/client.rb:337:in `client_request'
(eval):3:in `describe_instances'
/var/vcap/deploy/bosh/director/shared/gems/ruby/1.9.1/gems/aws-sdk-1.3.8/lib/aws/ec2/resource.rb:72:in `describe_call'
/var/vcap/deploy/bosh/director/shared/gems/ruby/1.9.1/gems/aws-sdk-1.3.8/lib/aws/ec2/instance.rb:631:in `get_resource'
/var/vcap/deploy/bosh/director/shared/gems/ruby/1.9.1/gems/aws-sdk-1.3.8/lib/aws/core/resource.rb:207:in `block (2 levels) in define_attribute_getter'
/var/vcap/deploy/bosh/director/shared/gems/ruby/1.9.1/gems/aws-sdk-1.3.8/lib/aws/core/cacheable.rb:64:in `retrieve_attribute'
/var/vcap/deploy/bosh/director/shared/gems/ruby/1.9.1/gems/aws-sdk-1.3.8/lib/aws/ec2/resource.rb:66:in `retrieve_attribute'
/var/vcap/deploy/bosh/director/shared/gems/ruby/1.9.1/gems/aws-sdk-1.3.8/lib/aws/core/resource.rb:207:in `block in define_attribute_getter'
/var/vcap/deploy/bosh/director/shared/gems/ruby/1.9.1/gems/bosh_aws_cpi-0.4.1/lib/cloud/aws/helpers.rb:37:in `block in wait_resource'
/var/vcap/deploy/bosh/director/shared/gems/ruby/1.9.1/gems/bosh_aws_cpi-0.4.1/lib/cloud/aws/helpers.rb:25:in `loop'
/var/vcap/deploy/bosh/director/shared/gems/ruby/1.9.1/gems/bosh_aws_cpi-0.4.1/lib/cloud/aws/helpers.rb:25:in `wait_resource'
/var/vcap/deploy/bosh/director/shared/gems/ruby/1.9.1/gems/bosh_aws_cpi-0.4.1/lib/cloud/aws/cloud.rb:119:in `block in create_vm'
/var/vcap/deploy/bosh/director/shared/gems/ruby/1.9.1/gems/bosh_common-0.4.0/lib/common/thread_formatter.rb:46:in `with_thread_name'
/var/vcap/deploy/bosh/director/shared/gems/ruby/1.9.1/gems/bosh_aws_cpi-0.4.1/lib/cloud/aws/cloud.rb:89:in `create_vm'
whereas the instance was indeed created in EC2.