Issue with AWS deployment

580 views
Skip to first unread message

Richard Lennox

unread,
Jun 26, 2012, 10:29:36 PM6/26/12
to bosh-users
Have followed this guide (thanks Dr Nic!)

https://github.com/drnic/bosh-getting-started/blob/master/creating-a-bosh-from-scratch.md

And everything works up to a point - however fails with a time out
when pinging the new instance, so never provisions the new machines. I
can see the instances created, VIP's bound, and then destroyed ok in
AWS, but seems comms somehow breaks down after the instance is
successfully created - and I'm not sure where next to take a look at.

As I'm using a different EC2 endpoint (ec2.ap-
southeast-1.amazonaws.com) which doesn't have the default kernel_id
available, I used the standard AWS stemcell public image, and modified
the manifest file to be

---
name: bosh-stemcell
bosh_protocol: "1"
version: 0.5.1
cloud_properties:
kernel_id: "aki-fe1354ac"

Then repacked it and uploaded this to my BOSH instance so that I have
a private AMI I can use as a stemcell - this all seems to work ok.

Failure when deploying logs below.

$ bosh deploy
Getting deployment properties from director...
Unable to get properties list from director, trying without it...
Compiling deployment manifest...
Cannot get current deployment information from director, possibly a
new deployment
Please review all changes carefully
Deploying `wordpress.yml' to `myfirstbosh' (type 'yes' to continue):
yes

Director task 31

Preparing deployment
binding deployment
(00:00:00)
binding releases
(00:00:00)
binding existing deployment
(00:00:00)
binding resource pools
(00:00:00)
binding stemcells
(00:00:00)
binding templates
(00:00:00)
binding unallocated VMs
(00:00:00)
binding instance networks
(00:00:00)
Done 8/8
00:00:00

Reusing already compiled packages
copying compiled packages
(00:00:00)
Done 1/1
00:00:00

Preparing DNS
binding DNS
(00:00:00)
Done 1/1
00:00:00

Creating bound missing VMs
common/1: Timed out pinging to d8046fb9-a058-4397-8087-07ae7e1e5437
after 300 seconds (00:05:43)
common/0: Timed out pinging to 28844764-ef21-47de-961f-26ff1516d597
after 300 seconds (00:05:51)
common/2: Timed out pinging to 5ce11760-9830-41af-a149-df59bd71d087
after 300 seconds (00:07:43)
Error 3/3
00:07:43

Error 100: Timed out pinging to d8046fb9-a058-4397-8087-07ae7e1e5437
after 300 seconds


Guillaume Berche

unread,
Jun 27, 2012, 5:30:29 PM6/27/12
to bosh-users
I'm not yet that far on the deploying-sample-release.md tutorial from
Dr Nick (thanks again). However, I'm curious to learn details on a
complete AWS manifest for wordpress. Is there a complete example avail
somewhere, can you share your version?

I'm wondering the details of the ACCESS_KEY, SECRET_ACCESS_KEY and
BOSH_AWS_REGISTRY_DNS_NAME placeholders as I could not find them into
https://github.com/drnic/bosh-getting-started/blob/master/examples/wordpress/deployment-manifest-initial.yml

Should those be specified in wordpress-aws.yml as access_key_id,
secret_access_key, into the properties.cloud_properties ?
https://github.com/cloudfoundry/oss-docs/blob/master/bosh/documentation/documentation.md#bosh-deployment-manifest
mentions compilation.cloud_properties, resource_pools.cloud_properties
as being required, but I guess this is only necessary for overrides of
global properties. Could the defaults configured on the micro-bosh
apply and avoid specifying them in properties.cloud_properties ?
Under
https://github.com/cloudfoundry/oss-docs/blob/master/bosh/samples do
not specify properties.cloud_properties and seems to only specify
vsphere VM sizing and vlan default overrides.

Where should the BOSH_AWS_REGISTRY_DNS_NAME be specified ?

I end up in wordpress-aws.yml with the following (without being quite
confident, I'll try that next):
[...]
properties:
[...]
mysql:
[...]
cloud_properties:
access_key_id: XXXX
secret_access_key: XXXX
ec2_endpoint: ec2.eu-west-1.amazonaws.com
default_key_name: keyname
default_security_groups: ["privatesg"]
endpoint: http://admin:ad...@mydns.com:25555

Besides, below are my notes from running creating-a-bosh-from-
scratch.md, I'll push Dr Nick some updates in github if this can be
useful to next newbies like me running the tutorial.

Thanks again,

Guillaume.

differences in my context with
https://github.com/drnic/bosh-getting-started/blob/master/creating-a-bosh-from-scratch.md
- running on eu-west-1 with ami-3d1f2b49 rather than us-east-1
- developper box running on an EC2 instance in the same security group
as the micro-bosh instance

Issues I met:
- AMI for eu-west has only en_GB.utf8 locale (fails postgresql startup
complaining about invalid lc_messages with "en_US.UTF-8", see
https://bugs.launchpad.net/ubuntu/+source/postgresql-8.2/+bug/162517)
I fixed it by adding the following bits after ./prepare_instance.sh
command:

#check if en_US.UTF-8 is indeed available
locale -a
#if not add it
sudo locale-gen en_US.UTF-8

Suggesting to state developer box assumptions ("From another terminal
on your local machine"):
- github ssh certificates installed as in https://help.github.com/articles/generating-ssh-keys
- ruby, rubygems, rake installed. What version required ? What's the
safest way for a ruby newbie (rbenv, rvm)? I reused steps from
prepare_instance.sh (adding sudos and ruby1.8 to listed commands).

Possibly add a word about not yet trying to run the tutorial on VPC
instances, linking to https://groups.google.com/a/cloudfoundry.org/group/bosh-dev/browse_thread/thread/30ca9b70b23fa4e7
besides ubuntu 10.0.4 AMI bug https://bugs.launchpad.net/ubuntu/+source/cloud-init/+bug/615545

Dr Nic Williams

unread,
Jun 27, 2012, 6:05:43 PM6/27/12
to bosh-...@cloudfoundry.org
Deployment manifests do not include AWS credentials or region information (sadly?). Instead, each BOSH owns the relationship with one cloud/region/account (a set of cloud properties). To deploy to multiple regions/multiple accounts, you need one BOSH per account/region. 

The cloud_properties you include in a deployment manifest are for VM related attributes (cpus or instance types, etc).

If you're looking at a manifest yml that mentions BOSH_AWS_REGISTRY_DNS_NAME or ACCESS_KEY then it is a manifest for describing the creation of a BOSH itself.

Nic

Dr Nic Williams - VP Developer Evangelism
Engine Yard
The Leading Platform as a Service
Mobile: +1 415 860 2185
Skype: nicwilliams
Twitter: @drnic

Dr Nic Williams

unread,
Jun 27, 2012, 6:20:00 PM6/27/12
to bosh-...@cloudfoundry.org
Richard, sorry I can't help with kernel related questions. Hopefully someone else can!

Nic

Dr Nic Williams - VP Developer Evangelism
Engine Yard
The Leading Platform as a Service
Mobile: +1 415 860 2185
Skype: nicwilliams
Twitter: @drnic

Dr Nic Williams

unread,
Jun 27, 2012, 7:14:03 PM6/27/12
to bosh-...@cloudfoundry.org
I understand where this question comes from. My tutorial isn't updated from when I learnt what I just emailed before. https://github.com/drnic/bosh-getting-started/blob/master/deploying-sample-release.md#sample-release

I'll fix this soon. Sorry.

Nic

Dr Nic Williams - VP Developer Evangelism
Engine Yard
The Leading Platform as a Service
Mobile: +1 415 860 2185
Skype: nicwilliams
Twitter: @drnic

Guillaume Berche

unread,
Jul 4, 2012, 6:26:21 PM7/4/12
to bosh-...@cloudfoundry.org

Thanks Nic for the clarifications. I’m now reaching the similar symtoms than Richard above: when asking to deploy the wordpress-aws deployment in the eu-west-1 region, bosh is instanciating a compilation ec2 instance, but it fails to connect to the agent within 300s and thus terminates the instance.

I was unable to connect to the compilation EC2 instance from ssh using neither the specify ssh keys in the “key_name” property (which I can see properly in the debug traces for the task), nor the default root or vcap password specified into https://github.com/cloudfoundry/bosh/blob/master/agent/misc/stemcell/vmbuilder.erb#L6-8 ; the ssh authentication fails in both modes.

Same if I manually launch the instance from AWS console with the same params as found in the task debug logs (including user data).

I tried patching the stemcell to the aki-62695816: the 1.02 kernel matching the default aki-825ea7eb in referenced in bosh-stemcell-aws-0.5.1.tgz, but without any more luck.

eu-west-1      aki-62695816   amd64   kernel  ec2-public-images-eu/pv-grub-hd0_1.02-amd64.gz.manifest.xml

 

The AWS system logs didn’t show entries that looked abnormal, was just ending with “Creating SSH2 RSA key; this may take some time”.

Having followed tutorial https://github.com/drnic/bosh-getting-started with the chef-based installation, I don’t have the micro commands mentioned by Doug MacEachern into https://groups.google.com/a/cloudfoundry.org/forum/?fromgroups#!searchin/bosh-users/password/bosh-users/I45ZjWve1rI/GS6vABqAXlIJ to change the default password as part of "bosh deploy". Is it normal ? Could this help?

 Any hint to fix this apparently invalid AMI, that triggers unresponsive agents and refuses incoming ssh connections ?

 Thanks in advance,

 Guillaume.

Guillaume Berche

unread,
Jul 4, 2012, 7:04:20 PM7/4/12
to bosh-...@cloudfoundry.org
I also tried to apply one of Vadim's posts "The username is vcap, password should be set via the env in the deployment manifest." at https://groups.google.com/a/cloudfoundry.org/forum/?fromgroups#!search/The$20username$20is$20vcap,$20password$20should$20be$20set$20via$20the$20env$20in$20the$20deployment/bosh-dev/DlRKMnbhWlY/ggnx9V8d_2gJ

I added in the env, the "password" property in both compilation and resource pools (see below extract of my wordpress-aws.yml deployment manifest), but without any more luck: interactive ssh access is still refused with vcap user for the compilation vm.

compilation:
  workers: 1 # only the required number are provisioned
  network: default
  cloud_properties:
    instance_type: m1.small
    availability_zone: "eu-west-1c"
    key_name: "mykey"
  env:
    password: "xxx"

[...]
resource_pools:
  - name: common
    network: default
    size: 3
    stemcell:
      name: bosh-stemcell
      version: 0.5.1
    cloud_properties:
      instance_type: m1.small
      availability_zone: "eu-west-1c"
      key_name: "mykey"
    env:
      password: "xxx"

Guillaume.

Vadim Spivak

unread,
Jul 4, 2012, 11:32:51 PM7/4/12
to bosh-...@cloudfoundry.org
How was the password provided? Plain text or encrypted in the manifest?

Thanks,
Vadim

Guillaume Berche

unread,
Jul 5, 2012, 4:39:17 AM7/5/12
to bosh-...@cloudfoundry.org
The password was provided in clear text, is it the expected?

Thanks,

Guillaume.

ps: since the AMI is EBS-backed, I may try to move the EBS volume to another instance in order to inspect the instance configuration and traces.


On Thursday, July 5, 2012 5:32:51 AM UTC+2, Vadim Spivak wrote:
How was the password provided? Plain text or encrypted in the manifest?

Thanks,
Vadim



Guillaume Berche

unread,
Jul 5, 2012, 7:25:13 PM7/5/12
to bosh-...@cloudfoundry.org
Seems the problem on non-default us AWS region was that the /var/vcap/deploy/bosh/aws_registry/shared/config/aws_registry.yml file as populated by the chef deployer is missing the ec2 endpoint. After adding it manually to the aws section, and restarting the aws_registry, the compilation job worked fine for 3 jobs and then failed, presumably for a similar race condition Martin fixed in http://reviews.cloudfoundry.org/#/c/6507/ I guess the retry could also apply to InvalidInstanceID.NotFound in addition to InvalidAMIID::NotFound. I'll give it a try tomorrow if I get sufficient time (I'm still learning ruby basics) before leaving for some vacations.

However, I could not understand why the ec2 instances created by bosh don't seem to accept ssh connections. Digging into the root EBS volume of the compilation instance, from another instance I could see the key pairs are not installed into /home/vcap/ or /root (no .ssh directory)

I've updated Dr Nic's tutorial with the workarounds the community gave to run on non us ec2 region (thanks!), and will submit a pull request after reviewing them, in the meantime, next users starting they install from scratch can have a look at https://github.com/gberche/bosh-getting-started to avoid following each email in this thread.

Following below are details on my current state and diagnostic, if this can help others.

Guillaume.


Current state:

Compiling packages
  mysql/0.1-dev (00:03:03)
  wordpress/0.1-dev (00:02:53)
  apache2/0.1-dev (00:10:46)
  mysqlclient/0.1-dev (00:02:47)
  nginx/0.1-dev: <?xml version="1.0" encoding="UTF-8"?>
<Response><Errors><Error><Code>InvalidInstanceID.NotFound</Code><Message>The instance ID 'i-498afc01' does not exist</Message></Error></Errors><RequestID>e0181565-379d-453a-a9ff-382c6a4e7d24</RequestID></Response> (00:00:01)
Error                                5/6 00:19:30

Error 100: The instance ID 'i-498afc01' does not exist

Previous diagnostics:

Digging into the root EBS volume of the compilation instance, from another EC2 instance I could see
- the key pairs are not installed into /home/vcap/ or /root (no .ssh directory)
- that the aws registry is returning 500 errors for the just created EC2 instance.

/mnt/srv-recovery$ sudo less ./var/vcap/bosh/log/current
2012-07-05_20:43:22.84664 #[628] INFO: Starting agent 0.5.1...
2012-07-05_20:43:22.84674 #[628] INFO: Configuring agent...
2012-07-05_20:43:23.06420 #[628] INFO: Configuring instance
2012-07-05_20:43:23.37917 /var/vcap/bosh/agent/lib/agent/infrastructure/aws/registry.rb:53:in `get_json_from_url': Cannot read settings for `http://admin:ad...@mydns.com:25777/instances/i-2f94e267/settings' from registry, got HTTP 500 (RuntimeError)
2012-07-05_20:43:23.37920       from /var/vcap/bosh/agent/lib/agent/infrastructure/aws/registry.rb:91:in `get_settings'
2012-07-05_20:43:23.37921       from /var/vcap/bosh/agent/lib/agent/infrastructure/aws/settings.rb:32:in `load_settings'
2012-07-05_20:43:23.37921       from /var/vcap/bosh/agent/lib/agent/infrastructure/aws.rb:10:in `load_settings'
2012-07-05_20:43:23.37922       from /var/vcap/bosh/agent/lib/agent/bootstrap.rb:60:in `load_settings'
2012-07-05_20:43:23.37922       from /var/vcap/bosh/agent/lib/agent/bootstrap.rb:34:in `configure'
2012-07-05_20:43:23.37923       from /var/vcap/bosh/agent/lib/agent.rb:92:in `start'
2012-07-05_20:43:23.37924       from /var/vcap/bosh/agent/lib/agent.rb:71:in `run'
2012-07-05_20:43:23.37924       from /var/vcap/bosh/agent/bin/agent:97:in `<main>'

Following that on the bosh micro:

less /var/vcap/deploy/bosh/aws_registry/shared/logs/aws_registry.debug.log
E, [2012-07-05T20:43:18.396913 #11223] ERROR -- : AWS error: <?xml version="1.0" encoding="UTF-8"?>
<Response><Errors><Error><Code>InvalidInstanceID.NotFound</Code><Message>The instance ID 'i-2f94e267' does not exist</Message></Error></Errors><RequestID>502c1485-6f02-49fd-81aa-de97a48d92c0</RequestID></Response> (Bosh::AwsRegistry::AwsError)
/var/vcap/deploy/bosh/aws_registry/current/aws_registry/lib/aws_registry/instance_manager.rb:67:in `rescue in instance_private_ip'
/var/vcap/deploy/bosh/aws_registry/current/aws_registry/lib/aws_registry/instance_manager.rb:65:in `instance_private_ip'
/var/vcap/deploy/bosh/aws_registry/current/aws_registry/lib/aws_registry/instance_manager.rb:47:in `check_instance_ip'
/var/vcap/deploy/bosh/aws_registry/current/aws_registry/lib/aws_registry/instance_manager.rb:34:in `read_settings'
/var/vcap/deploy/bosh/aws_registry/current/aws_registry/lib/aws_registry/api_controller.rb:20:in `block in <class:ApiController>'

after fixing the ec2 endpoint, restart the agent with "sudo sv restart aws_registry" the first compilation jobs work until the apparent race condition triggered:

bosh task 41 –debug
[…]
<Response><Errors><Error><Code>InvalidInstanceID.NotFound</Code><Message>The instance ID 'i-498afc01' does not exist</Message></Error></Errors><RequestID>e0181565-379d-453a-a9ff-382c6a4e7d24</RequestID></Response>
/var/vcap/deploy/bosh/director/shared/gems/ruby/1.9.1/gems/aws-sdk-1.3.8/lib/aws/core/client.rb:277:in `return_or_raise'
/var/vcap/deploy/bosh/director/shared/gems/ruby/1.9.1/gems/aws-sdk-1.3.8/lib/aws/core/client.rb:337:in `client_request'
(eval):3:in `describe_instances'
/var/vcap/deploy/bosh/director/shared/gems/ruby/1.9.1/gems/aws-sdk-1.3.8/lib/aws/ec2/resource.rb:72:in `describe_call'
/var/vcap/deploy/bosh/director/shared/gems/ruby/1.9.1/gems/aws-sdk-1.3.8/lib/aws/ec2/instance.rb:631:in `get_resource'
/var/vcap/deploy/bosh/director/shared/gems/ruby/1.9.1/gems/aws-sdk-1.3.8/lib/aws/core/resource.rb:207:in `block (2 levels) in define_attribute_getter'
/var/vcap/deploy/bosh/director/shared/gems/ruby/1.9.1/gems/aws-sdk-1.3.8/lib/aws/core/cacheable.rb:64:in `retrieve_attribute'
/var/vcap/deploy/bosh/director/shared/gems/ruby/1.9.1/gems/aws-sdk-1.3.8/lib/aws/ec2/resource.rb:66:in `retrieve_attribute'
/var/vcap/deploy/bosh/director/shared/gems/ruby/1.9.1/gems/aws-sdk-1.3.8/lib/aws/core/resource.rb:207:in `block in define_attribute_getter'
/var/vcap/deploy/bosh/director/shared/gems/ruby/1.9.1/gems/bosh_aws_cpi-0.4.1/lib/cloud/aws/helpers.rb:37:in `block in wait_resource'
/var/vcap/deploy/bosh/director/shared/gems/ruby/1.9.1/gems/bosh_aws_cpi-0.4.1/lib/cloud/aws/helpers.rb:25:in `loop'
/var/vcap/deploy/bosh/director/shared/gems/ruby/1.9.1/gems/bosh_aws_cpi-0.4.1/lib/cloud/aws/helpers.rb:25:in `wait_resource'
/var/vcap/deploy/bosh/director/shared/gems/ruby/1.9.1/gems/bosh_aws_cpi-0.4.1/lib/cloud/aws/cloud.rb:119:in `block in create_vm'
/var/vcap/deploy/bosh/director/shared/gems/ruby/1.9.1/gems/bosh_common-0.4.0/lib/common/thread_formatter.rb:46:in `with_thread_name'
/var/vcap/deploy/bosh/director/shared/gems/ruby/1.9.1/gems/bosh_aws_cpi-0.4.1/lib/cloud/aws/cloud.rb:89:in `create_vm'

whereas the instance was indeed created in EC2.

Guillaume Berche

unread,
Jul 6, 2012, 11:16:47 AM7/6/12
to bosh-...@cloudfoundry.org
An attempt to extend Martin's fix on the AWS race condition on EC2 instance creation: http://reviews.cloudfoundry.org/#/c/6967/

Seemed to work on my micro bosh running on ec2-west-1c region, but I could not make much tests.

Guillaume.

 bosh --non-interactive delete deployment wordpress; bosh --non-interactive deploy
[..]
Updating job nginx
  nginx/0 (canary) (00:00:34)
Done                                 1/1 00:00:34

Updating job wordpress
  wordpress/0 (canary) (00:00:38)
Done                                 1/1 00:00:38

Updating job mysql
  mysql/0 (canary) (00:01:04)
Done                                 1/1 00:01:04

Task 47 done
Started         Fri Jul 06 12:48:47 UTC 2012
Finished        Fri Jul 06 12:53:46 UTC 2012
Duration        00:04:59

Deployed `wordpress-aws.yml' to `myfirstbosh'

Stephen Kinser

unread,
Jul 8, 2012, 1:35:00 AM7/8/12
to bosh-...@cloudfoundry.org
I think we need to handle these too:

AWS::EC2::Errors::InvalidVolume::NotFound - AWS sometimes raises after a create volume call
AWS::Core::Resource::NotFound - AWS sometimes raises this right after the attach_to call of a volume to an instance

If we're deleting a resource, we don't want to keep retrying on NotFound errors. One option would be to change wait_resource to take a block, allowing the caller to specify how to handle errors that are raised. That way, for deletions, you wouldn't retry on NotFound but you would for creations. If the block isn't passed in, just let the error go up the call stack.

Reply all
Reply to author
Forward
0 new messages