Hi Bosh mailing list,
First, although it's not overly relevant to my problem, some background on what I'm doing and my progress so far. I'm attempting to use a combination of Bosh and Chef together to automate a cloud deployment on top of AWS. This is actually a stop gap until Docker matures and then I'll probably transition to using Bosh to deploy VM's and setup my network, Docker to encapsulate services (running as a Bosh job), and Chef to install stuff into the Docker images. Why a combination of all 3 when each of them is competing for the same space? After diving into each of them I think they can all take on complimentary roles in automating deployments (that play to their strengths). While I agree in principle that installing everything from source is the most robust and least error prone solution, in practice as a developer I just want a clean way to both quickly bring up complete environments and then to easily update my code base across those environments (hopefully easily supporting local install along the way). If anyone is curious I'm happy to go into this further, but for now I just thought I'd mention it as it's probably a slightly non-standard use case.
Anyways, progress so far: I've managed to build my own stemcell, upgrade it to Ubuntu 12.04.3 with a 3.8 kernel (so it can support Docker), and add in a Chef stage to install Chef and Berkshelf into the stemcell image. The stemcell uploads without issue and I can deploy jobs onto it. I am currently branched off the 1798 release, I'm happy to send a patch if you'd like to see my change set. As an aside, I ran into lots of issues trying to get a 13.XX stemcell working, so I stuck with upgrading the 12.04.3 kernel.
Next, I have a micro-bosh deployed on top of AWS and I can successfully use it to deploy jobs. I've created several test jobs and packages, the one I'm currently focusing on is an apt repository which Chef will pull from. My jobs use a wrapper service that monit invokes which registers the box with Chef / does a sync / de-registers with Chef on server shutdown, and everything seems to work fairly well (For the apt box monit is monitoring nginx, which is serving up the repository).
The issue I'm running into is that no matter how I change my manifest, the root file system is always coming up as 2 gigs. The nginx install thus fails to due lack of space on the drive. I've tried altering both the resource_pools and jobs sections of the manifest, however Bosh just seems to ignore the configuration when creating the root ebs volume. See below for an excerpt from my apt-repository.yml manifest:
resource_pools:
- name: common
network: default
size: 1
stemcell:
name: bosh-aws-xen-ubuntu
version: 1798
cloud_properties:
instance_type: m1.medium
disk: 8192
jobs:
- name: apt
template:
- apt
instances: 1
resource_pool: common
networks:
- name: default
default:
- dns
- gateway
cloud_properties:
instance_type: m1.medium
disk: 8192 I've also tried updating the manifest of the micro-bosh instance and redeploying (--update) it to see if it would re-size the disk, however that did not work either (it re-attached the same ebs volume, maybe I need to completely wipe out the micro bosh instance?). I then tried altering the default disk size and rebuilding my stemcell, with the end result of bosh upload stemcell error-ing out with:
E, [2014-01-30T03:37:35.024803 #4961] [task:256] ERROR -- : Unable to copy stemcell root image: command 'sudo -n /var/vcap/jobs/director/bin/stemcell-copy /var/vcap/data/tmp/director/stemcell20140130-4961-fv3nch/image /dev/xvdg 2>&1' failed with exit code 1
From the log it looks as though the volume gets created with the following properties:
I, [2014-01-30T03:35:05.798118 #4961] [task:256] INFO -- : Found stemcell image `bosh-aws-xen-ubuntu/1798', cloud properties are {"name"=>"bosh-aws-xen-ubuntu", "version"=>"1798", "infrastructure"=>"aws", "architecture"=>"x86_64", "root_device_name"=>"/dev/sda1"}
So it appears as though the copy failed due to the stemcell image being larger then the underlying volume (which makes me think this is the wrong direction and I should be hooking into another place in the deployment process that occurs before the stemcell gets copied, ie disk should be listed as one of the cloud properties above...).
I'm going to continue working through the ruby code to try and understand exactly where in the deployment process the volume is being created, however any help pointing me in the right direction would be greatly appreciated. Another option is change all my cookbooks to install into a chroot'd environment on top of /var/vcap/store (created using persistent_disk, which does work), perhaps this is a better direction from a Bosh standpoint (although and I don't look forward to having to rework my cookbooks...)?
Lastly, many thanks for open sourcing Bosh, I think it has the potential to be a very powerful tool. My only concern is that I feel some of the decisions are more in the direction of locking the user into the Bosh use case then creating a pluggable framework, but it's very likely I have that perception due to my lack of understanding of the project (or the fact that I'm a developer moonlighting as a dev-ops engineer...).
Some debug output to help the cause:
mt@mt:~/workspace/bosh/releases/apt-repository$ bosh status
Config
/home/mt/.bosh_config
Director
Name aws
URL
https://XX.XX.XX.XX:25555 Version 1.5.0.pre.912 (release:70118c29 bosh:70118c29)
User admin
UUID XXXXXXXXXXXXXXXXXXXXXXX
CPI aws
dns enabled (domain_name: microbosh)
compiled_package_cache disabled
snapshots disabled
Deployment
Manifest /home/mt/workspace/bosh/releases/apt-repository/apt-repository.yml
Release
dev apt/0.2-dev
final n/a
mt@mt:~/workspace/bosh/releases$ bosh stemcells
+---------------------+---------+--------------+
| Name | Version | CID |
+---------------------+---------+--------------+
| bosh-aws-xen-ubuntu | 1782 | ami-XXXXXXX |
| bosh-aws-xen-ubuntu | 1798 | ami-XXXXXXX |
+---------------------+---------+--------------+
(*) Currently in-use
Stemcells total: 2
Best,
- MT