Building eLife AWS deployment

72 views
Skip to first unread message

Lee Roder

unread,
Jul 13, 2017, 11:09:13 AM7/13/17
to elife-continuum-list
I have successfully build the eLife salt master (master-server, per the following url: https://github.com/elifesciences/builder/blob/master/docs/master-server.md).  I next attempted to build a web server - ./bldr launch:elife-website,prod.  The EC2 instance appears to have been successfully launched -- here's the tail end of the console output:

[54.165.226.93] out: Summary for local
[54.165.226.93] out: -------------
[54.165.226.93] out: Succeeded: 56
[54.165.226.93] out: Failed:     0
[54.165.226.93] out: -------------
[54.165.226.93] out: Total states run:     56
[54.165.226.93] out: Total run time:    4.778 s
[54.165.226.93] out:

however, if I point my browser to http://[external ip address], there appears to be no web server running.  At this point, I ssh'd into the EC2 instance and there does not appear to be any web server installed.  This is nginx-based, correct?  nginx is definitely not installed.  Nor is apache.  There is no /var/www.... folder, no www-data user,  hmmmm.... did I miss something?

Giorgio Sironi

unread,
Jul 13, 2017, 12:18:44 PM7/13/17
to Lee Roder, elife-continuum-list, Jennifer Strejevitch
Hi,
the `elife-website` project is not supported anymore as it was the 1.0 version of our website. The correct website to use (used on elifesciences.org) is `journal`.

Elaborating on the problem, the configuration for that project (if it happens for other projects that you want to use) is missing at https://github.com/elifesciences/builder-private-example/blob/master/salt/top.sls so probably only '*' is being matched by top.sls and only a few base states (56) are being used.

If you encounter a project behaving this way, you should configure it in your:
- https://github.com/elifesciences/builder-private-example/blob/master/salt/top.sls
- https://github.com/elifesciences/builder-private-example/blob/master/pillar/top.sls
following the examples set in
- https://github.com/elifesciences/journal-formula/blob/master/salt/example.top
- https://github.com/elifesciences/journal-formula/blob/master/salt/pillar/journal.sls respectively.

I will be adding the same examples to builder-private-example as I try them out in a test AWS environment separated from eLife.


--
You received this message because you are subscribed to the Google Groups "elife-continuum-list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elife-continuum-list+unsub...@googlegroups.com.
To post to this group, send email to elife-continuum-list@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elife-continuum-list/6efca2ad-3b06-4e31-975b-cb51bd7f7dcc%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
Giorgio Sironi
@giorgiosironi

Lee Roder

unread,
Jul 13, 2017, 12:59:31 PM7/13/17
to elife-continuum-list
Great!  Thanks for the update.  So, that brings me to another question.  There is a template salt pillar for the journal system:  journal.sls.  Inside of that are multiple values.  Which ones are required and what other configuration must take place in order for the website to function properly?  Specifically, I have a domain that is registered with Route53 -- haven't updated my name servers to point to Amazon yet, but will do so... in Route53 I have configured a public zone and a private zone (.local instead of .internal, but that's a minor point, just a personal preference).  However, I see right at the top of the journal.sls file the following two values:

    api_url: http://baseline--api-dummy.thedailybugle.internal
    api_url_public: https://baseline--api-dummy.thedailybugle.org

Once i update those to, say, api.mydomain.local (my Route53 private zone being mydomain.local) and api.mydomain.com (my Route53 public zone being mydomain.com), do I need to perform any other configuration for those urls to just 'start working'?

Also, here are the remaining values journal.sls.  What should they be set to?

    api_key: a_authentication_key
    side_by_side_view_url: null
    default_host: dailybugle.org
    observer_url: http://prod--observer.elifesciences.org
    session_name: journal
    secret: random_string
    web_users: {}
    gtm_id: null
    disqus_domain: null
    status_checks:
        API dummy: ping

    mailer:
        # available only in some US regions
        host: email-smtp.us-east-1.amazonaws.com
        port: 587  # an *unthrottled* SES port. avoid port 25
        username: fake
        password: fake
        encryption: tls

Lee Roder

unread,
Jul 13, 2017, 5:16:16 PM7/13/17
to elife-continuum-list
I selected what I thought were some reasonable values for the various settings for the journal.sls file.

I then attempted to run: ./bldr launch:journal,prod

That may not be exactly the correct syntax.  I was prompted with:

 2017-07-13 21:13:43,159 - WARNING - MainProcess - buildercore.decorators - TODO: `_pick` renamed from `_pick` to something. `choose` ?
please pick: (alternative config)
1 - skip this step
2 - fresh
    "uses a plain Ubuntu basebox instead of an ami"

3 - ci
4 - end2end
    "end2end environment for journal. ELB on top of multiple EC2 instances"

5 - continuumtest
6 - prod
    "prod environment for journal. ELB on top of multiple EC2 instances"

7 - cachetest
>

I selected 6.

During the CloudFormation Stack creation process, I received the following error from ./bldr

 (u'CREATE_FAILED',
  u"The specified SSL certificate doesn't exist, isn't in us-east-1 region, isn't valid, or doesn't include a valid certificate chain."),

and here was the corresponding error on AWS:


16:09:13 UTC-0500
CREATE_FAILEDAWS::CloudFront::DistributionCloudFrontCDNThe specified SSL certificate doesn't exist, isn't in us-east-1 region, isn't valid, or doesn't include a valid certificate chain.

Luke Skibinski

unread,
Jul 13, 2017, 11:13:32 PM7/13/17
to elife-continuum-list
Hi there, Lee. Thanks for your time and your patience working through this.

> I then attempted to run: ./bldr launch:journal,prod
> That may not be exactly the correct syntax.

that syntax is fine. in this case 'launch' is the command and then the comma separated values after the colon are arguments to the command. if you leave them off, it will prompt you to pick a project ('journal' in this case) and then an identifier for the instance of the project you're about to create ('prod' in this case).

What you were prompted with next were a set of possible alternative configurations from the default. We have these so more (or less) powerful project instances can be created as necessary. All projects and their alternative configurations can be found in the 'project file' that lives in /path/to/builder/projects/elife.yaml : https://github.com/elifesciences/builder/blob/master/projects/elife.yaml#L222

The alternative configuration you picked was 'prod', what we use in our production setup: https://github.com/elifesciences/builder/blob/master/projects/elife.yaml#L285

It's attempting to configure a cloudfront instance with some (very) elife-specific values that fronts multiple ec2 instances: https://github.com/elifesciences/builder/blob/master/projects/elife.yaml#L288

the 'project file' I linked to above works by taking the default values in the first section of that file (called 'defaults' handily enough) and merging the values in the project section over the top. This means we can have sensible defaults the majority of the time and keep project descriptions brief. In this case, the 'elb' section in the alternative configuration 'prod' for the 'journal' project (still with me?) doesn't mention this value: https://github.com/elifesciences/builder/blob/master/projects/elife.yaml#L96

eLife uses a certificate provided by an authority that isn't AWS, which AWS supports, but doesn't really encourage: http://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_server-certs.html

evidenced by the fact that there is no GUI within AWS to handle this. I'm quite certain we haven't tried deploying a cloudfront instance with an AWS provided certificate yet.

what I would suggest is to destroy that failed instance you just created and when presented with that 'alternate config' prompt again, choose '1 - skip this step'. It will use the default configuration and not attempt to configure cloudfront or an elasticache of ec2 instances.

We'll take a look at this certificate configuration and try to improve it. It *is* a little arcane.

Luke Skibinski

unread,
Jul 13, 2017, 11:23:54 PM7/13/17
to elife-continuum-list
Hi again, Lee.

These formulas and those values really could do better exposition. 

I hadn't considered it before, but the pillar files appear to be the interface to the formula for those not developing them. As such they are woefully unexplained. I'll see to it that we up our game on this.

Giorgio Sironi

unread,
Jul 14, 2017, 3:41:02 AM7/14/17
to elife-continuum-list
On Fri, Jul 14, 2017 at 4:13 AM, Luke Skibinski <l.ski...@elifesciences.org> wrote:
Hi there, Lee. Thanks for your time and your patience working through this.

> I then attempted to run: ./bldr launch:journal,prod
> That may not be exactly the correct syntax.

that syntax is fine. in this case 'launch' is the command and then the comma separated values after the colon are arguments to the command. if you leave them off, it will prompt you to pick a project ('journal' in this case) and then an identifier for the instance of the project you're about to create ('prod' in this case).

What you were prompted with next were a set of possible alternative configurations from the default. We have these so more (or less) powerful project instances can be created as necessary. All projects and their alternative configurations can be found in the 'project file' that lives in /path/to/builder/projects/elife.yaml : https://github.com/elifesciences/builder/blob/master/projects/elife.yaml#L222

The alternative configuration you picked was 'prod', what we use in our production setup: https://github.com/elifesciences/builder/blob/master/projects/elife.yaml#L285

It's attempting to configure a cloudfront instance with some (very) elife-specific values that fronts multiple ec2 instances: https://github.com/elifesciences/builder/blob/master/projects/elife.yaml#L288

the 'project file' I linked to above works by taking the default values in the first section of that file (called 'defaults' handily enough) and merging the values in the project section over the top. This means we can have sensible defaults the majority of the time and keep project descriptions brief. In this case, the 'elb' section in the alternative configuration 'prod' for the 'journal' project (still with me?) doesn't mention this value: https://github.com/elifesciences/builder/blob/master/projects/elife.yaml#L96

eLife uses a certificate provided by an authority that isn't AWS, which AWS supports, but doesn't really encourage: http://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_server-certs.html

evidenced by the fact that there is no GUI within AWS to handle this. I'm quite certain we haven't tried deploying a cloudfront instance with an AWS provided certificate yet.

what I would suggest is to destroy that failed instance you just created and when presented with that 'alternate config' prompt again, choose '1 - skip this step'. It will use the default configuration and not attempt to configure cloudfront or an elasticache of ec2 instances.

I think that you (Lee) probably already have started using your own .yaml configuration file for builder, as otherwise you wouldn't have been able to create instances given the need for configuration of a vpc, subnets and so on.
However, it may be easier to duplicate your .yaml configuration file from https://github.com/elifesciences/builder/blob/master/projects/continuum.yaml rather than https://github.com/elifesciences/builder/blob/master/projects/elife.yaml as continuum.yaml is where we are putting generic configurations that strip down the projects from eLife-specific needs.
 

--
You received this message because you are subscribed to the Google Groups "elife-continuum-list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elife-continuum-list+unsub...@googlegroups.com.
To post to this group, send email to elife-continuum-list@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
Giorgio Sironi
@giorgiosironi

Giorgio Sironi

unread,
Jul 14, 2017, 4:05:25 AM7/14/17
to Lee Roder, elife-continuum-list
We have now improved the pillar example adding comments about each of these values:
https://github.com/elifesciences/builder-private-example/blob/master/pillar/journal.sls
Our goal is to goal is to fully populate:
- https://github.com/elifesciences/builder-private-example/blob/master/salt/top.sls (one entry per project mapping a similar set to the project's fomula example.top)
- https://github.com/elifesciences/builder-private-example/tree/master/pillar (one file per project plus the elife.sls base one)
Feel free to open pull requests to this repository if you set up projects that are currently missing from the example.

--
You received this message because you are subscribed to the Google Groups "elife-continuum-list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elife-continuum-list+unsub...@googlegroups.com.
To post to this group, send email to elife-continuum-list@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
Giorgio Sironi
@giorgiosironi

Lee Roder

unread,
Jul 14, 2017, 8:09:03 AM7/14/17
to elife-continuum-list
Thanks Luke and Giorgio for your most helpful responses and feedback!  I have skimmed your responses and will digest them further upon posting this reply.  In the meantime, I wished to ask a question that I believe I will have even after perusing all of the including links.  Namely, it appears as if you are suggesting that:
1. I use continuum.yaml as my baseline, updating with my AWS specific values where appropriate
2. I run ./bldr launch:journal,prod and then select option 1 - skip this step
3. Push my SSL certs to... where exactly?  By the way, I am totally familiar with AWS's woeful lack of support for external certs.  I had to push some to a client's AWS instance some time back and figure out that, as you indicated, the only way to pull if off is from the AWS CLI.  Which leads me to a couple observations/questions/suggestions.  But first, let me say that my next actions will be to perform the steps above and then get back with y'all

Now, on to my observations/questions/suggestions.  First off, have you considered being a bit more cloud neutral?  That might enable folks to more easily adapt this process to their cloud platform of choice.  For example, create a load balancer VPS directly vs relying on Amazon ELBs.  Stay away from services like RDS and Route 53.  We actually went through this evolution a couple years back for a very large client.  Originally, our deployment formula was very AWS specific.  We have gradually moved away from that such that now, it is possible, through a web interface that we have built for this client, to select the Cloud Provisioning Platform, Region, VLAN (our generic term for what AWS refers to as a VPC), and then step through a select configs for each of the instance types - in their case: Load Balancers (we have moved to installing/configuring HAProxy), Web Servers (we are using NGINX presently - originally is was Apache), API Servers (for a proprietary CZMQ API that we helped this client develop), Database Servers (a Galera multi-master cluster).  A user is stepped through each of these and at each point can select system size (mem, cpu cores, disk capacity), OS Version (presently Ubuntu 16.04 and CentOS 7), HAProxy version (1.6 or .17), NGINX version... you get the idea.  And, at the final step, they are presented with summary screen and a Deploy button at the bottom.  When Deploy is hit - our python code gets called with a large payload and starts churning away spinning all required instances.  We happen to update a database table with progress info and the UI code just queries that table every so often (30 second intervals, I believe) and updates individual status for each machine as it is spun up, reaches high state, completes post-configuration, etc..  I'd be happy to walk you through a quick demo of that system - it might give you some ideas.  In your specific case, there is a slight difference in that our 'Salt Master' lives outside of a deployment, which means that we can store and manage a great deal more data -- with the tradeoff being that this becomes our system to manage.  Then, for each deployment, there is a syndic server.  So, that probably does not make sense for you, but it COULD be possible for the first 'bootstrap' step - spinning up a Salt Master in the customer's deployment, to include a based web UI that they can then hit and it actually becomes the front end for configuring your yaml files, sls pillar/state data, etc.... I think that would ease the process immensely for users -- and dramatically improve both the adoption rate as well as the ability to support!

Lee Roder

unread,
Jul 14, 2017, 9:02:52 AM7/14/17
to elife-continuum-list
Giorgio,

Will this image work as the default?

ami-2a0c313c

This seems to make more sense than ami-d90d92ce as the ami I mention above was released by Canonical on 6/29/17 while the one you mention comes from back in 8/9/16.  That means its a ton of updates/security patches behind.  Also, while on the topic of which version of Ubuntu, I would personally rest MUCH easier if you could roll forward to 16.04 (the most current LTS release).  18.04 will be out in just a few months and 14.04 is just over year from end of life.  From a configuration managment/support point of view, it seems to be a bad idea to essentially promote the use of an OS that will soon no longer be supported (that means no more updates/bug fixes/security patches).

Now, I did notice (it came to light during the call yesterday) that the current stock Ubuntu 16.04 hvm image only had python 3 installed.  A good indication of things to come!  We must all move our python code to python 3!  However, it is quite easy to get 2.7.12 installed.  In fact, since you are already needing to install salt-minion on each node, might I suggest doing what we do for another client?  Namely, add the following ppa and then, when you install salt-minion, you will also get python 2.7.12 installed.

wget -O - https://repo.saltstack.com/apt/ubuntu/16.04/amd64/latest/SALTSTACK-GPG-KEY.pub | sudo apt-key add -
echo 'deb http://repo.saltstack.com/apt/ubuntu/16.04/amd64/latest xenial main' > /etc/apt/sources.list.d/saltstack.list

while we are on the topic of ppa's, here is one that will get you NGINX 1.13:

wget http://nginx.org/keys/nginx_signing.key
sudo apt-key add nginx_signing.key
echo 'deb http://nginx.org/packages/mainline/ubuntu/ xenial nginx' >> /etc/apt/sources.list
echo 'deb-src http://nginx.org/packages/mainline/ubuntu/ xenial nginx' >> /etc/apt/sources.list

Do be forewarned, however, that the NGINX-support ppa archive above does NOT play nicely with the Ubuntu NGINX package so do NOT install nginx until after adding this PPA (you will end up needing to remove some other meta-packages first - nginx-core and something else... anyhow, just don't run apt install nginx until after adding the above ppa and then running apt update.  Also, one last thing, learned from painful experience - if you are also using php-fpm, the default package for Ubuntu 16.04 expects the user to be www-data (and the socket perms are set accordingly).  However the /etc/nginx.conf file has the user set to nginx.  The quickest/easiest fix is to simply change that user to be www-data (this user gets created when installing the various php packages).  Of course, if you are not using/needing php, you may safely ignore this bit of advice!

Lee Roder

unread,
Jul 14, 2017, 1:10:26 PM7/14/17
to elife-continuum-list
OK.  I have an update.  I create a wildcard ssl cert, pushed it up to my AWS account (aws iam upload-server-certificate --server-certificate-name elifetest --certificate-body file://Certificate.pem --certificate-chain file://CertificateChain.pem --private-key file://PrivateKey.pem), received a reply with my arn:aws:iam::.........  I edited builder/settings.yml to reference continuum.yaml instead of elif.yaml.  I then update projects/continuum.yaml to reflect all of the relevant/necessary value to match my specific environment, then ran: ./bldr launch:journal,prod

At this point, I no longer had six choices, but only 2.  Either: 1. Skip this stop or 2. Baseline.

I was feeling lucky.  I went with two -- and things ALMOST worked!  Its tough to know exactly what failed since so much console output flew by that I do not know if I can capture what the issue was.  I will include the last couple hundred lines at the end of this post.  I did verify that a CloudFormation stack was successfully created along with two EC2 instances: journal--prod--1 and journal--prod--2 - each in different subnets.  Nice.  An ELB also got created.  However, neither web server instance seemed to be listining on ports 80 or 443.  So, I ssh'd into the first one, verified that nginx was indeed up and running, and then started looking around in /etc/nginx.  The first odd thing I noticed is that /etc/nginx/sites-enabled was completely empty.  No symlinks.  I then looked in /etc/nginx/sites-available and found a default and an unencrypted-redirect.conf.  Hmmm... that second file certainly is of your making, but it essentially looks like it just redirect port 80 traffic to port 443.  Still no trace of anything that would be responding to incoming requests and pointing them to whatever your web root is.  I did some looking around and found that the web code seems to all be installed - in /srv/journal.  The handy-dandy README.md includes a link to:

 https://symfony.com/doc/current/setup/web_server_configuration.html

which describes, among other things, a proper nginx config file.  So.... something like that must exist somewhere.... but I cannot seem to find it!

Meanwhile, for what its worth - here is a list of every default value that I changed in builder/projects/continuum.yaml:
defaults:
    # domain: thedailybugle.org
    #intdomain: thedailybugle.internal
    #private-repo: ssh://g...@github.com/elife-anonymous-user/builder-private
        #account_id: 531849302986
        #vpc-id: vpc-c23159a5
        #subnet-id: subnet-6b6c6e41
        #subnet-cidr: '172.31.48.0/20'
        #redundant-subnet-id: subnet-dffe0c96
        #redundant-subnet-cidr: '172.31.0.0/20'
                #- subnet-6b6c6e41
                #- subnet-dffe0c96
            #certificate: arn:aws:iam::531849302986:server-certificate/wildcard.thedailybugle.org

master-server:
                #cidr-ip: 172.31.0.0/16
                #cidr-ip: 172.31.0.0/16

console log output:

[54.145.205.41] out: ----------
[54.145.205.41] out:           ID: journal-nginx-redirect-existing-paths
[54.145.205.41] out:     Function: file.managed
[54.145.205.41] out:         Name: /etc/nginx/traits.d/redirect-existing-paths.conf
[54.145.205.41] out:       Result: False
[54.145.205.41] out:      Comment: One or more requisite failed: elife.nginx.nginx-config
[54.145.205.41] out:      Changes:  
[54.145.205.41] out: ----------
[54.145.205.41] out:           ID: journal-nginx-robots
[54.145.205.41] out:     Function: file.managed
[54.145.205.41] out:         Name: /etc/nginx/traits.d/robots.conf
[54.145.205.41] out:       Result: False
[54.145.205.41] out:      Comment: One or more requisite failed: elife.nginx.nginx-config
[54.145.205.41] out:      Changes:  
[54.145.205.41] out: ----------
[54.145.205.41] out:           ID: journal-nginx-vhost
[54.145.205.41] out:     Function: file.managed
[54.145.205.41] out:         Name: /etc/nginx/sites-available/journal.conf
[54.145.205.41] out:       Result: False
[54.145.205.41] out:      Comment: One or more requisite failed: journal.journal-nginx-redirect-existing-paths, journal.journal-nginx-robots, elife.nginx.nginx-config
[54.145.205.41] out:      Changes:  
[54.145.205.41] out: ----------
[54.145.205.41] out:           ID: running-gulp
[54.145.205.41] out:     Function: cmd.script
[54.145.205.41] out:         Name: retrying-gulp
[54.145.205.41] out:       Result: True
[54.145.205.41] out:      Comment: Command 'retrying-gulp' run
[54.145.205.41] out:      Started: 16:03:51.970713
[54.145.205.41] out:     Duration: 73570.872 ms
[54.145.205.41] out:      Changes:  
[54.145.205.41] out:               ----------
[54.145.205.41] out:               pid:
[54.145.205.41] out:                   26788
[54.145.205.41] out:               retcode:
[54.145.205.41] out:                   0
[54.145.205.41] out:               stderr:
[54.145.205.41] out:                   (node:26797) DeprecationWarning: quality: use jpeg({ quality: ... }), webp({ quality: ... }) and/or tiff({ quality: ... }) instead
[54.145.205.41] out:                   (node:26797) DeprecationWarning: progressive: use jpeg({ progressive: ... }) and/or png({ progressive: ... }) instead
[54.145.205.41] out:                   (node:26797) DeprecationWarning: withoutChromaSubsampling: use jpeg({ chromaSubsampling: "4:4:4" }) instead
[54.145.205.41] out:                   (node:26797) DeprecationWarning: compressionLevel: use png({ compressionLevel: ... }) instead
[54.145.205.41] out:               stdout:
[54.145.205.41] out:                   Stopping redis-server: redis-server.
[54.145.205.41] out:                   Attempt: 1
[54.145.205.41] out:                   [16:03:54] Using gulpfile /srv/journal/gulpfile.js
[54.145.205.41] out:                   [16:03:54] Starting 'assets:clean'...
[54.145.205.41] out:                   [16:03:54] Starting 'favicons:clean'...
[54.145.205.41] out:                   [16:03:54] Starting 'images:clean'...
[54.145.205.41] out:                   [16:03:54] Starting 'patterns'...
[54.145.205.41] out:                   [16:03:54] Finished 'assets:clean' after 18 ms
[54.145.205.41] out:                   [16:03:54] Finished 'favicons:clean' after 14 ms
[54.145.205.41] out:                   [16:03:54] Starting 'favicons:build'...
[54.145.205.41] out:                   [16:03:54] Finished 'images:clean' after 24 ms
[54.145.205.41] out:                   [16:03:54] Starting 'images:banners'...
[54.145.205.41] out:                   [16:03:54] Starting 'images:logos'...
[54.145.205.41] out:                   [16:03:54] Starting 'images:svgs'...
[54.145.205.41] out:                   [16:03:56] gulp-responsive: community.jpg -> community-450x264.jpg
[54.145.205.41] out:                   [16:03:56] gulp-responsive: community.jpg -> community-767x264.jpg
[54.145.205.41] out:                   [16:03:56] gulp-responsive: community.jpg -> community-1023x288.jpg
[54.145.205.41] out:                   [16:03:57] gulp-responsive: community.jpg -> community-900x528.jpg
[54.145.205.41] out:                   [16:03:58] gulp-responsive: community.jpg -> community-1114x336.jpg
[54.145.205.41] out:                   [16:03:58] gulp-responsive: alpsp.png -> alpsp-250.png
[54.145.205.41] out:                   [16:03:58] gulp-responsive: community.jpg -> community-1534x528.jpg
[54.145.205.41] out:                   [16:03:58] gulp-responsive: community.jpg -> community-2046x576.jpg
[54.145.205.41] out:                   [16:03:58] gulp-responsive: alpsp.png -> alpsp-500.png
[54.145.205.41] out:                   [16:03:58] gulp-responsive: clockss.png -> clockss-250.png
[54.145.205.41] out:                   [16:03:58] gulp-responsive: clockss.png -> clockss-500.png
[54.145.205.41] out:                   [16:03:58] gulp-responsive: cope.png -> cope-250.png
[54.145.205.41] out:                   [16:03:59] gulp-responsive: cope.png -> cope-500.png
[54.145.205.41] out:                   [16:03:59] gulp-responsive: community.jpg -> community-2228x672.jpg
[54.145.205.41] out:                   [16:03:59] gulp-responsive: europe-pmc.png -> europe-pmc-250.png
[54.145.205.41] out:                   [16:03:59] gulp-responsive: europe-pmc.png -> europe-pmc-500.png
[54.145.205.41] out:                   [16:04:00] gulp-responsive: labs.jpg -> labs-767x264.jpg
[54.145.205.41] out:                   [16:04:00] gulp-responsive: labs.jpg -> labs-450x264.jpg
[54.145.205.41] out:                   [16:04:00] gulp-responsive: labs.jpg -> labs-1023x288.jpg
[54.145.205.41] out:                   [16:04:00] gulp-responsive: labs.jpg -> labs-900x528.jpg
[54.145.205.41] out:                   [16:04:02] gulp-responsive: labs.jpg -> labs-1534x528.jpg
[54.145.205.41] out:                   [16:04:02] gulp-responsive: labs.jpg -> labs-1114x336.jpg
[54.145.205.41] out:                   [16:04:02] gulp-responsive: exeter.png -> exeter-250.png
[54.145.205.41] out:                   [16:04:02] gulp-responsive: exeter.png -> exeter-500.png
[54.145.205.41] out:                   [16:04:02] gulp-responsive: force11.png -> force11-250.png
[54.145.205.41] out:                   [16:04:04] gulp-responsive: force11.png -> force11-500.png
[54.145.205.41] out:                   [16:04:04] gulp-responsive: labs.jpg -> labs-2046x576.jpg
[54.145.205.41] out:                   [16:04:04] gulp-responsive: labs.jpg -> labs-2228x672.jpg
[54.145.205.41] out:                   [16:04:04] gulp-responsive: gooa.png -> gooa-250.png
[54.145.205.41] out:                   [16:04:04] gulp-responsive: gooa.png -> gooa-500.png
[54.145.205.41] out:                   [16:04:05] gulp-responsive: magazine.jpg -> magazine-767x264.jpg
[54.145.205.41] out:                   [16:04:05] gulp-responsive: magazine.jpg -> magazine-450x264.jpg
[54.145.205.41] out:                   [16:04:05] gulp-responsive: magazine.jpg -> magazine-1023x288.jpg
[54.145.205.41] out:                   [16:04:05] gulp-responsive: magazine.jpg -> magazine-900x528.jpg
[54.145.205.41] out:                   [16:04:06] gulp-responsive: magazine.jpg -> magazine-1114x336.jpg
[54.145.205.41] out:                   [16:04:06] gulp-responsive: magazine.jpg -> magazine-1534x528.jpg
[54.145.205.41] out:                   [16:04:06] gulp-responsive: jats4r.png -> jats4r-250.png
[54.145.205.41] out:                   [16:04:06] gulp-responsive: jats4r.png -> jats4r-500.png
[54.145.205.41] out:                   [16:04:06] gulp-responsive: magazine.jpg -> magazine-2046x576.jpg
[54.145.205.41] out:                   [16:04:09] Finished 'images:svgs' after 15 s
[54.145.205.41] out:                   [16:04:09] gulp-responsive: magazine.jpg -> magazine-2228x672.jpg
[54.145.205.41] out:                   [16:04:09] gulp-responsive: Created 24 images (matched 3 of 3 images)
[54.145.205.41] out:                   [16:04:09] Finished 'images:banners' after 15 s
[54.145.205.41] out:                   [16:04:10] gulp-responsive: naked-scientists.png -> naked-scientists-250.png
[54.145.205.41] out:                   [16:04:10] gulp-responsive: naked-scientists.png -> naked-scientists-500.png
[54.145.205.41] out:                   [16:04:11] gulp-responsive: pmc.png -> pmc-250.png
[54.145.205.41] out:                   [16:04:11] gulp-responsive: pmc.png -> pmc-500.png
[54.145.205.41] out:                   [16:04:11] Finished 'patterns' after 16 s
[54.145.205.41] out:                   [16:04:11] gulp-responsive: aws.svg -> aws-250.png
[54.145.205.41] out:                   [16:04:11] gulp-responsive: aws.svg -> aws-500.png
[54.145.205.41] out:                   [16:04:11] gulp-responsive: crossref.svg -> crossref-250.png
[54.145.205.41] out:                   [16:04:11] gulp-responsive: crossref.svg -> crossref-500.png
[54.145.205.41] out:                   [16:04:11] gulp-responsive: digirati.svg -> digirati-250.png
[54.145.205.41] out:                   [16:04:11] gulp-responsive: digirati.svg -> digirati-500.png
[54.145.205.41] out:                   [16:04:11] gulp-responsive: ed-office.svg -> ed-office-250.png
[54.145.205.41] out:                   [16:04:11] gulp-responsive: ed-office.svg -> ed-office-500.png
[54.145.205.41] out:                   [16:04:11] gulp-responsive: ejp.svg -> ejp-250.png
[54.145.205.41] out:                   [16:04:11] gulp-responsive: ejp.svg -> ejp-500.png
[54.145.205.41] out:                   [16:04:11] gulp-responsive: github.svg -> github-250.png
[54.145.205.41] out:                   [16:04:11] gulp-responsive: github.svg -> github-500.png
[54.145.205.41] out:                   [16:04:12] gulp-responsive: glencoe.svg -> glencoe-250.png
[54.145.205.41] out:                   [16:04:12] gulp-responsive: glencoe.svg -> glencoe-500.png
[54.145.205.41] out:                   [16:04:12] gulp-responsive: jira.svg -> jira-250.png
[54.145.205.41] out:                   [16:04:12] gulp-responsive: jira.svg -> jira-500.png
[54.145.205.41] out:                   [16:04:12] gulp-responsive: jisc.svg -> jisc-250.png
[54.145.205.41] out:                   [16:04:12] gulp-responsive: jisc.svg -> jisc-500.png
[54.145.205.41] out:                   [16:04:12] gulp-responsive: lockss.svg -> lockss-250.png
[54.145.205.41] out:                   [16:04:12] gulp-responsive: lockss.svg -> lockss-500.png
[54.145.205.41] out:                   [16:04:12] gulp-responsive: loggly.svg -> loggly-500.png
[54.145.205.41] out:                   [16:04:12] gulp-responsive: loggly.svg -> loggly-250.png
[54.145.205.41] out:                   [16:04:12] gulp-responsive: mendeley.svg -> mendeley-250.png
[54.145.205.41] out:                   [16:04:12] gulp-responsive: mendeley.svg -> mendeley-500.png
[54.145.205.41] out:                   [16:04:13] gulp-responsive: new-relic.svg -> new-relic-250.png
[54.145.205.41] out:                   [16:04:13] gulp-responsive: new-relic.svg -> new-relic-500.png
[54.145.205.41] out:                   [16:04:13] gulp-responsive: oaspa.svg -> oaspa-250.png
[54.145.205.41] out:                   [16:04:13] gulp-responsive: oaspa.svg -> oaspa-500.png
[54.145.205.41] out:                   [16:04:13] gulp-responsive: orcid.svg -> orcid-250.png
[54.145.205.41] out:                   [16:04:13] gulp-responsive: orcid.svg -> orcid-500.png
[54.145.205.41] out:                   [16:04:13] gulp-responsive: paperity.svg -> paperity-250.png
[54.145.205.41] out:                   [16:04:13] gulp-imagemin: Minified 25 images (saved 36.5 kB - 24.9%)
[54.145.205.41] out:                   [16:04:13] Finished 'favicons:build' after 18 s
[54.145.205.41] out:                   [16:04:13] Starting 'favicons'...
[54.145.205.41] out:                   [16:04:13] gulp-responsive: paperity.svg -> paperity-500.png
[54.145.205.41] out:                   [16:04:13] Finished 'favicons' after 8.73 ms
[54.145.205.41] out:                   [16:04:13] gulp-responsive: publons.svg -> publons-250.png
[54.145.205.41] out:                   [16:04:13] gulp-responsive: publons.svg -> publons-500.png
[54.145.205.41] out:                   [16:04:13] gulp-responsive: share.svg -> share-250.png
[54.145.205.41] out:                   [16:04:13] gulp-responsive: share.svg -> share-500.png
[54.145.205.41] out:                   [16:04:13] gulp-responsive: slack.svg -> slack-250.png
[54.145.205.41] out:                   [16:04:13] gulp-responsive: slack.svg -> slack-500.png
[54.145.205.41] out:                   [16:04:13] gulp-responsive: Created 58 images (matched 29 of 29 images)
[54.145.205.41] out:                   [16:04:13] Finished 'images:logos' after 19 s
[54.145.205.41] out:                   [16:04:13] Starting 'images'...
[54.145.205.41] out:                   [16:04:58] gulp-imagemin: Minified 101 images (saved 1.03 MB - 22.7%)
[54.145.205.41] out:                   [16:04:58] Finished 'images' after 45 s
[54.145.205.41] out:                   [16:04:58] Starting 'assets'...
[54.145.205.41] out:                   [16:05:05] Finished 'assets' after 6.98 s
[54.145.205.41] out:                   [16:05:05] Starting 'default'...
[54.145.205.41] out:                   [16:05:05] Finished 'default' after 5.22 μs
[54.145.205.41] out:                   Script succeeded
[54.145.205.41] out:                   Starting redis-server: redis-server.
[54.145.205.41] out: ----------
[54.145.205.41] out:           ID: maintenance-mode-end
[54.145.205.41] out:     Function: cmd.run
[54.145.205.41] out:         Name: ln -s /etc/nginx/sites-available/journal.conf /etc/nginx/sites-enabled/journal.conf
[54.145.205.41] out: /etc/init.d/nginx reload
[54.145.205.41] out:
[54.145.205.41] out:       Result: False
[54.145.205.41] out:      Comment: One or more requisite failed: journal.journal-nginx-vhost
[54.145.205.41] out:      Changes:  
[54.145.205.41] out: ----------
[54.145.205.41] out:           ID: maintenance-mode-check-nginx-stays-up
[54.145.205.41] out:     Function: cmd.run
[54.145.205.41] out:         Name: sleep 2 && /etc/init.d/nginx status
[54.145.205.41] out:       Result: False
[54.145.205.41] out:      Comment: One or more requisite failed: journal.maintenance-mode-end
[54.145.205.41] out:      Changes:  
[54.145.205.41] out: ----------
[54.145.205.41] out:           ID: syslog-ng-for-journal-logs
[54.145.205.41] out:     Function: file.managed
[54.145.205.41] out:         Name: /etc/syslog-ng/conf.d/journal.conf
[54.145.205.41] out:       Result: True
[54.145.205.41] out:      Comment: File /etc/syslog-ng/conf.d/journal.conf updated
[54.145.205.41] out:      Started: 16:05:05.545542
[54.145.205.41] out:     Duration: 15.197 ms
[54.145.205.41] out:      Changes:  
[54.145.205.41] out:               ----------
[54.145.205.41] out:               diff:
[54.145.205.41] out:                   New file
[54.145.205.41] out:               mode:
[54.145.205.41] out:                   0644
[54.145.205.41] out: ----------
[54.145.205.41] out:           ID: logrotate-for-journal-logs
[54.145.205.41] out:     Function: file.managed
[54.145.205.41] out:         Name: /etc/logrotate.d/journal
[54.145.205.41] out:       Result: True
[54.145.205.41] out:      Comment: File /etc/logrotate.d/journal updated
[54.145.205.41] out:      Started: 16:05:05.561008
[54.145.205.41] out:     Duration: 731.597 ms
[54.145.205.41] out:      Changes:  
[54.145.205.41] out:               ----------
[54.145.205.41] out:               diff:
[54.145.205.41] out:                   New file
[54.145.205.41] out:               mode:
[54.145.205.41] out:                   0644
[54.145.205.41] out: ----------
[54.145.205.41] out:           ID: listener_syslog-ng
[54.145.205.41] out:     Function: service.mod_watch
[54.145.205.41] out:         Name: syslog-ng
[54.145.205.41] out:       Result: True
[54.145.205.41] out:      Comment: Service reloaded
[54.145.205.41] out:      Started: 16:05:06.692775
[54.145.205.41] out:     Duration: 288.927 ms
[54.145.205.41] out:      Changes:  
[54.145.205.41] out:               ----------
[54.145.205.41] out:               syslog-ng:
[54.145.205.41] out:                   True
[54.145.205.41] out: ----------
[54.145.205.41] out:           ID: listener_nginx-server-service
[54.145.205.41] out:     Function: service.mod_watch
[54.145.205.41] out:         Name: nginx
[54.145.205.41] out:       Result: True
[54.145.205.41] out:      Comment: Service restarted
[54.145.205.41] out:      Started: 16:05:06.982046
[54.145.205.41] out:     Duration: 1070.506 ms
[54.145.205.41] out:      Changes:  
[54.145.205.41] out:               ----------
[54.145.205.41] out:               nginx:
[54.145.205.41] out:                   True
[54.145.205.41] out: ----------
[54.145.205.41] out:           ID: listener_redis-server
[54.145.205.41] out:     Function: service.mod_watch
[54.145.205.41] out:         Name: redis-server
[54.145.205.41] out:       Result: True
[54.145.205.41] out:      Comment: Service restarted
[54.145.205.41] out:      Started: 16:05:08.052890
[54.145.205.41] out:     Duration: 1107.546 ms
[54.145.205.41] out:      Changes:  
[54.145.205.41] out:               ----------
[54.145.205.41] out:               redis-server:
[54.145.205.41] out:                   True
[54.145.205.41] out:
[54.145.205.41] out: Summary for local
[54.145.205.41] out: --------------
[54.145.205.41] out: Succeeded: 122 (changed=102)
[54.145.205.41] out: Failed:      6
[54.145.205.41] out: --------------
[54.145.205.41] out: Total states run:     128
[54.145.205.41] out: Total run time:   563.630 s
[54.145.205.41] out: Error provisioning, state.highstate returned: 2
[54.145.205.41] out:

2017-07-14 16:05:09,864 - ERROR - 54.145.205.41 - buildercore.core - sudo() received nonzero return code 2 while executing!        Requested: /bin/bash /tmp/highstate.sh-20170714155540    Executed: sudo -S -p 'sudo password:'  /bin/bash -l -c "/bin/bash /tmp/highstate.sh-20170714155540"

Traceback (most recent call last):
  File "/root/builder/venv/local/lib/python2.7/site-packages/fabric/main.py", line 756, in main
    *args, **kwargs
  File "/root/builder/venv/local/lib/python2.7/site-packages/fabric/tasks.py", line 426, in execute
    results['<local-only>'] = task.run(*args, **new_kwargs)
  File "/root/builder/venv/local/lib/python2.7/site-packages/fabric/tasks.py", line 173, in run
    return self.wrapped(*args, **kwargs)
  File "/root/builder/src/decorators.py", line 58, in wrap2
    return func(pname, *args, **kwargs)
  File "/root/builder/src/cfn.py", line 143, in launch
    bootstrap.create_update(stackname)
  File "/root/builder/src/buildercore/decorators.py", line 19, in _wrapper
    return func(stackname, *args, **kwargs)
  File "/root/builder/src/buildercore/bootstrap.py", line 408, in create_update
    update_stack(stackname, part_filter)
  File "/root/builder/src/buildercore/decorators.py", line 19, in _wrapper
    return func(stackname, *args, **kwargs)
  File "/root/builder/src/buildercore/bootstrap.py", line 400, in update_stack
    [fn(stackname) for fn in subdict(service_update_fns, service_list).values()]
  File "/root/builder/src/buildercore/bootstrap.py", line 391, in <lambda>
    ('ec2', lambda stackname: update_ec2_stack(stackname, concurrency)),
  File "/root/builder/src/buildercore/bootstrap.py", line 459, in update_ec2_stack
    stack_all_ec2_nodes(stackname, _update_ec2_node, username=BOOTSTRAP_USER, concurrency=concurrency)
  File "/root/builder/src/buildercore/core.py", line 275, in stack_all_ec2_nodes
    return parallel_work(single_node_work, params)
  File "/root/builder/src/buildercore/core.py", line 286, in parallel_work
    return execute(parallel(single_node_work), hosts=params['public_ips'].values())
  File "/root/builder/venv/local/lib/python2.7/site-packages/fabric/tasks.py", line 420, in execute
    error(err)
  File "/root/builder/venv/local/lib/python2.7/site-packages/fabric/utils.py", line 358, in error
    return func(message)
  File "/root/builder/venv/local/lib/python2.7/site-packages/fabric/utils.py", line 54, in abort
    raise env.abort_exception(msg)
buildercore.config.FabricException: One or more hosts failed while executing task 'single_node_work'

Luke Skibinski

unread,
Jul 16, 2017, 9:24:48 PM7/16/17
to elife-continuum-list
Hi Lee,

Regarding the broken state of the server, when you run Salt's "highstate" command, either by bringing up a new system, doing a ./bldr update on an existing system or logging into the system and doing "sudo salt-call state.highstate" you'll be presented with a lot of output with a coloured report at the end. The sections that are in red are states that failed. There is a summary at the end of the report of the number of states that failed, in this case, 6 failed. A state will fail if a state it depends on fails, so while it may seem like there is an overwhelming number of failures, often it's just one or two states that actually failed and all descendant states were not run.

I can help you debug if I see the whole report. You'll probably get the same errors again if you run an 'update' on that project instance that just failed. If your terminal doesn't scroll that much and  you don't have the ability to turn on a longer history, Salt stores it's logs at /var/log/salt/ and are timestamped.

Regarding your suggestions, and I speak only as the guy who originally built this thing and have no authority going forwards (I freelance on the other side of the world now), a decision was made to go all-in on AWS and ignore other cloud offerings. eLife was already using AWS extensively when I arrived and there just wasn't the capacity or the necessity or knowledge or demand to do any of those things. Builder was private and internal for a very long time before I made it public, and I come from a background where managing multiple load balancers with HAProxy and all the skullduggery that is database replication and general server maintenance, but if the price of going cloud-neutral is giving up the time and convenience and peace of mind bought by vendor reliance, I would recommend not. That said, there are other technologies out there now besides CloudFormation for automating deployment of cloud services and they could support different incarnations of the same idea depending on the preferred provider. But that isn't where eLife is right now.

Regarding friendliness of use and web interfaces, the audience for builder has always been devops. Prior to our current testing environment there was greater inconsistency in how to test code and more ad-hoc instances being used, more devs getting frustrated when things broke, but thanks to Giorgio and shifting the testing into CI, there is almost zero need to bring up adhoc instances these days. Unfortunately, continuum doesn't include the CI. I don't know who eLife thinks the audience for continuum should be going forwards.

In lieu of a web interface I would suggest persevering with the output. It can be overwhelming at first, but it's actually pleasantly plain and uncomplicated. You can see the commands being executed from start to finish, and a report at the end that is green on success.

Giorgio Sironi

unread,
Jul 17, 2017, 4:00:13 AM7/17/17
to Lee Roder, elife-continuum-list
On Fri, Jul 14, 2017 at 1:09 PM, Lee Roder <l...@impremistech.com> wrote:
Thanks Luke and Giorgio for your most helpful responses and feedback!  I have skimmed your responses and will digest them further upon posting this reply.  In the meantime, I wished to ask a question that I believe I will have even after perusing all of the including links.  Namely, it appears as if you are suggesting that:
1. I use continuum.yaml as my baseline, updating with my AWS specific values where appropriate

Yes, correct.
2. I run ./bldr launch:journal,prod and then select option 1 - skip this step

Yes, but right now I have just tested the `baseline` configuration environment which should be easier to use:

./bldr deploy:journal,baseline

 
3. Push my SSL certs to... where exactly?  By the way, I am totally familiar with AWS's woeful lack of support for external certs.  I had to push some to a client's AWS instance some time back and figure out that, as you indicated, the only way to pull if off is from the AWS CLI.  Which leads me to a couple observations/questions/suggestions.  But first, let me say that my next actions will be to perform the steps above and then get back with y'all

You are right that you can only do this in the CLI.
aws iam upload-server-certificate \
    --path '/cloudfront/wildcard.elifesciences.org/' \
    --server-certificate-name 'wildcard.elifesciences.org' \
    --certificate-body file://your-certificate.crt \
    --private-key file://your-private-key.pem \
    --certificate-chain file://full-chain.pem
 

Now, on to my observations/questions/suggestions.  First off, have you considered being a bit more cloud neutral?  That might enable folks to more easily adapt this process to their cloud platform of choice.  For example, create a load balancer VPS directly vs relying on Amazon ELBs.  Stay away from services like RDS and Route 53.  We actually went through this evolution a couple years back for a very large client.  Originally, our deployment formula was very AWS specific.  We have gradually moved away from that such that now, it is possible, through a web interface that we have built for this client, to select the Cloud Provisioning Platform, Region, VLAN (our generic term for what AWS refers to as a VPC), and then step through a select configs for each of the instance types - in their case: Load Balancers (we have moved to installing/configuring HAProxy), Web Servers (we are using NGINX presently - originally is was Apache), API Servers (for a proprietary CZMQ API that we helped this client develop), Database Servers (a Galera multi-master cluster).  A user is stepped through each of these and at each point can select system size (mem, cpu cores, disk capacity), OS Version (presently Ubuntu 16.04 and CentOS 7), HAProxy version (1.6 or .17), NGINX version... you get the idea.  And, at the final step, they are presented with summary screen and a Deploy button at the bottom.  When Deploy is hit - our python code gets called with a large payload and starts churning away spinning all required instances.  We happen to update a database table with progress info and the UI code just queries that table every so often (30 second intervals, I believe) and updates individual status for each machine as it is spun up, reaches high state, completes post-configuration, etc..  I'd be happy to walk you through a quick demo of that system - it might give you some ideas.

We want to move towards not depending on AWS through abstracting it away or substituting some technologies like S3 and SQS with generic solutions, but unfortunately at this time we have lots of dependencies that need to be addressed in the projects code first to go into that direction. That's why we provide a "deploy on AWS" guide rather than a generic one at the moment. These open source projects have been first created to serve the needs of elifesciences.org and open sourced as a follow-up.



--
Giorgio Sironi
@giorgiosironi

Giorgio Sironi

unread,
Jul 17, 2017, 4:08:19 AM7/17/17
to Lee Roder, elife-continuum-list
On Fri, Jul 14, 2017 at 6:10 PM, Lee Roder <l...@impremistech.com> wrote:
OK.  I have an update.  I create a wildcard ssl cert, pushed it up to my AWS account (aws iam upload-server-certificate --server-certificate-name elifetest --certificate-body file://Certificate.pem --certificate-chain file://CertificateChain.pem --private-key file://PrivateKey.pem), received a reply with my arn:aws:iam::.........  I edited builder/settings.yml to reference continuum.yaml instead of elif.yaml.  I then update projects/continuum.yaml to reflect all of the relevant/necessary value to match my specific environment, then ran: ./bldr launch:journal,prod

At this point, I no longer had six choices, but only 2.  Either: 1. Skip this stop or 2. Baseline.

I was feeling lucky.  I went with two -- and things ALMOST worked!  Its tough to know exactly what failed since so much console output flew by that I do not know if I can capture what the issue was.  I will include the last couple hundred lines at the end of this post.  I did verify that a CloudFormation stack was successfully created along with two EC2 instances: journal--prod--1 and journal--prod--2 - each in different subnets.  Nice.  An ELB also got created.  However, neither web server instance seemed to be listining on ports 80 or 443.  So, I ssh'd into the first one, verified that nginx was indeed up and running, and then started looking around in /etc/nginx.  The first odd thing I noticed is that /etc/nginx/sites-enabled was completely empty.  No symlinks.  I then looked in /etc/nginx/sites-available and found a default and an unencrypted-redirect.conf.  Hmmm... that second file certainly is of your making, but it essentially looks like it just redirect port 80 traffic to port 443.  Still no trace of anything that would be responding to incoming requests and pointing them to whatever your web root is.
 
  I did some looking around and found that the web code seems to all be installed - in /srv/journal.  The handy-dandy README.md includes a link to:

 https://symfony.com/doc/current/setup/web_server_configuration.html

which describes, among other things, a proper nginx config file.  So.... something like that must exist somewhere.... but I cannot seem to find it!

The deployment process only created the nginx site if the previous steps related to deploying the code have been successfull, that's likely to be why you didn't find anything in sites-enabled.

Meanwhile, for what its worth - here is a list of every default value that I changed in builder/projects/continuum.yaml:
defaults:
    # domain: thedailybugle.org
    #intdomain: thedailybugle.internal

The lack of domain and intdomain should skip the creation of DNS entries - meaning it will be difficult but not impossible to reach the instance from outside (referring to the default ELB DNS probably). I get you substituted these with your own?
 

        #account_id: 531849302986
        #vpc-id: vpc-c23159a5
        #subnet-id: subnet-6b6c6e41
        #subnet-cidr: '172.31.48.0/20'
        #redundant-subnet-id: subnet-dffe0c96
        #redundant-subnet-cidr: '172.31.0.0/20'
                #- subnet-6b6c6e41
                #- subnet-dffe0c96
            #certificate: arn:aws:iam::531849302986:server-certificate/wildcard.thedailybugle.org

All these should be modified, you are on the right track.


master-server:
                #cidr-ip: 172.31.0.0/16
                #cidr-ip: 172.31.0.0/16

These should have been adapter to the ip range of your VPC. Essentially limiting access to the master server to internal traffic.
 

console log output:

[54.145.205.41] out: ----------
[54.145.205.41] out:           ID: journal-nginx-redirect-existing-paths
[54.145.205.41] out:     Function: file.managed
[54.145.205.41] out:         Name: /etc/nginx/traits.d/redirect-existing-paths.conf
[54.145.205.41] out:       Result: False
[54.145.205.41] out:      Comment: One or more requisite failed: elife.nginx.nginx-config
[54.145.205.41] out:      Changes:  
[54.145.205.41] out: ----------
[54.145.205.41] out:           ID: journal-nginx-robots
[54.145.205.41] out:     Function: file.managed
[54.145.205.41] out:         Name: /etc/nginx/traits.d/robots.conf
[54.145.205.41] out:       Result: False
[54.145.205.41] out:      Comment: One or more requisite failed: elife.nginx.nginx-config
[54.145.205.41] out:      Changes:  
[54.145.205.41] out: ----------
[54.145.205.41] out:           ID: journal-nginx-vhost
[54.145.205.41] out:     Function: file.managed
[54.145.205.41] out:         Name: /etc/nginx/sites-available/journal.conf
[54.145.205.41] out:       Result: False
[54.145.205.41] out:      Comment: One or more requisite failed: journal.journal-nginx-redirect-existing-paths, journal.journal-nginx-robots, elife.nginx.nginx-config
[54.145.205.41] out:      Changes:  
[54.145.205.41] out: ----------

These errors point to a problem with the nginx configuration, in particular /etc/nginx/sites-available/journal.conf is th
e file you were looking for and whose deployment has been skipped because of previous errors. As you already know your way around Salt, we need to find the first state the failed (should be earlier in the output) because these one are only further dependencies.

The interesting bit stops here
 
While this is only a client-side exception triggered by the Salt highstate failure.
(I see there are additional emails in this thread, but I'll reply to this anyway for anyone else troubleshooting this problem)

--
Giorgio Sironi
@giorgiosironi

Lee Roder

unread,
Jul 17, 2017, 12:54:49 PM7/17/17
to elife-continuum-list
Quick clarification.  I had modified each of the values in continuum.yaml to match our specific configuration, including the web domain.  I did go ahead and destroy my build from Friday and run again, using the default ubuntu ami that was supplied in the original continuum.yaml, and it appears as if I failed in the same way.  The problem I am having in getting you the full console log is that there is so much output being generated that the error must show up prior to the top of the log.

I went and ran ./bldr   I am attaching 3 files for your review:

my specific continuum.yaml
app.log for this most recent run
my console output for as far back as I am able to retrieve it  (you will see lines that look like this: +++++++++++++++++  every so often.  Those are points where I was trying to capture/copy/paste log info realtime.  You will see that it is thousands of lines long.  I could perhaps redirect stdout to a file if you like, just not exactly sure where to do that).

Once again, at the very tail end of the log output, I see 102 succeeded and 6 failed  -- two EC2 instances are created, but neither completed the configuration process...

[34.204.99.43] out: Summary for local
[34.204.99.43] out: --------------
[34.204.99.43] out: Succeeded: 122 (changed=102)
[34.204.99.43] out: Failed:      6
[34.204.99.43] out: --------------
[34.204.99.43] out: Total states run:     128
[34.204.99.43] out: Total run time:   341.013 s
[34.204.99.43] out: Error provisioning, state.highstate returned: 2
[34.204.99.43] out:

continuum.yaml
conosole-log.txt
app.log

Giorgio Sironi

unread,
Jul 18, 2017, 4:06:03 AM7/18/17
to Lee Roder, elife-continuum-list
On Mon, Jul 17, 2017 at 5:54 PM, Lee Roder <l...@impremistech.com> wrote:
Quick clarification.  I had modified each of the values in continuum.yaml to match our specific configuration, including the web domain.

Understood and correct.
 
This output is key to finding out the problem, you can capture all of it by executing the command in this way:

./bldr any_command | tee console-log.txt

tee will empty the console-log.txt file and both output the lines it is fed and log them to the file.

--
Giorgio Sironi
@giorgiosironi

Lee Roder

unread,
Jul 19, 2017, 12:08:22 AM7/19/17
to elife-continuum-list
Giorgio,

I have uploaded the complete log of a run of:

./bldr deploy:journal,baseline

I look forward to your recommendations!
console-log.txt

Lee Roder

unread,
Jul 19, 2017, 9:19:02 AM7/19/17
to elife-continuum-list
So, I'm in looking at one of the production web servers that got spun up.  Trying to figure out why nginx doesn't seem to be completely configured, running git to pull down a couple of your key repos just to walk through the contents, etc.... I happened to notice that the version of salt installed is:

salt-minion 2017.7.0 (Nitrogen)

Is that going to present a problem?  How in the world did that version get installed?  That's awfully new. Interestingly, when grepping through various /etc/apt files, I noticed the following:

sources.list.d/saltstack.list.save:deb https://repo.saltstack.com/apt/ubuntu/14.04/amd64/2016.3 trusty main
sources.list.d/saltstack.list:deb https://repo.saltstack.com/apt/ubuntu/14.04/amd64/latest trusty main

Odd...

Lee Roder

unread,
Jul 20, 2017, 10:07:09 AM7/20/17
to elife-continuum-list
HELP!!!!  I had posted a complete console log of the output from my last build run (./bldr deploy:journal,baseline | tee console-log.txt) and am awaiting input on how to proceed!

Meanwhile, I have trolled through the console log hoping to find where things went wrong... could it be this (I removed a bunch of line breaks to render it more readable)?

Sensio\\Bundle\\DistributionBundle\\Composer\\ScriptHandler::clearCache\nScript Sensio\\Bundle\\DistributionBundle\\Composer\\ScriptHandler::clearCache handling the post-install-cmd event terminated with an exception

[RuntimeException]
An error occurred when executing the "\'cache:clear --no-warmup\'" command:
    [Symfony\\Component\\DependencyInjection\\Exception\\ServiceNotFoundException]
    The service "elife.journal.templating.promise_aware" has a dependency on a non-existent service "templating.globals"

Lee Roder

unread,
Jul 20, 2017, 6:41:15 PM7/20/17
to elife-continuum-list
OK... progress at long last.... dug deep today, looking at the console log, trying to trace back to the first issue, digging into salt-master bits and bobs... drilling down into pillars... I got one of my co-workers to team up with me.  It all seemed to surround some missing 'nginx' key that jinja was complaining about.  And there wasn't an nginx key in the pillar data.  Long and short, after quite a bit of testing/sifting through things, it turns out that the following was not present in file /srv/pillar/elife.sls and needed to be:
    nginx:
        certificate_folder: /etc/certificates

FINALLY, we were able to successfully build out the web servers.

Then, on to the next issue - nginx config file problems.  First off, the version of php that ended up being installed was php 7.0.  php-fpm/fastcgi is being used, but the socket file referenced in journal.conf was certainly not correct.  I had certificate problems on top of it, even when we were finally getting certs dropped into the correct place (ended up having to regenerate my certs -- problem on my end).  Also, ssl setup seemed off.  Anyhow, at LOOOOONG last, https://www.elife-test.info is alive!  KInda sorta....  I'm not really sure what page should get served up by default, and pretty much none of the app_xxxxx.php pages end up displaying.  I can get app_dev.php to show something, but it looks like it needs a document reference and I'm not even certain that there is a document library present!

Ready for more assistance!!

Luke Skibinski

unread,
Jul 20, 2017, 8:12:34 PM7/20/17
to elife-continuum-list
Hey Lee, sorry for not getting back to you sooner. I suspect Giorgio may be sick or away and I don't work for eLife on Thursdays.

I'm going through your last messages now and that full log, thanks for posting those and good work on your deep dive.

Luke Skibinski

unread,
Jul 20, 2017, 9:21:52 PM7/20/17
to elife-continuum-list
> I happened to notice that the version of salt installed is:
> salt-minion 2017.7.0 (Nitrogen)

I have no idea how that was installed. Off the top of my head, I'd say the salt bootstrap script defaulted to the latest possible version because it didn't didn't read the 'salt' value in your project file properly. Wild guess though.


> Is that going to present a problem?

It's certainly not going to help debug things ;) We'll debug what we can for now and keep this in the back of our minds.


> I noticed the following:
> sources.list.d/saltstack.list.save:deb https://repo.saltstack.com/apt/ubuntu/14.04/amd64/2016.3 trusty main
> sources.list.d/saltstack.list:deb https://repo.saltstack.com/apt/ubuntu/14.04/amd64/latest trusty main

We use the official bootstrap script here: https://github.com/saltstack/salt-bootstrap
I'm guessing it configures those sources itself.


> it turns out that the following was not present in file /srv/pillar/elife.sls and needed to be:
>    nginx:
>        certificate_folder: /etc/certificates

yes - a synchronisation problem it seems. Once upon a time when I first made builder public I knew we'd have these problems and wrote some builder tasks to help me sync up the examples (builder-private-example, example.top, example.pillar in formulas). Needless to say, they were never used and any syncing has been done by hand. I'll dig them out now and see if anything else is missing. We may be able to automate something to alert us when there is a divergence.



> [RuntimeException]
> An error occurred when executing the "\'cache:clear --no-warmup\'" command:
>     [Symfony\\Component\\DependencyInjection\\Exception\\ServiceNotFoundException]
>     The service "elife.journal.templating.promise_aware" has a dependency on a non-existent service
> "templating.globals"

Yes - this looks like a formula bug - it looks like that state should have failed. Further up in stderr we also see:

> [54.82.238.182] out:                     - Applying patches for bobthecow/mustache-bundle
> [54.82.238.182] out:                       https://github.com/bobthecow/BobthecowMustacheBundle/pull/9.patch (Remove superfluous argument)
> [54.82.238.182] out:                      Could not apply patch! Skipping. The error was: The "https://github.com/bobthecow/BobthecowMustacheBundle/pull/9.patch" file could not be downloaded (HTTP/1.1 502 Bad Gateway)
> [54.82.238.182] out:                  

not sure if the two are related or not, but buggy as hell. I'll open a ticket to get this looked at.


> Also, ssl setup seemed off. 

could you elaborate on this? the ssl setup is standard across all our projects and gets decent marks (https://www.ssllabs.com/ssltest/analyze.html?d=lax.elifesciences.org&hideResults=on). the bulk of the ssl configuration doesn't live in the per-project conf files but in the nginx conf file: https://github.com/elifesciences/builder-base-formula/blob/master/elife/config/etc-nginx-nginx.conf#L89-L108. The project conf just needs to turn it on and listen on the right port: https://github.com/elifesciences/journal-formula/blob/master/salt/journal/config/etc-nginx-sites-available-journal.conf#L45

If you have any suggestions here I'd be happy to hear them.

Cheers,

Luke

Luke Skibinski

unread,
Jul 20, 2017, 9:40:10 PM7/20/17
to elife-continuum-list
> I'm not really sure what page should get served up by default, and pretty much none of the app_xxxxx.php pages end up displaying. 
> I can get app_dev.php to show something, but it looks like it needs a document reference and I'm not even certain that there is a document library present!
> Ready for more assistance!!

I don't know much about journal internals. I do know it calls out to many other services for it's content. I would suggest that if the log files aren't revealing anything and your Salt output is 100% green, that there is an integration problem with the other services. I'm curious to know how journal should behave when it's completely cut off... Regardless, this is the territory of Chris Wilkinson, so I'll forward this thread to him.



On Friday, July 21, 2017 at 8:11:15 AM UTC+9:30, Lee Roder wrote:

Lee Roder

unread,
Jul 21, 2017, 8:34:22 AM7/21/17
to elife-continuum-list
Thanks, Luke!

For starters, I have attached the nginx config file that gets generated and placed on the 'journal' servers.  You will note that there is no listen 443 ssl; line.  I see the if test in the config file 'builder' script that you attached and it is clear that this particular build format passed this test: {% if salt['elife.cfg']('project.elb') %}, which ended up giving me no port 443/ssl.  I guess that would explain why the pillar file had no nginx certs in it.  Yet elsewhere in the build process, some jinja script expected those certs to be present/avialable.  Ugh... What are these servers, by the way?  I do not seem them mentioned in the continuum doc.  I see things like:

yet I was told to launch something called journal, which doesn't seem to be in this list.


So, here's what I'd REALLY like to know.  Could someone perhaps tell me precisely which build steps I need to take in what order to construct a 'complete' publishing environment?  I am quite happy to proceed... it just seems like I am missing parts.  For example, what error messages I do seem to be able to generate with the web code that gets deployed in a 'journal' build seem to indicate that it expects some sort of 'api' to be present -- yet I was not instructed to build an API server.  Also, should there be a database somewhere?  It seems when I originally attempted to build an elife web server, a database node was indeed deployed.  I do not see the same thing with my journal build...

Lee Roder

unread,
Jul 21, 2017, 8:35:41 AM7/21/17
to elife-continuum-list
oops!  forgot to attach the nginx conf file.  Here it is!
journal.conf

Luke Skibinski

unread,
Jul 23, 2017, 9:17:20 PM7/23/17
to elife-continuum-list
I took a look at that nginx conf file source on Friday for other reasons and found it quite confusing, so I sympathize.

I don't know what your mandate is, but I assume you're following this docs:

I don't know if the history of this effort is at all helpful in giving you context, so if it's not, just skip the next bit.

Here goes: Of all the different bits and pieces running at eLife prior to our recent revamp in ~May 2017, only a subset of those could be considered necessary to running one's own publishing system. This subset was picked out and documented and called 'continuum', a project name with far too many syllables. This documentation effort was occurring concurrently with another effort of replacing or re-purposing those systems being documented for this recent revamp (that we've been calling '2.0'). This is why this repo: https://github.com/elifesciences/elife-continuum-documentation is mostly confusing, misleading cruft. 

this doc however: https://github.com/elifesciences/elife-continuum-documentation/tree/master/deploy_on_aws *is* being maintained. it relates to the architecture post-2.0 release.

2.0 elife is a microservice based architecture. what this means is, we've swapped the problems of building and maintaining a monolithic piece of software for the problems of integrating smaller, more tightly focused pieces of software. There are many pros and cons to this approach, but it fits eLife better right now. From this cluster of microservices, a subset have again been picked out to be supported as 'continuum 2.0'.

I don't have any diagramming tools available to me, so I'll just explain the rough layout as succintly as I can:

1 journal service. 
is dumb, has no database, no state, but knows where to find it's content (api-gateway)

2 api-gateway
in charge of pulling all the services together under a single interface. 
API is described using RAML (like Swagger)

3 services
this is where it spreads out. 
I won't list all of them, just know there are many and more are planned.
they all live behind the api-gateway and must all return responses conforming to the api spec.

3.1 lax (article service)
this service is in charge of serving up json derived from xml

3.2 journal-cms (content service)
this service is in charge of serving up regular page content managed by Drupal

3.3 elife-metrics (metrics service)
this service is in charge of collecting page views, downloads and citations about articles

3.4 search
this service provides Lucene/Solr/Elastisearch type search capabilities.

3.5 recommendations
not entirely sure what this does ... I think it might be related content type searching.


most of those services are fairly 'flat' - it's just them and perhaps a reliance on another service or two, for example, metrics may depend on lax, recommendations may depend on search and metrics.

the one exception here is Lax, which is the gatekeeper between legacy production systems and these shiny new 2.0 services. Lax is the tip of the iceberg. Beneath lax we have:

3.1.1 bot-lax-adaptor 
a 'bridge' for communication between elife-bot and lax. bot-lax-adaptor relies on a library called 'elife-tools' that provides many tiny functions for extracting specific bits of content from article xml.

3.1.1.1 elife-bot
the elife-bot is one of those 'legacy production systems' I mentioned - it's big, it's very complex, and does most of the automated heavy lifting and scheduling involved during production. it relies heavily on AWS infrastructure.

3.1.1.1.1 elife-dashboard
hangs off of the elife-bot, giving the production team insight into the status and history of articles.

3.1.1.1.1.1 elife-article-scheduled
tied to the elife-dashboard project, it's job is scheduling and publishing scheduled articles


The 'jats-scraper' that you also mentioned is obsolete now. It scraped JATS xml and produced json in a simple format we called 'EIF' that we once used for ingesting into other systems. We're removingg support from it from our systems.

Not on your list was the 'api-dummy' project, which can be a drop-in replacement for the api-gateway and services in order to get a working journal up. It's step #5: https://github.com/elifesciences/elife-continuum-documentation/tree/master/deploy_on_aws

Not mentioned at all is that eLife uses third parties for submission and pre-production and, at time of writing, have no alternatives. Continuum in it's current state can't provide a 'complete' submission-to-publication publishing environment, even if we were to ignore the gaps in documentation and configuration.

If all you're trying to do is get steps 1-5 up and running, and your journal instance is still giving you a 'site broken' type page, I think the next step is looking at the integration between journal and api-dummy.

I'm not sure where the others are at the moment or why they're not responding, but these two systems are outside my knowledge. If I have time I'll try and bring up a local instance and get it talking to the dummy api.

Hope this helps clear things up a bit.

Lee Roder

unread,
Jul 24, 2017, 10:23:31 AM7/24/17
to elife-continuum-list
Luke,
Thanks much for that explanation.  It certainly helps a great deal.  Your overview was quite useful in providing context, and the 'deploy_on_aws' link was particularly helpful!

So, based on your comments as well as the 'deploy_on_aws' link, which I have quickly scanned, it would appear that 'api-dummy' would be the next piece to deploy.  I guess the only thing that leaves me feeling a bit odd is that 'api-dummy' does not sound very production-level... I haven't dug in to figure out exactly what it is, but I am hoping that it will give me the ability to build out/plug in/access whichever of the various services my client is looking to implement.

As one quick example, they have mentioned the dashboard.  Based on your overview, it would appear that I would need to build out/deploy the API Server, the lax service, the elife dashboard, and possibly the dlife article scheduler in order to have a functioning dashboard -- or does 'api-dummy' somehow provide all of that?

Meanwhile, armed with this overview of how the various pieces fit together, I will get back with my client to understand which components of this architecture they wish to deploy.

I suspect that I will be back with more questions once I have been able to visit with them!

Lee Roder

unread,
Jul 24, 2017, 12:27:45 PM7/24/17
to elife-continuum-list
Quick followup.  I have spoken with my client and what they would like to deploy initially is:

journal server
api server
lax server
elife bot
elife dashboard
elife article scheduler

Sadly, it appears as if the links from https://github.com/elifesciences/elife-continuum-documentation/tree/master/deploy_on_aws to:

WIP Deploying a(n) api-gateway instance

WIP Deploying a lax instance

WIP Deploying a(n) elife-bot instance

WIP Deploying a(n) elife-dashboard instance


are not yet available.  Also, I do not see a line item for elife-article-scheduler.  Will that be part of the elife-dashboard howto?


Given that my client wishes to have the components listed above installed and configured, what is your recommendation for the 'how to' recipe to achieve this today?

Luke Skibinski

unread,
Jul 24, 2017, 10:44:52 PM7/24/17
to elife-continuum-list
Hi Lee,

The article scheduler is included on the dashboard instance.

As for a recommendation on what to do next? I would suggest baby steps - get the api-dummy involved, get the journal instance displaying something other than a 'site broken' type page.

So:

1. bring up an api-dummy instance
2. configure your journal instance to talk to your api-dummy instance
[celebrate]
3. bring up an api-gateway instance
4. bring up a lax instance
5. configure gateway to talk to lax
6. feed lax some elife xml 
7. expect article xml to appear on journal site

The xml going into Lax needs to adhere to JATS but since JATS is so loose in specification, there are probably other undocumented idioms that need to be followed. I think if your client is serious about using our tech or simply wants to see their own content appear on the journal site, then we need to start looking at how elife JATS is different to their JATS. If they don't use JATS xml at all then it's perhaps not worth the effort for this exercise.

After that we can focus on the bot and the dashboard and feeding lax xml through it.

Luke Skibinski

unread,
Jul 24, 2017, 10:47:14 PM7/24/17