architecture for Blue/Green deployments

147 views
Skip to first unread message

dans...@gmail.com

unread,
May 1, 2019, 4:35:20 PM5/1/19
to Django users
My organization is moving into the AWS cloud, and with some other projects using MongoDB, ElasticSearch and a web application framework that is not Django, we've had no problem.

I'm our "Systems/Applications Architect", and some years ago I helped choose Django over some other solutions.   I stand by that as correct for us, but our Cloud guys want to know how to do Blue/Green deployments, and the more I look at it the less happy I am.

Here's the problem:
  • Django's ORM has long shielded developers from simple SQL problems like "SELECT * FROM fubar ..." and "INSERT INTO fubar VALUES (...)" sorts of problems.
  • However, if an existing "Blue" deployment knows about a column, it will try to retrieve it:
    • fubar = Fubar.objects.get(name='Samwise') 
  • If a new "Green" deployment is coming up, and we want to wait until Selenium testing has passed, we have the problem of migrations
I really don't see any simple way around a new database cluster/instance when we bring up a new cluster, with something like this:
  • Mark the live database as "in maintenance mode".    The application now will not write to the database, but we can also make that user's access read/only to preserve this.
  • Take a snapshot
  • Restore the snapshot to build the new database instance/cluster.
  • Make the new database as "live", e.g. clear "maintenance mode".   If t he webapp user is read-only, they are restored to full read/write permissions.
  • Run migrations in production
  • Bring up new auto-scaling group
Of course, some things that Django does really help:
  • The database migration is first tested by the test process, which runs migrations
  • The unit tests have succeeded before we try to do this migration.

Does anyone have experience/cloud operations design with doing Bluegreen deployments with Django?

Shaheed Haque

unread,
May 2, 2019, 12:34:55 PM5/2/19
to django...@googlegroups.com
Hi Dan,

I recently went through a similar exercise to the one you describe to move our prototype code on AWS.

First, my background includes a stint building a control plane for autoscaling VMs on OpenStack (and being generally long in tooth), but this is my first attempt at a Web App, and therefore Django too. I also grew up on VAXes, so the notion of an always-up cluster is deeply rooted.

Technical comments follow inline...

On Wed, 1 May 2019 at 21:35, <dans...@gmail.com> wrote:
My organization is moving into the AWS cloud, and with some other projects using MongoDB, ElasticSearch and a web application framework that is not Django, we've had no problem.

I'm our "Systems/Applications Architect", and some years ago I helped choose Django over some other solutions.   I stand by that as correct for us, but our Cloud guys want to know how to do Blue/Green deployments, and the more I look at it the less happy I am.

Here's the problem:
  • Django's ORM has long shielded developers from simple SQL problems like "SELECT * FROM fubar ..." and "INSERT INTO fubar VALUES (...)" sorts of problems.
  • However, if an existing "Blue" deployment knows about a column, it will try to retrieve it:
    • fubar = Fubar.objects.get(name='Samwise') 
  • If a new "Green" deployment is coming up, and we want to wait until Selenium testing has passed, we have the problem of migrations
I really don't see any simple way around a new database cluster/instance when we bring up a new cluster, with something like this:
  • Mark the live database as "in maintenance mode".    The application now will not write to the database, but we can also make that user's access read/only to preserve this.
  • Take a snapshot
  • Restore the snapshot to build the new database instance/cluster.
  • Make the new database as "live", e.g. clear "maintenance mode".   If t he webapp user is read-only, they are restored to full read/write permissions.
  • Run migrations in production
  • Bring up new auto-scaling group
We are not yet doing auto-scaling but otherwise your description applies very well to us. Right now, we have a pair of VMs, a "logic" VM hosting Django, and a "db" VM hosting Postgres (long term, we may move to Aurora for the database, but we are not there right now). The logic VM is based on an Ubuntu base image, but a load of extra stuff:
  • Django, our code and all Python dependencies
  • A whole host of non-Python dependencies starting with RabbitMQ (needed for Celery), nginx, etc
  • And a whole lot of configuration for the above (starting with production keys, passwords and the like)
The net result is that not only does it take 10-15 minutes for AWS to spin up a new db VM from a snapshot, but it also takes several minutes to spin gup, install, and configure the logic VM. So, we have a piece of code that can do a "live-to-<scenario>" upgrade:
  • Where scenario is "live-to-live" or "live-to-test".
  • The logic is the same in both except for a couple of small pieces only in the live-to-live case where we:
    • Pause the live system (db and celery queues) before snapshotting it for the new spin up
    • Create an archive of the database
    • Switch the Elastic IP on successful sanity test pass
  • We also have a small piece of run-time code in our project/settings.py that, on a live system, enables HTTPS and so on.
Before we do the "live-to-live" upgrade, we always to a "live-to-test" upgrade. This ensure we have run all migrations and pre-release sanities on virtually current data, and then perform a *separate* live-to-live.

While this works, it creates a window during when the service must be down. There is also a finite window when all those 3rd party dependencies on apt and pip/pypi expose the "live-to-live" to potential failure.

So in the "long term", I would prefer to attempt something like the following:
  • Use a cluster of logic N VMs.
  • Use an LB at the front end.
  • Enforce a development process that ensures that (roughly speaking) all database changes result in a new column, and where the old cannot be removed until a later update cycle. All migrations populate the new column.
  • We spin up and N+1th VM with the new logic, and once sanity testing is passed, switch the N+1 machine on in the LB, and remove one of the original N.
    • Loop
  • Delete the old column
Of course, the $64k question is all around how to keep the old logic and the new logic in sync with the two columns. For that, I can only wave my arms at present and say that the old column cannot really be there in its bare form, instead there will be some kind of a view that makes it look like it is - possibly with some old school stored procedure/trigger logic in support. Of course, I would love it if there were some magic tooling developed by the Django and database gurus before I have to tackle this. Then again, I don't believe in magic. And nor do I believe we'll have an army of devs to fake the magic.

I'd love to be shown a better way...(e.g. a complete second cluster, with a rolling migration of data from old to new until the old is killed?) else I'll be on the hook for making the above work!

Thanks, Shaheed
 
Of course, some things that Django does really help:
  • The database migration is first tested by the test process, which runs migrations
  • The unit tests have succeeded before we try to do this migration.

Does anyone have experience/cloud operations design with doing Bluegreen deployments with Django?

--
You received this message because you are subscribed to the Google Groups "Django users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-users...@googlegroups.com.
To post to this group, send email to django...@googlegroups.com.
Visit this group at https://groups.google.com/group/django-users.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-users/ae5310c6-b69f-43af-a838-5dce7bd6a712%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

dans...@gmail.com

unread,
May 2, 2019, 2:36:48 PM5/2/19
to Django users
What gold!

The essential piece of your logic is here:
Enforce a development process that ensures that (roughly speaking) all database changes result in a new column, and where the old cannot be removed until a later update cycle. All migrations populate the new column.

I assume that "enforce" is a software engineering management thing, not clever CI/CD code.

To change a column, you essentially do it in two steps:
  • Create new column with migrations, and do a release to the environment.
  • Drop old column with migrations, and do a release to the environment.
If you are only dropping an old column, it might go like this:
  • Drop the old column from  the model, but not from the database (e.g. assure that makemigrations has not been run), test and deploy.
  • Add the migration that does away with the old column - test and deploy.
Personally, I have a similar career trajectory, but from systems programming in C/C++ to software architect for webapps.   Understanding C/C++  and Linux gave me a capability of working with both developers and system guys. I think it was in 2013 that I did the presentation that started to kill Adobe ColdFusion (which I didn't want to learn).   I instead the next 6 years getting all of our applications moved to Django, now we are in transition to Python 3.6 and Django 2.2, with our first cloud system coming up.

On your slow cloud builds, I have an architecture that may help. The basic idea is easy to explain:
  • Do system provisioning through ansible roles
  • Install all the roles needed for all of your stacks when you build the AMI, but only run some of them, e.g. the basics.
  • If your build time in the cloud takes too long, the architecture assures you can easily pivot to prebaking more into the AMI, but you are not assuming you will have to.
Of course, you still cannot go to milliseconds.   But it allows you to trade building ahead of time and building during bring-up to nearly your hearts content.  Even the role that knows how to install a Python webapp is already on the AMI.

Shaheed Haque

unread,
May 2, 2019, 3:30:04 PM5/2/19
to django...@googlegroups.com
On Thu, 2 May 2019 at 19:37, <dans...@gmail.com> wrote:
What gold!

The essential piece of your logic is here:
Enforce a development process that ensures that (roughly speaking) all database changes result in a new column, and where the old cannot be removed until a later update cycle. All migrations populate the new column.

I assume that "enforce" is a software engineering management thing, not clever CI/CD code.

Ack.

To change a column, you essentially do it in two steps:
  • Create new column with migrations, and do a release to the environment.
  • Drop old column with migrations, and do a release to the environment.
Just so.
 
If you are only dropping an old column, it might go like this:
  • Drop the old column from  the model, but not from the database (e.g. assure that makemigrations has not been run), test and deploy.
  • Add the migration that does away with the old column - test and deploy.
Indeed.

Personally, I have a similar career trajectory, but from systems programming in C/C++ to software architect for webapps.   Understanding C/C++  and Linux gave me a capability of working with both developers and system guys. I think it was in 2013 that I did the presentation that started to kill Adobe ColdFusion (which I didn't want to learn).   I instead the next 6 years getting all of our applications moved to Django, now we are in transition to Python 3.6 and Django 2.2, with our first cloud system coming up.

LOL.I bet we could both tell some tales.

On your slow cloud builds, I have an architecture that may help. The basic idea is easy to explain:
  • Do system provisioning through ansible roles
  • Install all the roles needed for all of your stacks when you build the AMI, but only run some of them, e.g. the basics.
  • If your build time in the cloud takes too long, the architecture assures you can easily pivot to prebaking more into the AMI, but you are not assuming you will have to.
Of course, you still cannot go to milliseconds.   But it allows you to trade building ahead of time and building during bring-up to nearly your hearts content.  Even the role that knows how to install a Python webapp is already on the AMI.

From the previous experience I alluded to, I appreciate that using Puppet (or Chef or Ansible) to pre-bake a VM image and such an image would, as you note, significantly reduce my re-spin "system down" window. Now, I hesitate to say the next bit out loud, because I don't yet know if I am right, or the received wisdom about freezing dependencies is right, but here goes...comments welcome!

My current thinking is that the notion of trying to keep a public website secure by freezing dependencies (think virtualenv AND apt) is an Sisiphyian task given the implied need to track the transitive fanout of ALL dependencies down to the kernel. So, given that we have decent test coverage, and are frequently running those "live-to-test" upgrades, I can trade the security vulnerability tracking for compatibility tracking by having each upgrade cycle rebuild from the latest apt and pypi repositories. That's a win because we have to do the compatibility tracking anyway.

Thus, I get early exposure to occasional issues (e.g. pgAdmin was broken recently by pyscopg2 2.8, and django-polymorphic is broken by Django 2.2, both of which I discovered within about a day of the issue arising) while ditching the need to continuously track the security perimeter of the whole shooting match. Another way to look at it is that I leverage everybody else's tracking of their security issues (fixes for functional issues are a side benefit).

Assuming this analysis proves itself, the benefits trump the advantages of a pre-baked VM image over rebuilds from live repos, at least for me (I might think differently if we had more human bandwidth to burn).

And anyway, I'll fix the downtime with a cluster. One day. :-)


--
You received this message because you are subscribed to the Google Groups "Django users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-users...@googlegroups.com.
To post to this group, send email to django...@googlegroups.com.
Visit this group at https://groups.google.com/group/django-users.

Shaheed Haque

unread,
May 3, 2019, 4:10:44 AM5/3/19
to django...@googlegroups.com
I should clarify one thing...

On Thu, 2 May 2019 at 20:28, Shaheed Haque <shahee...@gmail.com> wrote:
On Thu, 2 May 2019 at 19:37, <dans...@gmail.com> wrote:
What gold!

The essential piece of your logic is here:
Enforce a development process that ensures that (roughly speaking) all database changes result in a new column, and where the old cannot be removed until a later update cycle. All migrations populate the new column.

I assume that "enforce" is a software engineering management thing, not clever CI/CD code.

Ack.

This response appears to contradict my original note, but what I actually think is that *some* automation along the lines of Django's existing migration capability is possible, and that careful software engineering management would likely be needed to (a) maximise the chance the automation will work, and (b) detect the cases when it won't.

As for the posited automation itself, one could imagine that the existing Django logic to analyse the need for migrations would work as-is. What would likely be needed is a different "backend" that targets blue-green upgrades. I would guess the overall level of automation possible is less than the near-magical current system, but I'm hopeful it might make the problem manageable.

(Of course, I'm still crossing my fingers that some real Django migration and SQL gurus will solve this before I have to).

Thanks, Shaheed

Dan Davis

unread,
Jun 24, 2019, 10:31:16 PM6/24/19
to Django users
A related discussion on django-developers has turned up this module, written by one of the participants in the discussion there:



Reply all
Reply to author
Forward
0 new messages