Have you had success with Traffic Splitting? How do you use it to roll out new versions?

130 views
Skip to first unread message

David Hardwick

unread,
Jul 21, 2015, 1:53:17 PM7/21/15
to google-a...@googlegroups.com
Hello,

We haven't used Traffic Splitting yet but it has been an available feature for a while and rumor has it that traffic splitting for non-default modules could be coming in as soon as a month.  

Any who, if you have experience use it, then I would like to hear how you are using it to roll out new features or versions.  I've heard the term 'canary' testing where you roll out a new version to 10% of folks...you measure the results and then either rollback and fully roll it out.  So if anyone is doing 'canary' testing and deployments as I've described it, then I've like to hear from you.

Thanks in advance,
  Hardwick

David Hardwick

unread,
Jul 24, 2015, 9:14:02 AM7/24/15
to Google App Engine, david.h...@bettercloud.com
Oh boy, the lack of response here is not encouraging

Paul Canning

unread,
Jul 24, 2015, 12:10:14 PM7/24/15
to Google App Engine, david.h...@bettercloud.com
Any word on when we can use it with non-default modules?

Michael Spainhower

unread,
Jul 24, 2015, 2:50:33 PM7/24/15
to Google App Engine, david.h...@bettercloud.com
@David, I started following this thread because I have the exact same question and agree the lack of response is worrisome.

We are in the Cloud Startup Program and I plan to ask about canary testing during my next engineering 1-on-1.  I will reply to this thread with what I learn from their engineer.

Jason Collins

unread,
Jul 24, 2015, 3:56:27 PM7/24/15
to Google App Engine, david.h...@bettercloud.com, sp...@homebuddy.io
Traffic-splitting / canary releases on App Engine are definitely "a thing".

Traffic-splits on non-default modules are now available via API:

Michael Spainhower

unread,
Jul 24, 2015, 4:14:29 PM7/24/15
to Google App Engine, david.h...@bettercloud.com, jason.a...@gmail.com
Sure, I don't think David or I have any question of whether canary testing is a possible or intended use case of traffic splitting.  I am interested in how folks have implemented it in practice for production apps.

For example, I don't have a great solution for canary testing a version which changes the ndb model schema.  I would love to hear the concrete lessons learned from anyone who has done such a thing.

Another example is how do you elegantly synchronize decoupled apps?  What I mean is that e.g., we run our APIs in a different project than our web front-end.  There are several ways to handle this, but again would love to get war stories from anyone who has run something similar in production.

David Hardwick

unread,
Jul 24, 2015, 4:44:06 PM7/24/15
to Michael Spainhower, Google App Engine, jason.a...@gmail.com
(For the model/schema change, I think that comes down for being prepared to have the schema accessed both ways...e.g., you need to remove a column, you roll out the new code that doesn't use the column...the old code still works with the column there...after you get to 100% on the new version, then you can nix the column...stuff like that.  So the QA team would actually want to test the traffic split approach in lower (non-production) environments prior to the production deployment to make sure it works out as planned. )

This link [1] provides more details on Traffic Splitting options (cookies vs IP) and some areas to be aware of with respect to caching issues.

It would be good to see if folks favor IP over Cookie for the splitting, and what lessons learned they had with regard to caching issues.

Rock on,
  Hardwick

--
David Hardwick | CTO | w. 646-237-5388 
3405 Piedmont Rd. NE, Suite 325, Atlanta, GA 30305

Alex Martelli

unread,
Jul 26, 2015, 3:45:34 PM7/26/15
to google-a...@googlegroups.com, david.h...@bettercloud.com, jason.a...@gmail.com
On Fri, Jul 24, 2015 at 1:14 PM, Michael Spainhower <sp...@homebuddy.io> wrote:
Sure, I don't think David or I have any question of whether canary testing is a possible or intended use case of traffic splitting.  I am interested in how folks have implemented it in practice for production apps.

I've done it in the past for some internal apps (on internal infrastructure, not app engine, but the architectural patterns are pretty much the same).

Rather than just "canary testing", I preferred to frame things as a "very gradual roll-out". Have the new version of whatever micro-service X we're updating start by handling, say, 5% of new incoming queries (depending on the overall QPS to your app, you may need to tweak that 5% -- you do need the new version to be handling enough queries to give you a statistically representative sample, but few enough that if a failure mode is revealed it will only inconvenience not-too-many customers... a delicate balance to draw).

Monitor the new version in depth, not just for "health" (though that's of course crucial), but also resource consumption and latency in response for the new version compared to the existing one -- if it's a pretty "core" micro-service, doubling latency, or doubling the consumption of some constrained resource, may not be something you can afford -- probably better to roll back and return to the drawing board to find out what's happening and (one hopes:-) fix it. ("peripheral" micro-services, e.g ones only occasionally used, may give you more latitude in what extra latency and/or resource consumption you can tolerate).

If everything is fine at e.g 5%, then move to 10% -- rinse, repeat. Almost always, a new micro-service version that's fine at 5% will also be fine at 10%, 20%, etc, as well -- but, "almost" is not quite good enough for a mission-critical production app. E.g, there may be a rare but occasionally occurring "query of death" for the new version, specifically tickling some bug that's usually dormant there -- and you may not meet any occurrence of the QoD at, say, 40%, but you might occasionally see some at 60%.

Which is what drilled into me the "no shortcuts" stance -- "hope is not a strategy" and all that. I'd rather take several days to complete the roll-out (and, if I'm the team's manager or lead, play "lightning rod" to protect the rest of the team from pressure by stakeholders) and do so with complete confidence, than risk emergencies surging and hurting users' workloads -- "think of the user, and everything else follows" is a mantra I have long lived by.

So this is just one (though important!) of the architecture patterns that micro-services bring to the fore -- it would also exist for a monolithic app, just not quite so prominent and without the many bells and whistles micro-services suggest (e.g, the ability for any app component to back down to a "known good" version of a micro-service it consumes, if and when it can detect -- or strongly suspect, e.g by timeouts -- that it's being served by a new but alas defective version of some micro-service or other... with plenty of logging and pagers ringing of course, but, that goes without saying to anybody who knows what Devops' all about:-).

I'm just back from OSCON, and micro-services were all over the place (I even had a couple of slides mentioning them, very much in passing alas!, as part of my own "Modern Python patterns and idioms" talk there:-) -- but I was seriously disappointed by not seeing any of these key architectural patterns explored in depth...

It's as if every one of these talks was "μS 101" with some specific twist wrt language and/or underlying platform, each interesting, mind you!, but, none of the ones I got diving deep enough into the core architecture patterns (only weakly dependent on platform and language) that we all need to learn and refine as the new architectures emerge.

An opportunity to submit appropriate talks for the next OSCON I guess (Austin, TX, May 2016)...:-).


> For example, I don't have a great solution for canary testing a version which changes the ndb model schema.  I would love to hear the concrete lessons learned from anyone who has done such a thing.

Alas, schema changes are always a bother, no matter the underlying technologies and architectures. I can't offer more than applause to David's post below, for starting to highlight some of the relevant issues -- TL;DR, that schema changes must be done in an incremental way, always having the code that wants/prefers/supports schema version N to operate non-destructively on versions N-1 and N+1 as well. A bother indeed, but, I have no silver bullet to slay that particular werewolf, sorry.

This isn't limited to canarying or incremental roll-out, though those patterns highlight the problem in particularly stark colors. But even back in the times of big-bang upgrades and hours of downtime to let them happen, I think I've witnessed more release/upgrade disasters tied to schema changes, than to any other single root cause... the risks are just more obvious and blocking today, rather than deeply hidden, and that extra visibility need not be a bad thing, in fact.

Another example is how do you elegantly synchronize decoupled apps?  What I mean is that e.g., we run our APIs in a different project than our web front-end.  There are several ways to handle this, but again would love to get war stories from anyone who has run something similar in production.

I may be missing something here -- I'd expect the web front-end to be a consumer of the APIs just like any other front-end would (an excellent architectural separation), so e.g a new API version would be handled by the web front-end just as it would by any other client (mobile native apps, etc, etc) -- use explicit versioning, version negotiation, and so forth -- just general best-practice patterns of API architecture, no?

I'm sure there are other use cases that show your point better, so, let's please discuss them!


Alex

 



On Friday, July 24, 2015 at 3:56:27 PM UTC-4, Jason Collins wrote:
Traffic-splitting / canary releases on App Engine are definitely "a thing".

Traffic-splits on non-default modules are now available via API:



On Friday, 24 July 2015 11:50:33 UTC-7, Michael Spainhower wrote:
@David, I started following this thread because I have the exact same question and agree the lack of response is worrisome.

We are in the Cloud Startup Program and I plan to ask about canary testing during my next engineering 1-on-1.  I will reply to this thread with what I learn from their engineer.



On Friday, July 24, 2015 at 9:14:02 AM UTC-4, David Hardwick wrote:
Oh boy, the lack of response here is not encouraging

On Tuesday, July 21, 2015 at 1:53:17 PM UTC-4, David Hardwick wrote:
Hello,

We haven't used Traffic Splitting yet but it has been an available feature for a while and rumor has it that traffic splitting for non-default modules could be coming in as soon as a month.  

Any who, if you have experience use it, then I would like to hear how you are using it to roll out new features or versions.  I've heard the term 'canary' testing where you roll out a new version to 10% of folks...you measure the results and then either rollback and fully roll it out.  So if anyone is doing 'canary' testing and deployments as I've described it, then I've like to hear from you.

Thanks in advance,
  Hardwick

--
You received this message because you are subscribed to the Google Groups "Google App Engine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to google-appengi...@googlegroups.com.
To post to this group, send email to google-a...@googlegroups.com.
Visit this group at http://groups.google.com/group/google-appengine.
To view this discussion on the web visit https://groups.google.com/d/msgid/google-appengine/d7a1d39c-24c8-48f4-aee3-08a452a75148%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

David Hardwick

unread,
Jul 27, 2015, 11:54:34 AM7/27/15
to Alex Martelli, Google App Engine, jason.a.collins
Thanks, Alex, that was a great post.  Very much appreciate your time in putting that together.  To sum it up, we have this architectural pattern (and patterns are just distilled versions of many experiences, so we get a great start when we use these patterns but it up to us to build out the last mile) for doing 'very gradual' roll-outs and we need to consider how to use App Engine Traffic Splitting as a tool in this larger process/pattern.

Thanks again!
  Hardwick
Reply all
Reply to author
Forward
0 new messages