How to handle zero (almost) downtime deployments of sharded persistent system.

380 views
Skip to first unread message

Denis Mikhaylov

unread,
Jun 9, 2016, 3:48:36 AM6/9/16
to Akka User List
Hi, hakkers!

I have a distributed app that uses `akka-persistence` and `akka-sharding`.
For some of my shard coordinators I use `remember entities` feature.
I heard (from Bonér or Kuhn, can't say for sure) that in production it's better to use blue/green deployments.
As I see, the steps are:
1. Deploy and start fresh application, let it form cluster (don't start any ShardRegions yet)
2. Stop ShardRegions that do not use `remember entities` features (e.g. Aggregate Roots).
3. Start these ShardRegions on freshly deployed system
4. Switch external traffic to new system (so that we already can accept external commands).
5. Stop ShardRegions that use `remember entities` features (e.g. Long Running Processes, that react to events from Aggregate Roots).
6. Start these ShardRegions on freshly deployed system (here I need akka sharding to restart entities that were alive on previous system)
7. Shut down old system

So the questions are: 
1. Is there any improvements to the deployment process?
2. Wouldn't this scenario corrupt Sharding related or any Akka internal data in journal?
3. How do you handle deployments in production?

Thanks a lot,
Denis.

Justin du coeur

unread,
Jun 9, 2016, 3:05:45 PM6/9/16
to akka...@googlegroups.com
+1 to this question.  For the moment I'm coping with a few seconds of downtime for releases, but we're going to have to become downtime-intolerant before long.  And zero downtime does look challenging in a heavily-sharded application.

Personally, I've been wondering if I should be trying to deal with intermixed releases -- bringing the new release up on one node at a time, in the *same* cluster as the old one, and gradually shutting the old ones down.  That seems to make sense in theory, but also seems bloody dangerous -- it requires that the releases be 100% wire-compatible, which is hard to test and presents evolutionary challenges -- and I'm not sure if there are gotchas to be aware of...

--
>>>>>>>>>> Read the docs: http://akka.io/docs/
>>>>>>>>>> Check the FAQ: http://doc.akka.io/docs/akka/current/additional/faq.html
>>>>>>>>>> Search the archives: https://groups.google.com/group/akka-user
---
You received this message because you are subscribed to the Google Groups "Akka User List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to akka-user+...@googlegroups.com.
To post to this group, send email to akka...@googlegroups.com.
Visit this group at https://groups.google.com/group/akka-user.
For more options, visit https://groups.google.com/d/optout.

Viktor Klang

unread,
Jun 9, 2016, 4:31:29 PM6/9/16
to Akka User List
Can we please define "zero" in this context?
--
Cheers,

Denis Mikhaylov

unread,
Jun 9, 2016, 5:23:17 PM6/9/16
to Akka User List
So that we don't loose external commands (or as less as possible).

четверг, 9 июня 2016 г., 23:31:29 UTC+3 пользователь √ написал:

Marek Żebrowski

unread,
Jun 10, 2016, 3:00:29 AM6/10/16
to Akka User List
We use sharding, but with own persistence, but no rembember-entities feature.
we do rolling-restarts (yes, not advised, lot's of pain when akka cluster goes crazy during restart)
so we just shut down shard nodes one by one, and start new ones. Ususally it works.
In theory it should be possible to start new cluster in 'read-only' mode, reading data for persistent entities from old cluster and somehow to do the `switch` but there is no tools for such scenario, and it would be extremeally hard to guarantee that writes from 'old' cluster are persisted and read by 'new' cluster for the switch. So we struggle with rolling restarts, with all the pains involved.

Justin du coeur

unread,
Jun 10, 2016, 2:30:27 PM6/10/16
to akka...@googlegroups.com
Denis' definition is about right, although I think of it slightly differently: so that an end user doesn't see any hiccup beyond somewhat longer latency than normal.  I can think of various ways to achieve this, but all of them look like a fairly major pain in the tuchus in one way or another, so I'm curious about whether you have recommendations.

We're using ConductR, but I'm curious about approaches both with and without it.  Architecturally, you can think of Querki as a Play application, where 95% of the serious code is under sharded Actors in an Akka cluster, which is started under but separate from Play's built-in ActorSystem.  (That's wildly oversimplified, but a decent thousand-foot block diagram description.)

Gytis G

unread,
Dec 3, 2017, 8:32:58 PM12/3/17
to Akka User List
Bumping an old thread, Justin du coeur, can you perhaps share what approach did you take and how did you succeed ?

Justin du coeur

unread,
Dec 4, 2017, 8:57:07 AM12/4/17
to akka...@googlegroups.com
No change -- for the time being, I'm still coping with a few seconds of downtime per deployment.  (On the grounds that that is better than risking shard duplication between old and new clusters)  That's not a major pain point while the company is small, but will eventually prove to be more of an issue.

I've been thinking a bit about what might be required for a plausible zero-downtime deployment, but still don't have a design that I'm fully happy with.  (Which is part of why I haven't pursued it yet.)  It really is a pretty tricky problem to address in a theoretically-correct way; I suspect the application and the deployment system (and maybe the Cluster Sharding mechanism itself) have to work together to make it at all plausible.

To unsubscribe from this group and stop receiving emails from it, send an email to akka-user+unsubscribe@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages