upgrading production cluster (sharded) system

256 views
Skip to first unread message

Peter

unread,
Jan 14, 2015, 12:44:46 PM1/14/15
to akka...@googlegroups.com
Hi

I wonder if anyone has experience or thoughts to share about upgrading production cluster systems?

I would ideally like to 
  • upgrade the cluster without downtime/scheduled outage
  • not mutate infrastructure, in other words, deploy a new set of nodes with the new version
  • do a staged upgrade, first just a single node taking as little as possible production traffic - a canary in the coal mine

A little bit more about my specific environment
  • cluster runs as a single EC2 autoscale group
  • no akka roles (looking into this as a way to gain independence between functional areas within the application & facilitate independent upgrades - something akin to micro services to use the buzzword du jour) 
  • i don't use akka persistence but each sharded actor is backed by my own distributed persistence mechanism based on DynamoDB
  • there is some tolerance for stale reads but there could be some cases where it's not acceptable
My understanding is that the number of cluster shards should be kept constant irrespective of number of cluster nodes, so that the shard resolution also remains stable irrespective of number of cluster nodes, as in the example in the documentation. It sounds like the bundled rebalancing strategy (LeastShardAllocationStrategy) should do the trick when adding the first node (canary). I'm wondering if there's any suggestions for doing the rest?

  • start all of the remaining new cluster nodes 
    • at what point does rebalancing get kicked off? is there a specific event that triggers a rebalance? is it possible to delay until all the new nodes/X nodes has joined/Y time has passed to minimize disruption (single rebalance vs rebalance for every node)
  • wait for period X to ensure rebalancing is complete and all buffered messages during rebalancing has been processed 
    • is it possible to determine this programmatically?
  • stop all the old version nodes 
    • one by one with a period in between or all at once?
    • at this point, messages in flight are lost, need to fall back to clients to retry
It gets progressively more hand wavy towards the end as I'm still thinking about the details, would love some input & feedback!

Thanks
Peter

Roland Kuhn

unread,
Jan 20, 2015, 7:10:56 AM1/20/15
to akka-user
Hi Peter,

upgrading usually implies that some changes were made: bugs were fixed and features added. This usually also implies that some messages have change in format or meaning (if only from “broken” to “works now”). Operating nodes of different understanding within the same cluster is a very risky proposition as it is very hard to get right—the new nodes must be fully capable of understanding the old ones and they must also not confuse the old ones with new language. Other issues arise when talking to a shared data store: if new nodes write new data, will the old nodes be able to deal with it? Will they silently and unknowingly corrupt new records?

For these reasons the stories from the field I have heard have all pointed towards doing red/blue deployments, starting a new cluster next to the old one and shifting traffic and deployment size between them in order to switch over.

Regards,

Roland

--
>>>>>>>>>> Read the docs: http://akka.io/docs/
>>>>>>>>>> Check the FAQ: http://doc.akka.io/docs/akka/current/additional/faq.html
>>>>>>>>>> Search the archives: https://groups.google.com/group/akka-user
---
You received this message because you are subscribed to the Google Groups "Akka User List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to akka-user+...@googlegroups.com.
To post to this group, send email to akka...@googlegroups.com.
Visit this group at http://groups.google.com/group/akka-user.
For more options, visit https://groups.google.com/d/optout.



Dr. Roland Kuhn
Akka Tech Lead
Typesafe – Reactive apps on the JVM.
twitter: @rolandkuhn


09goral .

unread,
Jan 20, 2015, 7:32:41 AM1/20/15
to akka...@googlegroups.com
I guess you mean blue/green ? :)

You received this message because you are subscribed to a topic in the Google Groups "Akka User List" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/akka-user/tC2RfJBruYA/unsubscribe.
To unsubscribe from this group and all its topics, send an email to akka-user+...@googlegroups.com.

To post to this group, send email to akka...@googlegroups.com.
Visit this group at http://groups.google.com/group/akka-user.
For more options, visit https://groups.google.com/d/optout.



--
Pozdrawiam,
Mateusz Górski

Anders Båtstrand

unread,
Jan 21, 2015, 10:31:51 AM1/21/15
to akka...@googlegroups.com
I would love to hear some more details about this! How do you avoid the two clusters to write to the same persistence-id? Not all my actors are divided in groups I can use to separate the traffic stream. Some are global, and would (in my current application) immediately start to persist stuff from both clusters...

Best regards,

Anders

Denis Mikhaylov

unread,
May 19, 2016, 4:47:19 PM5/19/16
to Akka User List
That is what my concerns are about! Any news, ideas, best practices?
Reply all
Reply to author
Forward
0 new messages