IllegalArgumentException during ShardCoordinator failover

178 views
Skip to first unread message

Richard Ney

unread,
Jan 13, 2017, 12:01:49 AM1/13/17
to Akka User List
I'm seeing a scenario where a cluster member is killed by Marathon due to out of control memory growth because a persistent actor stops processing message. When Marathon kills the cluster member the cluster role doesn't recover due to the shard coordinators having issue. My log is full exceptions. Wondering if the first one is a possible bug due to the requirement failed. 

2017-01-13 04:47:25.626 [ERROR] [report-compute.07dbb0f4-d92e-11e6-80f8-0aedbb57963c] [reportCompute] [eplworkerslave12.lhr.manhattan.aspect-cloud.net:31871] [PersistentShardCoordinator] Exception in receiveRecover when replaying event type [akka.cluster.sharding.ShardCoordinator$Internal$ShardRegionRegistered] with sequence number [12] for persistenceId [/sharding/reportCompute.mr427.worktypesCoordinator].
java.lang.IllegalArgumentException: requirement failed: Region Actor[akka.tcp://manh...@eplworkerslave4.lhr.manhattan.aspect-cloud.net:31708/system/sharding/reportCompute.mr427.worktypes#447621404] already registered: State(Map(),Map(Actor[akka.tcp://manh...@eplworkerslave4.lhr.manhattan.aspect-cloud.net:31708/system/sharding/reportCompute.mr427.worktypes#447621404] -> Vector()),Set(),Set(),false)
    at scala.Predef$.require(Predef.scala:224)
    at akka.cluster.sharding.ShardCoordinator$Internal$State.updated(ShardCoordinator.scala:276)
    at akka.cluster.sharding.PersistentShardCoordinator$$anonfun$receiveRecover$1.applyOrElse(ShardCoordinator.scala:742)
    at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
    at akka.persistence.Eventsourced$$anon$3$$anonfun$1.applyOrElse(Eventsourced.scala:481)

Along have a  large number of these messages:

[ReplayFilter] Invalid replayed event [sequenceNr=5, writerUUID=59b217ca-b6f6-4725-aa92-e5930d734daa]. There was already a newer writer whose last replayed event was [sequenceNr=5, writerUUID=c3a77da7-08c4-489c-b158-6c717270713d] for the same persistenceId [/sharding/reportCompute.pens.interactionsCoordinator].Perhaps, the old writer kept journaling messages after the new writer created, or duplicate persistentId for different entities?

[ShardRegion] Trying to register to coordinator at [Some(ActorSelection[Anchor(akka://manhattan/), Path(/system/sharding/reportCompute.pens.worktypesCoordinator/singleton/coordinator)])], but no acknowledgement. Total [100000] buffered messages.

-Richard

Patrik Nordwall

unread,
Jan 13, 2017, 4:38:03 AM1/13/17
to akka...@googlegroups.com
That can happen if you had more than one active Sharding Coordinator writing to the same journal table, for example if you had a network partition and used auto-downing that caused the cluster to be split into two separate clusters. These checks are there to find this problem.

/Patrik

On Fri, Jan 13, 2017 at 6:01 AM, Richard Ney <kamisa...@gmail.com> wrote:
I'm seeing a scenario where a cluster member is killed by Marathon due to out of control memory growth because a persistent actor stops processing message. When Marathon kills the cluster member the cluster role doesn't recover due to the shard coordinators having issue. My log is full exceptions. Wondering if the first one is a possible bug due to the requirement failed. 

2017-01-13 04:47:25.626 [ERROR] [report-compute.07dbb0f4-d92e-11e6-80f8-0aedbb57963c] [reportCompute] [eplworkerslave12.lhr.manhattan.aspect-cloud.net:31871] [PersistentShardCoordinator] Exception in receiveRecover when replaying event type [akka.cluster.sharding.ShardCoordinator$Internal$ShardRegionRegistered] with sequence number [12] for persistenceId [/sharding/reportCompute.mr427.worktypesCoordinator].
java.lang.IllegalArgumentException: requirement failed: Region Actor[akka.tcp://manhattan@eplworkerslave4.lhr.manhattan.aspect-cloud.net:31708/system/sharding/reportCompute.mr427.worktypes#447621404] already registered: State(Map(),Map(Actor[akka.tcp://manhattan@eplworkerslave4.lhr.manhattan.aspect-cloud.net:31708/system/sharding/reportCompute.mr427.worktypes#447621404] -> Vector()),Set(),Set(),false)

    at scala.Predef$.require(Predef.scala:224)
    at akka.cluster.sharding.ShardCoordinator$Internal$State.updated(ShardCoordinator.scala:276)
    at akka.cluster.sharding.PersistentShardCoordinator$$anonfun$receiveRecover$1.applyOrElse(ShardCoordinator.scala:742)
    at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
    at akka.persistence.Eventsourced$$anon$3$$anonfun$1.applyOrElse(Eventsourced.scala:481)

Along have a  large number of these messages:

[ReplayFilter] Invalid replayed event [sequenceNr=5, writerUUID=59b217ca-b6f6-4725-aa92-e5930d734daa]. There was already a newer writer whose last replayed event was [sequenceNr=5, writerUUID=c3a77da7-08c4-489c-b158-6c717270713d] for the same persistenceId [/sharding/reportCompute.pens.interactionsCoordinator].Perhaps, the old writer kept journaling messages after the new writer created, or duplicate persistentId for different entities?

[ShardRegion] Trying to register to coordinator at [Some(ActorSelection[Anchor(akka://manhattan/), Path(/system/sharding/reportCompute.pens.worktypesCoordinator/singleton/coordinator)])], but no acknowledgement. Total [100000] buffered messages.

-Richard

--
>>>>>>>>>> Read the docs: http://akka.io/docs/
>>>>>>>>>> Check the FAQ: http://doc.akka.io/docs/akka/current/additional/faq.html
>>>>>>>>>> Search the archives: https://groups.google.com/group/akka-user
---
You received this message because you are subscribed to the Google Groups "Akka User List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to akka-user+unsubscribe@googlegroups.com.
To post to this group, send email to akka...@googlegroups.com.
Visit this group at https://groups.google.com/group/akka-user.
For more options, visit https://groups.google.com/d/optout.



--

Patrik Nordwall
Akka Tech Lead
Lightbend -  Reactive apps on the JVM
Twitter: @patriknw

Reply all
Reply to author
Forward
0 new messages