On 9 December 2014 at 15:29:53, Johannes Berg (jber...@gmail.com) wrote:
Hi! I'm doing some load tests in our system and getting problems that some of my nodes are marked as unreachable even though the processes are up. I'm seeing it going a few times from reachable to unreachable and back a few times before staying unreachable saying connection gated for 5000ms and staying silently that way.
Looking at the connections made to one of the seed nodes I see that I have several hundreds of connections from other nodes except the failing ones. Is this normal? There are several (hundreds) just between two nodes. When are connections formed between cluster nodes and when are they taken down?
Also is there some limit on how many connections a node with default settings will accept?
We have auto-down-unreachable-after = 10s set in our config, does this mean if the node is busy and doesn't respond in 10 seconds it becomes unreachable?
Is there any reason why it would stay unreachable and not re-try to join the cluster?
We are using Akka 2.3.6 and using cluster aware routers quite much with a lot of remote messages going around.
Anyone that can shed some light on this or that can point me at some documentation about these things?
--
>>>>>>>>>> Read the docs: http://akka.io/docs/
>>>>>>>>>> Check the FAQ: http://doc.akka.io/docs/akka/current/additional/faq.html
>>>>>>>>>> Search the archives: https://groups.google.com/group/akka-user
---
You received this message because you are subscribed to the Google Groups "Akka User List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to akka-user+...@googlegroups.com.
To post to this group, send email to akka...@googlegroups.com.
Visit this group at http://groups.google.com/group/akka-user.
For more options, visit https://groups.google.com/d/optout.
Patrik Nordwall
Typesafe - Reactive apps on the JVM
Twitter: @patriknw
I will try that but it seems that will only help to a certain point and when I push the load further it will hit it again.
I hit this within a minute after I put on the load which is a bit annoying to me. I'm fine with it becoming unreachable as long as I can get it back to reachable when it has crunched through the load.
Will it still buffer up system messages even though it's unreachable?
At what rate are system messages typically sent?
Hi Johannes,
On Thu, Jan 22, 2015 at 4:53 PM, Johannes Berg <jber...@gmail.com> wrote:
I will try that but it seems that will only help to a certain point and when I push the load further it will hit it again.There is no system message traffic between two Akka systems by default, to have a system send system messages to another you either need to use remote deployment or deathwatch on remote actors. Which one are you using? What is the scenario?
The main issue is that whatever is the rate of system message delivery we *cannot* backpressure remote deployment or how many watched remote actors die. For any delivery buffer there is a large enough "mass actor extinction event" that will just fill it up. You can increase the buffer size though up to that point where you expect that a burst maximum is present (for example you know the peak number of remote watched actors and the peak rate of them dying).
I hit this within a minute after I put on the load which is a bit annoying to me. I'm fine with it becoming unreachable as long as I can get it back to reachable when it has crunched through the load.That means a higher buffer size. If there is no sysmsg buffer size that can absorb your load then you have to rethink your remote deployment/watch strategy (whichever feature you use).
Will it still buffer up system messages even though it's unreachable?After quarantine there is no system message delivery, everything is dropped. There is no recovery from quarantine that is its purpose. If there is any lost system message between two systems (and here they are dropped due to the buffer being full) then they are in an undefined state, especially with remote deployment, so they quarantine each other.
akka.remote {
watch-failure-detector {
heartbeat-interval = 1 s
threshold = 10.0
acceptable-heartbeat-pause = 10 s
unreachable-nodes-reaper-interval = 1s
expected-response-after = 3 s
}
}
As per our experience on spray-socketio project, too many remote actor watching will cause the cluster quarantined very quickly.The default heartbeat interval for remote watching is:akka.remote {
watch-failure-detector {
heartbeat-interval = 1 s
threshold = 10.0
acceptable-heartbeat-pause = 10 s
unreachable-nodes-reaper-interval = 1s
expected-response-after = 3 s
}
}
i.e, 1 second. Thus, when "I do use deathwatch on remote actors and the amount of deatchwatches I have is linear to the load", the amount of heartbeats that is sent per seconds could be mass.
...
Did you forget a NOT there? Did you mean "No, the number of heartbeat messages per seconds are NOT influenced by how many actors you watch."?
On Fri, Jan 23, 2015 at 10:12 AM, Caoyuan <dcao...@gmail.com> wrote:As per our experience on spray-socketio project, too many remote actor watching will cause the cluster quarantined very quickly.The default heartbeat interval for remote watching is:akka.remote {
watch-failure-detector {
heartbeat-interval = 1 s
threshold = 10.0
acceptable-heartbeat-pause = 10 s
unreachable-nodes-reaper-interval = 1s
expected-response-after = 3 s
}
}
i.e, 1 second. Thus, when "I do use deathwatch on remote actors and the amount of deatchwatches I have is linear to the load", the amount of heartbeats that is sent per seconds could be mass.
No, the number of heartbeat messages per seconds are ***NOT*** influenced by how many actors you watch. The heartbeats are for monitoring the connection between two nodes.
On Fri, Jan 23, 2015 at 7:06 PM, Patrik Nordwall <patrik....@gmail.com> wrote:On Fri, Jan 23, 2015 at 10:12 AM, Caoyuan <dcao...@gmail.com> wrote:As per our experience on spray-socketio project, too many remote actor watching will cause the cluster quarantined very quickly.The default heartbeat interval for remote watching is:akka.remote {
watch-failure-detector {
heartbeat-interval = 1 s
threshold = 10.0
acceptable-heartbeat-pause = 10 s
unreachable-nodes-reaper-interval = 1s
expected-response-after = 3 s
}
}
i.e, 1 second. Thus, when "I do use deathwatch on remote actors and the amount of deatchwatches I have is linear to the load", the amount of heartbeats that is sent per seconds could be mass.No, the number of heartbeat messages per seconds are ***NOT*** influenced by how many actors you watch. The heartbeats are for monitoring the connection between two nodes.I looked the code again, Patrik is right, the heartbeats are sending between nodes only. I'll reconsider the cause that we encountered before, to see what happened when too many actors are remote watched.Thanks for the pointing out of my mistake .
23 jan 2015 kl. 08:39 skrev Johannes Berg <jber...@gmail.com>:Thanks for the answers, this really explains a lot. I will go back to my abyss and rethink some things. See below some answers/comments.
On Thursday, January 22, 2015 at 6:31:01 PM UTC+2, drewhk wrote:Hi Johannes,On Thu, Jan 22, 2015 at 4:53 PM, Johannes Berg <jber...@gmail.com> wrote:
I will try that but it seems that will only help to a certain point and when I push the load further it will hit it again.There is no system message traffic between two Akka systems by default, to have a system send system messages to another you either need to use remote deployment or deathwatch on remote actors. Which one are you using? What is the scenario?
I do use deathwatch on remote actors and the amount of deatchwatches I have is linear to the load I put on the system so that explains increased number of system messages based on load then I guess.
The main issue is that whatever is the rate of system message delivery we *cannot* backpressure remote deployment or how many watched remote actors die. For any delivery buffer there is a large enough "mass actor extinction event" that will just fill it up. You can increase the buffer size though up to that point where you expect that a burst maximum is present (for example you know the peak number of remote watched actors and the peak rate of them dying).
Thinking about these features more closely I can see that these things may require acked delivery but I would have expected something that grows indefinately until outofmem like unbounded inboxes. It's not apparent from the documentation about deathwatching that you need to consider some buffer size if you are watching very many actors that may be created or die at a very fast rate (maybe a note about this could be added to the docs?). A quick glance at the feature you don't expect it to be limited by anything else than normal actor message sending and receiving. Furthermore I wouldn't have expected a buffer overflow due to deathwatching would cause a node to get quarantined and removed from the cluster, instead I would expect some deatchwatching to fail to work correctly.
Causing the node to go down in case of a buffer overflow seems a bit dangerous considering ddos attacks even though it maybe makes the system behave more consistently.
...
Thanks for the answers, this really explains a lot. I will go back to my abyss and rethink some things. See below some answers/comments.
On Thursday, January 22, 2015 at 6:31:01 PM UTC+2, drewhk wrote:Hi Johannes,On Thu, Jan 22, 2015 at 4:53 PM, Johannes Berg <jber...@gmail.com> wrote:
I will try that but it seems that will only help to a certain point and when I push the load further it will hit it again.There is no system message traffic between two Akka systems by default, to have a system send system messages to another you either need to use remote deployment or deathwatch on remote actors. Which one are you using? What is the scenario?
I do use deathwatch on remote actors and the amount of deatchwatches I have is linear to the load I put on the system so that explains increased number of system messages based on load then I guess.
The main issue is that whatever is the rate of system message delivery we *cannot* backpressure remote deployment or how many watched remote actors die. For any delivery buffer there is a large enough "mass actor extinction event" that will just fill it up. You can increase the buffer size though up to that point where you expect that a burst maximum is present (for example you know the peak number of remote watched actors and the peak rate of them dying).
Thinking about these features more closely I can see that these things may require acked delivery but I would have expected something that grows indefinately until outofmem like unbounded inboxes.
...
On Fri, Jan 23, 2015 at 7:06 PM, Patrik Nordwall <patrik....@gmail.com> wrote:On Fri, Jan 23, 2015 at 10:12 AM, Caoyuan <dcao...@gmail.com> wrote:As per our experience on spray-socketio project, too many remote actor watching will cause the cluster quarantined very quickly.The default heartbeat interval for remote watching is:akka.remote {
watch-failure-detector {
heartbeat-interval = 1 s
threshold = 10.0
acceptable-heartbeat-pause = 10 s
unreachable-nodes-reaper-interval = 1s
expected-response-after = 3 s
}
}
i.e, 1 second. Thus, when "I do use deathwatch on remote actors and the amount of deatchwatches I have is linear to the load", the amount of heartbeats that is sent per seconds could be mass.No, the number of heartbeat messages per seconds are ***NOT*** influenced by how many actors you watch. The heartbeats are for monitoring the connection between two nodes.I looked the code again, Patrik is right, the heartbeats are sending between nodes only. I'll reconsider the cause that we encountered before, to see what happened when too many actors are remote watched.
Hi Endre,
I didn't understand "There is no system message traffic between two Akka systems by default, to have a system send system messages to another you either need to use remote deployment or deathwatch on remote actors" very well.
Does remote deployment means create actors on any remote systems?
Let's say if 2 ActorSystems creates same actor hierarchy locally with the same code and actor1 and actor2 in both systems have the same path. In this case, actor1 and actor2 can talk to each other using actorSelection by changing only the Address field.
I wonder in this case, will any system messages be sent to keep the connections between actor1 and actor2 and will Quarantine happen in this scenario? We assume that there is no remote death watch.