Geo-distributed WAN cluster - hourly spikes

101 views
Skip to first unread message

John

unread,
Dec 18, 2016, 2:18:24 PM12/18/16
to codership
Hello everyone,

I had previously posted here a long time ago
https://groups.google.com/forum/embed/?place=forum/codership-team&showsearch=true&showpopout=true&hl=en&parenturl=http%3A%2F%2Fgaleracluster.com%2Fcommunity%2F#!searchin/codership-team/geo$20distributed/codership-team/BAFId1pzXl0/6iKJ1v0kJgAJ

Those settings dramatically helped in stabilizing the cluster.
However, something that either we did not notice or has suddenly come out are hourly spikes.

Here is what happens: every hour DB gets very slow. We are on SSD SAN so IO stays around 10-20%.
But everything slows down. Queries start taking upto 80 seconds to finish.

Given that yes it might be a problem with the application's operation/DB, is there anything we can do/optimize on Galera?
This happens once an hour and sometimes lasts upto 15-20 mins or even longer.

Same application with an even higher load is running perfectly fine on a Postgres DB.

Could this have anything to do with the replication or replication settings?

my.cnf file attached.

As per Philip's recommendation, the sysctl.conf updated with these on all nodes

  $net_core_rmem_max        = "16777216",
  $net_core_wmem_max        = "16777216",
  $net_core_rmem_default    = "16777216",
  $net_core_wmem_default    = "16777216",
  $net_ipv4_tcp_rmem        = "4096 87380 16777216",
  $net_ipv4_tcp_wmem        = "4096 65536 16777216",
  $net_ipv4_tcp_slow_start_after_idle        = "0",


Any help much appreciated.

Thanks
John
mycnf.txt

alexey.y...@galeracluster.com

unread,
Dec 22, 2016, 1:07:22 PM12/22/16
to John, codership
You could observe the following status variables: wsrep_evs_repl_latency
and wsrep_flow_control_paused. That would narrow things down.

But most likely it is some periodic cleanup, either in the application
or elsewhere in the system.

John Test

unread,
Dec 23, 2016, 10:20:07 PM12/23/16
to Alexey Yurchenko, codership
wsrep_evs_repl_latency | 0.0396777/0.041231/0.0928362/0.00678644/336
wsrep_flow_control_paused    | 0.000000

I did a test to see if this was replication related. I shut down all the other db nodes except for active one.
And the issue did not happen. Which indicates this is replication related.
I have gone through all the WAN replication and Philip's doc again. I am not sure what to do.

At this point we only have 2 nodes up.

John

unread,
Dec 29, 2016, 8:53:16 AM12/29/16
to codership, alexey.y...@galeracluster.com
Hi Alexey,

Please see my earlier response below regarding shutting down all db nodes except one.

Updating with flow control settings I tried as per
https://www.percona.com/blog/2013/05/02/galera-flow-control-in-percona-xtradb-cluster-for-mysql/

Please see below parameters
wsrep_provider_options="gcache.size=9G; gcache.page_size=300M; gmcast.segment=5; evs.keepalive_period=PT3S; evs.suspect_timeout=PT30S; evs.inactive_timeout=PT1M; evs.install_timeout=PT1M; evs.join_retrans_period=PT0.5S; gcs.max_packet_size=1048576; evs.send_window=512; evs.user_send_window=256; evs.version=1; evs.auto_evict=5; gcs.fc_limit=2048; gcs.fc_master_slave=YES; gcs.fc_factor=1.0"

Result
1. All nodes down except for one db node -> NO SPIKES
2. 2 nodes up in *same* datacenter -> NO SPIKES
3. 2 nodes up in *different* datacenters -> SPIKE!

This is definitely network related, and my disappointment is that Galera could not handle such a network connection.
Is there nothing else to tweak in a product that has a big selling point on geo-distribution?

So talking to a few other people, these are the options I was provided
1. hybrid replication (plain mysql replication between datacenters)
2. plain mysql master/slave setup

This ended up a disappointing scenario. I am hoping galera team would work on this.
It doesn't look good that it is not able to do something that plain mysql replication can.

I honestly don't care how far the remote site falls back (hence gcs.fc_limit=2048) but at least try to accommodate the situation.

Thanks
John



On Friday, December 23, 2016 at 10:20:07 PM UTC-5, John wrote:
wsrep_evs_repl_latency | 0.0396777/0.041231/0.0928362/0.00678644/336
wsrep_flow_control_paused    | 0.000000

I did a test to see if this was replication related. I shut down all the other db nodes except for active one.
And the issue did not happen. Which indicates this is replication related.
I have gone through all the WAN replication and Philip's doc again. I am not sure what to do.

At this point we only have 2 nodes up.

John Gerritt

unread,
Jan 2, 2017, 9:56:45 AM1/2/17
to codership, Alexey Yurchenko
FYI. This issue is resolved. It could have been resolved sooner with good documentation.
Perhaps Philip Stoev might want to include this in his geo-distributed slides.

I had to look at wsrep_local_recv_queue_max to determine the fc_limit.
I was setting fc_limit low and my wsrep_local_recv_queue_max was over 5000 on some nodes.

Thanks
John


--
You received this message because you are subscribed to a topic in the Google Groups "codership" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/codership-team/yf5OEPmjOHo/unsubscribe.
To unsubscribe from this group and all its topics, send an email to codership-team+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages