Cluster Time difference error

Joel Smith

unread,

Mar 29, 2016, 12:06:49 PM3/29/16

to Event Store

We're seeing a large number of errors on our cluster nodes like:

2016-03-18/192.168.147.211-2114-cluster-node.log:[PID:16535:026 2016.03.18 20:49:41.641 ERROR GossipServiceBase ] Time difference between us and [192.168.147.213:2113] is too great! UTC now: 2016-03-18 20:49:41.641, peer's time stamp: 2016-03-18 20:48:26.765.

However, the nodes themselves are within milliseconds of each other, and the error only occurs if the time stamps are out by a minute or more.

The nodes are virtual (on VMware) running Ubuntu 14.04 LTS, and the NTP synchronisation between them confirms that the time of the nodes is in agreement.

It appears that the comparison of the timestamps is done after processing of the gossip messages. What is the processing path that occurs between receiving the gossip messages and the actual comparison?

The linux VMs are not reporting CPU issues and networking is not showing any problem. Where should we be looking for the performance bottleneck?

Cheers,

Joel

Greg Young

unread,

Mar 29, 2016, 12:17:32 PM3/29/16

to event...@googlegroups.com

There shouldn't be any performance bottleneck causing that (its all in
memory processing for gossip messages).

It also certainly should not cause this big of a difference:

2016-03-18 20:49:41.641, peer's time stamp: 2016-03-18 20:48:26.765

The place where something could be happening is on the send/receive
taking a while but a minute doesn't seem very likely.

If you bring up two instances curl http://yourhost:httpport/gossip and
run them at about the same time what do they come up with?

Also is it always the same nodes getting error messages?

> --
> You received this message because you are subscribed to the Google Groups
> "Event Store" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to event-store...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

--
Studying for the Turing test

Greg Young

unread,

Mar 29, 2016, 12:19:57 PM3/29/16

to event...@googlegroups.com

Also can you send a log from around this time? Are these nodes under load?

Joel Smith

unread,

Mar 31, 2016, 6:02:31 AM3/31/16

to Event Store

If I curl the gossip port of node2 and node3 (from node1), they are identical bar millisecond differences in the timestamps from the delay in the two curl statements.

We have seen the error message on all three of the nodes at different times.

Although they are in use, there isn't any suggestion that they are under stress according to the system stats. However, they are hosted on a vmware cluster, so there might be resource issues at that point.

I've attached logs for the day before and the day in question - this particular instance happens just after midnight, but some possible errors occurred around 23:45 on one of the nodes. The node that was having problems eventually got into a loop and memory usage climbed until the node died (although the process is still there, it no longer responds to gossip). Restarting the process brings everything back into line.

This is our dev cluster, but we have seen the same thing happen in production.

Cheers,

Joel

combined_logs.tgz

ahjohannessen

unread,

Mar 31, 2016, 12:14:36 PM3/31/16

to Event Store

Joel, how many esx(i) hosts do you have, and are the development ES nodes pinned to a particular esx(i) host?

ahjohannessen

unread,

Mar 31, 2016, 12:15:46 PM3/31/16

to Event Store

Furthermore, is any particular activity happening "underneath" at that point, e.g. backup of vmware machines?

Greg Young

unread,

Mar 31, 2016, 12:17:20 PM3/31/16

to event...@googlegroups.com

Weird I don't have Joels email

Joel Smith

unread,

Apr 1, 2016, 4:36:39 AM4/1/16

to Event Store

Don't have visibility of this at the moment, but have asked the question of the infrastructure team who run the ESX platform.

Greg Young

unread,

Apr 1, 2016, 4:55:25 AM4/1/16

to event...@googlegroups.com

To be clear:

"We have seen the error message on all three of the nodes at different times."

One node having a clock issue will put messages on all the nodes (it
is gossiping with them) the real question is if its one node with a
weird clock or multiple. You can find this by looking in the logs, the
logs say what their time is and which node they were gossiping with
and what time they were given.

Cheers,

Greg

Joel Smith

unread,

Apr 1, 2016, 5:29:50 AM4/1/16

to Event Store

Again, don't have visibility, but we have seen the problem occurring at various times of the day, so not going to be something as simple as backups - sometimes it happens, sometimes it doesn't.

Joel Smith

unread,

Apr 1, 2016, 5:31:28 AM4/1/16

to Event Store

5 hosts in the ESX cluster (for each environment), no pinning.

Greg Young

unread,

Apr 1, 2016, 5:34:25 AM4/1/16

to event...@googlegroups.com

I'm not that familiar with ESX but is it possible a node is getting
moved etc and the clock is actually off for a period of time? If this
happens with some frequency it should be possible to write a little
script that gets the clock and checks. There isn't any "fancy" code
around gossip it just takes the clock and sends it. An alternative
which I may add any way is to have gossip track what its expected time
is and if that jumps to start logging about it.

Joel Smith

unread,

Apr 1, 2016, 6:06:25 AM4/1/16

to Event Store

When I have seen the error message on all three nodes, it is for separate errors, and different hosts reporting as being out.

The nodes are VMs, so not having a real clock (and in fact are on the same ESX cluster) - the NTP stats show that the clocks are in sync within milliseconds, certainly nowhere near a minute difference. We have also seen it in each environment, so same symptoms across 3 different clusters on 3 different ESX clusters. I don't think the issue is clock related, but something around performance and the way that the processing path happens to do the comparison against the gossip timestamp.

We aren't seeing performance issues within the Linux metrics, and the idle time percentage on main queue and storage writer queue is nearly always 100%.

Joel Smith

unread,

Apr 1, 2016, 6:16:23 AM4/1/16

to Event Store

The NTP stats would pick this up - we graph every minute, and the offset is in the order of milliseconds, not seconds, and certainly not minutes.

It isn't a true time difference, but seems to be a delay in the way the gossips are processed.

Greg Young

unread,

Apr 1, 2016, 6:16:49 AM4/1/16

to event...@googlegroups.com

"I don't think the issue is clock related, but something around
performance and the way that the processing path happens to do the
comparison against the gossip timestamp."

What I am suggesting is that due to something that occurs the clock
shifts and likely gets corrected after. An example of this could be
moving a node.

The path is all in memory. I would be amazed if you were seeing 2
minutes of a performance issue in handling a gossip message.

Can you provide log messages from during such an incident from all 3
nodes? Also your checks against ntp as far as I have seen here are
when the messages are *not* happening could you confirm this?

Greg Young

unread,

Apr 1, 2016, 6:17:28 AM4/1/16

to event...@googlegroups.com

How long do the messages last?

Joel Smith

unread,

Apr 1, 2016, 6:25:05 AM4/1/16

to Event Store

If you look at the log archive I posted, it includes logs from 2 days for all three nodes, before the event and during.

The NTP stats are continuously collected and graphed (that was our first assumption, that it was a genuine clock issue).

Greg Young

unread,

Apr 1, 2016, 6:27:46 AM4/1/16

to event...@googlegroups.com

I am not seeing this email?

Joel Smith

unread,

Apr 1, 2016, 6:31:30 AM4/1/16

to Event Store

https://groups.google.com/d/msg/event-store/bSI2hpZKIAA/-45hjjGYAwAJ

Greg Young

unread,

Apr 1, 2016, 6:41:05 AM4/1/16

to event...@googlegroups.com

Give me about an hour. I can't open it on my mac so will open it when
I get back to office.

Joel Smith

unread,

Apr 1, 2016, 6:49:12 AM4/1/16

to Event Store

Sure - should just be a tgz file...

tar -xzf combined_file.tgz

should open it on a mac

Joel Smith

unread,

Apr 1, 2016, 6:51:01 AM4/1/16

to Event Store

Thanks for your assistance...

On Friday, 1 April 2016 11:41:05 UTC+1, Greg Young wrote:

Joel Smith

unread,

Apr 1, 2016, 6:56:22 AM4/1/16

to Event Store

Oddly enough, when we get this issue, restarting the eventstore process brings the nodes back into sync with no complaints about time. It's as if the eventstore builds up a backlog of gossip messages, but my understanding was that gossip messages were discarded if they took too long to process.

Greg Young

unread,

Apr 1, 2016, 7:05:25 AM4/1/16

to event...@googlegroups.com

They are.

I am noticing there are a whole bunch of issues happening around
midnight (only looked at .212 so far). Heartbeat timeouts, elections
data not being able to be sent etc.

There is a time difference in the middle of it.

Much of this log is showing pretty severe network issues (cause
unknown at this point) as example you have run through 57,000 election
cycles.

On the gossip I think you are right that they are being queued. I just
audited the code and the timeout removes the task but does not cancel
the actual operation so it could just be very slowly trying (and when
they do come through they come through every few ms)

Joel Smith

unread,

Apr 1, 2016, 8:24:02 AM4/1/16

to Event Store

On 17/3 at 23:46 node 211 generates an error log with VERY SLOW BUS MSG alerts. This is just after

2016.03.17 23:44:46.952 DEBUG ClusterVNodeControll] There is NO MASTER or MASTER is DEAD according to GOSSIP. Starting new elections.

node 212 had been master.

This was when we got our first monitoring alerts:

checking gossip at 192.168.147.212:2113

Checking for 3 nodes 
Checking nodes for IsAlive state 
CheckGossip CRITICAL: Only 2 alive nodes, should be 3 alive 

Host: ggh-d-evtapp02
Timestamp: 2016-03-17 23:46:29 +0000 
Address:  192.168.147.212

checking gossip at 192.168.147.213:2113
Checking for 3 nodes
Checking nodes for IsAlive state

CheckGossip CRITICAL: Only 2 alive nodes, should be 3 alive
Host: ggh-d-evtapp03Timestamp: 2016-03-17 23:46:29 +0000
Address: 192.168.147.213

checking gossip at 192.168.147.211:2113
Checking for 3 nodes
Checking nodes for IsAlive state
CheckGossip CRITICAL: Only 1 alive nodes, should be 3 alive

Host: ggh-d-evtapp01
Timestamp: 2016-03-17 23:46:29 +0000
Address: 192.168.147.211

So, node 213 and 212 see 2 nodes, node 211 sees one.

All recovers at the next check at 23:47:29

Joel Smith

unread,

Apr 1, 2016, 8:53:08 AM4/1/16

to Event Store

Next alerts:

nodes found with states:
<State>PreReplica</State> when expected Master or Slave.


checking gossip at 192.168.147.212:2113
Checking for 3 nodes
Checking nodes for IsAlive state

Checking for exactly 1 master
Checking node state

Host: ggh-d-evtapp02
Timestamp: 2016-03-18 00:01:29 +0000
Address: 192.168.147.212
Status:  WARNING

nodes found with states:
<State>PreReplica</State> when expected Master or Slave.


checking gossip at 192.168.147.213:2113
Checking for 3 nodes
Checking nodes for IsAlive state

Checking for exactly 1 master
Checking node state

Host: ggh-d-evtapp03
Timestamp: 2016-03-18 00:01:29 +0000
Address: 192.168.147.213
Status:  WARNING

checking gossip at 192.168.147.211:2113

CheckGossip CRITICAL: Could not connect to http://192.168.147.211:2113/gossip?format=xml to check gossip, has event store fallen over on this node? 

Host: ggh-d-evtapp01.dev.green.net
Timestamp: 2016-03-18 00:01:29 +0000
Address:  192.168.147.211
Status:  CRITICAL

Recovers at 00:10:29

Repeats at 00:11:29

Recovers at 00:38:20

Repeats at 01:17:29

Continues flapping thereafter until 20:01 where node 211 no longer recovers, and again gets into a Time difference error state. This continues until 20:42 when it reports:

[PID:16535:017 2016.03.18 20:42:14.809 ERROR GossipController    ] Received as POST invalid ClusterInfo from [http://192.168.147.211:2113/gossip]. Content-Type: application/json.
[PID:16535:017 2016.03.18 20:42:14.809 ERROR GossipController    ] Received as POST invalid ClusterInfo from [http://192.168.147.211:2113/gossip]. Body: .

This continues until

[PID:16535:007 2016.03.18 20:47:22.051 ERROR JsonCodec           ]

During this time, memory use was ballooning, and then

[PID:16535:018 2016.03.18 22:21:00.433 DEBUG MonitoringService   ] Could not get free mem on linux, received memory info raw string: []

and it was completely fritzed.

Greg Young

unread,

Apr 1, 2016, 8:57:14 AM4/1/16

to event...@googlegroups.com

So there are two issues here. The first is that the old gossips aren't
really cancelling and come through later. This is relatively straight
forward to deal with at this point.
https://github.com/EventStore/EventStore/blob/release-v3.6.0/src/EventStore.Transport.Http/Client/HttpAsyncClient.cs#L19
you might enjoy the comment. Most likely it was failing on the create,
blocking. There is now a compatible license and we can use the other
code listed. I will put up an issue.

The second issue is what caused everything to get so backed up network
wise. On .212 it looks like all networking is having trouble in
particular with .211 you can see it trying to come back losing
connections etc. .211 doesn't see anyone else (and is getting problems
on all forms of networking tcp and http).

EG:

[PID:19643:021 2016.03.18 00:08:29.503 INFO TcpService ]
Internal TCP connection accepted: [Normal, 192.168.147.211:57344,
L192.168.147.212:1111, {d06ec4e7-e6b4-407a-b224-fc46286112f8}].
[PID:19643:024 2016.03.18 00:08:32.504 TRACE TcpConnectionManager]
Closing connection 'internal-normal' [192.168.147.211:57344,
L192.168.147.212:1111, {d06ec4e7-e6b4-407a-b224-fc46286112f8}]
cleanly. Reason: HEARTBEAT TIMEOUT at msgNum 0
[PID:19643:024 2016.03.18 00:08:32.504 INFO TcpConnection ] ES
TcpConnection closed [00:08:32.505: N192.168.147.211:57344,
L192.168.147.212:1111,
{d06ec4e7-e6b4-407a-b224-fc46286112f8}]:Received bytes: 0, Sent bytes:
22
[PID:19643:024 2016.03.18 00:08:32.505 INFO TcpConnection ] ES
TcpConnection closed [00:08:32.505: N192.168.147.211:57344,
L192.168.147.212:1111, {d06ec4e7-e6b4-407a-b224-fc46286112f8}]:Send
calls: 1, callbacks: 1
[PID:19643:024 2016.03.18 00:08:32.505 INFO TcpConnection ] ES
TcpConnection closed [00:08:32.505: N192.168.147.211:57344,
L192.168.147.212:1111, {d06ec4e7-e6b4-407a-b224-fc46286112f8}]:Receive
calls: 1, callbacks: 0
[PID:19643:024 2016.03.18 00:08:32.506 INFO TcpConnection ] ES
TcpConnection closed [00:08:32.506: N192.168.147.211:57344,
L192.168.147.212:1111, {d06ec4e7-e6b4-407a-b224-fc46286112f8}]:Close
reason: [Success] HEARTBEAT TIMEOUT at msgNum 0

.211

[PID:16535:026 2016.03.18 00:02:37.202 DEBUG ElectionsService ]
ELECTIONS: (V=55714) TIMED OUT! (S=ElectingLeader, M=).
[PID:16535:026 2016.03.18 00:02:37.202 DEBUG ElectionsService ]
ELECTIONS: (V=55715) SHIFT TO LEADER ELECTION.
[PID:16535:026 2016.03.18 00:02:37.202 DEBUG ElectionsService ]
ELECTIONS: (V=55715) VIEWCHANGE FROM [192.168.147.211:2113,
{12c20f0a-fddd-41a9-abf2-25b17eac9ef0}].
[PID:16535:026 2016.03.18 00:02:37.607 INFO ReplicaService ]
Subscribing at LogPosition: 4680637529 (0x116FCDC59) to MASTER
[192.168.147.212:1111, {ca3cda3c-7619-4b57-9bbe-8151ea9e8082}] as
replica with SubscriptionId: {c40c4ee6-bdf3-425d-928c-fc6294b3e139},
ConnectionId: {a92bcebd-2269-47ea-a8ae-bb89a558e3cc}, LocalEndPoint:
[], Epochs:
E2483@4680637299:{d40f40f4-7b50-4cb6-9c22-c87382698475}

Greg Young

unread,

Apr 1, 2016, 9:08:13 AM4/1/16

to event...@googlegroups.com

During this time, memory use was ballooning, and then

[PID:16535:018 2016.03.18 22:21:00.433 DEBUG MonitoringService ]
Could not get free mem on linux, received memory info rawstring: []

This I think might be due to the queueing of those messages I will
audit that code more closely.

What's also interesting is I have only seen anything like this a few
times (only in azure) I wonder if there is something particular about
how the virtualized networks are getting partitioned in these
environments as opposed to others.

I'm also am going to add monitoring to statistics as well as a
histogram there to get some deeper information.

Joel Smith

unread,

Apr 1, 2016, 9:14:54 AM4/1/16

to Event Store

We are going to try putting all three nodes as a group so that they live on the same ESX host (and so eliminate the networking issue).

This has the downside of meaning that if that ESX host fails, we lose the whole cluster. However, in Dev at least we can do this to eliminate networking as part of the problem.

Joel Smith

unread,

Apr 1, 2016, 10:41:41 AM4/1/16

to Event Store

The networking issue doesn't seem to fit - We've not had any tx or rx errors:

# uptime
 15:15:19 up 150 days,  4:16,  2 users,  load average: 0.81, 0.89, 0.68


# /opt/sensu/embedded/bin/metrics-net.rb
ggh-d-evtapp01.net.eth0.tx_packets 358948797 1459520042

ggh-d-evtapp01.net.eth0.rx_packets 409045650 1459520042
ggh-d-evtapp01.net.eth0.tx_bytes 117129928809 1459520042
ggh-d-evtapp01.net.eth0.rx_bytes 125511090261 1459520042
ggh-d-evtapp01.net.eth0.tx_errors 0 1459520042
ggh-d-evtapp01.net.eth0.rx_errors 0 1459520042
ggh-d-evtapp01.net.eth0.if_speed 10000 1459520042

These are counters, so that is no tx_errors or rx_errors in 150 days. There is no firewall between the nodes.

Indeed, this seems to be the clincher (for 211):

[root@ggh-d-evtapp01:2016-03-31]# netstat -s
Ip:
    375175628 total packets received
    0 forwarded
    0 incoming packets discarded
    375162614 incoming packets delivered
    359260474 requests sent out
    5 outgoing packets dropped
Icmp:
    31625 ICMP messages received
    116 input ICMP message failed.
    ICMP input histogram:
        destination unreachable: 2382
        echo requests: 29243
    30821 ICMP messages sent
    0 ICMP messages failed
    ICMP output histogram:
        destination unreachable: 1586
        echo replies: 29235
IcmpMsg:
        InType3: 2382
        InType8: 29243
        OutType0: 29235
        OutType3: 1586
Tcp:
    1313017 active connections openings
    748250 passive connection openings
    138137 failed connection attempts
    374700 connection resets received
    21 connections established
    353046858 segments received
    362162781 segments send out
    2510208 segments retransmited
    527276 bad segments received.
    996872 resets sent
Udp:
    7202637 packets received
    249 packets to unknown port received.
    0 packet receive errors
    7207436 packets sent
UdpLite:
TcpExt:
    166485 invalid SYN cookies received
    120761 resets received for embryonic SYN_RECV sockets
    712704 TCP sockets finished time wait in fast timer
    11972772 delayed acks sent
    4321 delayed acks further delayed because of locked socket
    Quick ack mode was activated 290061 times
    4584990 times the listen queue of a socket overflowed
    4584990 SYNs to LISTEN sockets dropped
    3738 packets directly queued to recvmsg prequeue.
    333960446 bytes directly in process context from backlog
    3023346 bytes directly received in process context from prequeue
    108203159 packet headers predicted
    480190 packets header predicted and directly queued to user
    41099609 acknowledgments not containing data payload received
    108090553 predicted acknowledgments
    304 times recovered from packet loss by selective acknowledgements
    56 congestion windows recovered without slow start by DSACK
    3707 congestion windows recovered without slow start after partial ack
    TCPLostRetransmit: 26
    51 timeouts after SACK recovery
    3 timeouts in loss state
    1558 fast retransmits
    295 forward retransmits
    16805 retransmits in slow start
    1672168 other TCP timeouts
    TCPLossProbes: 355970
    TCPLossProbeRecovery: 212066
    16 SACK retransmits failed
    290101 DSACKs sent for old packets
    1 DSACKs sent for out of order packets
    241161 DSACKs received
    1 DSACKs for out of order packets received
    54257 connections reset due to unexpected data
    202077 connections reset due to early user close
    15591 connections aborted due to timeout
    TCPDSACKIgnoredNoUndo: 159022
    TCPSpuriousRTOs: 182
    TCPSackShifted: 643
    TCPSackMerged: 1418
    TCPSackShiftFallback: 6697
    TCPRetransFail: 39
    TCPRcvCoalesce: 44012879
    TCPOFOQueue: 1064
    TCPOFOMerge: 1
    TCPChallengeACK: 536811
    TCPSYNChallenge: 535124
    TCPSpuriousRtxHostQueues: 12215
IpExt:
    InNoRoutes: 2
    InMcastPkts: 739
    InBcastPkts: 14880564
    InOctets: -1302133305
    OutOctets: 1201931353
    InMcastOctets: 26604
    InBcastOctets: 1679428203
    InNoECTPkts: 374870674
    InECT0Pkts: 304957

In particular the lines:

    4584990 times the listen queue of a socket overflowed
    4584990 SYNs to LISTEN sockets dropped

This suggests that the problem is that the process is not able to accept the

network traffic sufficiently fast and it is being dropped at this point.

On 212:

    3266737 times the listen queue of a socket overflowed
    3266737 SYNs to LISTEN sockets dropped

and on 213:

    736787 times the listen queue of a socket overflowed
    736787 SYNs to LISTEN sockets dropped

On my rabbitmq servers, which have similar uptimes, I don't see any drops to LISTEN sockets. In fact, a quick grep across the whole of the dev linux estate, doesn't show this on anything other than the eventstore servers. I suspect the issue is in the way the network traffic is being dealt with inside the process, rather than being the networking itself.

Greg Young

unread,

Apr 1, 2016, 11:00:28 AM4/1/16

to event...@googlegroups.com

This is listen sockets and listen queue (eg accepting connections).
This would not explain TCP connections all dying at the same time.

How long has your ES nodes been running (estimate on the machine)? And
what are your uptimes?

Greg Young

unread,

Apr 1, 2016, 11:12:24 AM4/1/16

to event...@googlegroups.com

btw for listen sockets I fixed a bug in mono a couple weeks ago that
will be bundled I think with the next version of ES where it would
stop accepting all sockets (http) until the process was restarted.
Normally I could only make this happen by sending connect floods to
the http server. Have you seen in the runtime of the servers a time
where all http sockets were being refused?

Joel Smith

unread,

Apr 1, 2016, 11:16:16 AM4/1/16

to Event Store

On Friday, 1 April 2016 16:00:28 UTC+1, Greg Young wrote:

This is listen sockets and listen queue (eg accepting connections).
This would not explain TCP connections all dying at the same time.

But if it is not able to deal with the new connections, is it also possible that it is not able to process existing connections expeditiously? Then, after it is dropped it due to a heartbeat timeout, it would not be able to re-establish the connection?

How long has your ES nodes been running (estimate on the machine)? And
what are your uptimes?

211# uptime


 15:15:19 up 150 days,  4:16,  2 users,  load average: 0.81, 0.89, 0.68



eventst+  9556 16.7 13.4 1736832 545916 ?      Ssl  Mar20 2838:36 /usr/bin/eventstored

212# uptime

 16:08:21 up 151 days,  4:55,  1 user,  load average: 0.18, 0.31, 0.27
eventst+ 29897 14.4 12.8 1717196 519212 ?      Ssl  Mar20 2445:11 /usr/bin/eventstored

213# uptime

 16:10:32 up 151 days,  4:57,  1 user,  load average: 0.51, 0.53, 0.48
eventst+ 25009 15.3 13.0 1713800 530200 ?      Ssl  Mar31 189:04 /usr/bin/eventstored

213 crashed yesterday with these errors:

[PID:17377:035 2016.03.31 18:38:09.661 ERROR QueuedHandlerThreadP] Error while processing message EventStore.Core.Messages.SystemMessage+BecomeShuttingDown in queued handler 'Subscriptions'.
System.InvalidOperationException: out of sync
  at System.Collections.Generic.Dictionary`2+Enumerator[System.Guid,EventStore.Core.Services.SubscriptionsService+Subscription].VerifyState () [0x00000] in <filename unknown>:0
  at System.Collections.Generic.Dictionary`2+Enumerator[System.Guid,EventStore.Core.Services.SubscriptionsService+Subscription].MoveNext () [0x00000] in <filename unknown>:0
  at System.Collections.Generic.Dictionary`2+ValueCollection+Enumerator[System.Guid,EventStore.Core.Services.SubscriptionsService+Subscription].MoveNext () [0x00000] in <filename unknown>:0
  at EventStore.Core.Services.SubscriptionsService.Handle (EventStore.Core.Messages.BecomeShuttingDown message) [0x00000] in <filename unknown>:0
  at EventStore.Core.Bus.MessageHandler`1[EventStore.Core.Messages.SystemMessage+BecomeShuttingDown].TryHandle (EventStore.Core.Messaging.Message message) [0x00000] in <filename unknown>:0
  at EventStore.Core.Bus.InMemoryBus.Publish (EventStore.Core.Messaging.Message message) [0x00000] in <filename unknown>:0
  at EventStore.Core.Bus.InMemoryBus.Handle (EventStore.Core.Messaging.Message message) [0x00000] in <filename unknown>:0
  at EventStore.Core.Bus.QueuedHandlerThreadPool.ReadFromQueue (System.Object o) [0x00000] in <filename unknown>:0

Joel Smith

unread,

Apr 1, 2016, 11:18:52 AM4/1/16

to Event Store

I assume this is what happens sometimes when the process locks up - when we get the alert message

checking gossip at 192.168.147.211:2113

CheckGossip CRITICAL: Could not connect to http://192.168.147.211:2113/gossip?format=xml to check gossip, has event store fallen over on this node? 

Host: ggh-d-evtapp01.dev.green.net
Timestamp: 2016-03-18 00:01:29 +0000
Address:  192.168.147.211
Status:  CRITICAL

Greg Young

unread,

Apr 1, 2016, 11:38:27 AM4/1/16

to event...@googlegroups.com

One message can be associated to many things. There are at a slew
reasons why this could happen. Is it transient/permanent? If its
permanent its likely what I am discussing.

TCP and HTTP are different subsystems.

What's interesting is that if you are getting permanent (until node
restart problems on gossip) they are probably fixed by my patch. I was
only able to get them to occur though with flooding connects and we
don't see this behaviour in many many other configurations so I am
trying to figure out what may make an environment more susceptible to
it (and can come up with a few things but would need more fine grained
testing to verify).

Let me squash my commits on it and I can send over a link to the
branch and run binaries on top of it if thats acceptable for you? I
have a feeling it may resolve the issue.

Greg

Joel Smith

unread,

Apr 1, 2016, 11:48:26 AM4/1/16

to Event Store

When we alert on failing to connect to the gossip port, it either resolves itself relatively quickly, or gets stuck. The process is still there, but no longer responds to the gossip (we check via the http connection).

We have often seen this after a period with the time out of sync issue. We normally need to restart the process after this. We developed the monitoring check to try and spot the event when the cluster was still working because still two live nodes, but one had died. This allows us to restart the node service without loss of overall service.

We can have a play with the version in Dev to see if it resolves issues.

Greg Young

unread,

Apr 1, 2016, 11:51:59 AM4/1/16

to event...@googlegroups.com

"We can have a play with the version in Dev to see if it resolves issues."

It won't this is a mono patch unless we build out packages etc for it.

Basically what was happening was they were breaking their
listen->accept->listen async loop which caused no further connections
to get accepted. It was however rather hard for me to reproduce it
though I can imagine ow environments could be more senstive to it.

Greg Young

unread,

Apr 1, 2016, 12:05:43 PM4/1/16

to event...@googlegroups.com

This is the patch
https://github.com/gregoryyoung/mono/commit/123aaa5db07fe464d12ea178ecc3f27990cc1a72
note that in the original it doesn't handle the finishing
synchronously case which breaks the loop. This would make total sense
for rejected listens (and where http goes away until restart, normally
TCP would still be working in this case). For transient failures its
likely something else (quite possibly even just packets not getting
there, load, etc)

Greg Young

unread,

Apr 4, 2016, 6:19:56 AM4/4/16

to event...@googlegroups.com

This change has been accepted onto mono and can now be tested without
using my branch. Make sure you have commit:
https://github.com/mono/mono/commit/196497780f01da0d275fafa1524bc2100459ee27

to build mono locally
git clone http://github.com/mono/mono
cd mono
./autogen.sh
./configure --prefix=$PREFIX (/opt/mono is a reasonable one)
make
sudo make install

You can then run binaries from here:
https://geteventstore.com/downloads/ (without mono)
mono EventStore.ClusterNode.exe (whatever config)

Let me know if it helps/doesn't help.

Cheers,

Greg

Greg Young

unread,

Apr 18, 2016, 9:02:52 AM4/18/16

to event...@googlegroups.com

This patch is being brought into the binary releases (back ported) for
3.6.0 which will be released soon.

Greg Young

unread,

Apr 19, 2016, 6:08:30 AM4/19/16

to event...@googlegroups.com

There are two additional changes aside from the mono patch being
introduced in 3.6.0.

The first is an introduction of a time to live on the gossip messages
so if the queue gets blocked they will not be sent later (which was
the cause of the time differences messages). This is not however the
root cause and only affects this symptom of the underlying issue.

The root cause of the blocking was WebRequest blocking. This is also
to be resolved in 3.6.0 with a move to httpclient.

This may or may not fix everything in the scenario. I can reproduce
your scenario easily, what I do not know is the underlying root cause
you were experiencing that caused the blocking. Likely my mono patch
resolves this. The system will at least behave properly now in the
case that things do end up blocking and there is additional data now
in case it persists.

3.6.0 can be expected by next week.

Cheers,

Greg

Joel Smith

unread,

Jun 15, 2016, 5:07:27 AM6/15/16

to Event Store

The root cause of the blocking was WebRequest blocking. This is also
to be resolved in 3.6.0 with a move to httpclient.

Looking at https://github.com/EventStore/EventStore/releases did the httpclient change only make it in to 3.6.3?

Greg Young

unread,

Jun 15, 2016, 5:23:29 AM6/15/16

to event...@googlegroups.com

3.6.3 is the version that you want (contains pcl version) earlier
versions had a similar change but there was a different issue where a
connect was ignoring timeouts in mono. PCL http client resolved that
issue.

Reply all

Reply to author

Forward