Event Store server crashes all the times in linux ubuntu 14.10

691 views
Skip to first unread message

Tao Wang

unread,
Jun 5, 2015, 4:29:50 PM6/5/15
to event...@googlegroups.com
I have run several tests to find out how stable the event store is. My goal is to see whether Event Store server can stay up until my test fills up the disk space. Basically, the test causes Event Store crash all the time.

I am wondering whether there are some configuration issues or known bugs causing these crashes. Any suggestions or measures I can take to prevent these crashes. I am going to run the same test in windows environment. However, I really do not want to go down that path even Event Store works better in windows.



Here is the details about my test.

Test Environment:

Event Store Server:

Three Nodes Cluster in AWS 

CoreOS: Linux kernel 4.0.1

CPU(s): 8
CPU Model name: Intel(R) Xeon(R) CPU E5-2666 v3 @ 2.90GHz

RAM: 16GB

Ubuntu 14.10 Lib

AWS Node Type: C4.2xlarge


Test 1 

Jmeter:  TCP protocol

Batch Size: 5000

Function in test: writeEvent call only

Event Data Payload Size: 500 Bytes

Number of concurrent users(jmeter thread numbers) : 1 

Targeted Test Duration: 72 hours

Result 1

Event Store server crashed after five minutes of the load test with this error message. This error seems related to the setting "projection=all" flag in the Event Store. https://github.com/EventStore/EventStore/issues/459

[00001,06,17:20:36.361] Connection 'external-normal' [52.7.113.16:55853, {e67e5bdc-306d-4ec8-b499-cab44d28eb47}] closed: ConnectionReset.
[00001,22,17:20:36.390] External TCP connection accepted: [Normal, 52.7.113.16:55856, L10.243.4.120:1113, {7904b60c-3a43-445c-83bf-2617bdae0db2}].
Stacktrace:


Native stacktrace:

./clusternode() [0x612962]
./clusternode() [0x5beb0b]
./clusternode() [0x4584f3]
/lib/x86_64-linux-gnu/libpthread.so.0(+0xfc90) [0x7f4dce6f4c90]

Debug info from gdb:


=================================================================
Got a SIGSEGV while executing native code. This usually indicates
a fatal error in the mono runtime or one of the native libraries
used by your application.
=================================================================

Test 2

Because of the error received from test 1, I have disable projections=ALL in Event Store.

Result 2

Event Store server crashed in half hour with 65 millions events written into disk. The error message is:

[00001,13,19:08:06.045] Error while processing message EventStore.Projections.Core.Messages.ReaderSubscriptionManagement+Unsubscribe in queued handler 'Projection Core #0'.
The given key was not present in the dictionary.
[00001,26,19:08:06.094] External TCP connection accepted: [Normal, 52.7.113.16:56487, L10.243.0.211:1113, {c7cbbcb3-f508-486a-9d02-833471f71dfb}].
[00001,10,19:08:06.105] SLOW BUS MSG [MainBus]: BecomeShuttingDown - 102ms. Handler: WideningHandler`2.

After this error, ES tried to restart all three nodes and the election process for master. It was failed.

Test 3

Added longer timeout using GossipTimeoutMs=2000 ms option in the Event Store command line argument to verify connection timeout error happened in previous test. GossipTimeoutMs is defined as the timeout on gossip to another node in ms. The default value is 500ms.

Result 3

Event Store crashed with heartbeat timeout after one hour of the test or 13 millions events entered into system. The symptom is similar to the second test. Event Store could not complete the re-election process during the restart after the system crash with heartbeat timeout. 

 

[00001,10,23:19:46.037] ========== [10.243.4.120:2112] CLONE ASSIGNMENT RECEIVED FROM [10.243.8.160:1112,n/a,{cbe16861-7019-4797-9958-3617f07b3f80}].
[00001,10,23:19:46.037] ========== [10.243.4.120:2112] IS CLONE!!! SPARTA!!! MASTER IS [10.243.8.160:2112,{cbe16861-7019-4797-9958-3617f07b3f80}]
[00001,07,23:19:46.046] ONLINE TRUNCATION IS NEEDED. NOT IMPLEMENTED. OFFLINE TRUNCATION WILL BE PERFORMED. SHUTTING DOWN NODE.


James Nugent

unread,
Jun 5, 2015, 6:51:54 PM6/5/15
to event...@googlegroups.com
The third test tells you exactly what the problem there is: Offline truncation is needed, which requires you to restart the node. The commercial manager acts as a supervisor for this - there is no equivalent in the OSS packages though.

Did you build from source or use the downloadable packages? Which version of libc do you have?

Result 2 also looks suspect - even though there is an error reported it's handled. It's more likely that all nodes were shut down, and not being supervised, they never restarted?

Greg Young

unread,
Jun 6, 2015, 8:44:57 AM6/6/15
to event...@googlegroups.com
1) Can look at need to know more about environment version of libc etc

2) I "disabled" projections then had an error with projections.

3) Pretty self explanatory. Supervise the node. There are times its
easier to just have a process commit suicide and restart as opposed to
trying to handle the edge condition.
> --
> You received this message because you are subscribed to the Google Groups
> "Event Store" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to event-store...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.



--
Studying for the Turing test

Dmitry Aleksandrov

unread,
Jun 7, 2015, 12:58:34 AM6/7/15
to event...@googlegroups.com
Hi All,


Regards,
Dmitry
Dmitry

James Nugent

unread,
Jun 7, 2015, 4:31:49 PM6/7/15
to event...@googlegroups.com
I think for the next builds going forward we'll default to building on the latest LTS version of Ubuntu.

Sam Kanai

unread,
Jun 10, 2015, 5:59:26 PM6/10/15
to event...@googlegroups.com

On Sunday, June 7, 2015 at 4:31:49 PM UTC-4, James Nugent wrote:
I think for the next builds going forward we'll default to building on the latest LTS version of Ubuntu.

James, can you explain more about the root cause of the truncation problem?  Why is it an edge case that requires the node to commit suicide?  Is online truncation a future feature on your roadmap?

Thanks,
Sam

Greg Young

unread,
Jun 10, 2015, 6:33:28 PM6/10/15
to event...@googlegroups.com

Supervise the node. This is not uncommon.

--

Sam Kanai

unread,
Jun 10, 2015, 8:15:27 PM6/10/15
to event...@googlegroups.com
Thats not what I was asking.  I look dimly upon process crashes, even when they are semi-graceful.  The presence of a monitoring agent is not relevant.  Brass tacks: I'm questioning the robustness of the server implementation, and would like to know more.  This seems a reasonable ask for an enterprise grade product, yes?

-Sam

Greg Young

unread,
Jun 11, 2015, 4:09:01 AM6/11/15
to event...@googlegroups.com
The situation is known as a deposed master. You had a master who is no
longer the master (he was partitioned). During this time of
partitioning he may have written transactions to his log while he
still thought he was master that others do not have (as they have
moved on after elections). As such this nodes log needs to be
truncated. A log truncation is an operation thats much better and
safer to do on a restart than inside of an active node (all sorts of
funky race conditions, its far more robust to clean start at this
point iirc RAFT specifies the same).

The commercial version comes with a supervisor.

Cheers,

Greg

João Bragança

unread,
Jun 11, 2015, 5:27:44 AM6/11/15
to event...@googlegroups.com
Because the writes were not acknowledged by a majority of nodes, the client gets an error in this situation right?

Greg Young

unread,
Jun 11, 2015, 6:28:55 AM6/11/15
to event...@googlegroups.com
The client will have received a timeout.

The writes may or may not have reached a majority of nodes. The client
retries on timeouts and will receive an idempotent write if it has
reached or will actually write if it has not.

Greg

James Nugent

unread,
Jun 11, 2015, 10:00:01 AM6/11/15
to event...@googlegroups.com
Actually no it doesn't seem reasonable to me - and the presence of a monitoring agent is *highly* relevant. See Erlang.

It's not a crash, just an exit. This is a perfectly good strategy for dealing with lots of conditions as has been proven over years. The reason you run clusters of commodity hardware is precisely to allow this kind of behaviour.

Sam Kanai

unread,
Jun 11, 2015, 11:31:11 AM6/11/15
to event...@googlegroups.com
@Greg - Thanks for the explanation, that helps.

@James - I'm terribly sorry you feel that questions about root causes aren't reasonable.  You can quibble about semantics, but an unplanned exit is no different from a crash to an Ops person, as both  generate an incident and require investigation.  Only when Ops has sufficient understanding can they disregard exit events.  To be clear, I'm not arguing for/against the suicide approach. I just want to know the why.

Related topic: I work with Tao, the original poster. and no master has survived more than 30 minutes under load before losing the heartbeat and dying.  This is a distressingly high failure rate and causes a lot of latency spikes on the client.  So my question is: is this the best we can expect from GES?  Or are we doing something wrong?  For example, is this side-effect of AWS network jitter, and would we see better results on-prem? 
 
-Sam

Greg Young

unread,
Jun 11, 2015, 11:39:16 AM6/11/15
to event...@googlegroups.com
"Related topic: I work with Tao, the original poster. and no master
has survived more than 30 minutes under load before losing the
heartbeat and dying. This is a distressingly high failure rate and
causes a lot of latency spikes on the client. So my question is: is
this the best we can expect from GES? Or are we doing something
wrong? For example, is this side-effect of AWS network jitter, and
would we see better results on-prem?"

I would also be interested in seeing what the CPU usage / TCP settings
etc were on the box.
> --
> You received this message because you are subscribed to the Google Groups
> "Event Store" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to event-store...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.



Sam Kanai

unread,
Jun 11, 2015, 11:58:25 AM6/11/15
to event...@googlegroups.com
I can gather that for you.  I'll also gather the OS/GES versions.  Thanks again!

Another question: I've already opened a query to your enterprise support team about onsite training.  If we book an onsite, can we use it to explore questions at this level of technical detail?  That would be incredibly helpful.  

Finally, I apologize if I'm coming across as overly demanding.  We're a conservative company, so I have to meet a very high bar for new software.

-Sam 

Sam Kanai

unread,
Jun 12, 2015, 2:24:50 PM6/12/15
to event...@googlegroups.com
Greg,
 

Ubuntu 14.10 containers on CoreOS beta 647.0.0 with ES version 3.0.3, in AWS.


Still arranging another perf run, but IIRC the CPU has been very underloaded so far.

What TCP settings are you looking for, exactly?


-Sam

Ben Salem

unread,
Jun 12, 2015, 3:36:58 PM6/12/15
to event...@googlegroups.com
Ok, we're running another copy of the loadtest and it looks like we're using about 20% of our available CPU on two machines (approximately 1.5 cores each) and 6-7% of our total CPU on our third machine (about .5 cores). As long as the process isn't CPU-locked (and I assume it isn't, seeing how it's taking up 1.5 CPUs on a couple machines), this feels like we should be totally fine on CPU!

Ben Salem

unread,
Jun 12, 2015, 4:45:53 PM6/12/15
to event...@googlegroups.com
We saw our crash again, and we got a whole bunch of stuff happening in the second of the crash. Hope you can find something in here below!


[00001,31,20:09:31.571] External TCP connection accepted: [Normal, 52.7.113.16:59890, L10.243.0.211:1113, {f4f1d01b-21ee-4cb2-b818-79570baa5bc2}].
[00001,31,20:09:36.029] SLOW BUS MSG [Worker #4 Bus]: TcpSend - 4210ms. Handler: TcpSendService.
[00001,21,20:09:36.029] SLOW BUS MSG [Worker #5 Bus]: TcpSend - 4210ms. Handler: TcpSendService.
[00001,31,20:09:36.029] SLOW QUEUE MSG [Worker #4]: TcpSend - 4211ms. Q: 715/715.
[00001,21,20:09:36.029] SLOW QUEUE MSG [Worker #5]: TcpSend - 4211ms. Q: 41/62.
[00001,15,20:09:36.029] SLOW BUS MSG [bus]: UnwrapEnvelopeMessage - 4222ms. Handler: UnwrapEnvelopeHandler.
[00001,26,20:09:36.029] PTables merge finished in 00:00:31.1839287 ([16004992, 16002064] entries merged into 32007056).
[00001,15,20:09:36.029] SLOW QUEUE MSG [Projection Core #2]: UnwrapEnvelopeMessage - 4222ms. Q: 61/313.
[00001,26,20:09:36.029] Loading PTable '7ab183a5-522b-4bf6-94df-632d3780d8c8' started...
[00001,19,20:09:36.029] SLOW QUEUE MSG [StorageReaderQueue #4]: ReadAllEventsForward - 4211ms. Q: 0/0.
[00001,13,20:09:36.029] SLOW BUS MSG [bus]: UnwrapEnvelopeMessage - 4212ms. Handler: UnwrapEnvelopeHandler.
[00001,13,20:09:36.029] SLOW QUEUE MSG [Projection Core #0]: UnwrapEnvelopeMessage - 4212ms. Q: 0/253.
[00001,29,20:09:36.031] ES TcpConnection closed [20:09:36.031: N10.243.8.160:37026, L10.243.0.211:1112, {255fbf12-86bd-4ddc-9d5f-0fa2fed666ca}]:
Received bytes: 2042130, Sent bytes: 21826478575
Send calls: 2941073, callbacks: 2941072
Receive calls: 40978, callbacks: 40978
Close reason: [Success] Socket closed

[00001,07,20:09:36.031] ES TcpConnection closed [20:09:36.031: N52.7.113.16:59890, L10.243.0.211:1113, {f4f1d01b-21ee-4cb2-b818-79570baa5bc2}]:
Received bytes: 2868990, Sent bytes: 95
Send calls: 3, callbacks: 2
Receive calls: 358, callbacks: 358
Close reason: [Success] Socket closed

[00001,27,20:09:36.031] ES TcpConnection closed [20:09:36.031: N10.243.4.120:50437, L10.243.0.211:1112, {572717b1-13b8-482e-8912-2aa2e4c8941e}]:
Received bytes: 13683, Sent bytes: 116019785
Send calls: 17087, callbacks: 17086
Receive calls: 253, callbacks: 253
Close reason: [Success] Socket closed

[00001,29,20:09:36.031] Connection 'internal-normal' [10.243.8.160:37026, {255fbf12-86bd-4ddc-9d5f-0fa2fed666ca}] closed: Success.
[00001,07,20:09:36.031] Connection 'external-normal' [52.7.113.16:59890, {f4f1d01b-21ee-4cb2-b818-79570baa5bc2}] closed: Success.
[00001,27,20:09:36.031] Connection 'internal-normal' [10.243.4.120:50437, {572717b1-13b8-482e-8912-2aa2e4c8941e}] closed: Success.
[00001,11,20:09:36.032] Looks like node [10.243.4.120:1112] is DEAD (TCP connection lost).
[00001,24,20:09:36.036] External TCP connection accepted: [Normal, 52.7.113.16:59891, L10.243.0.211:1113, {a14bd876-d5a8-48ee-a278-1a97a270cc30}].
[00001,11,20:09:36.036] CLUSTER HAS CHANGED (TCP connection lost to [10.243.4.120:1112])
Old:
VND {3e976baa-ed36-42dd-bc92-0e35396f1857} <LIVE> [Slave, 10.243.8.160:1112, n/a, 0.0.0.0:1113, n/a, 10.243.8.160:2112, 0.0.0.0:2113] 184833770652/184836185263/184836185263/E45@184742138273:{744ac94c-bfc2-4564-93cb-91d3fb408883} | 2015-06-12 20:09:32.830
VND {5230bdf1-2dce-457b-b891-0158f6cde4e4} <LIVE> [Slave, 10.243.4.120:1112, n/a, 0.0.0.0:1113, n/a, 10.243.4.120:2112, 0.0.0.0:2113] 184833770652/184836536655/184836536655/E45@184742138273:{744ac94c-bfc2-4564-93cb-91d3fb408883} | 2015-06-12 20:09:32.261
VND {6bd8270e-1bd7-4957-bbbd-7e22e02e71c1} <LIVE> [Master, 10.243.0.211:1112, n/a, 0.0.0.0:1113, n/a, 10.243.0.211:2112, 0.0.0.0:2113] 184836929759/184836930317/184836930317/E45@184742138273:{744ac94c-bfc2-4564-93cb-91d3fb408883} | 2015-06-12 20:09:36.032
New:
VND {3e976baa-ed36-42dd-bc92-0e35396f1857} <LIVE> [Slave, 10.243.8.160:1112, n/a, 0.0.0.0:1113, n/a, 10.243.8.160:2112, 0.0.0.0:2113] 184833770652/184836185263/184836185263/E45@184742138273:{744ac94c-bfc2-4564-93cb-91d3fb408883} | 2015-06-12 20:09:32.830
VND {5230bdf1-2dce-457b-b891-0158f6cde4e4} <DEAD> [Slave, 10.243.4.120:1112, n/a, 0.0.0.0:1113, n/a, 10.243.4.120:2112, 0.0.0.0:2113] 184833770652/184836536655/184836536655/E45@184742138273:{744ac94c-bfc2-4564-93cb-91d3fb408883} | 2015-06-12 20:09:36.036
VND {6bd8270e-1bd7-4957-bbbd-7e22e02e71c1} <LIVE> [Master, 10.243.0.211:1112, n/a, 0.0.0.0:1113, n/a, 10.243.0.211:2112, 0.0.0.0:2113] 184836929759/184836930317/184836930317/E45@184742138273:{744ac94c-bfc2-4564-93cb-91d3fb408883} | 2015-06-12 20:09:36.032
--------------------------------------------------------------------------------
[00001,11,20:09:36.036] Looks like node [10.243.8.160:1112] is DEAD (TCP connection lost).
[00001,11,20:09:36.036] CLUSTER HAS CHANGED (TCP connection lost to [10.243.8.160:1112])
Old:
VND {3e976baa-ed36-42dd-bc92-0e35396f1857} <LIVE> [Slave, 10.243.8.160:1112, n/a, 0.0.0.0:1113, n/a, 10.243.8.160:2112, 0.0.0.0:2113] 184833770652/184836185263/184836185263/E45@184742138273:{744ac94c-bfc2-4564-93cb-91d3fb408883} | 2015-06-12 20:09:32.830
VND {5230bdf1-2dce-457b-b891-0158f6cde4e4} <DEAD> [Slave, 10.243.4.120:1112, n/a, 0.0.0.0:1113, n/a, 10.243.4.120:2112, 0.0.0.0:2113] 184833770652/184836536655/184836536655/E45@184742138273:{744ac94c-bfc2-4564-93cb-91d3fb408883} | 2015-06-12 20:09:36.036
VND {6bd8270e-1bd7-4957-bbbd-7e22e02e71c1} <LIVE> [Master, 10.243.0.211:1112, n/a, 0.0.0.0:1113, n/a, 10.243.0.211:2112, 0.0.0.0:2113] 184836929759/184836930317/184836930317/E45@184742138273:{744ac94c-bfc2-4564-93cb-91d3fb408883} | 2015-06-12 20:09:36.032
New:
VND {3e976baa-ed36-42dd-bc92-0e35396f1857} <DEAD> [Slave, 10.243.8.160:1112, n/a, 0.0.0.0:1113, n/a, 10.243.8.160:2112, 0.0.0.0:2113] 184833770652/184836185263/184836185263/E45@184742138273:{744ac94c-bfc2-4564-93cb-91d3fb408883} | 2015-06-12 20:09:36.037
VND {5230bdf1-2dce-457b-b891-0158f6cde4e4} <DEAD> [Slave, 10.243.4.120:1112, n/a, 0.0.0.0:1113, n/a, 10.243.4.120:2112, 0.0.0.0:2113] 184833770652/184836536655/184836536655/E45@184742138273:{744ac94c-bfc2-4564-93cb-91d3fb408883} | 2015-06-12 20:09:36.036
VND {6bd8270e-1bd7-4957-bbbd-7e22e02e71c1} <LIVE> [Master, 10.243.0.211:1112, n/a, 0.0.0.0:1113, n/a, 10.243.0.211:2112, 0.0.0.0:2113] 184836929759/184836930317/184836930317/E45@184742138273:{744ac94c-bfc2-4564-93cb-91d3fb408883} | 2015-06-12 20:09:36.032
--------------------------------------------------------------------------------
[00001,11,20:09:36.037] CLUSTER HAS CHANGED (gossip received from [10.243.8.160:2112])
Old:
VND {3e976baa-ed36-42dd-bc92-0e35396f1857} <DEAD> [Slave, 10.243.8.160:1112, n/a, 0.0.0.0:1113, n/a, 10.243.8.160:2112, 0.0.0.0:2113] 184833770652/184836185263/184836185263/E45@184742138273:{744ac94c-bfc2-4564-93cb-91d3fb408883} | 2015-06-12 20:09:36.037
VND {5230bdf1-2dce-457b-b891-0158f6cde4e4} <DEAD> [Slave, 10.243.4.120:1112, n/a, 0.0.0.0:1113, n/a, 10.243.4.120:2112, 0.0.0.0:2113] 184833770652/184836536655/184836536655/E45@184742138273:{744ac94c-bfc2-4564-93cb-91d3fb408883} | 2015-06-12 20:09:36.036
VND {6bd8270e-1bd7-4957-bbbd-7e22e02e71c1} <LIVE> [Master, 10.243.0.211:1112, n/a, 0.0.0.0:1113, n/a, 10.243.0.211:2112, 0.0.0.0:2113] 184836929759/184836930317/184836930317/E45@184742138273:{744ac94c-bfc2-4564-93cb-91d3fb408883} | 2015-06-12 20:09:36.032
New:
VND {3e976baa-ed36-42dd-bc92-0e35396f1857} <LIVE> [Slave, 10.243.8.160:1112, n/a, 0.0.0.0:1113, n/a, 10.243.8.160:2112, 0.0.0.0:2113] 184836537844/184836538017/184836538017/E46@184836536655:{71d62cd9-9b9b-450a-acb7-7f8536503c23} | 2015-06-12 20:09:35.832
VND {5230bdf1-2dce-457b-b891-0158f6cde4e4} <LIVE> [Master, 10.243.4.120:1112, n/a, 0.0.0.0:1113, n/a, 10.243.4.120:2112, 0.0.0.0:2113] 184836537844/184836538017/184836538017/E46@184836536655:{71d62cd9-9b9b-450a-acb7-7f8536503c23} | 2015-06-12 20:09:35.268
VND {6bd8270e-1bd7-4957-bbbd-7e22e02e71c1} <LIVE> [Master, 10.243.0.211:1112, n/a, 0.0.0.0:1113, n/a, 10.243.0.211:2112, 0.0.0.0:2113] 184836929759/184836930317/184836930317/E45@184742138273:{744ac94c-bfc2-4564-93cb-91d3fb408883} | 2015-06-12 20:09:36.037
--------------------------------------------------------------------------------
[00001,14,20:09:36.037] SLOW BUS MSG [bus]: UnwrapEnvelopeMessage - 4220ms. Handler: UnwrapEnvelopeHandler.
[00001,14,20:09:36.037] SLOW QUEUE MSG [Projection Core #1]: UnwrapEnvelopeMessage - 4220ms. Q: 159/411.
[00001,11,20:09:36.037] There are FEW MASTERS according to gossip, need to start elections. MASTER: [InstanceId: {6bd8270e-1bd7-4957-bbbd-7e22e02e71c1}, InternalTcp: 10.243.0.211:1112, InternalSecureTcp: , ExternalTcp: 0.0.0.0:1113, ExternalSecureTcp: , InternalHttp: 10.243.0.211:2112, ExternalHttp: 0.0.0.0:2113].
GOSSIP:
VND {3e976baa-ed36-42dd-bc92-0e35396f1857} <LIVE> [Slave, 10.243.8.160:1112, n/a, 0.0.0.0:1113, n/a, 10.243.8.160:2112, 0.0.0.0:2113] 184836537844/184836538017/184836538017/E46@184836536655:{71d62cd9-9b9b-450a-acb7-7f8536503c23} | 2015-06-12 20:09:35.832
VND {5230bdf1-2dce-457b-b891-0158f6cde4e4} <LIVE> [Master, 10.243.4.120:1112, n/a, 0.0.0.0:1113, n/a, 10.243.4.120:2112, 0.0.0.0:2113] 184836537844/184836538017/184836538017/E46@184836536655:{71d62cd9-9b9b-450a-acb7-7f8536503c23} | 2015-06-12 20:09:35.268
VND {6bd8270e-1bd7-4957-bbbd-7e22e02e71c1} <LIVE> [Master, 10.243.0.211:1112, n/a, 0.0.0.0:1113, n/a, 10.243.0.211:2112, 0.0.0.0:2113] 184836929759/184836930317/184836930317/E45@184742138273:{744ac94c-bfc2-4564-93cb-91d3fb408883} | 2015-06-12 20:09:36.037
[00001,11,20:09:36.038] ELECTIONS: STARTING ELECTIONS.
[00001,11,20:09:36.038] ELECTIONS: (V=65) SHIFT TO LEADER ELECTION.
[00001,11,20:09:36.038] ELECTIONS: (V=65) VIEWCHANGE FROM [10.243.0.211:2112, {6bd8270e-1bd7-4957-bbbd-7e22e02e71c1}].
[00001,11,20:09:36.042] There are FEW MASTERS according to gossip, need to start elections. MASTER: [InstanceId: {6bd8270e-1bd7-4957-bbbd-7e22e02e71c1}, InternalTcp: 10.243.0.211:1112, InternalSecureTcp: , ExternalTcp: 0.0.0.0:1113, ExternalSecureTcp: , InternalHttp: 10.243.0.211:2112, ExternalHttp: 0.0.0.0:2113].
GOSSIP:
VND {3e976baa-ed36-42dd-bc92-0e35396f1857} <LIVE> [Slave, 10.243.8.160:1112, n/a, 0.0.0.0:1113, n/a, 10.243.8.160:2112, 0.0.0.0:2113] 184836537844/184836538017/184836538017/E46@184836536655:{71d62cd9-9b9b-450a-acb7-7f8536503c23} | 2015-06-12 20:09:36.038
VND {5230bdf1-2dce-457b-b891-0158f6cde4e4} <LIVE> [Master, 10.243.4.120:1112, n/a, 0.0.0.0:1113, n/a, 10.243.4.120:2112, 0.0.0.0:2113] 184836537844/184836538017/184836538017/E46@184836536655:{71d62cd9-9b9b-450a-acb7-7f8536503c23} | 2015-06-12 20:09:35.268
VND {6bd8270e-1bd7-4957-bbbd-7e22e02e71c1} <LIVE> [Master, 10.243.0.211:1112, n/a, 0.0.0.0:1113, n/a, 10.243.0.211:2112, 0.0.0.0:2113] 184836929759/184836930317/184836930317/E45@184742138273:{744ac94c-bfc2-4564-93cb-91d3fb408883} | 2015-06-12 20:09:36.038
[00001,11,20:09:36.056] CLUSTER HAS CHANGED (gossip received from [10.243.8.160:2112])
Old:
VND {3e976baa-ed36-42dd-bc92-0e35396f1857} <LIVE> [Slave, 10.243.8.160:1112, n/a, 0.0.0.0:1113, n/a, 10.243.8.160:2112, 0.0.0.0:2113] 184836537844/184836538017/184836538017/E46@184836536655:{71d62cd9-9b9b-450a-acb7-7f8536503c23} | 2015-06-12 20:09:36.038
VND {5230bdf1-2dce-457b-b891-0158f6cde4e4} <LIVE> [Master, 10.243.4.120:1112, n/a, 0.0.0.0:1113, n/a, 10.243.4.120:2112, 0.0.0.0:2113] 184836537844/184836538017/184836538017/E46@184836536655:{71d62cd9-9b9b-450a-acb7-7f8536503c23} | 2015-06-12 20:09:35.268
VND {6bd8270e-1bd7-4957-bbbd-7e22e02e71c1} <LIVE> [Master, 10.243.0.211:1112, n/a, 0.0.0.0:1113, n/a, 10.243.0.211:2112, 0.0.0.0:2113] 184836929759/184836930317/184836930317/E45@184742138273:{744ac94c-bfc2-4564-93cb-91d3fb408883} | 2015-06-12 20:09:36.038
New:
VND {3e976baa-ed36-42dd-bc92-0e35396f1857} <LIVE> [Slave, 10.243.8.160:1112, n/a, 0.0.0.0:1113, n/a, 10.243.8.160:2112, 0.0.0.0:2113] 184833770652/184836185263/184836185263/E45@184742138273:{744ac94c-bfc2-4564-93cb-91d3fb408883} | 2015-06-12 20:09:33.831
VND {5230bdf1-2dce-457b-b891-0158f6cde4e4} <LIVE> [Master, 10.243.4.120:1112, n/a, 0.0.0.0:1113, n/a, 10.243.4.120:2112, 0.0.0.0:2113] 184836537844/184836538017/184836538017/E46@184836536655:{71d62cd9-9b9b-450a-acb7-7f8536503c23} | 2015-06-12 20:09:35.268
VND {6bd8270e-1bd7-4957-bbbd-7e22e02e71c1} <LIVE> [Master, 10.243.0.211:1112, n/a, 0.0.0.0:1113, n/a, 10.243.0.211:2112, 0.0.0.0:2113] 184836929759/184836930317/184836930317/E45@184742138273:{744ac94c-bfc2-4564-93cb-91d3fb408883} | 2015-06-12 20:09:36.056
--------------------------------------------------------------------------------
[00001,11,20:09:36.060] CLUSTER HAS CHANGED (gossip received from [10.243.4.120:2112])
Old:
VND {3e976baa-ed36-42dd-bc92-0e35396f1857} <LIVE> [Slave, 10.243.8.160:1112, n/a, 0.0.0.0:1113, n/a, 10.243.8.160:2112, 0.0.0.0:2113] 184833770652/184836185263/184836185263/E45@184742138273:{744ac94c-bfc2-4564-93cb-91d3fb408883} | 2015-06-12 20:09:31.829
VND {5230bdf1-2dce-457b-b891-0158f6cde4e4} <LIVE> [Master, 10.243.4.120:1112, n/a, 0.0.0.0:1113, n/a, 10.243.4.120:2112, 0.0.0.0:2113] 184836537844/184836538017/184836538017/E46@184836536655:{71d62cd9-9b9b-450a-acb7-7f8536503c23} | 2015-06-12 20:09:35.268
VND {6bd8270e-1bd7-4957-bbbd-7e22e02e71c1} <LIVE> [Master, 10.243.0.211:1112, n/a, 0.0.0.0:1113, n/a, 10.243.0.211:2112, 0.0.0.0:2113] 184836929759/184836930317/184836930317/E45@184742138273:{744ac94c-bfc2-4564-93cb-91d3fb408883} | 2015-06-12 20:09:36.056
New:
VND {3e976baa-ed36-42dd-bc92-0e35396f1857} <LIVE> [Slave, 10.243.8.160:1112, n/a, 0.0.0.0:1113, n/a, 10.243.8.160:2112, 0.0.0.0:2113] 184833770652/184836185263/184836185263/E45@184742138273:{744ac94c-bfc2-4564-93cb-91d3fb408883} | 2015-06-12 20:09:32.265
VND {5230bdf1-2dce-457b-b891-0158f6cde4e4} <LIVE> [Slave, 10.243.4.120:1112, n/a, 0.0.0.0:1113, n/a, 10.243.4.120:2112, 0.0.0.0:2113] 184833770652/184836536655/184836536655/E45@184742138273:{744ac94c-bfc2-4564-93cb-91d3fb408883} | 2015-06-12 20:09:33.262
VND {6bd8270e-1bd7-4957-bbbd-7e22e02e71c1} <LIVE> [Master, 10.243.0.211:1112, n/a, 0.0.0.0:1113, n/a, 10.243.0.211:2112, 0.0.0.0:2113] 184836929759/184836930317/184836930317/E45@184742138273:{744ac94c-bfc2-4564-93cb-91d3fb408883} | 2015-06-12 20:09:36.056
--------------------------------------------------------------------------------
[00001,11,20:09:36.060] ELECTIONS: (V=65) VIEWCHANGE FROM [10.243.8.160:2112, {3e976baa-ed36-42dd-bc92-0e35396f1857}].
[00001,11,20:09:36.060] ELECTIONS: (V=65) MAJORITY OF VIEWCHANGE.
[00001,11,20:09:36.060] ELECTIONS: (V=65) SHIFT TO PREPARE PHASE.
[00001,11,20:09:36.060] ELECTIONS: (V=65) PREPARE_OK FROM [10.243.0.211:2112,{6bd8270e-1bd7-4957-bbbd-7e22e02e71c1}](L=184836929759,W=184836930317,C=184836930317,E45@184742138273:{744ac94c-bfc2-4564-93cb-91d3fb408883}).
[00001,11,20:09:36.060] There are FEW MASTERS according to gossip, need to start elections. MASTER: [InstanceId: {6bd8270e-1bd7-4957-bbbd-7e22e02e71c1}, InternalTcp: 10.243.0.211:1112, InternalSecureTcp: , ExternalTcp: 0.0.0.0:1113, ExternalSecureTcp: , InternalHttp: 10.243.0.211:2112, ExternalHttp: 0.0.0.0:2113].
GOSSIP:
VND {3e976baa-ed36-42dd-bc92-0e35396f1857} <LIVE> [Slave, 10.243.8.160:1112, n/a, 0.0.0.0:1113, n/a, 10.243.8.160:2112, 0.0.0.0:2113] 184833770652/184836185263/184836185263/E45@184742138273:{744ac94c-bfc2-4564-93cb-91d3fb408883} | 2015-06-12 20:09:33.831
VND {5230bdf1-2dce-457b-b891-0158f6cde4e4} <LIVE> [Master, 10.243.4.120:1112, n/a, 0.0.0.0:1113, n/a, 10.243.4.120:2112, 0.0.0.0:2113] 184836537844/184836538017/184836538017/E46@184836536655:{71d62cd9-9b9b-450a-acb7-7f8536503c23} | 2015-06-12 20:09:35.268
VND {6bd8270e-1bd7-4957-bbbd-7e22e02e71c1} <LIVE> [Master, 10.243.0.211:1112, n/a, 0.0.0.0:1113, n/a, 10.243.0.211:2112, 0.0.0.0:2113] 184836929759/184836930317/184836930317/E45@184742138273:{744ac94c-bfc2-4564-93cb-91d3fb408883} | 2015-06-12 20:09:36.056
[00001,11,20:09:36.060] There are FEW MASTERS according to gossip, need to start elections. MASTER: [InstanceId: {6bd8270e-1bd7-4957-bbbd-7e22e02e71c1}, InternalTcp: 10.243.0.211:1112, InternalSecureTcp: , ExternalTcp: 0.0.0.0:1113, ExternalSecureTcp: , InternalHttp: 10.243.0.211:2112, ExternalHttp: 0.0.0.0:2113].
GOSSIP:
VND {3e976baa-ed36-42dd-bc92-0e35396f1857} <LIVE> [Slave, 10.243.8.160:1112, n/a, 0.0.0.0:1113, n/a, 10.243.8.160:2112, 0.0.0.0:2113] 184833770652/184836185263/184836185263/E45@184742138273:{744ac94c-bfc2-4564-93cb-91d3fb408883} | 2015-06-12 20:09:31.829
VND {5230bdf1-2dce-457b-b891-0158f6cde4e4} <LIVE> [Master, 10.243.4.120:1112, n/a, 0.0.0.0:1113, n/a, 10.243.4.120:2112, 0.0.0.0:2113] 184836537844/184836538017/184836538017/E46@184836536655:{71d62cd9-9b9b-450a-acb7-7f8536503c23} | 2015-06-12 20:09:35.268
VND {6bd8270e-1bd7-4957-bbbd-7e22e02e71c1} <LIVE> [Master, 10.243.0.211:1112, n/a, 0.0.0.0:1113, n/a, 10.243.0.211:2112, 0.0.0.0:2113] 184836929759/184836930317/184836930317/E45@184742138273:{744ac94c-bfc2-4564-93cb-91d3fb408883} | 2015-06-12 20:09:36.056
[00001,11,20:09:36.060] ELECTIONS: (V=67) VIEWCHANGE FROM [10.243.4.120:2112, {5230bdf1-2dce-457b-b891-0158f6cde4e4}].
[00001,11,20:09:36.060] ELECTIONS: (V=67) SHIFT TO LEADER ELECTION.
[00001,11,20:09:36.060] ELECTIONS: (V=67) VIEWCHANGE FROM [10.243.0.211:2112, {6bd8270e-1bd7-4957-bbbd-7e22e02e71c1}].
[00001,11,20:09:36.061] ELECTIONS: (V=67) MAJORITY OF VIEWCHANGE.
[00001,11,20:09:36.062] ELECTIONS: (V=67) VIEWCHANGE FROM [10.243.8.160:2112, {3e976baa-ed36-42dd-bc92-0e35396f1857}].
[00001,11,20:09:36.064] ELECTIONS: (V=67) PREPARE FROM [10.243.4.120:2112, {5230bdf1-2dce-457b-b891-0158f6cde4e4}].
[00001,11,20:09:36.068] ELECTIONS: (V=67) SHIFT TO REG_NONLEADER.
[00001,11,20:09:36.068] ELECTIONS: (V=67) PROPOSAL FROM [10.243.4.120:2112,{5230bdf1-2dce-457b-b891-0158f6cde4e4}] M=[10.243.4.120:2112,{5230bdf1-2dce-457b-b891-0158f6cde4e4}](L=184836537844,W=184836538017,C=184836538017,E46@184836536655:{71d62cd9-9b9b-450a-acb7-7f8536503c23}). ME=[10.243.0.211:2112,{6bd8270e-1bd7-4957-bbbd-7e22e02e71c1}](L=184836929759,W=184836930317,C=184836930317,E45@184742138273:{744ac94c-bfc2-4564-93cb-91d3fb408883}).
[00001,11,20:09:36.068] ELECTIONS: (V=67) ACCEPT FROM [10.243.4.120:2112,{5230bdf1-2dce-457b-b891-0158f6cde4e4}] M=[10.243.4.120:2112,{5230bdf1-2dce-457b-b891-0158f6cde4e4}]).
[00001,11,20:09:36.068] ELECTIONS: (V=67) ACCEPT FROM [10.243.0.211:2112,{6bd8270e-1bd7-4957-bbbd-7e22e02e71c1}] M=[10.243.4.120:2112,{5230bdf1-2dce-457b-b891-0158f6cde4e4}]).
[00001,11,20:09:36.068] ELECTIONS: (V=67) DONE. ELECTED MASTER = [10.243.4.120:2112,{5230bdf1-2dce-457b-b891-0158f6cde4e4}](L=184836537844,W=184836538017,C=184836538017,E46@184836536655:{71d62cd9-9b9b-450a-acb7-7f8536503c23}). ME=[10.243.0.211:2112,{6bd8270e-1bd7-4957-bbbd-7e22e02e71c1}](L=184836929759,W=184836930317,C=184836930317,E45@184742138273:{744ac94c-bfc2-4564-93cb-91d3fb408883}).
[00001,11,20:09:36.098] ========== [10.243.0.211:2112] PRE-REPLICA STATE, WAITING FOR CHASER TO CATCH UP... MASTER IS [10.243.4.120:2112,{5230bdf1-2dce-457b-b891-0158f6cde4e4}]
[00001,11,20:09:36.122] SLOW QUEUE MSG [MainQueue]: ElectionsDone - 52ms. Q: 2/5.
[00001,12,20:09:36.215] SLOW BUS MSG [manager input bus]: BecomePreReplica - 93ms. Handler: ProjectionCoreCoordinator.
[00001,31,20:09:36.223] Connection 'master-normal' ({f3589931-246a-47b3-845e-81e1f09731db}) to [10.243.4.120:1112] established.
[00001,12,20:09:36.230] SLOW QUEUE MSG [Projections Master]: BecomePreReplica - 108ms. Q: 0/1.
[00001,11,20:09:36.266] Subscribing at LogPosition: 184836930317 (0x2B0923BB0D) to MASTER [10.243.4.120:1112, {5230bdf1-2dce-457b-b891-0158f6cde4e4}] as replica with SubscriptionId: {d9283127-563b-4999-95ce-f6e2c3cfe872}, ConnectionId: {f3589931-246a-47b3-845e-81e1f09731db}, LocalEndPoint: [], Epochs:
E45@184742138273:{744ac94c-bfc2-4564-93cb-91d3fb408883}
E44@184732660594:{8886e148-3161-430b-9632-f285e4ef2800}
E43@173017774736:{c33c7850-0506-418c-8fd3-a22ff61d14c7}
E42@164480837915:{350888f3-95ad-4bc4-88e0-9c4747468285}
E41@164480798584:{537c613d-12d3-4132-9e96-8f930f410840}
E40@164472846530:{900177f2-db66-45fc-b6b3-60b09b12ccb6}
E39@164472708848:{73a9355a-993e-41ac-bb45-40f7d67bc9c5}
E38@164472674145:{28af975d-9405-4c1f-b3a8-cef23e3a8995}
E37@164472600968:{0b94b04a-f69d-41dc-9f43-09dad835a791}
E36@164472599709:{c067ef54-0feb-4e9e-a476-7e04b42198e5}
E35@164472598718:{ace4e7cd-d1ec-4d9e-bd33-da8d1e9b3a02}
E34@143940134078:{bc19d4be-6705-45a9-8da3-bf2ea9a91b40}
E33@143933691512:{058dde06-251f-42eb-9651-3b880e8c2c88}
E32@143933673735:{c010de8b-74bc-4349-98d3-384f1df87ac1}
E31@123952434563:{d84ac9bf-df28-4c7f-94e6-fe6525cdec58}
E30@123400408367:{e673f45b-786c-4b75-9b17-c5ad1bf8377d}
E29@102863863210:{06a09761-7004-459a-81e8-5b765a1d0702}
E28@83960942217:{22fa25cb-5347-4a00-ad3c-99df0f2f9354}
E27@82898643972:{6ececb4c-a855-48f3-99ff-ffb6318702e6}
E26@82885807411:{a9e9c5cf-2e6a-429c-9bc0-0845d05ea782}...
.
[00001,11,20:09:36.266] SLOW BUS MSG [MainBus]: SubscribeToMaster - 143ms. Handler: ReplicaService.
[00001,11,20:09:36.266] SLOW QUEUE MSG [MainQueue]: ChaserCaughtUp - 144ms. Q: 0/6.
[00001,11,20:09:36.273] CLUSTER HAS CHANGED (gossip received from [10.243.4.120:2112])
Old:
VND {3e976baa-ed36-42dd-bc92-0e35396f1857} <LIVE> [Slave, 10.243.8.160:1112, n/a, 0.0.0.0:1113, n/a, 10.243.8.160:2112, 0.0.0.0:2113] 184833770652/184836185263/184836185263/E45@184742138273:{744ac94c-bfc2-4564-93cb-91d3fb408883} | 2015-06-12 20:09:32.265
VND {5230bdf1-2dce-457b-b891-0158f6cde4e4} <LIVE> [Slave, 10.243.4.120:1112, n/a, 0.0.0.0:1113, n/a, 10.243.4.120:2112, 0.0.0.0:2113] 184833770652/184836536655/184836536655/E45@184742138273:{744ac94c-bfc2-4564-93cb-91d3fb408883} | 2015-06-12 20:09:36.269
VND {6bd8270e-1bd7-4957-bbbd-7e22e02e71c1} <LIVE> [PreReplica, 10.243.0.211:1112, n/a, 0.0.0.0:1113, n/a, 10.243.0.211:2112, 0.0.0.0:2113] 184836929759/184836930317/184836930317/E45@184742138273:{744ac94c-bfc2-4564-93cb-91d3fb408883} | 2015-06-12 20:09:36.122
New:
VND {3e976baa-ed36-42dd-bc92-0e35396f1857} <LIVE> [Slave, 10.243.8.160:1112, n/a, 0.0.0.0:1113, n/a, 10.243.8.160:2112, 0.0.0.0:2113] 184836537844/184836538017/184836538017/E46@184836536655:{71d62cd9-9b9b-450a-acb7-7f8536503c23} | 2015-06-12 20:09:35.272
VND {5230bdf1-2dce-457b-b891-0158f6cde4e4} <LIVE> [Master, 10.243.4.120:1112, n/a, 0.0.0.0:1113, n/a, 10.243.4.120:2112, 0.0.0.0:2113] 184836537844/184836538249/184836538249/E47@184836538017:{d220fe07-5c4f-4ecb-b38a-fe39a2905017} | 2015-06-12 20:09:36.270
VND {6bd8270e-1bd7-4957-bbbd-7e22e02e71c1} <LIVE> [PreReplica, 10.243.0.211:1112, n/a, 0.0.0.0:1113, n/a, 10.243.0.211:2112, 0.0.0.0:2113] 184836929759/184836930317/184836930317/E45@184742138273:{744ac94c-bfc2-4564-93cb-91d3fb408883} | 2015-06-12 20:09:36.273
--------------------------------------------------------------------------------
[00001,14,20:09:36.276] _projections is not empty after all the projections have been killed
[00001,13,20:09:36.276] _projections is not empty after all the projections have been killed
[00001,15,20:09:36.276] _projections is not empty after all the projections have been killed
[00001,13,20:09:36.276] SLOW BUS MSG [bus]: StopCore - 60ms. Handler: ProjectionCoreService.
[00001,15,20:09:36.276] SLOW BUS MSG [bus]: StopCore - 60ms. Handler: ProjectionCoreService.
[00001,14,20:09:36.276] SLOW BUS MSG [bus]: StopCore - 60ms. Handler: ProjectionCoreService.
[00001,13,20:09:36.277] SLOW QUEUE MSG [Projection Core #0]: StopCore - 61ms. Q: 1/7.
[00001,14,20:09:36.277] SLOW QUEUE MSG [Projection Core #1]: StopCore - 61ms. Q: 1/5.
[00001,15,20:09:36.277] SLOW QUEUE MSG [Projection Core #2]: StopCore - 61ms. Q: 1/3.
[00001,13,20:09:36.277] _subscriptions is not empty after all the projections have been killed
[00001,15,20:09:36.277] _eventReaders is not empty after all the projections have been killed
[00001,14,20:09:36.277] _eventReaders is not empty after all the projections have been killed
[00001,13,20:09:36.277] _eventReaders is not empty after all the projections have been killed
[00001,13,20:09:36.277] _subscriptionEventReaders is not empty after all the projections have been killed
[00001,11,20:09:36.279] ========== [10.243.0.211:2112] IS CATCHING UP!!! BANZAI!!! MASTER IS [10.243.4.120:2112,{5230bdf1-2dce-457b-b891-0158f6cde4e4}]
[00001,11,20:09:36.279] ========== [10.243.0.211:2112] CLONE ASSIGNMENT RECEIVED FROM [10.243.4.120:1112,n/a,{5230bdf1-2dce-457b-b891-0158f6cde4e4}].
[00001,11,20:09:36.280] ========== [10.243.0.211:2112] IS CLONE!!! SPARTA!!! MASTER IS [10.243.4.120:2112,{5230bdf1-2dce-457b-b891-0158f6cde4e4}]
[00001,11,20:09:36.280] ========== [10.243.0.211:2112] SLAVE ASSIGNMENT RECEIVED FROM [10.243.4.120:1112,n/a,{5230bdf1-2dce-457b-b891-0158f6cde4e4}].
[00001,11,20:09:36.280] ========== [10.243.0.211:2112] IS SLAVE!!! SPARTA!!! MASTER IS [10.243.4.120:2112,{5230bdf1-2dce-457b-b891-0158f6cde4e4}]
[00001,26,20:09:36.280] Loading PTable '7ab183a5-522b-4bf6-94df-632d3780d8c8' (32007056 entries, cache depth 16) done in 00:00:00.2520366.
[00001,26,20:09:36.283] There are now 1 awaiting tables.
[00001,13,20:09:36.286] Error while processing message EventStore.Projections.Core.Messages.ReaderSubscriptionManagement+Unsubscribe in queued handler 'Projection Core #0'.

The given key was not present in the dictionary.
[00001,08,20:09:36.294] === SUBSCRIBED to [10.243.4.120:1112,{5230bdf1-2dce-457b-b891-0158f6cde4e4}] at 184836536655 (0x2B091DB94F). SubscriptionId: {d9283127-563b-4999-95ce-f6e2c3cfe872}.
[00001,08,20:09:36.296] Master [10.243.4.120:1112,{5230bdf1-2dce-457b-b891-0158f6cde4e4}] subscribed us at 184836536655 (0x2B091DB94F), which is less than our writer checkpoint 184836930317 (0x2B0923BB0D). TRUNCATION IS NEEDED!
[00001,08,20:09:36.296] OFFLINE TRUNCATION IS NEEDED (SubscribedAt 184836536655 (0x2B091DB94F) <= LastCommitPosition 184836929759 (0x2B0923B8DF)). SHUTTING DOWN NODE.
[00001,24,20:09:36.297] ES TcpConnection closed [20:09:36.297: N52.7.113.16:59891, L10.243.0.211:1113, {a14bd876-d5a8-48ee-a278-1a97a270cc30}]:
Received bytes: 2868990, Sent bytes: 94
Send calls: 3, callbacks: 3
Receive calls: 352, callbacks: 352
Close reason: [ConnectionReset] Socket receive error

[00001,24,20:09:36.297] Connection 'external-normal' [52.7.113.16:59891, {a14bd876-d5a8-48ee-a278-1a97a270cc30}] closed: ConnectionReset.
[00001,11,20:09:36.297] ========== [10.243.0.211:2112] IS SHUTTING DOWN!!! FAREWELL, WORLD...
[00001,21,20:09:36.328] External TCP connection accepted: [Normal, 52.7.113.16:59894, L10.243.0.211:1113, {6f9397c5-1673-489e-bdb6-8071bed5d117}].
[00001,08,20:09:36.329] Blocking message DataChunkBulk in StorageWriterService. Message:
DataChunkBulk message: MasterId: 5230bdf1-2dce-457b-b891-0158f6cde4e4, SubscriptionId: d9283127-563b-4999-95ce-f6e2c3cfe872, ChunkStartNumber: 688, ChunkEndNumber: 688, SubscriptionPosition: 184836536655, DataBytes length: 1024, CompleteChunk: False
[00001,26,20:09:36.447] Awaiting tables queue size is: 1.
[00001,11,20:09:36.451] SLOW BUS MSG [MainBus]: BecomeShuttingDown - 121ms. Handler: StorageReaderService.
[00001,11,20:09:36.455] Closing connection 'master-normal' [10.243.4.120:1112, L10.243.0.211:50074, {f3589931-246a-47b3-845e-81e1f09731db}] cleanly. Reason: Node state changed to ShuttingDown. Closing replication connection.
[00001,11,20:09:36.455] ES TcpConnection closed [20:09:36.455: N10.243.4.120:1112, L10.243.0.211:50074, {f3589931-246a-47b3-845e-81e1f09731db}]:
Received bytes: 3361, Sent bytes: 1355
Send calls: 3, callbacks: 3
Receive calls: 11, callbacks: 10
Close reason: [Success] Node state changed to ShuttingDown. Closing replication connection.

[00001,11,20:09:36.455] Connection 'master-normal' [10.243.4.120:1112, {f3589931-246a-47b3-845e-81e1f09731db}] closed: Success.
[00001,11,20:09:36.455] SLOW QUEUE MSG [MainQueue]: RequestShutdown - 157ms. Q: 0/18.
[00001,13,20:09:36.455] _eventReaders is not empty after all the projections have been killed
[00001,11,20:09:36.457] ========== [10.243.0.211:2112] Service 'StorageWriter' has shut down.
[00001,11,20:09:36.457] ========== [10.243.0.211:2112] Service 'StorageReader' has shut down.
[00001,11,20:09:36.457] ========== [10.243.0.211:2112] Service 'HttpServer [http://*:2113/]' has shut down.
[00001,11,20:09:36.457] ========== [10.243.0.211:2112] Service 'HttpServer [http://10.243.0.211:2112/]' has shut down.
[00001,11,20:09:36.457] Looks like node [10.243.4.120:1112] is DEAD (TCP connection lost).
[00001,11,20:09:36.457] CLUSTER HAS CHANGED (TCP connection lost to [10.243.4.120:1112])
Old:
VND {3e976baa-ed36-42dd-bc92-0e35396f1857} <LIVE> [Slave, 10.243.8.160:1112, n/a, 0.0.0.0:1113, n/a, 10.243.8.160:2112, 0.0.0.0:2113] 184836537844/184836538017/184836538017/E46@184836536655:{71d62cd9-9b9b-450a-acb7-7f8536503c23} | 2015-06-12 20:09:35.272
VND {5230bdf1-2dce-457b-b891-0158f6cde4e4} <LIVE> [Master, 10.243.4.120:1112, n/a, 0.0.0.0:1113, n/a, 10.243.4.120:2112, 0.0.0.0:2113] 184836537844/184836538249/184836538249/E47@184836538017:{d220fe07-5c4f-4ecb-b38a-fe39a2905017} | 2015-06-12 20:09:36.270
VND {6bd8270e-1bd7-4957-bbbd-7e22e02e71c1} <LIVE> [ShuttingDown, 10.243.0.211:1112, n/a, 0.0.0.0:1113, n/a, 10.243.0.211:2112, 0.0.0.0:2113] 184836929759/184836930317/184836930317/E45@184742138273:{744ac94c-bfc2-4564-93cb-91d3fb408883} | 2015-06-12 20:09:36.455
New:
VND {3e976baa-ed36-42dd-bc92-0e35396f1857} <LIVE> [Slave, 10.243.8.160:1112, n/a, 0.0.0.0:1113, n/a, 10.243.8.160:2112, 0.0.0.0:2113] 184836537844/184836538017/184836538017/E46@184836536655:{71d62cd9-9b9b-450a-acb7-7f8536503c23} | 2015-06-12 20:09:35.272
VND {5230bdf1-2dce-457b-b891-0158f6cde4e4} <DEAD> [Master, 10.243.4.120:1112, n/a, 0.0.0.0:1113, n/a, 10.243.4.120:2112, 0.0.0.0:2113] 184836537844/184836538249/184836538249/E47@184836538017:{d220fe07-5c4f-4ecb-b38a-fe39a2905017} | 2015-06-12 20:09:36.457
VND {6bd8270e-1bd7-4957-bbbd-7e22e02e71c1} <LIVE> [ShuttingDown, 10.243.0.211:1112, n/a, 0.0.0.0:1113, n/a, 10.243.0.211:2112, 0.0.0.0:2113] 184836929759/184836930317/184836930317/E45@184742138273:{744ac94c-bfc2-4564-93cb-91d3fb408883} | 2015-06-12 20:09:36.455
--------------------------------------------------------------------------------
[00001,11,20:09:36.457] ========== [10.243.0.211:2112] Service 'Master Replication Service' has shut down.
Exiting with exit code: 0.
Exit reason: Shutdown with exiting from process was requested.
[00001,11,20:09:36.573] ========== [10.243.0.211:2112] Service 'Storage Chaser' has shut down.
[00001,11,20:09:36.573] ========== [10.243.0.211:2112] All Services Shutdown.
[00001,11,20:09:36.590] ========== [10.243.0.211:2112] IS SHUT DOWN!!! SWEET DREAMS!!!
[00001,14,20:09:36.600] SLOW BUS MSG [bus]: Faulted - 321ms. Handler: ProjectionCoreResponseWriter.
[00001,15,20:09:36.600] SLOW BUS MSG [bus]: Faulted - 322ms. Handler: ProjectionCoreResponseWriter.
[00001,14,20:09:36.600] SLOW QUEUE MSG [Projection Core #1]: Faulted - 321ms. Q: 3/8.
[00001,15,20:09:36.600] SLOW QUEUE MSG [Projection Core #2]: Faulted - 322ms. Q: 1/6.
[00001,14,20:09:36.600] _eventReaders is not empty after all the projections have been killed
[00001,15,20:09:36.600] _eventReaders is not empty after all the projections have been killed
[00001,11,20:09:36.600] Exiting with exit code: 0.
Exit reason: Shutdown with exiting from process was requested.

Greg Young

unread,
Jun 13, 2015, 9:05:34 AM6/13/15
to event...@googlegroups.com
This is not a crash just reading the log explains what happened. Truncation needed. This will always happen on a master who is deposed.

I see however a "very slow message" of 4 seconds! To process...
--
You received this message because you are subscribed to the Google Groups "Event Store" group.
To unsubscribe from this group and stop receiving emails from it, send an email to event-store...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Sam Kanai

unread,
Jun 13, 2015, 10:01:44 AM6/13/15
to event...@googlegroups.com

Sure, we understand that truncation is happening.  But do you have any idea why it is so frequent?  This is a direct blocker to adoption, and we need to solve, mitigate, or else give up on GES.

We have multiple engineers on this, so if you have any suggestions, we'll do the legwork.  But we are stuck right now.

You received this message because you are subscribed to a topic in the Google Groups "Event Store" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/event-store/2Wo_IqS9ojA/unsubscribe.
To unsubscribe from this group and all its topics, send an email to event-store...@googlegroups.com.

Greg Young

unread,
Jun 13, 2015, 10:37:40 AM6/13/15
to event...@googlegroups.com
See second comment. There are log messages about a simple message processing taking 4 seconds! This would trip timeouts between nodes. As a side note I'd be curious how many of these issues you have running on a local node. With virtualization there are countless things that could have happened (eg shared node and something high court jumped off etc)

James Nugent

unread,
Jun 13, 2015, 2:27:31 PM6/13/15
to event...@googlegroups.com
It's frequent because you're seeing frequent elections, likely due to all the timeouts.

What are you config settings?

Sam Kanai

unread,
Jun 15, 2015, 4:03:07 PM6/15/15
to event...@googlegroups.com
@Greg - It's always possible that virtualization is a factor.  Alas, none of the AWS metrics show anything unusual.  Can you give us an idea of the (kernel) resources accessed during simple message processing, so that we can dig deeper?

When we ran clusters on non-virtual hardware, we also saw very frequent elections.  In our experience, network congestion really does a number on the gossip traffic.  However, on these systems, the elections never caused suicides.  It's not quite apples to apples, since the non-virtual nodes were running Windows.  We'll start building a non-virtual Linux cluster.

@James - I'll work on getting those settings.

-Sam

Greg Young

unread,
Jun 15, 2015, 4:54:34 PM6/15/15
to event...@googlegroups.com
"
When we ran clusters on non-virtual hardware, we also saw very
frequent elections. In our experience, network congestion really does
a number on the gossip traffic. However, on these systems, the
elections never caused suicides. It's not quite apples to apples,
since the non-virtual nodes were running Windows. We'll start
building a non-virtual Linux cluster."

Its dumb luck.

If you lose a master it will every time deterministically the "crash" you see.

Greg Young

unread,
Jun 15, 2015, 5:33:33 PM6/15/15
to event...@googlegroups.com
We are looking at another possible root cause dealing with gc.

Can you print out what is printed with --what-if when you run it?

Ben Salem

unread,
Jun 15, 2015, 5:49:44 PM6/15/15
to event...@googlegroups.com
What-if gives me the following:


Exit reason: WhatIf option specified
[00001,01,21:48:32.159]
ES VERSION:               3.0.3.0 (master/a1382252dd1ed0554ddb04015cdb2cbc1b0a65c1, Fri, 13 Mar 2015 13:56:05 -0400)
OS:                       Unknown (Unix 3.19.3.0)
RUNTIME:                  3.12.0 (tarball) (64-bit)
GC:                       2 GENERATIONS
LOGS:                     /event-store-data/log

HELP:                     False (<DEFAULT>)
VERSION:                  False (<DEFAULT>)
LOG:                      /event-store-data/log (Command Line)
CONFIG:                   <empty> (<DEFAULT>)
DEFINES:                  <empty> (<DEFAULT>)
WHAT IF:                  True (Command Line)
INT IP:                   127.0.0.1 (<DEFAULT>)
EXT IP:                   10.243.0.211 (Command Line)
INT HTTP PORT:            2112 (<DEFAULT>)
EXT HTTP PORT:            2113 (<DEFAULT>)
INT TCP PORT:             1112 (<DEFAULT>)
INT SECURE TCP PORT:      0 (<DEFAULT>)
EXT TCP PORT:             1113 (Command Line)
EXT SECURE TCP PORT:      0 (<DEFAULT>)
INT TCP HEARTBEAT TIMEOUT: 700 (<DEFAULT>)
EXT TCP HEARTBEAT TIMEOUT: 1000 (<DEFAULT>)
INT TCP HEARTBEAT INTERVAL: 700 (<DEFAULT>)
EXT TCP HEARTBEAT INTERVAL: 2000 (<DEFAULT>)
FORCE:                    False (<DEFAULT>)
CLUSTER SIZE:             3 (Command Line)
NODE PRIORITY:            0 (<DEFAULT>)
MIN FLUSH DELAY MS:       2 (<DEFAULT>)
COMMIT COUNT:             -1 (<DEFAULT>)
PREPARE COUNT:            -1 (<DEFAULT>)
ADMIN ON EXT:             True (<DEFAULT>)
STATS ON EXT:             True (<DEFAULT>)
GOSSIP ON EXT:            True (<DEFAULT>)
DISABLE SCAVENGE MERGING: False (<DEFAULT>)
DISCOVER VIA DNS:         False (Command Line)
CLUSTER DNS:              fake.dns (<DEFAULT>)
CLUSTER GOSSIP PORT:      30777 (<DEFAULT>)
GOSSIP SEED:              [ 10.243.4.120:2112, 10.243.8.160:2112 ] (Command Line)
STATS PERIOD SEC:         30 (<DEFAULT>)
CACHED CHUNKS:            -1 (<DEFAULT>)
CHUNKS CACHE SIZE:        536871424 (<DEFAULT>)
MAX MEM TABLE SIZE:       1000000 (<DEFAULT>)
DB:                       /event-store-data/db (Command Line)
MEM DB:                   False (<DEFAULT>)
SKIP DB VERIFY:           False (<DEFAULT>)
RUN PROJECTIONS:          All (Command Line)
PROJECTION THREADS:       3 (<DEFAULT>)
WORKER THREADS:           5 (<DEFAULT>)
HTTP PREFIXES:            [ http://*:2113/ ] (Command Line)
ENABLE TRUSTED AUTH:      False (<DEFAULT>)
CERTIFICATE STORE LOCATION: <empty> (<DEFAULT>)
CERTIFICATE STORE NAME:   <empty> (<DEFAULT>)
CERTIFICATE SUBJECT NAME: <empty> (<DEFAULT>)
CERTIFICATE THUMBPRINT:   <empty> (<DEFAULT>)
CERTIFICATE FILE:         <empty> (<DEFAULT>)
CERTIFICATE PASSWORD:     <empty> (<DEFAULT>)
USE INTERNAL SSL:         False (<DEFAULT>)
SSL TARGET HOST:          n/a (<DEFAULT>)
SSL VALIDATE SERVER:      True (<DEFAULT>)
AUTHENTICATION TYPE:      internal (<DEFAULT>)
PREPARE TIMEOUT MS:       2000 (<DEFAULT>)
COMMIT TIMEOUT MS:        2000 (<DEFAULT>)
UNSAFE DISABLE FLUSH TO DISK: False (<DEFAULT>)
GOSSIP INTERVAL MS:       1000 (<DEFAULT>)
GOSSIP ALLOWED DIFFERENCE MS: 60000 (<DEFAULT>)
GOSSIP TIMEOUT MS:        500 (<DEFAULT>)

[00001,01,21:48:32.176] Exiting with exit code: 0.
Exit reason: WhatIf option specified

Greg Young

unread,
Jun 15, 2015, 6:27:41 PM6/15/15
to event...@googlegroups.com
also : When we ran clusters on non-virtual hardware, we also saw very
frequent elections. I would have to see logs to have any comment.


" Can you give us an idea of the (kernel) resources accessed during
simple message processing, so that we can dig deeper?"

Depends on the message. Some of the ones above would appear to just be
CPU so I would watch gc pauses etc as a start.

Greg Young

unread,
Jun 16, 2015, 1:59:59 AM6/16/15
to event...@googlegroups.com
Btw you have internal timeouts set for 500/700ms. This really isn't much room especially under load in the cloud and likely you will get timeouts. These are listed below as timeouts (internal and gossip). 

Sam Kanai

unread,
Jun 16, 2015, 11:02:14 AM6/16/15
to event...@googlegroups.com
Apologies, looks like the changes were regressed after Tao's original run.  We're correcting that now.  We have never changed the internal timeout before.  We'll bump both timeouts to 2000ms, and re-run.

As far as your GC suspicion, if there any data we can gather, let us know.  Also, we can arrange access to our test cluster in AWS, if that would help.

-Sam

Sam Kanai

unread,
Jun 17, 2015, 3:38:33 PM6/17/15
to event...@googlegroups.com
Update: we have gotten better runs, for sure.  Many hours under constant load.  However, when a network partition does exceed the timeouts, we do still get the suicides.  I think we understand about 90% of this now, but there are some edge cases we have questions about.  I want to gather our thoughts before we ask.  In the meantime, I think James said that the behavior is deterministic given a precise outage pattern?  Do you guys have docs on those behaviors?

Greg Young

unread,
Jun 17, 2015, 3:46:22 PM6/17/15
to event...@googlegroups.com
The situations is this (I will explain with network cables if thats
ok). The situation is a well known (academically) edge condition known
as "deposed master"

So bring up a cluster with three nodes A,B,C

Cluster decides: A master, B slave, C slave.

Start a client writing say 500 messages/second to cluster.

At this point pull the network cable of A to introduce a partition
(master in minority). The other two nodes will form their own cluster
let's imagine C becomes master (non-deterministic which one is new
master). Wait about 1 minute then plug network cable back in on A.

When A comes up A has some writes that were never acknowledged by B or
C (e.g. his log doesn't match). At this point he will say "I need to
go down for truncation". We could do this online but its very very
complicated while hari kari is quite simple here :)


HTH

Greg

Sam Kanai

unread,
Jun 17, 2015, 5:31:38 PM6/17/15
to event...@googlegroups.com
Yep thats the 90% part we get.

One edge case we saw: BOTH slaves suicided when the old master became visible again.  This downs the cluster, and then the master falls back to an "unknown" state.  

The outage was 2.8 seconds, and the timeouts were 2 seconds.  We expect that the old master would "self-demote" because it was past the gossip window.  We thought it would be the one that suicided, but the other boxes did instead.  Any idea why?

(We can also send you logs from all three boxes, if that would help)

-Sam

Greg Young

unread,
Jun 17, 2015, 5:46:28 PM6/17/15
to event...@googlegroups.com
I'm not entirely clear how the slaves suicided (to be fair we run
24/365 randomized testing on this) .. Logs would be good on this.

Thanks!

Greg

Sam Kanai

unread,
Jun 17, 2015, 5:54:49 PM6/17/15
to event...@googlegroups.com
Not a problem.  I can send them today.  Where shall I send them?

-Sam

Greg Young

unread,
Jun 17, 2015, 6:11:31 PM6/17/15
to event...@googlegroups.com
Just send to my address.
Reply all
Reply to author
Forward
0 new messages