Advice on correct 2-instance setup is required

120 views
Skip to first unread message

Dennis O

unread,
May 17, 2016, 12:52:58 AM5/17/16
to Neo4j
Hi,

Please advise on required configuration for the 2-instance "HA" setup (AWS, neo4j Enterprise 3.0.1).

Currently I have on both instances:
- dbms.mode=HA
- ha.initial_hosts=172.31.35.147:5001,172.31.33.173:5001
- ha.host.coordination is commented out
- ha.host.data is commented out
Port 5001, 5002, 7474, 6001 open on both.

Differences
1. One node has ha.server_id=1 (172.31.33.173), another one - ha.server_id=2
2. Node with id=1 is Debian 8.4, id=2 is Centos 7


With this setup, node with id=1 starts w/o problems, elected as master, second one however fails.

Some log extracts:


2016-05-17 04:50:51.781+0000 INFO  [o.n.k.h.MasterClient214] MasterClient214 communication channel created towards /127.0.0.1:6001
2016-05-17 04:50:51.790+0000 INFO  [o.n.k.h.c.SwitchToSlave] Copying store from master
2016-05-17 04:50:51.791+0000 INFO  [o.n.k.h.MasterClient214] Thread[31, HA Mode switcher-1] Trying to open a new channel from /172.31.35.147:0 to /127.0.0.1:6001
2016-05-17 04:50:51.791+0000 DEBUG [o.n.k.h.MasterClient214] MasterClient214 could not connect from /172.31.35.147:0 to /127.0.0.1:6001
2016-05-17 04:50:51.796+0000 INFO  [o.n.k.h.MasterClient214] MasterClient214[/127.0.0.1:6001] shutdown
2016-05-17 04:50:51.796+0000 ERROR [o.n.k.h.c.m.HighAvailabilityModeSwitcher] Error while trying to switch to slave MasterClient214 could not connect from /172.31.35.147:0 to /127.0.0.1:6001
org.neo4j.com.ComException: MasterClient214 could not connect from /172.31.35.147:0 to /127.0.0.1:6001
at org.neo4j.com.Client$2.create(Client.java:225)
at org.neo4j.com.Client$2.create(Client.java:202)
at org.neo4j.com.ResourcePool.acquire(ResourcePool.java:177)
at org.neo4j.com.Client.acquireChannelContext(Client.java:390)
at org.neo4j.com.Client.sendRequest(Client.java:296)
at org.neo4j.com.Client.sendRequest(Client.java:289)
at org.neo4j.kernel.ha.MasterClient210.copyStore(MasterClient210.java:311)
at org.neo4j.kernel.ha.cluster.SwitchToSlave$1.copyStore(SwitchToSlave.java:531)
at org.neo4j.com.storecopy.StoreCopyClient.copyStore(StoreCopyClient.java:191)
at org.neo4j.kernel.ha.cluster.SwitchToSlave.copyStoreFromMaster(SwitchToSlave.java:525)
at org.neo4j.kernel.ha.cluster.SwitchToSlave.copyStoreFromMasterIfNeeded(SwitchToSlave.java:348)
at org.neo4j.kernel.ha.cluster.SwitchToSlave.switchToSlave(SwitchToSlave.java:272)
at org.neo4j.kernel.ha.cluster.modeswitch.HighAvailabilityModeSwitcher$1.run(HighAvailabilityModeSwitcher.java:348)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
at org.neo4j.helpers.NamedThreadFactory$2.run(NamedThreadFactory.java:104)
Caused by: java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at org.jboss.netty.channel.socket.nio.NioClientBoss.connect(NioClientBoss.java:148)
at org.jboss.netty.channel.socket.nio.NioClientBoss.processSelectedKeys(NioClientBoss.java:104)
at org.jboss.netty.channel.socket.nio.NioClientBoss.process(NioClientBoss.java:78)
at org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:312)
at org.jboss.netty.channel.socket.nio.NioClientBoss.run(NioClientBoss.java:41)
... 4 more
2016-05-17 04:50:51.797+0000 INFO  [o.n.k.h.c.m.HighAvailabilityModeSwitcher] Attempting to switch to slave in 7s
2016-05-17 04:50:58.799+0000 INFO  [o.n.k.i.f.CommunityFacadeFactory] No locking implementation specified, defaulting to 'forseti'
2016-05-17 04:50:58.799+0000 INFO  [o.n.k.h.c.SwitchToSlave] ServerId 2, moving to slave for master ha://0.0.0.0:6001?serverId=1



2016-05-17 04:30:57.535+0000 DEBUG [o.n.c.p.c.ClusterState$2] [AsyncLog @ 2016-05-17 04:30:57.534+0000]  ClusterState: discovery-[configurationTimeout]->discovery conversation-id:2/13# payload:ConfigurationTimeoutState{remainingPings=3}
2016-05-17 04:30:57.535+0000 DEBUG [o.n.c.p.h.HeartbeatState$1] [AsyncLog @ 2016-05-17 04:30:57.535+0000]  HeartbeatState: start-[reset_send_heartbeat]->start conversation-id:2/13#
2016-05-17 04:30:57.538+0000 INFO  [o.n.c.c.NetworkSender] [AsyncLog @ 2016-05-17 04:30:57.537+0000]  Attempting to connect from /172.31.35.147:0 to /172.31.33.173:5001
2016-05-17 04:30:57.540+0000 INFO  [o.n.c.c.NetworkSender] [AsyncLog @ 2016-05-17 04:30:57.540+0000]  Failed to connect to /172.31.33.173:5001 due to: java.net.ConnectException: Connection refused
2016-05-17 04:30:57.540+0000 DEBUG [o.n.c.p.c.ClusterState$2] [AsyncLog @ 2016-05-17 04:30:57.540+0000]  ClusterState: discovery-[configurationRequest]->discovery from:cluster://172.31.35.147:5001 conversation-id:2/13# payload:ConfigurationRequestState{joiningId=2, joiningUri=cluster://172.31.35.147:5001}
2016-05-17 04:30:58.420+0000 INFO  [o.n.c.c.NetworkReceiver] [AsyncLog @ 2016-05-17 04:30:58.420+0000]  cluster://172.31.35.147:47188 disconnected from me at cluster://172.31.35.147:5001
2016-05-17 04:30:58.420+0000 INFO  [o.n.c.c.NetworkReceiver] [AsyncLog @ 2016-05-17 04:30:58.420+0000]  cluster://172.31.35.147:47188 disconnected from me at cluster://172.31.35.147:5001
2016-05-17 04:30:58.434+0000 INFO  [o.n.k.i.t.l.c.CheckPointerImpl] Check Pointing triggered by database shutdown [1]:  Starting check pointing...
2016-05-17 04:30:58.438+0000 INFO  [o.n.k.i.t.l.c.CheckPointerImpl] Check Pointing triggered by database shutdown [1]:  Starting store flush...
2016-05-17 04:30:58.443+0000 INFO  [o.n.k.i.t.l.c.CheckPointerImpl] Check Pointing triggered by database shutdown [1]:  Store flush completed
2016-05-17 04:30:58.443+0000 INFO  [o.n.k.i.t.l.c.CheckPointerImpl] Check Pointing triggered by database shutdown [1]:  Starting appending check point entry into the tx log...
2016-05-17 04:30:58.447+0000 INFO  [o.n.k.i.t.l.c.CheckPointerImpl] Check Pointing triggered by database shutdown [1]:  Appending check point entry into the tx log completed
2016-05-17 04:30:58.447+0000 INFO  [o.n.k.i.t.l.c.CheckPointerImpl] Check Pointing triggered by database shutdown [1]:  Check pointing completed
2016-05-17 04:30:58.447+0000 INFO  [o.n.k.i.t.l.p.LogPruningImpl] Log Rotation [0]:  Starting log pruning.
2016-05-17 04:30:58.447+0000 INFO  [o.n.k.i.t.l.p.LogPruningImpl] Log Rotation [0]:  Log pruning complete.
2016-05-17 04:30:58.475+0000 INFO  [o.n.k.i.DiagnosticsManager] --- STOPPING diagnostics START ---
2016-05-17 04:30:58.475+0000 INFO  [o.n.k.i.DiagnosticsManager] High Availability diagnostics
Member state:PENDING
State machines:
   AtomicBroadcastMessage:start
   AcceptorMessage:start
   ProposerMessage:start
   LearnerMessage:start
   HeartbeatMessage:start
   ElectionMessage:start
   SnapshotMessage:start
   ClusterMessage:discovery
Current timeouts:
join:configurationTimeout{conversation-id=2/13#, timeout-count=29, created-by=2}
2016-05-17 04:30:58.475+0000 INFO  [o.n.k.i.DiagnosticsManager] --- STOPPING diagnostics END ---
2016-05-17 04:30:58.475+0000 INFO  [o.n.k.h.f.HighlyAvailableFacadeFactory] Shutdown started

etc. 


Any insights are highly appreciated!!

Thank you!
Dennis

Yayati Sule

unread,
May 17, 2016, 12:56:40 AM5/17/16
to ne...@googlegroups.com
Hi Dennis,
I am facing simialr problem, but in my case I have 2 machines as slaves and 1 master, but I cannot get them up and running. In your case maybe you can try to setup a third instance and try to start the cluster as a 3 Machine quorum is required for making a HA cluster.

Regards,
Yayati Sule
Associate Data Scientist
Innoplexus Consulting Services Pvt. Ltd.
www.innoplexus.com
Mob : +91-9527459407
Landline: +91-20-66527300
© 2011-16 Innoplexus Consulting Services Pvt. Ltd.

Unless otherwise explicitly stated, all rights including those in copyright in the content of this e-mail are owned by Innoplexus Consulting Services Pvt Ltd. and all related legal entities. The contents of this e-mail shall not be copied, reproduced, or transmitted in any form without the written permission of Innoplexus Consulting Services Pvt Ltd or that of the copyright owner. The receipt of this mail is the acknowledgement of the receipt of contents; if the recipient is not the intended addressee then the recipient shall notify the sender immediately.

The contents are provided for information only and no opinions expressed should be relied on without further consultation with Innoplexus Consulting Services Pvt Ltd. and all related legal entities. While all endeavors have been made to ensure accuracy, Innoplexus Consulting Services Pvt. Ltd. makes no warranty or representation to its accuracy, completeness or fairness and persons who rely on it do so entirely at their own risk. The information herein may be changed or withdrawn at any time without notice. Innoplexus Consulting Services Pvt. Ltd. will not be liable to any client or third party for the accuracy of the information supplied through this service.

Innoplexus Consulting Services Pvt. Ltd. accepts no responsibility or liability for the contents of any other site, whether linked to this site or not, or any consequences from your acting upon the contents of another site.

Please Consider the environment before printing this email.

--
You received this message because you are subscribed to the Google Groups "Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email to neo4j+un...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Sukaant Chaudhary

unread,
May 17, 2016, 1:12:49 AM5/17/16
to ne...@googlegroups.com
Hi,
It seems like configuration issue for making multiple instance.
I've few questions regarding this:
1. How you are planning to make Neo4J scalable in AWS?
2. Which service you are using in AWS?
3. Does Amazon started support for Neo4J?


-Sukaant Chaudhary

Dennis O

unread,
May 17, 2016, 11:27:21 AM5/17/16
to Neo4j
 
Update 

To have max. consistency, I've replaced a previous Centos instance with a Debian-based, i.e. now I have 2 identical Debian instances.

neo4j service is launching on the second one, however its debug log is filled with

2016-05-17 15:25:29.651+0000 ERROR [o.n.k.h.c.m.HighAvailabilityModeSwitcher] Error while trying to switch to slave MasterClient214 could not connect from /172.31.42.94:0 to /127.0.0.1:6001
org.neo4j.com.ComException: MasterClient214 could not connect from /172.31.42.94:0 to /127.0.0.1:6001
at org.neo4j.com.Client$2.create(Client.java:225)
at org.neo4j.com.Client$2.create(Client.java:202)
at org.neo4j.com.ResourcePool.acquire(ResourcePool.java:177)
at org.neo4j.com.Client.acquireChannelContext(Client.java:390)
at org.neo4j.com.Client.sendRequest(Client.java:296)
at org.neo4j.kernel.ha.MasterClient210.handshake(MasterClient210.java:288)
at org.neo4j.kernel.ha.cluster.SwitchToSlave.checkDataConsistencyWithMaster(SwitchToSlave.java:597)
at org.neo4j.kernel.ha.cluster.SwitchToSlave.checkDataConsistency(SwitchToSlave.java:392)
at org.neo4j.kernel.ha.cluster.SwitchToSlave.executeConsistencyChecks(SwitchToSlave.java:376)
at org.neo4j.kernel.ha.cluster.SwitchToSlave.switchToSlave(SwitchToSlave.java:293)
at org.neo4j.kernel.ha.cluster.modeswitch.HighAvailabilityModeSwitcher$1.run(HighAvailabilityModeSwitcher.java:348)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
at org.neo4j.helpers.NamedThreadFactory$2.run(NamedThreadFactory.java:104)
Caused by: java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at org.jboss.netty.channel.socket.nio.NioClientBoss.connect(NioClientBoss.java:148)
at org.jboss.netty.channel.socket.nio.NioClientBoss.processSelectedKeys(NioClientBoss.java:104)
at org.jboss.netty.channel.socket.nio.NioClientBoss.process(NioClientBoss.java:78)
at org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:312)
at org.jboss.netty.channel.socket.nio.NioClientBoss.run(NioClientBoss.java:41)
... 4 more
2016-05-17 15:25:29.651+0000 INFO  [o.n.k.h.c.m.HighAvailabilityModeSwitcher] Attempting to switch to slave in 15s

Dennis O

unread,
May 17, 2016, 11:37:09 AM5/17/16
to Neo4j
Hi Yayati,


>> as a 3 Machine quorum is required for making a HA cluster

Looks like neo4j does not require a 3rd machine to function as follows from "...When running Neo4j in HA mode there is always a single master and zero or more slaves...." on http://neo4j.com/docs/3.0.1/ha-architecture.html

Dennis

Dennis O

unread,
May 17, 2016, 11:08:41 PM5/17/16
to Neo4j
RESOLVED

1. Tutorial ( http://neo4j.com/docs/3.0.1/ha-setup-tutorial.html ) is somewhat misleading. Thanks to Dave from the neo4j's team for the info: you NEED to set 
ha.host.coordination
and
ha.host.data

I'm in AWS. In my config these values are set to Internal instance IP, with 5001 port for coordination and 6001 for host.data

2. Another part of my problem: deployment flow should be clarified in the docs.
While testing the setup, my config manager (Ansible) was doing a basic installation, then I was logging to each instance and adjusting configs manually.
This means, that my instances were initially launches in SINGLE mode each.
And maybe I am wrong, but I got impression that you cannot just switch to HA mode.
So i cleanly wiped out neo4j from the servers, updated Ansible playbooks to prepare and upload full HA config before instances launched, so very first launch was reading HA-specific config.
That helped!

I do see some issues in the browser's console + there's a bug with Browser Sync not remembering my authentication, but at least I got my servers running.


HA part of my config:



#*****************************************************************
# HA configuration
#*****************************************************************

# Uncomment and specify these lines for running Neo4j in High Availability mode.
# See the High availability setup tutorial for more details on these settings

# Database mode
# Allowed values:
# HA - High Availability
# SINGLE - Single mode, default.
# To run in High Availability mode uncomment this line:
dbms.mode=HA

# ha.server_id is the number of each instance in the HA cluster. It should be
# an integer (e.g. 1), and should be unique for each cluster instance.
ha.server_id={ 1 }

# ha.initial_hosts is a comma-separated list (without spaces) of the host:port
# where the ha.host.coordination of all instances will be listening. Typically
# this will be the same for all cluster instances.
ha.initial_hosts={ Internal IP of server 1 }:5001,{ Internal IP of server 2 }:5001,{ Internal IP of server 3 }:5001

# IP and port for this instance to listen on, for communicating cluster status
# information iwth other instances (also see ha.initial_hosts). The IP
# must be the configured IP address for one of the local interfaces.
ha.host.coordination={ Internal IP of server 1 }:5001

# IP and port for this instance to listen on, for communicating transaction
# data with other instances (also see ha.initial_hosts). The IP
# must be the configured IP address for one of the local interfaces.
ha.host.data={ Internal IP of server 1 }:6001

# The interval at which slaves will pull updates from the master. Comment out
# the option to disable periodic pulling of updates. Unit is seconds.
ha.pull_interval=10

# Amount of slaves the master will try to push a transaction to upon commit
# (default is 1). The master will optimistically continue and not fail the
# transaction even if it fails to reach the push factor. Setting this to 0 will
# increase write performance when writing through master but could potentially
# lead to branched data (or loss of transaction) if the master goes down.
#ha.tx_push_factor=1

# Strategy the master will use when pushing data to slaves (if the push factor
# is greater than 0). There are three options available "fixed_ascending" (default),
# "fixed_descending" or "round_robin". Fixed strategies will start by pushing to
# slaves ordered by server id (accordingly with qualifier) and are useful when
# planning for a stable fail-over based on ids.
#ha.tx_push_strategy=fixed_ascending

# Policy for how to handle branched data.
#ha.branched_data_policy=keep_all

# How often heartbeat messages should be sent. Defaults to ha.default_timeout.
#ha.heartbeat_interval=5s

# Timeout for heartbeats between cluster members. Should be at least twice that of ha.heartbeat_interval.
#ha.heartbeat_timeout=11s

# If you are using a load-balancer that doesn't support HTTP Auth, you may need to turn off authentication for the
# HA HTTP status endpoint by uncommenting the following line.
#dbms.security.ha_status_auth_enabled=false

# Whether this instance should only participate as slave in cluster. If set to
# true, it will never be elected as master.
#ha.slave_only=false


Config is almost identical between instances, the only differences are:
- server_id, 
- ha.host.coordination host, 
- ha.host.data host
( they carry IP of the "current" instance )


Hope this helps!
Dennis

Reply all
Reply to author
Forward
0 new messages