Greetings devs,
I am experimenting with QFS. Got a simple setup running, and now trying to add VR. I attempted to follow the guide, and was successful in adding nodes and activating them. However, I am now receiving this output when I query the status after sending the activate-nodes command:
vr.nodeId 0
vr.status -1008
vr.active 1
vr.state view_change
vr.primaryId -1
vr.epoch 4
vr.view 1
vr.log 3 1 8
vr.commit 3 1 8
vr.lastViewEnd 3 1 8
vr.quorum 2
vr.ignoreInvalidVrState 0
....
I am not sure what -1008 vr.status means, but I guess that it isn't good. Log error messages and config info is below. It seems that Node 0 is not able to maintain a connection with the log receivers on nodes 1 and 2. I have tried telneting directly from node0 to nodes 1 and 2 to port 30200, and am able to connect, and enter text, which shows up in the logs of nodes 1 and 2, so I know the networks are connections. Also I have a round-robin DNS record set up that points 192.168.1.28, 192.168.1.29, 192.168.1.30, to a DNS entry which I used when I added the nodes and activated them with qfsadmin.
I have spend many hours looking through the docs and source code before posting this message, and I will be very grateful for any ideas that anyone can provide that might help me debug the error messages in the logs below, and get nodes 1 and 2 talking with node 0 properly.
Node 0 is 192.168.1.28, Node 1 is 192.168.1.30, Node 3 is 192.168.1.29
Add nodes commands:
qfsadmin -s
metaservers.ourdomain.com -p 20000 -F op-type=add-node -F node-id=0 -F arg-count=1 -F args='192.168.1.28 30200' vr_reconfiguration
qfsadmin -s
metaservers.ourdomain.com -p 20000 -F op-type=add-node -F node-id=1 -F arg-count=1 -F args='192.168.1.30 30200' vr_reconfiguration
qfsadmin -s
metaservers.ourdomain.com -p 20000 -F op-type=add-node -F node-id=2 -F arg-count=1 -F args='192.168.1.29 30200' vr_reconfiguration
Activate nodes command:
Relevant lines from config files for metaservers:
metaServer.clientPort = 20000
metaServer.chunkServerPort = 30000
metaServer.clusterKey = qfs-filesystem
metaServer.fileSystemIdRequired = 1
metaServer.vr.metaDataStoreLocation = 192.168.1.28 20000 [appropriate values for other nodes]
metaServer.metaDataSync.fileSystemId = [scrubbed]
metaServer.metaDataSync.servers = 192.168.1.28 20000 192.168.1.29 20000 192.168.1.30 20000
metaServer.metaDataSync.fetchOnRestartFileName = metadatafetch
metaServer.log.receiver.listenOn = 0.0.0.0 30200
The clocks on the servers below are a couple seconds apart, not precisely synced.
I am receiving the following errors repeatedly on node 0 in the logs (details scrubbed):
Nov 29 13:34:15 node0_server metaserver[3634]: 11-29-2017 13:34:14.217 INFO - (MetaVrSM.cc:4480) start view change: restart, node: 0 prior state: view_change initiated: 2368 sec. ago vr-start-view-change node: 0 state: view_change epoch: 4 view: 1 committed: 3 1 8 last: 3 1 8 lve: 3 1 8 fsid: [scrubbed] mds: 192.168.1.28 131072 ckey: qfs-filesystem md: [scrubbed] restart, node: 0
Nov 29 13:34:15 node0_server metaserver[3634]: 11-29-2017 13:34:14.217 INFO - (MetaVrSM.cc:2439) +seq: 3019535703744487120 status: 0 state: view_change DVC: 0 active: 1 epoch: 4 view: 1 last: 3 1 8 vr-start-view-change node: 0 state: view_change epoch: 4 view: 1 committed: 3 1 8 last: 3 1 8 lve: 3 1 8 fsid: [scrubbed] mds: 192.168.1.28 131072 ckey: qfs-filesystem md: [scrubbed] restart, node: 0
Nov 29 13:34:15 node0_server metaserver[3634]: 11-29-2017 13:34:14.217 INFO - (MetaVrSM.cc:2439) =seq: 3019535703744487120 status: 0 state: view_change DVC: 0 active: 1 epoch: 4 view: 1 last: 3 1 8 vr-start-view-change node: 0 state: view_change epoch: 4 view: 1 committed: 3 1 8 last: 3 1 8 lve: 3 1 8 fsid: [scrubbed] mds: 192.168.1.28 131072 ckey: qfs-filesystem md: [scrubbed] restart, node: 0
Nov 29 13:34:15 node0_server metaserver[3634]: 11-29-2017 13:34:14.217 ERROR - (MetaVrSM.cc:581) start view change: node inactive status: -2 node: 2 / 0 view: 2 1 / 4 1
Nov 29 13:34:15 node0_server metaserver[3634]: 11-29-2017 13:34:14.217 ERROR - (MetaVrSM.cc:581) start view change: node inactive status: -2 node: 1 / 0 view: 2 1 / 4 1
Nov 29 13:34:17 node0_server metaserver[3634]: 11-29-2017 13:34:17.221 ERROR - (LogTransmitter.cc:1564) 192.168.1.29 30200: connection timed out socket error: vr: seq: -1 op: null hello seq: -1 pending: blocks: 0 bytes: 0
Nov 29 13:34:18 node0_server metaserver[3634]: 11-29-2017 13:34:17.221 ERROR - (LogTransmitter.cc:1564) 192.168.1.30 30200: connection timed out socket error: vr: seq: -1 op: null hello seq: -1 pending: blocks: 0 bytes: 0
Nov 29 13:34:18 node0_server metaserver[3634]: 11-29-2017 13:34:17.221 ERROR - (TcpSocket.cc:662) getpeername Transport endpoint is not connected 107
Nov 29 13:34:18 node0_server metaserver[3634]: 11-29-2017 13:34:17.221 ERROR - (TcpSocket.cc:662) getpeername Transport endpoint is not connected 107
Nov 29 13:34:20 node0_server metaserver[3634]: 11-29-2017 13:34:20.226 ERROR - (LogTransmitter.cc:1564) 192.168.1.29 30200: connection timed out socket error: vr: seq: -1 op: null hello seq: -1 pending: blocks: 0 bytes: 0
Nov 29 13:34:21 node0_server metaserver[3634]: 11-29-2017 13:34:20.226 ERROR - (LogTransmitter.cc:1564) 192.168.1.30 30200: connection timed out socket error: vr: seq: -1 op: null hello seq: -1 pending: blocks: 0 bytes: 0
Nov 29 13:34:21 node0_server metaserver[3634]: 11-29-2017 13:34:20.226 ERROR - (TcpSocket.cc:662) getpeername Transport endpoint is not connected 107
Nov 29 13:34:21 node0_server metaserver[3634]: 11-29-2017 13:34:20.226 ERROR - (TcpSocket.cc:662) getpeername Transport endpoint is not connected 107
And these kinds of errors on nodes 1 and 2 (details scrubbed):
Nov 29 13:34:12 node1_server metaserver[3738]: 11-29-2017 13:34:11.213 INFO - (MetaVrSM.cc:2439) seq: 231641259406251669 status: 0 state: backup DVC: 0 active: 0 epoch: 2 view: 1 last: 2 1 1 vr-hello node: 0 state: view_change epoch: 4 view: 1 committed: 3 1 8 last: 3 1 8 lve: 3 1 8 fsid: [scrubbed] mds: 192.168.1.28 131072 ckey: qfs-filesystem md: [scrubbed] restart, node: 0
Nov 29 13:34:14 node1_server metaserver[3738]: 11-29-2017 13:34:14.218 INFO - (MetaVrSM.cc:2439) +seq: 231641259406251670 status: 0 state: backup DVC: 0 active: 0 epoch: 2 view: 1 last: 2 1 1 vr-start-view-change node: 0 state: view_change epoch: 4 view: 1 committed: 3 1 8 last: 3 1 8 lve: 3 1 8 fsid: [scrubbed] mds: 192.168.1.28 131072 ckey: qfs-filesystem md: [scrubbed] restart, node: 0
Nov 29 13:34:15 node1_server metaserver[3738]: 11-29-2017 13:34:14.218 ERROR - (MetaVrSM.cc:2439) =seq: 231641259406251670 status: -2 node inactive state: backup DVC: 0 active: 0 epoch: 2 view: 1 last: 2 1 1 vr-start-view-change node: 0 state: view_change epoch: 4 view: 1 committed: 3 1 8 last: 3 1 8 lve: 3 1 8 fsid: [scrubbed] mds: 192.168.1.28 131072 ckey: qfs-filesystem md: [scrubbed] restart, node: 0
Nov 29 13:34:17 node1_server metaserver[3738]: 11-29-2017 13:34:17.222 ERROR - (LogReceiver.cc:1176) 192.168.1.28 43328 id: 2 error: network error closing connection last block end: 3 1 8 socket error:
Nov 29 13:34:18 node1_server metaserver[3738]: 11-29-2017 13:34:17.223 INFO - (MetaVrSM.cc:2439) seq: 231641259406251671 status: 0 state: backup DVC: 0 active: 0 epoch: 2 view: 1 last: 2 1 1 vr-hello node: 0 state: view_change epoch: 4 view: 1 committed: 3 1 8 last: 3 1 8 lve: 3 1 8 fsid: [scrubbed] mds: 192.168.1.28 131072 ckey: qfs-filesystem md: [scrubbed] restart, node: 0
Nov 29 13:34:20 node1_server metaserver[3738]: 11-29-2017 13:34:20.228 ERROR - (LogReceiver.cc:1176) 192.168.1.28 43340 id: 2 error: network error closing connection last block end: 3 1 8 socket error:
Nov 29 13:34:21 node1_server metaserver[3738]: 11-29-2017 13:34:20.228 INFO - (MetaVrSM.cc:2439) seq: 231641259406251672 status: 0 state: backup DVC: 0 active: 0 epoch: 2 view: 1 last: 2 1 1 vr-hello node: 0 state: view_change epoch: 4 view: 1 committed: 3 1 8 last: 3 1 8 lve: 3 1 8 fsid: [scrubbed] mds: 192.168.1.28 131072 ckey: qfs-filesystem md: [scrubbed] restart, node: 0