Seeking help debugging connection errors for viewstamp replication

55 views
Skip to first unread message

Lloyd W

unread,
Nov 30, 2017, 3:26:36 PM11/30/17
to qfs-...@googlegroups.com
Greetings devs,

I am experimenting with QFS. Got a simple setup running, and now trying to add VR. I attempted to follow the guide, and was successful in adding nodes and activating them. However, I am now receiving this output when I query the status after sending the activate-nodes command:
vr.nodeId 0
vr.status -1008
vr.active 1
vr.state view_change
vr.primaryId -1
vr.epoch 4
vr.view 1
vr.log 3 1 8
vr.commit 3 1 8
vr.lastViewEnd 3 1 8
vr.quorum 2
vr.ignoreInvalidVrState 0
....

I am not sure what -1008 vr.status means, but I guess that it isn't good. Log error messages and config info is below. It seems that Node 0 is not able to maintain a connection with the log receivers on nodes 1 and 2. I have tried telneting directly from node0 to nodes 1 and 2 to port 30200, and am able to connect, and enter text, which shows up in the logs of nodes 1 and 2, so I know the networks are connections. Also I have a round-robin DNS record set up that points 192.168.1.28, 192.168.1.29, 192.168.1.30, to a DNS entry which I used when I added the nodes and activated them with qfsadmin.

I have spend many hours looking through the docs and source code before posting this message, and I will be very grateful for any ideas that anyone can provide that might help me debug the error messages in the logs below, and get nodes 1 and 2 talking with node 0 properly.

Node 0 is 192.168.1.28, Node 1 is 192.168.1.30, Node 3 is 192.168.1.29

Add nodes commands:

qfsadmin -s metaservers.ourdomain.com -p 20000  -F op-type=add-node -F node-id=0 -F arg-count=1 -F args='192.168.1.28 30200' vr_reconfiguration
qfsadmin -s metaservers.ourdomain.com -p 20000  -F op-type=add-node -F node-id=1 -F arg-count=1 -F args='192.168.1.30 30200' vr_reconfiguration
qfsadmin -s metaservers.ourdomain.com -p 20000  -F op-type=add-node -F node-id=2 -F arg-count=1 -F args='192.168.1.29 30200' vr_reconfiguration

Activate nodes command:
qfsadmin -s metaservers.ourdomain.com -p 20000  -F op-type=activate-nodes -F arg-count=3 -F args='0 1 2' vr_reconfiguration

Relevant lines from config files for metaservers:

metaServer.clientPort = 20000
metaServer.chunkServerPort = 30000
metaServer.clusterKey = qfs-filesystem
metaServer.fileSystemIdRequired = 1
metaServer.vr.id = 0      [1 and 2 on other nodes]
metaServer.vr.metaDataStoreLocation = 192.168.1.28 20000 [appropriate values for other nodes]
metaServer.metaDataSync.fileSystemId = [scrubbed]
metaServer.metaDataSync.servers = 192.168.1.28 20000 192.168.1.29 20000 192.168.1.30 20000
metaServer.metaDataSync.fetchOnRestartFileName = metadatafetch
metaServer.log.receiver.listenOn = 0.0.0.0 30200

The clocks on the servers below are a couple seconds apart, not precisely synced.

I am receiving the following errors repeatedly on node 0 in the logs (details scrubbed):

Nov 29 13:34:15 node0_server metaserver[3634]: 11-29-2017 13:34:14.217 INFO - (MetaVrSM.cc:4480) start view change: restart, node: 0 prior state: view_change initiated: 2368 sec. ago vr-start-view-change node: 0 state: view_change epoch: 4 view: 1 committed: 3 1 8 last: 3 1 8 lve: 3 1 8 fsid: [scrubbed] mds: 192.168.1.28 131072 ckey: qfs-filesystem md: [scrubbed] restart, node: 0
Nov 29 13:34:15 node0_server metaserver[3634]: 11-29-2017 13:34:14.217 INFO - (MetaVrSM.cc:2439) +seq: 3019535703744487120 status: 0  state: view_change DVC: 0 active: 1 epoch: 4 view: 1 last: 3 1 8 vr-start-view-change node: 0 state: view_change epoch: 4 view: 1 committed: 3 1 8 last: 3 1 8 lve: 3 1 8 fsid: [scrubbed] mds: 192.168.1.28 131072 ckey: qfs-filesystem md: [scrubbed] restart, node: 0
Nov 29 13:34:15 node0_server metaserver[3634]: 11-29-2017 13:34:14.217 INFO - (MetaVrSM.cc:2439) =seq: 3019535703744487120 status: 0  state: view_change DVC: 0 active: 1 epoch: 4 view: 1 last: 3 1 8 vr-start-view-change node: 0 state: view_change epoch: 4 view: 1 committed: 3 1 8 last: 3 1 8 lve: 3 1 8 fsid: [scrubbed] mds: 192.168.1.28 131072 ckey: qfs-filesystem md: [scrubbed] restart, node: 0
Nov 29 13:34:15 node0_server metaserver[3634]: 11-29-2017 13:34:14.217 ERROR - (MetaVrSM.cc:581) start view change: node inactive status: -2 node: 2 / 0 view: 2 1 / 4 1
Nov 29 13:34:15 node0_server metaserver[3634]: 11-29-2017 13:34:14.217 ERROR - (MetaVrSM.cc:581) start view change: node inactive status: -2 node: 1 / 0 view: 2 1 / 4 1
Nov 29 13:34:17 node0_server metaserver[3634]: 11-29-2017 13:34:17.221 ERROR - (LogTransmitter.cc:1564) 192.168.1.29 30200: connection timed out socket error:  vr: seq: -1 op: null hello seq: -1 pending:  blocks: 0 bytes: 0
Nov 29 13:34:18 node0_server metaserver[3634]: 11-29-2017 13:34:17.221 ERROR - (LogTransmitter.cc:1564) 192.168.1.30 30200: connection timed out socket error:  vr: seq: -1 op: null hello seq: -1 pending:  blocks: 0 bytes: 0
Nov 29 13:34:18 node0_server metaserver[3634]: 11-29-2017 13:34:17.221 ERROR - (TcpSocket.cc:662) getpeername Transport endpoint is not connected 107
Nov 29 13:34:18 node0_server metaserver[3634]: 11-29-2017 13:34:17.221 ERROR - (TcpSocket.cc:662) getpeername Transport endpoint is not connected 107
Nov 29 13:34:20 node0_server metaserver[3634]: 11-29-2017 13:34:20.226 ERROR - (LogTransmitter.cc:1564) 192.168.1.29 30200: connection timed out socket error:  vr: seq: -1 op: null hello seq: -1 pending:  blocks: 0 bytes: 0
Nov 29 13:34:21 node0_server metaserver[3634]: 11-29-2017 13:34:20.226 ERROR - (LogTransmitter.cc:1564) 192.168.1.30 30200: connection timed out socket error:  vr: seq: -1 op: null hello seq: -1 pending:  blocks: 0 bytes: 0
Nov 29 13:34:21 node0_server metaserver[3634]: 11-29-2017 13:34:20.226 ERROR - (TcpSocket.cc:662) getpeername Transport endpoint is not connected 107
Nov 29 13:34:21 node0_server metaserver[3634]: 11-29-2017 13:34:20.226 ERROR - (TcpSocket.cc:662) getpeername Transport endpoint is not connected 107

And these kinds of errors on nodes 1 and 2 (details scrubbed):

Nov 29 13:34:12 node1_server metaserver[3738]: 11-29-2017 13:34:11.213 INFO - (MetaVrSM.cc:2439) seq: 231641259406251669 status: 0  state: backup DVC: 0 active: 0 epoch: 2 view: 1 last: 2 1 1 vr-hello node: 0 state: view_change epoch: 4 view: 1 committed: 3 1 8 last: 3 1 8 lve: 3 1 8 fsid: [scrubbed] mds: 192.168.1.28 131072 ckey: qfs-filesystem md: [scrubbed] restart, node: 0
Nov 29 13:34:14 node1_server metaserver[3738]: 11-29-2017 13:34:14.218 INFO - (MetaVrSM.cc:2439) +seq: 231641259406251670 status: 0  state: backup DVC: 0 active: 0 epoch: 2 view: 1 last: 2 1 1 vr-start-view-change node: 0 state: view_change epoch: 4 view: 1 committed: 3 1 8 last: 3 1 8 lve: 3 1 8 fsid: [scrubbed] mds: 192.168.1.28 131072 ckey: qfs-filesystem md: [scrubbed] restart, node: 0
Nov 29 13:34:15 node1_server metaserver[3738]: 11-29-2017 13:34:14.218 ERROR - (MetaVrSM.cc:2439) =seq: 231641259406251670 status: -2 node inactive state: backup DVC: 0 active: 0 epoch: 2 view: 1 last: 2 1 1 vr-start-view-change node: 0 state: view_change epoch: 4 view: 1 committed: 3 1 8 last: 3 1 8 lve: 3 1 8 fsid: [scrubbed] mds: 192.168.1.28 131072 ckey: qfs-filesystem md: [scrubbed] restart, node: 0
Nov 29 13:34:17 node1_server metaserver[3738]: 11-29-2017 13:34:17.222 ERROR - (LogReceiver.cc:1176) 192.168.1.28 43328 id: 2 error: network error closing connection last block end: 3 1 8 socket error:
Nov 29 13:34:18 node1_server metaserver[3738]: 11-29-2017 13:34:17.223 INFO - (MetaVrSM.cc:2439) seq: 231641259406251671 status: 0  state: backup DVC: 0 active: 0 epoch: 2 view: 1 last: 2 1 1 vr-hello node: 0 state: view_change epoch: 4 view: 1 committed: 3 1 8 last: 3 1 8 lve: 3 1 8 fsid: [scrubbed] mds: 192.168.1.28 131072 ckey: qfs-filesystem md: [scrubbed] restart, node: 0
Nov 29 13:34:20 node1_server metaserver[3738]: 11-29-2017 13:34:20.228 ERROR - (LogReceiver.cc:1176) 192.168.1.28 43340 id: 2 error: network error closing connection last block end: 3 1 8 socket error:
Nov 29 13:34:21 node1_server metaserver[3738]: 11-29-2017 13:34:20.228 INFO - (MetaVrSM.cc:2439) seq: 231641259406251672 status: 0  state: backup DVC: 0 active: 0 epoch: 2 view: 1 last: 2 1 1 vr-hello node: 0 state: view_change epoch: 4 view: 1 committed: 3 1 8 last: 3 1 8 lve: 3 1 8 fsid: [scrubbed] mds: 192.168.1.28 131072 ckey: qfs-filesystem md: [scrubbed] restart, node: 0




Michael Ovsiannikov

unread,
Dec 2, 2017, 9:00:57 PM12/2/17
to <qfs-devel@googlegroups.com>
Lloyd,

The following trace messages suggest that the nodes 1 an 2 are not active:

Nov 29 13:34:15 node0_server metaserver[3634]: 11-29-2017 13:34:14.217 ERROR - (MetaVrSM.cc:581) start view change: node inactive status: -2 node: 2 / 0 view: 2 1 / 4 1
Nov 29 13:34:15 node0_server metaserver[3634]: 11-29-2017 13:34:14.217 ERROR - (MetaVrSM.cc:581) start view change: node inactive status: -2 node: 1 / 0 view: 2 1 / 4 1

 In other words, nodes 1 and 2  are in the state where they are configured not to participate in VR primary election. Given that the system has 3 nodes the two wich are inactive, the “view change” protocol that elects primary node would never succeed as at least 2 (quorum) active functioning nodes are requires for election to complete successfully.

I’d be curious to learn what was the exact sequence of steps that resulted in current state of the system, where the epochs are 2 on the nodes and 1 and 2, and 4 on the node 0.

Each successful VR reconfiguration (configuration change) that involves node activation or inactivation (including resetting configuration by running meta server with clear-vr-config or  -vr-inactivate-all-nodes command line options), increments VR epoch by one. The start epoch, if file system has never VR configured is 0. Normally after nodes activation the epoch would transition from 0 to 1.

The first thing that comes to mind at the moment that would explain such system state, is that VR configuration was changed on nodes 1 and 2 by invoking meta server executable with  -vr-inactivate-all-nodes or -clear-vr-config options.

One way to get VR working might be to reset VR configuration on the node 0 by stopping meta server, then running meta server with -clear-vr-config command line option, and then clearing, or creating new empty log and checkpoint directories on the nodes 1 and 2, then performing qfs_admin steps described in the admin guide.

Another way is, of course, to start from scratch by creating an empty file system.

There is a simple way to experiment with VR running all nodes on “localhost” by invoking endurance test script from the build directory. For example, executing the following would build release version of the code, then start 3 meta server nodes in VR configuration with 9 chunk servers, and meta server web UI on localhost:20050

make && \
cd build/release && \
../../src/test-scripts/run_endurance_mc.sh -auth 0 -vr 3 -no-err-sim -mc-only -test-dirs-prefix vrtest/data

The scripts creates QFS client configuration file is in test directory, with the list of all 3 meta server client listeners:

src/cc/tools/qfs -cfg vrtest/data3/*/test/cli/client.prp -fs qfs://localhost -ls /
Found 1 items
drwx——— - root wheel 0 2017-12-02 17:30 /dumpster

To stop meta and chunk servers:
../../src/test-scripts/run_endurance_mc.sh -auth 0 -vr 3 -no-err-sim -mc-only -test-dirs-prefix vrtest/data -stop

One caveat: I wold not recommend using configuration files created by this script as templates for real / production configuration, as the test configuration has number of debug parameters that might have detrimental effects in non test / “field” use.

— Mike.

--
You received this message because you are subscribed to the Google Groups "QFS Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to qfs-devel+...@googlegroups.com.
To post to this group, send email to qfs-...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/qfs-devel/rA-T-OLQaK2IDpL8Utum-qD7aBUl4q2BTzEQw64Qk2PkmdfLqnwtmyNbYximqk3iCaNIUdy9Q1C0qadlb4L4t1TxK_oinXlIifDF-7dDOrw%3D%40protonmail.com.
For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages