A deploy error,

70 views
Skip to first unread message

杨效振

unread,
May 16, 2016, 2:18:55 AM5/16/16
to qfs-...@googlegroups.com
I have depoyed a system with one metaserver and three chunkserver,and they are
separated on four machine,but when I started the metaserver,it occurs a error as
follows:
05-15-2016 21:05:48.331 ERROR - (ChunkServer.cc:1093) 192.168.75.131:40935 file system id mismatch
05-15-2016 21:05:48.331 ERROR - (ChunkServer.cc:886) chunk server  -1/192.168.75.131:40935 down reason: hello authentication error, cluster key, or md5sum mismatch socket error: 
05-15-2016 21:05:48.331 DEBUG - (ChunkServer.cc:438)  -1 ~ChunkServer 0x17cffb0 total: 1


the metaserver and chunkserver config file are like these:
#metaServer
metaServer.clientPort = 20000
metaServer.chunkServerPort = 20100
metaServer.rackPrefixes = 192.168.75.129 1 192.168.75.130 2 192.168.75.131 3
metaServer.createEmptyFs = 1
metaServer.cpDir = /home/yang/qfsbase/meta/checkpoints
metaServer.logDir = /home/yang/qfsbase/meta/logs
metaServer.clusterKey = myQfs

#chunkServer1
chunkServer.metaServer.hostname = 192.168.75.128
chunkServer.metaServer.port = 20100
chunkServer.clientPort = 21001
chunkServer.clusterKey = myQfs
chunkServer.chunkDir = /home/yang/qfsbase/chunk1/chunkdir11 /home/yang/qfsbase/chunk1/chunkdir12

#chunkServer2
chunkServer.metaServer.hostname = 192.168.75.128
chunkServer.metaServer.port = 20100
chunkServer.clientPort = 21002
chunkServer.clusterKey = myQfs
chunkServer.chunkDir = /home/yang/qfsbase/chunk2/chunkdir21

#chunkServer3
chunkServer.metaServer.hostname = 192.168.75.128
chunkServer.metaServer.port = 20100
chunkServer.clientPort = 21003
chunkServer.clusterKey = myQfs
chunkServer.chunkDir = /home/yang/qfsbase/chunk3/chunkdir31

do I have something wrong on the config file?

mcan...@quantcast.com

unread,
May 17, 2016, 6:40:53 PM5/17/16
to qfs-...@googlegroups.com
Hi,

I think you need to specify md5 checksum of chunkserver binaries in metaserver configuration file.

That is;
1) Get the md5 checksum of chunkservery binary by running; "md5 <chunkserver-binary-file>"
2) Place the output of that line in metaserver configuration file. for instance;
     metaServer.chunkServerMd5sums = 6d99c0d6fdac176eb3147c59276a4458

Also, if you take a look at https://github.com/quantcast/qfs/wiki/Configuration-Reference, the definition for metaServer.chunkServerMd5sums says;
"A whitelist of space separated chunk server md5sums which are allowed to connect to the metaserver. When left empty any chunk server can connect to the metaserver. The default is an empty string, which allows all connections."
So, alternatively, having a line but leaving it empty should also work.

Please let me know the outcome.

Best,

study lp

unread,
May 31, 2016, 12:01:26 AM5/31/16
to QFS Development
In my opinion, it can happen in two reasons:
1. file system id mismatch:
   please refer to another question I have posted: file system id mismatch and hello authentication error
   This can happen when the metaserver dir are not consistent with chunserver dir.(I will discuss it in detail below)
2. md5sum mismatch:
   I do not read how the md5 is generated, but I guess it should be the executable version (has been confirmed by mcan...@quantcast.com  )

I guess the reason why it has these parameters is that:
1.  md5 sum mismatch
    This is to make sure that metaserver and chunkserver are compiled using the same code.  For example, you modify some underlying code which is used by both metaserver and chunserver. You comile the code in metaserver node, but forgot to synchronize the chunserver executables to other chunkserver nodes. You also configure the md5 sum accpeting list in metaServer.chunkServerMd5sums of conf/MetaServer.prp. So when the metaserver receives the chunkserver hello request, it will validate the chunkserver md5sum is same as what has been configured in configuration file. If not same, it will say MD5sum mismatch.
2. cluster key mismatch
   This can be easily understood that one machine can have multiple chunkservers which are belong to multiple qfs system. We should use a lable to make sure that which machines are in the same group of one qfs. Just to make sure that the cluster keys of  all servers in one qfs are same.
3. filesysmtem id mismatch
   Haha, this is the question i met. When you have deploy one qfs. You start it, upload files, and stop it.  Then you want to creat an empty filesystem again. You simply delete the metaserver.cpDir and logDir but forgot to delete the chunkserver.chunkDir.  This can cause errors and the system should recognize this error. What can it do?   When the system is created, one filesystem ID is created by metaserver, stored to its local and is synchronized to all chunkservers which connects to it. The chunkserver's first hello request doesn't send the filesystem id because it has no one. After the chunkserver receives the hello reply, it will receive the filesystem id and store it into local. Then the next time when the qfs restarts, the chunkserver will send its local stored filesystem id and the metaserver will judge whether it's the same as the metaserver's local one.  The errors can be recognized if the metaserver recreates its cpDir and logDir and recreates the filesystem Id, which is different with the chunkserver one.

lipeng


在 2016年5月16日星期一 UTC+8下午2:18:55,杨效振写道:

mcan...@quantcast.com

unread,
Jun 13, 2016, 5:17:34 PM6/13/16
to QFS Development
Right. These are all valid points for why we need md5sum, cluster key and filesystem id checks.
For other interested parties, the details can be found in https://github.com/quantcast/qfs/blob/master/conf/MetaServer.prp
Reply all
Reply to author
Forward
0 new messages