alluxio1.2.0 do not delete previous leader znodes before taking leadership

20 views
Skip to first unread message

Kaiming Wan

unread,
Sep 22, 2016, 1:13:27 AM9/22/16
to Alluxio Users
I have startup alluxio with fault tolerance and it works well with HDFS as UFS. But I find when previous leader znode still exists. The same issue can be find in https://groups.google.com/forum/#!searchin/alluxio-users/leader%7Csort:relevance/alluxio-users/-iMXqp5PYlI/Xzw5JDm9AgAJ.  I am confused that the issue seems to be fixed in tachyon-961, but it still exists in alluxio 1.2.0.


What's more, I have to say I have spend more than two days to make the alluxio cluster work with fault tolerance. The offical docs about deploy alluxio with fault tolerance is too simple.

For example, if you don't set alluxio.worker.block.heartbeat.timeout.ms and alluxio.security.authentication.socket.timeout.ms much larger, there will be many timeout exception. The offical docs don't even mention it.

Gene Pang

unread,
Sep 22, 2016, 9:46:09 AM9/22/16
to Alluxio Users
Hi,

Could you clarify your question? Do the existing znodes cause an error?

Thanks for the information on the fault tolerant docs. The section (http://www.alluxio.org/docs/master/en/Running-Alluxio-Fault-Tolerant.html#worker-configuration) already mentions changing the "alluxio.worker.block.heartbeat.timeout.ms" value.

For the parameter "alluxio.security.authentication.socket.timeout.ms", it is already set at 600 seconds. Is it true that you have to change it to a larger value? What happens if you keep at 600 seconds?

Thanks,
Gene

Kaiming Wan

unread,
Sep 22, 2016, 9:52:35 PM9/22/16
to Alluxio Users
HI Gene Pang,

    The existing znodes didn't cause any error. I just think it is not reasonable that when the previous node died, its znode still exists.


    And it is right that the docs in english point that we need to change "alluxio.worker.block.heartbeat.timeout.ms" value. However, the docs in chinese which I refer to still not point that. I just arbitrarily think the docs in chinese are totally the same with those in english.


    I have to set the  "alluxio.security.authentication.socket.timeout.ms" much larger than the default value such as 3000seconds. The default value will cause error: 
java.io.IOException: java.net.SocketTimeoutException: Read timed out
        at alluxio.AbstractClient.checkVersion(AbstractClient.java:112)
        at alluxio.AbstractClient.connect(AbstractClient.java:175)
        at alluxio.AbstractClient.retryRPC(AbstractClient.java:291)
        at alluxio.worker.block.BlockMasterClient.getId(BlockMasterClient.java:109)
        at alluxio.worker.WorkerIdRegistry.registerWithBlockMaster(WorkerIdRegistry.java:60)
        at alluxio.worker.block.BlockWorker.start(BlockWorker.java:168)
        at alluxio.worker.AlluxioWorker.startWorkers(AlluxioWorker.java:354)
        at alluxio.worker.AlluxioWorker.start(AlluxioWorker.java:326)
        at alluxio.worker.AlluxioWorker.main(AlluxioWorker.java:84)





在 2016年9月22日星期四 UTC+8下午9:46:09,Gene Pang写道:

Gene Pang

unread,
Sep 23, 2016, 9:00:39 AM9/23/16
to Alluxio Users
Thanks for the explanation. Yes, unfortunately, the other language docs are not always in sync with the english version of the docs. Would you be willing to contribute a fix to improve the documentation? It would be great opportunity to get involved with the open source community. 

Thanks,
Gene

Kaiming Wan

unread,
Sep 23, 2016, 11:40:19 AM9/23/16
to Alluxio Users
Yes, I am very glad to join the open source community. And I have registered the alluxio jira account. I will try to take some docs translation task for beginner recently.

在 2016年9月23日星期五 UTC+8下午9:00:39,Gene Pang写道:
Reply all
Reply to author
Forward
0 new messages