After master failover, the files In Memory before changed to Not In Memory

157 views
Skip to first unread message

Kun Li

unread,
Nov 15, 2016, 2:11:53 AM11/15/16
to Alluxio Users
After master failover, the files In Memory before changed to Not In Memory. I use glusterfs as the under filesystem, but this does not persistent inap the glusterfs, so when the master failover, this file disapeared. any hints?

Kun Li

unread,
Nov 15, 2016, 2:53:10 AM11/15/16
to Alluxio Users
Alluxio version is 1.3.0

We run alluxio in docker.  I killed master, the slave takeover. following is  the logs when failover happened:

slave takeover:


2016-11-15 05:56:05,011 INFO  logger.type (LeaderSelectorClient.java:takeLeadership) - deleting zk path: /alluxio/leader/aa02.prod.com:19998
2016-11-15 05:56:05,019 INFO  logger.type (LeaderSelectorClient.java:takeLeadership) - creating zk path: /alluxio/leader/aa02.prod.com:19998
2016-11-15 05:56:05,023 INFO  logger.type (LeaderSelectorClient.java:takeLeadership) - aa02.prod.com:19998 is now the leader.
2016-11-15 05:56:05,080 INFO  logger.type (AbstractMaster.java:stop) - BlockMaster: Stopping standby master.
2016-11-15 05:56:05,080 INFO  logger.type (JournalTailerThread.java:shutdown) - BlockMaster: Journal tailer shutdown has been initiated.
2016-11-15 05:56:11,034 INFO  logger.type (JournalTailerThread.java:run) - BlockMaster: Journal tailer has been shutdown. No new logs for the quiet period.
2016-11-15 05:56:11,035 INFO  logger.type (AbstractMaster.java:stop) - FileSystemMaster: Stopping standby master.
2016-11-15 05:56:11,035 INFO  logger.type (JournalTailerThread.java:shutdown) - FileSystemMaster: Journal tailer shutdown has been initiated.
2016-11-15 05:56:17,684 INFO  logger.type (JournalTailerThread.java:run) - FileSystemMaster: Journal tailer has been shutdown. No new logs for the quiet period.
2016-11-15 05:56:17,684 INFO  logger.type (AbstractMaster.java:start) - BlockMaster: Starting leader master.
2016-11-15 05:56:17,686 INFO  logger.type (JournalWriter.java:completeAllLogs) - Marking all logs as complete.
2016-11-15 05:56:17,723 INFO  logger.type (JournalWriter.java:completeCurrentLog) - Completed current log: /mnt/glusterfs/alluxio/journal/BlockMaster/log.out to completed log: /mnt/glusterfs/alluxio/journal/BlockMaster/completed/log.00000000000000000001
2016-11-15 05:56:17,724 INFO  logger.type (AbstractMaster.java:start) - BlockMaster: finish processing remaining journal entries (standby -> master).
2016-11-15 05:56:17,727 INFO  logger.type (JournalReader.java:getNextInputStream) - Opening journal log file: /mnt/glusterfs/alluxio/journal/BlockMaster/completed/log.00000000000000000001
2016-11-15 05:56:17,727 INFO  logger.type (JournalTailer.java:processNextJournalLogFiles) - BlockMaster: Processing a completed log file.
2016-11-15 05:56:17,731 INFO  logger.type (JournalTailer.java:processNextJournalLogFiles) - BlockMaster: Finished processing the log file.
2016-11-15 05:56:17,735 INFO  logger.type (JournalWriter.java:getCheckpointOutputStream) - Creating tmp checkpoint file: /mnt/glusterfs/alluxio/journal/BlockMaster/checkpoint.data.tmp
2016-11-15 05:56:17,735 INFO  logger.type (JournalWriter.java:getCheckpointOutputStream) - Latest journal sequence number: 3148 Next journal sequence number: 3149
2016-11-15 05:56:17,821 INFO  logger.type (JournalWriter.java:close) - Successfully created tmp checkpoint file: /mnt/glusterfs/alluxio/journal/BlockMaster/checkpoint.data.tmp
2016-11-15 05:56:17,843 INFO  logger.type (JournalWriter.java:close) - Renamed checkpoint file /mnt/glusterfs/alluxio/journal/BlockMaster/checkpoint.data.tmp to /mnt/glusterfs/alluxio/journal/BlockMaster/checkpoint.data
2016-11-15 05:56:17,843 INFO  logger.type (JournalWriter.java:deleteCompletedLogs) - Deleting all completed log files...
2016-11-15 05:56:17,844 INFO  logger.type (JournalWriter.java:deleteCompletedLogs) - Deleting completed log: /mnt/glusterfs/alluxio/journal/BlockMaster/completed/log.00000000000000000001
2016-11-15 05:56:17,856 INFO  logger.type (JournalWriter.java:deleteCompletedLogs) - Finished deleting all completed log files.
2016-11-15 05:56:17,887 INFO  logger.type (MountTable.java:add) - Mounting /mnt/glusterfs at /
2016-11-15 05:56:17,888 INFO  logger.type (AbstractMaster.java:start) - FileSystemMaster: Starting leader master.
2016-11-15 05:56:17,888 INFO  logger.type (JournalWriter.java:completeAllLogs) - Marking all logs as complete.
2016-11-15 05:56:17,920 INFO  logger.type (JournalWriter.java:completeCurrentLog) - Completed current log: /mnt/glusterfs/alluxio/journal/FileSystemMaster/log.out to completed log: /mnt/glusterfs/alluxio/journal/FileSystemMaster/completed/log.00000000000000000001
2016-11-15 05:56:17,920 INFO  logger.type (AbstractMaster.java:start) - FileSystemMaster: finish processing remaining journal entries (standby -> master).
2016-11-15 05:56:17,927 INFO  logger.type (JournalReader.java:getNextInputStream) - Opening journal log file: /mnt/glusterfs/alluxio/journal/FileSystemMaster/completed/log.00000000000000000001
2016-11-15 05:56:17,927 INFO  logger.type (JournalTailer.java:processNextJournalLogFiles) - FileSystemMaster: Processing a completed log file.
2016-11-15 05:56:18,086 INFO  logger.type (JournalTailer.java:processNextJournalLogFiles) - FileSystemMaster: Finished processing the log file.
2016-11-15 05:56:18,308 INFO  logger.type (JournalWriter.java:getCheckpointOutputStream) - Creating tmp checkpoint file: /mnt/glusterfs/alluxio/journal/FileSystemMaster/checkpoint.data.tmp
2016-11-15 05:56:18,308 INFO  logger.type (JournalWriter.java:getCheckpointOutputStream) - Latest journal sequence number: 13146 Next journal sequence number: 13147
2016-11-15 05:56:18,331 INFO  logger.type (JournalWriter.java:close) - Successfully created tmp checkpoint file: /mnt/glusterfs/alluxio/journal/FileSystemMaster/checkpoint.data.tmp
2016-11-15 05:56:18,373 INFO  logger.type (JournalWriter.java:close) - Renamed checkpoint file /mnt/glusterfs/alluxio/journal/FileSystemMaster/checkpoint.data.tmp to /mnt/glusterfs/alluxio/journal/FileSystemMaster/checkpoint.data
2016-11-15 05:56:18,373 INFO  logger.type (JournalWriter.java:deleteCompletedLogs) - Deleting all completed log files...
2016-11-15 05:56:18,373 INFO  logger.type (JournalWriter.java:deleteCompletedLogs) - Deleting completed log: /mnt/glusterfs/alluxio/journal/FileSystemMaster/completed/log.00000000000000000001
2016-11-15 05:56:18,396 INFO  logger.type (JournalWriter.java:deleteCompletedLogs) - Finished deleting all completed log files.
2016-11-15 05:56:18,403 INFO  logger.type (MetricsSystem.java:startSinksFromConfig) - Starting sinks with config: {}.
2016-11-15 05:56:18,410 INFO  util.log (Log.java:initialized) - Logging initialized @52046297ms
2016-11-15 05:56:18,505 INFO  server.Server (Server.java:doStart) - jetty-9.2.z-SNAPSHOT
2016-11-15 05:56:18,521 INFO  handler.ContextHandler (ContextHandler.java:doStart) - Started o.e.j.s.ServletContextHandler@1da13f57{/metrics/json,null,AVAILABLE}
2016-11-15 05:56:24,357 INFO  handler.ContextHandler (ContextHandler.java:doStart) - Started o.e.j.w.WebAppContext@369c1699{/,file:/usr/local/alluxio-1.3.0/core/server/src/main/webapp/,AVAILABLE}{/usr/local/alluxio-1.3.0/core/server/src/main/webapp}
2016-11-15 05:56:24,360 INFO  server.ServerConnector (AbstractConnector.java:doStart) - Started ServerConnector@32b6e1b6{HTTP/1.1}{0.0.0.0:19999}
2016-11-15 05:56:24,361 INFO  server.Server (Server.java:doStart) - Started @52052247ms
2016-11-15 05:56:24,361 INFO  logger.type (UIWebServer.java:startWebServer) - Alluxio Master Web service started @ /0.0.0.0:19999
2016-11-15 05:56:24,361 INFO  logger.type (AlluxioMaster.java:startServing) - Alluxio master version 1.3.0 started @ aa02.prod.com/10.3.3.5:19998 (gained leadership)
2016-11-15 05:56:47,127 INFO  logger.type (JournalWriter.java:openCurrentLog) - Opened current log file: /mnt/glusterfs/alluxio/journal/BlockMaster/log.out
2016-11-15 05:56:47,148 INFO  logger.type (JournalWriter.java:openCurrentLog) - Opened current log file: /mnt/glusterfs/alluxio/journal/FileSystemMaster/log.out

original master restarted:


2016-11-15 05:55:52,820 INFO  imps.CuratorFrameworkImpl (CuratorFrameworkImpl.java:start) - Starting
2016-11-15 05:55:52,830 INFO  zookeeper.ZooKeeper (Environment.java:logEnv) - Client environment:zookeeper.version=3.4.5-1392090, built on 09/30/2012 17:52 GMT
2016-11-15 05:55:52,832 INFO  zookeeper.ZooKeeper (Environment.java:logEnv) - Client environment:host.name=10.3.3.15
2016-11-15 05:55:52,833 INFO  zookeeper.ZooKeeper (Environment.java:logEnv) - Client environment:java.version=1.7.0_91
2016-11-15 05:55:52,833 INFO  zookeeper.ZooKeeper (Environment.java:logEnv) - Client environment:java.vendor=Oracle Corporation
2016-11-15 05:55:52,833 INFO  zookeeper.ZooKeeper (Environment.java:logEnv) - Client environment:java.home=/usr/lib/jvm/java-1.7-openjdk/jre
2016-11-15 05:55:52,833 INFO  zookeeper.ZooKeeper (Environment.java:logEnv) - Client environment:java.class.path=/usr/local/alluxio-1.3.0/conf/::/usr/local/alluxio-1.3.0/assembly/target/alluxio-assemblies-1.3.0-jar-with-dependencies.jar
2016-11-15 05:55:52,833 INFO  zookeeper.ZooKeeper (Environment.java:logEnv) - Client environment:java.library.path=/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib
2016-11-15 05:55:52,833 INFO  zookeeper.ZooKeeper (Environment.java:logEnv) - Client environment:java.io.tmpdir=/tmp
2016-11-15 05:55:52,833 INFO  zookeeper.ZooKeeper (Environment.java:logEnv) - Client environment:java.compiler=<NA>
2016-11-15 05:55:52,833 INFO  zookeeper.ZooKeeper (Environment.java:logEnv) - Client environment:os.name=Linux
2016-11-15 05:55:52,833 INFO  zookeeper.ZooKeeper (Environment.java:logEnv) - Client environment:os.arch=amd64
2016-11-15 05:55:52,833 INFO  zookeeper.ZooKeeper (Environment.java:logEnv) - Client environment:os.version=4.6.3-coreos
2016-11-15 05:55:52,833 INFO  zookeeper.ZooKeeper (Environment.java:logEnv) - Client environment:user.name=root
2016-11-15 05:55:52,833 INFO  zookeeper.ZooKeeper (Environment.java:logEnv) - Client environment:user.home=/root
2016-11-15 05:55:52,834 INFO  zookeeper.ZooKeeper (Environment.java:logEnv) - Client environment:user.dir=/
2016-11-15 05:55:52,836 INFO  zookeeper.ZooKeeper (ZooKeeper.java:<init>) - Initiating client connection, connectString=zookeeper:2181 sessionTimeout=60000 watcher=org.apache.curator.ConnectionState@256a905c
2016-11-15 05:55:52,853 INFO  zookeeper.ClientCnxn (ClientCnxn.java:logStartConnect) - Opening socket connection to server 192.168.192.20/192.168.192.20:2181. Will not attempt to authenticate using SASL (unknown error)
2016-11-15 05:55:52,856 INFO  zookeeper.ClientCnxn (ClientCnxn.java:primeConnection) - Socket connection established to 192.168.192.20/192.168.192.20:2181, initiating session
2016-11-15 05:55:52,862 INFO  zookeeper.ClientCnxn (ClientCnxn.java:onConnected) - Session establishment complete on server 192.168.192.20/192.168.192.20:2181, sessionid = 0x158447a840c003c, negotiated timeout = 60000
2016-11-15 05:55:52,864 INFO  zookeeper.ZooKeeper (ZooKeeper.java:close) - Session: 0x158447a840c003c closed
2016-11-15 05:55:52,864 INFO  zookeeper.ClientCnxn (ClientCnxn.java:run) - EventThread shut down
2016-11-15 05:55:52,864 INFO  imps.CuratorFrameworkImpl (CuratorFrameworkImpl.java:start) - Starting
2016-11-15 05:55:52,864 INFO  zookeeper.ZooKeeper (ZooKeeper.java:<init>) - Initiating client connection, connectString=zookeeper:2181 sessionTimeout=60000 watcher=org.apache.curator.ConnectionState@79a70972
2016-11-15 05:55:52,866 INFO  zookeeper.ClientCnxn (ClientCnxn.java:logStartConnect) - Opening socket connection to server 192.168.192.20/192.168.192.20:2181. Will not attempt to authenticate using SASL (unknown error)
2016-11-15 05:55:52,867 INFO  zookeeper.ClientCnxn (ClientCnxn.java:primeConnection) - Socket connection established to 192.168.192.20/192.168.192.20:2181, initiating session
2016-11-15 05:55:52,869 INFO  zookeeper.ClientCnxn (ClientCnxn.java:onConnected) - Session establishment complete on server 192.168.192.20/192.168.192.20:2181, sessionid = 0x358447a8e750039, negotiated timeout = 60000
2016-11-15 05:55:52,871 INFO  state.ConnectionStateManager (ConnectionStateManager.java:addStateChange) - State change: CONNECTED
2016-11-15 05:55:52,874 INFO  logger.type (AbstractMaster.java:stop) - BlockMaster: Stopping standby master.
2016-11-15 05:55:52,874 INFO  logger.type (AbstractMaster.java:stop) - FileSystemMaster: Stopping standby master.
2016-11-15 05:55:52,879 INFO  logger.type (AbstractMaster.java:start) - BlockMaster: Starting standby master.
2016-11-15 05:55:52,881 INFO  logger.type (JournalTailerThread.java:run) - BlockMaster: Journal tailer started.
2016-11-15 05:55:52,881 INFO  logger.type (JournalTailerThread.java:run) - BlockMaster: Waiting to load the checkpoint file.
2016-11-15 05:55:52,887 INFO  logger.type (AbstractMaster.java:start) - FileSystemMaster: Starting standby master.
2016-11-15 05:55:52,888 INFO  logger.type (JournalTailerThread.java:run) - FileSystemMaster: Journal tailer started.
2016-11-15 05:55:52,888 INFO  logger.type (JournalTailerThread.java:run) - FileSystemMaster: Waiting to load the checkpoint file.
2016-11-15 05:55:52,894 INFO  logger.type (LeaderSelectorClient.java:stateChanged) - The current leader is ab02.prod.com:19998
2016-11-15 05:55:52,904 INFO  logger.type (JournalTailerThread.java:run) - BlockMaster: Start loading the checkpoint file.
2016-11-15 05:55:52,904 INFO  logger.type (JournalTailer.java:processJournalCheckpoint) - BlockMaster: Loading checkpoint file: /mnt/glusterfs/alluxio/journal/BlockMaster/checkpoint.data
2016-11-15 05:55:52,904 INFO  logger.type (JournalReader.java:getCheckpointInputStream) - Opening journal checkpoint file: /mnt/glusterfs/alluxio/journal/BlockMaster/checkpoint.data
2016-11-15 05:55:52,908 INFO  logger.type (JournalTailerThread.java:run) - FileSystemMaster: Start loading the checkpoint file.
2016-11-15 05:55:52,908 INFO  logger.type (JournalTailer.java:processJournalCheckpoint) - FileSystemMaster: Loading checkpoint file: /mnt/glusterfs/alluxio/journal/FileSystemMaster/checkpoint.data
2016-11-15 05:55:52,908 INFO  logger.type (JournalReader.java:getCheckpointInputStream) - Opening journal checkpoint file: /mnt/glusterfs/alluxio/journal/FileSystemMaster/checkpoint.data
2016-11-15 05:55:52,961 INFO  logger.type (JournalTailerThread.java:run) - BlockMaster: Checkpoint file has been loaded.
2016-11-15 05:55:52,985 INFO  logger.type (JournalTailerThread.java:run) - FileSystemMaster: Checkpoint file has been loaded.
2016-11-15 05:56:17,954 ERROR logger.type (JournalTailerThread.java:run) - Error in journal tailer thread
java.io.IOException: Checkpoint file has been updated. This reader is no longer valid.
        at alluxio.master.journal.JournalReader.getNextInputStream(JournalReader.java:105)
        at alluxio.master.journal.JournalTailer.processNextJournalLogFiles(JournalTailer.java:118)
        at alluxio.master.journal.JournalTailerThread.run(JournalTailerThread.java:126)
2016-11-15 05:56:17,956 INFO  logger.type (JournalTailerThread.java:run) - BlockMaster: Waiting to load the checkpoint file.
2016-11-15 05:56:17,956 INFO  logger.type (JournalTailerThread.java:run) - BlockMaster: Start loading the checkpoint file.
2016-11-15 05:56:17,956 INFO  logger.type (JournalTailer.java:processJournalCheckpoint) - BlockMaster: Loading checkpoint file: /mnt/glusterfs/alluxio/journal/BlockMaster/checkpoint.data
2016-11-15 05:56:17,956 INFO  logger.type (JournalReader.java:getCheckpointInputStream) - Opening journal checkpoint file: /mnt/glusterfs/alluxio/journal/BlockMaster/checkpoint.data
2016-11-15 05:56:17,980 INFO  logger.type (JournalTailerThread.java:run) - BlockMaster: Checkpoint file has been loaded.
2016-11-15 05:56:18,171 INFO  logger.type (JournalReader.java:getNextInputStream) - Opening journal log file: /mnt/glusterfs/alluxio/journal/FileSystemMaster/completed/log.00000000000000000001
2016-11-15 05:56:18,171 INFO  logger.type (JournalTailer.java:processNextJournalLogFiles) - FileSystemMaster: Processing a completed log file.
2016-11-15 05:56:18,644 INFO  logger.type (JournalTailer.java:processNextJournalLogFiles) - FileSystemMaster: Finished processing the log file.
2016-11-15 05:56:19,733 ERROR logger.type (JournalTailerThread.java:run) - Error in journal tailer thread
java.io.IOException: Checkpoint file has been updated. This reader is no longer valid.
        at alluxio.master.journal.JournalReader.getNextInputStream(JournalReader.java:105)
        at alluxio.master.journal.JournalTailer.processNextJournalLogFiles(JournalTailer.java:118)
        at alluxio.master.journal.JournalTailerThread.run(JournalTailerThread.java:126)
2016-11-15 05:56:19,734 INFO  logger.type (JournalTailerThread.java:run) - FileSystemMaster: Waiting to load the checkpoint file.
2016-11-15 05:56:19,734 INFO  logger.type (JournalTailerThread.java:run) - FileSystemMaster: Start loading the checkpoint file.
2016-11-15 05:56:19,734 INFO  logger.type (JournalTailer.java:processJournalCheckpoint) - FileSystemMaster: Loading checkpoint file: /mnt/glusterfs/alluxio/journal/FileSystemMaster/checkpoint.data
2016-11-15 05:56:19,734 INFO  logger.type (JournalReader.java:getCheckpointInputStream) - Opening journal checkpoint file: /mnt/glusterfs/alluxio/journal/FileSystemMaster/checkpoint.data
2016-11-15 05:56:19,738 INFO  logger.type (JournalTailerThread.java:run) - FileSystemMaster: Checkpoint file has been loaded.


Gene Pang

unread,
Nov 15, 2016, 9:31:44 AM11/15/16
to Alluxio Users
Hi,

When the first leader master is killed, does the standby eventually takeover successfully? Does the new webUI show the workers connected to it?

What are the files in question? Were they written to Alluxio, or were they read from the UFS through Alluxio?

Thanks,
Gene

Kun Li

unread,
Nov 16, 2016, 10:22:34 AM11/16/16
to Alluxio Users
From the logs, the standby did takeover, but my webUI doesn't work now, so I can't ensure it.

Then I restarted all the workers, and through spark-shells save a file onto alluxio, at this time, I can find all the sub-files under ramdisk directory.

Then I killed the leader master again, from the log , I could see some workers registered again:

2016-11-16 14:48:22,181 INFO  logger.type (BlockMaster.java:workerRegister) - registerWorker(): MasterWorkerInfo{id=1, workerAddress=WorkerNetAddress{host=10.3.3.21, rpcPort=29998, dataPort=29999, webPort=30000}, capacityBytes=107374182400, usedBytes=1275068060, lastUpdatedTimeMs=1479307702180, blocks=[70833405952, 71403831296, 70732742656, 71638712320]}
2016-11-16 14:48:24,926 WARN  logger.type (BlockMaster.java:workerHeartbeat) - Could not find worker id: 14 for heartbeat.
2016-11-16 14:48:24,927 INFO  logger.type (BlockMaster.java:getWorkerId) - getWorkerId(): WorkerNetAddress: WorkerNetAddress{host=10.3.3.83, rpcPort=29998, dataPort=29999, webPort=30000} id: 2
2016-11-16 14:48:24,928 INFO  logger.type (BlockMaster.java:workerRegister) - registerWorker(): MasterWorkerInfo{id=2, workerAddress=WorkerNetAddress{host=10.3.3.83, rpcPort=29998, dataPort=29999, webPort=30000}, capacityBytes=107374182400, usedBytes=671088956, lastUpdatedTimeMs=1479307704928, blocks=[70632079360, 71571603456]}
2016-11-16 14:48:28,855 WARN  logger.type (BlockMaster.java:workerHeartbeat) - Could not find worker id: 16 for heartbeat.
2016-11-16 14:48:28,856 INFO  logger.type (BlockMaster.java:getWorkerId) - getWorkerId(): WorkerNetAddress: WorkerNetAddress{host=10.3.3.23, rpcPort=29998, dataPort=29999, webPort=30000} id: 3
2016-11-16 14:48:28,857 INFO  logger.type (BlockMaster.java:workerRegister) - registerWorker(): MasterWorkerInfo{id=3, workerAddress=WorkerNetAddress{host=10.3.3.23, rpcPort=29998, dataPort=29999, webPort=30000}, capacityBytes=107374182400, usedBytes=603980016, lastUpdatedTimeMs=1479307708857, blocks=[71504494592, 70615302144]}
2016-11-16 14:48:36,981 WARN  logger.type (BlockMaster.java:workerHeartbeat) - Could not find worker id: 8 for heartbeat.
2016-11-16 14:48:36,982 INFO  logger.type (BlockMaster.java:getWorkerId) - getWorkerId(): WorkerNetAddress: WorkerNetAddress{host=10.3.3.20, rpcPort=29998, dataPort=29999, webPort=30000} id: 4
2016-11-16 14:48:36,983 INFO  logger.type (BlockMaster.java:workerRegister) - registerWorker(): MasterWorkerInfo{id=4, workerAddress=WorkerNetAddress{host=10.3.3.20, rpcPort=29998, dataPort=29999, webPort=30000}, capacityBytes=107374182400, usedBytes=5314196380, lastUpdatedTimeMs=1479307716983, blocks=[70866960384, 71152173056, 71252836352, 71722598400, 71001178112, 71923924992, 71101841408, 71823261696, 71772930048, 70950846464, 71672266752, 71202504704, 71621935104, 71873593344, 71051509760, 71370276864]} 

But not all. On other workers  I could find the logs that have registered to leader master:

2016-11-16 14:56:37,807 WARN  logger.type (SleepingTimer.java:tick) - Worker Pin List Sync last execution took 602716 ms. Longer than the interval 1000
2016-11-16 14:56:37,809 INFO  logger.type (LeaderInquireClient.java:getMasterAddress) - Master addresses: [ab02.pek.prod.ucarinc.com:19998, aa02.prod.com:19998]
2016-11-16 14:56:37,810 INFO  logger.type (LeaderInquireClient.java:getMasterAddress) - The leader master: ab02.prod.com:19998
2016-11-16 14:56:37,810 INFO  logger.type (AbstractClient.java:connect) - Alluxio client (version 1.3.0) is trying to connect with FileSystemMasterWorker master @ ab02.prod.com/10.3.3.15:19998
2016-11-16 14:56:37,812 INFO  logger.type (AbstractClient.java:connect) - Client registered with FileSystemMasterWorker master @ ab02.prod.com/10.3.3.15:19998

From the time of logs , you can notice that this worker register to master at 14:56:37, but leader master didn't have logs about it, the log at 14:48:36 is the latest one.

and from alluxio client, I could see the strange output:

./alluxio fs ls /gps/test/testfile3
-rw-rw-rw-     root           root           320.00MB  11-16-2016 14:41:28:453  Not In Memory  /gps/test/testfile3/part-00008
-rw-rw-rw-     root           root           288.00MB  11-16-2016 14:41:28:461  Not In Memory  /gps/test/testfile3/part-00033
-rw-rw-rw-     root           root           288.00MB  11-16-2016 14:41:28:469  Not In Memory  /gps/test/testfile3/part-00039
-rw-rw-rw-     root           root           320.00MB  11-16-2016 14:41:28:472  Not In Memory  /gps/test/testfile3/part-00009
-rw-rw-rw-     root           root           320.00MB  11-16-2016 14:41:28:473  Not In Memory  /gps/test/testfile3/part-00016
-rw-rw-rw-     root           root           320.00MB  11-16-2016 14:41:28:473  Not In Memory  /gps/test/testfile3/part-00011
-rw-rw-rw-     root           root           294.01MB  11-16-2016 14:41:28:473  In Memory      /gps/test/testfile3/part-00001
-rw-rw-rw-     root           root           288.00MB  11-16-2016 14:41:28:477  In Memory      /gps/test/testfile3/part-00007
-rw-rw-rw-     root           root           288.00MB  11-16-2016 14:41:28:485  Not In Memory  /gps/test/testfile3/part-00025
-rw-rw-rw-     root           root           320.00MB  11-16-2016 14:41:28:485  Not In Memory  /gps/test/testfile3/part-00027
-rw-rw-rw-     root           root           320.00MB  11-16-2016 14:41:28:486  Not In Memory  /gps/test/testfile3/part-00013
-rw-rw-rw-     root           root           320.00MB  11-16-2016 14:41:28:486  Not In Memory  /gps/test/testfile3/part-00031
-rw-rw-rw-     root           root           320.00MB  11-16-2016 14:41:28:486  Not In Memory  /gps/test/testfile3/part-00022
-rw-rw-rw-     root           root           288.00MB  11-16-2016 14:41:28:490  In Memory      /gps/test/testfile3/part-00002
-rw-rw-rw-     root           root           320.00MB  11-16-2016 14:41:28:490  Not In Memory  /gps/test/testfile3/part-00000
-rw-rw-rw-     root           root           288.00MB  11-16-2016 14:41:28:494  Not In Memory  /gps/test/testfile3/part-00006
-rw-rw-rw-     root           root           320.00MB  11-16-2016 14:41:28:494  Not In Memory  /gps/test/testfile3/part-00005
-rw-rw-rw-     root           root           320.00MB  11-16-2016 14:41:28:498  In Memory      /gps/test/testfile3/part-00018
-rw-rw-rw-     root           root           320.00MB  11-16-2016 14:41:28:499  Not In Memory  /gps/test/testfile3/part-00003
-rw-rw-rw-     root           root           320.00MB  11-16-2016 14:41:30:160  Not In Memory  /gps/test/testfile3/part-00010
-rw-rw-rw-     root           root           320.00MB  11-16-2016 14:41:30:244  In Memory      /gps/test/testfile3/part-00004
-rw-rw-rw-     root           root           320.00MB  11-16-2016 14:41:30:311  In Memory      /gps/test/testfile3/part-00024
-rw-rw-rw-     root           root           288.00MB  11-16-2016 14:41:30:367  Not In Memory  /gps/test/testfile3/part-00014
-rw-rw-rw-     root           root           320.00MB  11-16-2016 14:41:32:251  In Memory      /gps/test/testfile3/part-00017
-rw-rw-rw-     root           root           320.00MB  11-16-2016 14:41:32:259  Not In Memory  /gps/test/testfile3/part-00015
-rw-rw-rw-     root           root           320.00MB  11-16-2016 14:41:32:391  Not In Memory  /gps/test/testfile3/part-00012
-rw-rw-rw-     root           root           320.00MB  11-16-2016 14:41:34:396  In Memory      /gps/test/testfile3/part-00019
-rw-rw-rw-     root           root           320.00MB  11-16-2016 14:41:34:629  Not In Memory  /gps/test/testfile3/part-00023
-rw-rw-rw-     root           root           320.00MB  11-16-2016 14:41:34:749  Not In Memory  /gps/test/testfile3/part-00020
-rw-rw-rw-     root           root           320.00MB  11-16-2016 14:41:36:666  In Memory      /gps/test/testfile3/part-00028
-rw-rw-rw-     root           root           320.00MB  11-16-2016 14:41:37:046  Not In Memory  /gps/test/testfile3/part-00026
-rw-rw-rw-     root           root           320.00MB  11-16-2016 14:41:37:046  Not In Memory  /gps/test/testfile3/part-00021
-rw-rw-rw-     root           root           320.00MB  11-16-2016 14:41:39:057  In Memory      /gps/test/testfile3/part-00030
-rw-rw-rw-     root           root           320.00MB  11-16-2016 14:41:39:353  Not In Memory  /gps/test/testfile3/part-00029
-rw-rw-rw-     root           root           288.00MB  11-16-2016 14:41:39:442  Not In Memory  /gps/test/testfile3/part-00032
-rw-rw-rw-     root           root           320.00MB  11-16-2016 14:41:41:292  In Memory      /gps/test/testfile3/part-00034
-rw-rw-rw-     root           root           320.00MB  11-16-2016 14:41:41:559  Not In Memory  /gps/test/testfile3/part-00035
-rw-rw-rw-     root           root           288.00MB  11-16-2016 14:41:41:764  Not In Memory  /gps/test/testfile3/part-00038
-rw-rw-rw-     root           root           320.00MB  11-16-2016 14:41:43:357  In Memory      /gps/test/testfile3/part-00037
-rw-rw-rw-     root           root           320.00MB  11-16-2016 14:41:43:890  Not In Memory  /gps/test/testfile3/part-00036
-rw-rw-rw-     gps            gps            0.00B     11-16-2016 14:41:46:557  In Memory      /gps/test/testfile3/_SUCCESS

What's the problem? any input will be appreciate.

likun

Gene Pang

unread,
Nov 17, 2016, 9:25:23 AM11/17/16
to Alluxio Users
Hi,

Which host was the old master and which host is the new master?

Also, if you go to the webUI of the new leader master, I think there is a workers page which shows which workers are "lost". Then, you can go to those workers to see why they are not connecting to the master. You may have to change the configuration value "alluxio.worker.block.heartbeat.timeout.ms" when in fault tolerant mode (http://www.alluxio.org/docs/master/en/Running-Alluxio-Fault-Tolerant.html#worker-configuration)

As for the "Not In Memory" output, that is probably happening because some of the workers have not connected to the master. This is not necessarily "strange", since if the data is persisted in the UFS, the next read of the file will read it from UFS.

Thanks,
Gene

Kun Li

unread,
Nov 20, 2016, 12:03:47 PM11/20/16
to Alluxio Users
ab02 is the new master.

I just created a docker images from jdk, and alluxio webui was working now. 
alluxio.worker.block.heartbeat.timeout.ms has been set to 3000000.

after failover, from webui I noticed there was 17 out 20 worker is registered succeed. the other 3 didn't display at the lost region, just didn't show.

the log from  the worker that was not registered:

2016-11-20 16:39:45,887 INFO  logger.type (AbstractClient.java:connect) - Alluxio client (version 1.3.0) is trying to connect with BlockMasterWorker master @ aa02.prod.com/10.3.3.5:19998
2016-11-20 16:40:01,785 INFO  logger.type (AbstractClient.java:connect) - Client registered with FileSystemMasterWorker master @ aa02.prod.com/10.3.3.5:19998
2016-11-20 16:40:01,786 INFO  logger.type (AbstractClient.java:connect) - Client registered with FileSystemMasterWorker master @ aa02.prod.com/10.3.3.5:19998
2016-11-20 16:40:01,802 WARN  logger.type (SleepingTimer.java:tick) - Worker Pin List Sync last execution took 24368 ms. Longer than the interval 1000
2016-11-20 16:40:01,812 WARN  logger.type (SleepingTimer.java:tick) - Worker FileSystemMaster Sync last execution took 24076 ms. Longer than the interval 1000
2016-11-20 16:40:12,554 INFO  logger.type (AbstractClient.java:connect) - Client registered with BlockMasterWorker master @ aa02.prod.com/10.3.3.5:19998
2016-11-20 16:40:12,555 WARN  logger.type (SleepingTimer.java:tick) - Worker Block Sync last execution took 33766 ms. Longer than the interval 1000 

It said it was registered, but not. from webui and from master log , I could ensure it.

From zookeeper, I could only see masters' record, not workers', so workers didn't write zk, when master failover, how could it find all the workers?

any hints?

likun

Gene Pang

unread,
Nov 21, 2016, 9:20:14 AM11/21/16
to Alluxio Users
Hi,

From those log messages, it looks like it is trying to register with the old master (aa02), instead of the new master (ab02). What do the log messages say for the old master aa02?

When it fails over, the new master will write the master location into zookeeper, and the workers will use that to find the new master.

Thanks,
Gene

Kun Li

unread,
Nov 22, 2016, 10:22:53 PM11/22/16
to Alluxio Users
Sorry, it's my fault, aa02 is the master actually.

I did more test.

Now I have a 20-workers cluster, so I reduced workers to 10, then  did the master failover test. It worked, all workers reregistered in the new master. 

Then I raised the number of workers, until totally 17 workers. all were good, master switched, all workers registered, I repeated master failover about 10 times, all succeed.

After that,  I raised to 18 workers, workers register failure appeared, but no error logs, the register-failed worker said it has registered, but master didn't recognize it at all. from webui, I can't saw this worker. from the output of 'alluxio fs ls', I could see some 'Not In Memory‘ blocks.

Also I tcpdumped the interface of both master and the worker that failed to register, there was packages back and forth the worker and port 19998 of master.
 
does alluxio have some limits ? for now I setup 100GB memory for alluxio per worker.

Gene Pang

unread,
Nov 23, 2016, 9:06:21 AM11/23/16
to Alluxio Users
There shouldn't be any limitation for this failover.

When you hit the failure, what does the "old" master logs say? Does it think a worker registered with it?

Thanks,
Gene

Kun Li

unread,
Nov 24, 2016, 8:18:09 AM11/24/16
to Alluxio Users
I have setup alluxio to run in container, and I have only one alluxio master container setup. I killed this master container, then another one startup on another server. so there is no "old" master running.

Gene Pang

unread,
Nov 26, 2016, 2:42:52 PM11/26/16
to Alluxio Users
I see.

I am wondering why some of the workers do not connect to the new master. Could you show the logs for the workers that do not failover to the new master?

Thanks,
Gene
Message has been deleted
Message has been deleted

Kun Li

unread,
Nov 27, 2016, 10:12:01 PM11/27/16
to Alluxio Users
I raised number of  workers to 18, then did a master failover, The following was the log of the unregistered worker: 
the time master began to failover is 02:24:41, I excerpted the last part. If you need more, let me known.

2016-11-28 02:25:44,136 INFO  logger.type (LeaderInquireClient.java:getMasterAddress) - Master addresses: [ab02.prod.com:19998aa02.prod.com:19998]
2016-11-28 02:25:44,137 INFO  logger.type (LeaderInquireClient.java:getMasterAddress) - The leader master: aa02.prod.com:19998
2016-11-28 02:25:44,137 INFO  logger.type (AbstractClient.java:connect) - Alluxio client (version 1.3.0) is trying to connect with FileSystemMasterWorker master @aa02.prod.com/10.3.3.5:19998
2016-11-28 02:25:44,138 ERROR logger.type (AbstractClient.java:connect) - Failed to connect (16) to FileSystemMasterWorker master @ aa02.prod.com/10.3.3.5:19998 : java.net.ConnectException: Connection refused (Connection refused)
2016-11-28 02:25:44,985 INFO  logger.type (LeaderInquireClient.java:getMasterAddress) - Master addresses: [ab02.prod.com:19998aa02.prod.com:19998]
2016-11-28 02:25:44,986 INFO  logger.type (LeaderInquireClient.java:getMasterAddress) - The leader master: aa02.prod.com:19998
2016-11-28 02:25:44,986 INFO  logger.type (AbstractClient.java:connect) - Alluxio client (version 1.3.0) is trying to connect with BlockMasterWorker master @aa02.prod.com/10.3.3.5:19998
2016-11-28 02:25:44,987 ERROR logger.type (AbstractClient.java:connect) - Failed to connect (16) to BlockMasterWorker master @ aa02.prod.com/10.3.3.5:19998 : java.net.ConnectException: Connection refused (Connection refused)
2016-11-28 02:25:45,121 INFO  logger.type (LeaderInquireClient.java:getMasterAddress) - Master addresses: [ab02.prod.com:19998aa02.prod.com:19998]
2016-11-28 02:25:45,122 INFO  logger.type (LeaderInquireClient.java:getMasterAddress) - The leader master: ab02.prod.com:19998
2016-11-28 02:25:45,123 INFO  logger.type (AbstractClient.java:connect) - Alluxio client (version 1.3.0) is trying to connect with FileSystemMasterWorker master @ab02.prod.com/10.3.3.15:19998
2016-11-28 02:25:45,139 INFO  logger.type (LeaderInquireClient.java:getMasterAddress) - Master addresses: [ab02.prod.com:19998aa02.prod.com:19998]
2016-11-28 02:25:45,140 INFO  logger.type (LeaderInquireClient.java:getMasterAddress) - The leader master: ab02.prod.com:19998
2016-11-28 02:25:45,140 INFO  logger.type (AbstractClient.java:connect) - Alluxio client (version 1.3.0) is trying to connect with FileSystemMasterWorker master @ab02.prod.com/10.3.3.15:19998
2016-11-28 02:25:45,988 INFO  logger.type (LeaderInquireClient.java:getMasterAddress) - Master addresses: [ab02.prod.com:19998aa02.prod.com:19998]
2016-11-28 02:25:45,989 INFO  logger.type (LeaderInquireClient.java:getMasterAddress) - The leader master: ab02.prod.com:19998
2016-11-28 02:25:45,989 INFO  logger.type (AbstractClient.java:connect) - Alluxio client (version 1.3.0) is trying to connect with BlockMasterWorker master @ab02.prod.com/10.3.3.15:19998
2016-11-28 02:26:02,838 INFO  logger.type (AbstractClient.java:connect) - Client registered with FileSystemMasterWorker master @ ab02.prod.com/10.3.3.15:19998
2016-11-28 02:26:02,839 INFO  logger.type (AbstractClient.java:connect) - Client registered with FileSystemMasterWorker master @ ab02.prod.com/10.3.3.15:19998
2016-11-28 02:26:02,857 WARN  logger.type (SleepingTimer.java:tick) - Worker Pin List Sync last execution took 29963 ms. Longer than the interval 1000
2016-11-28 02:26:02,866 WARN  logger.type (SleepingTimer.java:tick) - Worker FileSystemMaster Sync last execution took 31794 ms. Longer than the interval 1000
2016-11-28 02:26:11,579 INFO  logger.type (AbstractClient.java:connect) - Client registered with BlockMasterWorker master @ ab02.prod.com/10.3.3.15:19998
2016-11-28 02:26:11,580 WARN  logger.type (SleepingTimer.java:tick) - Worker Block Sync last execution took 38891 ms. Longer than the interval 1000 

likun

Gene Pang

unread,
Nov 28, 2016, 9:29:33 AM11/28/16
to Alluxio Users
Hi,

I am confused with the log messages. Which is the master which is up? Earlier you said the master is aa02, and that the "old" master ab02 is no longer up. Therefore, I am confused why this worker is connecting to the "old" master ab02? Also, could you take a look at the log messages of the "old" master ab02?

Thanks,
Gene

Kun Li

unread,
Nov 29, 2016, 4:54:31 AM11/29/16
to Alluxio Users
That's ok.

Since I did mutiple failover test, these logs came from different test . 

The old master in the logs above was aa02, and the new master was ab02. as I have killed the master pod on aa02, worker failed to connect  aa02. After ab02 began to run, and registered to zk, worker got new master from zk, and began to connect ab02.  

Gene Pang

unread,
Nov 29, 2016, 6:30:24 PM11/29/16
to Alluxio Users
Hi,

Could you start a new failover test, and collect the logs for the different components:
- old master
- new master
- a worker which failed over successfully
- a worker which did not fail over successfully

This information would be useful to understand what is going on.

Thanks,
Gene

Kun Li

unread,
Nov 30, 2016, 3:36:12 AM11/30/16
to Alluxio Users
Ok. here it is.
Before failover, the master is aa02.prod.com
the alluxio-site.proterties has these parameters setup:

alluxio.security.authorization.permission.umask=000
alluxio
.worker.block.heartbeat.timeout.ms=300000
alluxio
.worker.filesystem.heartbeat.interval.ms=60000
alluxio
.worker.block.heartbeat.interval.ms=60000

The alluxio cluster status excerpted from webui:

Alluxio Summary


Master Address: aa02.prod.com/10.3.3.5:19998
Started: 11-30-2016 05:13:38:281
Uptime: 0 day(s), 2 hour(s), 19 minute(s), and 46 second(s)
Version: 1.3.0
Running Workers: 20
Cluster Usage Summary


Workers Capacity: 2000.00GB
Workers Free / Used: 2000.00GB / 0.00B
UnderFS Capacity: 78.28TB
UnderFS Free / Used: 69.53TB / 8.75TB




Live Workers


Node Name Last Heartbeat  State Workers Capacity Space Used Space Usage
10.3.3.10 38   In Service 100.00GB 0.00B 100%Free
10.3.3.11 30   In Service 100.00GB 0.00B 100%Free
10.3.3.12 38   In Service 100.00GB 0.00B 100%Free
10.3.3.13 38   In Service 100.00GB 0.00B 100%Free
10.3.3.19 39   In Service 100.00GB 0.00B 100%Free
10.3.3.20 39   In Service 100.00GB 0.00B 100%Free
10.3.3.21 39   In Service 100.00GB 0.00B 100%Free
10.3.3.23 38   In Service 100.00GB 0.00B 100%Free
10.3.3.73 38   In Service 100.00GB 0.00B 100%Free
10.3.3.74 39   In Service 100.00GB 0.00B 100%Free
10.3.3.75 38   In Service 100.00GB 0.00B 100%Free
10.3.3.76 39   In Service 100.00GB 0.00B 100%Free
10.3.3.77 39   In Service 100.00GB 0.00B 100%Free
10.3.3.83 38   In Service 100.00GB 0.00B 100%Free
10.3.3.84 38   In Service 100.00GB 0.00B 100%Free
10.3.3.85 39   In Service 100.00GB 0.00B 100%Free
10.3.3.86 39   In Service 100.00GB 0.00B 100%Free
10.3.3.87 39   In Service 100.00GB 0.00B 100%Free
10.3.3.9  36   In Service 100.00GB 0.00B 100%Free
ab09
.prod.com 38 In Service 100.00GB 0.00B 100%Free


Lost Workers


Node Name Last Heartbeat Workers Capacity

Zookeeper status:

[zk: localhost:2181(CONNECTED) 32] ls /leader                                                           
[zk: localhost:2181(CONNECTED) 33] get /leader/aa02.prod.com:19998
10.3.3.5
cZxid = 0x30000082e
ctime = Wed Nov 30 05:13:20 GMT 2016
mZxid = 0x30000082e
mtime = Wed Nov 30 05:13:20 GMT 2016
pZxid = 0x30000082e
cversion = 0
dataVersion = 0
aclVersion = 0
ephemeralOwner = 0x0
dataLength = 9
numChildren = 0
[zk: localhost:2181(CONNECTED) 34] get /leader/ab02.prod.com:19998
10.3.3.15
cZxid = 0x300000826
ctime = Wed Nov 30 04:10:15 GMT 2016
mZxid = 0x300000826
mtime = Wed Nov 30 04:10:15 GMT 2016
pZxid = 0x300000826
cversion = 0
dataVersion = 0
aclVersion = 0
ephemeralOwner = 0x0
dataLength = 10
numChildren = 0
[zk: localhost:2181(CONNECTED) 35] ls /election                                                         
[_c_10613ed9-58f1-4633-8ee6-9450b3a991e6-lock-0000000002]
[zk: localhost:2181(CONNECTED) 36] get /election/_c_10613ed9-58f1-4633-8ee6-9450b3a991e6-lock-0000000002
cZxid = 0x30000082b
ctime = Wed Nov 30 05:12:00 GMT 2016
mZxid = 0x30000082b
mtime = Wed Nov 30 05:12:00 GMT 2016
pZxid = 0x30000082b
cversion = 0
dataVersion = 0
aclVersion = 0
ephemeralOwner = 0x15876ca48ee00b3
dataLength = 31
numChildren = 0
[zk: localhost:2181(CONNECTED) 37] 

When failover begin, the old master was killed, no log left.
the following is the log of new master:

2016-11-30 07:38:57,060 INFO  imps.CuratorFrameworkImpl (CuratorFrameworkImpl.java:start) - Starting
2016-11-30 07:38:57,076 INFO  zookeeper.ZooKeeper (Environment.java:logEnv) - Client environment:zookeeper.version=3.4.5-1392090, built on 09/30/2012 17:52 GMT
2016-11-30 07:38:57,078 INFO  zookeeper.ZooKeeper (Environment.java:logEnv) - Client environment:host.name=10.3.3.15
2016-11-30 07:38:57,078 INFO  zookeeper.ZooKeeper (Environment.java:logEnv) - Client environment:java.version=1.7.0_121
2016-11-30 07:38:57,078 INFO  zookeeper.ZooKeeper (Environment.java:logEnv) - Client environment:java.vendor=Oracle Corporation
2016-11-30 07:38:57,078 INFO  zookeeper.ZooKeeper (Environment.java:logEnv) - Client environment:java.home=/usr/lib/jvm/java-1.7-openjdk/jre
2016-11-30 07:38:57,079 INFO  zookeeper.ZooKeeper (Environment.java:logEnv) - Client environment:java.class.path=/usr/local/alluxio-1.3.0/conf/::/usr/local/alluxio-1.3.0/assembly/target/alluxio-assemblies-1.3.0-jar-with-dependencies.jar
2016-11-30 07:38:57,079 INFO  zookeeper.ZooKeeper (Environment.java:logEnv) - Client environment:java.library.path=/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib
2016-11-30 07:38:57,079 INFO  zookeeper.ZooKeeper (Environment.java:logEnv) - Client environment:java.io.tmpdir=/tmp
2016-11-30 07:38:57,079 INFO  zookeeper.ZooKeeper (Environment.java:logEnv) - Client environment:java.compiler=<NA>
2016-11-30 07:38:57,079 INFO  zookeeper.ZooKeeper (Environment.java:logEnv) - Client environment:os.name=Linux
2016-11-30 07:38:57,079 INFO  zookeeper.ZooKeeper (Environment.java:logEnv) - Client environment:os.arch=amd64
2016-11-30 07:38:57,079 INFO  zookeeper.ZooKeeper (Environment.java:logEnv) - Client environment:os.version=4.6.3-coreos
2016-11-30 07:38:57,079 INFO  zookeeper.ZooKeeper (Environment.java:logEnv) - Client environment:user.name=root
2016-11-30 07:38:57,079 INFO  zookeeper.ZooKeeper (Environment.java:logEnv) - Client environment:user.home=/root
2016-11-30 07:38:57,079 INFO  zookeeper.ZooKeeper (Environment.java:logEnv) - Client environment:user.dir=/
2016-11-30 07:38:57,082 INFO  zookeeper.ZooKeeper (ZooKeeper.java:<init>) - Initiating client connection, connectString=zookeeper:2181 sessionTimeout=60000 watcher=org.apache.curator.ConnectionState@10bc5c9
2016-11-30 07:38:57,099 INFO  zookeeper.ClientCnxn (ClientCnxn.java:logStartConnect) - Opening socket connection to server 192.168.192.20/192.168.192.20:2181. Will not attempt to authenticate using SASL (unknown error)
2016-11-30 07:38:57,104 INFO  zookeeper.ClientCnxn (ClientCnxn.java:primeConnection) - Socket connection established to 192.168.192.20/192.168.192.20:2181, initiating session
2016-11-30 07:38:57,110 INFO  zookeeper.ClientCnxn (ClientCnxn.java:onConnected) - Session establishment complete on server 192.168.192.20/192.168.192.20:2181, sessionid = 0x25876ca490b00b7, negotiated timeout = 60000
2016-11-30 07:38:57,112 INFO  zookeeper.ZooKeeper (ZooKeeper.java:close) - Session: 0x25876ca490b00b7 closed
2016-11-30 07:38:57,112 INFO  zookeeper.ClientCnxn (ClientCnxn.java:run) - EventThread shut down
2016-11-30 07:38:57,112 INFO  imps.CuratorFrameworkImpl (CuratorFrameworkImpl.java:start) - Starting
2016-11-30 07:38:57,112 INFO  zookeeper.ZooKeeper (ZooKeeper.java:<init>) - Initiating client connection, connectString=zookeeper:2181 sessionTimeout=60000 watcher=org.apache.curator.ConnectionState@65c98bf3
2016-11-30 07:38:57,114 INFO  zookeeper.ClientCnxn (ClientCnxn.java:logStartConnect) - Opening socket connection to server 192.168.192.20/192.168.192.20:2181. Will not attempt to authenticate using SASL (unknown error)
2016-11-30 07:38:57,114 INFO  zookeeper.ClientCnxn (ClientCnxn.java:primeConnection) - Socket connection established to 192.168.192.20/192.168.192.20:2181, initiating session
2016-11-30 07:38:57,116 INFO  zookeeper.ClientCnxn (ClientCnxn.java:onConnected) - Session establishment complete on server 192.168.192.20/192.168.192.20:2181, sessionid = 0x25876ca490b00b8, negotiated timeout = 60000
2016-11-30 07:38:57,119 INFO  state.ConnectionStateManager (ConnectionStateManager.java:addStateChange) - State change: CONNECTED
2016-11-30 07:38:57,120 WARN  state.ConnectionStateManager (ConnectionStateManager.java:processEvents) - There are no ConnectionStateListeners registered.
2016-11-30 07:38:57,123 INFO  logger.type (AbstractMaster.java:stop) - BlockMaster: Stopping standby master.
2016-11-30 07:38:57,123 INFO  logger.type (AbstractMaster.java:stop) - FileSystemMaster: Stopping standby master.
2016-11-30 07:38:57,127 INFO  logger.type (AbstractMaster.java:start) - BlockMaster: Starting standby master.
2016-11-30 07:38:57,129 INFO  logger.type (JournalTailerThread.java:run) - BlockMaster: Journal tailer started.
2016-11-30 07:38:57,129 INFO  logger.type (JournalTailerThread.java:run) - BlockMaster: Waiting to load the checkpoint file.
2016-11-30 07:38:57,136 INFO  logger.type (AbstractMaster.java:start) - FileSystemMaster: Starting standby master.
2016-11-30 07:38:57,137 INFO  logger.type (JournalTailerThread.java:run) - FileSystemMaster: Journal tailer started.
2016-11-30 07:38:57,137 INFO  logger.type (JournalTailerThread.java:run) - FileSystemMaster: Waiting to load the checkpoint file.
2016-11-30 07:38:57,140 INFO  logger.type (LeaderSelectorClient.java:stateChanged) - The current leader is aa02.prod.com:19998
2016-11-30 07:38:57,169 INFO  logger.type (JournalTailerThread.java:run) - FileSystemMaster: Start loading the checkpoint file.
2016-11-30 07:38:57,169 INFO  logger.type (JournalTailer.java:processJournalCheckpoint) - FileSystemMaster: Loading checkpoint file: /mnt/glusterfs/alluxio/journal/FileSystemMaster/checkpoint.data
2016-11-30 07:38:57,169 INFO  logger.type (JournalReader.java:getCheckpointInputStream) - Opening journal checkpoint file: /mnt/glusterfs/alluxio/journal/FileSystemMaster/checkpoint.data
2016-11-30 07:38:57,170 INFO  logger.type (JournalTailerThread.java:run) - BlockMaster: Start loading the checkpoint file.
2016-11-30 07:38:57,170 INFO  logger.type (JournalTailer.java:processJournalCheckpoint) - BlockMaster: Loading checkpoint file: /mnt/glusterfs/alluxio/journal/BlockMaster/checkpoint.data
2016-11-30 07:38:57,170 INFO  logger.type (JournalReader.java:getCheckpointInputStream) - Opening journal checkpoint file: /mnt/glusterfs/alluxio/journal/BlockMaster/checkpoint.data
2016-11-30 07:38:57,202 INFO  logger.type (JournalTailerThread.java:run) - BlockMaster: Checkpoint file has been loaded.
2016-11-30 07:38:57,218 INFO  logger.type (JournalTailerThread.java:run) - FileSystemMaster: Checkpoint file has been loaded.
2016-11-30 07:40:10,005 INFO  logger.type (LeaderSelectorClient.java:takeLeadership) - deleting zk path: /leader/ab02.prod.com:19998
2016-11-30 07:40:10,008 INFO  logger.type (LeaderSelectorClient.java:takeLeadership) - creating zk path: /leader/ab02.prod.com:19998
2016-11-30 07:40:10,009 INFO  logger.type (LeaderSelectorClient.java:takeLeadership) - ab02.prod.com:19998 is now the leader.
2016-11-30 07:40:10,011 INFO  logger.type (AbstractMaster.java:stop) - BlockMaster: Stopping standby master.
2016-11-30 07:40:10,011 INFO  logger.type (JournalTailerThread.java:shutdown) - BlockMaster: Journal tailer shutdown has been initiated.
2016-11-30 07:40:15,767 INFO  logger.type (JournalTailerThread.java:run) - BlockMaster: Journal tailer has been shutdown. No new logs for the quiet period.
2016-11-30 07:40:15,767 INFO  logger.type (AbstractMaster.java:stop) - FileSystemMaster: Stopping standby master.
2016-11-30 07:40:15,767 INFO  logger.type (JournalTailerThread.java:shutdown) - FileSystemMaster: Journal tailer shutdown has been initiated.
2016-11-30 07:40:21,845 INFO  logger.type (JournalTailerThread.java:run) - FileSystemMaster: Journal tailer has been shutdown. No new logs for the quiet period.
2016-11-30 07:40:21,846 INFO  logger.type (AbstractMaster.java:start) - BlockMaster: Starting leader master.
2016-11-30 07:40:21,847 INFO  logger.type (JournalWriter.java:completeAllLogs) - Marking all logs as complete.
2016-11-30 07:40:21,871 INFO  logger.type (AbstractMaster.java:start) - BlockMaster: finish processing remaining journal entries (standby -> master).
2016-11-30 07:40:21,880 INFO  logger.type (JournalWriter.java:getCheckpointOutputStream) - Creating tmp checkpoint file: /mnt/glusterfs/alluxio/journal/BlockMaster/checkpoint.data.tmp
2016-11-30 07:40:21,880 INFO  logger.type (JournalWriter.java:getCheckpointOutputStream) - Latest journal sequence number: 4 Next journal sequence number: 5
2016-11-30 07:40:21,962 INFO  logger.type (JournalWriter.java:close) - Successfully created tmp checkpoint file: /mnt/glusterfs/alluxio/journal/BlockMaster/checkpoint.data.tmp
2016-11-30 07:40:22,217 INFO  logger.type (JournalWriter.java:close) - Renamed checkpoint file /mnt/glusterfs/alluxio/journal/BlockMaster/checkpoint.data.tmp to /mnt/glusterfs/alluxio/journal/BlockMaster/checkpoint.data
2016-11-30 07:40:22,218 INFO  logger.type (JournalWriter.java:deleteCompletedLogs) - Deleting all completed log files...
2016-11-30 07:40:22,223 INFO  logger.type (JournalWriter.java:deleteCompletedLogs) - Finished deleting all completed log files.
2016-11-30 07:40:22,260 INFO  logger.type (MountTable.java:add) - Mounting /mnt/glusterfs at /
2016-11-30 07:40:22,261 INFO  logger.type (AbstractMaster.java:start) - FileSystemMaster: Starting leader master.
2016-11-30 07:40:22,261 INFO  logger.type (JournalWriter.java:completeAllLogs) - Marking all logs as complete.
2016-11-30 07:40:23,125 INFO  logger.type (JournalWriter.java:completeCurrentLog) - Completed current log: /mnt/glusterfs/alluxio/journal/FileSystemMaster/log.out to completed log: /mnt/glusterfs/alluxio/journal/FileSystemMaster/completed/log.00000000000000000001
2016-11-30 07:40:23,126 INFO  logger.type (AbstractMaster.java:start) - FileSystemMaster: finish processing remaining journal entries (standby -> master).
2016-11-30 07:40:23,134 INFO  logger.type (JournalReader.java:getNextInputStream) - Opening journal log file: /mnt/glusterfs/alluxio/journal/FileSystemMaster/completed/log.00000000000000000001
2016-11-30 07:40:23,134 INFO  logger.type (JournalTailer.java:processNextJournalLogFiles) - FileSystemMaster: Processing a completed log file.
2016-11-30 07:40:23,139 INFO  logger.type (JournalTailer.java:processNextJournalLogFiles) - FileSystemMaster: Finished processing the log file.
2016-11-30 07:40:23,159 INFO  logger.type (JournalWriter.java:getCheckpointOutputStream) - Creating tmp checkpoint file: /mnt/glusterfs/alluxio/journal/FileSystemMaster/checkpoint.data.tmp
2016-11-30 07:40:23,159 INFO  logger.type (JournalWriter.java:getCheckpointOutputStream) - Latest journal sequence number: 21 Next journal sequence number: 22
2016-11-30 07:40:23,189 INFO  logger.type (JournalWriter.java:close) - Successfully created tmp checkpoint file: /mnt/glusterfs/alluxio/journal/FileSystemMaster/checkpoint.data.tmp
2016-11-30 07:40:23,282 INFO  logger.type (JournalWriter.java:close) - Renamed checkpoint file /mnt/glusterfs/alluxio/journal/FileSystemMaster/checkpoint.data.tmp to /mnt/glusterfs/alluxio/journal/FileSystemMaster/checkpoint.data
2016-11-30 07:40:23,282 INFO  logger.type (JournalWriter.java:deleteCompletedLogs) - Deleting all completed log files...
2016-11-30 07:40:23,282 INFO  logger.type (JournalWriter.java:deleteCompletedLogs) - Deleting completed log: /mnt/glusterfs/alluxio/journal/FileSystemMaster/completed/log.00000000000000000001
2016-11-30 07:40:23,347 INFO  logger.type (JournalWriter.java:deleteCompletedLogs) - Finished deleting all completed log files.
2016-11-30 07:40:23,357 INFO  logger.type (MetricsSystem.java:startSinksFromConfig) - Starting sinks with config: {}.
2016-11-30 07:40:23,366 INFO  util.log (Log.java:initialized) - Logging initialized @87036ms
2016-11-30 07:40:23,487 INFO  server.Server (Server.java:doStart) - jetty-9.2.z-SNAPSHOT
2016-11-30 07:40:23,507 INFO  handler.ContextHandler (ContextHandler.java:doStart) - Started o.e.j.s.ServletContextHandler@73a1eeaa{/metrics/json,null,AVAILABLE}
2016-11-30 07:40:29,353 INFO  handler.ContextHandler (ContextHandler.java:doStart) - Started o.e.j.w.WebAppContext@1246406b{/,file:/usr/local/alluxio-1.3.0/core/server/src/main/webapp/,AVAILABLE}{/usr/local/alluxio-1.3.0/core/server/src/main/webapp}
2016-11-30 07:40:29,359 INFO  server.ServerConnector (AbstractConnector.java:doStart) - Started ServerConnector@680bf396{HTTP/1.1}{0.0.0.0:19999}
2016-11-30 07:40:29,359 INFO  server.Server (Server.java:doStart) - Started @93029ms
2016-11-30 07:40:29,360 INFO  logger.type (UIWebServer.java:startWebServer) - Alluxio Master Web service started @ /0.0.0.0:19999
2016-11-30 07:40:29,360 INFO  logger.type (AlluxioMaster.java:startServing) - Alluxio master version 1.3.0 started @ ab02.prod.com/10.3.3.15:19998 (gained leadership)
2016-11-30 07:40:29,451 WARN  logger.type (BlockMaster.java:workerHeartbeat) - Could not find worker id: 8 for heartbeat.
2016-11-30 07:40:29,451 WARN  logger.type (BlockMaster.java:workerHeartbeat) - Could not find worker id: 13 for heartbeat.
2016-11-30 07:40:29,451 WARN  logger.type (BlockMaster.java:workerHeartbeat) - Could not find worker id: 16 for heartbeat.
2016-11-30 07:40:29,451 WARN  logger.type (BlockMaster.java:workerHeartbeat) - Could not find worker id: 2 for heartbeat.
2016-11-30 07:40:29,451 WARN  logger.type (BlockMaster.java:workerHeartbeat) - Could not find worker id: 9 for heartbeat.
2016-11-30 07:40:29,451 WARN  logger.type (BlockMaster.java:workerHeartbeat) - Could not find worker id: 1 for heartbeat.
2016-11-30 07:40:29,451 WARN  logger.type (BlockMaster.java:workerHeartbeat) - Could not find worker id: 5 for heartbeat.
2016-11-30 07:40:29,451 WARN  logger.type (BlockMaster.java:workerHeartbeat) - Could not find worker id: 18 for heartbeat.
2016-11-30 07:40:29,451 WARN  logger.type (BlockMaster.java:workerHeartbeat) - Could not find worker id: 15 for heartbeat.
2016-11-30 07:40:29,451 WARN  logger.type (BlockMaster.java:workerHeartbeat) - Could not find worker id: 20 for heartbeat.
2016-11-30 07:40:29,451 WARN  logger.type (BlockMaster.java:workerHeartbeat) - Could not find worker id: 3 for heartbeat.
2016-11-30 07:40:29,451 WARN  logger.type (BlockMaster.java:workerHeartbeat) - Could not find worker id: 6 for heartbeat.
2016-11-30 07:40:29,451 WARN  logger.type (BlockMaster.java:workerHeartbeat) - Could not find worker id: 11 for heartbeat.
2016-11-30 07:40:29,451 WARN  logger.type (BlockMaster.java:workerHeartbeat) - Could not find worker id: 7 for heartbeat.
2016-11-30 07:40:29,451 WARN  logger.type (BlockMaster.java:workerHeartbeat) - Could not find worker id: 10 for heartbeat.
2016-11-30 07:40:29,451 WARN  logger.type (BlockMaster.java:workerHeartbeat) - Could not find worker id: 4 for heartbeat.
2016-11-30 07:40:29,451 WARN  logger.type (BlockMaster.java:workerHeartbeat) - Could not find worker id: 12 for heartbeat.
2016-11-30 07:40:29,468 INFO  logger.type (BlockMaster.java:getWorkerId) - getWorkerId(): WorkerNetAddress: WorkerNetAddress{host=10.3.3.87, rpcPort=29998, dataPort=29999, webPort=30000} id: 3
2016-11-30 07:40:29,468 INFO  logger.type (BlockMaster.java:getWorkerId) - getWorkerId(): WorkerNetAddress: WorkerNetAddress{host=10.3.3.74, rpcPort=29998, dataPort=29999, webPort=30000} id: 6
2016-11-30 07:40:29,469 INFO  logger.type (BlockMaster.java:getWorkerId) - getWorkerId(): WorkerNetAddress: WorkerNetAddress{host=10.3.3.20, rpcPort=29998, dataPort=29999, webPort=30000} id: 14
2016-11-30 07:40:29,469 INFO  logger.type (BlockMaster.java:getWorkerId) - getWorkerId(): WorkerNetAddress: WorkerNetAddress{host=10.3.3.23, rpcPort=29998, dataPort=29999, webPort=30000} id: 16
2016-11-30 07:40:29,468 INFO  logger.type (BlockMaster.java:getWorkerId) - getWorkerId(): WorkerNetAddress: WorkerNetAddress{host=10.3.3.10, rpcPort=29998, dataPort=29999, webPort=30000} id: 1
2016-11-30 07:40:29,468 INFO  logger.type (BlockMaster.java:getWorkerId) - getWorkerId(): WorkerNetAddress: WorkerNetAddress{host=10.3.3.76, rpcPort=29998, dataPort=29999, webPort=30000} id: 4
2016-11-30 07:40:29,468 INFO  logger.type (BlockMaster.java:getWorkerId) - getWorkerId(): WorkerNetAddress: WorkerNetAddress{host=10.3.3.12, rpcPort=29998, dataPort=29999, webPort=30000} id: 10
2016-11-30 07:40:29,468 INFO  logger.type (BlockMaster.java:getWorkerId) - getWorkerId(): WorkerNetAddress: WorkerNetAddress{host=10.3.3.21, rpcPort=29998, dataPort=29999, webPort=30000} id: 8
2016-11-30 07:40:29,468 INFO  logger.type (BlockMaster.java:getWorkerId) - getWorkerId(): WorkerNetAddress: WorkerNetAddress{host=10.3.3.86, rpcPort=29998, dataPort=29999, webPort=30000} id: 11
2016-11-30 07:40:29,468 INFO  logger.type (BlockMaster.java:getWorkerId) - getWorkerId(): WorkerNetAddress: WorkerNetAddress{host=10.3.3.73, rpcPort=29998, dataPort=29999, webPort=30000} id: 2
2016-11-30 07:40:29,468 INFO  logger.type (BlockMaster.java:getWorkerId) - getWorkerId(): WorkerNetAddress: WorkerNetAddress{host=10.3.3.75, rpcPort=29998, dataPort=29999, webPort=30000} id: 17
2016-11-30 07:40:29,468 INFO  logger.type (BlockMaster.java:getWorkerId) - getWorkerId(): WorkerNetAddress: WorkerNetAddress{host=ab09.prod.com, rpcPort=29998, dataPort=29999, webPort=30000} id: 13
2016-11-30 07:40:29,468 INFO  logger.type (BlockMaster.java:getWorkerId) - getWorkerId(): WorkerNetAddress: WorkerNetAddress{host=10.3.3.85, rpcPort=29998, dataPort=29999, webPort=30000} id: 5
2016-11-30 07:40:29,469 INFO  logger.type (BlockMaster.java:getWorkerId) - getWorkerId(): WorkerNetAddress: WorkerNetAddress{host=10.3.3.83, rpcPort=29998, dataPort=29999, webPort=30000} id: 12
2016-11-30 07:40:29,469 INFO  logger.type (BlockMaster.java:getWorkerId) - getWorkerId(): WorkerNetAddress: WorkerNetAddress{host=10.3.3.84, rpcPort=29998, dataPort=29999, webPort=30000} id: 15
2016-11-30 07:40:29,469 INFO  logger.type (BlockMaster.java:getWorkerId) - getWorkerId(): WorkerNetAddress: WorkerNetAddress{host=10.3.3.19, rpcPort=29998, dataPort=29999, webPort=30000} id: 7
2016-11-30 07:40:29,469 INFO  logger.type (BlockMaster.java:getWorkerId) - getWorkerId(): WorkerNetAddress: WorkerNetAddress{host=10.3.3.77, rpcPort=29998, dataPort=29999, webPort=30000} id: 9
2016-11-30 07:40:29,477 INFO  logger.type (BlockMaster.java:workerRegister) - registerWorker(): MasterWorkerInfo{id=3, workerAddress=WorkerNetAddress{host=10.3.3.87, rpcPort=29998, dataPort=29999, webPort=30000}, capacityBytes=107374182400, usedBytes=0, lastUpdatedTimeMs=1480491629476, blocks=[]}
2016-11-30 07:40:29,477 INFO  logger.type (BlockMaster.java:workerRegister) - registerWorker(): MasterWorkerInfo{id=5, workerAddress=WorkerNetAddress{host=10.3.3.85, rpcPort=29998, dataPort=29999, webPort=30000}, capacityBytes=107374182400, usedBytes=0, lastUpdatedTimeMs=1480491629476, blocks=[]}
2016-11-30 07:40:29,477 INFO  logger.type (BlockMaster.java:workerRegister) - registerWorker(): MasterWorkerInfo{id=11, workerAddress=WorkerNetAddress{host=10.3.3.86, rpcPort=29998, dataPort=29999, webPort=30000}, capacityBytes=107374182400, usedBytes=0, lastUpdatedTimeMs=1480491629476, blocks=[]}
2016-11-30 07:40:29,477 INFO  logger.type (BlockMaster.java:workerRegister) - registerWorker(): MasterWorkerInfo{id=6, workerAddress=WorkerNetAddress{host=10.3.3.74, rpcPort=29998, dataPort=29999, webPort=30000}, capacityBytes=107374182400, usedBytes=0, lastUpdatedTimeMs=1480491629476, blocks=[]}
2016-11-30 07:40:29,477 INFO  logger.type (BlockMaster.java:workerRegister) - registerWorker(): MasterWorkerInfo{id=2, workerAddress=WorkerNetAddress{host=10.3.3.73, rpcPort=29998, dataPort=29999, webPort=30000}, capacityBytes=107374182400, usedBytes=0, lastUpdatedTimeMs=1480491629476, blocks=[]}
2016-11-30 07:40:29,477 INFO  logger.type (BlockMaster.java:workerRegister) - registerWorker(): MasterWorkerInfo{id=14, workerAddress=WorkerNetAddress{host=10.3.3.20, rpcPort=29998, dataPort=29999, webPort=30000}, capacityBytes=107374182400, usedBytes=0, lastUpdatedTimeMs=1480491629476, blocks=[]}
2016-11-30 07:40:29,477 INFO  logger.type (BlockMaster.java:workerRegister) - registerWorker(): MasterWorkerInfo{id=7, workerAddress=WorkerNetAddress{host=10.3.3.19, rpcPort=29998, dataPort=29999, webPort=30000}, capacityBytes=107374182400, usedBytes=0, lastUpdatedTimeMs=1480491629476, blocks=[]}
2016-11-30 07:40:29,477 INFO  logger.type (BlockMaster.java:workerRegister) - registerWorker(): MasterWorkerInfo{id=1, workerAddress=WorkerNetAddress{host=10.3.3.10, rpcPort=29998, dataPort=29999, webPort=30000}, capacityBytes=107374182400, usedBytes=0, lastUpdatedTimeMs=1480491629476, blocks=[]}
2016-11-30 07:40:29,477 INFO  logger.type (BlockMaster.java:workerRegister) - registerWorker(): MasterWorkerInfo{id=13, workerAddress=WorkerNetAddress{host=ab09.prod.com, rpcPort=29998, dataPort=29999, webPort=30000}, capacityBytes=107374182400, usedBytes=0, lastUpdatedTimeMs=1480491629476, blocks=[]}
2016-11-30 07:40:29,477 INFO  logger.type (BlockMaster.java:workerRegister) - registerWorker(): MasterWorkerInfo{id=10, workerAddress=WorkerNetAddress{host=10.3.3.12, rpcPort=29998, dataPort=29999, webPort=30000}, capacityBytes=107374182400, usedBytes=0, lastUpdatedTimeMs=1480491629476, blocks=[]}
2016-11-30 07:40:29,477 INFO  logger.type (BlockMaster.java:workerRegister) - registerWorker(): MasterWorkerInfo{id=16, workerAddress=WorkerNetAddress{host=10.3.3.23, rpcPort=29998, dataPort=29999, webPort=30000}, capacityBytes=107374182400, usedBytes=0, lastUpdatedTimeMs=1480491629476, blocks=[]}
2016-11-30 07:40:29,477 INFO  logger.type (BlockMaster.java:workerRegister) - registerWorker(): MasterWorkerInfo{id=4, workerAddress=WorkerNetAddress{host=10.3.3.76, rpcPort=29998, dataPort=29999, webPort=30000}, capacityBytes=107374182400, usedBytes=0, lastUpdatedTimeMs=1480491629476, blocks=[]}
2016-11-30 07:40:29,477 INFO  logger.type (BlockMaster.java:workerRegister) - registerWorker(): MasterWorkerInfo{id=8, workerAddress=WorkerNetAddress{host=10.3.3.21, rpcPort=29998, dataPort=29999, webPort=30000}, capacityBytes=107374182400, usedBytes=0, lastUpdatedTimeMs=1480491629476, blocks=[]}
2016-11-30 07:40:29,477 INFO  logger.type (BlockMaster.java:workerRegister) - registerWorker(): MasterWorkerInfo{id=17, workerAddress=WorkerNetAddress{host=10.3.3.75, rpcPort=29998, dataPort=29999, webPort=30000}, capacityBytes=107374182400, usedBytes=0, lastUpdatedTimeMs=1480491629476, blocks=[]}
2016-11-30 07:40:29,477 INFO  logger.type (BlockMaster.java:workerRegister) - registerWorker(): MasterWorkerInfo{id=9, workerAddress=WorkerNetAddress{host=10.3.3.77, rpcPort=29998, dataPort=29999, webPort=30000}, capacityBytes=107374182400, usedBytes=0, lastUpdatedTimeMs=1480491629476, blocks=[]}
2016-11-30 07:40:29,477 INFO  logger.type (BlockMaster.java:workerRegister) - registerWorker(): MasterWorkerInfo{id=15, workerAddress=WorkerNetAddress{host=10.3.3.84, rpcPort=29998, dataPort=29999, webPort=30000}, capacityBytes=107374182400, usedBytes=0, lastUpdatedTimeMs=1480491629476, blocks=[]}
2016-11-30 07:40:29,477 INFO  logger.type (BlockMaster.java:workerRegister) - registerWorker(): MasterWorkerInfo{id=12, workerAddress=WorkerNetAddress{host=10.3.3.83, rpcPort=29998, dataPort=29999, webPort=30000}, capacityBytes=107374182400, usedBytes=0, lastUpdatedTimeMs=1480491629476, blocks=[]}
2016-11-30 07:40:39,654 WARN  logger.type (BlockMaster.java:workerHeartbeat) - Could not find worker id: 19 for heartbeat.
2016-11-30 07:40:39,657 INFO  logger.type (BlockMaster.java:getWorkerId) - getWorkerId(): WorkerNetAddress: WorkerNetAddress{host=10.3.3.9, rpcPort=29998, dataPort=29999, webPort=30000} id: 18
2016-11-30 07:40:39,660 INFO  logger.type (BlockMaster.java:workerRegister) - registerWorker(): MasterWorkerInfo{id=18, workerAddress=WorkerNetAddress{host=10.3.3.9, rpcPort=29998, dataPort=29999, webPort=30000}, capacityBytes=107374182400, usedBytes=0, lastUpdatedTimeMs=1480491639659, blocks=[]}
2016-11-30 07:42:04,598 INFO  logger.type (LeaderInquireClient.java:<init>) - create new zookeeper client. address: zookeeper:2181
2016-11-30 07:42:04,598 INFO  imps.CuratorFrameworkImpl (CuratorFrameworkImpl.java:start) - Starting
2016-11-30 07:42:04,599 INFO  zookeeper.ZooKeeper (ZooKeeper.java:<init>) - Initiating client connection, connectString=zookeeper:2181 sessionTimeout=60000 watcher=org.apache.curator.ConnectionState@79111dfc
2016-11-30 07:42:04,601 INFO  zookeeper.ClientCnxn (ClientCnxn.java:logStartConnect) - Opening socket connection to server 192.168.192.20/192.168.192.20:2181. Will not attempt to authenticate using SASL (unknown error)
2016-11-30 07:42:04,601 INFO  zookeeper.ClientCnxn (ClientCnxn.java:primeConnection) - Socket connection established to 192.168.192.20/192.168.192.20:2181, initiating session
2016-11-30 07:42:04,603 INFO  zookeeper.ClientCnxn (ClientCnxn.java:onConnected) - Session establishment complete on server 192.168.192.20/192.168.192.20:2181, sessionid = 0x15876ca48ee00bd, negotiated timeout = 60000
2016-11-30 07:42:04,604 INFO  state.ConnectionStateManager (ConnectionStateManager.java:addStateChange) - State change: CONNECTED
2016-11-30 07:42:04,604 WARN  state.ConnectionStateManager (ConnectionStateManager.java:processEvents) - There are no ConnectionStateListeners registered.
2016-11-30 07:42:04,605 INFO  logger.type (LeaderInquireClient.java:getMasterAddress) - Master addresses: [ab02.prod.com:19998, aa02.prod.com:19998]
2016-11-30 07:42:04,606 INFO  logger.type (LeaderInquireClient.java:getMasterAddress) - The leader master: ab02.prod.com:19998
2016-11-30 07:42:05,797 INFO  logger.type (LeaderInquireClient.java:getMasterAddress) - Master addresses: [ab02.prod.com:19998, aa02.prod.com:19998]
2016-11-30 07:42:05,798 INFO  logger.type (LeaderInquireClient.java:getMasterAddress) - The leader master: ab02.prod.com:19998
2016-11-30 07:42:07,778 INFO  logger.type (LeaderInquireClient.java:getMasterAddress) - Master addresses: [ab02.prod.com:19998, aa02.prod.com:19998]
2016-11-30 07:42:07,779 INFO  logger.type (LeaderInquireClient.java:getMasterAddress) - The leader master: ab02.prod.com:19998
2016-11-30 07:42:10,636 INFO  logger.type (LeaderInquireClient.java:getMasterAddress) - Master addresses: [ab02.prod.com:19998, aa02.prod.com:19998]
2016-11-30 07:42:10,638 INFO  logger.type (LeaderInquireClient.java:getMasterAddress) - The leader master: ab02.prod.com:19998



The log of a registed worker: from startup, register to aa02, then failover to ab02

2016-11-30 07:27:05,015 INFO  util.log (Log.java:initialized) - Logging initialized @772ms
2016-11-30 07:27:05,317 INFO  logger.type (MetricsSystem.java:startSinksFromConfig) - Starting sinks with config: {}.
2016-11-30 07:27:05,321 INFO  server.Server (Server.java:doStart) - jetty-9.2.z-SNAPSHOT
2016-11-30 07:27:05,352 INFO  handler.ContextHandler (ContextHandler.java:doStart) - Started o.e.j.s.ServletContextHandler@78d88d3e{/metrics/json,null,AVAILABLE}
2016-11-30 07:27:09,642 INFO  handler.ContextHandler (ContextHandler.java:doStart) - Started o.e.j.w.WebAppContext@54d9a624{/,file:/usr/local/alluxio-1.3.0/core/server/src/main/webapp/,AVAILABLE}{/usr/local/alluxio-1.3.0/core/server/src/main/webapp}
2016-11-30 07:27:09,649 INFO  server.ServerConnector (AbstractConnector.java:doStart) - Started ServerConnector@7272be94{HTTP/1.1}{0.0.0.0:30000}
2016-11-30 07:27:09,649 INFO  server.Server (Server.java:doStart) - Started @5408ms
2016-11-30 07:27:09,649 INFO  logger.type (UIWebServer.java:startWebServer) - Alluxio worker web service started @ /0.0.0.0:30000
2016-11-30 07:27:09,652 INFO  logger.type (LeaderInquireClient.java:<init>) - create new zookeeper client. address: zookeeper:2181
2016-11-30 07:27:09,708 INFO  imps.CuratorFrameworkImpl (CuratorFrameworkImpl.java:start) - Starting
2016-11-30 07:27:09,716 INFO  zookeeper.ZooKeeper (Environment.java:logEnv) - Client environment:zookeeper.version=3.4.5-1392090, built on 09/30/2012 17:52 GMT
2016-11-30 07:27:09,717 INFO  zookeeper.ZooKeeper (Environment.java:logEnv) - Client environment:host.name=10.3.3.86
2016-11-30 07:27:09,717 INFO  zookeeper.ZooKeeper (Environment.java:logEnv) - Client environment:java.version=1.7.0_121
2016-11-30 07:27:09,717 INFO  zookeeper.ZooKeeper (Environment.java:logEnv) - Client environment:java.vendor=Oracle Corporation
2016-11-30 07:27:09,717 INFO  zookeeper.ZooKeeper (Environment.java:logEnv) - Client environment:java.home=/usr/lib/jvm/java-1.7-openjdk/jre
2016-11-30 07:27:09,717 INFO  zookeeper.ZooKeeper (Environment.java:logEnv) - Client environment:java.class.path=/usr/local/alluxio-1.3.0/conf/::/usr/local/alluxio-1.3.0/assembly/target/alluxio-assemblies-1.3.0-jar-with-dependencies.jar
2016-11-30 07:27:09,718 INFO  zookeeper.ZooKeeper (Environment.java:logEnv) - Client environment:java.library.path=/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib
2016-11-30 07:27:09,718 INFO  zookeeper.ZooKeeper (Environment.java:logEnv) - Client environment:java.io.tmpdir=/tmp
2016-11-30 07:27:09,718 INFO  zookeeper.ZooKeeper (Environment.java:logEnv) - Client environment:java.compiler=<NA>
2016-11-30 07:27:09,719 INFO  zookeeper.ZooKeeper (Environment.java:logEnv) - Client environment:os.name=Linux
2016-11-30 07:27:09,719 INFO  zookeeper.ZooKeeper (Environment.java:logEnv) - Client environment:os.arch=amd64
2016-11-30 07:27:09,719 INFO  zookeeper.ZooKeeper (Environment.java:logEnv) - Client environment:os.version=4.6.3-coreos
2016-11-30 07:27:09,719 INFO  zookeeper.ZooKeeper (Environment.java:logEnv) - Client environment:user.name=root
2016-11-30 07:27:09,719 INFO  zookeeper.ZooKeeper (Environment.java:logEnv) - Client environment:user.home=/root
2016-11-30 07:27:09,719 INFO  zookeeper.ZooKeeper (Environment.java:logEnv) - Client environment:user.dir=/
2016-11-30 07:27:09,720 INFO  zookeeper.ZooKeeper (ZooKeeper.java:<init>) - Initiating client connection, connectString=zookeeper:2181 sessionTimeout=60000 watcher=org.apache.curator.ConnectionState@318603db
2016-11-30 07:27:09,736 INFO  zookeeper.ClientCnxn (ClientCnxn.java:logStartConnect) - Opening socket connection to server 192.168.192.20/192.168.192.20:2181. Will not attempt to authenticate using SASL (unknown error)
2016-11-30 07:27:09,740 INFO  zookeeper.ClientCnxn (ClientCnxn.java:primeConnection) - Socket connection established to 192.168.192.20/192.168.192.20:2181, initiating session
2016-11-30 07:27:09,747 INFO  zookeeper.ClientCnxn (ClientCnxn.java:onConnected) - Session establishment complete on server 192.168.192.20/192.168.192.20:2181, sessionid = 0x35876ca7b4400cb, negotiated timeout = 60000
2016-11-30 07:27:09,750 INFO  state.ConnectionStateManager (ConnectionStateManager.java:addStateChange) - State change: CONNECTED
2016-11-30 07:27:09,750 WARN  state.ConnectionStateManager (ConnectionStateManager.java:processEvents) - There are no ConnectionStateListeners registered.
2016-11-30 07:27:10,768 INFO  logger.type (LeaderInquireClient.java:getMasterAddress) - Master addresses: [ab02.prod.com:19998, aa02.prod.com:19998]
2016-11-30 07:27:10,769 INFO  logger.type (LeaderInquireClient.java:getMasterAddress) - The leader master: aa02.prod.com:19998
2016-11-30 07:27:10,771 INFO  logger.type (AbstractClient.java:connect) - Alluxio client (version 1.3.0) is trying to connect with BlockMasterWorker master @ aa02.prod.com/10.3.3.5:19998
2016-11-30 07:27:10,794 INFO  logger.type (AbstractClient.java:connect) - Client registered with BlockMasterWorker master @ aa02.prod.com/10.3.3.5:19998
2016-11-30 07:27:10,837 INFO  logger.type (DefaultAlluxioWorker.java:start) - Started Alluxio worker with id 11
2016-11-30 07:27:10,837 INFO  logger.type (DefaultAlluxioWorker.java:start) - Alluxio worker version 1.3.0 started @ /0.0.0.0:29998
2016-11-30 07:27:10,839 INFO  logger.type (LeaderInquireClient.java:getMasterAddress) - Master addresses: [ab02.prod.com:19998, aa02.prod.com:19998]
2016-11-30 07:27:10,841 INFO  logger.type (LeaderInquireClient.java:getMasterAddress) - The leader master: aa02.prod.com:19998
2016-11-30 07:27:10,841 INFO  logger.type (AbstractClient.java:connect) - Alluxio client (version 1.3.0) is trying to connect with FileSystemMasterWorker master @ aa02.prod.com/10.3.3.5:19998
2016-11-30 07:27:10,844 INFO  logger.type (LeaderInquireClient.java:getMasterAddress) - Master addresses: [ab02.prod.com:19998, aa02.prod.com:19998]
2016-11-30 07:27:10,845 INFO  logger.type (AbstractClient.java:connect) - Client registered with FileSystemMasterWorker master @ aa02.prod.com/10.3.3.5:19998
2016-11-30 07:27:10,845 INFO  logger.type (LeaderInquireClient.java:getMasterAddress) - The leader master: aa02.prod.com:19998
2016-11-30 07:27:10,845 INFO  logger.type (AbstractClient.java:connect) - Alluxio client (version 1.3.0) is trying to connect with FileSystemMasterWorker master @ aa02.prod.com/10.3.3.5:19998
2016-11-30 07:27:10,848 INFO  logger.type (AbstractClient.java:connect) - Client registered with FileSystemMasterWorker master @ aa02.prod.com/10.3.3.5:19998
2016-11-30 07:40:10,838 ERROR logger.type (AbstractClient.java:retryRPC) - 
org.apache.thrift.transport.TTransportException
        at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132)
        at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86)
        at org.apache.thrift.transport.TSaslTransport.readLength(TSaslTransport.java:376)
        at org.apache.thrift.transport.TSaslTransport.readFrame(TSaslTransport.java:453)
        at org.apache.thrift.transport.TSaslTransport.read(TSaslTransport.java:435)
        at org.apache.thrift.transport.TSaslClientTransport.read(TSaslClientTransport.java:37)
        at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86)
        at org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:429)
        at org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:318)
        at org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:219)
        at org.apache.thrift.protocol.TProtocolDecorator.readMessageBegin(TProtocolDecorator.java:135)
        at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:77)
        at alluxio.thrift.BlockMasterWorkerService$Client.recv_heartbeat(BlockMasterWorkerService.java:201)
        at alluxio.thrift.BlockMasterWorkerService$Client.heartbeat(BlockMasterWorkerService.java:185)
        at alluxio.worker.block.BlockMasterClient$3.call(BlockMasterClient.java:135)
        at alluxio.worker.block.BlockMasterClient$3.call(BlockMasterClient.java:132)
        at alluxio.AbstractClient.retryRPC(AbstractClient.java:317)
        at alluxio.worker.block.BlockMasterClient.heartbeat(BlockMasterClient.java:132)
        at alluxio.worker.block.BlockMasterSync.heartbeat(BlockMasterSync.java:148)
        at alluxio.heartbeat.HeartbeatThread.run(HeartbeatThread.java:82)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
2016-11-30 07:40:10,838 ERROR logger.type (AbstractClient.java:retryRPC) - 
org.apache.thrift.transport.TTransportException
        at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132)
        at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86)
        at org.apache.thrift.transport.TSaslTransport.readLength(TSaslTransport.java:376)
        at org.apache.thrift.transport.TSaslTransport.readFrame(TSaslTransport.java:453)
        at org.apache.thrift.transport.TSaslTransport.read(TSaslTransport.java:435)
        at org.apache.thrift.transport.TSaslClientTransport.read(TSaslClientTransport.java:37)
        at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86)
        at org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:429)
        at org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:318)
        at org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:219)
        at org.apache.thrift.protocol.TProtocolDecorator.readMessageBegin(TProtocolDecorator.java:135)
        at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:77)
        at alluxio.thrift.FileSystemMasterWorkerService$Client.recv_heartbeat(FileSystemMasterWorkerService.java:162)
        at alluxio.thrift.FileSystemMasterWorkerService$Client.heartbeat(FileSystemMasterWorkerService.java:148)
        at alluxio.worker.file.FileSystemMasterClient$3.call(FileSystemMasterClient.java:120)
        at alluxio.worker.file.FileSystemMasterClient$3.call(FileSystemMasterClient.java:117)
        at alluxio.AbstractClient.retryRPC(AbstractClient.java:348)
        at alluxio.worker.file.FileSystemMasterClient.heartbeat(FileSystemMasterClient.java:117)
        at alluxio.worker.file.FileWorkerMasterSyncExecutor.heartbeat(FileWorkerMasterSyncExecutor.java:88)
        at alluxio.heartbeat.HeartbeatThread.run(HeartbeatThread.java:82)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
2016-11-30 07:40:10,838 ERROR logger.type (AbstractClient.java:retryRPC) - 
org.apache.thrift.transport.TTransportException
        at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132)
        at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86)
        at org.apache.thrift.transport.TSaslTransport.readLength(TSaslTransport.java:376)
        at org.apache.thrift.transport.TSaslTransport.readFrame(TSaslTransport.java:453)
        at org.apache.thrift.transport.TSaslTransport.read(TSaslTransport.java:435)
        at org.apache.thrift.transport.TSaslClientTransport.read(TSaslClientTransport.java:37)
        at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86)
        at org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:429)
        at org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:318)
        at org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:219)
        at org.apache.thrift.protocol.TProtocolDecorator.readMessageBegin(TProtocolDecorator.java:135)
        at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:77)
        at alluxio.thrift.FileSystemMasterWorkerService$Client.recv_getPinIdList(FileSystemMasterWorkerService.java:135)
        at alluxio.thrift.FileSystemMasterWorkerService$Client.getPinIdList(FileSystemMasterWorkerService.java:123)
        at alluxio.worker.file.FileSystemMasterClient$2.call(FileSystemMasterClient.java:101)
        at alluxio.worker.file.FileSystemMasterClient$2.call(FileSystemMasterClient.java:98)
        at alluxio.AbstractClient.retryRPC(AbstractClient.java:317)
        at alluxio.worker.file.FileSystemMasterClient.getPinList(FileSystemMasterClient.java:98)
        at alluxio.worker.block.PinListSync.heartbeat(PinListSync.java:55)
        at alluxio.heartbeat.HeartbeatThread.run(HeartbeatThread.java:82)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
2016-11-30 07:40:10,842 INFO  logger.type (LeaderInquireClient.java:getMasterAddress) - Master addresses: [ab02.prod.com:19998, aa02.prod.com:19998]
2016-11-30 07:40:10,844 INFO  logger.type (LeaderInquireClient.java:getMasterAddress) - The leader master: ab02.prod.com:19998
2016-11-30 07:40:10,844 INFO  logger.type (AbstractClient.java:connect) - Alluxio client (version 1.3.0) is trying to connect with BlockMasterWorker master @ ab02.prod.com/10.3.3.15:19998
2016-11-30 07:40:10,845 INFO  logger.type (LeaderInquireClient.java:getMasterAddress) - Master addresses: [ab02.prod.com:19998, aa02.prod.com:19998]
2016-11-30 07:40:10,848 INFO  logger.type (LeaderInquireClient.java:getMasterAddress) - The leader master: ab02.prod.com:19998
2016-11-30 07:40:10,849 INFO  logger.type (AbstractClient.java:connect) - Alluxio client (version 1.3.0) is trying to connect with FileSystemMasterWorker master @ ab02.prod.com/10.3.3.15:19998
2016-11-30 07:40:10,850 INFO  logger.type (LeaderInquireClient.java:getMasterAddress) - Master addresses: [ab02.prod.com:19998, aa02.prod.com:19998]
2016-11-30 07:40:10,851 INFO  logger.type (LeaderInquireClient.java:getMasterAddress) - The leader master: ab02.prod.com:19998
2016-11-30 07:40:10,851 INFO  logger.type (AbstractClient.java:connect) - Alluxio client (version 1.3.0) is trying to connect with FileSystemMasterWorker master @ ab02.prod.com/10.3.3.15:19998
2016-11-30 07:40:29,437 INFO  logger.type (AbstractClient.java:connect) - Client registered with FileSystemMasterWorker master @ ab02.prod.com/10.3.3.15:19998
2016-11-30 07:40:29,437 INFO  logger.type (AbstractClient.java:connect) - Client registered with FileSystemMasterWorker master @ ab02.prod.com/10.3.3.15:19998
2016-11-30 07:40:29,437 INFO  logger.type (AbstractClient.java:connect) - Client registered with BlockMasterWorker master @ ab02.prod.com/10.3.3.15:19998



The log of the unregistered worker: from startup , register to aa02, then try to register ab02

2016-11-30 07:33:13,689 INFO  util.log (Log.java:initialized) - Logging initialized @939ms
2016-11-30 07:33:14,001 INFO  logger.type (MetricsSystem.java:startSinksFromConfig) - Starting sinks with config: {}.
2016-11-30 07:33:14,007 INFO  server.Server (Server.java:doStart) - jetty-9.2.z-SNAPSHOT
2016-11-30 07:33:14,034 INFO  handler.ContextHandler (ContextHandler.java:doStart) - Started o.e.j.s.ServletContextHandler@51211f6d{/metrics/json,null,AVAILABLE}
2016-11-30 07:33:18,382 INFO  handler.ContextHandler (ContextHandler.java:doStart) - Started o.e.j.w.WebAppContext@5dcac79c{/,file:/usr/local/alluxio-1.3.0/core/server/src/main/webapp/,AVAILABLE}{/usr/local/alluxio-1.3.0/core/server/src/main/webapp}
2016-11-30 07:33:18,390 INFO  server.ServerConnector (AbstractConnector.java:doStart) - Started ServerConnector@1e62e899{HTTP/1.1}{0.0.0.0:30000}
2016-11-30 07:33:18,390 INFO  server.Server (Server.java:doStart) - Started @5643ms
2016-11-30 07:33:18,391 INFO  logger.type (UIWebServer.java:startWebServer) - Alluxio worker web service started @ /0.0.0.0:30000
2016-11-30 07:33:18,393 INFO  logger.type (LeaderInquireClient.java:<init>) - create new zookeeper client. address: zookeeper:2181
2016-11-30 07:33:18,454 INFO  imps.CuratorFrameworkImpl (CuratorFrameworkImpl.java:start) - Starting
2016-11-30 07:33:18,464 INFO  zookeeper.ZooKeeper (Environment.java:logEnv) - Client environment:zookeeper.version=3.4.5-1392090, built on 09/30/2012 17:52 GMT
2016-11-30 07:33:18,464 INFO  zookeeper.ZooKeeper (Environment.java:logEnv) - Client environment:host.name=10.3.3.11
2016-11-30 07:33:18,464 INFO  zookeeper.ZooKeeper (Environment.java:logEnv) - Client environment:java.version=1.7.0_121
2016-11-30 07:33:18,464 INFO  zookeeper.ZooKeeper (Environment.java:logEnv) - Client environment:java.vendor=Oracle Corporation
2016-11-30 07:33:18,464 INFO  zookeeper.ZooKeeper (Environment.java:logEnv) - Client environment:java.home=/usr/lib/jvm/java-1.7-openjdk/jre
2016-11-30 07:33:18,464 INFO  zookeeper.ZooKeeper (Environment.java:logEnv) - Client environment:java.class.path=/usr/local/alluxio-1.3.0/conf/::/usr/local/alluxio-1.3.0/assembly/target/alluxio-assemblies-1.3.0-jar-with-dependencies.jar
2016-11-30 07:33:18,466 INFO  zookeeper.ZooKeeper (Environment.java:logEnv) - Client environment:java.library.path=/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib
2016-11-30 07:33:18,466 INFO  zookeeper.ZooKeeper (Environment.java:logEnv) - Client environment:java.io.tmpdir=/tmp
2016-11-30 07:33:18,466 INFO  zookeeper.ZooKeeper (Environment.java:logEnv) - Client environment:java.compiler=<NA>
2016-11-30 07:33:18,466 INFO  zookeeper.ZooKeeper (Environment.java:logEnv) - Client environment:os.name=Linux
2016-11-30 07:33:18,466 INFO  zookeeper.ZooKeeper (Environment.java:logEnv) - Client environment:os.arch=amd64
2016-11-30 07:33:18,466 INFO  zookeeper.ZooKeeper (Environment.java:logEnv) - Client environment:os.version=4.6.3-coreos
2016-11-30 07:33:18,466 INFO  zookeeper.ZooKeeper (Environment.java:logEnv) - Client environment:user.name=root
2016-11-30 07:33:18,466 INFO  zookeeper.ZooKeeper (Environment.java:logEnv) - Client environment:user.home=/root
2016-11-30 07:33:18,467 INFO  zookeeper.ZooKeeper (Environment.java:logEnv) - Client environment:user.dir=/
2016-11-30 07:33:18,468 INFO  zookeeper.ZooKeeper (ZooKeeper.java:<init>) - Initiating client connection, connectString=zookeeper:2181 sessionTimeout=60000 watcher=org.apache.curator.ConnectionState@377c0e2a
2016-11-30 07:33:18,487 INFO  zookeeper.ClientCnxn (ClientCnxn.java:logStartConnect) - Opening socket connection to server 192.168.192.20/192.168.192.20:2181. Will not attempt to authenticate using SASL (unknown error)
2016-11-30 07:33:18,491 INFO  zookeeper.ClientCnxn (ClientCnxn.java:primeConnection) - Socket connection established to 192.168.192.20/192.168.192.20:2181, initiating session
2016-11-30 07:33:18,500 INFO  zookeeper.ClientCnxn (ClientCnxn.java:onConnected) - Session establishment complete on server 192.168.192.20/192.168.192.20:2181, sessionid = 0x15876ca48ee00bc, negotiated timeout = 60000
2016-11-30 07:33:18,503 INFO  state.ConnectionStateManager (ConnectionStateManager.java:addStateChange) - State change: CONNECTED
2016-11-30 07:33:18,503 WARN  state.ConnectionStateManager (ConnectionStateManager.java:processEvents) - There are no ConnectionStateListeners registered.
2016-11-30 07:33:19,517 INFO  logger.type (LeaderInquireClient.java:getMasterAddress) - Master addresses: [ab02.prod.com:19998, aa02.prod.com:19998]
2016-11-30 07:33:19,519 INFO  logger.type (LeaderInquireClient.java:getMasterAddress) - The leader master: aa02.prod.com:19998
2016-11-30 07:33:19,520 INFO  logger.type (AbstractClient.java:connect) - Alluxio client (version 1.3.0) is trying to connect with BlockMasterWorker master @ aa02.prod.com/10.3.3.5:19998
2016-11-30 07:33:19,540 INFO  logger.type (AbstractClient.java:connect) - Client registered with BlockMasterWorker master @ aa02.prod.com/10.3.3.5:19998
2016-11-30 07:33:19,600 INFO  logger.type (LeaderInquireClient.java:getMasterAddress) - Master addresses: [ab02.prod.com:19998, aa02.prod.com:19998]
2016-11-30 07:33:19,600 INFO  logger.type (DefaultAlluxioWorker.java:start) - Started Alluxio worker with id 17
2016-11-30 07:33:19,600 INFO  logger.type (DefaultAlluxioWorker.java:start) - Alluxio worker version 1.3.0 started @ /0.0.0.0:29998
2016-11-30 07:33:19,602 INFO  logger.type (LeaderInquireClient.java:getMasterAddress) - The leader master: aa02.prod.com:19998
2016-11-30 07:33:19,602 INFO  logger.type (AbstractClient.java:connect) - Alluxio client (version 1.3.0) is trying to connect with FileSystemMasterWorker master @ aa02.prod.com/10.3.3.5:19998
2016-11-30 07:33:19,604 INFO  logger.type (AbstractClient.java:connect) - Client registered with FileSystemMasterWorker master @ aa02.prod.com/10.3.3.5:19998
2016-11-30 07:33:19,605 INFO  logger.type (LeaderInquireClient.java:getMasterAddress) - Master addresses: [ab02.prod.com:19998, aa02.prod.com:19998]
2016-11-30 07:33:19,607 INFO  logger.type (LeaderInquireClient.java:getMasterAddress) - The leader master: aa02.prod.com:19998
2016-11-30 07:33:19,607 INFO  logger.type (AbstractClient.java:connect) - Alluxio client (version 1.3.0) is trying to connect with FileSystemMasterWorker master @ aa02.prod.com/10.3.3.5:19998
2016-11-30 07:33:19,609 INFO  logger.type (AbstractClient.java:connect) - Client registered with FileSystemMasterWorker master @ aa02.prod.com/10.3.3.5:19998
2016-11-30 07:40:19,599 ERROR logger.type (AbstractClient.java:retryRPC) - 
org.apache.thrift.transport.TTransportException
        at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132)
        at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86)
        at org.apache.thrift.transport.TSaslTransport.readLength(TSaslTransport.java:376)
        at org.apache.thrift.transport.TSaslTransport.readFrame(TSaslTransport.java:453)
        at org.apache.thrift.transport.TSaslTransport.read(TSaslTransport.java:435)
        at org.apache.thrift.transport.TSaslClientTransport.read(TSaslClientTransport.java:37)
        at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86)
        at org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:429)
        at org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:318)
        at org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:219)
        at org.apache.thrift.protocol.TProtocolDecorator.readMessageBegin(TProtocolDecorator.java:135)
        at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:77)
        at alluxio.thrift.BlockMasterWorkerService$Client.recv_heartbeat(BlockMasterWorkerService.java:201)
        at alluxio.thrift.BlockMasterWorkerService$Client.heartbeat(BlockMasterWorkerService.java:185)
        at alluxio.worker.block.BlockMasterClient$3.call(BlockMasterClient.java:135)
        at alluxio.worker.block.BlockMasterClient$3.call(BlockMasterClient.java:132)
        at alluxio.AbstractClient.retryRPC(AbstractClient.java:317)
        at alluxio.worker.block.BlockMasterClient.heartbeat(BlockMasterClient.java:132)
        at alluxio.worker.block.BlockMasterSync.heartbeat(BlockMasterSync.java:148)
        at alluxio.heartbeat.HeartbeatThread.run(HeartbeatThread.java:82)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
2016-11-30 07:40:19,601 ERROR logger.type (AbstractClient.java:retryRPC) - 
org.apache.thrift.transport.TTransportException
        at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132)
        at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86)
        at org.apache.thrift.transport.TSaslTransport.readLength(TSaslTransport.java:376)
        at org.apache.thrift.transport.TSaslTransport.readFrame(TSaslTransport.java:453)
        at org.apache.thrift.transport.TSaslTransport.read(TSaslTransport.java:435)
        at org.apache.thrift.transport.TSaslClientTransport.read(TSaslClientTransport.java:37)
        at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86)
        at org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:429)
        at org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:318)
        at org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:219)
        at org.apache.thrift.protocol.TProtocolDecorator.readMessageBegin(TProtocolDecorator.java:135)
        at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:77)
        at alluxio.thrift.FileSystemMasterWorkerService$Client.recv_getPinIdList(FileSystemMasterWorkerService.java:135)
        at alluxio.thrift.FileSystemMasterWorkerService$Client.getPinIdList(FileSystemMasterWorkerService.java:123)
        at alluxio.worker.file.FileSystemMasterClient$2.call(FileSystemMasterClient.java:101)
        at alluxio.worker.file.FileSystemMasterClient$2.call(FileSystemMasterClient.java:98)
        at alluxio.AbstractClient.retryRPC(AbstractClient.java:317)
        at alluxio.worker.file.FileSystemMasterClient.getPinList(FileSystemMasterClient.java:98)
        at alluxio.worker.block.PinListSync.heartbeat(PinListSync.java:55)
        at alluxio.heartbeat.HeartbeatThread.run(HeartbeatThread.java:82)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
2016-11-30 07:40:19,600 ERROR logger.type (AbstractClient.java:retryRPC) - 
org.apache.thrift.transport.TTransportException
        at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132)
        at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86)
        at org.apache.thrift.transport.TSaslTransport.readLength(TSaslTransport.java:376)
        at org.apache.thrift.transport.TSaslTransport.readFrame(TSaslTransport.java:453)
        at org.apache.thrift.transport.TSaslTransport.read(TSaslTransport.java:435)
        at org.apache.thrift.transport.TSaslClientTransport.read(TSaslClientTransport.java:37)
        at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86)
        at org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:429)
        at org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:318)
        at org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:219)
        at org.apache.thrift.protocol.TProtocolDecorator.readMessageBegin(TProtocolDecorator.java:135)
        at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:77)
        at alluxio.thrift.FileSystemMasterWorkerService$Client.recv_heartbeat(FileSystemMasterWorkerService.java:162)
        at alluxio.thrift.FileSystemMasterWorkerService$Client.heartbeat(FileSystemMasterWorkerService.java:148)
        at alluxio.worker.file.FileSystemMasterClient$3.call(FileSystemMasterClient.java:120)
        at alluxio.worker.file.FileSystemMasterClient$3.call(FileSystemMasterClient.java:117)
        at alluxio.AbstractClient.retryRPC(AbstractClient.java:348)
        at alluxio.worker.file.FileSystemMasterClient.heartbeat(FileSystemMasterClient.java:117)
        at alluxio.worker.file.FileWorkerMasterSyncExecutor.heartbeat(FileWorkerMasterSyncExecutor.java:88)
        at alluxio.heartbeat.HeartbeatThread.run(HeartbeatThread.java:82)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
2016-11-30 07:40:19,603 INFO  logger.type (LeaderInquireClient.java:getMasterAddress) - Master addresses: [ab02.prod.com:19998, aa02.prod.com:19998]
2016-11-30 07:40:19,605 INFO  logger.type (LeaderInquireClient.java:getMasterAddress) - The leader master: ab02.prod.com:19998
2016-11-30 07:40:19,606 INFO  logger.type (AbstractClient.java:connect) - Alluxio client (version 1.3.0) is trying to connect with BlockMasterWorker master @ ab02.prod.com/10.3.3.15:19998
2016-11-30 07:40:19,606 INFO  logger.type (LeaderInquireClient.java:getMasterAddress) - Master addresses: [ab02.prod.com:19998, aa02.prod.com:19998]
2016-11-30 07:40:19,608 INFO  logger.type (LeaderInquireClient.java:getMasterAddress) - The leader master: ab02.prod.com:19998
2016-11-30 07:40:19,608 INFO  logger.type (AbstractClient.java:connect) - Alluxio client (version 1.3.0) is trying to connect with FileSystemMasterWorker master @ ab02.prod.com/10.3.3.15:19998
2016-11-30 07:40:19,609 INFO  logger.type (LeaderInquireClient.java:getMasterAddress) - Master addresses: [ab02.prod.com:19998, aa02.prod.com:19998]
2016-11-30 07:40:19,611 INFO  logger.type (LeaderInquireClient.java:getMasterAddress) - The leader master: ab02.prod.com:19998
2016-11-30 07:40:19,611 INFO  logger.type (AbstractClient.java:connect) - Alluxio client (version 1.3.0) is trying to connect with FileSystemMasterWorker master @ ab02.prod.com/10.3.3.15:19998
2016-11-30 07:40:32,307 INFO  logger.type (AbstractClient.java:connect) - Client registered with FileSystemMasterWorker master @ ab02.prod.com/10.3.3.15:19998
2016-11-30 07:40:32,307 INFO  logger.type (AbstractClient.java:connect) - Client registered with BlockMasterWorker master @ ab02.prod.com/10.3.3.15:19998
2016-11-30 07:40:33,443 INFO  logger.type (AbstractClient.java:connect) - Client registered with FileSystemMasterWorker master @ ab02.prod.com/10.3.3.15:19998

The status excerpted from alluxio webui:

Alluxio Summary

Started: 11-30-2016 07:40:29:381
Uptime: 0 day(s), 0 hour(s), 3 minute(s), and 16 second(s)
Version: 1.3.0
Running Workers: 18
Cluster Usage Summary

Workers Capacity: 1800.00GB
Workers Free / Used: 1800.00GB / 0.00B
UnderFS Capacity: 78.28TB
UnderFS Free / Used: 69.52TB / 8.76TB


Live Workers

Node Name Last Heartbeat State Workers Capacity Space Used Space Usage
10.3.3.10 58 In Service 100.00GB 0.00B 100%Free
10.3.3.12 57 In Service 100.00GB 0.00B 100%Free
10.3.3.19 58 In Service 100.00GB 0.00B 100%Free
10.3.3.20 57 In Service 100.00GB 0.00B 100%Free
10.3.3.21 58 In Service 100.00GB 0.00B 100%Free
10.3.3.23 57 In Service 100.00GB 0.00B 100%Free
10.3.3.73 57 In Service 100.00GB 0.00B 100%Free
10.3.3.74 58 In Service 100.00GB 0.00B 100%Free
10.3.3.75 49 In Service 100.00GB 0.00B 100%Free
10.3.3.76 58 In Service 100.00GB 0.00B 100%Free
10.3.3.77 58 In Service 100.00GB 0.00B 100%Free
10.3.3.83 58 In Service 100.00GB 0.00B 100%Free
10.3.3.84 57 In Service 100.00GB 0.00B 100%Free
10.3.3.85 58 In Service 100.00GB 0.00B 100%Free
10.3.3.86 58 In Service 100.00GB 0.00B 100%Free
10.3.3.87 58 In Service 100.00GB 0.00B 100%Free
10.3.3.9 55 In Service 100.00GB 0.00B 100%Free
ab09.prod.com 58 In Service 100.00GB 0.00B 100%Free

Lost Workers

Node Name Last Heartbeat Workers Capacity



Zookeeper status after faliover:

[zk: localhost:2181(CONNECTED) 37] ls /leader                                                           
[zk: localhost:2181(CONNECTED) 38] get /leader/aa02.prod.com:19998                          
10.3.3.5
cZxid = 0x30000082e
ctime = Wed Nov 30 05:13:20 GMT 2016
mZxid = 0x30000082e
mtime = Wed Nov 30 05:13:20 GMT 2016
pZxid = 0x30000082e
cversion = 0
dataVersion = 0
aclVersion = 0
ephemeralOwner = 0x0
dataLength = 9
numChildren = 0
[zk: localhost:2181(CONNECTED) 39] get /leader/ab02.prod.com:19998                          
10.3.3.15
cZxid = 0x300000860
ctime = Wed Nov 30 07:40:10 GMT 2016
mZxid = 0x300000860
mtime = Wed Nov 30 07:40:10 GMT 2016
pZxid = 0x300000860
cversion = 0
dataVersion = 0
aclVersion = 0
ephemeralOwner = 0x0
dataLength = 10
numChildren = 0
[zk: localhost:2181(CONNECTED) 40] ls /election                                                         
[_c_7f58be2b-bf1c-4c0f-bf7c-221d8f95485b-lock-0000000003]
[zk: localhost:2181(CONNECTED) 41] get /election/_c_7f58be2b-bf1c-4c0f-bf7c-221d8f95485b-lock-0000000003
cZxid = 0x30000085d
ctime = Wed Nov 30 07:38:57 GMT 2016
mZxid = 0x30000085d
mtime = Wed Nov 30 07:38:57 GMT 2016
pZxid = 0x30000085d
cversion = 0
dataVersion = 0
aclVersion = 0
ephemeralOwner = 0x25876ca490b00b8
dataLength = 31
numChildren = 0
[zk: localhost:2181(CONNECTED) 42] 


Thanks for your patient

Likun

Kun Li

unread,
Dec 1, 2016, 10:03:00 AM12/1/16
to Alluxio Users
Update some message.

I did some test:

1, I deleted alluxio.worker.block.heartbeat.timeout.ms=300000  from alluxio-site.properties, restarted the alluxio cluster. the same result as before : master failover succeed when I had 17 workers, all workers registered after master failover, failed when I had more than 17 workers, after master failover , there was always several workers not registered.  It seemed the parameter alluxio.worker.block.heartbeat.timeout.ms didn't take effect.

2, I shutdowned alluxio master , then waited ten minutes, started another master, all workers disappeared,  no worker could register. but still, the log from workers:

2016-12-01 11:19:49,000 INFO  logger.type (AbstractClient.java:connect) - Client registered with FileSystemMasterWorker master @ ab02.prod.com/10.3.3.15:19998
2016-12-01 11:19:49,004 INFO  logger.type (AbstractClient.java:connect) - Client registered with FileSystemMasterWorker master @ ab02.prod.com/10.3.3.15:19998
2016-12-01 11:19:49,018 WARN  logger.type (SleepingTimer.java:tick) - Worker Pin List Sync last execution took 26462 ms. Longer than the interval 1000
2016-12-01 11:19:49,027 WARN  logger.type (SleepingTimer.java:tick) - Worker FileSystemMaster Sync last execution took 27382 ms. Longer than the interval 1000


So, obviously, the log of 'client registered' was somewhat misleading.

3, Then, I configured alluxio.worker.block.heartbeat.timeout.ms=1000000 in the alluxio-site.properties again, restarted the cluster, redid the test , the same result. from webui, I could ensure the parameter alluxio.worker.block.heartbeat.timeout.ms is 1000000. any other parameter needed ?

Likun

Gene Pang

unread,
Dec 1, 2016, 10:14:39 AM12/1/16
to Alluxio Users
Thanks for the info!

In your previous logs, you showed a worker log for a node that connected and one that did not connect. What are both of their hostnames, so we can see the master logs for those hostnames?

When configuring the alluxio-site.properties file, are you propagating the same configuration to all of the masters and workers? The master web UI will show the configuration for the master, but the workers will be using their own local configuration.

Thanks,
Gene

Kun Li

unread,
Dec 1, 2016, 8:06:16 PM12/1/16
to Alluxio Users
From the worker's log, you can see the following info:
Client environment:host.name=10.3.3.86
and
Client environment:host.name=10.3.3.11

That's the hostname that showed in the master's log.

And yes, all the master and workers share the same alluxio-site.properties file, since I deployed them with docker, they are using the same image.

Thanks for your time.

Likun

Kun Li

unread,
Dec 6, 2016, 9:22:34 AM12/6/16
to Alluxio Users
Some updates.

One of my colleague did some analysis, here it is:

There is a bug in the worker id management. Here is what happens.


BlockMaster maintains two queues: 1. the list of workers that have registered with the master: mWorkers (https://github.com/Alluxio/alluxio/blob/7c5bd6c101cd699c0da204c036878ad56c3615cf/core/server/src/main/java/alluxio/master/block/BlockMaster.java#L147 ) 2. the list of lost workers that have lost communication with the master, mLostWorkers.

 

1. Each worker when it first starts up, it will register with the master, and be put into the mWorkers list.

2. Each worker will heartbeat with the master, if master does not get the heartbeat from its worker for a configurable timeout, it will move the worker from mWorkers list to the mLostWorkers list

3. When the lost worker resumes sending heartbeat to the master, the master will move it from mLostWorkers to mWorkers

 

The above logic is quite straightforward. Each worker is also assigned a worker id. The id is the index of the worker in the mWorker list. The id is sent back from master to worker. so the worker knows what workerid it is assigned to. The worker will use the id to send heartbeat to master, and the master will use the worker id to keep track of which worker is alive, which is lost.

 

The mWorkers list and mLostWorkers list is *NOT* persisted. Hence, when a new master becomes leader, the new master does *NOT* know which worker has what id. It starts with an empty mWorkers and mLostWorkers list.

 

Now is the sequence of events that lead to the problem.

1. Start two masters, master A, and master B. master A is leader, master B is standby

2. Start 20 workers

3. the 20 workers get the leader master's address from zookeeper

4. the 20 workers register with the lead master A

5. master A has the 20 workers in into mWorkers list, each worker has got a worker id, assigned by master A

6. the 20 workers send heartbeat to the master periodically

7. kill master A

8. master B become leader

9. the 20 workers get the new leader master's address from zookeeper

10. the 20 workers send heartbeat to the master B

11. master B' mWorkers list is empty when it first take over the leadership, so in handling the worker's heartbeat, it will print out "Could not find worker id: {} for heartbeat."(https://github.com/Alluxio/alluxio/blob/7c5bd6c101cd699c0da204c036878ad56c3615cf/core/server/src/main/java/alluxio/master/block/BlockMaster.java#L674). After that, the master will ask the worker to re-register, and the worker will be assigned a new workerid from the new master.

12. However, after some workers (say 17) have re-registered with the new master, the new master B's mWorkers list is no longer empty. and when last three workers that heart beat with the new master, chances are the three worker's old worker id (id assigned by the prior master) is between 1 and 17, hence when the new master receives the heartbeat from the last three workers, it would think they have already registered. Hence, the new master would *NOT* ask the last three workers to re-register, and the last three workers would be permanently lost. The new master does *NOT* keep track of them, the new master does *NOT* know they exist, even though they continuously heartbeat with the master. The last three workers still think they are registered with the master, with the old worker id.

 

The bug is in this line below. It checks whether the worker id is present in mWorkers, if it does, it will think the worker has registered. it does not check whether the worker that sends the heart beat is the same worker that has registered with the master with that id.

MasterWorkerInfo worker = mWorkers.getFirstByField(ID_INDEX, workerId);(https://github.com/Alluxio/alluxio/blob/7c5bd6c101cd699c0da204c036878ad56c3615cf/core/server/src/main/java/alluxio/master/block/BlockMaster.java#L672)

 

How do you think about it, Gene.

Likun

Gene Pang

unread,
Dec 6, 2016, 9:58:51 AM12/6/16
to Alluxio Users
Hi,

This is a great analysis and investigation! Thank you so much for your effort!

I think your analysis is correct. Could you report this to the Alluxio Jira? https://alluxio.atlassian.net/projects/ALLUXIO/issues/?filter=allopenissues

We can continue the conversation about the solution in that Jira ticket.

Thanks!
Gene

Kun Li

unread,
Dec 7, 2016, 2:10:39 AM12/7/16
to Alluxio Users
Hi,

I have reported to the Alluxio Jira, id is ALLUXIO-2439. 
I‘m not familiar with the Jira system, if there is anything improper, pls let me know.

Likun

在 2016年12月6日星期二 UTC+8下午10:58:51,Gene Pang写道:

Gene Pang

unread,
Dec 7, 2016, 12:01:09 PM12/7/16
to Alluxio Users
Thanks for submitting the ticket! Here is the link to the ticket for reference: https://alluxio.atlassian.net/browse/ALLUXIO-2439

-Gene

Gene Pang

unread,
Jan 2, 2017, 9:10:54 AM1/2/17
to Alluxio Users
Hi,

The fix for the bug was merged, and it should be in the upcoming release. You can check out 1.4.0-RC (https://github.com/Alluxio/alluxio/tree/v1.4.0-RC2) to verify the fix.

Thanks,
Gene
Reply all
Reply to author
Forward
0 new messages