Xenserver 7.0 Reiniciando

390 views
Skip to first unread message

Tales Costa

unread,
Aug 4, 2016, 11:25:18 AM8/4/16
to xen-br
Bom dia srs,

Possuo 3 servidores Dell R410 com xenserver atualmente na versão 7 com todos os updates instalados.
Por algum motivo os servidores slave estão reiniciando aleatoriamente e sem motivo aparente. Em alguns casos o servidor chega a reiniciar mesmo sem nenhuma VM em funcionamento. 
Meses atrás utilizava apenas 2 servidores na versão 6.2 e funcionava perfeitamente. Ao atualizar para a versão 6.5 e incluir outro servidor o problema começou a ocorrer. 
Procurei alguma informação em logs do sistema mas não localizei nada. É como se o servidor fosse desligado de uma hora para outra porém nos logs da iDRAC não há nenhuma informação de queda ou problema com o hardware. Caso eu deixe somente 2 servidores funcionando (master + slave) roda por dias perfeitamente. 

Mais detalhes:

3 Dell R410
1 Storage MD3200i
2 LUN (VMs e Backup)
HA Ativo no pool
20 VMs (19 Linux e 1 Windows)

Alguém já passou por algum problema parecido ou possa indicar os arquivos de log do xen para troubleshooting. 

Agradeço desde já.

--
Tales

Marllus Lustosa

unread,
Aug 5, 2016, 10:50:54 AM8/5/16
to xen...@googlegroups.com
Desabilite temporariamente o HA e veja se eles param de reiniciar.

Após isso, prosseguiremos com o aprofundamento do problema.

--
--
Você recebeu esta mensagem porque está inscrito em Grupo "xen-br" do Grupos Google.
Para enviar mensagens para este grupo, envie um email para xen...@googlegroups.com
Para anular a inscrição neste grupo, envie um email para xen-br-unsubscribe@googlegroups.com
Para mais opções, visite este grupo em http://groups.google.com/group/xen-br?hl=pt-BR
Site do GU Xen-BR: http://www.xen-br.org
Antes de enviar sua primeira mensagem leia atentamente as regras para participação no site http://groups.google.com/group/xen-br/web/regras?hl=pt-BR

---
Você recebeu essa mensagem porque está inscrito no grupo "xen-br" dos Grupos do Google.
Para cancelar inscrição nesse grupo e parar de receber e-mails dele, envie um e-mail para xen-br+unsubscribe@googlegroups.com.
Para mais opções, acesse https://groups.google.com/d/optout.



--
att,

Marllus de Melo Lustosa 
Analista de Tecnologia da Informação / Infraestrutura - Universidade Federal do Ceará - UFC
Bacharel em Ciência da Computação - UFPI
Lattes -> http://lattes.cnpq.br/5163268610076256
HomePage -> http://www.marllus.com
+55 86 99159663
+55 85 97590092

Tales Rodarte

unread,
Aug 8, 2016, 9:57:04 AM8/8/16
to xen...@googlegroups.com
Olá Marllus,

Após a atualização desativei o HA para realizar os updates e mantive desativado até o momento.
Nenhum server reiniciou. Agora já não sei se foi por conta da atualização 6.5 para o 7.0 ou por conta do HA.

Agradeço a dica.

--
Tales
Para anular a inscrição neste grupo, envie um email para xen-br-un...@googlegroups.com

Para mais opções, visite este grupo em http://groups.google.com/group/xen-br?hl=pt-BR
Site do GU Xen-BR: http://www.xen-br.org
Antes de enviar sua primeira mensagem leia atentamente as regras para participação no site http://groups.google.com/group/xen-br/web/regras?hl=pt-BR

---
Você recebeu essa mensagem porque está inscrito no grupo "xen-br" dos Grupos do Google.
Para cancelar inscrição nesse grupo e parar de receber e-mails dele, envie um e-mail para xen-br+un...@googlegroups.com.

Marllus Lustosa

unread,
Aug 8, 2016, 10:11:40 AM8/8/16
to xen...@googlegroups.com
Existe alguma falha em alguma placa de rede?

Talvez ele esteja entrando em modo host fencing mesmo que realmente não esteja ocorrendo nada de errado.

Veja se o xapi está ok.
# service xapi status


Tales Rodarte

unread,
Aug 8, 2016, 4:13:30 PM8/8/16
to xen...@googlegroups.com
Marllus,

Nos logs do sistema e do switch não chega a apresentar desconexão.

Status do xapi:

[root@xen-01 ~]# service xapi status
● xapi.service - SYSV: XenServer server
   Loaded: loaded (/etc/rc.d/init.d/xapi)
  Drop-In: /etc/systemd/system/xapi.service.d
           └─local.conf, slice.conf
   Active: active (running) since Thu 2016-08-04 09:38:25 BRT; 4 days ago
     Docs: man:systemd-sysv-generator(8)
  Process: 1825 ExecStart=/etc/rc.d/init.d/xapi start (code=exited, status=0/SUCCESS)
 Main PID: 1901 (xapi)
   CGroup: /control.slice/xapi.service
           └─1901 /opt/xensource/bin/xapi -daemon true -writereadyfile /var/run/xapi_startup.cookie -writeinitcomplete /var/run/xapi_init_complete.cookie -onsystemboot

Nos outros o resultado é o mesmo.

--
Tales
Para anular a inscrição neste grupo, envie um email para xen-br-un...@googlegroups.com

Para mais opções, visite este grupo em http://groups.google.com/group/xen-br?hl=pt-BR
Site do GU Xen-BR: http://www.xen-br.org
Antes de enviar sua primeira mensagem leia atentamente as regras para participação no site http://groups.google.com/group/xen-br/web/regras?hl=pt-BR

---
Você recebeu essa mensagem porque está inscrito no grupo "xen-br" dos Grupos do Google.
Para cancelar inscrição nesse grupo e parar de receber e-mails dele, envie um e-mail para xen-br+un...@googlegroups.com.

Odilair Leão

unread,
Aug 12, 2016, 8:44:47 PM8/12/16
to xen-br, talesr...@gmail.com
Tales, revise as configurações de IP dos servidores. Pode ser que haja algum conflito de IP na sua rede, sendo assim, quando você liga o HA, ele percebe uma falha de comunicação e o servidor reinicia.

Tales Rodarte

unread,
Aug 17, 2016, 10:13:14 AM8/17/16
to Odilair Leão, xen-br
Olá Odilair,

Os IPs estão divididos corretamente e os hosts de cada nó configurados.
Já se passaram 13 dias e não ocorreu nenhum reboot repentino. \o

Estou pensando em ativar o HA novamente e monitorar durante os próximos dias para ter certeza se o problema seja com ele.

Agradeço as dicas.

--
Tales

Marllus Lustosa

unread,
Aug 18, 2016, 11:50:43 AM8/18/16
to xen...@googlegroups.com, Odilair Leão
Você poderia testar ativando o HA novamente e depois olhar os arquivos de log para ver qual o erro que ele está reconhecendo para ativar o fencing/failover.

Uma pergunta: Onde você está criando o HA heartbeat SR? Pode (a possibilidade é remota, mas pode...) acontecer de o sistema HA do pool não estar conseguindo se comunicar com o SR (timeout, por exemplo) ativando, assim, o modo fencing/failover e reiniciando o mesmo (pois ele irá pensar que o host morreu, pois o estado de vida não é retornado na requisição).
Foi um chute, mas em troubleshooting a gente considera tudo. Abraço.


Para anular a inscrição neste grupo, envie um email para xen-br-unsubscribe@googlegroups.com

Para mais opções, visite este grupo em http://groups.google.com/group/xen-br?hl=pt-BR
Site do GU Xen-BR: http://www.xen-br.org
Antes de enviar sua primeira mensagem leia atentamente as regras para participação no site http://groups.google.com/group/xen-br/web/regras?hl=pt-BR

---
Você recebeu essa mensagem porque está inscrito no grupo "xen-br" dos Grupos do Google.
Para cancelar inscrição nesse grupo e parar de receber e-mails dele, envie um e-mail para xen-br+unsubscribe@googlegroups.com.

Para mais opções, acesse https://groups.google.com/d/optout.

claytondf menezes

unread,
Aug 18, 2016, 2:08:20 PM8/18/16
to xen-br
Sofri do mesmo problema na versão 6.5, tenho dois servidores instalados e com todos os updates.
Foi no total de 3 ocorrências no periodo de 3 dias.
Fiz um ajuste no NTP que apresentava problemas de sincronismo...
Sem problemas aparente...
Já estou fazendo os testes para Migrar para o XenServer7.0

Tales Rodarte

unread,
Aug 18, 2016, 2:39:16 PM8/18/16
to xen...@googlegroups.com
Marllus,

Ativo o HA na lun onde as VMs estão. Não cheguei a ativar na outra LUN (backup) mas vou considerar fazer isso também.
O estranho é que cada hora ocorre com um server.
Vou ativar hoje o HA e amanhã posto novidades caso reinicie.

--
Tales

Tales Rodarte

unread,
Aug 18, 2016, 3:00:07 PM8/18/16
to xen...@googlegroups.com
Clayton,

Cheguei a considerar está falta de sincronia (ao migrar para o 6.5), só que mesmo sincronizados os servidores continuaram com a falha.

root@xx:/# for i in `echo 123; echo 124; echo 125`; do sshpass -p 'xxxx' ssh ro...@xxx.xx.x.$i 'date' ; done
Qui Ago 18 15:41:53 BRT 2016
Qui Ago 18 15:41:53 BRT 2016
Qui Ago 18 15:41:53 BRT 2016

--
Tales

Tales Rodarte

unread,
Aug 22, 2016, 1:06:13 PM8/22/16
to xen...@googlegroups.com
Pessoal,

Ativei o HA hoje e minutos depois um server já reiniciou. Fiz o teste ativando em ambas as LUNs que tenho.

Segue o xha.log do servidor master:

Aug 18 18:20:04 BRT 2016 [notice] HA daemon started - built at May  9 10:05:13 UTC 2016 -
Aug 18 18:20:04 BRT 2016 [info] CONF: my_index=2 num_host=3 t1=7 T1=60 t2=7 T2=60 Wh=60 Ws=75 Tboot=120 Tenable=120 tXapi=60 TXapi=120 RestartXapi=1 TRestartXapi=300 Tlicence=30
Aug 18 18:20:04 BRT 2016 [info] LOG: logmask = 100000
Aug 18 18:20:04 BRT 2016 [info] LOG:  OFF:18(DUMPPACKET)
Aug 18 18:20:04 BRT 2016 [info] LOG:  OFF:19(TRACE)
Aug 18 18:20:04 BRT 2016 [info] LOG:  ON :20(FH_TRACE)
Aug 18 18:20:04 BRT 2016 [info] LOG:  OFF:21(LM_TRACE)
Aug 18 18:20:04 BRT 2016 [info] LOG:  OFF:22(SCRIPT)
Aug 18 18:20:04 BRT 2016 [info] LOG:  OFF:23(SC_WARNING)
Aug 18 18:20:04 BRT 2016 [info] COM: com_initialize(0).
Aug 18 18:20:04 BRT 2016 [info] SF: phase 0 initialization...
Aug 18 18:20:04 BRT 2016 [info] LM: lm_initialize(0).
Aug 18 18:20:04 BRT 2016 [info] BM: bm_initialize(0).
Aug 18 18:20:04 BRT 2016 [info] HB: hb_initialize(0).
Aug 18 18:20:04 BRT 2016 [info] Xapimon: xapimon_initialize(0).
Aug 18 18:20:04 BRT 2016 [info] SC: script_initialize(0).
Aug 18 18:20:04 BRT 2016 [info] SC: hostweight_initialize(0).
Aug 18 18:20:04 BRT 2016 [info] SM: sm_initialize(0).
Aug 18 18:20:04 BRT 2016 [info] COM: com_initialize(1).
Aug 18 18:20:04 BRT 2016 [info] SF: phase 1 initialization...
Aug 18 18:20:04 BRT 2016 [info] WD: watchdog mode = HYPERVISOR.
Aug 18 18:20:04 BRT 2016 [info] WD: (watchdog_create) success label=statefile id=1.
Aug 18 18:20:04 BRT 2016 [debug] WD: watchdog id file /var/run/xhad.wd.id is updated.
Aug 18 18:20:04 BRT 2016 [info] LM: lm_initialize(1).
Aug 18 18:20:04 BRT 2016 [info] BM: bm_initialize(1).
Aug 18 18:20:04 BRT 2016 [info] HB: hb_initialize(1).
Aug 18 18:20:04 BRT 2016 [info] WD: (watchdog_create) success label=heartbeat id=2.
Aug 18 18:20:04 BRT 2016 [debug] WD: watchdog id file /var/run/xhad.wd.id is updated.
Aug 18 18:20:04 BRT 2016 [info] Xapimon: xapimon_initialize(1).
Aug 18 18:20:04 BRT 2016 [info] SC: script_initialize(1).
Aug 18 18:20:04 BRT 2016 [info] SC: hostweight_initialize(1).
Aug 18 18:20:04 BRT 2016 [info] SC: host weight has changed from 0 to 1
Aug 18 18:20:04 BRT 2016 [info] SC:     builtinclass=native weight=1
Aug 18 18:20:04 BRT 2016 [info] SM: sm_initialize(1).
Aug 18 18:20:04 BRT 2016 [info] Initial weight = 1.
Aug 18 18:20:04 BRT 2016 [info] State File read/write enabled.
Aug 18 18:20:04 BRT 2016 [info] Heartbeat send/receive enabled.
Aug 18 18:20:11 BRT 2016 [debug] HB: HB domain is updated [hbdomain = (0@m)].
Aug 18 18:20:11 BRT 2016 [debug] HB: HB domain is updated [hbdomain = (@1m)].
Aug 18 18:20:11 BRT 2016 [debug] SM: heartbeat is ready
Aug 18 18:20:11 BRT 2016 [notice] HA daemon successfuly started acceesing the State-File.
Aug 18 18:20:14 BRT 2016 [warn] BM: cannot open bonding status file (/proc/net/bonding/eth0). (2)
Aug 18 18:20:14 BRT 2016 [info] BM: this is not bonded. Terminating bonding thread.
Aug 18 18:20:26 BRT 2016 [debug] SM: SF domain is updated [sfdomain = (@@@)].
Aug 18 18:20:26 BRT 2016 [debug] SM: statefile is ready
Aug 18 18:20:26 BRT 2016 [debug] SM: other hosts are ready
Aug 18 18:20:26 BRT 2016 [info] Start Criteria: pool state=INIT; excluded=FALSE; liveset=EMPTY
Aug 18 18:20:26 BRT 2016 [info] Start Criteria: Forming a new liveset with all configured hosts.
Aug 18 18:20:26 BRT 2016 [info] Start Criteria: current_liveset = (000)
Aug 18 18:20:26 BRT 2016 [info] Start Criteria: proposed_liveset = (111)
Aug 18 18:22:25 BRT 2016 [notice] Start Criteria: Start timeout (0x5)
Aug 18 18:22:25 BRT 2016 [err] Start timeout (1901) Cannot start HA daemon, because Start Criteria is not met for the local host.  - Abort.
Aug 18 18:22:25 BRT 2016 [debug] SM: SF domain is updated [sfdomain = (11D)].
Aug 18 18:22:26 BRT 2016 [notice] HA daemon started shutdown process.
Aug 18 18:22:26 BRT 2016 [info] SM: sm_initialize(-1).
Aug 18 18:22:26 BRT 2016 [info] SC: hostweight_initialize(-1).
Aug 18 18:22:26 BRT 2016 [info] SC: script_initialize(-1).
Aug 18 18:22:26 BRT 2016 [info] Xapimon: xapimon_initialize(-1).
Aug 18 18:22:26 BRT 2016 [info] HB: hb_initialize(-1).
Aug 18 18:22:26 BRT 2016 [info] WD: (watchdog_close) label=heartbeat stopping watchdog timer.
Aug 18 18:22:26 BRT 2016 [info] WD: (watchdog_close) label=heartbeat watchdog timer has been stopped successfully.
Aug 18 18:22:26 BRT 2016 [debug] WD: watchdog id file /var/run/xhad.wd.id is updated.
Aug 18 18:22:26 BRT 2016 [info] BM: bm_initialize(-1).
Aug 18 18:22:26 BRT 2016 [info] LM: lm_initialize(-1).
Aug 18 18:22:26 BRT 2016 [info] WD: (watchdog_close) label=statefile stopping watchdog timer.
Aug 18 18:22:26 BRT 2016 [info] WD: (watchdog_close) label=statefile watchdog timer has been stopped successfully.
Aug 18 18:22:26 BRT 2016 [debug] WD: watchdog id file /var/run/xhad.wd.id is deleted.
Aug 18 18:22:26 BRT 2016 [info] COM: com_initialize(-1).
Aug 18 18:22:26 BRT 2016 [notice] HA daemon completed shutdown process.
Aug 22 12:18:14 BRT 2016 [notice] HA daemon started - built at May  9 10:05:13 UTC 2016 -
Aug 22 12:18:14 BRT 2016 [info] CONF: my_index=2 num_host=3 t1=7 T1=60 t2=7 T2=60 Wh=60 Ws=75 Tboot=120 Tenable=120 tXapi=60 TXapi=120 RestartXapi=1 TRestartXapi=300 Tlicence=30
Aug 22 12:18:14 BRT 2016 [info] LOG: logmask = 100000
Aug 22 12:18:14 BRT 2016 [info] LOG:  OFF:18(DUMPPACKET)
Aug 22 12:18:14 BRT 2016 [info] LOG:  OFF:19(TRACE)
Aug 22 12:18:14 BRT 2016 [info] LOG:  ON :20(FH_TRACE)
Aug 22 12:18:14 BRT 2016 [info] LOG:  OFF:21(LM_TRACE)
Aug 22 12:18:14 BRT 2016 [info] LOG:  OFF:22(SCRIPT)
Aug 22 12:18:14 BRT 2016 [info] LOG:  OFF:23(SC_WARNING)
Aug 22 12:18:14 BRT 2016 [info] COM: com_initialize(0).
Aug 22 12:18:14 BRT 2016 [info] SF: phase 0 initialization...
Aug 22 12:18:14 BRT 2016 [info] LM: lm_initialize(0).
Aug 22 12:18:14 BRT 2016 [info] BM: bm_initialize(0).
Aug 22 12:18:14 BRT 2016 [info] HB: hb_initialize(0).
Aug 22 12:18:14 BRT 2016 [info] Xapimon: xapimon_initialize(0).
Aug 22 12:18:14 BRT 2016 [info] SC: script_initialize(0).
Aug 22 12:18:14 BRT 2016 [info] SC: hostweight_initialize(0).
Aug 22 12:18:14 BRT 2016 [info] SM: sm_initialize(0).
Aug 22 12:18:14 BRT 2016 [info] COM: com_initialize(1).
Aug 22 12:18:14 BRT 2016 [info] SF: phase 1 initialization...
Aug 22 12:18:14 BRT 2016 [info] WD: watchdog mode = HYPERVISOR.
Aug 22 12:18:14 BRT 2016 [info] WD: (watchdog_create) success label=statefile id=1.
Aug 22 12:18:14 BRT 2016 [debug] WD: watchdog id file /var/run/xhad.wd.id is updated.
Aug 22 12:18:14 BRT 2016 [info] LM: lm_initialize(1).
Aug 22 12:18:14 BRT 2016 [info] BM: bm_initialize(1).
Aug 22 12:18:14 BRT 2016 [info] HB: hb_initialize(1).
Aug 22 12:18:14 BRT 2016 [info] WD: (watchdog_create) success label=heartbeat id=2.
Aug 22 12:18:14 BRT 2016 [debug] WD: watchdog id file /var/run/xhad.wd.id is updated.
Aug 22 12:18:14 BRT 2016 [info] Xapimon: xapimon_initialize(1).
Aug 22 12:18:14 BRT 2016 [info] SC: script_initialize(1).
Aug 22 12:18:14 BRT 2016 [info] SC: hostweight_initialize(1).
Aug 22 12:18:14 BRT 2016 [info] SC: host weight has changed from 0 to 1
Aug 22 12:18:14 BRT 2016 [info] SC:     builtinclass=native weight=1
Aug 22 12:18:14 BRT 2016 [info] SM: sm_initialize(1).
Aug 22 12:18:14 BRT 2016 [info] Initial weight = 1.
Aug 22 12:18:14 BRT 2016 [info] State File read/write enabled.
Aug 22 12:18:14 BRT 2016 [info] Heartbeat send/receive enabled.
Aug 22 12:18:21 BRT 2016 [debug] HB: HB domain is updated [hbdomain = (0@m)].
Aug 22 12:18:21 BRT 2016 [debug] HB: HB domain is updated [hbdomain = (@1m)].
Aug 22 12:18:21 BRT 2016 [debug] SM: heartbeat is ready
Aug 22 12:18:21 BRT 2016 [notice] HA daemon successfuly started acceesing the State-File.
Aug 22 12:18:24 BRT 2016 [warn] BM: cannot open bonding status file (/proc/net/bonding/eth0). (2)
Aug 22 12:18:24 BRT 2016 [info] BM: this is not bonded. Terminating bonding thread.
Aug 22 12:18:36 BRT 2016 [debug] SM: SF domain is updated [sfdomain = (@@@)].
Aug 22 12:18:36 BRT 2016 [debug] SM: statefile is ready
Aug 22 12:18:36 BRT 2016 [debug] SM: other hosts are ready
Aug 22 12:18:36 BRT 2016 [info] Start Criteria: pool state=INIT; excluded=FALSE; liveset=EMPTY
Aug 22 12:18:36 BRT 2016 [info] Start Criteria: Forming a new liveset with all configured hosts.
Aug 22 12:18:36 BRT 2016 [info] Start Criteria: current_liveset = (000)
Aug 22 12:18:36 BRT 2016 [info] Start Criteria: proposed_liveset = (111)
Aug 22 12:20:35 BRT 2016 [notice] Start Criteria: Start timeout (0x5)
Aug 22 12:20:35 BRT 2016 [err] Start timeout (1901) Cannot start HA daemon, because Start Criteria is not met for the local host.  - Abort.
Aug 22 12:20:35 BRT 2016 [debug] SM: SF domain is updated [sfdomain = (11D)].
Aug 22 12:20:36 BRT 2016 [notice] HA daemon started shutdown process.
Aug 22 12:20:36 BRT 2016 [info] SM: sm_initialize(-1).
Aug 22 12:20:36 BRT 2016 [info] SC: hostweight_initialize(-1).
Aug 22 12:20:36 BRT 2016 [info] SC: script_initialize(-1).
Aug 22 12:20:36 BRT 2016 [info] Xapimon: xapimon_initialize(-1).
Aug 22 12:20:36 BRT 2016 [info] HB: hb_initialize(-1).
Aug 22 12:20:36 BRT 2016 [info] WD: (watchdog_close) label=heartbeat stopping watchdog timer.
Aug 22 12:20:36 BRT 2016 [info] WD: (watchdog_close) label=heartbeat watchdog timer has been stopped successfully.
Aug 22 12:20:36 BRT 2016 [debug] WD: watchdog id file /var/run/xhad.wd.id is updated.
Aug 22 12:20:36 BRT 2016 [info] BM: bm_initialize(-1).
Aug 22 12:20:36 BRT 2016 [info] LM: lm_initialize(-1).
Aug 22 12:20:36 BRT 2016 [info] WD: (watchdog_close) label=statefile stopping watchdog timer.
Aug 22 12:20:36 BRT 2016 [info] WD: (watchdog_close) label=statefile watchdog timer has been stopped successfully.
Aug 22 12:20:36 BRT 2016 [debug] WD: watchdog id file /var/run/xhad.wd.id is deleted.
Aug 22 12:20:36 BRT 2016 [info] COM: com_initialize(-1).
Aug 22 12:20:36 BRT 2016 [notice] HA daemon completed shutdown process.
Aug 22 12:35:21 BRT 2016 [notice] HA daemon started - built at May  9 10:05:13 UTC 2016 -
Aug 22 12:35:21 BRT 2016 [info] CONF: my_index=2 num_host=3 t1=7 T1=60 t2=7 T2=60 Wh=60 Ws=75 Tboot=120 Tenable=120 tXapi=60 TXapi=120 RestartXapi=1 TRestartXapi=300 Tlicence=30
Aug 22 12:35:21 BRT 2016 [info] LOG: logmask = 100000
Aug 22 12:35:21 BRT 2016 [info] LOG:  OFF:18(DUMPPACKET)
Aug 22 12:35:21 BRT 2016 [info] LOG:  OFF:19(TRACE)
Aug 22 12:35:21 BRT 2016 [info] LOG:  ON :20(FH_TRACE)
Aug 22 12:35:21 BRT 2016 [info] LOG:  OFF:21(LM_TRACE)
Aug 22 12:35:21 BRT 2016 [info] LOG:  OFF:22(SCRIPT)
Aug 22 12:35:21 BRT 2016 [info] LOG:  OFF:23(SC_WARNING)
Aug 22 12:35:21 BRT 2016 [info] COM: com_initialize(0).
Aug 22 12:35:21 BRT 2016 [info] SF: phase 0 initialization...
Aug 22 12:35:21 BRT 2016 [info] LM: lm_initialize(0).
Aug 22 12:35:21 BRT 2016 [info] BM: bm_initialize(0).
Aug 22 12:35:21 BRT 2016 [info] HB: hb_initialize(0).
Aug 22 12:35:21 BRT 2016 [info] Xapimon: xapimon_initialize(0).
Aug 22 12:35:21 BRT 2016 [info] SC: script_initialize(0).
Aug 22 12:35:21 BRT 2016 [info] SC: hostweight_initialize(0).
Aug 22 12:35:21 BRT 2016 [info] SM: sm_initialize(0).
Aug 22 12:35:21 BRT 2016 [info] COM: com_initialize(1).
Aug 22 12:35:21 BRT 2016 [info] SF: phase 1 initialization...
Aug 22 12:35:21 BRT 2016 [info] WD: watchdog mode = HYPERVISOR.
Aug 22 12:35:21 BRT 2016 [info] WD: (watchdog_create) success label=statefile id=1.
Aug 22 12:35:21 BRT 2016 [debug] WD: watchdog id file /var/run/xhad.wd.id is updated.
Aug 22 12:35:21 BRT 2016 [info] LM: lm_initialize(1).
Aug 22 12:35:21 BRT 2016 [info] BM: bm_initialize(1).
Aug 22 12:35:21 BRT 2016 [info] HB: hb_initialize(1).
Aug 22 12:35:21 BRT 2016 [info] WD: (watchdog_create) success label=heartbeat id=2.
Aug 22 12:35:21 BRT 2016 [debug] WD: watchdog id file /var/run/xhad.wd.id is updated.
Aug 22 12:35:21 BRT 2016 [info] Xapimon: xapimon_initialize(1).
Aug 22 12:35:21 BRT 2016 [info] SC: script_initialize(1).
Aug 22 12:35:21 BRT 2016 [info] SC: hostweight_initialize(1).
Aug 22 12:35:21 BRT 2016 [info] SC: host weight has changed from 0 to 1
Aug 22 12:35:21 BRT 2016 [info] SC:     builtinclass=native weight=1
Aug 22 12:35:21 BRT 2016 [info] SM: sm_initialize(1).
Aug 22 12:35:21 BRT 2016 [info] Initial weight = 1.
Aug 22 12:35:21 BRT 2016 [info] State File read/write enabled.
Aug 22 12:35:21 BRT 2016 [info] Heartbeat send/receive enabled.
Aug 22 12:35:28 BRT 2016 [debug] HB: HB domain is updated [hbdomain = (0@m)].
Aug 22 12:35:28 BRT 2016 [debug] HB: HB domain is updated [hbdomain = (@1m)].
Aug 22 12:35:28 BRT 2016 [debug] SM: heartbeat is ready
Aug 22 12:35:28 BRT 2016 [notice] HA daemon successfuly started acceesing the State-File.
Aug 22 12:35:31 BRT 2016 [warn] BM: cannot open bonding status file (/proc/net/bonding/eth0). (2)
Aug 22 12:35:31 BRT 2016 [info] BM: this is not bonded. Terminating bonding thread.
Aug 22 12:35:43 BRT 2016 [debug] SM: SF domain is updated [sfdomain = (@@@)].
Aug 22 12:35:43 BRT 2016 [debug] SM: statefile is ready
Aug 22 12:35:43 BRT 2016 [debug] SM: other hosts are ready
Aug 22 12:35:43 BRT 2016 [info] Start Criteria: pool state=INIT; excluded=FALSE; liveset=EMPTY
Aug 22 12:35:43 BRT 2016 [info] Start Criteria: Forming a new liveset with all configured hosts.
Aug 22 12:35:43 BRT 2016 [info] Start Criteria: current_liveset = (000)
Aug 22 12:35:43 BRT 2016 [info] Start Criteria: proposed_liveset = (111)
Aug 22 12:35:56 BRT 2016 [info] Start Criteria: all hosts are ready to start.
Aug 22 12:35:56 BRT 2016 [info] Start Criteria: current_liveset = (111)
Aug 22 12:35:56 BRT 2016 [info] Start Criteria: proposed_liveset = (111)
Aug 22 12:35:56 BRT 2016 [notice] Start Criteria: Changing the pool state to ACTIVE.
Aug 22 12:35:56 BRT 2016 [info] Start Criteria has been met for the local host.
Aug 22 12:35:56 BRT 2016 [info] Joined.
Aug 22 12:35:56 BRT 2016 [info] Xapi monitor started.
Aug 22 12:35:56 BRT 2016 [info] Fencing is armed.
Aug 22 12:35:56 BRT 2016 [notice] The local host has transitioned to online state.
Aug 22 12:35:56 BRT 2016 [warn] HA daemon set scheduler failed (sys 1) status=-1 policy=2 priority=50
Aug 22 12:35:56 BRT 2016 [debug] HA daemon mlockall succeeded
Aug 22 12:35:56 BRT 2016 [info] SC: propose_master enter.
Aug 22 12:35:58 BRT 2016 [notice] propose_master returns TRUE.
Aug 22 12:40:15 BRT 2016 [debug] FH: Start fault handler.
Aug 22 12:40:15 BRT 2016 [debug] FH: HB/SF state has become stable.
Aug 22 12:40:15 BRT 2016 [debug] FH: weight value [1] is commited.
Aug 22 12:40:18 BRT 2016 [debug] FH: waiting for consistent view...
Aug 22 12:41:18 BRT 2016 [warn] Host (0) and the local host do not agre on the view to the pool membership.
Aug 22 12:41:18 BRT 2016 [warn]     local HB domain = (111)
Aug 22 12:41:18 BRT 2016 [warn]     local SF domain = (111)
Aug 22 12:41:18 BRT 2016 [warn]     remote HB domain = (101)
Aug 22 12:41:18 BRT 2016 [warn]     remote SF domain = (111)
Aug 22 12:41:18 BRT 2016 [warn]     remote HB domain on SF = (101)
Aug 22 12:41:18 BRT 2016 [warn]     remote SF domain on HB = (111)
Aug 22 12:41:18 BRT 2016 [warn]     other HB domain = (101)
Aug 22 12:41:18 BRT 2016 [warn]     other SF domain = (111)
Aug 22 12:41:18 BRT 2016 [warn]     other HB domain on SF = (101)
Aug 22 12:41:18 BRT 2016 [warn]     other SF domain on HB = (111)
Aug 22 12:41:18 BRT 2016 [warn]     other HB domain = (011)
Aug 22 12:41:18 BRT 2016 [warn]     other SF domain = (111)
Aug 22 12:41:18 BRT 2016 [warn]     other HB domain on SF = (011)
Aug 22 12:41:18 BRT 2016 [warn]     other SF domain on HB = (111)
Aug 22 12:41:18 BRT 2016 [warn]     other HB domain = (111)
Aug 22 12:41:18 BRT 2016 [warn]     other SF domain = (111)
Aug 22 12:41:18 BRT 2016 [warn]     other HB domain on SF = (111)
Aug 22 12:41:18 BRT 2016 [warn]     other SF domain on HB = (111)
Aug 22 12:41:18 BRT 2016 [warn]     weight[0] = 1
Aug 22 12:41:18 BRT 2016 [warn]     weight[1] = 1
Aug 22 12:41:18 BRT 2016 [warn]     weight[2] = 1
Aug 22 12:41:18 BRT 2016 [warn] after merger:
Aug 22 12:41:18 BRT 2016 [warn]     other HB domain = (101)
Aug 22 12:41:18 BRT 2016 [warn]     other SF domain = (111)
Aug 22 12:41:18 BRT 2016 [warn]     other HB domain on SF = (101)
Aug 22 12:41:18 BRT 2016 [warn]     other SF domain on HB = (111)
Aug 22 12:41:18 BRT 2016 [warn]     other HB domain = (011)
Aug 22 12:41:18 BRT 2016 [warn]     other SF domain = (111)
Aug 22 12:41:18 BRT 2016 [warn]     other HB domain on SF = (000)
Aug 22 12:41:18 BRT 2016 [warn]     other SF domain on HB = (111)
Aug 22 12:41:18 BRT 2016 [warn]     other HB domain = (111)
Aug 22 12:41:18 BRT 2016 [warn]     other SF domain = (111)
Aug 22 12:41:18 BRT 2016 [warn]     other HB domain on SF = (101)
Aug 22 12:41:18 BRT 2016 [warn]     other SF domain on HB = (111)
Aug 22 12:41:18 BRT 2016 [warn]     weight[0] = 1
Aug 22 12:41:18 BRT 2016 [warn]     weight[1] = 1
Aug 22 12:41:18 BRT 2016 [warn]     weight[2] = 1
Aug 22 12:41:18 BRT 2016 [warn]     merged HB domain = (101)
Aug 22 12:41:18 BRT 2016 [warn]     merged SF domain = (111)
Aug 22 12:41:18 BRT 2016 [debug] FH: All hosts now have consistent view to the pool membership.
Aug 22 12:41:25 BRT 2016 [warn] SM: partition_size[0] = 2 2
Aug 22 12:41:25 BRT 2016 [warn] SM: partition_size[1] = 0 0
Aug 22 12:41:25 BRT 2016 [warn] SM: partition_size[2] = 2 2
Aug 22 12:41:25 BRT 2016 [warn] SM: winner_index = 0
Aug 22 12:41:25 BRT 2016 [debug] SM: I have won by the result of weight caluculation.
Aug 22 12:41:25 BRT 2016 [debug] FH: I have won.
Aug 22 12:41:25 BRT 2016 [debug] SM: Node (1) will be removed from liveset.
Aug 22 12:41:25 BRT 2016 [debug] SM: HB domain is updated = (101)
Aug 22 12:41:25 BRT 2016 [debug] SM: SF domain is updated = (101)
Aug 22 12:42:01 BRT 2016 [warn] SC: (script_service_do_query_liveset) reporting "heartbeat approaching timeout". host[1].time_since_last_hb=35746.
Aug 22 12:42:01 BRT 2016 [warn] SC: (script_service_do_query_liveset) reporting "State-File approaching timeout". host[1].time_since_last_update_on_sf=37822.
Aug 22 12:42:47 BRT 2016 [notice] Liveset has been updated.  new liveset = (101)
Aug 22 12:42:47 BRT 2016 [debug] FH: End fault handler.
Aug 22 12:43:01 BRT 2016 [info] SC: (script_service_do_query_liveset) "Heartbeat approaching timeout" turned FALSE
Aug 22 12:43:01 BRT 2016 [info] SC: (script_service_do_query_liveset) "State-file approaching timeout" turned FALSE
Aug 22 12:43:50 BRT 2016 [debug] HB: HB domain is updated [hbdomain = (1@m)].
Aug 22 12:44:10 BRT 2016 [debug] SM: SF domain is updated [sfdomain = (1@1)].
Aug 22 12:44:18 BRT 2016 [debug] Join Agent: Send ack to join request from host (1).
Aug 22 12:44:18 BRT 2016 [info] Join Agent: Join request from host (1) is accepted by the local host.
Aug 22 12:44:18 BRT 2016 [notice] Liveset has been updated.  new liveset = (111)
Aug 22 12:44:18 BRT 2016 [info] Join Agent: proposed_liveset = (111)
Aug 22 12:47:41 BRT 2016 [info] SC: set_pool_state "invalid" enter.
Aug 22 12:47:41 BRT 2016 [err] Invalid pool state (1906).
Aug 22 12:47:42 BRT 2016 [notice] HA daemon started shutdown process.
Aug 22 12:47:42 BRT 2016 [info] SM: sm_initialize(-1).
Aug 22 12:47:42 BRT 2016 [info] SC: hostweight_initialize(-1).
Aug 22 12:47:42 BRT 2016 [info] SC: script_initialize(-1).
Aug 22 12:47:42 BRT 2016 [info] Xapimon: xapimon_initialize(-1).
Aug 22 12:47:42 BRT 2016 [info] HB: hb_initialize(-1).
Aug 22 12:47:42 BRT 2016 [info] WD: (watchdog_close) label=heartbeat stopping watchdog timer.
Aug 22 12:47:42 BRT 2016 [info] WD: (watchdog_close) label=heartbeat watchdog timer has been stopped successfully.
Aug 22 12:47:42 BRT 2016 [debug] WD: watchdog id file /var/run/xhad.wd.id is updated.
Aug 22 12:47:42 BRT 2016 [info] BM: bm_initialize(-1).
Aug 22 12:47:42 BRT 2016 [info] LM: lm_initialize(-1).
Aug 22 12:47:42 BRT 2016 [info] WD: (watchdog_close) label=statefile stopping watchdog timer.
Aug 22 12:47:42 BRT 2016 [info] WD: (watchdog_close) label=statefile watchdog timer has been stopped successfully.
Aug 22 12:47:42 BRT 2016 [debug] WD: watchdog id file /var/run/xhad.wd.id is deleted.
Aug 22 12:47:42 BRT 2016 [info] COM: com_initialize(-1).
Aug 22 12:47:42 BRT 2016 [notice] HA daemon completed shutdown process.
Aug 22 12:55:02 BRT 2016 [notice] HA daemon started - built at May  9 10:05:13 UTC 2016 -
Aug 22 12:55:02 BRT 2016 [info] CONF: my_index=2 num_host=3 t1=7 T1=60 t2=7 T2=60 Wh=60 Ws=75 Tboot=120 Tenable=120 tXapi=60 TXapi=120 RestartXapi=1 TRestartXapi=300 Tlicence=30
Aug 22 12:55:02 BRT 2016 [info] LOG: logmask = 100000
Aug 22 12:55:02 BRT 2016 [info] LOG:  OFF:18(DUMPPACKET)
Aug 22 12:55:02 BRT 2016 [info] LOG:  OFF:19(TRACE)
Aug 22 12:55:02 BRT 2016 [info] LOG:  ON :20(FH_TRACE)
Aug 22 12:55:02 BRT 2016 [info] LOG:  OFF:21(LM_TRACE)
Aug 22 12:55:02 BRT 2016 [info] LOG:  OFF:22(SCRIPT)
Aug 22 12:55:02 BRT 2016 [info] LOG:  OFF:23(SC_WARNING)
Aug 22 12:55:02 BRT 2016 [info] COM: com_initialize(0).
Aug 22 12:55:02 BRT 2016 [info] SF: phase 0 initialization...
Aug 22 12:55:02 BRT 2016 [info] LM: lm_initialize(0).
Aug 22 12:55:02 BRT 2016 [info] BM: bm_initialize(0).
Aug 22 12:55:02 BRT 2016 [info] HB: hb_initialize(0).
Aug 22 12:55:02 BRT 2016 [info] Xapimon: xapimon_initialize(0).
Aug 22 12:55:02 BRT 2016 [info] SC: script_initialize(0).
Aug 22 12:55:02 BRT 2016 [info] SC: hostweight_initialize(0).
Aug 22 12:55:02 BRT 2016 [info] SM: sm_initialize(0).
Aug 22 12:55:02 BRT 2016 [info] COM: com_initialize(1).
Aug 22 12:55:02 BRT 2016 [info] SF: phase 1 initialization...
Aug 22 12:55:02 BRT 2016 [info] WD: watchdog mode = HYPERVISOR.
Aug 22 12:55:02 BRT 2016 [info] WD: (watchdog_create) success label=statefile id=1.
Aug 22 12:55:02 BRT 2016 [debug] WD: watchdog id file /var/run/xhad.wd.id is updated.
Aug 22 12:55:02 BRT 2016 [info] LM: lm_initialize(1).
Aug 22 12:55:02 BRT 2016 [info] BM: bm_initialize(1).
Aug 22 12:55:02 BRT 2016 [info] HB: hb_initialize(1).
Aug 22 12:55:02 BRT 2016 [info] WD: (watchdog_create) success label=heartbeat id=2.
Aug 22 12:55:02 BRT 2016 [debug] WD: watchdog id file /var/run/xhad.wd.id is updated.
Aug 22 12:55:02 BRT 2016 [info] Xapimon: xapimon_initialize(1).
Aug 22 12:55:02 BRT 2016 [info] SC: script_initialize(1).
Aug 22 12:55:02 BRT 2016 [info] SC: hostweight_initialize(1).
Aug 22 12:55:02 BRT 2016 [info] SC: host weight has changed from 0 to 1
Aug 22 12:55:02 BRT 2016 [info] SC:     builtinclass=native weight=1
Aug 22 12:55:02 BRT 2016 [info] SM: sm_initialize(1).
Aug 22 12:55:02 BRT 2016 [info] Initial weight = 1.
Aug 22 12:55:02 BRT 2016 [info] State File read/write enabled.
Aug 22 12:55:02 BRT 2016 [info] Heartbeat send/receive enabled.
Aug 22 12:55:09 BRT 2016 [debug] HB: HB domain is updated [hbdomain = (0@m)].
Aug 22 12:55:09 BRT 2016 [debug] HB: HB domain is updated [hbdomain = (@1m)].
Aug 22 12:55:09 BRT 2016 [debug] SM: heartbeat is ready
Aug 22 12:55:09 BRT 2016 [notice] HA daemon successfuly started acceesing the State-File.
Aug 22 12:55:12 BRT 2016 [warn] BM: cannot open bonding status file (/proc/net/bonding/eth0). (2)
Aug 22 12:55:12 BRT 2016 [info] BM: this is not bonded. Terminating bonding thread.
Aug 22 12:55:24 BRT 2016 [debug] SM: SF domain is updated [sfdomain = (@@@)].
Aug 22 12:55:24 BRT 2016 [debug] SM: statefile is ready
Aug 22 12:55:24 BRT 2016 [debug] SM: other hosts are ready
Aug 22 12:55:24 BRT 2016 [info] Start Criteria: pool state=INIT; excluded=FALSE; liveset=EMPTY
Aug 22 12:55:24 BRT 2016 [info] Start Criteria: Forming a new liveset with all configured hosts.
Aug 22 12:55:24 BRT 2016 [info] Start Criteria: current_liveset = (000)
Aug 22 12:55:24 BRT 2016 [info] Start Criteria: proposed_liveset = (111)
Aug 22 12:55:37 BRT 2016 [info] Start Criteria: all hosts are ready to start.
Aug 22 12:55:37 BRT 2016 [info] Start Criteria: current_liveset = (111)
Aug 22 12:55:37 BRT 2016 [info] Start Criteria: proposed_liveset = (111)
Aug 22 12:55:37 BRT 2016 [notice] Start Criteria: Changing the pool state to ACTIVE.
Aug 22 12:55:37 BRT 2016 [info] Start Criteria has been met for the local host.
Aug 22 12:55:37 BRT 2016 [info] Joined.
Aug 22 12:55:37 BRT 2016 [info] Xapi monitor started.
Aug 22 12:55:37 BRT 2016 [info] Fencing is armed.
Aug 22 12:55:37 BRT 2016 [notice] The local host has transitioned to online state.
Aug 22 12:55:37 BRT 2016 [warn] HA daemon set scheduler failed (sys 1) status=-1 policy=2 priority=50
Aug 22 12:55:37 BRT 2016 [debug] HA daemon mlockall succeeded
Aug 22 12:55:37 BRT 2016 [info] SC: propose_master enter.
Aug 22 12:55:39 BRT 2016 [notice] propose_master returns TRUE.
Aug 22 13:23:52 BRT 2016 [debug] FH: Start fault handler.
Aug 22 13:23:53 BRT 2016 [debug] FH: HB/SF state has become stable.
Aug 22 13:23:53 BRT 2016 [debug] FH: weight value [1] is commited.
Aug 22 13:23:56 BRT 2016 [debug] FH: waiting for consistent view...
Aug 22 13:24:56 BRT 2016 [warn] Host (0) and the local host do not agre on the view to the pool membership.
Aug 22 13:24:56 BRT 2016 [warn]     local HB domain = (111)
Aug 22 13:24:56 BRT 2016 [warn]     local SF domain = (111)
Aug 22 13:24:56 BRT 2016 [warn]     remote HB domain = (101)
Aug 22 13:24:56 BRT 2016 [warn]     remote SF domain = (111)
Aug 22 13:24:56 BRT 2016 [warn]     remote HB domain on SF = (101)
Aug 22 13:24:56 BRT 2016 [warn]     remote SF domain on HB = (111)
Aug 22 13:24:56 BRT 2016 [warn]     other HB domain = (101)
Aug 22 13:24:56 BRT 2016 [warn]     other SF domain = (111)
Aug 22 13:24:56 BRT 2016 [warn]     other HB domain on SF = (101)
Aug 22 13:24:56 BRT 2016 [warn]     other SF domain on HB = (111)
Aug 22 13:24:56 BRT 2016 [warn]     other HB domain = (011)
Aug 22 13:24:56 BRT 2016 [warn]     other SF domain = (111)
Aug 22 13:24:56 BRT 2016 [warn]     other HB domain on SF = (011)
Aug 22 13:24:56 BRT 2016 [warn]     other SF domain on HB = (111)
Aug 22 13:24:56 BRT 2016 [warn]     other HB domain = (111)
Aug 22 13:24:56 BRT 2016 [warn]     other SF domain = (111)
Aug 22 13:24:56 BRT 2016 [warn]     other HB domain on SF = (111)
Aug 22 13:24:56 BRT 2016 [warn]     other SF domain on HB = (111)
Aug 22 13:24:56 BRT 2016 [warn]     weight[0] = 1
Aug 22 13:24:56 BRT 2016 [warn]     weight[1] = 1
Aug 22 13:24:56 BRT 2016 [warn]     weight[2] = 1
Aug 22 13:24:56 BRT 2016 [warn] after merger:
Aug 22 13:24:56 BRT 2016 [warn]     other HB domain = (101)
Aug 22 13:24:56 BRT 2016 [warn]     other SF domain = (111)
Aug 22 13:24:56 BRT 2016 [warn]     other HB domain on SF = (101)
Aug 22 13:24:56 BRT 2016 [warn]     other SF domain on HB = (111)
Aug 22 13:24:56 BRT 2016 [warn]     other HB domain = (011)
Aug 22 13:24:56 BRT 2016 [warn]     other SF domain = (111)
Aug 22 13:24:56 BRT 2016 [warn]     other HB domain on SF = (000)
Aug 22 13:24:56 BRT 2016 [warn]     other SF domain on HB = (111)
Aug 22 13:24:56 BRT 2016 [warn]     other HB domain = (111)
Aug 22 13:24:56 BRT 2016 [warn]     other SF domain = (111)
Aug 22 13:24:56 BRT 2016 [warn]     other HB domain on SF = (101)
Aug 22 13:24:56 BRT 2016 [warn]     other SF domain on HB = (111)
Aug 22 13:24:56 BRT 2016 [warn]     weight[0] = 1
Aug 22 13:24:56 BRT 2016 [warn]     weight[1] = 1
Aug 22 13:24:56 BRT 2016 [warn]     weight[2] = 1
Aug 22 13:24:56 BRT 2016 [warn]     merged HB domain = (101)
Aug 22 13:24:56 BRT 2016 [warn]     merged SF domain = (111)
Aug 22 13:24:56 BRT 2016 [debug] FH: All hosts now have consistent view to the pool membership.
Aug 22 13:25:03 BRT 2016 [warn] SM: partition_size[0] = 2 2
Aug 22 13:25:03 BRT 2016 [warn] SM: partition_size[1] = 0 0
Aug 22 13:25:03 BRT 2016 [warn] SM: partition_size[2] = 2 2
Aug 22 13:25:03 BRT 2016 [warn] SM: winner_index = 0
Aug 22 13:25:03 BRT 2016 [debug] SM: I have won by the result of weight caluculation.
Aug 22 13:25:03 BRT 2016 [debug] FH: I have won.
Aug 22 13:25:03 BRT 2016 [debug] SM: Node (1) will be removed from liveset.
Aug 22 13:25:03 BRT 2016 [debug] SM: HB domain is updated = (101)
Aug 22 13:25:03 BRT 2016 [debug] SM: SF domain is updated = (101)
Aug 22 13:25:33 BRT 2016 [warn] SC: (script_service_do_query_liveset) reporting "heartbeat approaching timeout". host[1].time_since_last_hb=30322.
Aug 22 13:25:33 BRT 2016 [warn] SC: (script_service_do_query_liveset) reporting "State-File approaching timeout". host[1].time_since_last_update_on_sf=33350.
Aug 22 13:26:25 BRT 2016 [notice] Liveset has been updated.  new liveset = (101)
Aug 22 13:26:25 BRT 2016 [debug] FH: End fault handler.
Aug 22 13:26:33 BRT 2016 [info] SC: (script_service_do_query_liveset) "Heartbeat approaching timeout" turned FALSE
Aug 22 13:26:33 BRT 2016 [info] SC: (script_service_do_query_liveset) "State-file approaching timeout" turned FALSE
Aug 22 13:27:28 BRT 2016 [debug] HB: HB domain is updated [hbdomain = (1@m)].
Aug 22 13:27:47 BRT 2016 [debug] SM: SF domain is updated [sfdomain = (1@1)].
Aug 22 13:27:47 BRT 2016 [debug] Join Agent: Send ack to join request from host (1).
Aug 22 13:27:51 BRT 2016 [info] Join Agent: Join request from host (1) is accepted by the local host.
Aug 22 13:27:51 BRT 2016 [notice] Liveset has been updated.  new liveset = (111)
Aug 22 13:27:51 BRT 2016 [info] Join Agent: proposed_liveset = (111)
Aug 22 13:31:14 BRT 2016 [debug] FH: Start fault handler.
Aug 22 13:31:14 BRT 2016 [debug] FH: HB/SF state has become stable.
Aug 22 13:31:14 BRT 2016 [debug] FH: weight value [1] is commited.
Aug 22 13:31:16 BRT 2016 [debug] FH: waiting for consistent view...
Aug 22 13:32:16 BRT 2016 [warn] Host (0) and the local host do not agre on the view to the pool membership.
Aug 22 13:32:16 BRT 2016 [warn]     local HB domain = (111)
Aug 22 13:32:16 BRT 2016 [warn]     local SF domain = (111)
Aug 22 13:32:16 BRT 2016 [warn]     remote HB domain = (101)
Aug 22 13:32:16 BRT 2016 [warn]     remote SF domain = (111)
Aug 22 13:32:16 BRT 2016 [warn]     remote HB domain on SF = (101)
Aug 22 13:32:16 BRT 2016 [warn]     remote SF domain on HB = (111)
Aug 22 13:32:16 BRT 2016 [warn]     other HB domain = (101)
Aug 22 13:32:16 BRT 2016 [warn]     other SF domain = (111)
Aug 22 13:32:16 BRT 2016 [warn]     other HB domain on SF = (101)
Aug 22 13:32:16 BRT 2016 [warn]     other SF domain on HB = (111)
Aug 22 13:32:16 BRT 2016 [warn]     other HB domain = (011)
Aug 22 13:32:16 BRT 2016 [warn]     other SF domain = (111)
Aug 22 13:32:16 BRT 2016 [warn]     other HB domain on SF = (011)
Aug 22 13:32:16 BRT 2016 [warn]     other SF domain on HB = (111)
Aug 22 13:32:16 BRT 2016 [warn]     other HB domain = (111)
Aug 22 13:32:16 BRT 2016 [warn]     other SF domain = (111)
Aug 22 13:32:16 BRT 2016 [warn]     other HB domain on SF = (111)
Aug 22 13:32:16 BRT 2016 [warn]     other SF domain on HB = (111)
Aug 22 13:32:16 BRT 2016 [warn]     weight[0] = 1
Aug 22 13:32:16 BRT 2016 [warn]     weight[1] = 1
Aug 22 13:32:16 BRT 2016 [warn]     weight[2] = 1
Aug 22 13:32:16 BRT 2016 [warn] after merger:
Aug 22 13:32:16 BRT 2016 [warn]     other HB domain = (101)
Aug 22 13:32:16 BRT 2016 [warn]     other SF domain = (111)
Aug 22 13:32:16 BRT 2016 [warn]     other HB domain on SF = (101)
Aug 22 13:32:16 BRT 2016 [warn]     other SF domain on HB = (111)
Aug 22 13:32:16 BRT 2016 [warn]     other HB domain = (011)
Aug 22 13:32:16 BRT 2016 [warn]     other SF domain = (111)
Aug 22 13:32:16 BRT 2016 [warn]     other HB domain on SF = (000)
Aug 22 13:32:16 BRT 2016 [warn]     other SF domain on HB = (111)
Aug 22 13:32:16 BRT 2016 [warn]     other HB domain = (111)
Aug 22 13:32:16 BRT 2016 [warn]     other SF domain = (111)
Aug 22 13:32:16 BRT 2016 [warn]     other HB domain on SF = (101)
Aug 22 13:32:16 BRT 2016 [warn]     other SF domain on HB = (111)
Aug 22 13:32:16 BRT 2016 [warn]     weight[0] = 1
Aug 22 13:32:16 BRT 2016 [warn]     weight[1] = 1
Aug 22 13:32:16 BRT 2016 [warn]     weight[2] = 1
Aug 22 13:32:16 BRT 2016 [warn]     merged HB domain = (101)
Aug 22 13:32:16 BRT 2016 [warn]     merged SF domain = (111)
Aug 22 13:32:16 BRT 2016 [debug] FH: All hosts now have consistent view to the pool membership.
Aug 22 13:32:24 BRT 2016 [warn] SM: partition_size[0] = 2 2
Aug 22 13:32:24 BRT 2016 [warn] SM: partition_size[1] = 0 0
Aug 22 13:32:24 BRT 2016 [warn] SM: partition_size[2] = 2 2
Aug 22 13:32:24 BRT 2016 [warn] SM: winner_index = 0
Aug 22 13:32:24 BRT 2016 [debug] SM: I have won by the result of weight caluculation.
Aug 22 13:32:24 BRT 2016 [debug] FH: I have won.
Aug 22 13:32:24 BRT 2016 [debug] SM: Node (1) will be removed from liveset.
Aug 22 13:32:24 BRT 2016 [debug] SM: HB domain is updated = (101)
Aug 22 13:32:24 BRT 2016 [debug] SM: SF domain is updated = (101)
Aug 22 13:33:11 BRT 2016 [warn] SC: (script_service_do_query_liveset) reporting "heartbeat approaching timeout". host[1].time_since_last_hb=48050.
Aug 22 13:33:11 BRT 2016 [warn] SC: (script_service_do_query_liveset) reporting "State-File approaching timeout". host[1].time_since_last_update_on_sf=44300.
Aug 22 13:33:46 BRT 2016 [notice] Liveset has been updated.  new liveset = (101)
Aug 22 13:33:46 BRT 2016 [debug] FH: End fault handler.
Aug 22 13:33:51 BRT 2016 [info] SC: (script_service_do_query_liveset) "Heartbeat approaching timeout" turned FALSE
Aug 22 13:33:51 BRT 2016 [info] SC: (script_service_do_query_liveset) "State-file approaching timeout" turned FALSE
Aug 22 13:34:48 BRT 2016 [debug] HB: HB domain is updated [hbdomain = (1@m)].
Aug 22 13:35:04 BRT 2016 [debug] SM: SF domain is updated [sfdomain = (1@1)].
Aug 22 13:35:04 BRT 2016 [debug] Join Agent: Send ack to join request from host (1).
Aug 22 13:35:08 BRT 2016 [info] Join Agent: Join request from host (1) is accepted by the local host.
Aug 22 13:35:08 BRT 2016 [notice] Liveset has been updated.  new liveset = (111)
Aug 22 13:35:08 BRT 2016 [info] Join Agent: proposed_liveset = (111)
Aug 22 13:47:30 BRT 2016 [info] SC: set_pool_state "invalid" enter.
Aug 22 13:47:30 BRT 2016 [err] Invalid pool state (1906).
Aug 22 13:47:30 BRT 2016 [notice] HA daemon started shutdown process.
Aug 22 13:47:30 BRT 2016 [info] SM: sm_initialize(-1).
Aug 22 13:47:30 BRT 2016 [info] SC: hostweight_initialize(-1).
Aug 22 13:47:30 BRT 2016 [info] SC: script_initialize(-1).
Aug 22 13:47:30 BRT 2016 [info] Xapimon: xapimon_initialize(-1).
Aug 22 13:47:30 BRT 2016 [info] HB: hb_initialize(-1).
Aug 22 13:47:30 BRT 2016 [info] WD: (watchdog_close) label=heartbeat stopping watchdog timer.
Aug 22 13:47:30 BRT 2016 [info] WD: (watchdog_close) label=heartbeat watchdog timer has been stopped successfully.
Aug 22 13:47:30 BRT 2016 [debug] WD: watchdog id file /var/run/xhad.wd.id is updated.
Aug 22 13:47:30 BRT 2016 [info] BM: bm_initialize(-1).
Aug 22 13:47:30 BRT 2016 [info] LM: lm_initialize(-1).
Aug 22 13:47:30 BRT 2016 [info] WD: (watchdog_close) label=statefile stopping watchdog timer.
Aug 22 13:47:30 BRT 2016 [info] WD: (watchdog_close) label=statefile watchdog timer has been stopped successfully.
Aug 22 13:47:30 BRT 2016 [debug] WD: watchdog id file /var/run/xhad.wd.id is deleted.
Aug 22 13:47:30 BRT 2016 [info] COM: com_initialize(-1).
Aug 22 13:47:30 BRT 2016 [notice] HA daemon completed shutdown process.

--
Tales

Em 18-08-2016 12:37, Marllus Lustosa escreveu:

Marllus Lustosa

unread,
Aug 22, 2016, 2:29:59 PM8/22/16
to xen...@googlegroups.com
Tales, como vejo no log creio que minha suspeita seja verdadeira:

Está ocorrendo um timeout de uma solicitação do arquivo heartbeat, por isso o HA não sabe o estado do host.
Aug 22 13:33:11 BRT 2016 [warn] SC: (script_service_do_query_liveset) reporting "heartbeat approaching timeout". host[1].time_since_last_hb=48050.
Aug 22 13:33:11 BRT 2016 [warn] SC: (script_service_do_query_liveset) reporting "State-File approaching timeout". host[1].time_since_last_update_on_sf=44300.

Como ela não sabe o estado o HA inicia o processo de reboot do host (modo fencing):

Aug 22 13:47:30 BRT 2016 [notice] HA daemon started shutdown process.

O problema aparente é na comunicação entre host xen e SR do HA. Veja nos logs do kern.log e messages.log se algo está estranho com relação aos drivers das placas de rede ou se alguma está apresentando falhas.
Após isso tente substituir as placas de rede pra isolar algum problema físico.


De acordo com um manual de HA do xenserver:

Hosts will self-fence (power off and restart) in the event of any heartbeat failure except in the following cases: 

       • When the storage heartbeat is present for all hosts but the network has partitioned (so that there are now two groups of hosts). In this case, all of the hosts which are members of the largest network partition stay running, and the hosts in the smaller network partition selffence. The assumption here is that the network outage has isolated the VMs, and that they must be restarted on a host with working networking. If the network partitions are exactly the same size, then only one of them will self-fence.
      • When the storage heartbeat goes away but the network heartbeat remains. In this case, the hosts check to see if they can contact all other hosts over the network. If they can, the hosts remain running on the assumption that the storage heartbeat server has gone down. This does not compromise VM safety, but any network problems will result in fencing because that would mean both heartbeats have disappeared.



Marllus Lustosa

unread,
Aug 22, 2016, 2:29:59 PM8/22/16
to xen...@googlegroups.com

Tales Rodarte

unread,
Aug 22, 2016, 3:27:04 PM8/22/16
to xen...@googlegroups.com
Marllus,

Os logs do sistema não indica nenhum erro nas placas.
Vou realizar alguns testes físicos e com outro switch.

Obrigado

Tales
Reply all
Reply to author
Forward
0 new messages