Zwave binding - dying nodes and network healing

3,393 views
Skip to first unread message

ph.lu...@gmail.com

unread,
Feb 4, 2014, 1:53:47 AM2/4/14
to ope...@googlegroups.com
Hi,

Have been experimenting with openHAB for a while now. I'm using an Aeon USB stick to control ~10 Zwave devices around my house. Previously I used a MiCasaVerde Vera Lite controller for this purpose.

With the Zwave binding I experience a _lot_ of problems that at first glance seems to be related to Zwave network communication problems, but I'm not that sure that is the real problem. All devices worked perfectly fine with the Vera, never had any issues with nodes/devices that dropped out of the network and couldn't be contacted/controlled. With openHAB Zwave binding this happens all the time.

I can leave openHAB running for a day or two and at least one node always ends up marked as dead. Restarting openHAB brings up all nodes in say one of five attempts. The dying node is always the result of a node not responding within a few minutes. At startup this is very obvious because there is a lot of network communication and typically more than one node ends up dead after restart.

As I understand it, it is currently not possible to wake up a node marked as dead in the Zwave binding. Also there are no way to perform network maintenance such as route updates och manually sending NOP packets to verify node communications using the Zwave binding. Isn't that something that "should" be scheduled periodically, at least when nodes are being unresponsive?

I have noticed the same problems using both 1.3.0 and nightly 1.4.0 Zwave binding versions. My devices ar Fibaro 2x1,5kW switches, wall plugs and dimmers.

Any ideas or input on this? How do I get a reliable Zwave network with openHAB? Would such an easy thing as raising the timeout until a node is marked dead improve the situation?

/PH

Chris Jackson

unread,
Feb 4, 2014, 5:52:56 AM2/4/14
to ope...@googlegroups.com

I agree that this is an area of the zwave binding that needs work. I added some functionality to get neighbor node information and started to add support for this into HABmin, but have currently stopped this until 1.4 is released. I suspect that many of the issues are associated with routing, so a) finding out what the routes are, and b) being able to manipulate them, is, in my opinion probably quite important. From what I could tell, there doesn't seem to be any way to read the return routes that are currently set in the node, so the only way to be sure of what they are is to explicitly set them - this is one of the next things on my list....

 

Some sort of network 'healing' (similar to what Vera does) might also be useful, but I think this is a secondary issue - first is to try and understand what's happening...

 

I don't understand what marks a node dead (and I don't see dead nodes, but I do have some that stop working). I think it's the binding that marks nodes dead, but I also know that the stick holds a list of nodes that are considered dead as well, so it could be happening there (??). I've just not looked into this area of the code (yet!), but if anyone knows more it would be useful information to share so we can look at combating it.
 

Sorry - that's not answered your question, but I do think it's important and if anyone knows more about how this works it would be useful to get ideas so we can implement some fixes...

 

Cheers

Chris

 

Per-Henrik Lundblom

unread,
Feb 4, 2014, 6:07:05 AM2/4/14
to ope...@googlegroups.com
Hi,

Didn't know that the Veras performed network healing activities. That may explain the smoother (in that and only that area) user experience.

It is definitely the binding that marks the nodes dead. If that in turn comes from the firmware/Zwave stack in the Aeon stick or not, I don't know. I have full Zwave binding debug logs where the behavior can be seen. Today I spent fifteen minutes of openHAB restarts to get all my nodes up and running. Extremely annoying, took about 15 restarts to get the all up. Which node(s) end up dead varies but it is never the ones closest to the Aeon USB stick.

On what documentation or information is the Zwave binding implemented from?

Talking about the Aeon USB stick, in HABmin by stick isn't identified as an Aeon Labs USB stick. In the node properties under Devices tab Manufacturer and Device ID are 0 but looking at the Zwave binding logs the correct values are read at some point during startup. No idea why they are lost or set to 0 later on.


Chris Jackson

unread,
Feb 4, 2014, 7:10:09 AM2/4/14
to ope...@googlegroups.com
Didn't know that the Veras performed network healing activities. That may explain the smoother (in that and only that area) user experience.
Yes, the Vera can do a network heal at 2am each night. From memory it's configurable, but on by default. The network heal does some sort of ping test on all nodes to establish the routes and then configures appropriately. There's also a button somewhere in the Vera setup window that can manually start this...
 

It is definitely the binding that marks the nodes dead. If that in turn comes from the firmware/Zwave stack in the Aeon stick or not, I don't know. I have full Zwave binding debug logs where the behavior can be seen.
Sure - the binding certainly marks it dead within the binding, my question is who makes the decision in the first place - does the binding do that, or is it told by the stick that the node is dead. I added some functionality to remove dead nodes from the stick, and this functionality only works if the node is in the sticks "dead node" list, so the stick also keeps track of this... I suspect that the two are independent, but I've not looked...

 
Today I spent fifteen minutes of openHAB restarts to get all my nodes up and running. Extremely annoying, took about 15 restarts to get the all up. Which node(s) end up dead varies but it is never the ones closest to the Aeon USB stick.
Yep - my feeling is that this is a routing issue which is why setting up routes is high up my todo list...

 
On what documentation or information is the Zwave binding implemented from?
I think it's largely derived from OZW - certainly when I've been implementing parts of the binding, this has been my main reference. There is also command class information around, but most of the ZWave docs are not available.

 
Talking about the Aeon USB stick, in HABmin by stick isn't identified as an Aeon Labs USB stick.... 
No idea why they are lost or set to 0 later on
Yes - same here. I want to look into this at some stage, but it's not high on my list right now as it's superficial. I didn't want to mess with this without understanding why it was 0 - just in case this is used to indicate something within the binding...

Cheers
Chris

Gert Konemann

unread,
Feb 4, 2014, 8:44:29 AM2/4/14
to ope...@googlegroups.com
I also experience zwave connection problems, which do not exist in Homeseer. I configure the zwave routing with Homeseer. When Homeseer has found a good forward and return path, routing can be left static without further zwave errors in Homeseer. But with the openHAB binding still frequent errors are reported with some nodes. Also battery powered nodes report errors during their sleep period. That is certainly not correct. I also have sometimes dead nodes after openHAB has run untouched during several days. Declaring nodes dead is a bad strategy, it should be selfhealing indeed. 

Gert

Op dinsdag 4 februari 2014 13:10:09 UTC+1 schreef Chris Jackson:

Chris Jackson

unread,
Feb 4, 2014, 9:04:06 AM2/4/14
to ope...@googlegroups.com
I also experience zwave connection problems, which do not exist in Homeseer. I configure the zwave routing with Homeseer. When Homeseer has found a good forward and return path, routing can be left static without further zwave errors in Homeseer. But with the openHAB binding still frequent errors are reported with some nodes.
I wonder if Homeseer still has communication issues with these nodes, but it doesn't mark them dead so it resolves itself later? On OH, once it's dead, I think it stays dead and that probably needs revisiting...
 
 
Also battery powered nodes report errors during their sleep period. That is certainly not correct.
I don't see this here - I have a few nodes that are too far from the controller to work correctly and they never show as dead even though the controller can never talk to them (they are battery nodes). I wonder if it's because the controller has NEVER communicated with them and they haven't completed the initialisation that they don't show dead?
 
I also have sometimes dead nodes after openHAB has run untouched during several days. Declaring nodes dead is a bad strategy, it should be selfhealing indeed. 
Completely agree - this needs investigation

Chris

Chris Jackson

unread,
Feb 4, 2014, 11:02:27 AM2/4/14
to ope...@googlegroups.com
I just had a quick look a the code. The dead node check simply checks if a node has node completed to DONE stage after 2 minutes. It will only set to dead though if the node is a 'listening' or 'frequently listening' node. Once dead, it stays there...
 
The implementation in OZW seems to be the same, although as OZW is effectively just a driver it sends a notification that there are dead nodes and I guess assumes that the application will do something.  In the OH case, I don't think that this same functionality is applicable - or to be more precise, the handling of any top level 'retry' function also needs to be handled somewhere in the binding since there is no higher level application.
 
I suspect that we need to implement an "management layer" in the binding that handles this sort of thing, and any "daily heal" or whatever in order to ensure that the network remains healthy....
 
Comments on this appreciated...
 
Chris

Per-Henrik Lundblom

unread,
Feb 4, 2014, 3:22:15 PM2/4/14
to ope...@googlegroups.com


On Tuesday, February 4, 2014 5:02:27 PM UTC+1, Chris Jackson wrote:
I just had a quick look a the code. The dead node check simply checks if a node has node completed to DONE stage after 2 minutes. It will only set to dead though if the node is a 'listening' or 'frequently listening' node. Once dead, it stays there...

Seems very consistence with my observations and quick digging in the code.

The implementation in OZW seems to be the same, although as OZW is effectively just a driver it sends a notification that there are dead nodes and I guess assumes that the application will do something.  In the OH case, I don't think that this same functionality is applicable - or to be more precise, the handling of any top level 'retry' function also needs to be handled somewhere in the binding since there is no higher level application.
 
I suspect that we need to implement an "management layer" in the binding that handles this sort of thing, and any "daily heal" or whatever in order to ensure that the network remains healthy....

I agree  such management entity within the binding is necessary. As I see it, it needs to periodically verify network health and perform "healing" if nodes are dying. I also think there should be a way to manually trigger these network healing functionality (HABmin?).

Haven't done that much reading on the inner workings of the Zwave protocol and its network topology. I'm also aware of the limited public documentation on the issue. Still I would assume these network healing mechanisms is something Zensys must have thought of and documented?

/PH

Chris Jackson

unread,
Feb 4, 2014, 3:31:30 PM2/4/14
to ope...@googlegroups.com

> Haven't done that much reading on the inner workings of the Zwave protocol and its network topology. I'm also aware of the limited public documentation on the issue. Still I would assume these network healing mechanisms is something Zensys must have thought of and documented?

Official ZWave documentation isn’t available unless you purchase the developers kit - which costs something like $25k!!! So, we need to use other sources of information like the OZW development. It’s ridiculous, but that’s the situation :(

There is some information around though - some old zwave docs, and some information on sites like MCV.

Chris

Per-Henrik Lundblom

unread,
Feb 4, 2014, 3:45:15 PM2/4/14
to ope...@googlegroups.com

I know about the closed nature of the protocol. I have glanced through some of the old zwave docs you are referring to, think I will do it a second time to get a better idea of the whole thing.

/PH

Chris Jackson

unread,
Feb 4, 2014, 6:50:06 PM2/4/14
to ope...@googlegroups.com
So it seems that the controller actually calculates the return routes - you just need to tell the controller to set a node between two nodes. A network heal then goes through and sets a route from the source node, to all the nodes it’s associated with, and also with the controller to ensure everything has a route...

There’s no way, from what I can tell, to get the routes that it’s set - you can just request the neighbours (this is a function I’ve already implemented).

I’ve started adding these extra calls into the binding - with this we can then think about what higher level ‘heal’ function we add above this to try and make the network more robust….

Chris

Chris Jackson

unread,
Feb 6, 2014, 5:32:38 PM2/6/14
to ope...@googlegroups.com
I've had a play this evening and here's what I've implemented -:

Every minute (configurable) the binding will check for DEAD nodes. If it finds any, it will try and reset the route to the controller on that node and then set it back to "un-DEAD" - ie DONE. There may be a problem with nodes that go to DEAD during initialisation since the function I'm using to set it back to DONE won't work if the device didn't complete initialisation, so this may need to be changed.

Additionally, every night at 2am (configurable) the binding will do the following for ALL nodes in the network. 
  • Update the neighbor nodes so that all nodes know their neighbors
  • Request associations so that we know what devices each device needs to communicate with
  • Update the route to the controller
  • Update the routes to the associated nodes
  • Update the route to the wakeup notification node
  • Refresh the bindings view of the network
  • Save the XML files
From what I can tell from scanning around the internet, this seems to be the 'right' thing to do, but I'm open to other ideas.....  This certainly won't be the final solution here since I suspect this won't work for battery devices so we'll have to have a rethink there, but it's something to work on...

This is mostly coded, although I'm not completely sure how to check that it's working - I can check that the functions return correctly, but it's more difficult to know if the network is configured correctly.... 

I'm open to ideas and comments :)

Cheers
Chris

Surendra

unread,
Feb 6, 2014, 11:06:11 PM2/6/14
to ope...@googlegroups.com
Hi Chris,

I am not expert on the topic but here are few inputs from my side:
- We have to make sure we handle battery operated devices during the healing process and I am not sure how this can be done.
- On Fibaro forums, I have seen a thread on network healing and there is an example of some excellent visualization for the current routing. Adding such a view to HABmin will really help. http://forum.fibaro.com/viewtopic.php?t=1714
- I was reading on Razberry site and the controller implements network healing function (http://razberry.z-wave.me/index.php?id=10). I am not sure about other controllers supporting this. Can we not periodically trigger this in the controller or we are planning to implement such a function in openhab.
- Would be great if we could see the current routing table (graphical or textual)in the classic UI and trigger manual updates.

Regards,
Surendra

Surendra

unread,
Feb 7, 2014, 4:09:17 AM2/7/14
to ope...@googlegroups.com
Not sure if you have seen this http://wiki.zwaveeurope.com/index.php?title=SDK_Versions_and_Explorer_Frames. If also talks of dealing with dead nodes/failed routes in various version of the SDK.

Chris Jackson

unread,
Feb 7, 2014, 4:31:55 AM2/7/14
to ope...@googlegroups.com

- We have to make sure we handle battery operated devices during the healing process and I am not sure how this can be done.
I completely agree - this requires some more thought though. For now, my focus has been on getting the low level functions that are required to perform the heal working - then we can string them together to ensure the robustness of the network. For mains powered devices this is easy - battery devices will need more thought as they possibly won’t wake up during the process unless we account for the wakeup class. I’ve been thinking about this, but first will be to get something working for ’normal’ devices.

- On Fibaro forums, I have seen a thread on network healing and there is an example of some excellent visualization for the current routing. Adding such a view to HABmin will really help. http://forum.fibaro.com/viewtopic.php?t=1714
I did think about something like this, and I might ultimately do this, but for now the visualisation I have in HABmin is a little different. Maybe not quite as sexy, but I think just as useful?

- I was reading on Razberry site and the controller implements network healing function (http://razberry.z-wave.me/index.php?id=10). I am not sure about other controllers supporting this. Can we not periodically trigger this in the controller or we are planning to implement such a function in openhab.
My plan at least was to implement a heal function in the binding (as per the outline in my previous mail) and this is largely coded. It will take time to get this working I’m sure, but the problem with using a function on the Pi is it won’t be available for other users with for example the ZStick. That’s not to say that support for the Pi-Heal shouldn’t be added as well, but we need something universal I think.

- Would be great if we could see the current routing table (graphical or textual)in the classic UI and trigger manual updates.
Hmmm - personally, I’m not sure about this. I think it complicates the UI which (in my opinion) should be there for ‘users’, not ‘administrators’. That’s my personal preference though… The other issue is that as best as I can tell (from looking at the OZW source which is one of the few available sources of information for zwave) it’s not possible to read the routes - only the neighbours. Reading the neighbours is simple and already implemented, but that doesn’t actually tell you if a route is configured between two nodes (as far as I know - I don’t claim to be an expert - this is just my understanding from what I’ve read).

Cheers
Chris

Chris Jackson

unread,
Feb 7, 2014, 4:44:05 AM2/7/14
to ope...@googlegroups.com

Not sure if you have seen this http://wiki.zwaveeurope.com/index.php?title=SDK_Versions_and_Explorer_Frames. If also talks of dealing with dead nodes/failed routes in various version of the SDK.
No - I’d note seen that page. Thanks for the link. It has some interesting comments - unfortunately it’s basically impossible to get the documentation on the API since this is restricted :( With the API, it would be a (relatively) simple task…

Cheers
Chris

Tom

unread,
Feb 10, 2014, 4:32:57 AM2/10/14
to ope...@googlegroups.com
I am really happy to see this thread ... when I first reported ZWave issues every one told me that ZWave and its OpenHAB binding are working completely reliably for them and I am glad to see that I am not the only one with issues anyway. In the meantime I have replaced nearly every single piece of hardware in my network, starting from the controller, did 1001 network neighborhood updates, test&heal, tried adding nodes as relays, etc. And while things are working more stable now, they are still not really perfect. Just today morning one of my fibaro dimmer attached lamps refused to be switched on via radio again (openhab considered it dead after running into a timeout). I was quite surprised, however, to see the node recover immediately after I pushed it local switch. That's a progress (I've installed the latest nightly last week-end)!

According to the logs, the dimmer responded to a first message and stopped responding to the 2nd message 20s after the first one (using it as sunrise dimmer, so it gets many commands ...), then did not respond to further messages, then recovered after I pressed its locally attached switch. So, between "working" and "dead" are actually only 20s ... Unfortunately, I have not been motivated enough to see if I get messages through bypassing OpenHAB (the thing with OpenHAB is that once the zwave binding considers a node "dead" it stops trying to send messages to them, so you never know if the zwave network recovered yet).

Things I learned in the meantime:

 - forget ZWave sticks with firmware versions older than 3 years - in 2011 a firmware bug
   was discovered and obviously some old firmware versions did send messages via dead end routes
   willingly - that's why I replaced my old Tricklestar stick by a new Aeon Labs one

 - actually once topology has been discovered using network neighborhood updates
   the controller should care about routing itself (unless you tell the controller to
   not use routing via options), shouldn't it? Some documents claim that controllers with current
   firmware versions are supposed to use explorer frames itself if it runs into a routing
   problem.

 - the one node with most regular issues is directly accessible from the stick  according to
   the routing table. The issue might be caused by the node being on the edge of
   not being accessible and the controller not using alternative routes once it does
   not get through (alternative routes do exist according to the routing table).

 - I still do not understand return routes. Are they static, and what happens once the
  chosen return route becomes invalid (e.g. due to interference)?

- I'll check if Aeon Labs has got a newer firmware for my controller (their website was
  down again when I got the stick ...)

Kind regards,
Tom

Ben Jones

unread,
Feb 10, 2014, 4:47:34 AM2/10/14
to ope...@googlegroups.com
One thing I have found that was the biggest help was to run a 'node neighbour update' on the controller node itself in OZWCP. It often fails and can take 2-3 attempts before it completes successfully, but this seems to be the best way to ensure that 'out of range' nodes are not accessed directly by the controller. It makes sense if you think about it - the node neighbour list of the controller node has to be the most accurate, otherwise a node which is out of reach, but is listed as a neighbour, will continuously fail.

You probably need to ensure your 'far away' nodes are not neighbours to ensure any messages are routed via other nodes in between. I recently had a device fail (hardware fault) in the middle of my home. Removing it caused all sorts of problems in my mesh network, where the whole thing had been running smoothly for months with no issues. That failed node was obviously being used to route a fair bit of traffic to 2-3 other nodes further away and I had to go through the node neighbour update process for all nodes, including the controller, before things started working smoothly again. It seems to be back to be nice and reliable again now.

Hopefully the changes Chris makes to this stuff for HABMin will make this an automatic process in the binding as it is definitely beyond the ability of most users and leads to a lot of frustration!

Chris Jackson

unread,
Feb 10, 2014, 8:05:14 AM2/10/14
to ope...@googlegroups.com
I’ve got most of this coded, but have a few issues with reliability that makes me think I’ve not got everything quite right…

I’ve found also that the node neighbour updates for the controller can fail - the binding will auto retry up to 3 times, and it’s always worked after 2 or 3 tries. It also takes quite a long time (up to a minute) where the other nodes seem to return almost immediately…

I need to spend a bit more time looking at the serial issue, and then I’ll get something out to play with. It will no doubt be a first attempt as there’s little information around about the best thing to do to recover the system, but I hope it will help…

Cheers
Chris


Ben Jones

unread,
Feb 10, 2014, 2:19:14 PM2/10/14
to ope...@googlegroups.com
I have definitely found the controller node neighbour update takes a long time to complete - and often fails. Whereas other nodes are very quick.

Sounds like you are making good progress tho!

Chris Jackson

unread,
Feb 10, 2014, 4:37:49 PM2/10/14
to ope...@googlegroups.com
Yep - it’s generally working, but I’m finding that sometimes the serial port kind of hangs - i.e. the controller stops responding to controller messages. It still works fine for node messages, and once it receives a node message the controller messages start working again. I’m not sure if there’s something I’m not sending to clear things down properly (??). A bit more time required…

(I’m in Germany this week - I’ve even brought a small zwave network with me to play with in the hotel - that’s how dull I am :) )

Joshua Colson

unread,
Feb 11, 2014, 9:27:07 AM2/11/14
to ope...@googlegroups.com
On Monday, February 10, 2014 2:37:49 PM UTC-7, Chris Jackson wrote:
... 
(I’m in Germany this week - I’ve even brought a small zwave network with me to play with in the hotel - that’s how dull I am :) )


I think you mean: "That's how dedicated I am." :-)

--
Joshua Colson 

Tom

unread,
Mar 13, 2014, 11:17:08 AM3/13/14
to ope...@googlegroups.com
I just thought I add my latest observations to this thread ... which are actually rather Z-Wave than OpenHAB related. After my Z-Wave network started behaving worse again, I decided to try a few things out again. Result: After deleting return routes on the nodes that keep dying in OpenHAB things run much more smoothly. E.g. while my sunrise dimmer kept locking after a few updates every single morning it has now worked correctly for 3 days. Even my script restarting OpenHAB when it sees dead nodes appear in the logs is less busy lately (but happens to still be triggered from time to time).

Without knowing if the explanation is correct I guess that with no static return routes the devices choose the (reverse) route the request went through which is not a perfect choice but might be a better choice than always trying the same static route. I always wondered how routes where chosen - probably, neighborhood discovery does not take connection quality into account, and of course radio is something dynamic with conditions changing all the time.

Chris Jackson

unread,
Mar 13, 2014, 6:16:11 PM3/13/14
to ope...@googlegroups.com
Hi Tom,
Interesting observation. From what I’ve read though, you should have a return route - at least one reason is that just because the forward route works, doesn’t always mean that the reverse route will work. ie. Just because node A can talk directly to node B, doesn’t necessarily mean that node B can talk to node A (although - probably - most of the time this will be the case). 

Which version of the binding are you using? Are you using the self healing version - I have had some feedback that this has been successful for some users.

Cheers
Chris


On 13 Mar 2014, at 08:17, Tom <taeo...@gmail.com> wrote:

I just thought I add my latest observations to this thread ... which are actually rather Z-Wave than OpenHAB related. After my Z-Wave network started behaving worse again, I decided to try a few things out again. Result: After deleting return routes on the nodes that keep dying in OpenHAB things run much more smoothly. E.g. while my sunrise dimmer kept locking after a few updates every single morning it has now worked correctly for 3 days. Even my script restarting OpenHAB when it sees dead nodes appear in the logs is less busy lately (but happens to still be triggered from time to time).

Without knowing if the explanation is correct I guess that with no static return routes the devices choose the (reverse) route the request went through which is not a perfect choice but might be a better choice than always trying the same static route. I always wondered how routes where chosen - probably, neighborhood discovery does not take connection quality into account, and of course radio is something dynamic with conditions changing all the time.

--
You received this message because you are subscribed to the Google Groups "openhab" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openhab+u...@googlegroups.com.
To post to this group, send email to ope...@googlegroups.com.
Visit this group at http://groups.google.com/group/openhab.
For more options, visit https://groups.google.com/d/optout.

Tom

unread,
Mar 14, 2014, 1:23:07 PM3/14/14
to ope...@googlegroups.com
Hello Chris,

I am using a +/- recent 1.5.0 snapshot version and (still) do manual healing now and then. I should give the auto healing version a try ... it seems you did a great job there.

It is unsure if deleting return routes did really improve things or if this is just a temporary coincidence. I have had issues since then, just less often. Without knowing how z-wave routing really works my concern is that the return route chosen might be one that may become unavailable, so not necessarily the best and most reliable route. E.g. (with return routes set) the sunrise dimmer works almost reliably throughout the day, but fails nearly every single morning at about the same step and time. Of course, with (metal) shutters closed, WLAN and computers turned off, etc. the radio "weather" is different from during the day. The sparse Z-Wave documentation does not really tell us how the network is adapting to such dynamic circumstances(?). Nodes dying and being healed some time later is a workaround - it sounds strange to me that it should be normal for nodes to not respond for a while due to routing issues.

Of course, you are right, relying on a route working in both directions is not reliable either. So deleting return routes might be a workaround under some circumstances, but probably makes things worse in other cases.

However, this is all Z-Wave criticism, not OpenHAB+Z-Wave specific. I really wonder if and how commercial products get this solved. Maybe there are even hidden and intentionally implemented obstacles with secret solutions in order to control who is able to get reliable products out. Sounds like a conspiracy theory, though :-)

Best regards,
Tom

Chris Jackson

unread,
Mar 15, 2014, 1:15:09 PM3/15/14
to ope...@googlegroups.com
Hi Tom,
Interesting comments - your thoughts echo mine in many ways I think…

> I am using a +/- recent 1.5.0 snapshot version and (still) do manual healing now and then. I should give the auto healing version a try ... it seems you did a great job there.
It might be worth trying the auto heal version (??) just to see if it helps. I’d also like to merge this soon so feedback is welcome.

> It is unsure if deleting return routes did really improve things or if this is just a temporary coincidence. I have had issues since then, just less often. Without knowing how z-wave routing really works my concern is that the return route chosen might be one that may become unavailable, so not necessarily the best and most reliable route.
There’s clearly no substitute for a ‘good network’ - i.e. fixing the network topology so that dead nodes don’t happen. However, it does seem that getting to that point isn’t easy :( In the meantime, I think the best we can do is throw a few ideas around (like we’re doing) and see what ideas we can come up with. It’s such a shame that all this is all ‘closed’ information by zwave :(

> The sparse Z-Wave documentation does not really tell us how the network is adapting to such dynamic circumstances(?).
No - and this is clearly an issue that they are trying to solve as new features were added to try and discover routes better. The problem (I think) is that most of these devices have crappy antennas, and as the ‘radio environment’ changes during the day (computers on off, shutters open, wifi……) marginal nodes stop working. The best think (maybe) is to add a few more nodes as repeaters to fill the gaps and make the network reliable.

> Nodes dying and being healed some time later is a workaround - it sounds strange to me that it should be normal for nodes to not respond for a while due to routing issues.
I agree - I don’t really know why the binding has a “dead node” concept such that when a node is dead we don’t communicate with it. Maybe it’s to avoid locking up the network and slowing down all the ‘good’ nodes (?), but otherwise I don’t really see why we can’t send messages out rather than marking it dead...

> However, this is all Z-Wave criticism, not OpenHAB+Z-Wave specific. I really wonder if and how commercial products get this solved. Maybe there are even hidden and intentionally implemented obstacles with secret solutions in order to control who is able to get reliable products out. Sounds like a conspiracy theory, though :-)
:)
I think zwave tries to fix these sorts of things with the discovery frames etc, but they just don’t tell you anything about what they’re doing, or what we’re meant to do to make it work as well as it can…

My feeling, partly from experience with the MCV Vera, and partly from playing with the OH binding, is the following strategy -:
- Implement a low rate polling so that we find out if nodes stop working before we actually want to use them ‘in anger’ - maybe poll all nodes every 5 or 10 minutes.
- When nodes go DEAD, I think we should still send ‘required’ messages.
- Implement a nightly heal to keep routes up to date.

This is just my thoughts to make the system more robust, but I’m sure there are other ideas….

Cheers
Chris



mka...@gmail.com

unread,
Mar 17, 2014, 9:24:46 AM3/17/14
to ope...@googlegroups.com
Hi Chris,

I have been having issues with various dead nodes over time in my 10+ z-wave installation when using OpenHab. With the #609 snapshot and a Network Heal using Habmin all units now nicely work as expected. 

So I basically just wanted to drop you a note to thank you for your great work with contributing to OpenHab and the initiative and creation of Habmin!

Regards
Martin

Tom

unread,
Mar 22, 2014, 1:37:18 PM3/22/14
to ope...@googlegroups.com
Hello Chris,

so, in the meantime I got the healing version of the ZWave binding working (after correcting that silly mistake of trying to use hh:mm in zwave:healing setting) ...

On the first spot this seems to improve things ... No long-time observations yet, but at least I can now get the network up with all nodes ready by starting OpenHAB and starting a "heal" from Habmin. I did not actually understand the "DEAD node check" messages yet: isn't this supposed to immediately try to heal nodes being marked DEAD? At least it does not for nodes being DEAD from the start (it's happening every now and then that after a restart one or two nodes do not come up). Maybe, a heal should  preventively be scheduled a few minutes after staring up (or at the moment when the binding does declare the ZWave network ready?)

So, for now I am operating with return routes again :-)


Am Samstag, 15. März 2014 18:15:09 UTC+1 schrieb Chris Jackson:
No - and this is clearly an issue that they are trying to solve as new features were added to try and discover routes better. The problem (I think) is that most of these devices have crappy antennas,

I guess that this due to ZWave declaring range issues unimportant due to a working routing. Well, routing is not working that good, it seems ...

The best think (maybe) is to add a few more nodes as repeaters to fill the gaps and make the network reliable.

I am not sure if this really helps. My network is rather  dense in the meantime and nodes are dying rather arbitrarily. If neighbour discovery does not include connection quality, then things may still go wrong if the controller prefers a route with bad connection quality over a route with good quality since it cannot distinct. In order to choose a "good" route, reachable/unreachable status is unsufficient.

I agree - I don’t really know why the binding has a “dead node” concept such that when a node is dead we don’t communicate with it. Maybe it’s to avoid locking up the network and slowing down all the ‘good’ nodes (?), but otherwise I don’t really see why we can’t send messages out rather than marking it dead...

I fully agree.  ZWave *is* a radio network, and there is no such thing as a completely reliable radio network. So, using a radio network includes having a plan for what to do when things temporarily go wrong. Declaring a node DEAD forever just because it did not respond to a message is a very severe reaction.

- Implement a low rate polling so that we find out if nodes stop working before we actually want to use them ‘in anger’ - maybe poll all nodes every 5 or 10 minutes.
- When nodes go DEAD, I think we should still send ‘required’ messages.

Yes, or at least ping them in order to notice when they become reachable again.

Kind regards,
Tom
 

Chris Jackson

unread,
Mar 22, 2014, 5:15:46 PM3/22/14
to ope...@googlegroups.com
Hey Tom,
Glad things look to be working...

On the first spot this seems to improve things ... No long-time observations yet, but at least I can now get the network up with all nodes ready by starting OpenHAB and starting a "heal" from Habmin. I did not actually understand the "DEAD node check" messages yet: isn't this supposed to immediately try to heal nodes being marked DEAD?
Yes - it should. Every 2 minutes it should check for DEAD nodes and it should then try and heal them. I’m happy to look at some logs if you want to send them over...


At least it does not for nodes being DEAD from the start (it's happening every now and then that after a restart one or two nodes do not come up). Maybe, a heal should  preventively be scheduled a few minutes after staring up (or at the moment when the binding does declare the ZWave network ready?)
Effectively this is what should happen - every 2 minutes… It might not work though on devices that don’t complete initialisation - I’ll need to check this.


I fully agree.  ZWave *is* a radio network, and there is no such thing as a completely reliable radio network. So, using a radio network includes having a plan for what to do when things temporarily go wrong. Declaring a node DEAD forever just because it did not respond to a message is a very severe reaction.

- Implement a low rate polling so that we find out if nodes stop working before we actually want to use them ‘in anger’ - maybe poll all nodes every 5 or 10 minutes.
- When nodes go DEAD, I think we should still send ‘required’ messages.

Yes, or at least ping them in order to notice when they become reachable again.
When I get a chance I will take a look at these things. I’d like to get the current stuff merged soon since it seems that at the very least it’s working at least as well as the existing system.

Cheers
Chris

Tom

unread,
Mar 24, 2014, 6:08:32 AM3/24/14
to ope...@googlegroups.com
Hi Chris,


Am Samstag, 22. März 2014 22:15:46 UTC+1 schrieb Chris Jackson:
Yes - it should. Every 2 minutes it should check for DEAD nodes and it should then try and heal them. I’m happy to look at some logs if you want to send them over...

You've got mail :-)
 

Effectively this is what should happen - every 2 minutes… It might not work though on devices that don’t complete initialisation - I’ll need to check this.

I am not sure if it is auto-healing at all. I see the regular dead node checks being logged but in the meantime I had not only the dead-on-start case but also the dead-during-operation case. Neither of them was answered by an automatic heal. Timed and HABmin healing seems to run, however.
 
When I get a chance I will take a look at these things. I’d like to get the current stuff merged soon since it seems that at the very least it’s working at least as well as the existing system.

As far as I can tell no new issues appeared and a few of the old ones are lindered :-)

Kind regards,
Tom

Chris Jackson

unread,
Mar 24, 2014, 8:38:43 AM3/24/14
to ope...@googlegroups.com
You've got mail :-)
Got it - thanks :) I'll take a look tonight and see what it shows...
 

Effectively this is what should happen - every 2 minutes… It might not work though on devices that don’t complete initialisation - I’ll need to check this.

I am not sure if it is auto-healing at all. I see the regular dead node checks being logged but in the meantime I had not only the dead-on-start case but also the dead-during-operation case. Neither of them was answered by an automatic heal.
Ok - I'll take a close look at this. I did look at it yesterday and it should work ok. I added some more debug logs, but maybe I'll add more after I've looked through your log.

 
Timed and HABmin healing seems to run, however.
As far as I can tell no new issues appeared and a few of the old ones are lindered :-) 
 Excellent - thanks for the feedback.

Cheers
Chris

Chris Jackson

unread,
Mar 24, 2014, 1:18:46 PM3/24/14
to ope...@googlegroups.com
Hey Tom,
Unfortunately the logs didn’t show much - other than to confirm your observation that the dead node check isn’t working. I found one possible reason (the timer wasn’t being set!) although I’m not sure this is the full story so I’ve added some more debug to the log and uploaded a new version to the HABmin Github.

Another log would be great when you get a chance :)

Cheers
Chris

Tom

unread,
Mar 24, 2014, 3:17:15 PM3/24/14
to ope...@googlegroups.com

Hi Chris,

ok, I have just installed the updated binding. Unfortunately, this time all nodes came up without any problems ... I'll see if the sunrise dimmer fails tomorrow morning and makes OpenHAB log interesting messages :-)

Kind regards,
Tom

Tom

unread,
Mar 26, 2014, 4:55:33 AM3/26/14
to ope...@googlegroups.com
Hello Chris,

after installing the latest version I restarted OpenHAB until I got a node dead from the start ... and ... it was healed and came to live a few minutes later.

This is really great! A huge step forward! Many thanks for all your efforts!

I am waiting now for the next "dead during operation" event in order to see the node being healed then :-)

Kind regards,
Tom

Chris Jackson

unread,
Mar 26, 2014, 5:37:42 AM3/26/14
to ope...@googlegroups.com
Hi Tom,
Many thanks for testing that - I'm pleased it's working (at last :) ).

I'm looking at changing the DEAD node detection from a 90 second 'dead node check' to use an event (an idea I got from JsW todo list on issue 431). This will speed things up with initiating the heal after a node goes dead, and make the code a little more efficient (no more need to poll).

I still want to take a look at some other measures (as mentioned previously), but I'd try and get this merged to the master branch first.

Again - thanks for your help...

Cheers
Chris

Chris Jackson

unread,
Mar 26, 2014, 12:32:33 PM3/26/14
to ope...@googlegroups.com
I would like to propose an alternative method of handling DEAD nodes in the zWave binding. Nodes are considered dead when they don’t respond to a message - in the current binding, once dead, they stay dead and you an NEVER talk to them. I think this is not the best approach to manage dead nodes, however they probably should be properly managed since ignoring 'dodgy' may also cause its own set of issues…

Options (As I see it) -:
  • Mark dead and don’t do anything. This is the current approach - it means that a node’s gone until you restart the binding. 
  • Don’t ever mark a node dead. This has one drawback I think - when trying to communicate with a node that’s not talking, we will ‘lock up’ the binding for 15 seconds. During this time, no communications will be possible to other nodes. 
  • Perform a network heal when the node goes dead and mark it alive again. This is what the new ‘HABmin’ binding does now - it’s a reasonable way forward, but as above, since we mark the node alive, we’ll instantly get into the 15 second lockup issue (which may not be a problem in most instances).
  • Perform a network heal when the node goes dead, but don’t mark it alive. When a node is dead, we reduce the number of retries to 1, and only mark the node alive when we receive a response. 

The last option seems to me to be a good compromise. It gets nodes working again - it doesn't discard nodes from the network, and it will ALWAYS mean we send a command to a node. However, if the node is considered DEAD, then we just send once rather than the 3 retries that are allowed for working nodes to avoid too much downtime. This has the advantage that we’ll always talk to a node, but we limit the impact on other nodes by not locking up the binding for 15 seconds (15 seconds is 3 retries, each waiting up to 5 seconds). It also means that by marking the node ALIVE only when we receive a response from it, the DEAD indicator is useful in that it tells us this node is a bit dodgy.

One issue that we face is that only 1 transaction is allowed to be outstanding at once - if we can change this (which I think is possible, but is not a trivial change) then the impact of retries would be considerably reduced. This is a story for another day though!

The other thing that I think needs to be implemented is a periodic PING of all nodes (except battery nodes). With the default implementation in the binding, we don’t poll nodes, so the only time we find out they are dead is when we want to actually talk to them (e.g. when you turn the lights on). I think it’s better to periodically poll all nodes (say, every 10 or 20 minutes or so) - not at a rate that’s going to congest the network, but often enough that the binding can find out that a node has disappeared, and attempt to perform a network heal to rectify the situation BEFORE you want to turn the lights on…

So, in summary the following would be my proposed implementation -:
  • Alive nodes are normally allowed 3 retries 
  • A periodic PING will be used at a slow rate to ensure a node is still ALIVE 
  • If a node fails to respond after 3 retries it gets marked DEAD 
  • A notification will be sent to start a network heal on the DEAD node 
  • Any further commands etc sent to the DEAD node will still be sent, but no retries will be supported to avoid locking up the network 
  • As soon as a node responds, it gets marked ALIVE, and has full retry privileges restored 
The above is relatively simple to implement from the current HABmin version of the binding and I think would make things more robust. Given the changes are small (probably only an hours work) it might be better to implement them before the current PR is merged into the master branch. 

I welcome any feedback and alternative thoughts on this…

Cheers
Chris

Ben Jones

unread,
Mar 26, 2014, 3:10:05 PM3/26/14
to ope...@googlegroups.com
I think JwS is back on deck so hopefully he will see this and put in his 2c. Sounds good to me though.

Gert Konemann

unread,
Mar 26, 2014, 6:28:03 PM3/26/14
to ope...@googlegroups.com
Chris,

I support your proposal. 

Gert


--
You received this message because you are subscribed to a topic in the Google Groups "openhab" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/openhab/h3beXlo5iKU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to openhab+u...@googlegroups.com.

Dave Hock

unread,
Jul 8, 2014, 12:29:42 AM7/8/14
to ope...@googlegroups.com
For a long time i have seen all nodes suddenly go red. One thing i have correlated is a 4in1 is a common neighbor to most nodes. When i force a wakeup with the button all nodes come back to life
Reply all
Reply to author
Forward
0 new messages