OpenHAB 1.5 w/ Habmin Z-Wave - Unstable?

1,261 views
Skip to first unread message

Nolan Garrett

unread,
Apr 13, 2014, 12:49:16 AM4/13/14
to ope...@googlegroups.com
I'll preface this with stating that I understand that OpenHAB 1.5 with the associated latest releases of habmin are not considered release quality. That said, I think I am experiencing instability on my Raspberry Pi installation with the Razberry (the addon Z-Wave module for the Pi) that I am not sure is considered normal. 

I have a 32 node network, and regularly most nodes show offline. Triggers based upon events from my Aeon 4-in-1 multisensors unreliably reach my controller (as far as I can tell), and the network is sometimes either slow, or just outright unreliable. I'm struggling with troubleshooting the problem effectively. The reason I think that OpenHAB or HABmin is the issue is that the Z-Way module seems extremely reliable. I can start it and in seconds everything is controllable without issue, reliably and without latency. Rebuild routes as I may like, turning off Z-Way and restarting OpenHAB and I get stuck with a network that doesn't work most of the time.

Any thoughts on how to tackle this? I have a background in programming and have little fear of digging deep - I just need to know where to start.

Thanks!

Nolan

Chris Jackson

unread,
Apr 13, 2014, 5:23:16 AM4/13/14
to ope...@googlegroups.com
Hi Nolan,
I’d be happy to take a look at a log if you want to provide one? If the devices always go dead during the initialisation, then maybe a log for the first 2 or 3 minutes after startup should be good? Maybe also including the XML files from the /etc/zwave folder would also be useful...

The 4in1 devices seem to feature prominently on the problem list :( Can I ask how you are powering the devices (USB or battery). In theory, I don’t think that a battery device should be deemed dead by the binding during initialisation.

How many other devices do you have (i.e. devices that are permanently powered by mains - or are your 4in1s all USB powered?

Also, is this a problem that you only see with the version on the HABmin site, or does it also happen with the 1.5 snapshot that comes with openHAB?

Cheers
Chris

Vanisher

unread,
Apr 13, 2014, 2:01:51 PM4/13/14
to ope...@googlegroups.com
I Have the same problem.

Also triggers that are time-related are not working anymore.

I tried the latest snapshot and the latest habmin 1.50 z-wave version.


Op zondag 13 april 2014 06:49:16 UTC+2 schreef Nolan Garrett:

Chris Jackson

unread,
Apr 13, 2014, 2:10:47 PM4/13/14
to ope...@googlegroups.com

> Also triggers that are time-related are not working anymore.
What do you mean by triggers? Are you talking about rules? If so, I’d be very surprised if that’s related to the zwave binding - it sounds more like the generic startup problem.

I’m more than happy to try and solve any issues, but it would be helpful if you can provide a log, and information on your environment - it’s very difficult to debug a problem without some sort of information.

Chris

Nolan Garrett

unread,
Apr 13, 2014, 2:32:50 PM4/13/14
to ope...@googlegroups.com
Hi Chris,

I appreciate the offer! Here is a log dump. I set the zwave binding into debug mode, and let it run for a couple of hours. I know it's a lot to sort through but I figured more info is better than none.

Right now 9 of my 32 devices (all mains powered) are not showing online or are showing in the yellow "pending" state in HABmin. I am less concerned about the 4in1 sensors right now - I get the feeling they have their own set of problems. That said, I am having a difficult time with mains powered devices. I can pretty much guarantee that if I let openhab run for another 24 hours, the next time I check most devices will be offline and I may or may not be able to control them. Even for ones that show online, my scheduled events (time based) may or may not actually trigger the devices to switch on/off, even though in the openhab interface it'll show the state of the device as expected. (An example would be I have a sunrise action configured to turn off a light. The light will still be on after sunrise, but openhab will show it is off. Same for my sunset action that turns a light on).

I've definitely noticed the problem seems exacerbated by 1.5 and/or the latest HABmin releases, but it was never quite stable under 1.4 either. I am using the latest snapshots of HABmin and OpenHAB.

I'm happy to assist in debugging any way I can!
logs.tar.gz
nodes.tar.gz

Chris Jackson

unread,
Apr 13, 2014, 4:16:37 PM4/13/14
to ope...@googlegroups.com

> I appreciate the offer! Here is a log dump. I set the zwave binding into debug mode, and let it run for a couple of hours. I know it's a lot to sort through but I figured more info is better than none.
Absolutely - many thanks :)

Can you send me your item configuration for the zwave devices? Also, what node is your controller?

Cheers
Chris

Nolan Garrett

unread,
Apr 13, 2014, 5:02:30 PM4/13/14
to ope...@googlegroups.com
Here is my item file. My controller is node 1.
garrett.items

Chris Jackson

unread,
Apr 13, 2014, 6:03:57 PM4/13/14
to ope...@googlegroups.com
Thanks.

One thing I’ve noticed is that you’ve got 25 nodes with the refresh_interval set, with a 30 second update rate. The problem I see in the log with this is that as soon as anything ‘bad’ happens (like, one node timing out) and you’re quickly getting into trouble. If one node times out, that locks up the binding for 15 seconds, so then your 25 refresh commands all have to be sent in the remaining 15 seconds…. Even with everything going well, you’ve got nearly 1 frame per second being taken up with the refresh polling. While this is a reasonably heavy load, I wouldn’t think it would be a problem - it’s a bit difficult to see from the log how long a request is taking on your Pi since the log isn’t logging milliseconds - it looks like you’re getting a few responses per second, so you should be able to get all these 25 requests back in maybe 8 to 10 seconds if everything goes well and there’s no timeouts.

However, with 32 nodes in your system, I just wonder if it’s overloading? I see the queue getting VERY long when there’s a timeout - like 100+ frames in the queue! Version 1.3 handled the polling differently so maybe this is part of the reason things appear worse now (that’s just speculation - I’m not sure how the old binding worked as it was changed to the current implementation before my time).

The other thing is that in the HABmin version of the binding, it doesn’t treat DEAD nodes quite as ‘dead’ as the previous version. When a node goes dead, a) it tries to repair it, and b) it will still try and send messages to it, but it won’t do any retries. In my way of thinking, this is better than the previous system, which once a node went dead, it would ignore it until the binding is restarted. However, in your case, with all the polling, the healing might well be slowing things down such that the polling then starts to back things up. Maybe we need to change this slightly, however dead node handling should be a last resort function that only triggers ‘very occasionally’… I’ll have a think about this.

Anyway…. I would try increasing the refresh_interval for starters and see if this helps.

I think your switches are the GE - there was an effort a few months back to try and only poll these switches when they change state. I think these switches send the App Update message when they change state, and we can then poll only at this time - this would cut down your network traffic considerably as you wouldn’t need to use the refresh_interval at all on those switches. So, if you wanted to try and hack into the source, this would be a really useful thing to try and debug. In theory (!!) it’s implemented in the HABmin version, but as I can’t test it, and I had to reimplement it when I refactored the code a few months back, I can’t be sure. If you fancy taking this one on, I can point you at the file etc to look at…

I also note some of your XML files show device/type of 0, so somethings up there. I’ll take a look at this as if the file not’s ‘correct’ then it probably should reinitialise the data.

I’ll take a look at a few more things, but that’s my first thoughts anyway :)

Cheers
Chris




Nolan Garrett

unread,
Apr 13, 2014, 6:30:12 PM4/13/14
to ope...@googlegroups.com
Good feedback! I will remove all of the refresh intervals for now. I'd added them as I was noticing "sync" issues, but that was back in 1.3 and without HABmin.

From: Chris Jackson
Sent: ‎4/‎13/‎2014 3:04 PM
To: ope...@googlegroups.com
Subject: Re: [openhab] Re: OpenHAB 1.5 w/ Habmin Z-Wave - Unstable?

Thanks.

One thing I've noticed is that you've got 25 nodes with the refresh_interval set, with a 30 second update rate. The problem I see in the log with this is that as soon as anything 'bad' happens (like, one node timing out) and you're quickly getting into trouble. If one node times out, that locks up the binding for 15 seconds, so then your 25 refresh commands all have to be sent in the remaining 15 seconds.... Even with everything going well, you've got nearly 1 frame per second being taken up with the refresh polling. While this is a reasonably heavy load, I wouldn't think it would be a problem - it's a bit difficult to see from the log how long a request is taking on your Pi since the log isn't logging milliseconds - it looks like you're getting a few responses per second, so you should be able to get all these 25 requests back in maybe 8 to 10 seconds if everything goes well and there's no timeouts.


However, with 32 nodes in your system, I just wonder if it's overloading? I see the queue getting VERY long when there's a timeout - like 100+ frames in the queue! Version 1.3 handled the polling differently so maybe this is part of the reason things appear worse now (that's just speculation - I'm not sure how the old binding worked as it was changed to the current implementation before my time).

The other thing is that in the HABmin version of the binding, it doesn't treat DEAD nodes quite as 'dead' as the previous version. When a node goes dead, a) it tries to repair it, and b) it will still try and send messages to it, but it won't do any retries.  In my way of thinking, this is better than the previous system, which once a node went dead, it would ignore it until the binding is restarted. However, in your case, with all the polling, the healing might well be slowing things down such that the polling then starts to back things up. Maybe we need to change this slightly, however dead node handling should be a last resort function that only triggers 'very occasionally'... I'll have a think about this.

Anyway.... I would try increasing the refresh_interval for starters and see if this helps.

I think your switches are the GE - there was an effort a few months back to try and only poll these switches when they change state. I think these switches send the App Update message when they change state, and we can then poll only at this time - this would cut down your network traffic considerably as you wouldn't need to use the refresh_interval at all on those switches. So, if you wanted to try and hack into the source, this would be a really useful thing to try and debug. In theory (!!) it's implemented in the HABmin version, but as I can't test it, and I had to reimplement it when I refactored the code a few months back, I can't be sure.  If you fancy taking this one on, I can point you at the file etc to look at...


I also note some of your XML files show device/type of 0, so somethings up there. I'll take a look at this as if the file not's 'correct' then it probably should reinitialise the data.

I'll take a look at a few more things, but that's my first thoughts anyway :)

Cheers
Chris




--
You received this message because you are subscribed to a topic in the Google Groups "openhab" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/openhab/_Wh_iurrIIo/unsubscribe.
To unsubscribe from this group and all its topics, send an email to openhab+u...@googlegroups.com.
To post to this group, send email to ope...@googlegroups.com.
Visit this group at http://groups.google.com/group/openhab.
For more options, visit https://groups.google.com/d/optout.

Nolan Garrett

unread,
Apr 13, 2014, 8:59:33 PM4/13/14
to ope...@googlegroups.com
I would be interested in where to look in the code for the GE App Update refresh poll. I could track this down, since most of my devices are the GE switches or dimmers.

Nolan Garrett

unread,
Apr 14, 2014, 1:20:26 AM4/14/14
to ope...@googlegroups.com
Even after removing the refresh_interval, the issue continues. After waiting for several hours, the network just never comes online. I even tried removing the nodeXX.xml files to let those try to rebuild. Logs attached. After this session which you see in the logs, I stopped OpenHAB and started Z-Way, which was able to control everything immediately without issue.
logs.tar.gz

Chris Jackson

unread,
Apr 14, 2014, 3:26:10 AM4/14/14
to ope...@googlegroups.com
Thanks Nolan,
I'll take a look at the logs this evening - hopefully with the polling out of the way it might show something more. I'll also get back to you about the GE issue...

Cheers
Chris

Dave Hock

unread,
Apr 14, 2014, 10:58:38 AM4/14/14
to ope...@googlegroups.com
Hi Chris,

These are my log files too. Still seeing issues....

I do notice a lot of the CT30 classes are not implemented. I could help out there but I am a bit clueless as to the openhab zwave architecture and the zwave protocol. I guess implementing a class is just a matter of assembling/disassembling the message parameter bits?

Dave Hock

unread,
Apr 14, 2014, 10:59:29 AM4/14/14
to ope...@googlegroups.com
zwave.zip

Chris Jackson

unread,
Apr 14, 2014, 11:47:18 AM4/14/14
to ope...@googlegroups.com

> Even after removing the refresh_interval, the issue continues. After waiting for several hours, the network just never comes online. I even tried removing the nodeXX.xml files to let those try to rebuild. Logs attached. After this session which you see in the logs, I stopped OpenHAB and started Z-Way, which was able to control everything immediately without issue.
Now this is really strange….. It looks like the RPi (or something!) is returning multiple responses to a single request -:

For example, right at the beginning of the log, we send out this request -:
2014-04-13 20:13:44 DEBUG o.o.b.z.i.p.ZWaveController$ZWaveSendThread[:763]- Sending Message = 01 03 00 07 FB

This is the only time that this request is sent, but the response comes in 4 times
2014-04-13 20:13:44 DEBUG o.o.b.z.i.p.ZWaveController$ZWaveReceiveThread[:913]- Receive Message = 01 2B 01 07 …
2014-04-13 20:13:46 DEBUG o.o.b.z.i.p.ZWaveController$ZWaveReceiveThread[:913]- Receive Message = 01 2B 01 07 …
2014-04-13 20:13:49 DEBUG o.o.b.z.i.p.ZWaveController$ZWaveReceiveThread[:913]- Receive Message = 01 2B 01 07 …
2014-04-13 20:13:50 DEBUG o.o.b.z.i.p.ZWaveController$ZWaveReceiveThread[:913]- Receive Message = 01 2B 01 07 …

The problem gets worse as I think there’s a cascade effect. The response to the initial request would normally cause the subsequent message to be queued, but because the initial message gets 4 responses, the next message gets requested 4 times - and I think each on of these 4 requests gets 4 responses and by the time we get to the MessageSerialApiGetInitData, we have 30 or so responses….

Because the MessageSerialApiGetInitData is received around 30 times, we then go on to request all the node data - 30 times more than we need to! So, the send queue ends up with nearly 900 frames in it! i haven’t really looked further down the log to see what happens, but with this as a starting point, I think it’s heading south very quickly :(

At some point, the queue does get down to 0, so this does kind of sort itself out (it takes about 20 minutes - from 20:13 to 20:34), but by this time nodes are dead and it’s trying to do a heal and I think we’re probably fighting a loosing battle.

I haven’t tried to correlate the messages further down the log (i.e. after the queue has returned to 0), but it looks like the duplication continues. For sure (at least from the logging), during the initialisation, the messages are only being sent once and replied to multiple times.

I doubt it’s a problem with the binding - this send/receive part of the code has been static since 1.3, and it’s not something that I’ve seen before. The “Sending Message =“ log entry is only 1 line of code before the write to the COM port, so as best I can tell, the binding is only sending once, but getting multiple responses...

Given that the binding does (I think?) work for others with the RPi, there must be something different with your setup I guess? I wonder if you have a different version of Java to others with the Pi, or something like that?

Let me know what you can find - I’m happy to help if I can...

Cheers
Chris


Jack B

unread,
Apr 14, 2014, 12:06:44 PM4/14/14
to ope...@googlegroups.com
Hey Guys,

Fwiw I was seeing the multiple messages thing with 1.3 and 1.4 as well, and found reference to what I think is related in the z-wave wiki page, under known issues for the razberry board:


Tried switching to an older java, but can't remember if it fixed the issue (it definitely made it run like I hog which I presumed was down to the lack of hf support). Haven't tried with 1.5 as I switched to hacking around on top of the Z-Way server as a better short term base and have been keeping an eye on the list since.

Cheers
Jack

Chris Jackson

unread,
Apr 14, 2014, 12:19:49 PM4/14/14
to ope...@googlegroups.com
Thanks Jack,
That does indeed look like the same issue - maybe the ‘this is under investigation’ might be a stretch though :(

I’m not sure how you’d go about debugging this. As per my previous email, the logged message that says it’s sending is only 1 line before the write (ok, there’s a try{ in the way, but that’s all) so there’s no real scope for adding more debug. Maybe there’s something that can be done on the Pi itself to sniff the serial comms?

The other thing to look at is to see if this correlates with versions of Java, or some other platform specific ‘thing’ (rev of RPi, Java version, OS version?). I’m sure people are successfully using the Pi as I’ve had people mention other HABmin issues (i.e. the ‘online’ / ‘offline’ messages repeating as the Pi can’t quite keep up). It might be worth trying to get a list of what is and isn’t working to see if there’s any commonality?

Chris

Nolan Garrett

unread,
Apr 14, 2014, 1:31:41 PM4/14/14
to ope...@googlegroups.com
So I am running this Java:

java version "1.7.0_51"
Java(TM) SE Embedded Runtime Environment (build 1.7.0_51-b13, headless)
Java HotSpot(TM) Embedded Client VM (build 24.51-b03, mixed mode)

My Razberry is on 1.5.0-rc1

Kernel: Linux ha 3.10.25+ #622 PREEMPT Fri Jan 3 18:41:00 GMT 2014 armv6l GNU/Linux

I think you're on to something with Java being the root issue. I fired up OZWCP and I find the network is responsive and functional without issue. Since neither OZWCP nor Z-Way is Java-based, it certainly indicates a Java issue.

I'll see if I can figure out how to sniff the serial communications.



Chris

Chris Jackson

unread,
Apr 14, 2014, 1:35:28 PM4/14/14
to ope...@googlegroups.com
Dave,
Thanks for the log - it’s interesting :)

I haven’t digested it yet, but there may be some clues in here to the problems with the 4in1. I suggest though that maybe we do this in another thread and leave this to the RPi issue (since Nolan started this thread)…

I’ll start another thread as I have some questions…

Chris

Nolan Garrett

unread,
Apr 14, 2014, 1:48:24 PM4/14/14
to ope...@googlegroups.com
I've replaced the Java version I am using and have switched to the repo version. Going to see if there is any improvement.


On Mon, Apr 14, 2014 at 10:35 AM, Chris Jackson <ch...@cd-jackson.com> wrote:
Dave,
Thanks for the log - it's interesting :)

I haven't digested it yet, but there may be some clues in here to the problems with the 4in1. I suggest though that maybe we do this in another thread and leave this to the RPi issue (since Nolan started this thread)...

I'll start another thread as I have some questions...

Nolan Garrett

unread,
Apr 14, 2014, 4:16:21 PM4/14/14
to ope...@googlegroups.com
Here are the new logs now that I am on the repo version of the JDK:

java version "1.7.0_40"
Java(TM) SE Runtime Environment (build 1.7.0_40-b43)
Java HotSpot(TM) Client VM (build 24.0-b56, mixed mode)

It seems like there are fewer retransmissions. I've noticed the queue grows large but quickly cuts down to around 35 messages, where it then sits for quite a while before dropping to 0 (and steadily marks everything as dead I think). Most devices show uninitialized, dead or pending. 

If I had any other Z-Wave card to test with, I would to eliminate or indicate this Raspberry Pi as the issue.

It's interesting that at least some portion of the binding does seem to be working even though most of my network shows offline. For example, two garage door sensors I have trigger an email rule to be when I start OpenHAB that they are closed, though those devices always show dead.
logs.tar.gz

Chris Jackson

unread,
Apr 14, 2014, 4:21:18 PM4/14/14
to ope...@googlegroups.com
It’s better, but still no good :(

The node data frame is still repeated 10 or 15 times, so it’s still not working correctly unfortunately.

Cheers
Chris

Nolan Garrett

unread,
Apr 14, 2014, 9:03:01 PM4/14/14
to ope...@googlegroups.com
Just posted this on the Razberry forum to see if anyone has seen this behavior. I'll continue troubleshooting on my end as much as I can. Still trying to figure out how to effective sniff the serial port.

Nolan Garrett

unread,
Apr 17, 2014, 12:34:52 AM4/17/14
to ope...@googlegroups.com
Well, I am having a hard time diagnosing this. Serial sniffing is giving me limited results (probably because I don't know what I am doing). At the razberry forum, he said he thinks that OpenHAB isn't ACKing the packet.

Anyone have ideas on how to figure this one out?

Chris Jackson

unread,
Apr 17, 2014, 4:19:16 AM4/17/14
to ope...@googlegroups.com

On Thursday, 17 April 2014 05:34:52 UTC+1, Nolan Garrett wrote:
Well, I am having a hard time diagnosing this. Serial sniffing is giving me limited results (probably because I don't know what I am doing). At the razberry forum, he said he thinks that OpenHAB isn't ACKing the packet.

Anyone have ideas on how to figure this one out?

Hi Nolan,
ACKs are (or should be) sent - if not, everyone would have this problem. However, it's an interesting point, and I would recommend increasing your logging level to TRACE. Here you should then get messages logged to show that the ACK is being sent. The ACK gets sent after the message checksum is validated, so maybe there's something 'interesting' with math libraries or something (although this is . Anyway, if you enable TRACE logging, then it will either log a message saying that the ACK is being sent, or it will log a message saying that the frame is being discarded. That will give us one more bit of information and may help with the direction...

I should add though that I doubt this is the issue since the frame is being processed, and if the checksum has a problem, then the frame wouldn't be processed. Basically, if the frame is processed, it's very difficult to see how the ACK wouldn't be sent -:


     private void processIncomingMessage(byte[] buffer) {
     SerialMessage serialMessage = new SerialMessage(buffer);
     if (serialMessage.isValid) {
     logger.trace("Message is valid, sending ACK");
     sendResponse(ACK);
     } else {
     logger.error("Message is not valid, discarding");
     return;
     }
    
     handleIncomingMessage(serialMessage);
        }

Cheers
Chris 

Manolis N

unread,
Apr 17, 2014, 9:58:24 AM4/17/14
to ope...@googlegroups.com
Hello Chris,
I am having similar problems, with an AEON HEM and 3 Fibaro plugs which work unreliably, as they are on two different floors. I have tried the habmin "healing" without much luck, some devices apear as dead some times, eventhough they have "neighbours". However using z-tool "optimize" all devices can be reached, so maybe that's a hint that healing can be improved (?)

Anyway, I have attached my zwave trace logs, and node xmls, maybe you will find something interesting in there.
System starts clean with no node.xmls and "heal" button is pressed from habmin a few minutes later.
java version "1.7.0_51"
Java(TM) SE Runtime Environment (build 1.7.0_51-b13)
Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode)

By the way,I am the only one using a three phase HEM as far as I know, and you do not have it in your product explorer.
With Bens help the 3 phase readings work fine, but I can not get the "total" readings.
Here is the discussion and it's manual if you are interested having a look and adding better support for it.

Let me know if I can help in any other way!

Thanks
Manolis
zwave.zip

Chris Jackson

unread,
Apr 17, 2014, 10:17:17 AM4/17/14
to ope...@googlegroups.com
Thanks Manolis,
I think the problem you're having is likely different (unless you have a RPi?).

Anyway, I'll take a look at your log. For sure the heal may need some refinement - there's no real information available on the 'best' way to handle this. According to docs I've seen, the zwave protocol should just sort itself out, but from what we've seen, it doesn't... I'm definately open to ideas on how to improve this if anyone has any though :)

Cheers
Chris

Nolan Garrett

unread,
Apr 17, 2014, 5:00:13 PM4/17/14
to ope...@googlegroups.com
On it! I really want to figure this out, both for myself and since it's listed as known issue on the OpenHAB Z-Wave binding site, at least one other person must experience this.


--

Manolis N

unread,
Apr 18, 2014, 7:20:21 AM4/18/14
to ope...@googlegroups.com
(No, I am not running on RPi​.)

I just tried your latest commit (Update zwave to fix problem with dead node sequencing)

Deleted node xmls, started and it is seems better.. I got 90% of the items in a few seconds! (this normally takes minutes/hours)

However, on habmin HEM seems dead (eventhough it did populate values) and I also now have a dead fibaro plug, which was working fine a few minutes ago with previous version.

Btw is the heal button per node selected (hence need to press for every dead one), or is it global for all nodes? 
I pressed heal a few times, but nothing changed. I am attaching trace logs, with new and old node xmls.

Thanks
Manolis



--
zwave.zip

Chris Jackson

unread,
Apr 18, 2014, 7:32:08 AM4/18/14
to ope...@googlegroups.com

I just tried your latest commit (Update zwave to fix problem with dead node sequencing)
Note that I made an error and committed the same version earlier this morning and only noticed this 10 minutes ago, so you may have still been running the same version. You might want to reload with the ‘real’ new version as your log is from before I fixed this (again - sorry).

Btw is the heal button per node selected (hence need to press for every dead one), or is it global for all nodes? 
I pressed heal a few times, but nothing changed. I am attaching trace logs, with new and old node xmls.
This is a global setting, but it’s a good idea to make it ‘per node’. 

I’ll have a quick look at your logs, but if you have more after the version update that would be good.

Cheers
Chris

Manolis N

unread,
Apr 18, 2014, 7:36:06 AM4/18/14
to ope...@googlegroups.com
Ok, I will re-update... however I have noticed something interesting with current version (which ever that is).

The second startup I did, without deleting the node.xmls, is working PERFECT !!!
I have ALL items refreshing at 60secs..
I am keeping the log writing, I will re-download your jar, delete nodes, start one time... observe and then start again without deleting the nodes. I have a feeling you are doing something in initialization, which might not go well the first time, but if you load the nodes it goes way better :-)





--

Charles AMPEAU

unread,
Apr 18, 2014, 8:40:11 AM4/18/14
to ope...@googlegroups.com
Hi Chris and Nolan,

If it could help, here is a zwave log with trace level on my RPI with Razberry module + SRT321 + SSR303.

I'm running lastest OH 1.5 snapshot with latest ZWave addon from HABmin.

My 2 cents

Charles


Le dimanche 13 avril 2014 06:49:16 UTC+2, Nolan Garrett a écrit :
I'll preface this with stating that I understand that OpenHAB 1.5 with the associated latest releases of habmin are not considered release quality. That said, I think I am experiencing instability on my Raspberry Pi installation with the Razberry (the addon Z-Wave module for the Pi) that I am not sure is considered normal. 

I have a 32 node network, and regularly most nodes show offline. Triggers based upon events from my Aeon 4-in-1 multisensors unreliably reach my controller (as far as I can tell), and the network is sometimes either slow, or just outright unreliable. I'm struggling with troubleshooting the problem effectively. The reason I think that OpenHAB or HABmin is the issue is that the Z-Way module seems extremely reliable. I can start it and in seconds everything is controllable without issue, reliably and without latency. Rebuild routes as I may like, turning off Z-Way and restarting OpenHAB and I get stuck with a network that doesn't work most of the time.

Any thoughts on how to tackle this? I have a background in programming and have little fear of digging deep - I just need to know where to start.

Thanks!

Nolan
zwave.zip

Manolis N

unread,
Apr 18, 2014, 9:20:58 AM4/18/14
to ope...@googlegroups.com
I think you nailed it Chris, cudos! :-)
I updated the bundle and started with no nodexmls, (log time 14:43)
HEM worked right away, one fibaro (node 4) marked orange but was reporting values 
(what does orange mean ?)

So everything seem to be working fine, from first start...
I did a restart at 14:51 , everything continued working fine, and orange fibaro became green after a while. I have it runing for almost two hours now, everything is reporting every 60secs so it looks that mesh problems are solved!

Now the only problem is the AEON HEM, is not reporting the "3 phase sum" values, but thats a diferent issue :-)

Logs attached anyway

Thanks Chris!

zwave.zip

Nolan Garrett

unread,
Apr 18, 2014, 11:35:10 AM4/18/14
to ope...@googlegroups.com
I'll give this a test in a few hours and post back with my results and logs.

From: Manolis N
Sent: ‎4/‎18/‎2014 6:21 AM

To: ope...@googlegroups.com
Subject: Re: [openhab] Re: OpenHAB 1.5 w/ Habmin Z-Wave - Unstable?

Chris Jackson

unread,
Apr 18, 2014, 11:58:12 AM4/18/14
to ope...@googlegroups.com
Thanks Manolis,
Well see :)

I think one of the issues with this whole ‘dead node’ stuff is that there’s a lot of variability. Radio waves don’t propagate the same every time, it’s time of day dependant, what other stuff is on in the house dependant, and who knows what else dependant :)

The latest version should be better - previously there was a bug that meant that the system would potentially spend a lot of time trying to recover a dead node. This could impact the rest of the system, which can propagate problems. However, the underlying issue is still that dead nodes should be something that only happens very seldom - if it’s happening lots, then probably there’s an issue with the network (or of course a bug in the software).

Your comment earlier about startup issues is interesting. I was looking at some logs earlier in the week (Nolans I think) and even on a good day (i.e. everything working) when you have 30+ nodes in the network, the way the binding starts will mean a lot of frames get queued. This shouldn’t (probably!) cause any problems, but it might in some cases so maybe this is something to look at at some stage. However, it probably means changing more than a few lines, and I’d prefer to avoid that right now…….

Let’s see how things pan out with this version - hopefully it will be better at least :)

Cheers
Chris

Nolan Garrett

unread,
Apr 18, 2014, 3:02:21 PM4/18/14
to ope...@googlegroups.com
Well, less luck for me unfortunately. Here are my logs. I am at work so haven't had a chance to review them at all, but it appears to be the same.
logs.tar.gz

Chris Jackson

unread,
Apr 18, 2014, 3:08:56 PM4/18/14
to ope...@googlegroups.com
Sorry Nolan,
I didn’t really expect too much change for your problem unfortunately. I can confirm though (as you probably already did) that the ACK messages are being sent… I’m not too sure where to next though….

I also had a quick look at Charles’ log, and it shows (basically) the same problem :(

Chris

Nolan Garrett

unread,
Apr 18, 2014, 3:34:05 PM4/18/14
to ope...@googlegroups.com
Me either. I suppose abandoning the RPi as a platform is an option. Not one I like, but I like having my network work even less. ;)

joe bob

unread,
Apr 18, 2014, 10:38:05 PM4/18/14
to ope...@googlegroups.com
One more thing to try - run without any log debugging. If the rpi is having trouble keeping up, the logging overhead may be too much. Also i assume you're running the rpi headless?

Brian Crosby

unread,
Apr 19, 2014, 6:02:15 PM4/19/14
to ope...@googlegroups.com
All

The latest (621 snapshot) is not usable with OH and Habmin? It is slowly starting up the discovery of my devices. But it doesn't seem to be working as it did...thoughts?

Chris Jackson

unread,
Apr 19, 2014, 6:09:27 PM4/19/14
to ope...@googlegroups.com
I don’t think the zwave binding has changed for quite a while, so maybe it’s just compiled badly or something?

Chris

Nolan Garrett

unread,
Apr 19, 2014, 9:33:57 PM4/19/14
to ope...@googlegroups.com
I'm guessing something changed. 1.4 was far more stable for me. I may try reverting.

From: Chris Jackson
Sent: ‎4/‎19/‎2014 3:09 PM

To: ope...@googlegroups.com
Subject: Re: [openhab] Re: OpenHAB 1.5 w/ Habmin Z-Wave - Unstable?

I don't think the zwave binding has changed for quite a while, so maybe it's just compiled badly or something?

Chris

Nolan Garrett

unread,
Apr 20, 2014, 1:42:47 AM4/20/14
to ope...@googlegroups.com
Yes, I am running headless, and trying without debugging didn't yield much difference. I'm thinking of reverting to 1.4, then doing 1.4 with the latest HABmin to see if I get different results.

Chris Jackson

unread,
Apr 20, 2014, 4:01:41 AM4/20/14
to ope...@googlegroups.com
For sure rhere have been some changes since 1.4, but I interpreted Brians question as being did something change in the latest version.  The binding hasn't changed much for a few months now (pretty much only database updates).

I think its quite clear that the problem on the pi has been around for a while - probably since 1.3. 

Nolan. Have you added more devices recently? This would make is less stable with the problem that seems to exist on the pi - 32 devices is a reasonably large network to have your messages repeated 4 or 5 times. 

Maybe you'll find that the 1.4 version was more stable on the pi for some reason so it might be wort a look, but unfortunately don't think it will ever work very well with the duplicated packet problem.

Chris


Sent from Samsung Mobile
You received this message because you are subscribed to the Google Groups "openhab" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openhab+u...@googlegroups.com.

Brian Crosby

unread,
Apr 20, 2014, 8:18:15 AM4/20/14
to ope...@googlegroups.com

You are correct. 1.4 was stable. 1.5 I can't use. Half the time it doesn't register. The other half it crashes my OH system.

This is a big problem as OH runs many critical functions in my home.

Chris Jackson

unread,
Apr 20, 2014, 8:43:20 AM4/20/14
to ope...@googlegroups.com
Hi Brian,
If I understand correctly (based on your email about using version 621 snapshot) is you’re using the openhab master, and not the HABmin build? As best as I can tell, there’s only been a very small number of commits since 1.4 (9th Feb) - 4 I think. -:

It’s strange that it crashes the system since most of the changes are pretty small. What makes you think that it’s the zwave binding that’s crashing your system? Have you tried just putting the 1.4 binding into your 1.5 OH to see if that fixes everything?

The only ‘significant’ change since 1.4 appears to be the handling of the wakeup class - the other changes are mostly logging or minor settings changes. Even this change is reasonably limited, but it does add a timer, and maybe this causes problems on your system?

Maybe you can supply some logs and I’ll try and take a look at it, although the codebase is quite old now compared to the HABmin version. I thought that the HABmin version was going to be merged a week or so back, but not so far...

Chris

Nolan Garrett

unread,
Apr 20, 2014, 12:02:38 PM4/20/14
to ope...@googlegroups.com
I did basically double my devices.

I ordered a Z-stick last night. I am going to try that on a different computer and see if I get improved results.

I may also try 1.4 as well, depending on how much time I get today.

From: Chris Jackson
Sent: ‎4/‎20/‎2014 1:01 AM

Nolan Garrett

unread,
Apr 20, 2014, 2:55:16 PM4/20/14
to ope...@googlegroups.com
So interestingly, I tried something a little different. I took the RPi completely down, power cycled it, and brought it back. I then let OpenHAB load and stayed out of HABmin. The result was that several minutes later when I did launch HABmin, most of my devices were online. As soon as I launched HABmin and the CPU starts hitting near 100%, my devices get flakey.

I am starting to wonder if the act of too much logging and running HABmin is actually maxing the CPU on this little device.

How can I tell what devices OpenHAB sees as online without launching HABmin, which seems to destroy my CPU on this RPi?

Chris Jackson

unread,
Apr 20, 2014, 3:51:26 PM4/20/14
to ope...@googlegroups.com
Hi Nolan,
I’d be surprised if HABmin is maxing out the CPU - it should only be polling once every 5 seconds.  However, to check this, you could try the following. If you have Chrome, then you can run up the debug console (from the menu, click Tools, then Developer tools). If you then select the Network tab, you should be able to see the requests (and responses) to/from HABmin. It will also tell you how long each request is taking. Between these requests, HABmin is doing nothing to your Pi, so you can tell how much it’s loading it…

The other thing to do is to look at the log and see if the retries are still happening even if HABmin isn’t running (basically do what you’ve just done, but check the log to see if the repetition is still there).  I’m going to guess that it is since the statement in the Wiki about this problem was written before HABmin was in the binding (the note was in the Wiki in November, but HABmin was only merged into the zwave binding in mid January).

One other point I noticed in the wiki - it states “There seems to be an issue with the binding running on the latest oracle VM Beta” - are there other VMs - other than from Oracle? Or older ones? It just reads as though this is a problem that only occurs with particular software (or maybe I’m just reading too much into this).

I suspect that until this ‘multiple receive’ problem is solved, things will be pretty dodgy unfortunately :(

Cheers
Chris


Nolan Garrett

unread,
Apr 20, 2014, 7:10:30 PM4/20/14
to ope...@googlegroups.com
I think you are right, to some degree. I am definitely finding a correlation to HABmin launches and specifically navigating the Z-Wave binding in HABmin to high CPU, but it does subside after a minute or so. I am noticing that with logging turned way down and if I stay away from launching HABmin or any other CPU intensive process during the initial OpenHAB run and binding initialization (meaning 5-10 minutes for best results), far more of my nodes are online.

I don't know a lot about how the Razberry and the Z-Wave binding interact, but I wonder - could an overloaded CPU cause this multiple receive problem?

It's hard for me to troubleshoot because the more logging I turn on, the worse my network seems to become.

I'll post more as I continue troubleshooting. 

Chris Jackson

unread,
Apr 20, 2014, 7:22:42 PM4/20/14
to ope...@googlegroups.com

> I think you are right, to some degree. I am definitely finding a correlation to HABmin launches and specifically navigating the Z-Wave binding in HABmin to high CPU, but it does subside after a minute or so. I am noticing that with logging turned way down and if I stay away from launching HABmin or any other CPU intensive process during the initial OpenHAB run and binding initialization (meaning 5-10 minutes for best results), far more of my nodes are online.
Now, when HABmin starts - that’s a different story since it downloads a bunch of data in one hit (all the HTML and JS files, and then information on bindings and sitemaps, items etc) so this might cause some issues to a poor RPi.


> I don't know a lot about how the Razberry and the Z-Wave binding interact, but I wonder - could an overloaded CPU cause this multiple receive problem?
I wouldn’t have thought so, but who knows. Given that the logs I looked at had maybe 4 or 5 repeats, it means that 3 or 4 acks are getting lost. The ack is 1 byte long - so we’re saying that 1 byte is getting lost due to overloaded CPU, but the normal 20 byte packets get through ok?


> I'll post more as I continue troubleshooting.
Good luck.

Cheer
Chris

Chris Jackson

unread,
Apr 22, 2014, 1:05:39 PM4/22/14
to ope...@googlegroups.com
Hi Nolan,
I’ve found a bug in the zwave binding - the handling of the send data response is looking at the incorrect part of the response. In the code below, the response is looking a message buffer byte 1, but this is actually always == 1 for a response (this byte is in fact the response/request byte, which is 1 for a response).

private void handleSendDataResponse(SerialMessage incomingMessage) {
logger.trace("Handle Message Send Data Response");
if(incomingMessage.getMessageBuffer()[2] != 0x00)
logger.debug("Sent Data successfully placed on stack.");
else
logger.error("Sent Data was not placed on stack due to error.");
}

Looking at your logs, when this occurs, we always seem to end up with a timeout, which is ‘correct' since the failure actually means the controller ignored this frame.  This can be seen in the log output below. The code is looking at the 01 in red, and saying that the request was successful, but it should be looking at the 00 in blue (also underlined!), which would have shown this was actually unsuccessful.

2014-04-18 11:28:53.796 DEBUG o.o.b.z.i.p.SerialMessage[:216]- Assembled message buffer = 01 04 01 60 00 9A 
2014-04-18 11:28:53.799 DEBUG o.o.b.z.i.p.s.RequestNodeInfoMessageClass[:38]- Request node info successfully placed on stack.
2014-04-18 11:28:58.764 ERROR o.o.b.z.i.p.ZWaveController$ZWaveSendThread[:800]- NODE 31: Timeout while sending message. Requeueing

The other slight downside from calling the getMessageBuffer method is it regenerates the frame again, which actually takes quite a bit of processing (regeneration of frame data, plus checksum) - on the RPi, this might be significant (maybe not that significant, but it won’t help!).  Looking back at the code, this bug has always been in the stack (at least since 1.3)

Looking at one of your logs from the RPi, the log is full of these mishandled errors. To be fair, this probably isn’t anything to do with the Pi since the processing of the protocol is handled in the zwave interface board - it may be to do with the large network (possibly). Correcting this error will at least allow us to properly interpret what’s going on!

At the end of the day, this is actually just a logging issue since (currently) there’s no return from the above check, and therefore either way the frame will timeout and be re-transmitted in 5 seconds. I’ve had a look at the OZW implementation, and it’s doing the same thing, and I can’t think of a better way to handle this at the moment.

I’ve created a new version with this fix in it, and a couple of other changes. The only other one of note is that I’ve added a ‘controller soft reset’ when the binding starts - I found after running fine for 4 or 5 weeks, my controller needed to be reset or it wouldn’t handle the update neighbour node requests. It seems like a useful thing to do on startup, but I’ll probably make it configurable before I push this to the master.

I’d recommend upgrading to this to see a) if it’s any better (I’m not too hopeful), and b) it will show the extent of these errors by correcting the logging.

Cheers
Chris

Nolan Garrett

unread,
Apr 22, 2014, 6:01:19 PM4/22/14
to ope...@googlegroups.com
Here is the logs with the latest HABmin and OpenHAB versions. Haven't had a chance to review yet.
zwave.tar.gz

Chris Jackson

unread,
Apr 22, 2014, 6:17:25 PM4/22/14
to ope...@googlegroups.com
I’ve not looked at this in detail, but…
amazingly, there’s (almost) no obvious duplication of packets in this log. The only duplicate packet I can find is right at the beginning and that might be a startup issue as I see there is a timeout and the receive thread gets restarted - this might have caused the first few frames to be resent… The maximum queue length is only 36 frames (down from something around 300 in the first log I looked at!) - 36 frames isn’t far off the mark given you have 37 nodes!

There are still an awful lot of timeouts, and dead nodes, so it’s not exactly working ‘well’, but I’m most surprised that the duplication is (seemingly) gone. If you change back to a previous version of the binding, does the duplication start again or is this down to something else you’ve changed?

Cheers
Chris

Nolan Garrett

unread,
Apr 22, 2014, 8:17:20 PM4/22/14
to ope...@googlegroups.com
Great question. Sorry for my noobness - what's the best way to get the previous revision of the binding?

Nolan Garrett

unread,
Apr 22, 2014, 10:34:33 PM4/22/14
to ope...@googlegroups.com
On another note, I got my Z-Stick today. On the same RPi, the issue persists. Implies to me it's not a Razberry problem, since I still have lots of dead nodes. Log attached of that run.

Next up, I am moving OpenHAB to high-performance Windows box. I'll use the same Z-Stick and see if I suddenly get a functional network.
zwave.log

Nolan Garrett

unread,
Apr 22, 2014, 10:49:00 PM4/22/14
to ope...@googlegroups.com
Wow, that was incredibly easy. Network seems up (couple of dead nodes, looking into those).  Can't get the UI to work but I think it's unrelated.

So I guess this is evidence that the RPi just can't keep a functional Z-Wave network at a certain point.

Chris Jackson

unread,
Apr 23, 2014, 4:13:52 AM4/23/14
to ope...@googlegroups.com
One easy way is to use the version from the openhab snapshot on CloudBees (https://openhab.ci.cloudbees.com/).  

Alternatively, in github, you can select the zwave file in the addons directory, and then click HISTORY in the top corner. This will give you a list of the updates to this file, and if ou click RAW it will download the file.

Cheer
Chris

Chris Jackson

unread,
Apr 23, 2014, 4:26:24 AM4/23/14
to ope...@googlegroups.com
Again, there’s no duplication in this log (at least, not obviously). It’s interesting as to why this seems to have gone away…. The other thing I have added to the new version of the HABmin build is to reset the controller when the binding starts - maybe (??) this is also helping. It would definitely be interesting to see if you change to an older version if this makes things go back to the duplication.


On 23 Apr 2014, at 03:34, Nolan Garrett <nolan....@gmail.com> wrote:

> <zwave.log>

Nolan Garrett

unread,
Apr 23, 2014, 10:12:10 PM4/23/14
to ope...@googlegroups.com
Did a quick test - didn't snag the log (I was working on a different issue). I found duplicate in the logs on the old version. Go back to the latest where you fixed the code you reference above and the duplication ends.

The network is far more stable on my Windows box than the RPi. I think the RPi just can't handle the load of the network. Having a different issue I'm going to start a new post about.

Also, I have a feature request for HABmin - it would be awesome to know what state a heal is in (or if it is even still running) from the network screen, without having to click through every device to find if any are "waiting". Also the ability to single out a single device for a heal would be awesome as well, but I think you have that on your list somewhere.

Chris Jackson

unread,
Apr 24, 2014, 7:29:49 AM4/24/14
to ope...@googlegroups.com
Also, I have a feature request for HABmin - it would be awesome to know what state a heal is in (or if it is even still running) from the network screen, without having to click through every device to find if any are "waiting". Also the ability to single out a single device for a heal would be awesome as well, but I think you have that on your list somewhere.
Feel free to add these as issues on the HABmin site. They are things I've thought about, so agree it is nice - I just need to think about how to add them...
The original concept with this particular part of HABmin was not to make this binding/device specific so that the interface was able to be used for different bindings with minimal change. So, adding an indicator that a device is in a HEAL state (preferably) needs to be done in an abstract way so that it doesn't make this zwave specific...

Ultimately, I hope that an interface similar to this will be implemented in ESH (OHv2) and I'd look to change HABmin to that interface, so I really want to keep to this philosophy.  That said, we also want to make it useful, and I agree some of this is useful, so if you raise an issue, I'll have a look at how it can be done...

Cheers
Chris

Chris Jackson

unread,
Apr 24, 2014, 10:08:45 AM4/24/14
to ope...@googlegroups.com
Did a quick test - didn't snag the log (I was working on a different issue). I found duplicate in the logs on the old version. Go back to the latest where you fixed the code you reference above and the duplication ends.
I missed this comment, or didn't quite click to its relevance first time round...  So, the new binding somehow fixes the RPi duplication error - that's interesting (and good!).  

I think I will now create a PR for this version since it fixes a few bugs and also adds the controller reset. I was going to make the controller reset configurable, but as no-one has reported any issues with it I might just leave it as a standard startup action since it ensures the controller is always in a known state when we start (which can't be a bad thing).

Cheers
Chris

Manolis N

unread,
Apr 24, 2014, 11:08:23 AM4/24/14
to ope...@googlegroups.com
Hello again Chris,
after having your latest zwave bundle running for a day or so, the mesh failed again :-(
It has been discussed many times in the forum, that this could be "normal" depending on the network topology, interfearence etc.. so I am not complaining :-) 
I am just reporting my findings.

I run the following tests in an attempt to diagnose what's happening and what might have an effect on the network stability when in a "most nodes dead" situation:
if I restart openhab, nothing changes
if I press "heal" on habmin, this time it did not fix anything (at least in a few minutes time) 
if I exit openhab, do some z-tool network heals (after 2 or 3 attempts I get a success mesages on all nodes) and reopen openhab, I have most of the nodes up!
now if I restart openhab it seems things are more dynamic, also if I replace the node.xmls with old ones when system was working, it seems I have everything running smooth again (or it could just happen anyway after some time or multiple re-starts ?)
(logs attached in they interest you)

I was wondering if starting with node xmls, that contain correct nodeNeighbors has an effect on the end result ? is there a point deleting/replacing node.xmls ? 

Furthermore, I have a new more important problem with this installation (dont know if it's related to this zwave  bundle)
I receive wrong "total kWh" values upon initialization. It "stabilized" to the correct number after an iteration or so
(the result is spikes on the charts, and wrong statistics of daily consumption which is based on max malues)

I had this problem again in the past, it turned out it was possible to have a "ghost" openhab process running, so upon restart I had two java process and things were getting seriously mixed up (I had submitted some logs then too). However this time, I am making sure there are no left-over java process upon restart, so it must be something else (?)
If you look at line 7921:
2014-04-24 16:18:29 - Meter Value = (59.078000)
this is a wrong value, it was 9.16 and returned to 9.16 again later on (not captured in this attached log)
If this persists, I will try to capture a clean minimum log, and re-submit, but based on the charts, it has happened twice during these tests

Thanks for all the efford you put in to this! 
Regards,
Manolis




--
You received this message because you are subscribed to a topic in the Google Groups "openhab" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/openhab/_Wh_iurrIIo/unsubscribe.
To unsubscribe from this group and all its topics, send an email to openhab+u...@googlegroups.com.
zwave.zip

Chris Jackson

unread,
Apr 24, 2014, 11:17:47 AM4/24/14
to ope...@googlegroups.com
I am just reporting my findings.
Thanks - it's always useful to hear what's going on :) 

(logs attached in they interest you)
I'll take a look at these later and see if there's anything useful in them - thanks.
 
I was wondering if starting with node xmls, that contain correct nodeNeighbors has an effect on the end result ? is there a point deleting/replacing node.xmls ? 
No - the nodeNeighbor information is not used at all - other than by the network graph in HABmin. All the routing is done by the controller - we don't get to find out what routes it has set - we can only ask what neighbors are reported.

Starting the binding with an XML file is useful though as it will significantly reduce traffic during startup since the binding doesn't request a lot of static data (things like the manufacturer ID etc).

Cheers
Chris 
Reply all
Reply to author
Forward
0 new messages