Account Options

  1. Sign in
The old Google Groups will be going away soon, but your browser is incompatible with the new version.
Google Groups Home
« Groups Home
Wait Timeout
There are currently too many topics in this group that display first. To make this topic appear first, remove this option from another topic.
There was an error processing your request. Please try again.
flag
  Messages 51 - 74 of 74 - Collapse all  -  Translate all to Translated (View all originals) < Older 
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
Greg Satz  
View profile  
 More options Nov 14 2012, 1:58 am
From: Greg Satz <s...@iranger.com>
Date: Tue, 13 Nov 2012 23:58:34 -0700
Local: Wed, Nov 14 2012 1:58 am
Subject: Re: [openzwave] Wait Timeout
Scott, I am not sure if the change to ignore m_expectedCallbackId  is going to resolve the issue. Here is what I see is going on.

First SET transaction works like we (and the code) expects it to. ZW_SEND_DATA request with callback ID is received in .76 seconds.

2012-11-13 20:43:00:564 Node012, Sending command (Callback ID=0x68, Expected Reply=0x13) - Basic Set (Node=12): 0x01, 0x0a, 0x00, 0x13, 0x0c, 0x03, 0x20, 0x01, 0x00, 0x05, 0x68, 0xa5
2012-11-13 20:43:00:565 Node012, Queuing command: BasicCmd_Get (Node=12): 0x01, 0x09, 0x00, 0x13, 0x0c, 0x02, 0x20, 0x02, 0x05, 0x69, 0xa5
2012-11-13 20:43:00:566 Node012,   ACK received CallbackId 0x68 Reply 0x13
2012-11-13 20:43:00:570 Node012,   Received: 0x01, 0x04, 0x01, 0x13, 0x01, 0xe8
2012-11-13 20:43:00:570 Node012, ZW_SEND_DATA delivered to Z-Wave stack
2012-11-13 20:43:01:262 Node012,   Received: 0x01, 0x09, 0x00, 0x04, 0x00, 0x0c, 0x03, 0x20, 0x03, 0x00, 0xde
2012-11-13 20:43:01:262
2012-11-13 20:43:01:262 Node012, Received Basic report from node 12: level=0
2012-11-13 20:43:01:263 Node012, Refreshed Value: old value=255, new value=0, type=uint8
2012-11-13 20:43:01:263 Node012, Changes to this value are not verified
2012-11-13 20:43:01:330 Node012,   Received: 0x01, 0x05, 0x00, 0x13, 0x68, 0x00, 0x81
2012-11-13 20:43:01:330 Node012,   ZW_SEND_DATA Request with callback ID 0x68 received (expected 0x68)
2012-11-13 20:43:01:330 Node012,   Expected reply was received
2012-11-13 20:43:01:330 Node012,   Message transaction complete

The second GET transaction is messy but does complete.

2012-11-13 20:43:01:331 Node012, Sending command (Callback ID=0x69, Expected Reply=0x04) - BasicCmd_Get (Node=12): 0x01, 0x09, 0x00, 0x13, 0x0c, 0x02, 0x20, 0x02, 0x05, 0x69, 0xa5
2012-11-13 20:43:01:333 Node012,   ACK received CallbackId 0x69 Reply 0x04
2012-11-13 20:43:01:336 Node012,   Received: 0x01, 0x04, 0x01, 0x13, 0x01, 0xe8
2012-11-13 20:43:01:336 Node012, ZW_SEND_DATA delivered to Z-Wave stack
2012-11-13 20:43:01:399 Node012,   Received: 0x01, 0x09, 0x00, 0x04, 0x00, 0x0c, 0x03, 0x20, 0x03, 0x00, 0xde
2012-11-13 20:43:01:399
2012-11-13 20:43:01:399 Node012, Received Basic report from node 12: level=0
2012-11-13 20:43:01:400 Node012, Refreshed Value: old value=0, new value=0, type=uint8
2012-11-13 20:43:01:400 Node012, Changes to this value are not verified
2012-11-13 20:43:03:332 Node012, WriteMsg Wait Timeout m_currentMsg=b0534730
2012-11-13 20:43:03:332

We see the basic report response come back from the remote device pretty quickly (.068 seconds) but we never receive the ZW_SEND_DATA request. The big question is what happened to that final ZW_SEND_DATA? Note that all the retransmissions use the same callback ID of 0x69.

2012-11-13 20:43:03:332 Node012, Sending command (Attempt 2, Callback ID=0x69, Expected Reply=0x04) - BasicCmd_Get (Node=12): 0x01, 0x09, 0x00, 0x13, 0x0c, 0x02, 0x20, 0x02, 0x05, 0x69, 0xa5
2012-11-13 20:43:03:334 Node012,   ACK received CallbackId 0x69 Reply 0x04
2012-11-13 20:43:03:336 Node012,   Received: 0x01, 0x04, 0x01, 0x13, 0x01, 0xe8
2012-11-13 20:43:03:336 Node012, ZW_SEND_DATA delivered to Z-Wave stack
2012-11-13 20:43:05:333 Node012, WriteMsg Wait Timeout m_currentMsg=b0534730
2012-11-13 20:43:05:333

We time out and retransmit again. So now we have two outstanding ZW_SEND_DATAs out there. Still no final ZW_SEND_DATA request with the final transmit status. Same callback ID of 0x69.

2012-11-13 20:43:05:333 Node012, Sending command (Attempt 3, Callback ID=0x69, Expected Reply=0x04) - BasicCmd_Get (Node=12): 0x01, 0x09, 0x00, 0x13, 0x0c, 0x02, 0x20, 0x02, 0x05, 0x69, 0xa5
2012-11-13 20:43:05:461 Node012,   ACK received CallbackId 0x69 Reply 0x04
2012-11-13 20:43:05:463 Node012,   Received: 0x01, 0x04, 0x01, 0x13, 0x00, 0xe9
2012-11-13 20:43:05:463 Node012, ERROR: ZW_SEND_DATA could not be delivered to Z-Wave stack
2012-11-13 20:43:05:633 Node012,   Received: 0x01, 0x05, 0x00, 0x13, 0x69, 0x00, 0x80
2012-11-13 20:43:05:633 Node012,   ZW_SEND_DATA Request with callback ID 0x69 received (expected 0x69)
2012-11-13 20:43:05:947 Node012,   Received: 0x01, 0x09, 0x00, 0x04, 0x00, 0x0c, 0x03, 0x20, 0x03, 0x00, 0xde
2012-11-13 20:43:05:947

So we transmit a third and final time based on the current code using 3 transmissions. This time the message isn't transmitted (0x00 response). However the final ZW_SEND_DATA request with callback ID of 0x69 is received. I am going to claim this is the first ZW_SEND_DATA request that just took this long to get back to the controller and our code wasn't patient enough to wait for it. We already received the basic report at 20:43:01:399. It just took 4.2 seconds for the ZW_SEND_DATA request to get there!

2012-11-13 20:43:05:948 Node012, Received Basic report from node 12: level=0
2012-11-13 20:43:05:948 Node012, Refreshed Value: old value=0, new value=0, type=uint8
2012-11-13 20:43:05:948 Node012, Changes to this value are not verified

We process the next read message which is the second (attempt 2) basic report.

2012-11-13 20:43:05:948 Node012,   Expected reply and command class was received
2012-11-13 20:43:05:948 Node012,   Message transaction complete
2012-11-13 20:43:05:948

Then we complete the transaction based on the ZW_SEND_DATA request callback id 0x69. It is done but more responses are flowing in from the retransmissions.

2012-11-13 20:43:09:679   Received: 0x01, 0x05, 0x00, 0x13, 0x69, 0x00, 0x80
2012-11-13 20:43:09:679   ZW_SEND_DATA Request with callback ID 0x69 received (expected 0x00)
2012-11-13 20:43:09:679 WARNING: Unexpected Callback ID received

The second retransmission's ZW_SEND_DATA request with the same callback ID of 0x69 arrives but the code is no longer expecting it.

2012-11-13 20:43:09:757 Node012,   Received: 0x01, 0x09, 0x00, 0x04, 0x00, 0x0c, 0x03, 0x20, 0x03, 0x00, 0xde
2012-11-13 20:43:09:757
2012-11-13 20:43:09:757 Node012, Received Basic report from node 12: level=0
2012-11-13 20:43:09:757 Node012, Refreshed Value: old value=0, new value=0, type=uint8
2012-11-13 20:43:09:757 Node012, Changes to this value are not verified
2012-11-13 20:43:09:958 Node012,   Received: 0x01, 0x09, 0x00, 0x04, 0x00, 0x0c, 0x03, 0x20, 0x03, 0x00, 0xde
2012-11-13 20:43:09:959
2012-11-13 20:43:09:959 Node012, Received Basic report from node 12: level=0
2012-11-13 20:43:09:959 Node012, Refreshed Value: old value=0, new value=0, type=uint8
2012-11-13 20:43:09:959 Node012, Changes to this value are not verified

Then two more basic reports flow in. Unless the device is set up to asynchronously deliver basic reports I am at a loss to explain this one other then they are retransmissions from the remove device because it never saw an ACK from the controller.

So the big question for me is: why does the first round trip sequence (SET) complete in under 1 second but the second transaction (GET) takes over 4 seconds? This is the smoking gun here. I would bet on RF interference or nodes that are barely within range. It is claimed the Vera did a better job but I would love to see more message sequences from the Vera as it is quite possible it had the same issue and their timeouts were more adaptive then ours.

I don't think relaxing the m_expectedCallbackId test will change anything since all the retransmitted frames use the same callback id.

I have been adding code to measure lost messages and round trip times in an effort to categorize device reliability. Right now there is no clear feedback on how reliable the communications are between devices. It also isn't clear why things are unreliable. Getting some more data will help us determine if it is environment, library limitations (or bugs), user error, etc. Also better RTT data will help us improve our timeout algorithms. TCP has solved all these problems so there is no doubt we can too.

Thanks,
Greg

On Nov 13, 2012, at 7:56 PM, dscott wrote:


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
dscott  
View profile  
 More options Nov 14 2012, 6:59 am
From: dscott <dscottmill...@gmail.com>
Date: Wed, 14 Nov 2012 03:59:01 -0800 (PST)
Local: Wed, Nov 14 2012 6:59 am
Subject: Re: [openzwave] Wait Timeout

Greg:

I agree with your analysis...and I began a similar analysis of the log
sequence as a possible example of the controller "hanging."  It would be
great if we can really fix the issue over time (and, like you, I'm focused
on it), but the change I've made does address it in the near term.  I've
been thinking about changing the code to use a different callback id for
retries (this would involve changing that byte as well as the checksum
before resending), which would help with diagnostics.

It isn't clear whether the frame received at 43:01.399 is a duplicate
response to the 43:00.564 request or a really fast response to the
43:01.331 request.  If it's the latter, then I'm not sure we really care
what happened to the second ZW_SEND_DATA message in the transaction...we
asked for data and received it.  (That was the logic behind relaxing the
condition.)  But if it is a duplicate response to the first message (and I
have seen a fair number of "dupes" on my network over time), then we
haven't really re-read the value and we have to wait for the second
ZW_SEND_DATA to come through.

I'll try to prioritize changing the callback ID for each retry and see what
that tells us about which messages belong to which transaction.

Scott


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
utz  
View profile  
 More options Nov 14 2012, 9:01 am
From: utz <utz.roe...@gmail.com>
Date: Wed, 14 Nov 2012 06:01:25 -0800 (PST)
Local: Wed, Nov 14 2012 9:01 am
Subject: Re: [openzwave] Wait Timeout

The trace above is just switching on node 12. I have a lot of these retries
as well in other parts (e.g. at system start up). I will try to add a more
complete (and large) logfile later (Where is a good place to dump these?)

I wonder if it would be possible to add a configuration flag which would
change message transmission behavior. As I understand Basic Set is sent
with TRANSMIT_OPTION_ACK which requires the message to be acknowledged at
each hop towards the destination. However, openzwave uses immediately after
the Basic Set a Basic Get to confirm if the transmission was successful.
Thus, two mechanisms ensuring reliability are used concurrently. In my
opinion either we should send a message with TRANSMIT_OPTION_ACK and in
this case we would not need the Basic Get (as we should know from the Basic
Set with TRANSMIT_OPTION_ACK if it was delivered or not) or we should use  
Basic Set without TRANSMIT_OPTION_ACK followed by Basic Get. (Which seems
to be more what the vera controller does; vera knows as well if messages
were properly processed but it does that without the Basic Get after the
Basic Set).

I do not believe it is a network problem with lost messages as I can use
the vera controller included in the same network to send messages to node
12 without problem (both controllers use the same network and same routing,
one works one has some problems). I will try to add more traces to show
similarities/differences here.  


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
dscott  
View profile  
 More options Nov 14 2012, 11:28 am
From: dscott <dscottmill...@gmail.com>
Date: Wed, 14 Nov 2012 08:28:57 -0800 (PST)
Local: Wed, Nov 14 2012 11:28 am
Subject: Re: [openzwave] Wait Timeout

Don't do too much testing just yet. I'm going to try to write the code that
changes the callback ID for retries today or tomorrow. That will
(hopefully) give us better insight into what's going on.
On your other point about changing the TRANSMIT_OPTION, I'm open to doing
that for efficiency purposes once we know what's going on.  You may not
have observed the following behavior on your network, but others (including
me) have seen situations in which the "Set" is executed and acknowledged
flawlessly but the value doesn't actually change.  Similarly, we have seen
devices report spurious changes (like temperature going from 68F to 0F to
68F in three sequential polls within 60 seconds).  This abnormal behavior
was the impetus behind "verifying" changed values and only trusting a value
change when we've actually read it (sometimes twice!) from the device.  
Otherwise, we found that we couldn't "trust" the values held in OZW Value
objects.

But we may be able to be more sparing in the use of TRANSMIT_OPTION_ACK...

I'll try to commit code with the changing callback IDs soon.

Scott


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
utz  
View profile  
 More options Nov 14 2012, 11:45 am
From: utz <utz.roe...@gmail.com>
Date: Wed, 14 Nov 2012 08:45:08 -0800 (PST)
Local: Wed, Nov 14 2012 11:45 am
Subject: Re: [openzwave] Wait Timeout

Ok, I wait a bit with testing to see what your changes may do.

I think it might be useful to be able to disable TRANSMIT_OPTION_ACK. I do
not see the need to have two concurrent end-to-end reliability mechanisms
on the go.

If you google "zwave routing" you find a document describing the ack
process. It seems to me that a message requiring an ACK transported over 1
intermediate hop requires 8 individual messages. Without ack would be 2.
Given that we confirm anyway after we should maybe reduce network load.

Utz


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Greg Satz  
View profile  
 More options Nov 14 2012, 11:56 am
From: Greg Satz <s...@iranger.com>
Date: Wed, 14 Nov 2012 09:55:57 -0700
Local: Wed, Nov 14 2012 11:55 am
Subject: Re: [openzwave] Wait Timeout

Staring at the Vera logs Utz sent revealed some different behavior then what we see from the library. It is why I was asking what firmware his Vera's controller is using.

With the vera the get data applicationcontroller update shows up after the ZW_SEND_DATA request. With the library it shows up in between the ZW_SEND_DATA response and the request. All of the Vera log's show round trips less then .2 seconds though some of his health checks didn't retrieve an ack.

Also the Vera's basic sets didn't return a basic report so I can see why you think the basic report at 20:43:01:262 may be from another transaction.

What I do see is that our transactions are really messy and lots of data is flying about so keeping track of it all is very hard. The Vera transaction is pretty straight forward with simple command/response interactions. Is it the controller firmware that is different or what we are doing?

Is it possible to pop out the usb stick from the Vera and use it against the library to see if anything is different?

Thanks,
Greg

On Nov 14, 2012, at 4:59 AM, dscott wrote:


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
utz  
View profile  
 More options Nov 14 2012, 1:24 pm
From: utz <utz.roe...@gmail.com>
Date: Wed, 14 Nov 2012 10:24:55 -0800 (PST)
Local: Wed, Nov 14 2012 1:24 pm
Subject: Re: [openzwave] Wait Timeout

I agree that the Vera transactions for the same thing look much simpler. If this way works it might be beneficial to try this (simpler is better).

The Vera2 I have has the zwave chip integrated, I can't take it out. However, Vera 1 had it external and it was the same as a zstick. The chip is used via serial in Vera so I assume it is the same as a USB zstick just integrated. Vera is an openwrt Linux router with odd processor. Compiling openzwave for this hardware is generally possible but not easy (tool chain setup).

Generally I am quite convinced it is the same hardware, not with a particular special firmware. I have a look if I can check firmware numbers...


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
dscott  
View profile  
 More options Nov 15 2012, 7:03 pm
From: dscott <dscottmill...@gmail.com>
Date: Thu, 15 Nov 2012 16:03:23 -0800 (PST)
Local: Thurs, Nov 15 2012 7:03 pm
Subject: Re: [openzwave] Wait Timeout

I've been testing the code that changes the callback ID for retries of the
same message.  Attached is an interesting log snippet that highlights some
issues with our code.  Some observations below, then a summary of the
sequencing under the five attempts for this message (attempt 5 is a "clean"
transaction, executing as expected).  I'll be investigating some of the
issues cited below and encourage others to do the same.  I'll commit a
change that changes the callback ID shortly...I haven't seen any adverse
effects.

Observations:
1) interspersed second and third attempts demonstrate that the "report"
may(?) arrive before the ZW_SEND_DATA Request message...or was the report
associated with Attempt 1?
2) not sure it's a coincidence that the "ZW_SEND_DATA_Request" message
appears to arrive from the controller at the same instant (05:39:14:078)
the timeout occurs--need to look at the code more carefully
3) note the overlapping "Put"s (data received from the controller and
"Write".  Even though some data comes in from the controller, it isn't
passed to the application until  later(see 05:39:14:098)
4) further effect of the same issue, the CAN sent by the controller isn't
provided to the application until 100ms later (received at 05:39:14:158,
processed at :278)
5) we need to focus on the cause of the delay between receipt of bytes from
the controller and the time they are supplied to the application (maybe
make sure there's nothing in the "read from controller" buffer before
writing?)
6) I have to look at my code to evaluate the message at 05:39:14:448 (not
sure what makes it think the callback ID is 0xd8)
7) As the summary (below) shows, when all was complete, three messages were
sent without a CAN and three ZW_SEND_DATA Receiveds and 3 Reports came back

First attempt
05:39:03:342 Message written to controller (attempt 1, callback 0xd5)
05:39:03:523 ZW_SEND_DATA delivered (presumably 0xd5)
05:39:03:783 ZW_SEND_DATA Request (callback 0xd5)
05:39:08:350 Timeout

Second and third attempts (interspersed)
05:39:08:390 Message written to controller (attempt 2, callback 0xd6)
05:39:08:620 ZW_SEND_DATA delivered (presumably 0xd6)
05:39:10:523 Report received (unclear whether this is 0xd5 or 0xd6)
05:39:14:078 Timeout
05:39:14:098 ZW_SEND_DATA Request message received from controller, but not
processed until 05:39:14:208
05:39:14:138 Message written to controller (attempt 3, callback 0xd7)
05:39:14:158 Controller sends CAN , but not processed until 05:39:14:278
05:39:14:208 ZW_SEND_DATA Request (callback 0xd6) from 05:39:14:098
processed (but now expecting 0xd7)
05:39:14:238 Report received from controller, but not processed until
05:39:14:398
05:39:14:278 Application processes the 05:39:14:158 CAN, triggering resend

Fourth attempt
05:39:14:308 Message written to controller (attempt 4, callback 0xd8)
05:39:14:328 Controller sends CAN , but not processed until 05:39:14:488
05:39:14:398 Report received from controller
05:39:14:488 Application processes the 05:39:14:328 CAN, triggering resend

Fifth attempt
05:39:14:528 Message written to controller (attempt 5, callback 0xd9)
05:39:14:639 ZW_SEND_DATA delivered (presumably 0xd9)
05:39:14:759 ZW_SEND_DATA Request (callback 0xd9)
05:39:14:949 Report received from controller
  Message transaction complete

Summary
5 messages written
3 ZW_SEND_DATA delivered messages received, 2 CANs received
3 ZW_SEND_DATA Request messages received (0xd5, 0xd6, 0xd9...the three that
weren't CANned)
3 reports received

Scott

  Log Snippet.txt
14K Download

 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
utz  
View profile  
 More options Dec 5 2012, 6:16 pm
From: utz <utz.roe...@gmail.com>
Date: Wed, 5 Dec 2012 15:16:43 -0800 (PST)
Local: Wed, Dec 5 2012 6:16 pm
Subject: Re: [openzwave] Wait Timeout

Here another theory:

when using FUNC_ID_ZW_SEND_DATA the following things happen:
1) an ACK is received from the controller
2) then a message starting with SOF is coming in, normally of length 4
(e.g. 0x01, 0x04, 0x01, 0x13, 0x00, 0xe9). I believe this message just
indicates that the controller is processing the FUNC_ID_ZW_SEND_DATA
request. This message does not contain the CallbackId. I guess we need to
send an ACK to the controller for this message (which I think is done).
3) then another message starting with SOF is coming in, normally of length
5 (e.g. 0x01, 0x05, 0x00, 0x13, 0x9f, 0x00, 0x76). This message confirms
that the FUNC_ID_ZW_SEND_DATA request has been processed by the controller.
Here the CallbackId is included.  I guess we need to send an ACK to the
controller for this message (which I think is done).

I think that openzwave gets mixed up with the messages in step 2 and 3.
Message in 2 is interpreted as it would be 3, the callbackid is not found
(as not included) and the message is marked as failed delivered and a
resend is triggered. While this is happening the message of step 3 might
come in. Depending on timing this message may still be processed and all is
good, if not we have 3 retries and a fail at the end.

Can someone have a look at the code that understands it better and see if
this might be the culprit regarding the behaviour described here?


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Greg Satz  
View profile  
 More options Dec 5 2012, 6:48 pm
From: Greg Satz <s...@iranger.com>
Date: Wed, 5 Dec 2012 16:48:49 -0700
Local: Wed, Dec 5 2012 6:48 pm
Subject: Re: [openzwave] Wait Timeout
Try increasing your timeout. In the file cpp/src/Defs.h at:

#define RETRY_TIMEOUT   2000            // Retry send after two seconds        

change the 2000 to 10000 and report back what happens.

Thanks,
Greg

On Dec 5, 2012, at 4:16 PM, utz wrote:


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
utz  
View profile  
 More options Dec 5 2012, 7:29 pm
From: utz <utz.roe...@gmail.com>
Date: Wed, 5 Dec 2012 16:29:18 -0800 (PST)
Local: Wed, Dec 5 2012 7:29 pm
Subject: Re: [openzwave] Wait Timeout

I can try this tomorrow. I assume it will only help partly as timing is one issue, interpreting a field in a message that is not there causes other issues as well (this is what I assume). I can't check it with the code in trunk as this currently spins on a printf when errors occur (I assume recent changes to deal with this).


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
D. Scott Miller  
View profile  
 More options Dec 5 2012, 9:52 pm
From: "D. Scott Miller" <sc...@ustiger.net>
Date: Thu, 6 Dec 2012 02:52:06 +0000
Local: Wed, Dec 5 2012 9:52 pm
Subject: RE: [openzwave] Wait Timeout
This thread has been quiet for a while so I'm not certain of the context, but:
1) The theory offered today by Utz doesn't hold water.  The library recognizes the difference between the two incoming messages (the 4 byte one and the 5 byte one).  So the sequencing issue (which I think is the basis for this conjecture) remains an anomaly.  I can provide logs to demonstrate this.

I can also confirm that extending the wait timeout doesn't solve the issue.

But maybe someone else (Linux?) will get a different result, so it's probably still worth a try.

Scott


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Greg Satz  
View profile  
 More options Dec 6 2012, 12:34 am
From: Greg Satz <s...@iranger.com>
Date: Wed, 5 Dec 2012 22:34:59 -0700
Local: Thurs, Dec 6 2012 12:34 am
Subject: Re: [openzwave] Wait Timeout
I agree with Scott. The second message comes from the controller as a response. The 0x01 in the third byte is checked by the code before the the function code in the fourth byte. The third message containing the callback ID has a 0x00 in the third byte and is a request. So you can see the code is able to distinguish these two messages correctly.

Scott, are the logs utz posted earlier in this thread show these anomalies? How does increasing the timeout change (or not change) the behavior? I would think it should minimize the issues at least.

Thanks,
Greg

On Dec 5, 2012, at 7:52 PM, D. Scott Miller wrote:

Utz's original message:

Here another theory:

when using FUNC_ID_ZW_SEND_DATA the following things happen:
1) an ACK is received from the controller
2) then a message starting with SOF is coming in, normally of length 4 (e.g. 0x01, 0x04, 0x01, 0x13, 0x00, 0xe9). I believe this message just indicates that the controller is processing the FUNC_ID_ZW_SEND_DATA request. This message does not contain the CallbackId. I guess we need to send an ACK to the controller for this message (which I think is done).
3) then another message starting with SOF is coming in, normally of length 5 (e.g. 0x01, 0x05, 0x00, 0x13, 0x9f, 0x00, 0x76). This message confirms that the FUNC_ID_ZW_SEND_DATA request has been processed by the controller. Here the CallbackId is included.  I guess we need to send an ACK to the controller for this message (which I think is done).

I think that openzwave gets mixed up with the messages in step 2 and 3. Message in 2 is interpreted as it would be 3, the callbackid is not found (as not included) and the message is marked as failed delivered and a resend is triggered. While this is happening the message of step 3 might come in. Depending on timing this message may still be processed and all is good, if not we have 3 retries and a fail at the end.

Can someone have a look at the code that understands it better and see if this might be the culprit regarding the behaviour described here?


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
utz  
View profile  
 More options Dec 6 2012, 5:05 am
From: utz <utz.roe...@gmail.com>
Date: Thu, 6 Dec 2012 02:05:49 -0800 (PST)
Local: Thurs, Dec 6 2012 5:05 am
Subject: Re: [openzwave] Wait Timeout

Scott, Greg,

yep, my assessment of the situation is crap; that is not the problem
indeed. I think I was staring too much on program code last night and got
lost.

The problem is shown in the first post of the thread. Two times the same
case of transmitting to a node. In one case it works, the other one I get
the 3 retries and then the stack gives up to deliver the message. I have
this issue a lot in my setup, basically communicating with all nodes. As a
result I have messages sometimes delivered and sometimes not. This is bad
if you control your heating at home as setpoint changes might get missed
(heating running all night). So I have at the moment 3 options : 1) build a
layer on top of openzwave which monitors success of transmissions and then
retransmit on this level if needed. 2) fix openzwave 3) write my own little
stack that can do far less than openzwave but is more reliable (for the few
things I want to do).  I would prefer 2) as 1) is a hack and 3) is
duplication of work already done.

I try the timer setting, I run on linux, maybe its different there.

Utz


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Greg Satz  
View profile  
 More options Dec 6 2012, 11:54 am
From: Greg Satz <s...@iranger.com>
Date: Thu, 6 Dec 2012 09:54:30 -0700
Local: Thurs, Dec 6 2012 11:54 am
Subject: Re: [openzwave] Wait Timeout

I am beginning to wonder if using SUC has something to do with more varying delays. The fact that it works more consistently with Vera is what keeps me coming back to what we are doing differently in the library.

I agree the way the library handles delays needs to change. Changing your delay should change the behavior toward the better. For some reason Scott doesn't see that and this may be part of the clue too. In a soon to be committed patch a notification for timeouts will be generated and counters will track that interaction. You folks with larger networks are testing the library further and it will require more changes to keep up with it.

Thanks,
Greg

On Dec 6, 2012, at 3:05 AM, utz wrote:


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
D. Scott Miller  
View profile  
 More options Dec 6 2012, 1:38 pm
From: "D. Scott Miller" <sc...@ustiger.net>
Date: Thu, 6 Dec 2012 18:38:02 +0000
Local: Thurs, Dec 6 2012 1:38 pm
Subject: RE: [openzwave] Wait Timeout

I don't use a SUC on the network and have the problem (as you know).

From: openzwave@googlegroups.com [mailto:openzwave@googlegroups.com] On Behalf Of Greg Satz
Sent: Thursday, December 06, 2012 11:55 AM
To: openzwave@googlegroups.com
Subject: Re: [openzwave] Wait Timeout

I am beginning to wonder if using SUC has something to do with more varying delays. The fact that it works more consistently with Vera is what keeps me coming back to what we are doing differently in the library.

I agree the way the library handles delays needs to change. Changing your delay should change the behavior toward the better. For some reason Scott doesn't see that and this may be part of the clue too. In a soon to be committed patch a notification for timeouts will be generated and counters will track that interaction. You folks with larger networks are testing the library further and it will require more changes to keep up with it.

Thanks,
Greg

On Dec 6, 2012, at 3:05 AM, utz wrote:

Scott, Greg,

yep, my assessment of the situation is crap; that is not the problem indeed. I think I was staring too much on program code last night and got lost.

The problem is shown in the first post of the thread. Two times the same case of transmitting to a node. In one case it works, the other one I get the 3 retries and then the stack gives up to deliver the message. I have this issue a lot in my setup, basically communicating with all nodes. As a result I have messages sometimes delivered and sometimes not. This is bad if you control your heating at home as setpoint changes might get missed (heating running all night). So I have at the moment 3 options : 1) build a layer on top of openzwave which monitors success of transmissions and then retransmit on this level if needed. 2) fix openzwave 3) write my own little stack that can do far less than openzwave but is more reliable (for the few things I want to do).  I would prefer 2) as 1) is a hack and 3) is duplication of work already done.

I try the timer setting, I run on linux, maybe its different there.

Utz

On Thursday, December 6, 2012 5:34:59 AM UTC, satz wrote:

I agree with Scott. The second message comes from the controller as a response. The 0x01 in the third byte is checked by the code before the the function code in the fourth byte. The third message containing the callback ID has a 0x00 in the third byte and is a request. So you can see the code is able to distinguish these two messages correctly.

Scott, are the logs utz posted earlier in this thread show these anomalies? How does increasing the timeout change (or not change) the behavior? I would think it should minimize the issues at least.

Thanks,
Greg

On Dec 5, 2012, at 7:52 PM, D. Scott Miller wrote:

Utz's original message:

Here another theory:

when using FUNC_ID_ZW_SEND_DATA the following things happen:
1) an ACK is received from the controller
2) then a message starting with SOF is coming in, normally of length 4 (e.g. 0x01, 0x04, 0x01, 0x13, 0x00, 0xe9). I believe this message just indicates that the controller is processing the FUNC_ID_ZW_SEND_DATA request. This message does not contain the CallbackId. I guess we need to send an ACK to the controller for this message (which I think is done).
3) then another message starting with SOF is coming in, normally of length 5 (e.g. 0x01, 0x05, 0x00, 0x13, 0x9f, 0x00, 0x76). This message confirms that the FUNC_ID_ZW_SEND_DATA request has been processed by the controller. Here the CallbackId is included.  I guess we need to send an ACK to the controller for this message (which I think is done).

I think that openzwave gets mixed up with the messages in step 2 and 3. Message in 2 is interpreted as it would be 3, the callbackid is not found (as not included) and the message is marked as failed delivered and a resend is triggered. While this is happening the message of step 3 might come in.. Depending on timing this message may still be processed and all is good, if not we have 3 retries and a fail at the end.

Can someone have a look at the code that understands it better and see if this might be the culprit regarding the behaviour described here?

!DSPAM:50c06e06646833165249928!


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
utz  
View profile  
 More options Dec 6 2012, 3:42 pm
From: utz <utz.roe...@gmail.com>
Date: Thu, 6 Dec 2012 12:42:12 -0800 (PST)
Local: Thurs, Dec 6 2012 3:42 pm
Subject: Re: [openzwave] Wait Timeout

I did this and it improved things. Network seems to operate better (Still
looking at it). (I am not using the most recent version from trunk btw)

However, still see some of these errors. Below an example sending to node
9. It says here that it expected 0x00 as callbackid and it received 0x6e,
therefore it assumes it must be retransmitted. However, 0x6e seems
perfectly ok to me as a callback. Why would it expect 0x00?

2012-12-06 20:33:36:894 Node009, Queuing command: SwitchBinary Set
(Node=9): 0x01, 0x0a, 0x00, 0x13, 0x09, 0x03, 0x25, 0x01, 0xff, 0x05, 0x6e,
0x5c

2012-12-06 20:33:36:894 Node009, Queuing command: SwitchBinaryCmd_Get
(Node=9): 0x01, 0x09, 0x00, 0x13, 0x09, 0x02, 0x25, 0x02, 0x05, 0x6f, 0xa3

2012-12-06 20:33:36:899 Node009, WriteMsg WriteNextMsg m_currentMsg=af68e398

2012-12-06 20:33:36:899

2012-12-06 20:33:36:899 Node009, Sending command (Callback ID=0x6e,
Expected Reply=0x13) - SwitchBinary Set (Node=9): 0x01, 0x0a, 0x00, 0x13,
0x09, 0x03, 0x25, 0x01, 0xff, 0x05, 0x6e, 0x5c

2012-12-06 20:33:36:901 Node009,   ACK received CallbackId 0x6e Reply 0x13

2012-12-06 20:33:36:904 Node009,   Received: 0x01, 0x04, 0x01, 0x13, 0x01,
0xe8

2012-12-06 20:33:36:904 Node009, ZW_SEND_DATA delivered to Z-Wave stack

2012-12-06 20:33:40:998 Node009,   Received: 0x01, 0x05, 0x00, 0x13, 0x6e,
0x01, 0x86

2012-12-06 20:33:40:999 Node009,   ZW_SEND_DATA Request with callback ID
0x6e received (expected 0x6e)

2012-12-06 20:33:40:999 Node009, ERROR: ZW_SEND_DATA failed. No ACK
received - device may be asleep.

2012-12-06 20:33:40:999 Node009,   WARNING: Device is not a sleeping node -
retrying the send.

2012-12-06 20:33:40:999 Node009,   Expected callbackId was received

2012-12-06 20:33:40:999 IsExpectedReply: m_expectedNodeId = 9
m_expectedReply = 13

2012-12-06 20:33:41:010 Node009,   Received: 0x01, 0x05, 0x00, 0x13, 0x6e,
0x00, 0x87

2012-12-06 20:33:41:011 Node009,   ZW_SEND_DATA Request with callback ID
0x6e received (expected 0x00)

2012-12-06 20:33:41:011 Node009, WARNING: Unexpected Callback ID received

2012-12-06 20:33:41:011 Node009,   Expected reply was received

2012-12-06 20:33:41:011 Node009,   Message transaction complete

2012-12-06 20:33:41:011

2012-12-06 20:33:41:011 Node009, WriteMsg WriteNextMsg m_currentMsg=a9c41a88

2012-12-06 20:33:41:012

2012-12-06 20:33:41:012 Node009, Sending command (Callback ID=0x6f,
Expected Reply=0x04) - SwitchBinaryCmd_Get (Node=9): 0x01, 0x09, 0x00,
0x13, 0x09, 0x02, 0x25, 0x02, 0x05, 0x6f, 0xa3

2012-12-06 20:33:41:014 Node009,   ACK received CallbackId 0x6f Reply 0x04

2012-12-06 20:33:41:017 Node009,   Received: 0x01, 0x04, 0x01, 0x13, 0x01,
0xe8

2012-12-06 20:33:41:018 Node009, ZW_SEND_DATA delivered to Z-Wave stack

2012-12-06 20:33:41:029 Node009,   Received: 0x01, 0x05, 0x00, 0x13, 0x6f,
0x00, 0x86

2012-12-06 20:33:41:030 Node009,   ZW_SEND_DATA Request with callback ID
0x6f received (expected 0x6f)

2012-12-06 20:33:41:037 Node009,   Received: 0x01, 0x09, 0x00, 0x04, 0x00,
0x09, 0x03, 0x25, 0x03, 0xff, 0x21

2012-12-06 20:33:41:037

2012-12-06 20:33:41:037 Node009, Received SwitchBinary report from node 9:
level=On

2012-12-06 20:33:41:037 Node009, Refreshed Value: old value=false, new
value=true, type=bool

2012-12-06 20:33:41:038 Node009, Changes to this value are not verified

2012-12-06 20:33:41:038 Node009,   Expected reply and command class was
received

2012-12-06 20:33:41:038 Node009,   Message transaction complete


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Greg Satz  
View profile  
 More options Dec 7 2012, 2:46 am
From: Greg Satz <s...@iranger.com>
Date: Fri, 7 Dec 2012 00:46:09 -0700
Local: Fri, Dec 7 2012 2:46 am
Subject: Re: [openzwave] Wait Timeout

What I am seeing in this log sequence is a set request not getting an ACK in 4 seconds. Prior to increasing the timeout the library would have tried to retransmit after two seconds. With the increased timer the controller gets a chance to do its radio layer retransmissions until it decides to give up and return a No Ack error. What is even more interesting is the controller returns a success status 11ms later. We get the unexpected callback ID because the No Ack response satisfied the library's state machine that the transaction has completed and the callback ID is now unexpected. I haven't seen this fail/success scenario before because our timeout timer got in the way. The succeeding get responds in 12ms and the resulting data arrives 7ms after that.

Wireless networks aren't going to be 100% reliable. The question is how do we recover and whether we lose data. In this case the set actually worked but it took a little more then 4 seconds to figure it out. We cannot control how the controller and its radios work. So I wouldn't call this a bug and the following get should show that the 'on' state is accurate.

To the question of why one transaction takes over 4 seconds and another takes less then quarter second is a good question. It could be environmental, such as radio interference or it could still be something to do with the library though it isn't clear how yet. It could also have to do with the difference in location of your usb stick vs. where the vera is/was situated. Even a few feet one direction or another can make a difference depending on your topology. It would help if we had more statistics in the library that can track trends and show where the network may be experiencing problems. That is what we are trying to work on now.

Thanks,
Greg

On Dec 6, 2012, at 1:42 PM, utz wrote:


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
D. Scott Miller  
View profile  
 More options Dec 7 2012, 6:44 am
From: "D. Scott Miller" <sc...@ustiger.net>
Date: Fri, 7 Dec 2012 11:44:43 +0000
Local: Fri, Dec 7 2012 6:44 am
Subject: RE: [openzwave] Wait Timeout

I assume that you changed the timeout to 10000 as Greg suggested.  If you can, please be on the lookout for messages received right at the end of the timeout threshold.  That was a symptom I noted…the longer I set the timeout, the longer it took for some responses to come back.  If you see the same behavior, that would suggest that there’s a problem with OZW’s event/wait mechanics.  If you don’t see it, then I must have been dreaming!

Scott


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Greg Satz  
View profile  
 More options Dec 7 2012, 12:16 pm
From: Greg Satz <s...@iranger.com>
Date: Fri, 7 Dec 2012 10:16:36 -0700
Local: Fri, Dec 7 2012 12:16 pm
Subject: Re: [openzwave] Wait Timeout

He had to have changed the timeout to something longer then the default 2000ms or the timeout would have fired prior to the no ack failure. I recall you having issues with the wait code since Mal wrote it. We spent quite some time trying to reproduce what you saw and never did. Seems like that is still an issue for you. I don't recall how it looked in the logs but it might be useful to post a snippet so others know what to look for. Are you saying that elongating the timeout timer causes the problem to occur?

Thanks,
Greg

On Dec 7, 2012, at 4:44 AM, D. Scott Miller wrote:


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
D. Scott Miller  
View profile  
 More options Dec 7 2012, 3:48 pm
From: "D. Scott Miller" <sc...@ustiger.net>
Date: Fri, 7 Dec 2012 20:48:28 +0000
Local: Fri, Dec 7 2012 3:48 pm
Subject: RE: [openzwave] Wait Timeout

I was asking because the behavior I saw was that nothing would come back during the timeout waiting period, but then it would (somehow!) show up when the timeout elapsed.

I asked the question because his log would have been consistent with what I have seen if he had set the timeout to 4000 instead of 10000.

I'll locate a snippet and post it.  If you don't see something soon, it'll be because I've had to reset things to regenerate the error (which can take some time to show up).

From: openzwave@googlegroups.com [mailto:openzwave@googlegroups.com] On Behalf Of Greg Satz
Sent: Friday, December 07, 2012 12:17 PM
To: openzwave@googlegroups.com
Subject: Re: [openzwave] Wait Timeout

He had to have changed the timeout to something longer then the default 2000ms or the timeout would have fired prior to the no ack failure. I recall you having issues with the wait code since Mal wrote it. We spent quite some time trying to reproduce what you saw and never did. Seems like that is still an issue for you. I don't recall how it looked in the logs but it might be useful to post a snippet so others know what to look for. Are you saying that elongating the timeout timer causes the problem to occur?

Thanks,
Greg

On Dec 7, 2012, at 4:44 AM, D. Scott Miller wrote:

I assume that you changed the timeout to 10000 as Greg suggested.  If you can, please be on the lookout for messages received right at the end of the timeout threshold.  That was a symptom I noted...the longer I set the timeout, the longer it took for some responses to come back.  If you see the same behavior, that would suggest that there's a problem with OZW's event/wait mechanics.  If you don't see it, then I must have been dreaming!

Scott


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
utz  
View profile  
 More options Dec 7 2012, 8:41 pm
From: utz <utz.roe...@gmail.com>
Date: Fri, 7 Dec 2012 17:41:36 -0800 (PST)
Local: Fri, Dec 7 2012 8:41 pm
Subject: RE: [openzwave] Wait Timeout

I set it to 10000. I will look for the things you point out and post any findings here.


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Alexie  
View profile  
 More options Dec 14 2012, 2:13 pm
From: Alexie <uale...@gmail.com>
Date: Fri, 14 Dec 2012 20:13:06 +0100
Local: Fri, Dec 14 2012 2:13 pm
Subject: Re: [openzwave] Wait Timeout

Hi Greg,

I was testing during start-up of the open-zwave (r585-588) and i noticed
that my sensors (sleeping devices) are always (tried to be) polled. This
results in time-outs (resend attempts), but that's not the major problem.
After those resend attempts, the NEXT poll requests towards none-sleeping
devices are going wrong - and results also in failed resend attempts.

Asfar i am aware off, (all or almost all?) sleeping devices have a WakeUp
commandclass, thus we should queue up the query of the device until it
wakes up.

I modified "cpp/src/command_classes/WakeUp.cpp" from:
m_awake( true ),
to:
m_awake( false ),

Then the initial value if a sleeping device is awake, will be "false". See
the attached OZW_Log.txt* files for the pre and post patch. All looks to be
working perfectly now (during start-up). Can this patch be applied in the
main code line?

Regards,
Alex

2012/12/8 utz <utz.roe...@gmail.com>

  OZW_Log.txt.wakeup-queue
223K Download

  OZW_Log.txt.retries
328K Download

 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Greg Satz  
View profile  
 More options Jan 5, 11:37 pm
From: Greg Satz <s...@iranger.com>
Date: Sat, 5 Jan 2013 21:37:31 -0700
Local: Sat, Jan 5 2013 11:37 pm
Subject: Re: [openzwave] Wait Timeout

The latest version of the library has changes to address this thread, including the sleeping nodes behavior. The retransmit timer has been increased to 40 seconds. The No Ack (or any other failure return from Send Data) will be treated as a final conclusion to that particular transaction. A No Operation will be sent to nodes during discovery that will help determine if they are alive or asleep. This patch shouldn't be needed. It would be good to get some feedback on these changes. There are some other things that can be done to improve the reliability of the message handling of the library. However before continuing it would be nice to know how the current software is working.

Thanks,
Greg

On Dec 14, 2012, at 12:13 PM, Alexie wrote:


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
End of messages < Older 
« Back to Discussions « Newer topic     Older topic »