error detection rate with crc-16 CCITT

Shane williams

unread,

Mar 27, 2011, 4:58:32 AM3/27/11

to

Hi

We're using the 68302 micro with DDCMP serial protocol over two wire
RS485. According to the user manual, this uses CRC16-CCITT - X**16
X**12 X**5 + 1.

Does anyone have any idea what the chance of getting an undetected
error is with this protocol? I know all single bit errors are
detected. Supposing we run a point to point connection at slightly
faster than it's really capable of and we get 10% of messages with
more than a single bit error. What percentage of these will go
undetected by the CRC check?

Suppose we run the connection at a "normal" baud rate with almost no
errors. What is the likelihood of getting undetected errors now?

Thanks for any help.

Rich Webb

unread,

Mar 27, 2011, 6:35:02 AM3/27/11

to

The Wikipedia article on the "Mathematics of CRC" is short and a good
place to start. The paper it references
<http://www.ece.cmu.edu/~koopman/roses/dsn04/koopman04_crc_poly_embedded.pdf>
has the analysis you are looking for. Note (as mentioned in the
wikipedia article) that the paper's convention for representing the
polynomial differs from the usual method.

--
Rich Webb Norfolk, VA

Michael Karas

unread,

Mar 27, 2011, 6:53:26 AM3/27/11

to

In article <13c95ff0-d9ca-4f0b-92a4-d21fe6c36c55
@j35g2000prb.googlegroups.com>, shane....@gmail.com says...

The CRC-16 will be able to detect errors in 99.9984 percent of cases.
This stems from the code being one value off out of 16-bits of
error code count.

65535 / 65536 = 0.999984 percent

See:
http://automationwiki.com/index.php?title=CRC-16-CCITT

for some implementation ideas.

-------------

Are you getting some of the errors in your transmission path
due to distortion of the RS485 waveform due to non-equal propagation
delays through your logic on the "0"-->"1" transition versus the
one from "1"-->"0"? Common problem with certain optocouplers. ;-)

--

Michael Karas
Carousel Design Solutions
http://www.carousel-design.com

Shane williams

unread,

Mar 27, 2011, 7:39:15 AM3/27/11

to

On Mar 27, 11:53 pm, Michael Karas <mka...@carousel-design.com> wrote:
> In article <13c95ff0-d9ca-4f0b-92a4-d21fe6c36c55

> @j35g2000prb.googlegroups.com>, shane.2471...@gmail.com says...

Thanks. I'm trying to figure out whether it's possible/ viable to
dynamically determine the fastest baud rate we can use by checking the
error rate. The cable lengths and types of wire used when our systems
are installed varies and I was hoping we could automatically work out
what speed a particular connection can run at. The spec for the
MOC5007 Optocoupler seems a bit vague so I was trying to find a better
one.

Vladimir Vassilevsky

unread,

Mar 27, 2011, 10:21:18 AM3/27/11

to

Shane williams wrote:

> Thanks. I'm trying to figure out whether it's possible/ viable to
> dynamically determine the fastest baud rate we can use by checking the
> error rate.

Yes. But:

1) It is easier, faster and more reliable to evaluate the channel by
transmitting a known pseudo-random test pattern rather then the actual data.

2) If the baud rate is changed dynamically, how would the receivers know
the baud rate of the transmitters?

3) Since the system is intended to be operable even at the lowest baud,
why not always use the lowest baud?

Vladimir Vassilevsky
DSP and Mixed Signal Design Consultant
http://www.abvolt.com

Rafael Deliano

unread,

Mar 27, 2011, 10:46:09 AM3/27/11

to

> I'm trying to figure out whether it's possible/ viable to
> dynamically determine the fastest baud rate we can use by checking the
> error rate.

Packet length for a 16 bit CRCs should be limited to 4kbyte.
The CRC doesn´t know at which baud rate the packets are coming.
Your assumption ( which may well be true ) is that the error-pattern
shifts from singlebit to bursts and more errors will go undetected.
But the detected error rate would go way up too. By counting
retransmissions now and later at the higher baud rate one could
easily see if that has happened and switch to a 24 bit or 32 bit CRC.

MfG JRD

Tim Wescott

unread,

Mar 27, 2011, 1:44:00 PM3/27/11

to

On 03/27/2011 03:53 AM, Michael Karas wrote:
> In article<13c95ff0-d9ca-4f0b-92a4-d21fe6c36c55
> @j35g2000prb.googlegroups.com>, shane....@gmail.com says...
>>
>> Hi
>>
>> We're using the 68302 micro with DDCMP serial protocol over two wire
>> RS485. According to the user manual, this uses CRC16-CCITT - X**16
>> X**12 X**5 + 1.
>>
>> Does anyone have any idea what the chance of getting an undetected
>> error is with this protocol? I know all single bit errors are
>> detected. Supposing we run a point to point connection at slightly
>> faster than it's really capable of and we get 10% of messages with
>> more than a single bit error. What percentage of these will go
>> undetected by the CRC check?
>>
>> Suppose we run the connection at a "normal" baud rate with almost no
>> errors. What is the likelihood of getting undetected errors now?
>>
>> Thanks for any help.
>
>
> The CRC-16 will be able to detect errors in 99.9984 percent of cases.
> This stems from the code being one value off out of 16-bits of
> error code count.
>
> 65535 / 65536 = 0.999984 percent

It isn't that simple. CRC-16 will be able to detect _all_ 1, 2 and 3
bit errors, and some 4-bit errors. How many 'cases' of four bit errors
in a message depends on the message length and your error rate, so right
there your fixed percentage of errors detected goes right out the window.

Read the article cited by Rich Webb.

--

Tim Wescott
Wescott Design Services
http://www.wescottdesign.com

Do you need to implement control loops in software?
"Applied Control Theory for Embedded Systems" was written for you.
See details at http://www.wescottdesign.com/actfes/actfes.html

Tim Wescott

unread,

Mar 27, 2011, 1:47:08 PM3/27/11

to

If you creep up on things, looking for one or two bit errors per packet
and backing off, then you should do OK. I'm with Vladimir, however,
that if you can you should consider just sending pseudo-random
sequences. Error counting with those is easy-peasy, and if you know
it's coming down the pike you don't have to worry about corrupting data
that you depend on.

Tim Wescott

unread,

Mar 27, 2011, 1:51:29 PM3/27/11

to

On 03/27/2011 07:21 AM, Vladimir Vassilevsky wrote:
>
>
> Shane williams wrote:
>
>
>> Thanks. I'm trying to figure out whether it's possible/ viable to
>> dynamically determine the fastest baud rate we can use by checking the
>> error rate.
>
> Yes. But:
>
> 1) It is easier, faster and more reliable to evaluate the channel by
> transmitting a known pseudo-random test pattern rather then the actual
> data.

I've done this -- and it is.

> 2) If the baud rate is changed dynamically, how would the receivers know
> the baud rate of the transmitters?

There's ways. Any good embedded programmer should be able to figure out
half a dozen before they even put pen to napkin.

> 3) Since the system is intended to be operable even at the lowest baud,
> why not always use the lowest baud?

If it's like ones that I've worked with, the data over the link is a
combination of high-priority "gotta haves" like operational data, and
lower-priority "dang this would be nice" things like diagnostics, faster
status updates, and that sort of thing.

So the advantages of going up in speed are obvious. For that matter,
there may be advantages to being able to tell the a maintenance guy what
not-quite-fast-enough speed can be achieved, so he can make an informed
choice about what faults to look for.

Jim Stewart

unread,

Mar 27, 2011, 2:36:18 PM3/27/11

to

Tim Wescott wrote:

> It isn't that simple. CRC-16 will be able to detect _all_ 1, 2 and 3 bit
> errors, and some 4-bit errors.

I've often wondered about that statement. Suppose
you get a 1 bit error in the message and an error
in the crc remainder that results in a "good" message?

Is there an implicit guarantee in the algorithm
that it will take more than 3 bits to "fix" the
remainder.

My apologies if this is covered in the Webb article,
running late today and don't have time to read it.

D Yuniskis

unread,

Mar 27, 2011, 3:29:50 PM3/27/11

to

Hi Shane,

On 3/27/2011 4:39 AM, Shane williams wrote:
> On Mar 27, 11:53 pm, Michael Karas<mka...@carousel-design.com> wrote:

[8<]

>> Are you getting some of the errors in your transmission path
>> due to distortion of the RS485 waveform due to non-equal propagation
>> delays through your logic on the "0"-->"1" transition versus the
>> one from "1"-->"0"? Common problem with certain optocouplers. ;-)

And some devices degrade with age.

> Thanks. I'm trying to figure out whether it's possible/ viable to
> dynamically determine the fastest baud rate we can use by checking the
> error rate. The cable lengths and types of wire used when our systems
> are installed varies and I was hoping we could automatically work out
> what speed a particular connection can run at. The spec for the
> MOC5007 Optocoupler seems a bit vague so I was trying to find a better
> one.

<frown> You might, instead, want to think of this from the
"engineering" standpoint -- what are the likely/expected
*sources* of your errors? I.e., how is the channel typically [1]
going to be corrupted.

First, think of the medium by itself. With a given type of
cable (including "crap" that someone might fabricate on-the-spot),
how will your system likely behave (waveform distortions,
sampling skew in the receiver, component aging, etc.).

Then, think of the likely noise sources that might interfere
with your signal. Is there some synchronous source nearby that
will periodically be bouncing your grounds or coupling directly
to your signals (i.e., will your cable be routed alongside
something noisey)? [this assumes you have identified any
sources of "noise" that your system imposes on *itself*! e.g.,
each time *you* command the VFD to engage the 10HP motor you
might notice glitches in your data...]

Then, think of what aperiodic/transient/"random" disturbances
are likely to be encountered in your environment.

In each case, think of the impact on the data stream AT ALL
THE DATA RATES YOU *MIGHT* BE LIKELY TO HAVE IN USE. Are
you likely to see lots of dispersed single bit errors? How
far apart (temporally) are they likely to be (far enough
that two different code words can cover them?) Or, will
you encounter a burst of consecutive errors? (if so, how
wide?)

Finally, regarding your hinted algorithm: note that the
time constant you use in determining when/if to change rates
has to take into consideration these observations on the likely
environment. E.g., if errors are likely to creep in "slowly"
(beginning with low probability, low error rate), then you
can "notice" the errors and start anticipating more (?) and
back off on your data rate -- hopefully, quick enough that the
error rate doesn't grow to exceed your *continued* ability
for your CRC to remain effective.

OTOH, if the error rate ever "grows" (instantaneously) faster
than your CRC is able to detect the increased error rate,
you run the risk of accepting bad data "as good". And, sitting
"fat, happy and glorious" all the while you are doing so!
(i.e., sort of like a PLL locking on a harmonic outside the
intended capture range).

Can you, instead, figure out how to *ensure* a reliable channel?

--------------------
[1] and *atypically*!

Shane williams

unread,

Mar 27, 2011, 6:01:53 PM3/27/11

to

Didn't think about that.

You're exactly right about the need for speed. Background data is
fine at the slower rate but when an operator is doing something on the
system we want the response to be faster than the slowest rate gives
us.

Switching rates seems fairly easy to me. One end tells the other what
rate they're switching to, the other acknowledges, if no ack then
retry a couple of times. If one end switches and the other doesn't,
after one second or so of no communication, they both switch back to
the slowest rate.

Shane williams

unread,

Mar 27, 2011, 6:31:34 PM3/27/11

to

Interesting points, thanks. The environment can be just about
anything. I suspect we'll back off the baud rate fairly quickly once
errors start occurring. I'm also thinking we could raise the security
for some of the critical messages, like double transmissions perhaps.

Shane williams

unread,

Mar 27, 2011, 6:34:53 PM3/27/11

to

On Mar 28, 3:46 am, Rafael Deliano <Rafael_DelianoENTFER...@t-

Packet length is max 270 bytes / 2700 bits or so but critical messages
are more like about 50 bytes / 500 bits.

Tim Wescott

unread,

Mar 27, 2011, 6:49:22 PM3/27/11

to

One bit error in the message and one in the CRC counts as two bit
errors. It's the number of bit errors in _both_ the CRC _and_ the
message that you need to count.

D Yuniskis

unread,

Mar 27, 2011, 7:22:15 PM3/27/11

to

Hi Shane,

On 3/27/2011 3:31 PM, Shane williams wrote:

> Interesting points, thanks. The environment can be just about
> anything. I suspect we'll back off the baud rate fairly quickly once
> errors start occurring. I'm also thinking we could raise the security
> for some of the critical messages, like double transmissions perhaps.

Consider carefully what sort of "encoding" you use. E.g.,
"double transmissions" might add lots of overhead for very
little gain in "reliability".

You can [1] also consider dynamically varying the data rate in
a TDM sort of scheme -- so, in this timeslot, you run at a slow,
reliable rate transfering critical messages; then, in this other
timeslot, you run "flat out" pushing data that would be "nice to
have" but not critical to proper operation.

Again, you really need to look hard at what you are likely to
encounter "in the field" before you can come to any expectations
regarding likely performance. I've seen (and have been guilty,
myself!) some pretty mangled patches to deployed systems "just
to get by until the FedEx replacement parts delivery arrives".
If you *might* be running on the bleeding edge in some configuration,
the last thing you want is a guy in the field to *think* things
are OK when, in fact, they are not.

[e.g., you might want to add a switch that forces communications
to stay in the "degraded/secure" mode if you suspect you are not
catching all the communication errors in a particular installation...
because the tech made a cable out of "bell wire"]

----------------------------

[1] Depends on what is on the other end of the link, of course.
But, if you can autobaud dynamically, then that suggests you have
some control over both ends of the link!

Paul

unread,

Mar 27, 2011, 7:22:59 PM3/27/11

to

In article <14a46afd-a5a4-4d6b-be24-de552c289027
@l14g2000pre.googlegroups.com>, shane....@gmail.com says...
> Subject: Re: error detection rate with crc-16 CCITT
> Date: Sun, 27 Mar 2011 15:01:53 -0700 (PDT)
> From: Shane williams <shane....@gmail.com>
> Newsgroups: comp.arch.embedded

Have you thought about about simple heartbeat loopback data packets?

If you get to the situation where too many error bits cannot be detected
how will you know everything is alright.

Every once in a while send a small varying pseudo-random data packet at
highest speed to various nodes, which will just echo the packet back if
decoded correctly. Once received check every bit is correct.

This way you are less likely to have false-positives about data being
correct when it is not.

You can change speeds and retry on failures. If you don't see an echo
back, you have more problems to resolve.

Sending larger data packets at higher speeds helps to thoroughly check
data integrity and more chnce of more data switching frequencies that
may or may not be affected.

--
Paul Carpenter | pa...@pcserviceselectronics.co.uk
<http://www.pcserviceselectronics.co.uk/> PC Services
<http://www.pcserviceselectronics.co.uk/fonts/> Timing Diagram Font
<http://www.gnuh8.org.uk/> GNU H8 - compiler & Renesas H8/H8S/H8 Tiny
<http://www.badweb.org.uk/> For those web sites you hate

Shane williams

unread,

Mar 27, 2011, 8:36:45 PM3/27/11

to

Yep, it's the same device at both ends.

Regarding double transmissions, what do you mean by "encoding". We
could complement all bits in the second transmission I guess.

TDM might not be viable and probably too much hassle I suspect. The
baud rate behavior will be user configurable with probably a system
wide switch to allow the faster baud rate.

Thanks

Shane williams

unread,

Mar 27, 2011, 8:41:41 PM3/27/11

to

On Mar 28, 12:22 pm, Paul <p...@pcserviceselectronics.co.uk> wrote:
> In article <14a46afd-a5a4-4d6b-be24-de552c289027

> @l14g2000pre.googlegroups.com>, shane.2471...@gmail.com says...

>
>
> Have you thought about about simple heartbeat loopback data packets?
>
> If you get to the situation where too many error bits cannot be detected
> how will you know everything is alright.
>
> Every once in a while send a small varying pseudo-random data packet at
> highest speed to various nodes, which will just echo the packet back if
> decoded correctly. Once received check every bit is correct.
>
> This way you are less likely to have false-positives about data being
> correct when it is not.
>
> You can change speeds and retry on failures. If you don't see an echo
> back, you have more problems to resolve.
>
> Sending larger data packets at higher speeds helps to thoroughly check
> data integrity and more chnce of more data switching frequencies that
> may or may not be affected.

Thanks for the idea about loop-back data packets. That sounds useful.

The system is a ring of devices with each connection point to point
with one device at each end.

Rich Webb

unread,

Mar 27, 2011, 9:22:00 PM3/27/11

to

On Sun, 27 Mar 2011 17:36:45 -0700 (PDT), Shane williams
<shane....@gmail.com> wrote:

>Regarding double transmissions, what do you mean by "encoding". We
>could complement all bits in the second transmission I guess.

One approach that I've used in the past is to require an ack/nak for
each message sent. If the ack includes the CRC portion of the message
that's being acknowledged, then a simple match by the originator against
the CRC that it sent gives pretty good confidence that the receiver got
a correct message.

The returned CRC is, of course, part of the message body that the remote
unit sends which is in its turn used to build that message's CRC.

D Yuniskis

unread,

Mar 27, 2011, 10:15:51 PM3/27/11

to

You are sending 2*n bits to encode n bits of data.
Yet, that encoding will only *detect* a single bit
error. Won't *correct* ANY errors. Won't *see*
(certain) two bit errors. etc.

I.e., your choice of message encoding has lots of
overhead (twice as many bits!) but doesn't give you
a corresponding increase in "reliability".

Without understanding what sorts of errors you are likely
to encounter, it is hard to design a protocol and encoding
scheme that will be resilient to *those* errors.

> TDM might not be viable and probably too much hassle I suspect. The
> baud rate behavior will be user configurable with probably a system
> wide switch to allow the faster baud rate.

You can also opt to run at the slower (more reliable) rate
ALL THE TIME and encode command messages more robustly than
"less important messages". I.e., so command messages have
greater Hamming distances (require more bandwidth per bit,
so to speak) while less important messages are *compressed*
so there is more "data" per bit -- and less protection against
corrupted transmission. As such, the compressed data appears
to have a higher bandwidth -- at reduced reliability -- even
though it is being sent over the same "bit rate" channel.

D Yuniskis

unread,

Mar 28, 2011, 1:23:49 AM3/28/11

to

On 3/27/2011 5:41 PM, Shane williams wrote:

> The system is a ring of devices with each connection point to point
> with one device at each end.

Do you *literally* mean a ring topology? I.e., (excuse the
crappy ASCII art)

AAAA ----> BBBB ----> CCCC ----> DDDD
AAAA BBBB CCCC DDDD
AAAA <----------<----------<---- DDDD

So, for A to send to D, B and C act as intermediaries?

Now, hold that thought...

How does C send to A? I.e., is the "bottom" connection
simply a pass-thru connection from the downstream node?
Or, is it an active connection (like a second comm channel)?
Asked another way, can C send to A *without* going through
D (i.e., by going through B, instead)?

Regardless... consider that if you twiddle with the baud rate
on any link, you will either need to make sure *all* links
"simultaneously" update their baud-rates (taking into
consideration any packets "in the pipe")

-- or --

you have to provide an elastic store in each node and some
smarts to decide what data that node can *drop* (since it's
outbound connection may not? be at the same rate as it's
inbound connection)

[this last bit applies iff there is a real second channel
in each node like:

AAAA ----> BBBB ----> CCCC ----> DDDD
AAAA BBBB CCCC DDDD
AAAA <---- BBBB <---- CCCC <---- DDDD

robert...@yahoo.com

unread,

Mar 28, 2011, 1:54:59 AM3/28/11

to

Use a proper forward error correction scheme. You'll be able to
monitor the increase in error rate while still getting most packets
through. A Reed-Solomon code will allow you to (for example) add 20
bytes to a 235 byte message and correct any 10 bad bytes (and all
detect all bad messages with no more than 19 bad bytes). If you're
getting a bit corrected every few dozen packets, it's probably safe to
bump up the data rate. If it's a couple dozen bits in every packet,
it's time to back off. In fact, this can substantially increase your
effective data rate, as you can continue to run in the presence of a
moderate number of errors (disk drives, for instance, run well into
that region, and it's relatively rare these days that *any* sector
actually reads "clean," and a very heavy duty ECC code is used to
compensate).

You can also improve things by using a multi level scheme, which could
be a simple duplication (think disk RAID-1), or some combined code
over multiple packets (simply parity like RAID-5, or Reed-Solomon-ish
like RAID-6), which would provide added recovery, at the expense of
added latency (mainly in the presence of errors). Since you mentioned
that you have at least two classes of data (critical and nice to
have), apply the second level of FEC to just the critical data (after
protecting each packet with an appropriate RS code), and even a
substantial spike in error rate, you're likely to get the critical
stuff through.

Shane williams

unread,

Mar 28, 2011, 6:28:58 AM3/28/11

to

It's physically a 2 wire half duplex ring with messages going in both
directions around the ring to provide redundancy. Say 8 nodes 1 to
8. Node 1 talks to nodes 2 and 8, node 2 talks to nodes 1 and 3 etc.

However we may end up with 3 ports per node making it a collection of
rings or a mesh. The loading at the slowest baud rate is approx 10%
for 64 nodes. If we decide to allow mixed baud rates, each node will
have the ability to tell its adjacent nodes to slow down when its
message queue gets to a certain level, allowing it to cope with a
brief surge in messages. Also to help the propagation delay, we might
split long messages to a max of 50 bytes or so.

ChrisQ

unread,

Mar 28, 2011, 7:04:58 AM3/28/11

to

Shane williams wrote:

> It's physically a 2 wire half duplex ring with messages going in both
> directions around the ring to provide redundancy. Say 8 nodes 1 to
> 8. Node 1 talks to nodes 2 and 8, node 2 talks to nodes 1 and 3 etc.
>
> However we may end up with 3 ports per node making it a collection of
> rings or a mesh. The loading at the slowest baud rate is approx 10%
> for 64 nodes. If we decide to allow mixed baud rates, each node will
> have the ability to tell its adjacent nodes to slow down when its
> message queue gets to a certain level, allowing it to cope with a
> brief surge in messages. Also to help the propagation delay, we might
> split long messages to a max of 50 bytes or so.
>

I think ddcmp dates to the mid 70's and was originally designed by
digital / dec for their decnet network, then updated later for ethernet.
Fwir, it is a connection oriented protocol implemented as a multilayer
stack, that provided reliable comms between nodes. It had error
detection, retries etc much as tcp/ip does. It's a long time since I
used decnet, but I know that there are ddcmp protocol specs and other
docs out there which describe the whole stack. There is, I think, even a
linux decnet protocol driver which might be a usefull bit of code to
look at, even if the complete stack is too much for the application...

Regards,

Chris

Vladimir Vassilevsky

unread,

Mar 28, 2011, 11:04:29 AM3/28/11

to

Some people are just looking to find trouble for their ass. Perhaps,
they are masochists; they like to be fucked. Good luck with that; there
are almost limitless possibilities for the protocol malfunctioning.

VLV

Simon Clubley

unread,

Mar 28, 2011, 1:17:47 PM3/28/11

to

On 2011-03-28, ChrisQ <me...@devnull.com> wrote:
>
> I think ddcmp dates to the mid 70's and was originally designed by
> digital / dec for their decnet network, then updated later for ethernet.

Yes, it was before the VAX days. (VMS is a part of my day job, so I am
familiar with DEC history.)

> Fwir, it is a connection oriented protocol implemented as a multilayer
> stack, that provided reliable comms between nodes. It had error
> detection, retries etc much as tcp/ip does. It's a long time since I
> used decnet, but I know that there are ddcmp protocol specs and other
> docs out there which describe the whole stack. There is, I think, even a
> linux decnet protocol driver which might be a usefull bit of code to
> look at, even if the complete stack is too much for the application...
>

The Phase IV documents can be found at:

http://linux-decnet.sourceforge.net/docs/doc_index.html

I don't know what the current status of the DECnet code in Linux is however
as I never use it.

Simon.

--
Simon Clubley, clubley@remove_me.eisner.decus.org-Earth.UFP
Microsoft: Bringing you 1980s technology to a 21st century world

ChrisQ

unread,

Mar 28, 2011, 4:51:41 PM3/28/11

to

Simon Clubley wrote:

>
> The Phase IV documents can be found at:
>
> http://linux-decnet.sourceforge.net/docs/doc_index.html
>
> I don't know what the current status of the DECnet code in Linux is however
> as I never use it.
>
> Simon.
>

A dec document describing the low level protocol, crc, retries and
states etc can be found at:

http://decnet.ipv7.net/docs/dundas/aa-d599a-tc.pdf

I had a previous life working with dec kit and thought I recognised the
name, perhaps from the vms group, but were you by any chance a
contractor in the mid to late 80's ?...

Regards,

Chris

Shane williams

unread,

Mar 28, 2011, 6:48:53 PM3/28/11

to

On Mar 28, 6:54 pm, "robertwess...@yahoo.com"

Thanks. Error correction sounds like it would be too CPU intensive.
I'd be happy just to detect errors.

Do you have any idea how many bytes we would have to add to a 60 byte
message to detect 19 bad bytes or less and how CPU intensive it is?

Shane williams

unread,

Mar 28, 2011, 7:19:26 PM3/28/11

to

Can you describe just one possibility?

Simon Clubley

unread,

Mar 28, 2011, 7:25:20 PM3/28/11

to

On 2011-03-28, ChrisQ <me...@devnull.com> wrote:

> Simon Clubley wrote:
>>
>> The Phase IV documents can be found at:
>>
>> http://linux-decnet.sourceforge.net/docs/doc_index.html
>>
>> I don't know what the current status of the DECnet code in Linux is however
>> as I never use it.
>>
>

> A dec document describing the low level protocol, crc, retries and
> states etc can be found at:
>
> http://decnet.ipv7.net/docs/dundas/aa-d599a-tc.pdf
>

I'll have a read through it thanks; it's been a long time since I really
did anything with DECnet Phase IV.

> I had a previous life working with dec kit and thought I recognised the
> name, perhaps from the vms group, but were you by any chance a
> contractor in the mid to late 80's ?...
>

No, but late 80s/early 90s was the start of my career and I was writing
code for the PDP-11 before moving onto VAX then Alpha and taking in a
range of other environments along the way.

It's quite possible you ran across me as part of that, especially if
you attended the annual DECUS conferences.

robert...@yahoo.com

unread,

Mar 28, 2011, 11:54:28 PM3/28/11

to

To detect (but not correct) all errors of 152 (19*8) or fewer, you'd
have to add at least 152 bits of check code. If you're only looking
to detect errors occurring in no more than 19 bytes of the message, it
would be a bit less, but not hugely so. Remember that to detect n
bits of error, the block has to be different enough from any other
valid block that errors in n bit do not make it look like a different
valid block.

If you're asking about a RS code as I described above, the short
message really doesn't buy you anything, since you need about twice
the number of bits worth of RS symbols as the number of error bits you
hope to correct.

RS is moderately computationally intensive, but that clearly depends
on your data rates, and that hardware you're running on. In fact it
has a worse reputation that it really deserves. But to toss some
numbers out there, a decent implementation in C, on a 1GHz x86, for a
RS(255, 239) encoding (239 bytes of data, plus 16 bytes of check code,
or a bit weaker than what was discussed above – that’s a commonly used
code in broadcasting, so is well studied and you should be able to
find plenty of benchmarks and samples and whatnot), should come in at
100-200Mb/s for encoding (or 10K-20K cycles per block), about half
that for decoding blocks without errors, and about a fifth the
encoding rate for decoding blocks with the maximum correctable amounts
of error. Shorter blocks require less work to process, but it's sub-
linear, so your net data rate for a fixed CPU load will go down as
block size decreases. And note that 255 bytes is the longest possible
block for RS with 8 bit symbols.

On something like an ARM 9, quadruple the cycle counts.

Vladimir Vassilevsky

unread,

Mar 29, 2011, 1:03:01 AM3/29/11

to

Shane williams wrote:

For starters: for the efficient operation, the transmit and the receive
chains should be buffered. In order to change the rate, you have to make
sure the buffers are flushed. If you are planning switching the rate
back and forth, that would incur significant penalty in efficiency.

VLV

Shane williams

unread,

Mar 29, 2011, 1:49:57 AM3/29/11

to

On Mar 29, 6:03 pm, Vladimir Vassilevsky <nos...@nowhere.com> wrote:
>
> For starters: for the efficient operation, the transmit and the receive
> chains should be buffered. In order to change the rate, you have to make
> sure the buffers are flushed. If you are planning switching the rate
> back and forth, that would incur significant penalty in efficiency.

The hardware handles the sending of a whole message at a time. The
software gives the hardware a whole message to send and gets told when
a whole message has been received. This is done by an interrupt
routine. The interrupt routine will decide when to switch baud rates
or check when the other end is asking to switch so the only penalty is
a couple of extra messages and a short delay if the switch works. If
the switch doesn't work there's a slightly bigger penalty but we won't
be switching often enough for it to matter.

upsid...@downunder.com

unread,

Mar 29, 2011, 4:43:18 AM3/29/11

to

In the CAN environment (at least when using some sensible controllers
like SJA1000 listen only mode) autobauding is trivial.

Adding some Modbus RTU slaves to some existing RS-485 Modbus network
only requires listening for the traffic for a second or two.

upsid...@downunder.com

unread,

Mar 29, 2011, 6:21:01 AM3/29/11

to

On Tue, 29 Mar 2011 00:03:01 -0500, Vladimir Vassilevsky
<nos...@nowhere.com> wrote:

>For starters: for the efficient operation, the transmit and the receive
>chains should be buffered.

Most industrial protocols (like Modbus) are simple half duplex
request/response systems.

At any significant line rates, the troughput is severely limited by
the line turn-around delays at both ends.

The additional delay (due to autobauding) at new device insertion
should not be a significant issue.

Philip Koopman

unread,

Mar 29, 2011, 8:16:33 AM3/29/11

to

Shane williams <shane....@gmail.com> wrote:

>Packet length is max 270 bytes / 2700 bits or so but critical messages
>are more like about 50 bytes / 500 bits.

As someone else has previously noted you can get CRC performance data
from this paper:
http://www.ece.cmu.edu/~koopman/roses/dsn04/koopman04_crc_poly_embedded.pdf

You are interested in CRC CCITT-16 x^16 + x^12 + x^5 + 1
At 2700 or fewer bits you get Hamming Distance of 4 which means detects
all 1, 2, & 3 bit errors but not all 4 bit errors. Because of the
factorization of this polynomial it also detects all odd numbers of bit
errrors, but at the price that all even numbers of bit errors are twice
as likely to escape detection (roughly 1 - 2^-15 probability of
detection). It also detects any burst error that corrupts any
combination of 16 or fewer bits in a row, assuming you get endian-ness
and order of the CRC right.

At 500 bits the properties are pretty much the same.

To clarify an earlier discussion point, the number of errors you are
guaranteed to detect depends on the polynomial and the message length.
So you can't just say any particular polynomial detects all x-number of
bit errors without giving a maximum length. In the case of CCITT-16 you
you get this performance (HD=4) up to 32751 data bits (not counting CRC
bits) and after that it only detects odd number of bit errors (2-bit
errors will be undetected if they are more than about 32 K bits apart).

-- Phil Koopman
http://betterembsw.blogspot.com/

Phil Koopman -- koo...@cmu.edu -- http://www.ece.cmu.edu/~koopman

Vladimir Vassilevsky

unread,

Mar 29, 2011, 9:15:17 AM3/29/11

to

upsid...@downunder.com wrote:

> In the CAN environment (at least when using some sensible controllers
> like SJA1000 listen only mode) autobauding is trivial.

The whole point of using CAN is the hardware arbitration and collision
avoidance of the bus. This won't work with autobauding.

Vladimir Vassilevsky
DSP and Mixed Signal Design Consultant
http://www.abvolt.com

Vladimir Vassilevsky

unread,

Mar 29, 2011, 9:49:21 AM3/29/11

to

Shane williams wrote:

* Having protocol logic in the Tx/Rx ISRs is a bad idea already.
* How a hub type device would work?
* What if a node somehow missed the correct baud rate, receiving garbage
and responding to it?
* How would you verify, troubleshoot and prove the operation?

VLV

upsid...@downunder.com

unread,

Mar 29, 2011, 10:09:08 AM3/29/11

to

On Tue, 29 Mar 2011 08:15:17 -0500, Vladimir Vassilevsky
<nos...@nowhere.com> wrote:

>
>
>upsid...@downunder.com wrote:
>
>
>> In the CAN environment (at least when using some sensible controllers
>> like SJA1000 listen only mode) autobauding is trivial.
>
>The whole point of using CAN is the hardware arbitration and collision
>avoidance of the bus. This won't work with autobauding.

CAN bus standard does not include any autobauding feature. However,
SJA1000 style controllers can be put into listen mode (in which case
it does not disrupt the network traffic, if a CRC error is detected).

A new autobauding CAN slave can listen for the bus traffic at various
speeds, until it receives messages without CRC errors and hence
concludes that the correct speed has been found.

In any serial line protocol with CRC, this can be used by slaves for
autobaud detection.

ChrisQ

unread,

Mar 29, 2011, 10:20:46 AM3/29/11

to

Vladimir Vassilevsky wrote:
>

>
> * Having protocol logic in the Tx/Rx ISRs is a bad idea already.

Absolutely. Lower level comms drivers should always be transparent to
data. You build protocol layers on top of that.

Maybe i'm missing something, but I don't understand what all the fuss is
about in this thread. All this kind of thing has been done to death in
the past. It would help the op to have a look at one of the original
ddcmp protocol specs to see how it should be done, with message flow,
state transitions etc. Why keep on reinventing the wheel ?...

Regards,

Chris

upsid...@downunder.com

unread,

Mar 29, 2011, 10:25:07 AM3/29/11

to

From the system point of view, including the end of frame detection
(and hence also autobauding) in the ISR makes a lot of sense, since it
reduces the number of times the RT tasks needs to be rescheduled.

ChrisQ

unread,

Mar 29, 2011, 10:28:56 AM3/29/11

to

Simon Clubley wrote:

>
> No, but late 80s/early 90s was the start of my career and I was writing
> code for the PDP-11 before moving onto VAX then Alpha and taking in a
> range of other environments along the way.
>
> It's quite possible you ran across me as part of that, especially if
> you attended the annual DECUS conferences.
>
> Simon.
>

I just thought the name sounded familiar. I too spent several years
doing systems engineering, programming macro and C on pdp and vax. Never
attended decus meetings,but was a member and still have some tapes. Worked
at dec park, racal, smiths and others during the good old 80's...

Regards,

Chris

ChrisQ

unread,

Mar 29, 2011, 11:01:28 AM3/29/11

to

upsid...@downunder.com wrote:

>
> From the system point of view, including the end of frame detection
> (and hence also autobauding) in the ISR makes a lot of sense, since it
> reduces the number of times the RT tasks needs to be rescheduled.
>

There are arguments for and against and there are tradeoffs, different
if, for example, you are running a state driven loop, rather than an
rtos. One way to get round the problem is to have an incoming fifo big
enough to handle data between polling, then a polling function within
the fifo module that takes two args, start of frame and end of frame.
The function just scans the fifo from time to time, skipping over
duplicate sof until a complete frame is recognised, which is then passed
to the deframer / protocol handler. The interrupt handler never needs to
be disabled and in the unlikely event of fifo full, the data just wraps
round, deleting the least recent byte. Acks, nacks, timeouts and retries
at the next level up then keep the link reliable. You can drive this
sort of thing in many ways, for example, from a timer callback. This is
more or less how it's been done for decades...

Regards,

Chris

D Yuniskis

unread,

Mar 29, 2011, 11:09:43 AM3/29/11

to

Hi Shane,

On 3/28/2011 3:28 AM, Shane williams wrote:
> On Mar 28, 6:23 pm, D Yuniskis<not.going.to...@seen.com> wrote:
>> Regardless... consider that if you twiddle with the baud rate
>> on any link, you will either need to make sure *all* links
>> "simultaneously" update their baud-rates (taking into
>> consideration any packets "in the pipe")
>>
>> -- or --
>>
>> you have to provide an elastic store in each node and some
>> smarts to decide what data that node can *drop* (since it's
>> outbound connection may not? be at the same rate as it's
>> inbound connection)
>>
>> [this last bit applies iff there is a real second channel
>> in each node like:
>>
>> AAAA ----> BBBB ----> CCCC ----> DDDD
>> AAAA BBBB CCCC DDDD
>> AAAA <---- BBBB <---- CCCC <---- DDDD
>
> It's physically a 2 wire half duplex ring with messages going in both
> directions around the ring to provide redundancy. Say 8 nodes 1 to
> 8. Node 1 talks to nodes 2 and 8, node 2 talks to nodes 1 and 3 etc.

Is this a synchronous protocol? Or, are you just using a pair
of UARTs on each device to implement the CW & CCW links?

If that's the case, you have to remember to include all the
"overhead bit(-time)s" in your evaluation of the error rate
and your performance thereunder.

E.g., a start bit error is considerably different than a
*data* bit error (think about it).

> However we may end up with 3 ports per node making it a collection of
> rings or a mesh. The loading at the slowest baud rate is approx 10%

[scratches head] then why are you worrying about running at
a higher rate? Latency might be a reason -- assuming you
don't circulate messages effectively as they pass *through*
a node. But, recall that you only have to pass through
32 nodes, worst case, to get *a* copy of a message to any
other node...

> for 64 nodes. If we decide to allow mixed baud rates, each node will
> have the ability to tell its adjacent nodes to slow down when its
> message queue gets to a certain level, allowing it to cope with a
> brief surge in messages.

Depending on how you chose to allocate the Tx&Rx devices in each
link -- and, whether or not your baudrate generator allows
the Tx and Rx to run at different baudrates -- you have to:
* make sure your Tx FIFO (hardware and software) is empty before
changing Tx baudrate
* make sure your "neighbor" isn't sending data to you when you
change your Rx baudrate (!)

Consider that a link (a connection to *a* neighbor) that "gives you
problems" will probably (?) cause problems in all communications
with that neighbor (Tx & Rx). So, you probably want to tie the
Tx and Rx channels of *one* device to that neighbor (vs. splitting
the Rx with the upstream and Tx with the downstream IN A GIVEN RING)

[this may seem intuitive -- or not! For the *other* case, see end]

Now, when you change the Rx baudrate for the upstream CW neighbor,
you are also (?) changing the Tx baudrate for the downstream CCW
neighbor (the "neighbor" is the same physical node in each case).
Also, you have to consider if you will be changing the baudrate
for the "other" ring simultaneously (so you have to consider the
RTT in your switching calculations).

Chances are (bet dollars to donuts?), the two rings are in different
points of their message exchange (since the distance from message
originator to that particular node is different in the CW ring
vs. the CCW ring). I.e., this may be a convenient time to change
the baudrate (thereby INTERRUPTING the flow of data around the ring)
for the CW ring -- but, probably *not* for the CCW ring.

[recall, changing baudrate is probably going to result in lots
of errors for the communications to/from the affected neighbor(s)]

So, you really have to wait for the entire ring to become idle
before you change baudrates -- and then must have all nodes do
so more or less concurrently (for that ring). If you've split the
Tx and Rx like I described, then this must also happen on the
"other" ring at the same time.

Regarding the "other" way to split the Tx&Rx... have the Tx
always talk to the downstream neighbor and Rx the upstream
IN THE SAME RING. In this case, changes to Tx+Rx baudrates
apply only to a certain ring. So, you can change baudrate
when it is convenient (temporally) for that *ring*.

But, now the two rings are potentially operating at different
rates. So, the "other" ring will eventually ALSO have to
have its baudrate adjusted to match (or, pass different traffic)

D Yuniskis

unread,

Mar 29, 2011, 11:17:18 AM3/29/11

to

Hi Chris,

I think the OP is pretty much stuck with having logic in the ISR.
Consider, the upstream Rx IRQ has to push the received packet into
the downstream Rx FIFO to propagate the message around that ring.
At the very least, the Rx ISR needs to decide if the message
"looks" correct -- at least, "correct enough" to warrant passing
it along (presumably, you don't want to propagate errors that
were picked up locally -- since those errors can then become
UNerrors on the next op...)

If the Rx ISR had to require the intervention of user-land code
(or, any higher level of the protocol stack) to move the received
packet "out", then the time spent *in* the node would increase
dramatically which is reflected in the RTT of the entire ring(s).

ChrisQ

unread,

Mar 29, 2011, 1:39:12 PM3/29/11

to

D Yuniskis wrote:
> Hi Chris,

>
>
> I think the OP is pretty much stuck with having logic in the ISR.
> Consider, the upstream Rx IRQ has to push the received packet into
> the downstream Rx FIFO to propagate the message around that ring.
> At the very least, the Rx ISR needs to decide if the message
> "looks" correct -- at least, "correct enough" to warrant passing
> it along (presumably, you don't want to propagate errors that
> were picked up locally -- since those errors can then become
> UNerrors on the next op...)
>
> If the Rx ISR had to require the intervention of user-land code
> (or, any higher level of the protocol stack) to move the received
> packet "out", then the time spent *in* the node would increase
> dramatically which is reflected in the RTT of the entire ring(s).

One could ask about the wisdom of using a ring topology, which will
always involve more latency than a multidrop network using some sort of
poll / select or request / response protocol. You must have more than one
comms link for redundancy, as any break in the ring isolates any node
past the fault. You need double the comms hardware, as each node needs
an rx and tx uart. In the presence of faults, a ring topology doesn't
degrade anything like as gracefully, as multidrop either. Finally, where
does ddcmp fit into the picture ?. Ddcmp is more than just a frame
format, it's a complete protocol spec with defined messages flows, state
transitions, error recovery etc...

Regards,

Chris

D Yuniskis

unread,

Mar 29, 2011, 2:16:28 PM3/29/11

to

Hi Chris,

On 3/29/2011 10:39 AM, ChrisQ wrote:

> D Yuniskis wrote:
>
>> I think the OP is pretty much stuck with having logic in the ISR.
>> Consider, the upstream Rx IRQ has to push the received packet into
>> the downstream Rx FIFO to propagate the message around that ring.
>> At the very least, the Rx ISR needs to decide if the message
>> "looks" correct -- at least, "correct enough" to warrant passing
>> it along (presumably, you don't want to propagate errors that
>> were picked up locally -- since those errors can then become
>> UNerrors on the next op...)
>>
>> If the Rx ISR had to require the intervention of user-land code
>> (or, any higher level of the protocol stack) to move the received
>> packet "out", then the time spent *in* the node would increase
>> dramatically which is reflected in the RTT of the entire ring(s).
>
> One could ask about the wisdom of using a ring topology, which will
> always involve more latency than a multidrop network using some sort of
> poll / select or request / response protocol.

Yup. I'm *guessing* this was just a "free" (hardware) approach.
You'd have to ask the OP for his specific reasons for this approach...

> You must have more than one
> comms link for redundancy, as any break in the ring isolates any node
> past the fault.

Hence the double ring approach -- with twice the "cost".

> You need double the comms hardware, as each node needs
> an rx and tx uart. In the presence of faults, a ring topology doesn't
> degrade anything like as gracefully, as multidrop either.

Well, to be fair, with a bus topology, anything that can
toast/busy the bus indefinitely will shutdown *all*
communications. With a ring, one can at least talk to
one's neighbor (even if you can't get a reply).

I.e., a node can say "the bozo upstream from me is spouting
endless gibberish" (or, "hasn't said anything in *DAYS*")
and, therefore, you can hope folks downstream from you propagate
this and, as a result, know that they should move to a safe/secure
state *and* know where the fault likely lies.

In a bus topology, everyone has to monitor the bus's health
independently and there is *no* communication in the event of
a failure.

I know of at least one such design in which the "master(s)" had
the ability to impress a high voltage on the bus with the
intent that a malfunctioning node wouldn't be smart enough to
galvanically isolate itself from the bus during this event
(this would blow fuses that would, thereafter, isolate the
offending node :> ).

No idea how well this worked, in practice. Amusing idea, though!
If only *all* problems could be solved with a suitably high
voltage ;)

ChrisQ

unread,

Mar 29, 2011, 6:31:52 PM3/29/11

to

D Yuniskis wrote:
> Hi Chris,
>

> Well, to be fair, with a bus topology, anything that can
> toast/busy the bus indefinitely will shutdown *all*
> communications. With a ring, one can at least talk to
> one's neighbor (even if you can't get a reply).
>

That's true, say with an line turnaround after tx that doesn't, thereby
pulling the line down. It's still a lower cost solution than the ring
approach. Of course, ethernet fixes that problem, but at the cost of
increased software complexity, so there really is no free lunch.

> In a bus topology, everyone has to monitor the bus's health
> independently and there is *no* communication in the event of
> a failure.

You often need a keep alive packet for ongoing health check,
which would fail to get an ack on a hardware failure, but that's still
less hw in the path than ring topology, so better mtbf (in theory) to
start with.

>
> I know of at least one such design in which the "master(s)" had
> the ability to impress a high voltage on the bus with the
> intent that a malfunctioning node wouldn't be smart enough to
> galvanically isolate itself from the bus during this event
> (this would blow fuses that would, thereafter, isolate the
> offending node :> ).

Horrifying, though I guess you could implement that using zener diodes and
limiting r on each node for protection, placed after, say, 50mA fuses,
which would be easy to open circuit. Sounds a bit extreme though...

Regards,

Chris

robert...@yahoo.com

unread,

Mar 29, 2011, 6:40:25 PM3/29/11

to

The traditional way to lower latency in a ring is to start
transmitting to the next node early - at least as soon as you see that
the address (hopefully at the front of the frame) isn't yours. If the
frame is bad, you can force that to be passed on by making sure the
ending delimiter is transmitted with an indication of error. If you
do it right, then in the worst case the bad frame will be pruned at
each node, so even if the address has been damaged* (or it was
addressed to a non-existent node), it'll get removed in a reasonable
amount of time.

*And exactly where the frame is removed from the ring is a design
question. Often the sender removes frames it had sent, when they make
their way back around, in which case the critical item for removal is
the source address (and usually in that case the destination node sets
a "copied" bit in the trailer, thus verifying physical transmission of
the frame to the destination).

robert...@yahoo.com

unread,

Mar 29, 2011, 7:07:31 PM3/29/11

to

On Mar 29, 12:39 pm, ChrisQ <m...@devnull.com> wrote:
> D Yuniskis wrote:
> > Hi Chris,
>
> > I think the OP is pretty much stuck with having logic in the ISR.
> > Consider, the upstream Rx IRQ has to push the received packet into
> > the downstream Rx FIFO to propagate the message around that ring.
> > At the very least, the Rx ISR needs to decide if the message
> > "looks" correct -- at least, "correct enough" to warrant passing
> > it along (presumably, you don't want to propagate errors that
> > were picked up locally -- since those errors can then become
> > UNerrors on the next op...)
>
> > If the Rx ISR had to require the intervention of user-land code
> > (or, any higher level of the protocol stack) to move the received
> > packet "out", then the time spent *in* the node would increase
> > dramatically which is reflected in the RTT of the entire ring(s).
>
> One could ask about the wisdom of using a ring topology, which will
> always involve more latency than a multidrop network using some sort of
> poll / select or request / response protocol. You must have more than one
> comms link for redundancy, as any break in the ring isolates any node
> past the fault. You need double the comms hardware, as each node needs
> an rx and tx uart.

You only need double the comm hardware if you want redundant rings
(I'm not sure if that was the point you were making or not). On a
single ring (or one one of a pair of double rings), you just need a
funky cable, and connect Tx clockwise, and Rx counterclockwise (or
vice versa) on the ring. You can add some redundancy by adding some
relays to physically bypass inactive nodes from the ring, and deal
with wire breaks too with an additional pair of wires, and some more
switch gear, but...

...at what point does one just give up on all that custom work, and
just toss on a cheap Ethernet port, and invest in a basic switch?
After all, you can get single quantities of a PIC with Ethernet for
$4.50 (including pretty much everything except the magnetics). And
that neatly solves bandwidth issues too (if 10Mb isn't enough for some
nodes, you can always give those nodes faster interfaces, and two
nodes talking don't take bandwidth away from other nodes that might
want to talk (assuming a proper switch).

D Yuniskis

unread,

Mar 29, 2011, 8:38:27 PM3/29/11

to

Hi Robert,

On 3/29/2011 3:40 PM, robert...@yahoo.com wrote:

[attributions elided]

Yes, but this only makes sense in an environment where errors
are "infrequent". Note that the OP is talking of error rates
as high as "19 bytes out of 60". In that case, there's just too
great a chance that you will start passing a corrupted message
with little chance to RELIABLY mark it as such "on the tail
end" (i.e., your error indication stands a good chance of
being corrupted -- what I playfully called "UNerrors")

As a policy, I don't like (implicitly) "blessing" anything
that I don't have a high degree of confidence in ("I" being
code that I write).

D Yuniskis

unread,

Mar 29, 2011, 8:43:25 PM3/29/11

to

Hi Chris,

On 3/29/2011 3:31 PM, ChrisQ wrote:
> D Yuniskis wrote:

>> I know of at least one such design in which the "master(s)" had
>> the ability to impress a high voltage on the bus with the
>> intent that a malfunctioning node wouldn't be smart enough to
>> galvanically isolate itself from the bus during this event
>> (this would blow fuses that would, thereafter, isolate the
>> offending node :> ).
>
> Horrifying, though I guess you could implement that using zener diodes and
> limiting r on each node for protection, placed after, say, 50mA fuses,
> which would be easy to open circuit. Sounds a bit extreme though...

I don't know if it was a result of a client requirement or just
paranoia on the part of the system designer(s) -- the system was
designed for 24/7/365 unattended operation in a hostile physical
environment (with no on-site personnel to even determine if something
was "malfunctioning"). <shrug> When you have no *practical*
alternatives, I guess you just do the best you can!

Shane williams

unread,

Mar 30, 2011, 7:41:21 AM3/30/11

to

On Mar 30, 6:39 am, ChrisQ <m...@devnull.com> wrote:
>
> One could ask about the wisdom of using a ring topology, which will
> always involve more latency than a multidrop network using some sort of
> poll / select or request / response protocol. You must have more than one
> comms link for redundancy, as any break in the ring isolates any node
> past the fault. You need double the comms hardware, as each node needs
> an rx and tx uart. In the presence of faults, a ring topology doesn't
> degrade anything like as gracefully, as multidrop either. Finally, where
> does ddcmp fit into the picture ?. Ddcmp is more than just a frame
> format, it's a complete protocol spec with defined messages flows, state
> transitions, error recovery etc...

I found out today ddcmp was used purely because it calculated the CRC
for us. All it does is the framing. All the state transition and
error recovery stuff is turned off. Using ddcmp was probably a
mistake because ccitt crc can be calculated quickly enough and soon
we'll be doing a new version of this device with a different micro
which will have to be compatible with the existing device so will
still have to ddcmp but without the hardware support.

I'm trying to improve the propagation delay of messages around the
ring without requiring the customer to fit twisted pair cable
everywhere and I'm also trying to improve the error monitoring so we
can signal when a connection isn't performing well enough without
creating nuisance faults, hence my interest in the error detection
capability of crc16-ccitt.

We actually already do have an RS485 multi-drop version of this
protocol but it's non-deterministic and doesn't work very well. I
don't really want to go into that...

Shane williams

unread,

Mar 30, 2011, 8:12:03 AM3/30/11

to

On Mar 30, 4:09 am, D Yuniskis <not.going.to...@seen.com> wrote:
> Hi Shane,
>
> On 3/28/2011 3:28 AM, Shane williams wrote:
>
> > On Mar 28, 6:23 pm, D Yuniskis<not.going.to...@seen.com> wrote:
> >> Regardless... consider that if you twiddle with the baud rate
> >> on any link, you will either need to make sure *all* links
> >> "simultaneously" update their baud-rates (taking into
> >> consideration any packets "in the pipe")
>
> >> -- or --
>
> >> you have to provide an elastic store in each node and some
> >> smarts to decide what data that node can *drop* (since it's
> >> outbound connection may not? be at the same rate as it's
> >> inbound connection)
>
> >> [this last bit applies iff there is a real second channel
> >> in each node like:
>
> >> AAAA ----> BBBB ----> CCCC ----> DDDD
> >> AAAA BBBB CCCC DDDD
> >> AAAA <---- BBBB <---- CCCC <---- DDDD
>
> > It's physically a 2 wire half duplex ring with messages going in both
> > directions around the ring to provide redundancy. Say 8 nodes 1 to
> > 8. Node 1 talks to nodes 2 and 8, node 2 talks to nodes 1 and 3 etc.
>
> Is this a synchronous protocol? Or, are you just using a pair
> of UARTs on each device to implement the CW & CCW links?

Asynchronous with a pair of uarts, one for clockwise, one for counter-
clockwise.

>
> If that's the case, you have to remember to include all the
> "overhead bit(-time)s" in your evaluation of the error rate
> and your performance thereunder.
>
> E.g., a start bit error is considerably different than a
> *data* bit error (think about it).

Hmm. I forgot about that. A start or stop bit error means the whole
message is rejected which is good.

>
> > However we may end up with 3 ports per node making it a collection of
> > rings or a mesh. The loading at the slowest baud rate is approx 10%
>
> [scratches head] then why are you worrying about running at
> a higher rate?

Because not all sites can wire as a mesh. The third port is optional
but helps the propagation delay a lot.

> Latency might be a reason -- assuming you
> don't circulate messages effectively as they pass *through*
> a node. But, recall that you only have to pass through
> 32 nodes, worst case, to get *a* copy of a message to any
> other node...
>
> > for 64 nodes. If we decide to allow mixed baud rates, each node will
> > have the ability to tell its adjacent nodes to slow down when its
> > message queue gets to a certain level, allowing it to cope with a
> > brief surge in messages.
>
> Depending on how you chose to allocate the Tx&Rx devices in each
> link -- and, whether or not your baudrate generator allows
> the Tx and Rx to run at different baudrates -- you have to:
> * make sure your Tx FIFO (hardware and software) is empty before
> changing Tx baudrate
> * make sure your "neighbor" isn't sending data to you when you
> change your Rx baudrate (!)

This is assured. It's half duplex and the hardware sends a whole
message at a time.

>

> Consider that a link (a connection to *a* neighbor) that "gives you
> problems" will probably (?) cause problems in all communications
> with that neighbor (Tx & Rx). So, you probably want to tie the
> Tx and Rx channels of *one* device to that neighbor (vs. splitting
> the Rx with the upstream and Tx with the downstream IN A GIVEN RING)

Not sure I follow but a single uart does both the tx and rx to the
same neighbor.

>
> [this may seem intuitive -- or not! For the *other* case, see end]
>
> Now, when you change the Rx baudrate for the upstream CW neighbor,
> you are also (?) changing the Tx baudrate for the downstream CCW
> neighbor (the "neighbor" is the same physical node in each case).

Yes

> Also, you have to consider if you will be changing the baudrate
> for the "other" ring simultaneously (so you have to consider the
> RTT in your switching calculations).

What is RTT?

>
> Chances are (bet dollars to donuts?), the two rings are in different
> points of their message exchange (since the distance from message
> originator to that particular node is different in the CW ring
> vs. the CCW ring). I.e., this may be a convenient time to change
> the baudrate (thereby INTERRUPTING the flow of data around the ring)
> for the CW ring -- but, probably *not* for the CCW ring.

I'm lost here.

>
> [recall, changing baudrate is probably going to result in lots
> of errors for the communications to/from the affected neighbor(s)]
>
> So, you really have to wait for the entire ring to become idle
> before you change baudrates -- and then must have all nodes do
> so more or less concurrently (for that ring). If you've split the
> Tx and Rx like I described, then this must also happen on the
> "other" ring at the same time.
>
> Regarding the "other" way to split the Tx&Rx... have the Tx
> always talk to the downstream neighbor and Rx the upstream
> IN THE SAME RING. In this case, changes to Tx+Rx baudrates
> apply only to a certain ring. So, you can change baudrate
> when it is convenient (temporally) for that *ring*.
>
> But, now the two rings are potentially operating at different
> rates. So, the "other" ring will eventually ALSO have to
> have its baudrate adjusted to match (or, pass different traffic)
>

I think there must be a misunderstanding somewhere - not sure where.

Shane williams

unread,

Mar 30, 2011, 8:20:09 AM3/30/11

to

On Mar 31, 1:12 am, Shane williams <shane.2471...@gmail.com> wrote:
>
> > > It's physically a 2 wire half duplex ring with messages going in both
> > > directions around the ring to provide redundancy. Say 8 nodes 1 to
> > > 8. Node 1 talks to nodes 2 and 8, node 2 talks to nodes 1 and 3 etc.
>
> > Is this a synchronous protocol? Or, are you just using a pair
> > of UARTs on each device to implement the CW & CCW links?
>
> Asynchronous with a pair of uarts, one for clockwise, one for counter-
> clockwise.
>

oops, I made a mistake here - one uart for neighbour one, another
uart for neighbour 2. Tx to neighbour one is the CW data (say) and tx
to neighbour 2 is the CCW data.

ChrisQ

unread,

Mar 30, 2011, 8:31:31 AM3/30/11

to

Shane williams wrote:

>
> I found out today ddcmp was used purely because it calculated the CRC
> for us. All it does is the framing. All the state transition and
> error recovery stuff is turned off. Using ddcmp was probably a
> mistake because ccitt crc can be calculated quickly enough and soon
> we'll be doing a new version of this device with a different micro
> which will have to be compatible with the existing device so will
> still have to ddcmp but without the hardware support.
>

So you using some hardware device with internal crc hw ?. Just
curious, which device ?.

>
> I'm trying to improve the propagation delay of messages around the
> ring without requiring the customer to fit twisted pair cable
> everywhere and I'm also trying to improve the error monitoring so we
> can signal when a connection isn't performing well enough without
> creating nuisance faults, hence my interest in the error detection
> capability of crc16-ccitt.
>

Two unknowns: Max cable length between nodes and max baud rate ?. Assume
that you are currently running unbalanced rs232 style cabling ?.

If you are limited on baud rate due to cable length, you might be able to
compress the data. A recent project for led road signs was limited to 9600
bauds, but the screen update requirement of 1 second max meant that we
had no
option but to use compression.

>
> We actually already do have an RS485 multi-drop version of this
> protocol but it's non-deterministic and doesn't work very well. I
> don't really want to go into that...
>

Sounds like a better place to start, from a technical point of view :-)...

Regards,

Chris

Shane williams

unread,

Mar 30, 2011, 8:34:50 AM3/30/11

to

On Mar 30, 11:40 am, "robertwess...@yahoo.com"

<robertwess...@yahoo.com> wrote:
>
> The traditional way to lower latency in a ring is to start
> transmitting to the next node early - at least as soon as you see that
> the address (hopefully at the front of the frame) isn't yours. If the
> frame is bad, you can force that to be passed on by making sure the
> ending delimiter is transmitted with an indication of error. If you
> do it right, then in the worst case the bad frame will be pruned at
> each node, so even if the address has been damaged* (or it was
> addressed to a non-existent node), it'll get removed in a reasonable
> amount of time.

It's half duplex so we can't start transmitting early.

>
> *And exactly where the frame is removed from the ring is a design
> question. Often the sender removes frames it had sent, when they make
> their way back around, in which case the critical item for removal is
> the source address (and usually in that case the destination node sets
> a "copied" bit in the trailer, thus verifying physical transmission of
> the frame to the destination).

In our case, there are two logical rings and each message placed on
the ring is sent in both directions. When the two messages meet up
approximately half-way round they annihilate each other - but if the
annihilation fails, the sender removes them as well.

Shane williams

unread,

Mar 30, 2011, 8:53:02 AM3/30/11

to

On Mar 31, 1:31 am, ChrisQ <m...@devnull.com> wrote:
> Shane williams wrote:
>
> > I found out today ddcmp was used purely because it calculated the CRC
> > for us. All it does is the framing. All the state transition and
> > error recovery stuff is turned off. Using ddcmp was probably a
> > mistake because ccitt crc can be calculated quickly enough and soon
> > we'll be doing a new version of this device with a different micro
> > which will have to be compatible with the existing device so will
> > still have to ddcmp but without the hardware support.
>
> So you using some hardware device with internal crc hw ?. Just
> curious, which device ?.
>

Motorola 68302

>
>
> > I'm trying to improve the propagation delay of messages around the
> > ring without requiring the customer to fit twisted pair cable
> > everywhere and I'm also trying to improve the error monitoring so we
> > can signal when a connection isn't performing well enough without
> > creating nuisance faults, hence my interest in the error detection
> > capability of crc16-ccitt.
>
> Two unknowns: Max cable length between nodes and max baud rate ?. Assume
> that you are currently running unbalanced rs232 style cabling ?.

It's RS485 but apparently a variety of cable gets used, not always
twisted pair.

>
> If you are limited on baud rate due to cable length, you might be able to
> compress the data. A recent project for led road signs was limited to 9600
> bauds, but the screen update requirement of 1 second max meant that we
> had no
> option but to use compression.
>

I don't know much about compression but it sounds too CPU intensive
for the 68302? What micro are you using?

D Yuniskis

unread,

Mar 30, 2011, 10:33:24 AM3/30/11

to

Hi Shane,

On 3/30/2011 5:34 AM, Shane williams wrote:
> On Mar 30, 11:40 am, "robertwess...@yahoo.com"
> <robertwess...@yahoo.com> wrote:
>>
>> The traditional way to lower latency in a ring is to start
>> transmitting to the next node early - at least as soon as you see that
>> the address (hopefully at the front of the frame) isn't yours. If the
>> frame is bad, you can force that to be passed on by making sure the
>> ending delimiter is transmitted with an indication of error. If you
>> do it right, then in the worst case the bad frame will be pruned at
>> each node, so even if the address has been damaged* (or it was
>> addressed to a non-existent node), it'll get removed in a reasonable
>> amount of time.
>
> It's half duplex so we can't start transmitting early.

So, you have to buffer a message, verify it's integrity and then
push it on to the next node. This suggests it is either done
*in* the Rx ISR *or* an ASR running tightly coupled to it
(else you risk adding processing delays to the propagation delay).

I.e., the time a message takes to circumnavigate the ring is
~K * n where K reflects message size, baud rate and per node
processing.

>> *And exactly where the frame is removed from the ring is a design
>> question. Often the sender removes frames it had sent, when they make
>> their way back around, in which case the critical item for removal is
>> the source address (and usually in that case the destination node sets
>> a "copied" bit in the trailer, thus verifying physical transmission of
>> the frame to the destination).
>
> In our case, there are two logical rings and each message placed on
> the ring is sent in both directions. When the two messages meet up
> approximately half-way round they annihilate each other - but if the
> annihilation fails, the sender removes them as well.

How does the sender *recognize* them as "his to annihilate"?
I.e., if the data can be corrupted, so can the "sender ID"!
The problem with a ring is that it has no "end" so things have
the potential to go 'round and 'round and 'round and...

If you unilaterally drop any message found to be corrupted, then
you have no way of knowing if it was received by its intended
recipient (since you don't know who it's recipient is -- or was).
If you await acknowledgment (and retry until you receive it),
then you run the risk of a message being processed more than
once. etc.

ChrisQ

unread,

Mar 30, 2011, 12:26:20 PM3/30/11

to

Shane williams wrote:

> It's RS485 but apparently a variety of cable gets used, not always
> twisted pair.
>

Not wishing to offend, but this sounds like a legacy project that was
originally ill thought out and a bit of a hack to start with. You used
the ddcmp frame format, but didn't implement the full protocol.
The system wiring is non RS485 conforming, so susceptable to noise
related errors and line drive problems. Data reliability is exactly
what protocol definitions like ddcmp are designed to address.

I think you will have to at least rewire with twisted pair before
addressing
any sw issues. If the hardware is bad, then no amount of software will fix
the problem...

>>
>
> I don't know much about compression but it sounds too CPU intensive
> for the 68302? What micro are you using?
>

The project used the Renesas 32C87 series from Hitachi. Not such an
elegant arch as 68k, but almost certainly faster than a '302. Depending
on the data, simple compression like huffman encoding can work quite
well, but another way might be to simplify / reorganise the frame format
or data within it, so you can send fewer bytes...

Regards,

Chris

D Yuniskis

unread,

Mar 30, 2011, 1:14:52 PM3/30/11

to

Hi Shane,

On 3/30/2011 5:12 AM, Shane williams wrote:
> On Mar 30, 4:09 am, D Yuniskis<not.going.to...@seen.com> wrote:

>> Is this a synchronous protocol? Or, are you just using a pair

>> of UARTs on each device to implement the CW& CCW links?

>
> Asynchronous with a pair of uarts, one for clockwise, one for counter-
> clockwise.

OK. Been there, done that, T-shirt to prove it...

>> If that's the case, you have to remember to include all the
>> "overhead bit(-time)s" in your evaluation of the error rate
>> and your performance thereunder.
>>
>> E.g., a start bit error is considerably different than a
>> *data* bit error (think about it).
>
> Hmm. I forgot about that. A start or stop bit error means the whole
> message is rejected which is good.

My point was that if you *miss* a start bit, then you have -- at
the very least -- missed the "first" bit of the message (because,
if it was MARKING, the UART just ignored it and, if it was SPACING,
the UART thought *it* was the start bit). If you are pushing
bytes (characters) down the wire at the maximum data rate (minimal
stop time between characters), then you run the risk of part of
the *next* character being "shifted" into this "misaligned" first
character. I.e., it gets really difficult to figure out *if*
your code will be able to detect an error (because the received
byte "looks wrong") or if, BY CHANCE, the bit patterns can conspire
to look like a valid "something else".

>>> However we may end up with 3 ports per node making it a collection of
>>> rings or a mesh. The loading at the slowest baud rate is approx 10%
>>
>> [scratches head] then why are you worrying about running at
>> a higher rate?
>
> Because not all sites can wire as a mesh. The third port is optional
> but helps the propagation delay a lot.

Sorry, the subject wasn't clear in my question <:-(
I mean, if you were to stick with the slowest rate, your
"10%" number *suggests* you have lots of margin -- why
push for a higher rate with the potential for more
problems?

>> Latency might be a reason -- assuming you
>> don't circulate messages effectively as they pass *through*
>> a node. But, recall that you only have to pass through
>> 32 nodes, worst case, to get *a* copy of a message to any
>> other node...
>>
>>> for 64 nodes. If we decide to allow mixed baud rates, each node will
>>> have the ability to tell its adjacent nodes to slow down when its
>>> message queue gets to a certain level, allowing it to cope with a
>>> brief surge in messages.
>>
>> Depending on how you chose to allocate the Tx&Rx devices in each
>> link -- and, whether or not your baudrate generator allows
>> the Tx and Rx to run at different baudrates -- you have to:
>> * make sure your Tx FIFO (hardware and software) is empty before
>> changing Tx baudrate
>> * make sure your "neighbor" isn't sending data to you when you
>> change your Rx baudrate (!)
>
> This is assured. It's half duplex and the hardware sends a whole
> message at a time.

So, for each ring, you WON'T receive a message until you have
transmitted any previous message? Alternatively, you won't
transmit a message until your receiver is finished?

What prevents two messages from being "in a ring" at the same
time (by accident)? I.e., without violating the above, it
seems possible that node 18 can be sending to node 19 (while
19 is NOT sending to 20 and 17 is not sending to 18) at the
same time that node 3 is sending to node 4 (while neither 2
nor 4 are actively transmitting).

Since this *seems* possible, how can you be sure one message
doesn't get delayed slightly so that the second message ends
up catching up to it? (i.e., node 23 has no way of knowing
that node 24 is transmitting to 25 so 23 *could* start sending
a message to 24 that 24 fails to notice -- in whole or in
part -- because 24 is preoccupied with its outbound message)

>> Consider that a link (a connection to *a* neighbor) that "gives you
>> problems" will probably (?) cause problems in all communications

>> with that neighbor (Tx& Rx). So, you probably want to tie the

>> Tx and Rx channels of *one* device to that neighbor (vs. splitting
>> the Rx with the upstream and Tx with the downstream IN A GIVEN RING)
>
> Not sure I follow but a single uart does both the tx and rx to the
> same neighbor.
>
>> [this may seem intuitive -- or not! For the *other* case, see end]
>
>> Now, when you change the Rx baudrate for the upstream CW neighbor,
>> you are also (?) changing the Tx baudrate for the downstream CCW
>> neighbor (the "neighbor" is the same physical node in each case).
>
> Yes
>
>> Also, you have to consider if you will be changing the baudrate
>> for the "other" ring simultaneously (so you have to consider the
>> RTT in your switching calculations).
>
> What is RTT?

Round Trip Time (sorry :< ) I.e., you (each of your nodes) has
to be aware of the time it takes a message to (hopefully) make
it around the ring.

>> Chances are (bet dollars to donuts?), the two rings are in different
>> points of their message exchange (since the distance from message
>> originator to that particular node is different in the CW ring
>> vs. the CCW ring). I.e., this may be a convenient time to change
>> the baudrate (thereby INTERRUPTING the flow of data around the ring)
>> for the CW ring -- but, probably *not* for the CCW ring.
>
> I'm lost here.

Number the nodes 1 - 10 (sequentially).
The CW node has 1 sending to 2, 2 sending to 3, ... 10 sending to 1.
The CW node has 10 sending to 9, 9 sending to 8, ... 1 sending to 10.
The nodes operate concurrently.

So, assume 7 originates a message -- destined for 3. In the CW ring,
it is routed as 7, 8, 9, 10, 1, 2, 3. In the CCW ring, it is routed
(simultaneously) as 7, 6, 5, 4, 3.

*If* it progresses node to node at the exact same rates in each
ring (this isn't guaranteed but "close enough for gummit work"),
then it arrives in 8 & 6 at the same time, 9 & 5, 10 & 4, 1 & 3,
2 & 2 (though different "rings"), 3 & 1, etc. (note I have assumed,
here, that it continues around until reaching it's originator...
but, that's not important).

Now, at node 9, if the CW ring decides that the baudrate needs to be
changed and it thinks "now is a good time to do so" (because it has
*just* passed it's CW message on to node 10), that action effectively
interrupts any traffic in the CW ring (until the other nodes make
the similar baudrate adjustment in the CW direction).

But, there is a message circulating in the CCW ring -- it was just
transmitted from node 5 to 4 (while 9 was sending to 10). It will
eventually be routed to node 9 as it continues it's way around the
CCW ring. But, *it* is moving at the original baudrate (in the CCW
ring) while node 9 is now operating at the *new* baudrate (in the
CW ring). So, any new traffic in the CW ring will run around
that ring at a different rate than the CCW traffic. If you only
allow one message to be active in each ring at any given time, then
this will "resolve itself" one RTT later. But, if the "other"
ring never decides to change baudrates... ?

And, if it *does* change baudrates at the same time as the "first"
ring, then you have to wait for the CW message to have been
completely propagated *and* the CCW message as well before making
the change. I.e., you have to let both rings go idle before
risking the switch (or, take considerable care to ensure that
a switch doesn't happen DOWNstream of a circulating message)

>> [recall, changing baudrate is probably going to result in lots
>> of errors for the communications to/from the affected neighbor(s)]
>>
>> So, you really have to wait for the entire ring to become idle
>> before you change baudrates -- and then must have all nodes do
>> so more or less concurrently (for that ring). If you've split the
>> Tx and Rx like I described, then this must also happen on the
>> "other" ring at the same time.
>>
>> Regarding the "other" way to split the Tx&Rx... have the Tx
>> always talk to the downstream neighbor and Rx the upstream
>> IN THE SAME RING. In this case, changes to Tx+Rx baudrates
>> apply only to a certain ring. So, you can change baudrate
>> when it is convenient (temporally) for that *ring*.
>>
>> But, now the two rings are potentially operating at different
>> rates. So, the "other" ring will eventually ALSO have to
>> have its baudrate adjusted to match (or, pass different traffic)
>
> I think there must be a misunderstanding somewhere - not sure where.

You can wire two UARTs to give you two rings in TWO DIFFERENT WAYS.
Look at a segment of the ring with three nodes:

------> 1 AAAA 1 --------> 1 BBBB 1 --------> 1 CCCC 1 ------->
AAAA BBBB CCCC
<------ 2 AAAA 2 <-------< 2 BBBB 2 <-------- 2 CCCC 2 <-------

vs.

------> 1 AAAA 2 --------> 1 BBBB 2 --------> 1 CCCC 2 ------->
AAAA BBBB CCCC
<------ 1 AAAA 2 <-------< 1 BBBB 2 <-------- 1 CCCC 2 <-------

where the numbers identify the UARTs associated with each signal.

[assume tx and rx baudrates are driven by the same baudrate generator
so there is a BRG1 and BRG2 in each node]

In the first case, when you change the baudrate of a UART at some
particular node, the baudrate for that segment in *the* ring that
the UART services (left-to-right ring vs right-to-left ring) changes.
So, you must change *after* you have finished transmitting and you
will no longer be able to receive until the node upstream from you
also changes baudrate.

In the second case, when you change the baudrate of a UART at some
particular node, the baudrate for all communications with that
particular neighbor (to the left or to the right) changes. So,
*both* rings are "broken" until that neighbor makes the comparable
change.

Look at each scenario and its consequences while messages are
circulating (in both rings!). Changing data rates can be a very
disruptive thing as it forces the ring(s) to be emptied; some
minimum guaranteed quiescent period to provide a safety factor
(that no messages are still in transit); the actual change
to be effected; a quiescent period to ensure all nodes are
at the new speed; *then* you can start up again.

While it sounds "child-like", you might find making a drawing
and moving some coins (tokens) around the rings as if they were
messages. It helps to picture what changes to the rings'
operation you can make and *when*.

Either try NOT to change baudrates *or* change them at times
that can be determined a priori.

D Yuniskis

unread,

Mar 30, 2011, 1:49:12 PM3/30/11

to

Hi Shane,

On 3/30/2011 4:41 AM, Shane williams wrote:
> On Mar 30, 6:39 am, ChrisQ<m...@devnull.com> wrote:
>>
>> One could ask about the wisdom of using a ring topology, which will
>> always involve more latency than a multidrop network using some sort of
>> poll / select or request / response protocol. You must have more than one
>> comms link for redundancy, as any break in the ring isolates any node
>> past the fault. You need double the comms hardware, as each node needs
>> an rx and tx uart. In the presence of faults, a ring topology doesn't
>> degrade anything like as gracefully, as multidrop either. Finally, where
>> does ddcmp fit into the picture ?. Ddcmp is more than just a frame
>> format, it's a complete protocol spec with defined messages flows, state
>> transitions, error recovery etc...
>
> I found out today ddcmp was used purely because it calculated the CRC
> for us. All it does is the framing. All the state transition and
> error recovery stuff is turned off. Using ddcmp was probably a
> mistake because ccitt crc can be calculated quickly enough and soon
> we'll be doing a new version of this device with a different micro
> which will have to be compatible with the existing device so will
> still have to ddcmp but without the hardware support.
>
> I'm trying to improve the propagation delay of messages around the

What sort of times are you seeing, presently? At which baudrates?
How much *better* do they need to be (or, would you *like* them
to be)?

> ring without requiring the customer to fit twisted pair cable
> everywhere and I'm also trying to improve the error monitoring so we
> can signal when a connection isn't performing well enough without
> creating nuisance faults, hence my interest in the error detection
> capability of crc16-ccitt.
>
> We actually already do have an RS485 multi-drop version of this
> protocol but it's non-deterministic and doesn't work very well. I
> don't really want to go into that...

It's relatively easy to get deterministic behavior from a 485
deployment. And, depending on the *actual* operating conditions
of the current ring implementation, could probably achieve
lower latencies at lower baudrates.

Shane williams

unread,

Mar 30, 2011, 6:39:07 PM3/30/11

to

On Mar 31, 3:33 am, D Yuniskis <not.going.to...@seen.com> wrote:
> Hi Shane,
>
> On 3/30/2011 5:34 AM, Shane williams wrote:
>
>
> > It's half duplex so we can't start transmitting early.
>
> So, you have to buffer a message, verify it's integrity and then
> push it on to the next node. This suggests it is either done
> *in* the Rx ISR *or* an ASR running tightly coupled to it
> (else you risk adding processing delays to the propagation delay).

It's done in the Rx ISR. Actually as well as being half duplex the
hardware sends and receives whole messages though we could change that

>
> I.e., the time a message takes to circumnavigate the ring is
> ~K * n where K reflects message size, baud rate and per node
> processing.

Yes.

>
> >> *And exactly where the frame is removed from the ring is a design
> >> question. Often the sender removes frames it had sent, when they make
> >> their way back around, in which case the critical item for removal is
> >> the source address (and usually in that case the destination node sets
> >> a "copied" bit in the trailer, thus verifying physical transmission of
> >> the frame to the destination).
>
> > In our case, there are two logical rings and each message placed on
> > the ring is sent in both directions. When the two messages meet up
> > approximately half-way round they annihilate each other - but if the
> > annihilation fails, the sender removes them as well.
>
> How does the sender *recognize* them as "his to annihilate"?
> I.e., if the data can be corrupted, so can the "sender ID"!
> The problem with a ring is that it has no "end" so things have
> the potential to go 'round and 'round and 'round and...

The message would have to get corrupted undetected every time around
the ring to go round forever.

Each device that puts a message on the ring puts his own address at
the start plus a one byte incrementing sequence number. Each node
keeps a list of address/ sequence #/ received time of the last X
messages received. If it's seen the address/ seq# before within a
certain time, it removes the message from the ring.

>
> If you unilaterally drop any message found to be corrupted, then
> you have no way of knowing if it was received by its intended
> recipient (since you don't know who it's recipient is -- or was).
> If you await acknowledgment (and retry until you receive it),
> then you run the risk of a message being processed more than
> once. etc.

Every message a node transmits has an incrementing "ack" byte that the
next node sends back in its next message. If the ack byte doesn't
come back correctly the message is sent again. If the ack is lost and
a retry is sent, the receiver throws the message away because he's
already seen it.

Shane williams

unread,

Mar 30, 2011, 6:56:24 PM3/30/11

to

On Mar 31, 6:49 am, D Yuniskis <not.going.to...@seen.com> wrote:
> Hi Shane,
>
>

> > I'm trying to improve the propagation delay of messages around the
>
> What sort of times are you seeing, presently? At which baudrates?
> How much *better* do they need to be (or, would you *like* them
> to be)?

We've limited the ring to 32 nodes up till now at 57600 baud. The
nominal target now is 64 nodes for which I've calculated a worst case
request response time of approx 1.8 seconds with no retries. I would
like it to be about one second. 64 nodes max is a bit arbitrary so
what I'm really trying to do is get the best performance that's
reasonably achievable. Some sites have well over 64 devices but not
all on the same ring.

>
> > ring without requiring the customer to fit twisted pair cable
> > everywhere and I'm also trying to improve the error monitoring so we
> > can signal when a connection isn't performing well enough without
> > creating nuisance faults, hence my interest in the error detection
> > capability of crc16-ccitt.
>
> > We actually already do have an RS485 multi-drop version of this
> > protocol but it's non-deterministic and doesn't work very well. I
> > don't really want to go into that...
>
> It's relatively easy to get deterministic behavior from a 485
> deployment. And, depending on the *actual* operating conditions
> of the current ring implementation, could probably achieve
> lower latencies at lower baudrates.

Some of the devices that connect to the multi-drop network are old and
low-powered and a token ring was too much overhead at the time. Also
we require redundancy which needs 4 wires for multi-drop but only 2
wires for the ring.

Shane williams

unread,

Mar 30, 2011, 8:56:51 PM3/30/11

to

On Mar 31, 6:14 am, D Yuniskis <not.going.to...@seen.com> wrote:
> Hi Shane,
>
> On 3/30/2011 5:12 AM, Shane williams wrote:
>
> > On Mar 30, 4:09 am, D Yuniskis<not.going.to...@seen.com> wrote:
> >> Is this a synchronous protocol? Or, are you just using a pair
> >> of UARTs on each device to implement the CW& CCW links?
>
> > Asynchronous with a pair of uarts, one for clockwise, one for counter-
> > clockwise.
>
> OK. Been there, done that, T-shirt to prove it...

Hi, I'm out of time today. I'll get back to this tomorrow.
Thanks.

D Yuniskis

unread,

Mar 31, 2011, 2:14:58 AM3/31/11

to

Hi Shane,

On 3/30/2011 3:56 PM, Shane williams wrote:

[8<]

>>> I'm trying to improve the propagation delay of messages around the
>>
>> What sort of times are you seeing, presently? At which baudrates?
>> How much *better* do they need to be (or, would you *like* them
>> to be)?
>
> We've limited the ring to 32 nodes up till now at 57600 baud. The
> nominal target now is 64 nodes for which I've calculated a worst case
> request response time of approx 1.8 seconds with no retries. I would
> like it to be about one second.

OK, back-of-napkin guesstimates...

Assume 10b character frames transmitted "flat out" (no time between
end of stop bit and beginning of next start bit). So, 5760 characters
per second is data rate.

If we assume N is size of packet (in "characters"), then RTT is
64 * [(N / 5760) + P] where P is time spent processing message
packet on each node before passing it along.

1.8s/64 = [(N / 5760) + P] = ~30ms

Guessing at a message size of ~100 bytes (characters) suggests the
"transmission time" component of this is ~17ms -- leaving 13ms as
a guess at the processing time, P.

If you cut this to ~0, then you achieve your 1 sec goal (almost
exactly).

This suggests that eliminating/simplifying any error detection
so that incoming messages can *easily* be propagated is a goal
to pursue. If you can improve the reliability of the comm link
so that errors are the *exception* (i.e., unexpected), then
you can simplify the effort required to "handle" those errors.

Furthermore, if errors *are* The Exception, then you can consider
running the interface(s) in full duplex mode and starting to pass
a packet along to your successor *before* it is completely
received. This effectively reduces the size of the message (N)
in the above calculation.

E.g., if you can hold just *10* bytes of the message before
deciding to pass it along, then the "transmission time"
component drops to 1.7ms. Your RTT is then 0.1 sec!

Alternatively, you can spend a few ms processing in each node
and still beat your 1 sec goal -- *or*, drop the data rate by
a factor of 5 or 6 and still hit the 1 sec goal!

[remember, this is back-of-napkin calculation so I don't claim
it accurately reflects *your* operating environment. rather, it
puts some options in perspective...]

> 64 nodes max is a bit arbitrary so
> what I'm really trying to do is get the best performance that's
> reasonably achievable. Some sites have well over 64 devices but not
> all on the same ring.
>
>>> ring without requiring the customer to fit twisted pair cable
>>> everywhere and I'm also trying to improve the error monitoring so we
>>> can signal when a connection isn't performing well enough without
>>> creating nuisance faults, hence my interest in the error detection
>>> capability of crc16-ccitt.
>>
>>> We actually already do have an RS485 multi-drop version of this
>>> protocol but it's non-deterministic and doesn't work very well. I
>>> don't really want to go into that...
>>
>> It's relatively easy to get deterministic behavior from a 485
>> deployment. And, depending on the *actual* operating conditions
>> of the current ring implementation, could probably achieve
>> lower latencies at lower baudrates.
>
> Some of the devices that connect to the multi-drop network are old and
> low-powered and a token ring was too much overhead at the time. Also
> we require redundancy which needs 4 wires for multi-drop but only 2
> wires for the ring.

How do you mean "4 wires" vs. "2 wires"? You can run a 485 network
with a single differential pair. A 232-ish approach requires a Tx
and Rx conductor (for each ring). So, you could implement two
485 busses for the same conductor count as your dual UART rings.

D Yuniskis

unread,

Mar 31, 2011, 4:04:41 AM3/31/11

to

Hi Shane,

On 3/30/2011 3:39 PM, Shane williams wrote:

>>>> *And exactly where the frame is removed from the ring is a design
>>>> question. Often the sender removes frames it had sent, when they make
>>>> their way back around, in which case the critical item for removal is
>>>> the source address (and usually in that case the destination node sets
>>>> a "copied" bit in the trailer, thus verifying physical transmission of
>>>> the frame to the destination).
>>
>>> In our case, there are two logical rings and each message placed on
>>> the ring is sent in both directions. When the two messages meet up
>>> approximately half-way round they annihilate each other - but if the
>>> annihilation fails, the sender removes them as well.
>>
>> How does the sender *recognize* them as "his to annihilate"?
>> I.e., if the data can be corrupted, so can the "sender ID"!
>> The problem with a ring is that it has no "end" so things have
>> the potential to go 'round and 'round and 'round and...
>
> The message would have to get corrupted undetected every time around
> the ring to go round forever.

No. Once a frame is corrupted, it is no longer recognizable as
its original *intent*. (see below)

> Each device that puts a message on the ring puts his own address at
> the start plus a one byte incrementing sequence number. Each node

Right. So what happens if the address gets corrupted? Or the
sequence number? Once it is corrupted, each successive node will
pass along the corrupted version AS IF it was a regular message
(best case, you can detect it as "corrupted" and remove it from
the ring -- but you won't know how to decide *which* message was
then "deleted")

> keeps a list of address/ sequence #/ received time of the last X
> messages received. If it's seen the address/ seq# before within a
> certain time, it removes the message from the ring.

But you don't know how any of these things will be "corrupted".
You can only opt to remove a message that you are "suspicious of".
And that implies that your error detection scheme is robust enough
that *all* errors (signs of corruption) are detectable.

If you are setting out with the expectation that you *will* be
operating with a real (non zero) error rate, what can you do
to assure yourself that you are catching *all* errors?

>> If you unilaterally drop any message found to be corrupted, then
>> you have no way of knowing if it was received by its intended
>> recipient (since you don't know who it's recipient is -- or was).
>> If you await acknowledgment (and retry until you receive it),
>> then you run the risk of a message being processed more than
>> once. etc.
>
> Every message a node transmits has an incrementing "ack" byte that the
> next node sends back in its next message. If the ack byte doesn't
> come back correctly the message is sent again.

Again, how do you know that the message isn't corrupted to distort
the ACK and "whatever else"? I.e., so that the message is no longer
recognizable as it's original form -- yet looks like a valid message
(or *not*).

How do you know that this is not a case of the message arriving
correctly but the ACK being corrupted?

> If the ack is lost and
> a retry is sent, the receiver throws the message away because he's
> already seen it.

All of these things are predicated on the assumption that
errors are rare. So, the chance of a message (forward or ACK)
being corrupted *and* a followup/reply being corrupted is
"highly unlikely".

If you're looking at error rates high enough that you are
trying to detect errors as significant as "19 bytes out of 60",
then how much confidence can you have in *any* of the messages?

Shane williams

unread,

Mar 31, 2011, 7:04:10 AM3/31/11

to

On Mar 31, 9:04 pm, D Yuniskis <not.going.to...@seen.com> wrote:
> Hi Shane,
>
> On 3/30/2011 3:39 PM, Shane williams wrote:
>
> >>>> *And exactly where the frame is removed from the ring is a design
> >>>> question. Often the sender removes frames it had sent, when they make
> >>>> their way back around, in which case the critical item for removal is
> >>>> the source address (and usually in that case the destination node sets
> >>>> a "copied" bit in the trailer, thus verifying physical transmission of
> >>>> the frame to the destination).
>
> >>> In our case, there are two logical rings and each message placed on
> >>> the ring is sent in both directions. When the two messages meet up
> >>> approximately half-way round they annihilate each other - but if the
> >>> annihilation fails, the sender removes them as well.
>
> >> How does the sender *recognize* them as "his to annihilate"?
> >> I.e., if the data can be corrupted, so can the "sender ID"!
> >> The problem with a ring is that it has no "end" so things have
> >> the potential to go 'round and 'round and 'round and...
>
> > The message would have to get corrupted undetected every time around
> > the ring to go round forever.
>
> No. Once a frame is corrupted, it is no longer recognizable as
> its original *intent*. (see below)

I'm not following. If a message is corrupted but still looks like a
valid message then the corrupted message will still get removed from
the ring after it's been around once because each node will find in
its list of "recent" messages.

>
> > Each device that puts a message on the ring puts his own address at
> > the start plus a one byte incrementing sequence number. Each node
>
> Right. So what happens if the address gets corrupted? Or the
> sequence number? Once it is corrupted, each successive node will
> pass along the corrupted version AS IF it was a regular message
> (best case, you can detect it as "corrupted" and remove it from
> the ring -- but you won't know how to decide *which* message was
> then "deleted")

Each message placed on the ring is duplicated so that one copy goes CW
and one goes CCW until they meet up. Both copies would have to be
lost/ damaged for the message not to get right around. Critical data
is actually refreshed in this system.

>
> > keeps a list of address/ sequence #/ received time of the last X
> > messages received. If it's seen the address/ seq# before within a
> > certain time, it removes the message from the ring.
>
> But you don't know how any of these things will be "corrupted".
> You can only opt to remove a message that you are "suspicious of".
> And that implies that your error detection scheme is robust enough
> that *all* errors (signs of corruption) are detectable.
>
> If you are setting out with the expectation that you *will* be
> operating with a real (non zero) error rate, what can you do
> to assure yourself that you are catching *all* errors?
>
> >> If you unilaterally drop any message found to be corrupted, then
> >> you have no way of knowing if it was received by its intended
> >> recipient (since you don't know who it's recipient is -- or was).
> >> If you await acknowledgment (and retry until you receive it),
> >> then you run the risk of a message being processed more than
> >> once. etc.
>
> > Every message a node transmits has an incrementing "ack" byte that the
> > next node sends back in its next message. If the ack byte doesn't
> > come back correctly the message is sent again.
>
> Again, how do you know that the message isn't corrupted to distort
> the ACK and "whatever else"? I.e., so that the message is no longer
> recognizable as it's original form -- yet looks like a valid message
> (or *not*).

Do you mean that the message gets ackd but it was actually damaged?
The other copy of the message would have to get damaged too.

>
> How do you know that this is not a case of the message arriving
> correctly but the ACK being corrupted?
>
> > If the ack is lost and
> > a retry is sent, the receiver throws the message away because he's
> > already seen it.
>
> All of these things are predicated on the assumption that
> errors are rare. So, the chance of a message (forward or ACK)
> being corrupted *and* a followup/reply being corrupted is
> "highly unlikely".
>
> If you're looking at error rates high enough that you are
> trying to detect errors as significant as "19 bytes out of 60",
> then how much confidence can you have in *any* of the messages?

I'm not sure where I said 19 out of 60 - it was something to do with
the Reed Solomon thing I think. I think I was trying to find out how
many bytes of overhead there would be for significantly better error
detection, however the CPU overhead is too great.

A fellow engineer had speculated that we could live with 10% errors at
the faster baud rate. My recommendation is now going to be that we
have to see no increase in errors (i.e. more or less no errors) to
stay at the faster rate - if we do this at all. It's been suggested
that the nature of the errors due to running too fast on un-twisted
non-shielded cable will make it immediately obvious that we're going
too fast. We'll also add some 55 and AA bytes to the "idle" packets
exchanged between nodes to help detect errors. I guess if real noise
occurs and we're running at the faster rate, we won't know whether to
drop back to the slower rate or not.

Shane williams

unread,

Mar 31, 2011, 7:49:21 AM3/31/11

to

On Mar 31, 7:14 pm, D Yuniskis <not.going.to...@seen.com> wrote:
> Hi Shane,
>
> On 3/30/2011 3:56 PM, Shane williams wrote:
>
> [8<]
>
> >>> I'm trying to improve the propagation delay of messages around the
>
> >> What sort of times are you seeing, presently? At which baudrates?
> >> How much *better* do they need to be (or, would you *like* them
> >> to be)?
>
> > We've limited the ring to 32 nodes up till now at 57600 baud. The
> > nominal target now is 64 nodes for which I've calculated a worst case
> > request response time of approx 1.8 seconds with no retries. I would
> > like it to be about one second.
>
> OK, back-of-napkin guesstimates...
>
> Assume 10b character frames transmitted "flat out" (no time between
> end of stop bit and beginning of next start bit). So, 5760 characters
> per second is data rate.

Yes.

>
> If we assume N is size of packet (in "characters"), then RTT is
> 64 * [(N / 5760) + P] where P is time spent processing message
> packet on each node before passing it along.
>
> 1.8s/64 = [(N / 5760) + P] = ~30ms
>
> Guessing at a message size of ~100 bytes (characters) suggests the
> "transmission time" component of this is ~17ms -- leaving 13ms as
> a guess at the processing time, P.

No. The request/ response we need to speed up has perhaps 30 bytes
out-going and 200 bytes returned.
I estimated 10 milliseconds per node for the out-going message and 40
ms for the return message. 32 times 10 plus 32 times 40 is approx 1.6
seconds plus some extra processing time at each end. My estimate is
probably a little bit low.

>
> If you cut this to ~0, then you achieve your 1 sec goal (almost
> exactly).
>
> This suggests that eliminating/simplifying any error detection
> so that incoming messages can *easily* be propagated is a goal
> to pursue. If you can improve the reliability of the comm link
> so that errors are the *exception* (i.e., unexpected), then
> you can simplify the effort required to "handle" those errors.
>
> Furthermore, if errors *are* The Exception, then you can consider
> running the interface(s) in full duplex mode and starting to pass
> a packet along to your successor *before* it is completely
> received. This effectively reduces the size of the message (N)
> in the above calculation.

Yep, we looked at full duplex but we need to allow 2 wire
connections. We're also considering splitting the longer messages
into shorter ones.

>
> E.g., if you can hold just *10* bytes of the message before
> deciding to pass it along, then the "transmission time"
> component drops to 1.7ms. Your RTT is then 0.1 sec!
>
> Alternatively, you can spend a few ms processing in each node
> and still beat your 1 sec goal -- *or*, drop the data rate by
> a factor of 5 or 6 and still hit the 1 sec goal!
>
> [remember, this is back-of-napkin calculation so I don't claim
> it accurately reflects *your* operating environment. rather, it
> puts some options in perspective...]
>
>

[snip]

>
> >>> We actually already do have an RS485 multi-drop version of this
> >>> protocol but it's non-deterministic and doesn't work very well. I
> >>> don't really want to go into that...
>
> >> It's relatively easy to get deterministic behavior from a 485
> >> deployment. And, depending on the *actual* operating conditions
> >> of the current ring implementation, could probably achieve
> >> lower latencies at lower baudrates.
>
> > Some of the devices that connect to the multi-drop network are old and
> > low-powered and a token ring was too much overhead at the time. Also
> > we require redundancy which needs 4 wires for multi-drop but only 2
> > wires for the ring.
>
> How do you mean "4 wires" vs. "2 wires"? You can run a 485 network
> with a single differential pair. A 232-ish approach requires a Tx
> and Rx conductor (for each ring). So, you could implement two
> 485 busses for the same conductor count as your dual UART rings.

With the ring, the system still operates when there is a break or
short somewhere. With multidrop and 2 wires, a short takes down the
whole bus so we have 4 wires between devices with each pair supposed
to be routed on a different path. The signal is duplicated on both
pairs of wire, except for when we're checking the integrity of each
pair individually.

Shane williams

unread,

Mar 31, 2011, 8:45:59 AM3/31/11

to

On Mar 31, 6:14 am, D Yuniskis <not.going.to...@seen.com> wrote:
> Hi Shane,
>
> On 3/30/2011 5:12 AM, Shane williams wrote:
>
>
> >> E.g., a start bit error is considerably different than a
> >> *data* bit error (think about it).
>
> > Hmm. I forgot about that. A start or stop bit error means the whole
> > message is rejected which is good.
>
> My point was that if you *miss* a start bit, then you have -- at
> the very least -- missed the "first" bit of the message (because,
> if it was MARKING, the UART just ignored it and, if it was SPACING,
> the UART thought *it* was the start bit). If you are pushing
> bytes (characters) down the wire at the maximum data rate (minimal
> stop time between characters), then you run the risk of part of
> the *next* character being "shifted" into this "misaligned" first
> character. I.e., it gets really difficult to figure out *if*
> your code will be able to detect an error (because the received
> byte "looks wrong") or if, BY CHANCE, the bit patterns can conspire
> to look like a valid "something else".

Hmm, ok, if the byte count goes wrong as well I guess it could - I
didn't think of that. The ddcmp protocol actually has a 10 byte
header (can't remember if I mentioned this) with a separate crc for
the header. The count byte for the data is in the header. I suspect
the chance of it morphing into something valid would be pretty low in
our case - e.g. one particular byte must always have the value 0x01.

So the detection of 1,2 and 3 bit errors by crc16-ccitt doesn't allow
for start bit and stop bit errors? I never thought of that either.

>
> >>> However we may end up with 3 ports per node making it a collection of
> >>> rings or a mesh. The loading at the slowest baud rate is approx 10%
>
> >> [scratches head] then why are you worrying about running at
> >> a higher rate?
>
> > Because not all sites can wire as a mesh. The third port is optional
> > but helps the propagation delay a lot.
>
> Sorry, the subject wasn't clear in my question <:-(
> I mean, if you were to stick with the slowest rate, your
> "10%" number *suggests* you have lots of margin -- why
> push for a higher rate with the potential for more
> problems?

To get faster request/ response - a shorter propagation delay.

>
> >> Latency might be a reason -- assuming you
> >> don't circulate messages effectively as they pass *through*
> >> a node. But, recall that you only have to pass through
> >> 32 nodes, worst case, to get *a* copy of a message to any
> >> other node...
>
> >>> for 64 nodes. If we decide to allow mixed baud rates, each node will
> >>> have the ability to tell its adjacent nodes to slow down when its
> >>> message queue gets to a certain level, allowing it to cope with a
> >>> brief surge in messages.
>
> >> Depending on how you chose to allocate the Tx&Rx devices in each
> >> link -- and, whether or not your baudrate generator allows
> >> the Tx and Rx to run at different baudrates -- you have to:
> >> * make sure your Tx FIFO (hardware and software) is empty before
> >> changing Tx baudrate
> >> * make sure your "neighbor" isn't sending data to you when you
> >> change your Rx baudrate (!)
>
> > This is assured. It's half duplex and the hardware sends a whole
> > message at a time.
>
> So, for each ring, you WON'T receive a message until you have
> transmitted any previous message? Alternatively, you won't
> transmit a message until your receiver is finished?

This is true for each uart.

>
> What prevents two messages from being "in a ring" at the same
> time (by accident)? I.e., without violating the above, it
> seems possible that node 18 can be sending to node 19 (while
> 19 is NOT sending to 20 and 17 is not sending to 18) at the
> same time that node 3 is sending to node 4 (while neither 2
> nor 4 are actively transmitting).

I don't follow this. It's not a bus. 18 and 19 can talk to each
other and no-one else hears.

>
> Since this *seems* possible, how can you be sure one message
> doesn't get delayed slightly so that the second message ends
> up catching up to it? (i.e., node 23 has no way of knowing
> that node 24 is transmitting to 25 so 23 *could* start sending
> a message to 24 that 24 fails to notice -- in whole or in
> part -- because 24 is preoccupied with its outbound message)

I have a feeling there's a misunderstanding here - not sure what
though.

Yes, they do.

>
> So, assume 7 originates a message -- destined for 3. In the CW ring,
> it is routed as 7, 8, 9, 10, 1, 2, 3. In the CCW ring, it is routed
> (simultaneously) as 7, 6, 5, 4, 3.
>
> *If* it progresses node to node at the exact same rates in each
> ring (this isn't guaranteed but "close enough for gummit work"),
> then it arrives in 8 & 6 at the same time, 9 & 5, 10 & 4, 1 & 3,
> 2 & 2 (though different "rings"), 3 & 1, etc. (note I have assumed,
> here, that it continues around until reaching it's originator...
> but, that's not important).

ok - it actually dies at around about the 2&2 , 3&1 stage

>
> Now, at node 9, if the CW ring decides that the baudrate needs to be
> changed and it thinks "now is a good time to do so" (because it has
> *just* passed it's CW message on to node 10), that action effectively
> interrupts any traffic in the CW ring (until the other nodes make
> the similar baudrate adjustment in the CW direction).

No, the baud rate between any two nodes is independent of any other
two nodes. I'm missing something here.

We do the second case, except sometimes they mis-wire it so that uart
2 on B connects to uart 2 on C when it should connect to uart 1 on C -
but this doesn't matter (much) currently.

>
> [assume tx and rx baudrates are driven by the same baudrate generator
> so there is a BRG1 and BRG2 in each node]

Yes, there is.

>
> In the first case, when you change the baudrate of a UART at some
> particular node, the baudrate for that segment in *the* ring that
> the UART services (left-to-right ring vs right-to-left ring) changes.
> So, you must change *after* you have finished transmitting and you
> will no longer be able to receive until the node upstream from you
> also changes baudrate.
>
> In the second case, when you change the baudrate of a UART at some
> particular node, the baudrate for all communications with that
> particular neighbor (to the left or to the right) changes. So,
> *both* rings are "broken" until that neighbor makes the comparable
> change.

Yes.

>
> Look at each scenario and its consequences while messages are
> circulating (in both rings!). Changing data rates can be a very
> disruptive thing as it forces the ring(s) to be emptied; some
> minimum guaranteed quiescent period to provide a safety factor
> (that no messages are still in transit); the actual change
> to be effected; a quiescent period to ensure all nodes are
> at the new speed; *then* you can start up again.

Why do all nodes have to be at the same speed?

Shane williams

unread,

Apr 1, 2011, 8:53:50 AM4/1/11

to

On Mar 31, 5:26 am, ChrisQ <m...@devnull.com> wrote:
> Shane williams wrote:

> > I don't know much about compression but it sounds too CPU intensive
> > for the 68302? What micro are you using?
>
> The project used the Renesas 32C87 series from Hitachi. Not such an
> elegant arch as 68k, but almost certainly faster than a '302. Depending
> on the data, simple compression like huffman encoding can work quite
> well, but another way might be to simplify / reorganise the frame format
> or data within it, so you can send fewer bytes...
>

Off the top of your head, do you have any idea what the execution time
to do huffman compression of 200 bytes of text would be on the 32C87?

ChrisQ

unread,

Apr 1, 2011, 11:40:05 AM4/1/11

to

Shane williams wrote:

>
> Off the top of your head, do you have any idea what the execution time
> to do huffman compression of 200 bytes of text would be on the 32C87?
>

We did profile the code at one stage, using a scope on a port line that
was triggered on entry and exit from the function, but don't have the
results to hand, only that it was fast enough...

Regards,

Chris

D Yuniskis

unread,

Apr 1, 2011, 2:18:43 PM4/1/11

to

Hi Shane,

On 3/31/2011 4:49 AM, Shane williams wrote:
> On Mar 31, 7:14 pm, D Yuniskis<not.going.to...@seen.com> wrote:

>> If we assume N is size of packet (in "characters"), then RTT is
>> 64 * [(N / 5760) + P] where P is time spent processing message
>> packet on each node before passing it along.
>>
>> 1.8s/64 = [(N / 5760) + P] = ~30ms
>>
>> Guessing at a message size of ~100 bytes (characters) suggests the
>> "transmission time" component of this is ~17ms -- leaving 13ms as
>> a guess at the processing time, P.
>
> No. The request/ response we need to speed up has perhaps 30 bytes
> out-going and 200 bytes returned.
> I estimated 10 milliseconds per node for the out-going message and 40
> ms for the return message. 32 times 10 plus 32 times 40 is approx 1.6
> seconds plus some extra processing time at each end. My estimate is
> probably a little bit low.

As message size goes up (transmission time increases), you have all the
more incentive to passing the message along before it is completely
received. If, e.g., you can reduce the effective "hold-over" time
at each node to 10 bytes (from 200), then you can trim 30ms from that
40 you have estimated (190 bytes at 5760 bytes/sec). Since this
savings happens at each node, your RTT drops by almost a second
(30 ms * 32 nodes).

>> If you cut this to ~0, then you achieve your 1 sec goal (almost
>> exactly).
>>
>> This suggests that eliminating/simplifying any error detection
>> so that incoming messages can *easily* be propagated is a goal
>> to pursue. If you can improve the reliability of the comm link
>> so that errors are the *exception* (i.e., unexpected), then
>> you can simplify the effort required to "handle" those errors.
>>
>> Furthermore, if errors *are* The Exception, then you can consider
>> running the interface(s) in full duplex mode and starting to pass
>> a packet along to your successor *before* it is completely
>> received. This effectively reduces the size of the message (N)
>> in the above calculation.
>
> Yep, we looked at full duplex but we need to allow 2 wire
> connections. We're also considering splitting the longer messages
> into shorter ones.

I'm confused. Why do you think "full duplex" and "2 wire" are
contradictions?

I am not telling you to add or change any wiring/hardware.
Each node has two inputs (one for the CW ring and another for
the CCW ring) and two outputs (ditto).

What I am saying is that you start propagating an incoming packet
*before* it is completely received!

So, instead of (effectively):
count = 0
do {
buffer[count++] = get_byte()
} while (count < MESSAGE_SIZE)

// have now gobbled up entire incoming message!

if (message_is_for_me(buffer)) {
process_mesage(buffer, MESSAGE_SIZE)
} else {
// not for me so pass the whole message on...
transmit_message(buffer, MESSAGE_SIZE)
}

do something like:
count = 0
do {
buffer[count++] = get_byte()
} while (count < HEADER_SIZE)

// now have JUST the header/routing portion of the message

if (message_is_for_me(buffer)) {
// gather up the rest of this message as it belongs to me!
do {
buffer[count++] = get_byte()
} while (count < MESSAGE_SIZE)
// have now gobbled up entire incoming message so deal with it!
process_message(buffer, MESSAGE_SIZE)
} else {
// not intended for me so pass what I have, so far, along
transmit_message(buffer, HEADER_SIZE)
do {
// and, pass along each subsequent byte as it is received
transmit(get_byte())
} while (++count < MESSAGE_SIZE)
}

[this is written poorly just to illustrate what should
be happening]

You are running Tx and Rx at the same time but not really
in the traditional application of "full duplex". You overlap
transmission (propagation) of the message with it's reception.

>> E.g., if you can hold just *10* bytes of the message before
>> deciding to pass it along, then the "transmission time"
>> component drops to 1.7ms. Your RTT is then 0.1 sec!

>>>>> We actually already do have an RS485 multi-drop version of this

>>>>> protocol but it's non-deterministic and doesn't work very well. I
>>>>> don't really want to go into that...
>>
>>>> It's relatively easy to get deterministic behavior from a 485
>>>> deployment. And, depending on the *actual* operating conditions
>>>> of the current ring implementation, could probably achieve
>>>> lower latencies at lower baudrates.
>>
>>> Some of the devices that connect to the multi-drop network are old and
>>> low-powered and a token ring was too much overhead at the time. Also
>>> we require redundancy which needs 4 wires for multi-drop but only 2
>>> wires for the ring.
>>
>> How do you mean "4 wires" vs. "2 wires"? You can run a 485 network
>> with a single differential pair. A 232-ish approach requires a Tx
>> and Rx conductor (for each ring). So, you could implement two
>> 485 busses for the same conductor count as your dual UART rings.
>
> With the ring, the system still operates when there is a break or
> short somewhere.

With a SINGLE RING, you can't make that claim -- since there
is no way to get messages "across" the break. I.e., if there
is a break between nodes 3 & 4, then 1 can talk to 2 and 2 can
talk to 3 -- but 3 can't talk to 4 *and* 2 can't REPLY to 1
nor can 3 reply to 2 (or 1), etc.

You only get continued operation if *both* rings are "wired" and
a break is confined to a single ring (you can support some breaks
in *both* rings if you allow messages to transit from one ring to
the other -- but this gets hokey)

> With multidrop and 2 wires, a short takes down the whole bus

It takes out that *one* bus. But, you have a second -- using
a second pair of conductors (same number of wires that your
dual ring requires!)

> so we have 4 wires between devices with each pair supposed
> to be routed on a different path.

You can run the second multidrop bus "on a different path"
just as well as you can run the CCW ring's cabling on that
same "different path". I don't see why you think multidrop
is more vulnerable or takes more wires/hardware?

> The signal is duplicated on both
> pairs of wire, except for when we're checking the integrity of each
> pair individually.

Huh???

Shane williams

unread,

Apr 1, 2011, 6:26:36 PM4/1/11

to

On Apr 2, 7:18 am, D Yuniskis <not.going.to...@seen.com> wrote:
> Hi Shane,
>
> On 3/31/2011 4:49 AM, Shane williams wrote:
>
> >> Furthermore, if errors *are* The Exception, then you can consider
> >> running the interface(s) in full duplex mode and starting to pass
> >> a packet along to your successor *before* it is completely
> >> received. This effectively reduces the size of the message (N)
> >> in the above calculation.
>
> > Yep, we looked at full duplex but we need to allow 2 wire
> > connections. We're also considering splitting the longer messages
> > into shorter ones.
>
> I'm confused. Why do you think "full duplex" and "2 wire" are
> contradictions?

I'm confused too. Perhaps I haven't explained well enough.

We have two "logical rings" and one physical ring. So with your
diagram of nodes A,B,C, there are two wires going from B to C and two
completely separate wires going from B to A. Each device receives its
own transmission. Between B and C, only one device can transmit at a
time i.e. half duplex. B's transmissions to C are for the CCW ring
and C's transmissions to B are for the CW ring.

>
> I am not telling you to add or change any wiring/hardware.
> Each node has two inputs (one for the CW ring and another for
> the CCW ring) and two outputs (ditto).
>
> What I am saying is that you start propagating an incoming packet
> *before* it is completely received!
>

[snip]

>
> You are running Tx and Rx at the same time but not really
> in the traditional application of "full duplex". You overlap
> transmission (propagation) of the message with it's reception.
>

The hardware transmits and receives a whole message at a time for us
which saves a lot of interrupt overhead so we can't start transmitting
early with the current hardware..

> >>> Some of the devices that connect to the multi-drop network are old and
> >>> low-powered and a token ring was too much overhead at the time. Also
> >>> we require redundancy which needs 4 wires for multi-drop but only 2
> >>> wires for the ring.
>
> >> How do you mean "4 wires" vs. "2 wires"? You can run a 485 network
> >> with a single differential pair. A 232-ish approach requires a Tx
> >> and Rx conductor (for each ring). So, you could implement two
> >> 485 busses for the same conductor count as your dual UART rings.
>
> > With the ring, the system still operates when there is a break or
> > short somewhere.
>
> With a SINGLE RING, you can't make that claim -- since there
> is no way to get messages "across" the break. I.e., if there
> is a break between nodes 3 & 4, then 1 can talk to 2 and 2 can
> talk to 3 -- but 3 can't talk to 4 *and* 2 can't REPLY to 1
> nor can 3 reply to 2 (or 1), etc.

Messages go in both directions around the ring even though it's only 2
wire. If there's a break between 3 and 4, messages still get from 3
to 4 because they go round the other way as well. Every message
placed on the ring goes in both the CW and CCW directions, except when
two nodes are having a conversation with each other.

>
> You only get continued operation if *both* rings are "wired" and
> a break is confined to a single ring (you can support some breaks
> in *both* rings if you allow messages to transit from one ring to
> the other -- but this gets hokey)
>
> > With multidrop and 2 wires, a short takes down the whole bus
>
> It takes out that *one* bus. But, you have a second -- using
> a second pair of conductors (same number of wires that your
> dual ring requires!)

No, our "dual" ring only requires 2 wires.

>
> > so we have 4 wires between devices with each pair supposed
> > to be routed on a different path.
>
> You can run the second multidrop bus "on a different path"
> just as well as you can run the CCW ring's cabling on that
> same "different path". I don't see why you think multidrop
> is more vulnerable or takes more wires/hardware?
>
> > The signal is duplicated on both
> > pairs of wire, except for when we're checking the integrity of each
> > pair individually.
>
> Huh???

Sorry, we have a special board that we drive from a single uart and it
duplicates the tx onto each pair or wires and for rx it combines the
two signals to give the uart rx a single character stream.

D Yuniskis

unread,

Apr 4, 2011, 5:13:35 PM4/4/11

to

Hi Shane,

On 3/31/2011 5:45 AM, Shane williams wrote:

[much elided]

> Hmm, ok, if the byte count goes wrong as well I guess it could - I

It can change -- or *not*! When you miss a start bit, all
bets are off because your receiver is no longer in sync with
your data. E.g., if you miss a start bit and are transmitting
the value 0xFF (with no parity), then the line just looks
COMPLETELY IDLE for one whole character time. OTOH, if you
miss the start bit and are sending 0x55, then you could
"receive" any of a number of different values in place
of that 55...

> didn't think of that. The ddcmp protocol actually has a 10 byte
> header (can't remember if I mentioned this) with a separate crc for
> the header. The count byte for the data is in the header. I suspect
> the chance of it morphing into something valid would be pretty low in
> our case - e.g. one particular byte must always have the value 0x01.

When you are operating in an environment in which errors are
not The Exception, it is hard to make *any* assumptions.

> So the detection of 1,2 and 3 bit errors by crc16-ccitt doesn't allow
> for start bit and stop bit errors? I never thought of that either.

Because the start and stop bits are "out of band" (unless missing
one puts them *in* band -- for another character time!)

>>>>> However we may end up with 3 ports per node making it a collection of
>>>>> rings or a mesh. The loading at the slowest baud rate is approx 10%
>>

>> Sorry, the subject wasn't clear in my question<:-(
>> I mean, if you were to stick with the slowest rate, your
>> "10%" number *suggests* you have lots of margin -- why
>> push for a higher rate with the potential for more
>> problems?
>
> To get faster request/ response - a shorter propagation delay.

But there are other ways to do that. E.g., passing along the
message before it is completely received, etc.

>> So, for each ring, you WON'T receive a message until you have
>> transmitted any previous message? Alternatively, you won't
>> transmit a message until your receiver is finished?
>
> This is true for each uart.

But, is it true for each *ring*? I.e., in A->B->C->D->E->
you have stated that D won't be receiving while it is *sending*
to E. This implies C won't be sending (to D) in this time.
But, that doesn't preclude *B* from sending to C in this time!
I.e., can there be more than one message circulating in each
ring? If so, and the baud rate can be changed, how can you
guarantee that messages don't start "rear ending" the ones
ahead of them? I.e., if D downgrades its baudrate, any
message that B is sending (to C) looks like it is "speeding"...

>> What prevents two messages from being "in a ring" at the same
>> time (by accident)? I.e., without violating the above, it
>> seems possible that node 18 can be sending to node 19 (while
>> 19 is NOT sending to 20 and 17 is not sending to 18) at the
>> same time that node 3 is sending to node 4 (while neither 2
>> nor 4 are actively transmitting).
>
> I don't follow this. It's not a bus. 18 and 19 can talk to each
> other and no-one else hears.

See above.

>> Since this *seems* possible, how can you be sure one message
>> doesn't get delayed slightly so that the second message ends
>> up catching up to it? (i.e., node 23 has no way of knowing
>> that node 24 is transmitting to 25 so 23 *could* start sending
>> a message to 24 that 24 fails to notice -- in whole or in
>> part -- because 24 is preoccupied with its outbound message)
>
> I have a feeling there's a misunderstanding here - not sure what
> though.

See above. This is where the use of coins/tokens on a graph
can be useful -- you can see how the messages can potentially
interact with each other.

>> Number the nodes 1 - 10 (sequentially).
>> The CW node has 1 sending to 2, 2 sending to 3, ... 10 sending to 1.
>> The CW node has 10 sending to 9, 9 sending to 8, ... 1 sending to 10.
>> The nodes operate concurrently.
>
> Yes, they do.
>
>> So, assume 7 originates a message -- destined for 3. In the CW ring,
>> it is routed as 7, 8, 9, 10, 1, 2, 3. In the CCW ring, it is routed
>> (simultaneously) as 7, 6, 5, 4, 3.
>>
>> *If* it progresses node to node at the exact same rates in each
>> ring (this isn't guaranteed but "close enough for gummit work"),

>> then it arrives in 8& 6 at the same time, 9& 5, 10& 4, 1& 3,
>> 2& 2 (though different "rings"), 3& 1, etc. (note I have assumed,

>> here, that it continues around until reaching it's originator...
>> but, that's not important).
>
> ok - it actually dies at around about the 2&2 , 3&1 stage

So there is no way of a sender knowing that a recipient got
a message intended for it?

>> Now, at node 9, if the CW ring decides that the baudrate needs to be
>> changed and it thinks "now is a good time to do so" (because it has
>> *just* passed it's CW message on to node 10), that action effectively
>> interrupts any traffic in the CW ring (until the other nodes make
>> the similar baudrate adjustment in the CW direction).
>
> No, the baud rate between any two nodes is independent of any other
> two nodes. I'm missing something here.

When the baud rate changes between two particular (adjacent)
nodes, there is effectively a discontinuity introduced.
As you said, "the baud rate between any two nodes is independent
of any other two nodes" so other nodes can be talking at FASTER
(or slower) speeds. The time it takes to pass a message between
any two nodes can then vary. Time is universally shared among
all nodes. If D->E runs at 1200 baud and all other nodes are
running at 57600 baud, then a message from A can get to B and
then to C and then ... in the time it takes D to push a
similarly sized message out to *E*. I.e., C has no way of
knowing if D is ready to *listen* to C, yet, since C has
no way of knowing if D has finished transmitting to E.
C can't rely on the fact the time that was required for it
to receive it's incoming message (from B) would be sufficient
for D to have passed *its* message along!

>>>> Regarding the "other" way to split the Tx&Rx... have the Tx
>>>> always talk to the downstream neighbor and Rx the upstream
>>>> IN THE SAME RING. In this case, changes to Tx+Rx baudrates
>>>> apply only to a certain ring. So, you can change baudrate
>>>> when it is convenient (temporally) for that *ring*.
>>
>>>> But, now the two rings are potentially operating at different
>>>> rates. So, the "other" ring will eventually ALSO have to
>>>> have its baudrate adjusted to match (or, pass different traffic)
>>
>>> I think there must be a misunderstanding somewhere - not sure where.
>>
>> You can wire two UARTs to give you two rings in TWO DIFFERENT WAYS.
>> Look at a segment of the ring with three nodes:
>>
>> ------> 1 AAAA 1 --------> 1 BBBB 1 --------> 1 CCCC 1 ------->
>> AAAA BBBB CCCC
>> <------ 2 AAAA 2 <-------< 2 BBBB 2 <-------- 2 CCCC 2 <-------
>>
>> vs.
>>
>> ------> 1 AAAA 2 --------> 1 BBBB 2 --------> 1 CCCC 2 ------->
>> AAAA BBBB CCCC
>> <------ 1 AAAA 2 <-------< 1 BBBB 2 <-------- 1 CCCC 2 <-------
>>
>> where the numbers identify the UARTs associated with each signal.
>
> We do the second case, except sometimes they mis-wire it so that uart
> 2 on B connects to uart 2 on C when it should connect to uart 1 on C -
> but this doesn't matter (much) currently.

OK, so when you change the baudrate on a UART, you interrupt
traffic in *both* rings between that node and it's neighbor.
E.g., when the *one* UART that connects B to C changes baudrate,
then nothing can be flowing from B to C *or* C to B (i.e.,
*both* rings are involved)

>> [assume tx and rx baudrates are driven by the same baudrate generator
>> so there is a BRG1 and BRG2 in each node]
>
> Yes, there is.
>
>> In the first case, when you change the baudrate of a UART at some
>> particular node, the baudrate for that segment in *the* ring that
>> the UART services (left-to-right ring vs right-to-left ring) changes.
>> So, you must change *after* you have finished transmitting and you
>> will no longer be able to receive until the node upstream from you
>> also changes baudrate.
>>
>> In the second case, when you change the baudrate of a UART at some
>> particular node, the baudrate for all communications with that
>> particular neighbor (to the left or to the right) changes. So,
>> *both* rings are "broken" until that neighbor makes the comparable
>> change.
>
> Yes.
>
>> Look at each scenario and its consequences while messages are
>> circulating (in both rings!). Changing data rates can be a very
>> disruptive thing as it forces the ring(s) to be emptied; some
>> minimum guaranteed quiescent period to provide a safety factor
>> (that no messages are still in transit); the actual change
>> to be effected; a quiescent period to ensure all nodes are
>> at the new speed; *then* you can start up again.
>
> Why do all nodes have to be at the same speed?

They don't! But, if they aren't, then its more difficult to
ensure that messages don't "collide". I.e., if someone
downstream from you starts operating at a slower rate,
then messages that *you* are sending can end up "there"
before it is ready for them.

You *can* make this work. But, there are lots of ways it
can *break*. That was Vladimir's point (elsewhere, up-thread).
Especially if you are *expecting* to be operating (even
temporarily) at the fringe of reliable communication!