Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.

Dismiss

147 views

Skip to first unread message

Dec 6, 2023, 5:31:58â€¯AM12/6/23

to

Bit sequences with bounded number of continuous '1' bits are useful for

serialization of packetized data.

For example, if we have a way of converting arbitrary n-bit words to

unique (n+1)-bit extended words with at most L continuous leading '1'

bits, at most T continuous trailing '1' bits and at most M=L+T

continuous inner '1' bits then we can transfer multi-word packets

starting with a header consisting of M+1 '1' bits followed by a single

'0' bit and then by desired number of extended words. Any gaps between

packets are filled with '1' bits.

Receiver can easily synchronize itself to incoming packets and will

quickly and robustly recovers after media errors.

Such conversions have are other application as well. They always have.

For me, a case of n=16 (16-bit words, 17-bit extended words) is of

particular practical interest.

Brute force counting proves that both for (L=1,T=3) and for (L=2,T=2)

there exist more than 2**16 extended words with desired properties.

For those interested, 76735 and 83665 extended words, respectively.

So, desired 16b-to-17b conversions certainly exist for M=4.

The problem is that I know of no simple "algorithmic" way of

implementing such conversion short of using big look-up tables. Neither

in software nor in simple compact hardware.

That is in strong contrast to case of M=5 (both L=1,T=4 and L=2,T=3)

where I know many such algorithms.

Any ideas?

May be, somebody had seen such methods in patents, preferably old and

outdated?

Or, better, in books or articles?

serialization of packetized data.

For example, if we have a way of converting arbitrary n-bit words to

unique (n+1)-bit extended words with at most L continuous leading '1'

bits, at most T continuous trailing '1' bits and at most M=L+T

continuous inner '1' bits then we can transfer multi-word packets

starting with a header consisting of M+1 '1' bits followed by a single

'0' bit and then by desired number of extended words. Any gaps between

packets are filled with '1' bits.

Receiver can easily synchronize itself to incoming packets and will

quickly and robustly recovers after media errors.

Such conversions have are other application as well. They always have.

For me, a case of n=16 (16-bit words, 17-bit extended words) is of

particular practical interest.

Brute force counting proves that both for (L=1,T=3) and for (L=2,T=2)

there exist more than 2**16 extended words with desired properties.

For those interested, 76735 and 83665 extended words, respectively.

So, desired 16b-to-17b conversions certainly exist for M=4.

The problem is that I know of no simple "algorithmic" way of

implementing such conversion short of using big look-up tables. Neither

in software nor in simple compact hardware.

That is in strong contrast to case of M=5 (both L=1,T=4 and L=2,T=3)

where I know many such algorithms.

Any ideas?

May be, somebody had seen such methods in patents, preferably old and

outdated?

Or, better, in books or articles?

Dec 6, 2023, 7:16:32â€¯AM12/6/23

to

the data, not just the start and end. After all, it's not going to help

anyone if the start and end of the word are limited to no more than 2

consecutive ones if you have a run of 10 ones in the middle of the word.

It is also common to want to combine other encoding features at the same

time. Limiting the lengths of both zeros and ones means you have

transitions, which are needed for clock recovery and synchronisation.

For many transmissions, approximately 0 DC balance is highly desirable.

And you may also want forward error correction or at least error detection.

Maybe you have some specialised use in mind?

Otherwise, these links might be of interest to you, on related ideas :

<https://en.wikipedia.org/wiki/64b/66b_encoding>

<https://en.wikipedia.org/wiki/8b/10b_encoding>

<https://en.wikipedia.org/wiki/Bit_stuffing>

Dec 6, 2023, 8:43:28â€¯AM12/6/23

to

>

> It is also common to want to combine other encoding features at the

> same time. Limiting the lengths of both zeros and ones means you

> have transitions, which are needed for clock recovery and

> synchronisation. For many transmissions, approximately 0 DC balance

> is highly desirable. And you may also want forward error correction

> or at least error detection.

>

'1' bits in headers and gaps guarantees that there are at least M

transitions per packet. It is sufficient for clock phase recovery

at receiver (a.k.a Data Phase Alignment). Clock frequency recovery is

not necessary, because clocks at both sides of the link are derived from

the same source.

Anyway, when somewhat higher density of transitions is desirable, say,

one transition (i.e. logical 1) per 25 or 30 bits, it's absolutely

trivial to add at no additional coding cost since 76735 is sufficiently

large than 65736.

DC balance is not required in my case. When DC balance is required at

all, the balance requirements tend to be rather strict. Weak balance is

rarely of interest. Typically disparity should be bounded to few

percents over rather short intervals of few dozens of bit times. Or

better. In order to achieve that you need rather higher coding rates

than 17/16 or rather longer coding blocks than 16 bits. Or sometimes

both. In other words, you have to play completely different game.

> Maybe you have some specialised use in mind?

>

schemes that limit the number of continuous '1' bits to 4 are not

necessary. Limit of 5 is perfectly o.k. and as I mentioned in original

post, it's easy. In fact, if 5 was not so easy, from practical

perspective I would be satisfied with significantly higher limit, like 7

or 8.

The question about bound of 4 is more of theoretical than of

practical interest. Yesterday it surprised me to find out that such

codes exists not just for 16b-to-17b, but for much longer words as

well, up to 30b-to-31b. But there is a big difference between knowing

that the code exists and finding a way to implement it in

environment with limited resources.

> Otherwise, these links might be of interest to you, on related ideas :

>

> <https://en.wikipedia.org/wiki/64b/66b_encoding>

> <https://en.wikipedia.org/wiki/8b/10b_encoding>

> <https://en.wikipedia.org/wiki/Bit_stuffing>

>

>

more?

Dec 6, 2023, 9:47:41â€¯AM12/6/23

to

Dec 6, 2023, 10:51:28â€¯AM12/6/23

to

have not understood me. Correct me if I am wrong below.

For L = 2, T = 2 (as an example), your coding would not allow

1100 1110 0101 1011 1

because it ends in 3 ones. (The spaces are just to make it easier to

read.) Your aim is to avoid the join of two packets having more than 4

ones in a row, so that if the receive sees 5 or more ones in a row, it

knows it is an inter-packet idle period.

But you /would/ allow :

1000 1111 1111 1000 1

This has no more than 1 leading or trailing one. But it has 8 ones in a

row. A receiver that is trying to synchronise would see a run of 8 ones

and interpret it as an inter-packet idle gap.

Additionally, how is the receiver supposed to identify the number of

ones at the start of a packet, after a gap? If it sees a sequence of 10

ones, then a zero (and then other values), how does it know if the

packet is starting with 0, 10, or 110 ?

Synchronisation for the start of a frame in communication is always done

with specific patterns containing at least one transition. For example,

with good-old-fashioned UART communication, break (inter-character

pauses) are a string of at least 10 ones, and the start of a character

frame is always a zero with the line going from high to low (either

after a break, or after the previous character's stop bit).

>

>>

>> It is also common to want to combine other encoding features at the

>> same time. Limiting the lengths of both zeros and ones means you

>> have transitions, which are needed for clock recovery and

>> synchronisation. For many transmissions, approximately 0 DC balance

>> is highly desirable. And you may also want forward error correction

>> or at least error detection.

>>

>

> In my case of interest the bits are encoded as NRZI so mere presence of

> '1' bits in headers and gaps guarantees that there are at least M

> transitions per packet.

of ones per frame. As it stands, a run of 17 zeros would fit your

requirements.

> It is sufficient for clock phase recovery

> at receiver (a.k.a Data Phase Alignment). Clock frequency recovery is

> not necessary, because clocks at both sides of the link are derived from

> the same source.

> Anyway, when somewhat higher density of transitions is desirable, say,

> one transition (i.e. logical 1) per 25 or 30 bits, it's absolutely

> trivial to add at no additional coding cost since 76735 is sufficiently

> large than 65736.

would not say it is "trivial" until you have found a simple algorithm

for coding and decoding to match your requirements - but it is unlikely

to make the problem any harder.)

>

> DC balance is not required in my case. When DC balance is required at

> all, the balance requirements tend to be rather strict. Weak balance is

> rarely of interest.

interest in some systems. PCIe communication, for instance, maintains

an approximate balance by using a scrambler. Some DC "leakage" through

the terminators is enough to deal with the remainder, if I have

understood things correctly.

Other systems use active DC balance tracking to keep a stricter control,

and transmit different encodings to minimise the DC imbalance. But even

then, it is never perfect - no line drivers are equally strong for high

and low drives, and there are always differences in the latencies for

high and low transitions. Even with an ideal 50% duty cycle coming in,

there will be a small DC bias that must be leaked away. So DC balancing

in the encoding is /always/ weak, if you look closely enough.

> Typically disparity should be bounded to few

> percents over rather short intervals of few dozens of bit times. Or

> better. In order to achieve that you need rather higher coding rates

> than 17/16 or rather longer coding blocks than 16 bits. Or sometimes

> both. In other words, you have to play completely different game.

>

>> Maybe you have some specialised use in mind?

>>

>

> I do have a specialized use in mind, but for my specialized use the

> schemes that limit the number of continuous '1' bits to 4 are not

> necessary. Limit of 5 is perfectly o.k. and as I mentioned in original

> post, it's easy. In fact, if 5 was not so easy, from practical

> perspective I would be satisfied with significantly higher limit, like 7

> or 8.

>

> The question about bound of 4 is more of theoretical than of

> practical interest. Yesterday it surprised me to find out that such

> codes exists not just for 16b-to-17b, but for much longer words as

> well, up to 30b-to-31b. But there is a big difference between knowing

> that the code exists and finding a way to implement it in

> environment with limited resources.

>

>> Otherwise, these links might be of interest to you, on related ideas :

>>

>> <https://en.wikipedia.org/wiki/64b/66b_encoding>

>> <https://en.wikipedia.org/wiki/8b/10b_encoding>

>> <https://en.wikipedia.org/wiki/Bit_stuffing>

>>

>>

>

> Don't you know me long enough to guess that I had read that much and

> more?

>

track of what everyone knows.

When I see a post that looks like someone is trying to invent a new

technique in an existing field, and their ideas are noticeably different

from standard solutions, there are various possible explanations. One

is that they have a very niche use-case so the standard ideas don't

suit. Another is that they have come up with something truly original

and exciting. Most common, however, is that they don't know much about

the field and are not very aware of existing methods, and are currently

just playing about with ideas for solutions. Option 3 is the first

assumption until demonstrated otherwise. It looks like option 1 applies

to you here.

I can't say that I really understand why you have the requirements you

are asking for here - they don't make sense to me. Obviously I don't

have a full view of your system, but I cannot currently imagine a system

for which your requirements would be the best way to solve the problem.

Dec 6, 2023, 11:27:06â€¯AM12/6/23

to

LEs of sort we have in old-fashioned FPGA where each LE contains 1

register, 1 4-input LUT, carry-in/carry-out for adder and little else.

Use of one or two 8 Kbit SRAM blocks is acceptable on transmitter side,

but not on receiver side.

Pay attention that it's acceptable and even expected for implementation

to be serial i.e. transmitting/receiving one bit per clock and

completing the full transform every 17 clock cycles.

Dec 6, 2023, 12:05:11â€¯PM12/6/23

to

getting at here. I get that you want to transform an arbitrary

16-bit value into a 17-bit value that satisfies a certain

property (such that the 16-bit value can be reconstructed), but I

don't get what the property is exactly or how a 17-bit value is

transformed back into a 16-bit value. I do realize that what

you're looking for is a nice way of turning a 16-bit value into a

17-bit value (assuming I'm not completely confused, which is

always a possibility). Can you spell it out for me in a bit more

detail?

Dec 6, 2023, 12:44:43â€¯PM12/6/23

to

On Wed, 6 Dec 2023 16:48:51 +0100

You still refused to read my original post without skipping words.

In particular you obviously did not read this part "and at most M=L+T

continuous inner '1' bits".

In the mean time I found an algorithm that works. Sort of.

Modification that causes it to generate no more than 5 leading zeros

and no more than 16 trailing zeros per extended word is indeed trivial.

I am not fully satisfied with this algorithm for two reasons:

A. It's sort of ad hoc. It works for 16->17, but I am not sure that

similar algorithms are going to work for, say 24->25 or 28->29.

B. The hardware implementation (in FPGA) is not as compact as I would

like. There are too many non-trivial constants involved and non-trivial

constants consume LEs.

> >

> > DC balance is not required in my case. When DC balance is required

> > at all, the balance requirements tend to be rather strict. Weak

> > balance is rarely of interest.

>

> Weak balance - a statistical or probabilistic balance - is certainly

> of interest in some systems. PCIe communication, for instance,

> maintains an approximate balance by using a scrambler. Some DC

> "leakage" through the terminators is enough to deal with the

> remainder, if I have understood things correctly.

>

PCIe Gen1 and Gen2 apply 8b-10b encoding that guarantees rather strong

DC balance. As observed on 10T intervals, disparity never exceeds two

bits. It is possible due too high coding rate of 5/4.

For later PCIe generations DC balance is indeed non-robust. I suspect

that hardware here is designed to transmit/receive totally non-balanced

streams. With more balanced streams it just works better == has higher

noise margin.

But I admit to not learning newer PCIe phys to the same degree I

learned earlier generations.

> Other systems use active DC balance tracking to keep a stricter

> control, and transmit different encodings to minimise the DC

> imbalance. But even then, it is never perfect - no line drivers are

> equally strong for high and low drives, and there are always

> differences in the latencies for high and low transitions. Even with

> an ideal 50% duty cycle coming in, there will be a small DC bias that

> must be leaked away. So DC balancing in the encoding is /always/

> weak, if you look closely enough.

>

N/A for differential, which happens to dominate high speed outside of

DRAM buses.

I tend to think about coding techniques 'for transparency' as

intellectual assets. You want to have more of them even in the absence

of immediate application.

My absolute favorite is Consistent Overhead Byte Stuffing a.k.a. COBS.

It featured in few proposals, but never made it to any major standard.

Despite it, it's a work of beauty and I use variants of COBS at work

rather intensely, typically for communication between parts of our

embedded systems.

BTW, ideas of COBS are applicable among other approaches to avoiding

5-long bit patterns. Not so much to 4-long bit patterns.

The difference here is that avoidance of 5-long patterns can be

attacked by splitting the word into 3-bit and 4-bit symbols and as long

as there are symbols COBS has a chance.

For 4-long patterns partitioning to 4-bit symbols, even when

interleaved with 3-bit symbols, does not work for an obvious reasons.

And partitioning [of 17-bit extended word] to 3-bit symbols does not

work for even more obvious reason: 7**5 * 3 < 2**16.

In particular you obviously did not read this part "and at most M=L+T

continuous inner '1' bits".

Modification that causes it to generate no more than 5 leading zeros

and no more than 16 trailing zeros per extended word is indeed trivial.

I am not fully satisfied with this algorithm for two reasons:

A. It's sort of ad hoc. It works for 16->17, but I am not sure that

similar algorithms are going to work for, say 24->25 or 28->29.

B. The hardware implementation (in FPGA) is not as compact as I would

like. There are too many non-trivial constants involved and non-trivial

constants consume LEs.

> >

> > DC balance is not required in my case. When DC balance is required

> > at all, the balance requirements tend to be rather strict. Weak

> > balance is rarely of interest.

>

> Weak balance - a statistical or probabilistic balance - is certainly

> of interest in some systems. PCIe communication, for instance,

> maintains an approximate balance by using a scrambler. Some DC

> "leakage" through the terminators is enough to deal with the

> remainder, if I have understood things correctly.

>

DC balance. As observed on 10T intervals, disparity never exceeds two

bits. It is possible due too high coding rate of 5/4.

For later PCIe generations DC balance is indeed non-robust. I suspect

that hardware here is designed to transmit/receive totally non-balanced

streams. With more balanced streams it just works better == has higher

noise margin.

But I admit to not learning newer PCIe phys to the same degree I

learned earlier generations.

> Other systems use active DC balance tracking to keep a stricter

> control, and transmit different encodings to minimise the DC

> imbalance. But even then, it is never perfect - no line drivers are

> equally strong for high and low drives, and there are always

> differences in the latencies for high and low transitions. Even with

> an ideal 50% duty cycle coming in, there will be a small DC bias that

> must be leaked away. So DC balancing in the encoding is /always/

> weak, if you look closely enough.

>

DRAM buses.

intellectual assets. You want to have more of them even in the absence

of immediate application.

My absolute favorite is Consistent Overhead Byte Stuffing a.k.a. COBS.

It featured in few proposals, but never made it to any major standard.

Despite it, it's a work of beauty and I use variants of COBS at work

rather intensely, typically for communication between parts of our

embedded systems.

BTW, ideas of COBS are applicable among other approaches to avoiding

5-long bit patterns. Not so much to 4-long bit patterns.

The difference here is that avoidance of 5-long patterns can be

attacked by splitting the word into 3-bit and 4-bit symbols and as long

as there are symbols COBS has a chance.

For 4-long patterns partitioning to 4-bit symbols, even when

interleaved with 3-bit symbols, does not work for an obvious reasons.

And partitioning [of 17-bit extended word] to 3-bit symbols does not

work for even more obvious reason: 7**5 * 3 < 2**16.

Dec 6, 2023, 1:01:25â€¯PM12/6/23

to

So let's do it again.

1. Reversibility.

No two 16-bit input patterns correspond to same 17-bit output pattern.

2. Leading '1' bits.

Output patterns that have L+1 or more consecutive '1' MS bits are

illegal.

3. Trailing '1' bits.

Output patterns that have T+1 or more consecutive '1' LS bits are

illegal.

4. Inner '1' bits.

Output patterns that have L+T+1 or more consecutive '1' bits starting

at any position are illegal.

(L,T) = either (1,3) or (2,2).

Dec 6, 2023, 1:22:53â€¯PM12/6/23

to

As mentioned in other post, it is unsatisfactory for two reasons:

A. It's sort of ad hoc. It works for 16->17, but I am not sure that

similar algorithms are going to work for, say 24->25 or 28->29.

B. The hardware implementation (in FPGA) is not as compact as I would

like. There are too many non-trivial constants involved. Non-trivial
similar algorithms are going to work for, say 24->25 or 28->29.

B. The hardware implementation (in FPGA) is not as compact as I would

constants consume LEs.

static const unsigned bv_tab[17] = { // bv_tab[k]/bv_tab[k+i] <= 2

42557,

21647,

11011,

5601,

2849,

1449,

737,

375,

191,

97,

49,

25,

13,

7,

4,

2,

1,

};

// encode 16->17

unsigned enc(unsigned x)

{

// avoid long strings of zeros by mapping inputs in range

// [0:2**13-1] to [2**16:2**16+2**13-1]

if (x < 8*1024)

x += 0x10000;

// convert x to sub-binary non-constant base

unsigned y = 0;

for (int i = 0; i < 17; ++i) {

y += y;

const unsigned bv = bv_tab[i];

if (x >= bv) {

x -= bv;

y += 1;

}

}

return y & 0x1FFFF;

}

// decode 17->16

unsigned dec(unsigned x)

{

// convert x from sub-binary non-constant base back to binary

unsigned y = 0;

for (int i = 0; i < 17; ++i) {

if (x & 0x10000) {

y += bv_tab[i];

}

x += x;

}

return y & 0xFFFF;

}

Dec 6, 2023, 4:01:36â€¯PM12/6/23

to

exactly those purposes.

Terje

--

- <Terje.Mathisen at tmsw.no>

"almost all programming can be viewed as an exercise in caching"

Dec 6, 2023, 5:17:49â€¯PM12/6/23

to

Or, maybe cheaper, could be a 4b to 5b mapping.

Say, 4 to 5 bit:

0000 -> 01010

0001 -> 01001

0010 -> 01110

0011 -> 01101

0101 -> 00001

0110 -> 00110

0111 -> 00101

1000 -> 11010

1001 -> 11001

1010 -> 11110

1011 -> 11100

1101 -> 10001

1110 -> 10110

1111 -> 10101

Better schemes are likely possible, this was basically "input^0101" with

the last bit as a "balancing bit".

Could encode two such values for 8 bits, or 4 for 16 bits.

This could be reduced to 16->18 or 16->17 by as selecting an additional

XOR mask for the other bits, so:

(bits^0x5555)^(bal?MASK1:MASK2);

(bits^0x5555)^(bal1?(bal0?MASK1:MASK2):(bal0?MASK3:MASK4));

This would avoid needing giant lookup tables, and with appropriate masks

could remain roughly DC-biased (though, a 16->18 bit version, with 4

possible masks, count be better for DC balancing).

Say, XOR masks (2b):

00: 111011101110

01: 011001100110

10: 100110011001

11: 000100010001

One other possible feature that could be investigated (more likely if

using the 16 bits via 20 bits scheme) would be to tweak the encoding to

make it easier to re-synchronize the bitstream to an 8 or 16-bit

boundary if the link is disrupted for whatever reason.

Likely easiest way to do this would be to use an 0x55AA mask vs 0x5555

(for 16-bit alignment), or 0x5A5A for 8-bit alignment, with the

balancing bits tweaked such that the two cases are (mostly) mutually

exclusive.

By looking at a chunk of, say, 20 or 40 bits, it would be possible to

estimate within a reasonable probability whether or not the bitstream is

correctly aligned.

Say, 4 to 5 bit (5 and A):

0000 -> 01010, 10100

0001 -> 01001, 10110

0010 -> 01110, 10000

0011 -> 01101, 10011

0101 -> 00001, 11110

0110 -> 00110, 11000

0111 -> 00101, 11011

1000 -> 11010, 00100

1001 -> 11001, 00111

1010 -> 11110, 00001

1011 -> 11100, 00011

1101 -> 10001, 01111

1110 -> 10110, 01001

1111 -> 10101, 01011

Or, inverses:

00000: Invalid

00001: 0101(5), 1010(A)

00010: Invalid

00011: 1011(A), Invalid(5)

00100: 1000(A), Invalid(5)

00101: 0111(5), Invalid(A)

00110: 0110(5), Invalid(A)

00111: 1001(A), Invalid(5)

01000: Invalid

01001: 0001(5), 1110(A)

... (Could fill out the rest, but should be obvious enough).

11111: Invalid

Better is likely possible, but this was just what came to mind at the

moment.

Would prefer to keep away from the NRZI/Bit-Stuffing scheme, as this

"kinda sucks" IMHO (bad enough experiences with this trying to implement

hardware interfaces for USB).

...

Dec 6, 2023, 7:11:44â€¯PM12/6/23

to

of MAC-to-PHY interface (SGMII) and by optical media (all sub-variants

of 1000BaseX).

The dominant 1GbE media standard, 1000BASE-T, does not use 8B10B. It

applies much more complicated 8bit to 12-bit mapping with multiple

mappings corresponding to each input octet.

Strictly speaking, classic 8B10B also has multiple (2) mappings for

some of input octets, but by comparison it is very simple.

If we are going to believe Wikipedia, another extant copper 1GbE

variant, 1000BASE-T1, uses 80->81 bit code.

Dec 6, 2023, 7:37:49â€¯PM12/6/23

to

On 12/6/2023 2:30 AM, Michael S wrote:

> Bit sequences with bounded number of continuous '1' bits are useful for

> serialization of packetized data.

>

> For example, if we have a way of converting arbitrary n-bit words to

> unique (n+1)-bit extended words with at most L continuous leading '1'

> bits, at most T continuous trailing '1' bits and at most M=L+T

> continuous inner '1' bits then we can transfer multi-word packets

> starting with a header consisting of M+1 '1' bits followed by a single

> '0' bit and then by desired number of extended words. Any gaps between

> packets are filled with '1' bits.

> Receiver can easily synchronize itself to incoming packets and will

> quickly and robustly recovers after media errors.

> Such conversions have are other application as well. They always have.

>

> For me, a case of n=16 (16-bit words, 17-bit extended words) is of

> particular practical interest.

>

> Brute force counting proves that both for (L=1,T=3) and for (L=2,T=2)

> there exist more than 2**16 extended words with desired properties.

> For those interested, 76735 and 83665 extended words, respectively.

> So, desired 16b-to-17b conversions certainly exist for M=4.

> The problem is that I know of no simple "algorithmic" way of

> implementing such conversion short of using big look-up tables.

There almost has to be a "direct" (algorithmic) method without using
> Bit sequences with bounded number of continuous '1' bits are useful for

> serialization of packetized data.

>

> For example, if we have a way of converting arbitrary n-bit words to

> unique (n+1)-bit extended words with at most L continuous leading '1'

> bits, at most T continuous trailing '1' bits and at most M=L+T

> continuous inner '1' bits then we can transfer multi-word packets

> starting with a header consisting of M+1 '1' bits followed by a single

> '0' bit and then by desired number of extended words. Any gaps between

> packets are filled with '1' bits.

> Receiver can easily synchronize itself to incoming packets and will

> quickly and robustly recovers after media errors.

> Such conversions have are other application as well. They always have.

>

> For me, a case of n=16 (16-bit words, 17-bit extended words) is of

> particular practical interest.

>

> Brute force counting proves that both for (L=1,T=3) and for (L=2,T=2)

> there exist more than 2**16 extended words with desired properties.

> For those interested, 76735 and 83665 extended words, respectively.

> So, desired 16b-to-17b conversions certainly exist for M=4.

> The problem is that I know of no simple "algorithmic" way of

> implementing such conversion short of using big look-up tables.

external lookup tables... Humm...

> Neither

> in software nor in simple compact hardware.

> That is in strong contrast to case of M=5 (both L=1,T=4 and L=2,T=3)

> where I know many such algorithms.

>

> Any ideas?

> May be, somebody had seen such methods in patents, preferably old and

> outdated?

> Or, better, in books or articles?

>

>

>

https://groups.google.com/g/comp.lang.asm.x86/c/FScbTaQEYLc

Not exactly sure why yet...

Dec 6, 2023, 10:28:27â€¯PM12/6/23

to

Here is an idea, such as it is. Clunky for software, might

be okay for hardware.

Subdivide the input space by looking at the top six bits. There

are three ranges (with (L,T) = (2,2))

000000... to 100011... maps to 0xxxxx...

100100... to 110110... maps to 10xxxx...

110111... to 111111... maps to 110xxx...

which are

36864 values of 47338 code points (77.87 percent)

19456 values of 24079 code points (80.80 percent)

9216 values of 12248 code points (75.24 percent)

===== =====

65536 83665

Continue subdividing recursively, choosing nice value ranges in

binary to correspond to subdivisions in the output space. After

the first level of dividing, subdivisions have five output ranges

to choose from rather than just three (until we reach the end, of

course). For example, values in the range

[ 100000.. to 100100.. ) could be mapped to 01110xxx and 011110xxx

(4096 values of 3169 + 1612 = 4781 code points, 85.67 percent)

All the above just playing around by hand after the initial idea

of subdividing based on a small number of leading bits.

Not sure if this approach is good enough for you. I admit it's

not pretty. My sense is it's plausible, even if rather ugly.

And who knows, it may spark a different idea or a better approach.

Dec 7, 2023, 4:36:59â€¯AM12/7/23

to

The size of my packets vary between 4 and 33 16-bit words.

A protocol that was in use in previous generation of this product was a

simple one - each 16-bit word is prefixed by a single '0' bit, gaps

between packets are filled with 1s. This protocol requires at least

17-bit gaps between packets so maximal packet rate for a given number of

words N derived as: Rate = 1/(17*N+17).

With 8B10B the formula is: Rate = 1/(20*N+10).

One can see that 8B10B is strictly slower than an old simple scheme for

all possible packet sizes. It happens to be just a little slower (1/85

vs 1/90) for smallest packets, but quite a lot slower (1/578 vs 1/670)

for biggest packets. It is also more complicated, esp. on receiver side.

Another disadvantage of 8B10B is that after link is established packets

can be started only on 10-bit boundaries. When the rate of packets is

not a multiple of 10 clocks (which is most of the time) it adds jitter

to link delay.

Conclusion: 8B10B is not an improvement on old solution. In fact, it is

an opposite of improvement.

Now, you can ask why I want improvement at all?

The answer is: mostly because I want the ability to transfer shortest

packets (4 words) at rate of 1/84 clocks. My old protocol gives 1/85.

From practical perspective very little is missing and rather simple

improvement in encoding will do the job. The desire to produce encoded

streams with at most 4 continuous 1s is rooted in the pursuit for

theoretical perfection rather than the real-world requirements of

moment.

Dec 7, 2023, 5:13:14â€¯AM12/7/23

to

On Wed, 06 Dec 2023 19:28:15 -0800

I don't understand a "recursively" part.

May be, more detailed description of the second stage will help.

The key question I don't understand is whether boundaries used at

the stage n+1 depend on the output bits produced at stage n or

boundaries applied to residue of the input word at given stage are

always the same.

If the former, it seems that we end up with the same 64 Kword table

size as in direct approach.

If the later, it can be nice. But for that to work at each stage you

probably want to subdivide to sub-spaces with approximately equal

numbers of code points. So, may be, 01x, 01x and 1xx at first step?

Or just 0x and 1x, but then it degenerates into my current

"conversion to sub-binary non-constant base" algorithm.

And for my current algorithm I know for sure that "nice value ranges"

as you put it, can be used only at 5 initial stages. "Non nice" ranges

are required at stage 6 and later.

May be, more detailed description of the second stage will help.

The key question I don't understand is whether boundaries used at

the stage n+1 depend on the output bits produced at stage n or

boundaries applied to residue of the input word at given stage are

always the same.

If the former, it seems that we end up with the same 64 Kword table

size as in direct approach.

If the later, it can be nice. But for that to work at each stage you

probably want to subdivide to sub-spaces with approximately equal

numbers of code points. So, may be, 01x, 01x and 1xx at first step?

Or just 0x and 1x, but then it degenerates into my current

"conversion to sub-binary non-constant base" algorithm.

And for my current algorithm I know for sure that "nice value ranges"

as you put it, can be used only at 5 initial stages. "Non nice" ranges

are required at stage 6 and later.

Dec 7, 2023, 7:03:09â€¯AM12/7/23

to

On 06/12/2023 18:43, Michael S wrote:

> On Wed, 6 Dec 2023 16:48:51 +0100

> David Brown <david...@hesbynett.no> wrote:

>

>> On 06/12/2023 14:41, Michael S wrote:

>>> On Wed, 6 Dec 2023 13:06:29 +0100

>>> David Brown <david...@hesbynett.no> wrote:

>>>

>

> You still refused to read my original post without skipping words.

> In particular you obviously did not read this part "and at most M=L+T

> continuous inner '1' bits".

I did not "refuse" to do anything, but now that you have pointed this
> On Wed, 6 Dec 2023 16:48:51 +0100

> David Brown <david...@hesbynett.no> wrote:

>

>> On 06/12/2023 14:41, Michael S wrote:

>>> On Wed, 6 Dec 2023 13:06:29 +0100

>>> David Brown <david...@hesbynett.no> wrote:

>>>

>

> You still refused to read my original post without skipping words.

> In particular you obviously did not read this part "and at most M=L+T

> continuous inner '1' bits".

out, I see I misread a couple of things from the first post. Sorry for

my confusion there, leading to some unnecessary concerns in my post as

you had already covered the points.

>>

>>> Anyway, when somewhat higher density of transitions is desirable,

>>> say, one transition (i.e. logical 1) per 25 or 30 bits, it's

>>> absolutely trivial to add at no additional coding cost since 76735

>>> is sufficiently large than 65736.

>>

>> Certainly you can make sure you have enough ones in the packet. (I

>> would not say it is "trivial" until you have found a simple algorithm

>> for coding and decoding to match your requirements - but it is

>> unlikely to make the problem any harder.)

>>

>

> In the mean time I found an algorithm that works. Sort of.

> Modification that causes it to generate no more than 5 leading zeros

> and no more than 16 trailing zeros per extended word is indeed trivial.

>

> I am not fully satisfied with this algorithm for two reasons:

> A. It's sort of ad hoc. It works for 16->17, but I am not sure that

> similar algorithms are going to work for, say 24->25 or 28->29.

> B. The hardware implementation (in FPGA) is not as compact as I would

> like. There are too many non-trivial constants involved and non-trivial

> constants consume LEs.

>

right? I'm guessing you are looking at a linear shift register

arrangement of some kind, since they are amenable to compact FPGA

implementations (unlike a big lookup table - which would give a simple

solution if you could afford the space).

But it does all sound somewhat arbitrary here. Why do you pick these

specific numbers? It seems you are not concerned about errors other

than synchronisation (presumably you handle any required error checking

or correcting at a higher level). Since you have a common clock source,

are we talking about a pair of FPGAs on the same board here?

What sort of packet sizes are you using, and how critical is the

bandwidth utilisation? Could you not massively simplify it by picking M

as 16? Each 16 bits of data could be transferred as a frame consisting

of a zero followed by the original data. Inter-packet gaps will be all

ones, and you need a minimum gap of 17 ones before a new packet.

Compared to your earlier suggestions this would increase the minimum gap

time, but would make the implementation trivial.

Of course if this is all motivated more by interest and the fun and

challenge of trying to find such algorithms, rather than practical

necessity, then I'll stop trying to look at the practicalities and try

to think more about the coding patterns. Fun and interest is always a

good reason.

>>>

>>> DC balance is not required in my case. When DC balance is required

>>> at all, the balance requirements tend to be rather strict. Weak

>>> balance is rarely of interest.

>>

>> Weak balance - a statistical or probabilistic balance - is certainly

>> of interest in some systems. PCIe communication, for instance,

>> maintains an approximate balance by using a scrambler. Some DC

>> "leakage" through the terminators is enough to deal with the

>> remainder, if I have understood things correctly.

>>

>

> PCIe Gen1 and Gen2 apply 8b-10b encoding that guarantees rather strong

> DC balance. As observed on 10T intervals, disparity never exceeds two

> bits. It is possible due too high coding rate of 5/4.

> For later PCIe generations DC balance is indeed non-robust. I suspect

> that hardware here is designed to transmit/receive totally non-balanced

> streams. With more balanced streams it just works better == has higher

> noise margin.

> But I admit to not learning newer PCIe phys to the same degree I

> learned earlier generations.

checking mechanisms for later PCIe versions. So if you are unlucky

enough to have data patterns that give too high DC bias, causing errors,

these will lead to retransmissions. And I would assume the scrambler

phase would be different even when re-transmitting the same original

data, avoiding a repeat of the problem. Anyway, your transmission

always has to cope with /some/ DC bias, even if it is small, because

it's all analogue and never entirely precise. Leaking away the bias

through terminators is easy enough - you just don't want to have to

handle too much, because it costs power and slows the signal.

>>

>> I can't say that I really understand why you have the requirements

>> you are asking for here - they don't make sense to me. Obviously I

>> don't have a full view of your system, but I cannot currently imagine

>> a system for which your requirements would be the best way to solve

>> the problem.

>>

>>

>

> I tend to think about coding techniques 'for transparency' as

> intellectual assets. You want to have more of them even in the absence

> of immediate application.

> My absolute favorite is Consistent Overhead Byte Stuffing a.k.a. COBS.

> It featured in few proposals, but never made it to any major standard.

> Despite it, it's a work of beauty and I use variants of COBS at work

> rather intensely, typically for communication between parts of our

> embedded systems.

long time since I heard about COBS - thank you for reminding me of it!

Dec 7, 2023, 10:21:26â€¯AM12/7/23

to

BGB <cr8...@gmail.com> writes:

>On 12/6/2023 4:30 AM, Michael S wrote:

>> May be, somebody had seen such methods in patents, preferably old and

>> outdated?

>> Or, better, in books or articles?

>>

>

>For hardware, 8b/10b looks like a more sane option.

Although that has 25% overhead.
>On 12/6/2023 4:30 AM, Michael S wrote:

>> May be, somebody had seen such methods in patents, preferably old and

>> outdated?

>> Or, better, in books or articles?

>>

>

>For hardware, 8b/10b looks like a more sane option.

Using 64b/66b reduces the overhead to 3.125%.

Dec 7, 2023, 10:23:44â€¯AM12/7/23

to

Michael S <already...@yahoo.com> writes:

>On Wed, 6 Dec 2023 22:01:31 +0100

>Terje Mathisen <terje.m...@tmsw.no> wrote:

>

>> David Brown wrote:

>> > On 06/12/2023 11:30, Michael S wrote: =20
>On Wed, 6 Dec 2023 22:01:31 +0100

>Terje Mathisen <terje.m...@tmsw.no> wrote:

>

>> David Brown wrote:

>> >> Bit sequences with bounded number of continuous '1' bits are

>> >> useful for serialization of packetized data.

>> >>

>> >> For example, if we have a way of converting arbitrary n-bit words

>> >> to unique (n+1)-bit extended words with at most L continuous

>> >> leading '1' bits, at most T continuous trailing '1' bits and at

>> >> most M=3DL+T continuous inner '1' bits then we can transfer
>> >> useful for serialization of packetized data.

>> >>

>> >> For example, if we have a way of converting arbitrary n-bit words

>> >> to unique (n+1)-bit extended words with at most L continuous

>> >> leading '1' bits, at most T continuous trailing '1' bits and at

>> >> multi-word packets starting with a header consisting of M+1 '1'

>> >> bits followed by a single '0' bit and then by desired number of

>> >> extended words. Any gaps between packets are filled with '1' bits.

>> >> Receiver can easily synchronize itself to incoming packets and will

>> >> quickly and robustly recovers after media errors.

>> >> Such conversions have are other application as well. They always

>> >> have.

>> >>

>> >> For me, a case of n=3D16 (16-bit words, 17-bit extended words) is of
>> >> bits followed by a single '0' bit and then by desired number of

>> >> extended words. Any gaps between packets are filled with '1' bits.

>> >> Receiver can easily synchronize itself to incoming packets and will

>> >> quickly and robustly recovers after media errors.

>> >> Such conversions have are other application as well. They always

>> >> have.

>> >>

>> >> particular practical interest.

>> >>

>> >> Brute force counting proves that both for (L=3D1,T=3D3) and for

>> >> (L=3D2,T=3D2) there exist more than 2**16 extended words with desired

>> >> properties. For those interested, 76735 and 83665 extended words,

>> >> respectively. So, desired 16b-to-17b conversions certainly exist

>> >> for M=3D4. The problem is that I know of no simple "algorithmic" way
>> >> respectively. So, desired 16b-to-17b conversions certainly exist

>> >> of implementing such conversion short of using big look-up tables.

>> >> Neither in software nor in simple compact hardware.

>> >> That is in strong contrast to case of M=3D5 (both L=3D1,T=3D4 and
>> >> Neither in software nor in simple compact hardware.

>> >> L=3D2,T=3D3) where I know many such algorithms.

>> >>

>> >> Any ideas?

>> >> May be, somebody had seen such methods in patents, preferably old

>> >> and outdated?

>> >> Or, better, in books or articles?

>> >>

>> >> =20
>> >> Any ideas?

>> >> May be, somebody had seen such methods in patents, preferably old

>> >> and outdated?

>> >> Or, better, in books or articles?

>> >>

>> >=20

>> > Usually the interest is in limiting length of repeats at any point

>> > in the data, not just the start and end.=A0 After all, it's not going
>> > to help anyone if the start and end of the word are limited to no

>> > more than 2 consecutive ones if you have a run of 10 ones in the

>> > middle of the word.

>> >=20
>> > more than 2 consecutive ones if you have a run of 10 ones in the

>> > middle of the word.

>> > It is also common to want to combine other encoding features at the

>> > same time.=A0 Limiting the lengths of both zeros and ones means you
>> > have transitions, which are needed for clock recovery and

>> > synchronisation. For many transmissions, approximately 0 DC balance

>> > is highly desirable. And you may also want forward error correction

>> > or at least error detection. =20
>> > synchronisation. For many transmissions, approximately 0 DC balance

>> > is highly desirable. And you may also want forward error correction

>>=20

>> AFAIR an 8->10 encoding is used for all gigabit ethernet links, for=20

>> exactly those purposes.

>>=20

>> Terje

>>=20

>>=20

>

>In 1GbE world 8B10B is used only for one of several available variants

>of MAC-to-PHY interface (SGMII) and by optical media (all sub-variants

>of 1000BaseX).

>

>The dominant 1GbE media standard, 1000BASE-T, does not use 8B10B. It

>applies much more complicated 8bit to 12-bit mapping with multiple

>mappings corresponding to each input octet.

And 10Gbe uses 64b/66b which just requires a two-bit prefix that
>In 1GbE world 8B10B is used only for one of several available variants

>of MAC-to-PHY interface (SGMII) and by optical media (all sub-variants

>of 1000BaseX).

>

>The dominant 1GbE media standard, 1000BASE-T, does not use 8B10B. It

>applies much more complicated 8bit to 12-bit mapping with multiple

>mappings corresponding to each input octet.

can only have the values 0b01 or 0b10.

PCIe 3.0 uses 128b/130b.

Dec 7, 2023, 10:40:17â€¯AM12/7/23

to

On Thu, 7 Dec 2023 13:01:40 +0100

LFSR that does the work would be great. But I don't think that it is

even remotely possible. Or, may be, I am not enough of Galois

algebraist to think about how is it possible.

So far all my approaches were either logical (chained symbol

substitution) or arithmetic (conversion to numeric systems with

non-power-of-two base). In specific case of L+T=4 all algorithms that

I tried so far were of later variety.

FPGAs, except of *very* old ones, are actually quite good at addition

and subtraction as long as your adder is not too wide. 16-20 bits are

always o.k.

> But it does all sound somewhat arbitrary here. Why do you pick these

> specific numbers? It seems you are not concerned about errors other

> than synchronisation (presumably you handle any required error

> checking or correcting at a higher level). Since you have a common

> clock source, are we talking about a pair of FPGAs on the same board

> here?

>

Different boards in the same relatively small rack.

Transmitter size is not important, because there is only one

transmitter per (data acquisition) board.

Receiver size is relatively important, because there are many receivers

per (concentrator) board.

> What sort of packet sizes are you using, and how critical is the

> bandwidth utilisation? Could you not massively simplify it by

> picking M as 16? Each 16 bits of data could be transferred as a

> frame consisting of a zero followed by the original data.

> Inter-packet gaps will be all ones, and you need a minimum gap of 17

> ones before a new packet. Compared to your earlier suggestions this

> would increase the minimum gap time, but would make the

> implementation trivial.

>

>

> Of course if this is all motivated more by interest and the fun and

> challenge of trying to find such algorithms, rather than practical

> necessity, then I'll stop trying to look at the practicalities and

> try to think more about the coding patterns. Fun and interest is

> always a good reason.

>

Most of your above questions/suggestions answered above in my reply to

BGB.

even remotely possible. Or, may be, I am not enough of Galois

algebraist to think about how is it possible.

So far all my approaches were either logical (chained symbol

substitution) or arithmetic (conversion to numeric systems with

non-power-of-two base). In specific case of L+T=4 all algorithms that

I tried so far were of later variety.

FPGAs, except of *very* old ones, are actually quite good at addition

and subtraction as long as your adder is not too wide. 16-20 bits are

always o.k.

> But it does all sound somewhat arbitrary here. Why do you pick these

> specific numbers? It seems you are not concerned about errors other

> than synchronisation (presumably you handle any required error

> checking or correcting at a higher level). Since you have a common

> clock source, are we talking about a pair of FPGAs on the same board

> here?

>

Transmitter size is not important, because there is only one

transmitter per (data acquisition) board.

Receiver size is relatively important, because there are many receivers

per (concentrator) board.

> What sort of packet sizes are you using, and how critical is the

> bandwidth utilisation? Could you not massively simplify it by

> picking M as 16? Each 16 bits of data could be transferred as a

> frame consisting of a zero followed by the original data.

> Inter-packet gaps will be all ones, and you need a minimum gap of 17

> ones before a new packet. Compared to your earlier suggestions this

> would increase the minimum gap time, but would make the

> implementation trivial.

>

>

> Of course if this is all motivated more by interest and the fun and

> challenge of trying to find such algorithms, rather than practical

> necessity, then I'll stop trying to look at the practicalities and

> try to think more about the coding patterns. Fun and interest is

> always a good reason.

>

BGB.

Dec 7, 2023, 11:01:18â€¯AM12/7/23

to

a shift register in an FPGA DDR output block. The bus used three

simultaneous channels like HDMI. The transceivers are modified HDMI

cores. IIRC it transferred data in 36-bit packets. Got the bus specâ€™d

out and the cores written, but never put into use. I was going to use it

to interconnect multiple FPGA boards at high speed.

Dec 7, 2023, 12:11:23â€¯PM12/7/23

to

channel, ESCON, Infiniband, etc.

But also according to Wikipedia, for 10Gb Ethernet, it has been replaced

by 64B/66B, which is obviously much more space efficient.

https://en.wikipedia.org/wiki/64b/66b_encoding

But the part potentially relevant to the OP is that it uses an algorithm

called "scrambling" to address the encoding/decoding problem, as a

look-up table is obviously impractical. Perhaps the scrambling

algorithm could be adapted for a 16B/17B application. The article

includes links to descriptions, etc.

--

- Stephen Fuld

(e-mail address disguised to prevent spam)

Dec 7, 2023, 1:04:01â€¯PM12/7/23

to

On Thu, 7 Dec 2023 09:10:40 -0800

Stephen Fuld <sf...@alumni.cmu.edu.invalid> wrote:

>

> But the part potentially relevant to the OP is that it uses an

> algorithm called "scrambling" to address the encoding/decoding

> problem, as a look-up table is obviously impractical. Perhaps the

> scrambling algorithm could be adapted for a 16B/17B application. The

> article includes links to descriptions, etc.

>

I don't believe that scrambling could be useful.
Stephen Fuld <sf...@alumni.cmu.edu.invalid> wrote:

>

> But the part potentially relevant to the OP is that it uses an

> algorithm called "scrambling" to address the encoding/decoding

> problem, as a look-up table is obviously impractical. Perhaps the

> scrambling algorithm could be adapted for a 16B/17B application. The

> article includes links to descriptions, etc.

>

Scrambling is great statistically, much less great when you want to

guarantee any particular property with absolute certainty.

Dec 7, 2023, 1:24:54â€¯PM12/7/23

to

Michael S <already...@yahoo.com> writes:

>On Thu, 7 Dec 2023 09:10:40 -0800

>Stephen Fuld <sf...@alumni.cmu.edu.invalid> wrote:

>>

>> But the part potentially relevant to the OP is that it uses an

>> algorithm called "scrambling" to address the encoding/decoding

>> problem, as a look-up table is obviously impractical. Perhaps the

>> scrambling algorithm could be adapted for a 16B/17B application. The

>> article includes links to descriptions, etc.

>>

>

>I don't believe that scrambling could be useful.

10/100G ethernet seems to disagree with that characterization.
>On Thu, 7 Dec 2023 09:10:40 -0800

>Stephen Fuld <sf...@alumni.cmu.edu.invalid> wrote:

>>

>> But the part potentially relevant to the OP is that it uses an

>> algorithm called "scrambling" to address the encoding/decoding

>> problem, as a look-up table is obviously impractical. Perhaps the

>> scrambling algorithm could be adapted for a 16B/17B application. The

>> article includes links to descriptions, etc.

>>

>

>I don't believe that scrambling could be useful.

Generally scrambled using a LFSR.

100G/200G/400G use PAM-4 (four voltage levels).

Dec 7, 2023, 4:13:05â€¯PM12/7/23

to

On Thu, 07 Dec 2023 18:24:49 GMT

sc...@slp53.sl.home (Scott Lurndal) wrote:

> Michael S <already...@yahoo.com> writes:

> >On Thu, 7 Dec 2023 09:10:40 -0800

> >Stephen Fuld <sf...@alumni.cmu.edu.invalid> wrote:

> >>

> >> But the part potentially relevant to the OP is that it uses an

> >> algorithm called "scrambling" to address the encoding/decoding

> >> problem, as a look-up table is obviously impractical. Perhaps the

> >> scrambling algorithm could be adapted for a 16B/17B application.

> >> The article includes links to descriptions, etc.

> >>

> >

> >I don't believe that scrambling could be useful.

>

> 10/100G ethernet seems to disagree with that characterization.

>

O.k. Personally for you I'd spell it differently: I don't believe that
sc...@slp53.sl.home (Scott Lurndal) wrote:

> Michael S <already...@yahoo.com> writes:

> >On Thu, 7 Dec 2023 09:10:40 -0800

> >Stephen Fuld <sf...@alumni.cmu.edu.invalid> wrote:

> >>

> >> But the part potentially relevant to the OP is that it uses an

> >> algorithm called "scrambling" to address the encoding/decoding

> >> problem, as a look-up table is obviously impractical. Perhaps the

> >> scrambling algorithm could be adapted for a 16B/17B application.

> >> The article includes links to descriptions, etc.

> >>

> >

> >I don't believe that scrambling could be useful.

>

> 10/100G ethernet seems to disagree with that characterization.

>

scrambling could be useful for generation of 16-bit to 17-bit code words

with strict bound on the number of consecutive '1' bits.

Dec 8, 2023, 8:28:56â€¯AM12/8/23

to

redundancy and error correction for RAID, rather than this kind of

thing. My thought is that some kind of scrambler pattern might be able

to generate the patterns you need, or at least something close as a

first step. But I have no idea how to look for such encodings other

than by trial and error.

> So far all my approaches were either logical (chained symbol

> substitution) or arithmetic (conversion to numeric systems with

> non-power-of-two base). In specific case of L+T=4 all algorithms that

> I tried so far were of later variety.

what you want in your FPGAs?

> FPGAs, except of *very* old ones, are actually quite good at addition

> and subtraction as long as your adder is not too wide. 16-20 bits are

> always o.k.

>

>

>> But it does all sound somewhat arbitrary here. Why do you pick these

>> specific numbers? It seems you are not concerned about errors other

>> than synchronisation (presumably you handle any required error

>> checking or correcting at a higher level). Since you have a common

>> clock source, are we talking about a pair of FPGAs on the same board

>> here?

>>

>

> Different boards in the same relatively small rack.

> Transmitter size is not important, because there is only one

> transmitter per (data acquisition) board.

> Receiver size is relatively important, because there are many receivers

> per (concentrator) board.

>

with needing something like lookup tables for the transmission encoding,

as long as reception decoding can be small? (I had been assuming that

you have more symmetrical hardware requirements.)

>

>> What sort of packet sizes are you using, and how critical is the

>> bandwidth utilisation? Could you not massively simplify it by

>> picking M as 16? Each 16 bits of data could be transferred as a

>> frame consisting of a zero followed by the original data.

>> Inter-packet gaps will be all ones, and you need a minimum gap of 17

>> ones before a new packet. Compared to your earlier suggestions this

>> would increase the minimum gap time, but would make the

>> implementation trivial.

>>

>>

>> Of course if this is all motivated more by interest and the fun and

>> challenge of trying to find such algorithms, rather than practical

>> necessity, then I'll stop trying to look at the practicalities and

>> try to think more about the coding patterns. Fun and interest is

>> always a good reason.

>>

>

> Most of your above questions/suggestions answered above in my reply to

> BGB.

>

Dec 9, 2023, 1:10:15â€¯AM12/9/23

to

First level decomposition (values on left, codes on the right). In all

cases values are the lower end of a range.

0 000 00x xxx xxx xxx 0................

1 001 00x xxx xxx xxx 10...............

1 101 11x xxx xxx xxx 110..............

We give an example of decomposing the third range. The third range has

14 code bits remaining, which gives 12248 codes available for use.

The 12248 codes are subdivided as follows

0............. 6230

10............ 3169

110........... 1612

1110.......... 820

11110......... 417

Second level decomposition of the third range above. First we use the

last two prefixes to take care of the odd-shaped portion up to a value

point of 1 110 000 000 000 000

1 101 11x xxx xxx xxx 1024 values with 1237 = 820+417 codes

0 000 xxx xxx 110 1110... 704 values of 820 codes

1 011 xxx xxx 110 11110.. 320 values of 417 codes

(1 110 000 000 000 000 boundary)

After nibbling off the odd-shaped portion, we are left with 8192 values:

1 11x xxx xxx xxx xxx

We subdivide these 8192 values among the first three prefixes

1 11x xxx xxx xxx xxx

0 xxx xxx xxx xxx 110 0...... 4096+512 = 4608 : 6230 codes

1 001 xxx xxx xxx 110 10..... 2048+512 = 2560 : 3169

1 11x xxx xxx xxx 110 110.... 1024 = 1024 : 1622

(10 000 000 000 000 000 boundary)

The principle being used here is to subdivide the value range along

lines that are determined by high order bit positions, chosen so that

each subrange maps onto a code point range that is roughly 4/3 the size

of the value range being mapped.

> The key question I don't understand is whether boundaries used at

> the stage n+1 depend on the output bits produced at stage n or

> boundaries applied to residue of the input word at given stage are

> always the same.

> If the former, it seems that we end up with the same 64 Kword table

> size as in direct approach.

at stage n+1 depends on the upstream value bits. The tests

are done at fixed bit positions, and the results combined by

logic so there is a lot of sharing. I think the growth is

more like linear than exponential.

> If the later, it can be nice. But for that to work at each stage you

> probably want to subdivide to sub-spaces with approximately equal

> numbers of code points.

are roughly proportional to the set of code points for the code

prefix used for that set of values.

> So, may be, 01x, 01x and 1xx at first step?

> Or just 0x and 1x, but then it degenerates into my current

> "conversion to sub-binary non-constant base" algorithm.

> And for my current algorithm I know for sure that "nice value ranges"

> as you put it, can be used only at 5 initial stages. "Non nice" ranges

> are required at stage 6 and later.

I've been describing is not iterative, it's cascading logic.

Now that I have investigated the code space more carefully I understand

the problem a lot better. I think it's easy to construct a solution,

but the challenge is to construct a "good" solution. I independently

developed an iterative approach similar to your table-driven algorithm

shown in another posting, and it was pretty straightforward (and also

generalizes without difficulty to different numbers of bits). Although

it was easy to produce, it's not as compact as your table-driven

algorithm. The scheme I have outlined above takes a different approach,

and looks much better suited to hardware then software.

If any of the above is confusing, probably it's best to just forget

about it. I suspect the qualities of a "good" solution depend a lot on

aspects of the application that are outside the given problem statement.

Dec 10, 2023, 7:50:18â€¯AM12/10/23

to

On Fri, 08 Dec 2023 22:10:06 -0800

May be.

For me "nice value range" means that all values can be represented

as M(i)*2**E(i) where min(M(i)) is small.

At the end I looked for such nice values with brute force search.

It turned out that there exist multiple sets that are nice enough for

my purposes i.e. enable implementation of de-serializer that

together in additional logic fits in 50 LEs.

And finding it only took ~3 minutes of compute time on a single core of

old PC.

That's my current and likely final algorithm.

static const unsigned bv_tab[17] = {

46, 47, 48, 49,

50, 51, 52, 53,

54, 55, 56, 56,

56, 64, 64, 64,

64, };

// encode 16->17

unsigned enc(unsigned x)

{

if (x < (1<<12))

x += 0x10000; // avoid long strings of zeros

// convert x to non-binary base

y += y;

}

For example, good solution that encodes/decodes one word per clock is

probably quite different from my algorithm above that is oriented

toward producing/consuming one bit per clock.

For me "nice value range" means that all values can be represented

as M(i)*2**E(i) where min(M(i)) is small.

At the end I looked for such nice values with brute force search.

It turned out that there exist multiple sets that are nice enough for

my purposes i.e. enable implementation of de-serializer that

together in additional logic fits in 50 LEs.

And finding it only took ~3 minutes of compute time on a single core of

old PC.

That's my current and likely final algorithm.

static const unsigned bv_tab[17] = {

50, 51, 52, 53,

54, 55, 56, 56,

56, 64, 64, 64,

64, };

// encode 16->17

unsigned enc(unsigned x)

{

x += 0x10000; // avoid long strings of zeros

// convert x to non-binary base

unsigned y = 0;

for (int i = 0; i < 17; ++i) {

unsigned bv = bv_tab[i]*1024;
for (int i = 0; i < 17; ++i) {

y += y;

if (x >= bv) {

x -= bv;

y += 1;

}

x += x;
x -= bv;

y += 1;

}

}

return y & 0x1FFFF;

}

// decode 17->16

unsigned dec(unsigned x)

{

// convert x from non-binary base to binary
}

// decode 17->16

unsigned dec(unsigned x)

{

unsigned y = 0;

for (int i = 0; i < 17; ++i) {

y += y;
for (int i = 0; i < 17; ++i) {

if (x & 0x10000)

y += bv_tab[i];

x += x;

}

return (y >> 6) & 0xFFFF;
y += bv_tab[i];

x += x;

}

}

> Now that I have investigated the code space more carefully I

> understand the problem a lot better. I think it's easy to construct

> a solution, but the challenge is to construct a "good" solution. I

> independently developed an iterative approach similar to your

> table-driven algorithm shown in another posting, and it was pretty

> straightforward (and also generalizes without difficulty to different

> numbers of bits). Although it was easy to produce, it's not as

> compact as your table-driven algorithm. The scheme I have outlined

> above takes a different approach, and looks much better suited to

> hardware then software.

>

> If any of the above is confusing, probably it's best to just forget

> about it. I suspect the qualities of a "good" solution depend a lot

> on aspects of the application that are outside the given problem

> statement.

Yes, that sounds true.
> Now that I have investigated the code space more carefully I

> understand the problem a lot better. I think it's easy to construct

> a solution, but the challenge is to construct a "good" solution. I

> independently developed an iterative approach similar to your

> table-driven algorithm shown in another posting, and it was pretty

> straightforward (and also generalizes without difficulty to different

> numbers of bits). Although it was easy to produce, it's not as

> compact as your table-driven algorithm. The scheme I have outlined

> above takes a different approach, and looks much better suited to

> hardware then software.

>

> If any of the above is confusing, probably it's best to just forget

> about it. I suspect the qualities of a "good" solution depend a lot

> on aspects of the application that are outside the given problem

> statement.

For example, good solution that encodes/decodes one word per clock is

probably quite different from my algorithm above that is oriented

toward producing/consuming one bit per clock.

Dec 10, 2023, 8:19:40â€¯AM12/10/23

to

On Fri, 8 Dec 2023 14:28:37 +0100

And suppose that by trial and error I found a polynomial that does a

decent job. Now how I am going to do a reverse transform? I mean,

without having algebraic understanding? It sounds like much harder task

than finding a scrambler :(

> > So far all my approaches were either logical (chained symbol

> > substitution) or arithmetic (conversion to numeric systems with

> > non-power-of-two base). In specific case of L+T=4 all algorithms

> > that I tried so far were of later variety.

>

> I can see how that could work, but I would expect it to be too big

> for what you want in your FPGAs?

>

A logical cores of circuits that do such transforms at rate one bit

per clock are rather small. Converter from binary to sub-binary base

is very similar to integer divider and reverse converter is very

similar to integer multiplier. Both of them belong to class of

shift-and-add algorithms.

The difference from DIV/MUL is that in DIV/MUL one of the arguments of

algorithm remains in register where in Numeric System transforms this

argument is delivered from look-up table. Compactness of both circuits

in FPGA depends mainly on depth and width of look up tables.

> > FPGAs, except of *very* old ones, are actually quite good at

> > addition and subtraction as long as your adder is not too wide.

> > 16-20 bits are always o.k.

> >

> >

> >> But it does all sound somewhat arbitrary here. Why do you pick

> >> these specific numbers? It seems you are not concerned about

> >> errors other than synchronisation (presumably you handle any

> >> required error checking or correcting at a higher level). Since

> >> you have a common clock source, are we talking about a pair of

> >> FPGAs on the same board here?

> >>

> >

> > Different boards in the same relatively small rack.

> > Transmitter size is not important, because there is only one

> > transmitter per (data acquisition) board.

> > Receiver size is relatively important, because there are many

> > receivers per (concentrator) board.

> >

>

> Interesting - that might allow for alternative ideas. So you'd be

> okay with needing something like lookup tables for the transmission

> encoding, as long as reception decoding can be small? (I had been

> assuming that you have more symmetrical hardware requirements.)

>

I guess I would be ok with that.

But by now I found a solution that is sufficiently compact on both

sides.

decent job. Now how I am going to do a reverse transform? I mean,

without having algebraic understanding? It sounds like much harder task

than finding a scrambler :(

> > So far all my approaches were either logical (chained symbol

> > substitution) or arithmetic (conversion to numeric systems with

> > non-power-of-two base). In specific case of L+T=4 all algorithms

> > that I tried so far were of later variety.

>

> I can see how that could work, but I would expect it to be too big

> for what you want in your FPGAs?

>

per clock are rather small. Converter from binary to sub-binary base

is very similar to integer divider and reverse converter is very

similar to integer multiplier. Both of them belong to class of

shift-and-add algorithms.

The difference from DIV/MUL is that in DIV/MUL one of the arguments of

algorithm remains in register where in Numeric System transforms this

argument is delivered from look-up table. Compactness of both circuits

in FPGA depends mainly on depth and width of look up tables.

> > FPGAs, except of *very* old ones, are actually quite good at

> > addition and subtraction as long as your adder is not too wide.

> > 16-20 bits are always o.k.

> >

> >

> >> But it does all sound somewhat arbitrary here. Why do you pick

> >> these specific numbers? It seems you are not concerned about

> >> errors other than synchronisation (presumably you handle any

> >> required error checking or correcting at a higher level). Since

> >> you have a common clock source, are we talking about a pair of

> >> FPGAs on the same board here?

> >>

> >

> > Different boards in the same relatively small rack.

> > Transmitter size is not important, because there is only one

> > transmitter per (data acquisition) board.

> > Receiver size is relatively important, because there are many

> > receivers per (concentrator) board.

> >

>

> Interesting - that might allow for alternative ideas. So you'd be

> okay with needing something like lookup tables for the transmission

> encoding, as long as reception decoding can be small? (I had been

> assuming that you have more symmetrical hardware requirements.)

>

But by now I found a solution that is sufficiently compact on both

sides.

Dec 10, 2023, 1:46:08â€¯PM12/10/23

to

Michael S <already...@yahoo.com> writes:

> On Fri, 08 Dec 2023 22:10:06 -0800

> Tim Rentsch <tr.1...@z991.linuxsc.com> wrote:

[...]
> On Fri, 08 Dec 2023 22:10:06 -0800

> Tim Rentsch <tr.1...@z991.linuxsc.com> wrote:

Excellent!

The regularity of values in bv_tab[] makes it possible to get rid

of the array entirely. Here is a version of the encode function

that produces identical values to enc() above. Please ignore

minor differences in coding style; the important things to focus

on are the variables 'm' and 'd', with successive values of 'm'

taking on values of elements from the bv_tab[] array.

unsigned

encode_without_bvtab( unsigned x0 ){

unsigned x = x0 < 1<<12 ? x0 + 0x10000 : x0;

unsigned r = 0;

unsigned m = 46;

unsigned d = 011777; // note that this value is in octal

for( int i = 0; i < 17; i++, x += x, d >>= 1 ){

r <<= 1;

if( x >= m*1024 ) x -= m*1024, r |= 1;

m += (d&1) + (d==1)*7;

}

return r;

}

A similar transformation can be made to the dec() decoding function.

Dec 11, 2023, 2:59:55â€¯AM12/11/23

to

very wrong. But with a digital scrambler, you are xor'ing the data

sequence with a pseudo-random generated sequence, giving a result that

is statistically noise with very close to 50% one/zero ratio.

To get the data back, you simply scramble again with the same

pseudo-random sequence. As long as your data and pseudo-random

sequences are synchronised, it's all easy - scrambling and unscrambling

are the same thing.

At least, that's the theory as far as I understand it.

(Real data is always biased in its original form - typically far more

zeros than ones. Though if the data is encrypted or compressed, it will

be much more like unbiased noise.)

None of this helps find a scrambling algorithm or polynomial that meets

your needs, but I /think/ it means it would work well if you found such

a scrambler.

>

>>> So far all my approaches were either logical (chained symbol

>>> substitution) or arithmetic (conversion to numeric systems with

>>> non-power-of-two base). In specific case of L+T=4 all algorithms

>>> that I tried so far were of later variety.

>>

>> I can see how that could work, but I would expect it to be too big

>> for what you want in your FPGAs?

>>

>

> A logical cores of circuits that do such transforms at rate one bit

> per clock are rather small. Converter from binary to sub-binary base

> is very similar to integer divider and reverse converter is very

> similar to integer multiplier. Both of them belong to class of

> shift-and-add algorithms.

multiplication, which is often supported single-cycle in hardware blocks

of FPGAs) but of course when you don't need more than one bit out per

clock, it's not bad that bad.

Dec 11, 2023, 5:36:08â€¯PM12/11/23

to

On Mon, 11 Dec 2023 08:59:49 +0100

Scrambler of this type is guaranteed to not work at all for my purpose.

All it achieves is making output string to look like random sequence.

And random sequence has approximately 35% probability to contains 4

consecutive '1' bits per 16-bit or 17-bit word. Way above 50%

probability for my smallest 64-bit packet.

Scrambler like thhat can not even take advantage of additional bit that

I am willing to add per every 16 bits.

It is believable that scramblers of this sort are great for reducing

probability of *long* bursts of 1s (or zeros) down to insignificance.

Say, bursts of 40 repetitions. Or, at least, of 25 repetitions. But not

of 4 or 5 or 10. Not even of 16.

> >

> >>> So far all my approaches were either logical (chained symbol

> >>> substitution) or arithmetic (conversion to numeric systems with

> >>> non-power-of-two base). In specific case of L+T=4 all algorithms

> >>> that I tried so far were of later variety.

> >>

> >> I can see how that could work, but I would expect it to be too big

> >> for what you want in your FPGAs?

> >>

> >

> > A logical cores of circuits that do such transforms at rate one bit

> > per clock are rather small. Converter from binary to sub-binary base

> > is very similar to integer divider and reverse converter is very

> > similar to integer multiplier. Both of them belong to class of

> > shift-and-add algorithms.

>

> I keep thinking of division as being big and/or slow (in comparison

> to multiplication, which is often supported single-cycle in hardware

> blocks of FPGAs) but of course when you don't need more than one bit

> out per clock, it's not bad that bad.

>

In FPGA, at 1 bit per clock and without using HW multipliers (which are

of little use for 1 bit per clock, anyway) I am not sure at all that

multiplier would be smaller than divider. More likely, divider would be

smaller, but a little (few percents) slower in terms of achievable

clock frequency.

All it achieves is making output string to look like random sequence.

And random sequence has approximately 35% probability to contains 4

consecutive '1' bits per 16-bit or 17-bit word. Way above 50%

probability for my smallest 64-bit packet.

Scrambler like thhat can not even take advantage of additional bit that

I am willing to add per every 16 bits.

It is believable that scramblers of this sort are great for reducing

probability of *long* bursts of 1s (or zeros) down to insignificance.

Say, bursts of 40 repetitions. Or, at least, of 25 repetitions. But not

of 4 or 5 or 10. Not even of 16.

> >

> >>> So far all my approaches were either logical (chained symbol

> >>> substitution) or arithmetic (conversion to numeric systems with

> >>> non-power-of-two base). In specific case of L+T=4 all algorithms

> >>> that I tried so far were of later variety.

> >>

> >> I can see how that could work, but I would expect it to be too big

> >> for what you want in your FPGAs?

> >>

> >

> > A logical cores of circuits that do such transforms at rate one bit

> > per clock are rather small. Converter from binary to sub-binary base

> > is very similar to integer divider and reverse converter is very

> > similar to integer multiplier. Both of them belong to class of

> > shift-and-add algorithms.

>

> I keep thinking of division as being big and/or slow (in comparison

> to multiplication, which is often supported single-cycle in hardware

> blocks of FPGAs) but of course when you don't need more than one bit

> out per clock, it's not bad that bad.

>

of little use for 1 bit per clock, anyway) I am not sure at all that

multiplier would be smaller than divider. More likely, divider would be

smaller, but a little (few percents) slower in terms of achievable

clock frequency.

0 new messages