SP framing draft

Martin Sustrik

unread,

Aug 17, 2011, 4:19:52 AM8/17/11

to sp-discu...@googlegroups.com

Hi all,

I've assembled an SP framing I-D which can be found here:

https://raw.github.com/sustrik/sp-docs/master/draft-sustrik-spframing.txt

It's by no means finished and it's meant rather as a starting point for
the discussion.

If contains requirements for the SP framing layer as discussed on this
list. As proposed by Tony Garnock-Jones, the requirements are separated
into hard requirements and "additional considerations".

Furthermore, there's a proposal for framing on top of TCP, SCTP and UDP
multicast (the latter being based on experience with PGM transport in
0MQ) which seems to be quite a representative sample of underlying
protocols. Optionally, we can add an unreliable unicast transport such
as DCCP.

Comments are welcome!
Martin

Martin Sustrik

unread,

Aug 17, 2011, 4:25:59 AM8/17/11

to sp-discu...@googlegroups.com

On 08/17/2011 10:19 AM, Martin Sustrik wrote:

> I've assembled an SP framing I-D which can be found here:
>
> https://raw.github.com/sustrik/sp-docs/master/draft-sustrik-spframing.txt

I forgot to mention that I would really appreciate if anyone with HW
experience would look at the data formats and point out any problems
that may occur in HW processing of said formats.

Martin

Alexis Richardson

unread,

Aug 17, 2011, 11:12:24 AM8/17/11

to sp-discu...@googlegroups.com

Sanjay is having a look.

> Martin
>
> --
> Note Well: This discussion group is meant to become an IETF working group in
> the future. Thus, the posts to this discussion should comply with IETF
> contribution policy as explained here:
> http://www.ietf.org/about/note-well.html
>

Martin Sustrik

unread,

Aug 17, 2011, 11:17:21 AM8/17/11

to sp-discu...@googlegroups.com, Alexis Richardson

On 08/17/2011 05:12 PM, Alexis Richardson wrote:

> Sanjay is having a look.

Great!

AFAICS UDP multicast framing can add some padding to unsure that message
headers are aligned to 32 bits.

In theory, TCP framing can do something similar to help DPI, however,
given that re-fragmentation of TCP packets can occur basically at any
level of the stack, I guess there's no way to keep the messages aligned
all the way through the network.

Martin

Pieter Hintjens

unread,

Aug 17, 2011, 5:15:33 PM8/17/11

to sp-discu...@googlegroups.com

Nice document, Martin.

For what it's worth there is a UDP unicast protocol running as an
extension to 0MQ, the WLP is minimal and documented here:
https://github.com/imatix/vtx/blob/master/v3/vtx_udp.c

If you want to cover this, I'm willing to write it up as suitable
text. The UDP unicast driver is used for discovery and name resolution
and supports the basic 0MQ socket types (2.1 style, not 3.0 style).

-Pieter

Sanjay Aiyagari

unread,

Aug 17, 2011, 8:41:39 PM8/17/11

to sp-discu...@googlegroups.com, Alexis Richardson

Hi Martin,

Thanks for writing up this document. I think you have illustrated a
lot of good concepts in the documents you have written in the sp-docs
tree.

You asked to evaluate this from a hardware perspective. The use cases
of hardware-assisted messaging fall into these categories (in
increasing order of difficulty):

1. Monitoring
2. Transforming (encryption/compression/conversion)
3. Mapping network features to SP features
4. Routing/redirection

We should decide which of these use cases we want to address with SP
before we decide to assess it from a hardware perspective.

For monitoring, it should be possible to identify the number of
complete messages sent and received on an SP connection, with very,
very low cost, such that one would likely leave the counter on all the
time. To achieve this it would be nice to have a message count of
"number of new messages in the packet" so it becomes a simple addition
of a counter field which is at a fixed offset.

Bonus - even better if counts can be per-SP-application. Nothing is
currently stated about multiple SP applications running -- do you
intend to simply run these on separate ports? Or would this only be
possible using a multi-streaming protocol like SCTP?

Transforming - a hardware engine needs the number of bytes available
to transform now, not at some unspecified time in the future. So a
fragment sequence + size of fragment approach would be better than
"eventual size of message" approach used here. Otherwise you may not
achieve your goals around zero-copy because everything would need to
be copied anyway.

Mapping - for example setting QoS of the packet based on some property
of the message. Message start needs to be easily accessible - this is
not ideal with the offset being valid very deep into the packet. If
we think this is valuable, then we need to think more deeply on this
one.

Routing/redirection - This is probably the hardest use case to ensure
and it means a lot of tradeoffs - so it would be good to understand
what you expect to see in SP - the sizes of messages you think will
actually be routed in hardware, etc. I don't think the current
framing lends itself to this use case very well because there is no
guarantee of alignment of packets to messages - i.e. it is legal to
put an offset of 9000 in a jumbo packet and I don't see a hardware
engine buffering that to make a routing decision. This has to be
traded off with the use of more packets to enable hardware based
routing.

Some other random comments on the document:

In 1.3, you mention segmentation -- it's not ideal for SCTP to do its
own segmentation for messages up to 2^64 in size, especially if you
are using the "multiple streams" capability. The reason is that this
results in head-of-line blocking since segmented messages cannot be
interleaved in SCTP. If you make a statement explicitly not to use
the multiple streams capability, then it is possible this is OK, but I
can't be certain -- would have to refresh myself on implications of
SSNs vs TSNs in SCTP.

In 4.1 is there any consideration given as to how to deal with large
message transfer in the client? Simplistic client implementations
could be blocking for hours on a large message, leaving their users
staring at a frozen screen. At least an encouragement of
multithreaded programming here would be useful. Or are we intending
to use something like SIP at a higher layer to "connect" a control
stream to one or more data streams?

In 4.3 you could possibly acknowledge that path MTU discovery is
relatively common (since it's required for IPv6 anyway) and therefore
the MTU should be known in advance even if not statically configured
for the entire network. I found an interesting analysis of this:
http://ripe60.ripe.net/presentations/Stasiewicz-Measurements_of_IPv6_Path_MTU_Discovery_Behaviour.pdf

It would also be nice to ensure that the "size" field is atomic and
not itself split across packets. Using the same argument as you did
for the 64-bit size to begin with, one can also argue that this would
not significantly waste bandwidth.

Is there a reason to start at sequence number 0? TCP used to work
this way until people figured out how to hijack connections -- now the
sequence numbers are chosen randomly.
--
Sanjay

Martin Sustrik

unread,

Aug 18, 2011, 2:33:27 AM8/18/11

to sp-discu...@googlegroups.com, Pieter Hintjens

Hi Pieter,

> For what it's worth there is a UDP unicast protocol running as an
> extension to 0MQ, the WLP is minimal and documented here:
> https://github.com/imatix/vtx/blob/master/v3/vtx_udp.c
>
> If you want to cover this, I'm willing to write it up as suitable
> text. The UDP unicast driver is used for discovery and name resolution
> and supports the basic 0MQ socket types (2.1 style, not 3.0 style).

AFAIU the framing is done here:

NOM = version flags %b0111 sequence zmq-payload
sequence = 4BIT ; Request sequencing
zmq-payload = 1*zmq-frame
zmq-frame = tiny-frame / short-frame / long-frame
tiny-frame = 1OCTET frame-body
short-frame = %xFE 2OCTET frame-body
long-frame = %xFF 4OCTET frame-body
frame-body = *OCTET

Some comments:

1. Note the assumptions made for UDP multicast framing (single LAN etc.)
These can be done because most of the mutlicast usage is on LAN anyway.
With UDP unicast this is not the case and such restrictions are overly
prohibitive. Thinking out of scope of a single LAN then brings
additional concerns such as MTU discovery, congestion control etc.
That's why it's preferable to build unreliable unicast on top of some
existing solution that handles these problems. The DCCP seems to be the
best option at the moment.

2. Why 4-bit sequence number? That way reordering that happens to shift
a packet by 16 positions will result in broken messages.

3. tiny/short/long size field doesn't seem to be necessary. See the
comment in TCP framing section in the framing I-D.

Martin

Martin Sustrik

unread,

Aug 18, 2011, 2:42:18 AM8/18/11

to sp-discu...@googlegroups.com, Pieter Hintjens

On 08/18/2011 08:33 AM, Martin Sustrik wrote:

> AFAIU the framing is done here:
>
> NOM = version flags %b0111 sequence zmq-payload
> sequence = 4BIT ; Request sequencing
> zmq-payload = 1*zmq-frame
> zmq-frame = tiny-frame / short-frame / long-frame
> tiny-frame = 1OCTET frame-body
> short-frame = %xFE 2OCTET frame-body
> long-frame = %xFF 4OCTET frame-body
> frame-body = *OCTET

One additional thought: From the spec it's not clear how are the
messages aligned with UDP packets.

Martin

Martin Sustrik

unread,

Aug 19, 2011, 2:00:28 AM8/19/11

to sp-discu...@googlegroups.com, Sanjay Aiyagari, Alexis Richardson

Hi Sanjay,

> You asked to evaluate this from a hardware perspective. The use cases
> of hardware-assisted messaging fall into these categories (in
> increasing order of difficulty):
>
> 1. Monitoring
> 2. Transforming (encryption/compression/conversion)
> 3. Mapping network features to SP features
> 4. Routing/redirection
>
> We should decide which of these use cases we want to address with SP
> before we decide to assess it from a hardware perspective.

Yes. Two notes though:

1. In addition to addressing specific use cases we should strive to make
the parsing as simple as possible so that even the use cases we don't
consider at the moment would be easier to implement.

2. We should clearly separate per-message use cases and per-application
use cases. The latter can be bound to specific TCP port (or something
similar) which would make them visible even to existing network
equipment. The former is more complex and requires new equipment (FPGAs
etc.) Thus, I would say, when in doubt, opt for per-application behaviour.

> For monitoring, it should be possible to identify the number of
> complete messages sent and received on an SP connection, with very,
> very low cost, such that one would likely leave the counter on all the
> time. To achieve this it would be nice to have a message count of
> "number of new messages in the packet" so it becomes a simple addition
> of a counter field which is at a fixed offset.

This can't be done with TCP. First, applications don't have API to
actually craft TCP packets, second, TCP packets can be re-fragmented on
the way thus breaking the packet layout.

It can't be done with SCTP either. Applications don't have API to craft
SCTP packets themselves and the SCTP packet header itself doesn't seem
to contain the count of messages contained in the packet.

It can be done for UDP multicast. Currently the header is 6 bytes long.
There are 2 bytes remaining to make the header nicely aligned which can
be used to convey message count.

> Bonus - even better if counts can be per-SP-application. Nothing is
> currently stated about multiple SP applications running -- do you
> intend to simply run these on separate ports? Or would this only be
> possible using a multi-streaming protocol like SCTP?

Yes, different applications are meant to run on different ports.

I've written a theoretical whitepaper where I explain the concept.
Basically, application is split into "topologies" where "topology" is a
graph of nodes sharing the same business logic and communicating using
the same wire level protocol. Each topology could run on a different
port, thus making the monitoring relatively simple. Actually, if you are
interested in byte counts rather than message counts you can perform the
monitoring even with today's network equipment.

(http://www.250bpm.com/concepts#toc1)

> Transforming - a hardware engine needs the number of bytes available
> to transform now, not at some unspecified time in the future. So a
> fragment sequence + size of fragment approach would be better than
> "eventual size of message" approach used here. Otherwise you may not
> achieve your goals around zero-copy because everything would need to
> be copied anyway.

I guess I don't follow. Can you explain it in more detail, please?

> Mapping - for example setting QoS of the packet based on some property
> of the message. Message start needs to be easily accessible - this is
> not ideal with the offset being valid very deep into the packet. If
> we think this is valuable, then we need to think more deeply on this
> one.

Is there a use case for setting QoS on per-message basis? I would say
that QoS is more on per-application (topology) level.

Moreover, it's not clear how to combine per-message QoS with message
bundling.

> Routing/redirection - This is probably the hardest use case to ensure
> and it means a lot of tradeoffs - so it would be good to understand
> what you expect to see in SP - the sizes of messages you think will
> actually be routed in hardware, etc. I don't think the current
> framing lends itself to this use case very well because there is no
> guarantee of alignment of packets to messages - i.e. it is legal to
> put an offset of 9000 in a jumbo packet and I don't see a hardware
> engine buffering that to make a routing decision. This has to be
> traded off with the use of more packets to enable hardware based
> routing.

I see.

The UDP multicast design was aimed at market-data scenario, where
messages have to be bundled in a packet to achieve high throughput.

With TCP and SCTP the message header can occur at any place in the
packet which doesn't make them good for hardware routing either.

Maybe, if we want to experiment with HW routing, we can give it a try
with DCCP framing?

> In 1.3, you mention segmentation -- it's not ideal for SCTP to do its
> own segmentation for messages up to 2^64 in size, especially if you
> are using the "multiple streams" capability. The reason is that this
> results in head-of-line blocking since segmented messages cannot be
> interleaved in SCTP. If you make a statement explicitly not to use
> the multiple streams capability, then it is possible this is OK, but I
> can't be certain -- would have to refresh myself on implications of
> SSNs vs TSNs in SCTP.

Ugh! I have forgot about that.

Anyway, I would say that SCTP itself suffers from this problem and thus
it would be reasonable if SP/SCTP exhibited the same behaviour.

See beginning of section 3:

Framing layer should respect the design decisions made in the
underlying protocol and not try to fix the -- whether real or
perceived -- problems in the protocol.

At some point in the future we may be forced to compensate for
underlying protocol deficiencies on the SP level, however, for now I
would rather try to keep it clean from that and focus on the messaging
functionality proper.

> In 4.1 is there any consideration given as to how to deal with large
> message transfer in the client? Simplistic client implementations
> could be blocking for hours on a large message, leaving their users
> staring at a frozen screen. At least an encouragement of
> multithreaded programming here would be useful. Or are we intending
> to use something like SIP at a higher layer to "connect" a control
> stream to one or more data streams?

That's an interesting question. It seems to me that atomicity principle
and "notify about progress of incoming message" are in direct contradiction.

However, from user's point of view, receiving a large message should
look like exactly the same as if there's no message at all. For example,
if you call receive() and the call blocks it can be either because
there's no message or because large message is being received in the
background.

In a more sophisticated way, you can have a file descriptor that you can
poll on. The descriptor can then signal POLLIN event when there's a
complete message to be received.

> In 4.3 you could possibly acknowledge that path MTU discovery is
> relatively common (since it's required for IPv6 anyway) and therefore
> the MTU should be known in advance even if not statically configured
> for the entire network. I found an interesting analysis of this:
> http://ripe60.ripe.net/presentations/Stasiewicz-Measurements_of_IPv6_Path_MTU_Discovery_Behaviour.pdf

Would you like to propose the exact wording?

Nice presentation BTW.

> It would also be nice to ensure that the "size" field is atomic and
> not itself split across packets. Using the same argument as you did
> for the 64-bit size to begin with, one can also argue that this would
> not significantly waste bandwidth.

Yes. That's a good idea. Would it also be helpful if it's aligned to 32
bits or somesuch?

> Is there a reason to start at sequence number 0? TCP used to work
> this way until people figured out how to hijack connections -- now the
> sequence numbers are chosen randomly.

Hm, I haven't thought of that. In the beginning of the section there's a
statement that the proposed framing would work only on LAN though, which
makes the hijacking scenario relatively inprobable AFAICT.

The original motivation for starting with 0 was that that way the only
situation when the new instance of the sender can accidentally generate
a sequence number following the last sequence number issued by the
previous instance and thus corrupt the messages, is if the first
instance have already sent precisely 2^32 messages, overflowed the
sequence number to 0 and then failed -- given that most applications
don't even send that much messages, it's pretty much impossible. On the
other hand, starting with random number would make every 2^32 restart
accidentally fit into the sequence of the previous sender instance and
corrupt the messages. Improbable, but possible.

However, I have totally forgot about dropped and re-ordered packets.
When taking those into account, when the first instance of sender dies
after sending 1 packet, the next instance may accidentally fit into the
previous sequence -- if its first packet is either dropped or re-ordered.

So, I guess, starting with random number is better.

Martin

Tony Garnock-Jones

unread,

Aug 22, 2011, 11:51:23 AM8/22/11

to sp-discu...@googlegroups.com

On 17 August 2011 04:19, Martin Sustrik <sus...@250bpm.com> wrote:

I've assembled an SP framing I-D which can be found here:
https://raw.github.com/sustrik/sp-docs/master/draft-sustrik-spframing.txt

One minor quibble: you're calling a 64-bit binary length prefixed blob a "netstring", but I'd always understood netstrings to be as defined by djb here: http://cr.yp.to/proto/netstrings.txt

Catching up on the newer messages in the group now.

Tony

Tony Garnock-Jones

unread,

Aug 22, 2011, 1:04:30 PM8/22/11

to sp-discu...@googlegroups.com

On 17 August 2011 04:19, Martin Sustrik <sus...@250bpm.com> wrote:

I've assembled an SP framing I-D which can be found here:
https://raw.github.com/sustrik/sp-docs/master/draft-sustrik-spframing.txt

Do you think writing down a mapping to WebSockets (such as they are) would be useful? It'd be nice to have a transport usable for communicating with browser instances.

Regards,
Tony

Martin Sustrik

unread,

Aug 22, 2011, 1:07:58 PM8/22/11

to sp-discu...@googlegroups.com, Tony Garnock-Jones

Hi Tony,

> Do you think writing down a mapping to WebSockets (such as they are)
> would be useful? It'd be nice to have a transport usable for
> communicating with browser instances.

Defintely. If WebSockets catch on it can become as important as TCP
mapping. IIRC there've been a last call for WebSockets RFC recently so
the spec should be more or less stable. I have no experience in the area
though. Would you like to give it a try?

Martin

Martin Sustrik

unread,

Aug 23, 2011, 4:02:36 AM8/23/11

to sp-discu...@googlegroups.com, Tony Garnock-Jones

On 08/22/2011 05:51 PM, Tony Garnock-Jones wrote:

> One minor quibble: you're calling a 64-bit binary length prefixed blob a
> "netstring", but I'd always understood netstrings to be as defined by
> djb here: http://cr.yp.to/proto/netstrings.txt

Oops. I haven't known there's a formal definition for netstring. Let's
call in size-prefixed BLOB then?

Martin

Tony Garnock-Jones

unread,

Aug 25, 2011, 9:55:42 AM8/25/11

to Martin Sustrik, sp-discu...@googlegroups.com

On 22 August 2011 13:07, Martin Sustrik <sus...@250bpm.com> wrote:

Defintely. If WebSockets catch on it can become as important as TCP mapping. IIRC there've been a last call for WebSockets RFC recently so the spec should be more or less stable. I have no experience in the area though. Would you like to give it a try?

I'm afraid I don't have the requisite experience either. Looking at the current draft, though, it's a frame-based binary-capable protocol supporting up to 64-bit message lengths. There's a bunch of noise around subprotocol selection and so forth that I wouldn't know whether to address in an SP mapping or not. It supports message fragmentation but that's irrelevant to SP so can be ignored.

Regards,
Tony

Sanjay Aiyagari

unread,

Sep 23, 2011, 2:55:38 AM9/23/11

to Martin Sustrik, sp-discu...@googlegroups.com, Alexis Richardson

Hi Martin,

I've been extremely busy over the last month at work and I'm just
catching up on things... sorry if some of these items have already
been covered, but I haven't seen anything else on this thread yet.

Per-application behavior is an interesting idea -- I think it makes
sense as long as you have some sort of control layer to associate the
applications with the actual ports used. Sort of like how SIP
associates multiple distinct streams (each possibly with their own
protocol and data format) to a single session, and the actual ports
used can be negotiated at run time. Come to think of it, we could
even use SIP as the control layer for this -- we would just need a
content-type for SP.

On the message count -- agreed on TCP -- we can't do anything about
that so it is what it is. With UDP it's easy as you suggest. I think
with SCTP it is still possible if you don't do batching -- will have
to think about this.

Transformations - let me crystallize this with the example of an
encryption engine. You need a size and a number of bytes to encrypt.
If the message is fragmented, then you need the size of each fragment.
If the intent is to use a hardware encryption engine, it would be
good to make sure that the data it would need is easily accessible.

QoS - yes, an application can have messages that are more important
than others. You might have application layer "heartbeat" messages
that need to be high priority because when they are lost the
application will think it needs to initiate failover to another node,
for example. I would say that bundling implies that the packet gets
the highest priority of all the messages bundled in the packet (i.e.
providing better QoS than you advertise is never a problem).

Routing - the market data scenario is point-to-point so routing
doesn't matter, right? Or are you trying to do routing to different
endpoints of the UDP packets? If the latter, we need to think about
how to make this achievable, especially if you are also doing batching
in which case it requires the intelligent fragmentation of the bundled
packets. In SCTP, yes it is possible for a message header to show up
at different places, but they are findable places that only require
state from the start of the packet -- not from the start of the whole
connection as with TCP. Also the common case is for there simply to
be a DATA segment inside the SCTP packet, so the average speed would
be predictable. Agreed that the extra logic means a switch would
probably not inspect SCTP, but a router might.
--
Sanjay

Martin Sustrik

unread,

Sep 30, 2011, 6:58:46 AM9/30/11

to Sanjay Aiyagari, sp-discu...@googlegroups.com, Alexis Richardson

Hi Sanjay,

> I've been extremely busy over the last month at work and I'm just
> catching up on things...

Same thing here :)

> Per-application behavior is an interesting idea -- I think it makes
> sense as long as you have some sort of control layer to associate the
> applications with the actual ports used. Sort of like how SIP
> associates multiple distinct streams (each possibly with their own
> protocol and data format) to a single session, and the actual ports
> used can be negotiated at run time. Come to think of it, we could
> even use SIP as the control layer for this -- we would just need a
> content-type for SP.

Interesting. I'll have a look at SIP.

The important thing here is clear separation of the data layer and
control layer. Applications should not be aware that topology named "X"
runs on port 5678. Actually, different parts of the topology can use
different ports. The point is that application uses the string "X" to
refer to the topology, why underlying layers take care of associating
the topology with the port number, monitoring it etc.

I've tried to make this point in my recent talk at SkillsMatter:

http://skillsmatter.com/podcast/home/future-messaging/js-2728

Short summary: Clear separation of the application interface and
administration interface make writing the apps easy for the developers
while at the same time creating a market for administrative solutions &
services.

> On the message count -- agreed on TCP -- we can't do anything about
> that so it is what it is. With UDP it's easy as you suggest. I think
> with SCTP it is still possible if you don't do batching -- will have
> to think about this.

Agreed.

> Transformations - let me crystallize this with the example of an
> encryption engine. You need a size and a number of bytes to encrypt.
> If the message is fragmented, then you need the size of each fragment.
> If the intent is to use a hardware encryption engine, it would be
> good to make sure that the data it would need is easily accessible.

Ok. Understood.

> QoS - yes, an application can have messages that are more important
> than others. You might have application layer "heartbeat" messages
> that need to be high priority because when they are lost the
> application will think it needs to initiate failover to another node,
> for example. I would say that bundling implies that the packet gets
> the highest priority of all the messages bundled in the packet (i.e.
> providing better QoS than you advertise is never a problem).

I am not an expert, but is that kind of system used in the real world?
It just sounds a bit creaky. I mean things like certain low-priority
messages arriving at the destination faster because they happen to be
bundled with high-priority messages. That in turn may cause RX
reordering buffer overflow and result in disconnection etc.

> Routing - the market data scenario is point-to-point so routing
> doesn't matter, right? Or are you trying to do routing to different
> endpoints of the UDP packets? If the latter, we need to think about
> how to make this achievable, especially if you are also doing batching
> in which case it requires the intelligent fragmentation of the bundled
> packets. In SCTP, yes it is possible for a message header to show up
> at different places, but they are findable places that only require
> state from the start of the packet -- not from the start of the whole
> connection as with TCP. Also the common case is for there simply to
> be a DATA segment inside the SCTP packet, so the average speed would
> be predictable. Agreed that the extra logic means a switch would
> probably not inspect SCTP, but a router might.

I think we may be speaking about different layers here. What I meant was
routing on SP level rather then L2/L3 routing.

Let's take an example of simple broker based pub/sub topology. Client A
publishes message "X". Clients C and E happen to be subscribed to "X"
while clients B and D are not.

What I meant by routing is the algorithm that transfers the message from
A to broker and then from broker to C and to E.

Martin

Reply all

Reply to author

Forward