Unlimited message size

Martin Sustrik

unread,

May 23, 2011, 3:50:18 AM5/23/11

to sp-discu...@googlegroups.com

Hi all,

In the original framing requirements there was a call for providing
message segmentation where the underlying protocol doesn't provide it
itself:

"In case the message doesn't fit into L4 transport unit (such as packet)
the framing layer should segment the messages into appropriately sized
pieces."

This requirements seems to contradict the desire to keep the underlying
protocol's semantics untampered. Protocols like UDP have inherent limit
for the size of data that can be sent as an atomic unit (MTU) and this
should be (?) preserved.

By forcing the "large message" requirement on UDP we would force the
framing to add sequence numbers to packets so that the receiving side
can re-assemble the message correctly. What's worse, we would force
state per peer to be held on the receiver (at least the message being
assembled). Holding state per peer may not be acceptable for UDP
scenarios with very large amount of potential peers. But maybe the
problem could be stored by rapid expiration of the state...

On the other hand, if "large message" requirement is not forced, the
above layers would have to account for limited message size (most
typical MTU size being ~1500 bytes). We can get rid of responsibility
for application data by transferring it to the SP user: "When using UDP,
never send message exceeding MTU size". However, we would have to keep
the SP message envelope within MTU size as well. The question is whether
fixed-sized SP envelopes are even possible, specifically in cases where
message is tranferred via unlimited number of intrmediate nodes.

Thoughts?
Martin

brugeman

unread,

May 23, 2011, 6:52:45 AM5/23/11

to sp-discu...@googlegroups.com

Hi,

+1 for setting no limit on message size. It feels that SP being above the transport layer should provide new abstraction level. If I'm aware of UPD's limits, and I need messages to be delivered atomically, and w/o overhead on receivers - I'll use UDP. If I need more abstraction, and interoperability with other transports, I'll need SP.

Also, non-limited message size does not mean that, say, UPD's semantics will _always_ be tampered. If I know about UDP's pros and cons, and would like to avoid additional overhead - I'll use small messages, which will (should) seem like raw UDP.

So, if application cares about some feature specific to a particular transport, it should use message size as required by that transport. But if what it cares more is simplicity, interoperability between transports, it can use arbitrary sized messages (and live with additional overhead). That way one could use SP for all use cases.

Artur Brugeman

Gary Berger

unread,

May 23, 2011, 8:45:16 AM5/23/11

to sp-discu...@googlegroups.com

The higher abstraction layers should not ignore the limitations of the underlying medium. One of the problems today is dealing with out-of-order packets and the reassembly buffers required.. This makes rearrangeable CLOS networks difficult to design.. The SP should help the developer design for certain network events such as congestion collapse, transient loops and variable delay..

It may be necessary to introduce some new semantics into the scalability protocol which operate with different characteristics.

Types of Service

Edge driven - Primary use case is to populate local edge-caches for realtime entertainment (e.g. NetFlix). Highly impacted by BW*Delay Product and AIMD congestion algorithms.
Peer-to-Peer Services - Modeled on multi–dimensional hypercubes (I.e. Skype, Gnutella) which make up predominant traffic across the Internet core. Maybe SP provides a scalable overlay network if one chooses to build a next generation Skype..
Data Center Services which require tolerability for congestion spikes (I.e. Incast) due to NOSQL Query/Join execution frameworks, multiple geographically dispersed data center resources and "extreme" partitioning events. The tradeoffs here are typically around consistency vs. availability (I.e. Weaker consistency models chosen in place of complete service outage)

I would also be curious if SP can deal with the Location/Identity Split problem which is so pervasive today in mobile computing and data center resources. Dealing with this problem at the SP layer might provide a better approach than refactoring the protocol bindings (I.e. Egress-Interface->ARP->IP->Egress Interface) with alternatives like LISP, TRILL, etc...

tks

.:|:.:|:. Gary Berger | Technical Leader, Engineering| Cisco Systems

--
Note Well: This discussion group is meant to become an IETF working group in the future. Thus, the posts to this discussion should comply with IETF contribution policy as explained here: http://www.ietf.org/about/note-well.html

Martin Sustrik

unread,

May 23, 2011, 4:16:43 PM5/23/11

to sp-discu...@googlegroups.com, Gary Berger

Hi Gary,

Nice to have you here!

> The higher abstraction layers should not ignore the limitations of the
> underlying medium. One of the problems today is dealing with
> out-of-order packets and the reassembly buffers required.. This makes
> rearrangeable CLOS networks difficult to design.. The SP should help the
> developer design for certain network events such as congestion collapse,
> transient loops and variable delay..

Congestion collapse: My suggestion would be to rely on CC algorithms as
provided by underlying layers. SP is meant to live more or less at L5,
way above where you manage congestion control. However, the messaging
patterns it provides should heave clearly defined behaviour when
congestion hits (pushback, message dropping etc.)

Transient loops: There was absolutely no research in this area. For
example, with 0MQ we currently assume that there are no loops. Which is
not viable in the long run, obviously.

Variable delay: You mean large latency spread, right? One thing often
done on messaging level that is heavily impacted by this is
heartbeating. Still, I believe the heartbeating should be done on L4 (as
SCTP does it), not in the scalability layer (we have no roundtrip time
estimates or any other metrics to drive the heartbeat algorithm above
the L4).

> It may be necessary to introduce some new semantics into the scalability
> protocol which operate with different characteristics.
>
> Types of Service
>

> * Edge driven - Primary use case is to populate local edge-caches

> for realtime entertainment (e.g. NetFlix). Highly impacted by
> BW*Delay Product and AIMD congestion algorithms.

> * Peer-to-Peer Services - Modeled on multiï¿½dimensional hypercubes

> (I.e. Skype, Gnutella) which make up predominant traffic across
> the Internet core. Maybe SP provides a scalable overlay network if
> one chooses to build a next generation Skype..

> * Data Center Services which require tolerability for congestion

> spikes (I.e. Incast) due to NOSQL Query/Join execution frameworks,
> multiple geographically dispersed data center resources and
> "extreme" partitioning events. The tradeoffs here are typically
> around consistency vs. availability (I.e. Weaker consistency
> models chosen in place of complete service outage)

I would say we have to dig further into these use cases to find out how
they interact with scalability layer. Can you elaborate? Providing
pointers to existing work/analysis would be helpful as well.

> I would also be curious if SP can deal with the Location/Identity Split
> problem which is so pervasive today in mobile computing and data center
> resources. Dealing with this problem at the SP layer might provide a
> better approach than refactoring the protocol bindings (I.e.
> Egress-Interface->ARP->IP->Egress Interface) with alternatives like
> LISP, TRILL, etc...

For me it's not yet clear how scalability patterns and identity
interact. To use 0MQ as an example once again, we've deliberately
avoided the problem of identity/naming as too hard to solve at the
moment. With SP we will have to dig into it though. It an extremely
interesting stuff and I myself am quite curious about where it will lead us.

Martin

Martin Sustrik

unread,

May 23, 2011, 4:36:05 PM5/23/11

to sp-discu...@googlegroups.com, Gary Berger

On 05/23/2011 02:45 PM, Gary Berger wrote:

> The higher abstraction layers should not ignore the limitations of the
> underlying medium.

The problem here, IMO, should be rather restated as: Should SP rely on
the underlying layer to fragment large message into chunks (as is the
case with TCP) or should it try to provide it itself if the underlying
protocol has limited transfer units (as is the case with UDP).

The question seems to be pretty trivial, but it has some profound
implications.

Naive reasoning goes like this: SP should not provide fragmentation. If
you need to run SP on top of a protocol that does not support
fragmentation, build fragmentation on top of it.

The problem with this reasoning is when SP metadata (message headers,
labels, whatever you prefer to call them) exceed the transfer unit.

This is bound to happen if every node on a path between endpoints is
allowed to add metadata to the message. However concise the metadata,
choose a path long enough and transfer unit size will be exceeded.

Thus, to not provide fragmentation on SP level we would have to be very
clear about not allowing the intermediary nodes to add metadata to the
message. Contrast this with say current implementation of request/reply
pattern in 0MQ where the whole backwards path (all the nodes on it) to
the original requester is held in the request message.

The state has to be stored somewhere though. If it's not in the message,
it would have to be stored on intermediate nodes. This is in technically
possible, however, the problem is that amount of memory on the
intermediate nodes defines the total throughput of the system -- you
can't pass more messages than you are able to store info about.

This in turn limits the scalability of the system as a whole. Which
seems to contradict the ultimate goal on the whole SP thing.

Martin

Gary Berger

unread,

May 23, 2011, 5:10:10 PM5/23/11

to Martin Sustrik, sp-discu...@googlegroups.com

On 5/23/11 4:16 PM, "Martin Sustrik" <sus...@250bpm.com> wrote:

>Hi Gary,
>
>Nice to have you here!
>
>> The higher abstraction layers should not ignore the limitations of the
>> underlying medium. One of the problems today is dealing with
>> out-of-order packets and the reassembly buffers required.. This makes
>> rearrangeable CLOS networks difficult to design.. The SP should help the
>> developer design for certain network events such as congestion collapse,
>> transient loops and variable delay..
>
>Congestion collapse: My suggestion would be to rely on CC algorithms as
>provided by underlying layers. SP is meant to live more or less at L5,
>way above where you manage congestion control. However, the messaging
>patterns it provides should heave clearly defined behaviour when
>congestion hits (pushback, message dropping etc.)

@gaberger: Ok, Congestion is intimately coupled with queue occupancy and
therefore linked to the size of the flow.. "Long-lived, greedy TCP flows
will cause the length of the bottleneck queue to grow until packets are
dropped, resulting in the familiar sawtooth pattern[DCTCP]. How would the
SP deal with say back-pressure messages from QCN?

>Transient loops: There was absolutely no research in this area. For
>example, with 0MQ we currently assume that there are no loops. Which is
>not viable in the long run, obviously.

@gaberger: Since the networks are getting dense 100K server instances in a
DC, you have a state-synchronization problem.

>
>Variable delay: You mean large latency spread, right? One thing often
>done on messaging level that is heavily impacted by this is
>heartbeating. Still, I believe the heartbeating should be done on L4 (as
>SCTP does it), not in the scalability layer (we have no roundtrip time
>estimates or any other metrics to drive the heartbeat algorithm above
>the L4).

@gaberger: Maybe this is more of an affinity designation for locality
(I.e. Rack-affinity)

>
>> It may be necessary to introduce some new semantics into the scalability
>> protocol which operate with different characteristics.
>>
>> Types of Service
>>
>> * Edge driven - Primary use case is to populate local edge-caches
>> for realtime entertainment (e.g. NetFlix). Highly impacted by
>> BW*Delay Product and AIMD congestion algorithms.

>> * Peer-to-Peer Services - Modeled on multidimensional hypercubes

>> (I.e. Skype, Gnutella) which make up predominant traffic across
>> the Internet core. Maybe SP provides a scalable overlay network if
>> one chooses to build a next generation Skype..
>> * Data Center Services which require tolerability for congestion
>> spikes (I.e. Incast) due to NOSQL Query/Join execution frameworks,
>> multiple geographically dispersed data center resources and
>> "extreme" partitioning events. The tradeoffs here are typically
>> around consistency vs. availability (I.e. Weaker consistency
>> models chosen in place of complete service outage)
>
>I would say we have to dig further into these use cases to find out how
>they interact with scalability layer. Can you elaborate? Providing
>pointers to existing work/analysis would be helpful as well.

@gaberger: So there is some work here, There is no one-size-fits all which
is back to why there is a need for a scalability protocol. Sensitivities
to oversubscription rates, distance, user response time, etc..

"On the massive increase in mobile data use and the dangers that creates:
The real risk of everything is collapse. Nobody utters this loudly enough,
but the real issue for the world is a collapse of the network or some local
collapses.[FRANCETEL]

Another interesting IETF project is [ALTO], How would a Scalability
Protocol leverage such a system?

.
>
>> I would also be curious if SP can deal with the Location/Identity Split
>> problem which is so pervasive today in mobile computing and data center
>> resources. Dealing with this problem at the SP layer might provide a
>> better approach than refactoring the protocol bindings (I.e.
>> Egress-Interface->ARP->IP->Egress Interface) with alternatives like
>> LISP, TRILL, etc...
>
>For me it's not yet clear how scalability patterns and identity
>interact. To use 0MQ as an example once again, we've deliberately
>avoided the problem of identity/naming as too hard to solve at the
>moment. With SP we will have to dig into it though. It an extremely
>interesting stuff and I myself am quite curious about where it will lead
>us.
>
>Martin

@gaberger: Yes this would be a dive into some dark territory.

-----

[DCTCP] http://portal.acm.org/citation.cfm?id=1851275.1851192

[FRANCETEL]
http://allthingsd.com/20110523/exclusive-france-telecom-ceo-on-apple-androi
d-and-how-you-can-kiss-your-unlimited-plan-goodbye/

[ALTO] http://tools.ietf.org/html/draft-ietf-alto-protocol-08

-g
>

Paul Colomiets

unread,

May 23, 2011, 5:30:16 PM5/23/11

to sp-discu...@googlegroups.com

Hi,

On Mon, May 23, 2011 at 11:36 PM, Martin Sustrik <sus...@250bpm.com> wrote:

This is bound to happen if every node on a path between endpoints is allowed to add metadata to the message. However concise the metadata, choose a path long enough and transfer unit size will be exceeded.

Thus, to not provide fragmentation on SP level we would have to be very clear about not allowing the intermediary nodes to add metadata to the message. Contrast this with say current implementation of request/reply pattern in 0MQ where the whole backwards path (all the nodes on it) to the original requester is held in the request message.

I think it's ok to add metadata. There are inherently very short messages, e.g. change notifications (which probably consists only of object id or path or URI). If message will be 1000 bytes (which ok, in most circumstances even for instant messaging or games), there are a plenty of space for headers. Given MTU is 500 bytes and single address is 20 (zmq uses 16 + 4 bytes for some header), you can add 25 intermediate nodes. If this is not enought some kind of gateway can be added, which consumes several labels and produces single one. This gateway must hold all pathes to nodes behind it, not a data for all passed messages, so it should scale well. This is quite similar to what NAT does.

The actual problem I see, is that it's hard to predict exacly allowed message size. If I think that 1200 is good for my application, and I won't have more that 15 intermediate nodes, it's ok. But what if later network unexpectedly grown or MTU is lower (which is common for VPNs, because they add header on their own) ?

Probably for such protocols there should be some minimum reserved space, and some maximum size of data intermediaries can attach, and probably some minimum number of servers supported by the protocol. For later it means that if you want to have more than N intermediaries you should use the "gateways" in the sense I've defined them above. Or if it your own application and you exacly know your maximum messages size then you can optimize better.

And anyway because it's L5 protocol, we aren't going to have enormous number of intermediaries, because most forwarding work is already done on IP level. The problem arises if you want to use labels as not the address of the node where message should come or where message is came from, but as the chain where message should visit (there are similar patterns for zeromq), but it's probably wrong use case.

The state has to be stored somewhere though. If it's not in the message, it would have to be stored on intermediate nodes. This is in technically possible, however, the problem is that amount of memory on the intermediate nodes defines the total throughput of the system -- you can't pass more messages than you are able to store info about.

This in turn limits the scalability of the system as a whole. Which seems to contradict the ultimate goal on the whole SP thing.

Yes. I'm almost sure it's not the way it should be done.

--

Paul

Pieter Hintjens

unread,

May 24, 2011, 2:14:06 AM5/24/11

to sp-discu...@googlegroups.com

On Mon, May 23, 2011 at 11:30 PM, Paul Colomiets <pa...@colomiets.name> wrote:

> Given MTU is 500 bytes and single address is 20 (zmq uses 16 + 4 bytes for

> some header), you can add 25 intermediate nodes...

The 20 bytes used in 0MQ seems excessive. At any point a node only
needs to uniquely identify connections for a return path, and not
reuse this identification. So an 8-byte integer would work with
minimum fuss, as would a shorter integer with a little care.

-Pieter

Martin Sustrik

unread,

May 30, 2011, 5:33:43 AM5/30/11

to Gary Berger, sp-discu...@googlegroups.com

Hi Gary,

Sorry for the delay. I've been busy with 0MQ last week.

>> Congestion collapse: My suggestion would be to rely on CC algorithms as
>> provided by underlying layers. SP is meant to live more or less at L5,
>> way above where you manage congestion control. However, the messaging
>> patterns it provides should heave clearly defined behaviour when
>> congestion hits (pushback, message dropping etc.)
>
> @gaberger: Ok, Congestion is intimately coupled with queue occupancy and
> therefore linked to the size of the flow.. "Long-lived, greedy TCP flows
> will cause the length of the bottleneck queue to grow until packets are
> dropped, resulting in the familiar sawtooth pattern[DCTCP]. How would the
> SP deal with say back-pressure messages from QCN?

My take on it would as follows: Explicitly avoid dealing with CC in the
first phase and assume that L4 and layers below take care of that.

Once the scalability patterns at L5 are in place, we can investigate
whether the extra information provided by the user (by choosing a
particular pattern and by applied different options) can be used to
optimise the CC algorithm (eg. using different router settings for
pub/sub and req/rep).

>> Transient loops: There was absolutely no research in this area. For
>> example, with 0MQ we currently assume that there are no loops. Which is
>> not viable in the long run, obviously.
>
> @gaberger: Since the networks are getting dense 100K server instances in a
> DC, you have a state-synchronization problem.

Definitely. More research is needed here.

>> Variable delay: You mean large latency spread, right? One thing often
>> done on messaging level that is heavily impacted by this is
>> heartbeating. Still, I believe the heartbeating should be done on L4 (as
>> SCTP does it), not in the scalability layer (we have no roundtrip time
>> estimates or any other metrics to drive the heartbeat algorithm above
>> the L4).
>
> @gaberger: Maybe this is more of an affinity designation for locality
> (I.e. Rack-affinity)

Which specific features do you believe are directly impacted by location
affinity? Just the heartbeating/presence or anything else?

> @gaberger: So there is some work here, There is no one-size-fits all which
> is back to why there is a need for a scalability protocol. Sensitivities
> to oversubscription rates, distance, user response time, etc..

Yes. We are seeing some of that even today. What we do so far with 0MQ
is trying to be naively fair to all participants, but that won't do in a
long run. For example, the algorithms for distributing work over the
network should be more sophisticated that simple "round robin among
peers that don't apply backpressure at the moment".

> Another interesting IETF project is [ALTO], How would a Scalability
> Protocol leverage such a system?

It has to do with the discussion above. If there's something like ALTO
in place, the algorithms inside SP can take advantage of the
ALTO-provided metrics and say steer the load-balancing in such way as to
reflect the preferences of the middle nodes be it network devices, ISPs
or whatever.

>>> I would also be curious if SP can deal with the Location/Identity Split
>>> problem which is so pervasive today in mobile computing and data center
>>> resources. Dealing with this problem at the SP layer might provide a
>>> better approach than refactoring the protocol bindings (I.e.
>>> Egress-Interface->ARP->IP->Egress Interface) with alternatives like
>>> LISP, TRILL, etc...
>>
>> For me it's not yet clear how scalability patterns and identity
>> interact. To use 0MQ as an example once again, we've deliberately
>> avoided the problem of identity/naming as too hard to solve at the
>> moment. With SP we will have to dig into it though. It an extremely
>> interesting stuff and I myself am quite curious about where it will lead
>> us.

> @gaberger: Yes this would be a dive into some dark territory.

Ack. It's definitely tempting though.

Martin

Martin Sustrik

unread,

May 30, 2011, 6:02:18 AM5/30/11

to sp-discu...@googlegroups.com, Paul Colomiets

Hi Paul,

> I think it's ok to add metadata. There are inherently very short
> messages, e.g. change notifications (which probably consists only of
> object id or path or URI). If message will be 1000 bytes (which ok, in
> most circumstances even for instant messaging or games), there are a
> plenty of space for headers. Given MTU is 500 bytes and single address
> is 20 (zmq uses 16 + 4 bytes for some header), you can add 25
> intermediate nodes. If this is not enought some kind of gateway can be
> added, which consumes several labels and produces single one. This
> gateway must hold all pathes to nodes behind it, not a data for all
> passed messages, so it should scale well. This is quite similar to what
> NAT does.

There's one difference to take into account though. While NAT gateway
works on connection (port) level, the SP gateway would have to work on
message level (ie. each message can have a different metadata). Which
means that SP gateway would have to store orders of magnitude more data
then NAT gateway. It could get especially in high-volume high-latency
scenarios. I would expect up to say 10,000-fold increase in memory
requirements there.

> The actual problem I see, is that it's hard to predict exacly allowed
> message size. If I think that 1200 is good for my application, and I
> won't have more that 15 intermediate nodes, it's ok. But what if later
> network unexpectedly grown or MTU is lower (which is common for VPNs,
> because they add header on their own) ?

Plus if someone using your application adds an intermediary node later
on to scale it up, the whole thing can crash.

> Probably for such protocols there should be some minimum reserved space,
> and some maximum size of data intermediaries can attach, and probably
> some minimum number of servers supported by the protocol. For later it
> means that if you want to have more than N intermediaries you should use
> the "gateways" in the sense I've defined them above. Or if it your own
> application and you exacly know your maximum messages size then you can
> optimize better.
>
> And anyway because it's L5 protocol, we aren't going to have enormous
> number of intermediaries, because most forwarding work is already done
> on IP level.

That's an interesting question. Depends on how people would use the
intermediaries. If I go beyond the idea of intermediaries as dumb
routers, I can imagine quite a lot of use cases for
filtering/transformations in intermediaries. There's no guarantee that
the number of hops will keep low forever IMO.

> The problem arises if you want to use labels as not the
> address of the node where message should come or where message is came
> from, but as the chain where message should visit (there are similar
> patterns for zeromq), but it's probably wrong use case.

These concerns are specific to individual messaging patterns. If a
messaging patterns requires an enourmous envelope to be added to the
message the pattern is probable ill-designed.

> The state has to be stored somewhere though. If it's not in the
> message, it would have to be stored on intermediate nodes. This is
> in technically possible, however, the problem is that amount of
> memory on the intermediate nodes defines the total throughput of the
> system -- you can't pass more messages than you are able to store
> info about.
>
> This in turn limits the scalability of the system as a whole. Which
> seems to contradict the ultimate goal on the whole SP thing.
>
> Yes. I'm almost sure it's not the way it should be done.

However, note that your "NAT" proposal is exactly that -- storing the
state at the intermediary node.

Martin

Martin Sustrik

unread,

May 30, 2011, 6:04:29 AM5/30/11

to sp-discu...@googlegroups.com, Pieter Hintjens

On 05/24/2011 08:14 AM, Pieter Hintjens wrote:

> The 20 bytes used in 0MQ seems excessive. At any point a node only
> needs to uniquely identify connections for a return path, and not
> reuse this identification. So an 8-byte integer would work with
> minimum fuss, as would a shorter integer with a little care.

Yes. 4 bytes should do. Maybe even 2. As an comparison, MPLS labels are
4 bytes long IIRC.

Martin

Reply all

Reply to author

Forward