Snabb Switch and the End-to-End principle

240 views
Skip to first unread message

Luke Gorrie

unread,
Mar 24, 2015, 10:03:51 AM3/24/15
to snabb...@googlegroups.com
Howdy!

Here is a pleasant observation: Snabb Switch's recent evolution has been bringing our internal software design closer into line with the End to End Principle.

The end to end principle says to put complexity at the edges of a network and keep the bits in the middle simple. Prime example: the Internet has scaled by putting a lot of complexity in the endpoints (TCP, HTTP, SSL) but keeping the intermediate hops simple (Ethernet, IP, GRE). This has allowed applications and infrastructure to evolve independently, and indeed the internet has scaled up remarkably well in all respects.

Our app network has recently been pushing complexity to the edges too. The programming model for an app is almost exactly like the programming model for a real device: you have some links that carry a stream of packets, and each packet is a blob of binary data that you can process however you want, and there is really not much more to it than that.

This is unusual in networking stacks. It is more common for packets to be big data structures with things like mutexes, reference counts, cached protocol headers, extra reserved space, checksum validity information, multiple data buffers chained together, and so on. This information can be useful, but if every app in the network has to carefully maintain it then it can also be complex and expensive. My reflection is that by making all code that deals with packets worry about these details is contrary to the end-to-end principle, like making every internet switch and router deal with TCP/HTTP/SSL just because some endpoints use those protocols.

End of reflection :). But if you are feeling curious then below is a little summary of how our packet structure has evolved and a comparison with a more traditional one.

This is what our packet structure looks like today (simple, spartan, SIMD-friendly):

struct packet {
    unsigned char data[PACKET_PAYLOAD_SIZE];
    uint16_t length;
};

Can't be much simpler than this, can it? (Indeed, I won't be shocked if we find a good reason to extend this again in the future, but that is not the current trend.)

Our previous one was also quite simple, but included checksum metadata ("checksum is already done" flag and "checksum needs to be done before transmission" flag). That made it possible for endpoint apps to communicate information to each other (e.g. between the Intel10G app and the Virtio-net app) but at the expense of every app on the path needing to deal with this (e.g. when encapsulating/decapsulating packets):

struct packet {
    unsigned char data[PACKET_PAYLOAD_SIZE];
    uint16_t length;           // data payload length
    uint16_t flags;            // see packet_flags enum below
    uint16_t csum_start;       // position where checksum starts
    uint16_t csum_offset;      // offset (after start) to store checksum
};

enum packet_flags {
  PACKET_NEEDS_CSUM = 1, // Layer-4 checksum needs to be computed
  PACKET_CSUM_VALID = 2  // checksums are known to be correct
};

... and this one was already much simpler than our previous version that allowed multiple buffers to be chained together and to keep track of where memory was allocated (e.g. within a particular virtual machine):

struct buffer;

struct buffer_origin {
  enum buffer_origin_type {
    BUFFER_ORIGIN_UNKNOWN = 0,
    BUFFER_ORIGIN_VIRTIO  = 1
    // NUMA...
  } type;
  union buffer_origin_info {
    struct buffer_origin_info_virtio {
      int16_t device_id;
      int16_t ring_id;
      int16_t header_id;
      char    *header_pointer;  // virtual address in this process
      uint32_t total_size;      // how many bytes in all buffers
    } virtio;
  } info;
};

// A packet_iovec describes a portion of a buffer.
struct packet_iovec {
  struct buffer *buffer;
  uint32_t offset;
  uint32_t length;
};

struct packet_info {
  uint8_t flags;        // see below
  uint8_t gso_flags;    // see below
  uint16_t hdr_len;     // ethernet + ip + tcp/udp header length
  uint16_t gso_size;    // bytes of post-header payload per segment
  uint16_t csum_start;  // position where checksum starts
  uint16_t csum_offset; // offset (after start) to store checksum
};

struct packet {
  int32_t refcount;
  int32_t color;
  struct packet_info info;
  int niovecs;
  int length;
  struct packet_iovec iovecs[PACKET_IOVEC_MAX];
} __attribute__ ((aligned(64)));

... which was pretty complex but still considerably more spartan than a genuine traditional one like the Linux struct sk_buff:

struct sk_buff {
/* These two members must be first. */
struct sk_buff *next;
struct sk_buff *prev;

union {
ktime_t tstamp;
struct skb_mstamp skb_mstamp;
};

struct sock *sk;
struct net_device *dev;

/*
* This is the control buffer. It is free to use for every
* layer. Please put your private variables there. If you
* want to keep them across layers you have to do a skb_clone()
* first. This is owned by whoever has the skb queued ATM.
*/
char cb[48] __aligned(8);

unsigned long _skb_refdst;
void (*destructor)(struct sk_buff *skb);
#ifdef CONFIG_XFRM
struct sec_path *sp;
#endif
#if defined(CONFIG_NF_CONNTRACK) || defined(CONFIG_NF_CONNTRACK_MODULE)
struct nf_conntrack *nfct;
#endif
#if IS_ENABLED(CONFIG_BRIDGE_NETFILTER)
struct nf_bridge_info *nf_bridge;
#endif
unsigned int len,
data_len;
__u16 mac_len,
hdr_len;

/* Following fields are _not_ copied in __copy_skb_header()
* Note that queue_mapping is here mostly to fill a hole.
*/
kmemcheck_bitfield_begin(flags1);
__u16 queue_mapping;
__u8 cloned:1,
nohdr:1,
fclone:2,
peeked:1,
head_frag:1,
xmit_more:1;
/* one bit hole */
kmemcheck_bitfield_end(flags1);

/* fields enclosed in headers_start/headers_end are copied
* using a single memcpy() in __copy_skb_header()
*/
/* private: */
__u32 headers_start[0];
/* public: */

/* if you move pkt_type around you also must adapt those constants */
#ifdef __BIG_ENDIAN_BITFIELD
#define PKT_TYPE_MAX (7 << 5)
#else
#define PKT_TYPE_MAX 7
#endif
#define PKT_TYPE_OFFSET() offsetof(struct sk_buff, __pkt_type_offset)

__u8 __pkt_type_offset[0];
__u8 pkt_type:3;
__u8 pfmemalloc:1;
__u8 ignore_df:1;
__u8 nfctinfo:3;

__u8 nf_trace:1;
__u8 ip_summed:2;
__u8 ooo_okay:1;
__u8 l4_hash:1;
__u8 sw_hash:1;
__u8 wifi_acked_valid:1;
__u8 wifi_acked:1;

__u8 no_fcs:1;
/* Indicates the inner headers are valid in the skbuff. */
__u8 encapsulation:1;
__u8 encap_hdr_csum:1;
__u8 csum_valid:1;
__u8 csum_complete_sw:1;
__u8 csum_level:2;
__u8 csum_bad:1;

#ifdef CONFIG_IPV6_NDISC_NODETYPE
__u8 ndisc_nodetype:2;
#endif
__u8 ipvs_property:1;
__u8 inner_protocol_type:1;
/* 4 or 6 bit hole */

#ifdef CONFIG_NET_SCHED
__u16 tc_index; /* traffic control index */
#ifdef CONFIG_NET_CLS_ACT
__u16 tc_verd; /* traffic control verdict */
#endif
#endif

union {
__wsum csum;
struct {
__u16 csum_start;
__u16 csum_offset;
};
};
__u32 priority;
int skb_iif;
__u32 hash;
__be16 vlan_proto;
__u16 vlan_tci;
#ifdef CONFIG_NET_RX_BUSY_POLL
unsigned int napi_id;
#endif
#ifdef CONFIG_NETWORK_SECMARK
__u32 secmark;
#endif
union {
__u32 mark;
__u32 dropcount;
__u32 reserved_tailroom;
};

union {
__be16 inner_protocol;
__u8 inner_ipproto;
};

__u16 inner_transport_header;
__u16 inner_network_header;
__u16 inner_mac_header;

__be16 protocol;
__u16 transport_header;
__u16 network_header;
__u16 mac_header;

/* private: */
__u32 headers_end[0];
/* public: */

/* These elements must be at the end, see alloc_skb() for details.  */
sk_buff_data_t tail;
sk_buff_data_t end;
unsigned char *head,
*data;
unsigned int truesize;
atomic_t users;
};

Max Rottenkolber

unread,
Mar 24, 2015, 11:16:42 AM3/24/15
to snabb...@googlegroups.com
On Tue, 24 Mar 2015 15:03:50 +0100, Luke Gorrie wrote:

> below is a little summary of how our packet structure has evolved and a
> comparison with a more traditional one.

Awesome! =) I think minimal design deserves more positive reinforcement.
I always operated under the assumption that good code must be "cute".
E.g. cute enough to be presentable in a children's textbook. I originally
learned this from K&R's "The C Programming Language" (I think) which was
the programming book which changed my mind from "wtf is uint 16" to "ok,
its a positive fixed size integer, but that's really not the point".

(You can not possibly imagine how confusing a generic C book is to a kid
who just wrote his first web CMS in Bash and wants to learn "real
programming".)

So now to my point: When you are religiously writing cute code and do
cute designs, and then look at a the code base of a big successful open
source project (say the Linux kernel), you might think (I at least
constantly do): "Oh wow, am I just a really naive dude living in a fairy-
world, writing fairy-world code that couldn't ever compete in the real
world?"

In reality however I think that aspiring to build "cute systems" yields
tremendous value that's really hard to measure. Its value that doesn't
directly translate to performance or features but instead enables
efficient sharing of ideas and cooperative reasoning about the properties
of a system. It lessens the costs of maintenance, documentation,
redesigns and feature additions. Its the difference between source code
as an artifact and source code as a living theorem.

Does that make any sense?

Disclaimer: I sell cute code for a living.


Javier Guerra Giraldez

unread,
Mar 24, 2015, 11:59:50 AM3/24/15
to snabb...@googlegroups.com
On 24 March 2015 at 09:03, Luke Gorrie <lu...@snabb.co> wrote:
> This is what our packet structure looks like today (simple, spartan,
> SIMD-friendly):
>
> struct packet {
> unsigned char data[PACKET_PAYLOAD_SIZE];
> uint16_t length;
> };


somehow i hadn't noticed this before... but is there any reason to put
the data before the length? my first intuition would be to put it
before, so reading the size (typically the first operation on a
packet) could start the memory prefetching. on very small packets
that could make a difference.

or maybe it's _desirable_ to have the length on a different cache
line? modern chips are just too weird...

--
Javier

Nikolay Nikolaev

unread,
Mar 24, 2015, 1:25:54 PM3/24/15
to snabb...@googlegroups.com
On Tue, Mar 24, 2015 at 5:59 PM, Javier Guerra Giraldez <jav...@snabb.co> wrote:
> On 24 March 2015 at 09:03, Luke Gorrie <lu...@snabb.co> wrote:
>> This is what our packet structure looks like today (simple, spartan,
>> SIMD-friendly):
>>
>> struct packet {
>> unsigned char data[PACKET_PAYLOAD_SIZE];
>> uint16_t length;
>> };
>
>
> somehow i hadn't noticed this before... but is there any reason to put
> the data before the length? my first intuition would be to put it
> before, so reading the size (typically the first operation on a
> packet) could start the memory prefetching. on very small packets
> that could make a difference.

memcpy, alignments and such :)

regards,
Nikolay Nikolaev
>
> or maybe it's _desirable_ to have the length on a different cache
> line? modern chips are just too weird...
>
> --
> Javier
>
> --
> You received this message because you are subscribed to the Google Groups "Snabb Switch development" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to snabb-devel...@googlegroups.com.
> To post to this group, send an email to snabb...@googlegroups.com.
> Visit this group at http://groups.google.com/group/snabb-devel.

Andy Wingo

unread,
Mar 24, 2015, 1:53:40 PM3/24/15
to Javier Guerra Giraldez, snabb...@googlegroups.com
On Tue 24 Mar 2015 16:59, Javier Guerra Giraldez <jav...@snabb.co> writes:

> somehow i hadn't noticed this before... but is there any reason to put
> the data before the length?

A guess: SIMD-friendly alignment of the data.

A

Luke Gorrie

unread,
Mar 25, 2015, 2:12:03 AM3/25/15
to snabb...@googlegroups.com
On 24 March 2015 at 16:59, Javier Guerra Giraldez <jav...@snabb.co> wrote:
somehow i hadn't noticed this before... but is there any reason to put
the data before the length?  my first intuition would be to put it
before, so reading the size (typically the first operation on a
packet) could start the memory prefetching.

Your intuition may well be correct. I'd love to have a way to quickly answer these questions.

Idea: we could have a "micro-optimizations" branch where ideas like this struct change could be PR'd and described for future reference? Then we could browse and cherry-pick from this branch when doing optimization efforts.

In practice it currently takes us significant effort to verify optimizations across applications and workloads and so it has been more effective to do many at the same time (everything needed to meet a specific target) rather than small isolated ones.

The current struct layout was adopted during the optimization effort to make straightline match the performance of its predecessor. That is, as part of a major rewrite rather than as part of an isolated experiment.

I also have a few ideas for SIMD optimization that I would like to dump out of my brain into some useful place like a microoptimization branch. Using the mailing list for now...

There are a couple of useful properties of our packet struct with respect to SIMD:

- The start of the packet is aligned how we like (e.g. 16/32/64 byte).
- We can safely overwrite the bytes following the end of the payload.

This could potentially eliminate a bunch of special cases in functions like memcpy and cksum.

For example, memcpy has a bunch of extra code to make the stores aligned (we know that ours are) and to handle copy sizes that aren't a multiple of the SIMD register size (we could just "round up" when copying into packet structs). So we could copy a 111 byte packet using four 32-byte aligned store instructions (or two 64-byte aligned stores when we get AVX512).

Similarly for checksums, the original SSE2 code has to specially deal with unaligned bytes at the start and the end of the packet with non-SIMD code. On AVX2 we should be able to do all the data in 32-byte chunks because (a) for the beginning unaligned loads are supported and fast and (b) for the end we can safely write zeros into the trailing bytes after the packet (up to 31 bytes) to exclude them from the checksum (suggested by Tony Rogvall who wrote the original SSE2 code).

Coming up with ideas to test is easy, the trouble is that testing is more work and so few turn out to be practical that discussing them in the abstract is mostly an unproductive distraction (mea culpa). It's fun though and beats browsing Facebook at least :)

Cheers!
-Luke


Luke Gorrie

unread,
Mar 25, 2015, 2:23:47 AM3/25/15
to snabb...@googlegroups.com
On 24 March 2015 at 16:16, Max Rottenkolber <m...@mr.gy> wrote:
So now to my point: When you are religiously writing cute code and do
cute designs, and then look at a the code base of a big successful open
source project (say the Linux kernel), you might think (I at least
constantly do): "Oh wow, am I just a really naive dude living in a fairy-
world, writing fairy-world code that couldn't ever compete in the real
world?"

I find it fascinating that the code we are writing today is extremely practical, but that it has only very recently become possible to write practical code like this.

Consider the tools that we have:
- Entire cores to use as workers (no interrupt-based work multiplexing)
- SIMD engine ramping up to 64 bytes per cycle throughput.
- High-level scripting language that is competitive with C.
- Best hardware platforms being available cheap and off the shelf.
- NICs with publicly available driver programming documentation.

So a cute code merchant is in the right place at the right time in networking in 2015 :-)


Reply all
Reply to author
Forward
0 new messages