[akaros] Incorrect TCP checksum with block extra data (#23)

6 views
Skip to first unread message

Xiao Jia

unread,
Oct 16, 2015, 7:51:02 PM10/16/15
to brho/akaros

I have been looking at why netperf throughput is so low (<1Mbps) on my test machines. It was netperf TCP_STREAM from Akaros to Linux. Both machines have mlx4 NIC and I disabled offloads on Linux.

After some debugging with tcpdump and wireshark, it turns out to be

  1. Akaros sending packets with incorrect TCP checksum after some number (hundreds, exact number seems to be random) of packets; but not all the checksum are wrong, only some of them; the first packet with incorrect checksum is 1460 bytes long (tcp len).
  2. Linux responds with duplicate ACKs.
  3. Akaros starts a fast retransmission, but still with the incorrect checksum so it's dropped.
  4. Akaros starts retransmissions, but all of them have the incorrect checksum (the same packet is just retransmitted, and since the checksum is wrongly computed, retransmission of course will not solve the problem).
  5. Retransmission takes time, hence the very low throughput.

With block extra data disabled, I can get decent throughput.

Now I'm looking at the code...


Reply to this email directly or view it on GitHub.

Xiao Jia

unread,
Oct 16, 2015, 9:16:41 PM10/16/15
to brho/akaros

I was guessing it has something to do with odd, so I had the patch below. And I'm seeing outputs like this:

csum 2e58 tcpcksum 2e 58 extra_len 1460 nr_extra_bufs 186 lens 71 365 nodd 2 nzero 183 nnull 180 noff 4
csum 2e58 tcpcksum 2e 58 extra_len 1460 nr_extra_bufs 848 lens 71 365 nodd 2 nzero 845 nnull 842 noff 4
csum 2e58 tcpcksum 2e 58 extra_len 1460 nr_extra_bufs 848 lens 71 365 nodd 2 nzero 845 nnull 842 noff 4
csum 2e58 tcpcksum 2e 58 extra_len 1460 nr_extra_bufs 848 lens 71 365 nodd 2 nzero 845 nnull 842 noff 4
csum 2e58 tcpcksum 2e 58 extra_len 1460 nr_extra_bufs 848 lens 71 365 nodd 2 nzero 845 nnull 842 noff 4
csum 2e58 tcpcksum 2e 58 extra_len 1460 nr_extra_bufs 848 lens 71 365 nodd 2 nzero 845 nnull 842 noff 4
csum 2e58 tcpcksum 2e 58 extra_len 1460 nr_extra_bufs 848 lens 71 365 nodd 2 nzero 845 nnull 842 noff 4

This is interesting because we see most of the extra_data have len=0 and base=0. This is only a 1460 byte packet, where are these extra_data's coming from???

diff --git a/kern/src/net/tcp.c b/kern/src/net/tcp.c
index fd80f87..29fe83a 100644
--- a/kern/src/net/tcp.c
+++ b/kern/src/net/tcp.c
@@ -1069,8 +1069,29 @@ struct block *htontcp4(Tcp * tcph, struct block *data, Tcp4hdr * ph,
        if (tcb != NULL && tcb->nochecksum) {
                h->tcpcksum[0] = h->tcpcksum[1] = 0;
        } else {
+               int seq_lt(uint32_t x, uint32_t y);
                csum = ~ptclcsum(data, TCP4_IPLEN, TCP4_PHDRSIZE);
                hnputs(h->tcpcksum, csum);
+               if (tcb && seq_lt(tcb->snd.ptr, tcb->snd.nxt)) {
+                       int nodd = 0, nzero = 0, nnull = 0, noff = 0;
+                       for (int i = 0; i < data->nr_extra_bufs; i++) {
+                               if (data->extra_data[i].len % 2)
+                                       nodd++;
+                               if (data->extra_data[i].len == 0)
+                                       nzero++;
+                               if (!data->extra_data[i].base)
+                                       nnull++;
+                               if (data->extra_data[i].off > 0)
+                                       noff++;
+                       }
+                       printk("csum %4x tcpcksum %2x %2x extra_len %d nr_extra_bufs %d lens",
+                                  csum, h->tcpcksum[0], h->tcpcksum[1], data->extra_len, data->nr_extra_bufs);
+                       for (int i = 0; i < data->nr_extra_bufs; i++) {
+                               if (data->extra_data[i].len % 2)
+                                       printk(" %d", data->extra_data[i].len);
+                       }
+                       printk(" nodd %d nzero %d nnull %d noff %d\n", nodd, nzero, nnull, noff);
+               }
                data->checksum_start = TCP4_IPLEN + TCP4_PHDRSIZE;
                data->checksum_offset = ph->tcpcksum - ph->tcpsport;
                data->flag |= Btcpck;

Davide Libenzi

unread,
Oct 16, 2015, 9:47:46 PM10/16/15
to Akaros, brho/akaros
I see there is a printblock API, which could give better view of the troubled block?
Looks like there are a lot of extra_bdata entries with zero size, so even if you have a low of extra_bdata entries, the total size might still be 1460.
Some part of the code keeps piling up new extra_bdata blocks with zero length?


--
You received this message because you are subscribed to the Google Groups "Akaros" group.
To unsubscribe from this group and stop receiving emails from it, send an email to akaros+un...@googlegroups.com.
To post to this group, send email to aka...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Xiao Jia

unread,
Oct 16, 2015, 9:48:45 PM10/16/15
to brho/akaros

It seems that we're missing an adjustment to odd here:

diff --git a/kern/src/net/ipaux.c b/kern/src/net/ipaux.c
index 5c070b8..98a7bb9 100644
--- a/kern/src/net/ipaux.c
+++ b/kern/src/net/ipaux.c
@@ -252,8 +252,8 @@ uint16_t ptclcsum_one(struct block *bp, int offset, int len)
                        hisum += ptclbsum(addr, x);
                else
                        losum += ptclbsum(addr, x);
+               odd = (odd + x) & 1;
                len -= x;
-
        }
        losum += hisum >> 8;
        losum += (hisum & 0xff) << 8;

but even with this patch there're still checksum errors, though much fewer.

Xiao Jia

unread,
Oct 16, 2015, 9:57:55 PM10/16/15
to aka...@googlegroups.com, brho/akaros
On Fri, Oct 16, 2015 at 6:47 PM, 'Davide Libenzi' via Akaros
<aka...@googlegroups.com> wrote:
> I see there is a printblock API, which could give better view of the troubled block?

Right, but the csum seems to be touched by something else, so I
couldn't even see the wrong checksum received in my printk outputs...
I need to read and understand the code better anyway, so I plan to
first try to find errors by just looking at them :-)

> Looks like there are a lot of extra_bdata entries with zero size, so even if you have a low of extra_bdata entries, the total size might still be 1460.

Yes. I printed non-zero-len extra data's, and those with checksum
errors usually have two or three non-zero entries, and their lengths
sum up to 1460. For example

csum 2e58 extra_len 1460 nr_extra_bufs 848 lens 585 875 nodd 2 nzero
846 nnull 844 noff 3
csum 2e58 extra_len 1460 nr_extra_bufs 848 lens 213 1024 223 nodd 2
nzero 845 nnull 842 noff 4

> Some part of the code keeps piling up new extra_bdata blocks with zero length?

I believe so. I plan to work through the extra data stuff anyway,
so... more code reading

btw I cannot see your reply on Github UI. maybe Github doesn't like
being in CC list?

Kevin Klues

unread,
Oct 16, 2015, 10:07:55 PM10/16/15
to aka...@googlegroups.com, brho/akaros
It looks like he might have responded via the groups website rather than his mail client, so the message sender was akaros@googlegroups, not dlibenzi... 
--
You received this message because you are subscribed to the Google Groups "Akaros" group.
To unsubscribe from this group and stop receiving emails from it, send an email to akaros+un...@googlegroups.com.
To post to this group, send email to aka...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


--
~Kevin

Davide Libenzi

unread,
Oct 16, 2015, 10:09:04 PM10/16/15
to Akaros, brho/akaros
Mind, I have absolutely no idea of that code, but this might be troubling (line 231):

    if (offset < BHLEN(bp)) {
        x = MIN(len, BHLEN(bp) - offset);
If len < BHLEN(bp) then MIN() can be lower than offset, and x a huge number.
Maybe an assert() in there, during debugging (or always)?

PS: No idea why Github did not get my reply. I included in my To:


Davide Libenzi

unread,
Oct 16, 2015, 10:10:21 PM10/16/15
to Akaros, brho/akaros
On Fri, Oct 16, 2015 at 7:09 PM, Davide Libenzi <dlib...@google.com> wrote:
Mind, I have absolutely no idea of that code, but this might be troubling (line 231):

    if (offset < BHLEN(bp)) {
        x = MIN(len, BHLEN(bp) - offset);
If len < BHLEN(bp) then MIN() can be lower than offset, and x a huge number.
Forget about it, misinterpreted parenthesis ☺


barret rhoden

unread,
Oct 17, 2015, 10:37:54 AM10/17/15
to reply+00e5874dfcc73526755edbf1e9204a76763ec1b...@reply.github.com, aka...@googlegroups.com
I'll have to think on it, but this whole thing sounds familiar - we
might have gone through this with earlier work with the checksum
offloading and block extra data.

Barret

Xiao Jia

unread,
Oct 19, 2015, 6:31:35 PM10/19/15
to brho/akaros, akaros-notifier

I tried to disable the tx checksum offload code path, but that didn't help. So I just implemented and use a separate checksum function instead of ptclcsum, and now I don't get checksum errors. So there is something wrong with ptclcsum. I'm taking a closer look at the code.

The one I'm currently using is

+static uint16_t my_csum(struct block *bp, int offset, int len)
+{
+       uint64_t hi = 0, lo = 0, sum;
+       int curpos_offset;
+
+       if (bp->next)
+               return 0;
+       for (int i = 0; i < BHLEN(bp); i++) {
+               int curpos = i;
+               if (curpos < offset)
+                       continue;
+               int relative_index = curpos - offset;
+               if (relative_index >= len)
+                       goto scanned;
+               if (relative_index % 2 == 0)
+                       hi += bp->rp[i];
+               else
+                       lo += bp->rp[i];
+       }
+       curpos_offset = BHLEN(bp);
+       for (int i = 0; i < bp->nr_extra_bufs; i++) {
+               struct extra_bdata *ebd = &bp->extra_data[i];
+               const uint8_t *buf = (const uint8_t *)(ebd->base) + ebd->off;
+               for (int j = 0; j < ebd->len; j++) {
+                       int curpos = curpos_offset + j;
+                       if (curpos < offset)
+                               continue;
+                       int relative_index = curpos - offset;
+                       if (relative_index >= len)
+                               goto scanned;
+                       if (relative_index % 2 == 0)
+                               hi += buf[j];
+                       else
+                               lo += buf[j];
+               }
+               curpos_offset += ebd->len;
+       }
+       assert(curpos_offset == BLEN(bp));
+scanned:
+       sum = (hi << 8) + lo;
+       while (sum >> 16)
+               sum = (sum >> 16) + (sum & 0xffff);
+       return (~sum) & 0xffff;
+}

Davide Libenzi

unread,
Oct 20, 2015, 10:59:50 AM10/20/15
to Akaros
For what I saw from my quick view this weekend, it is trying to optimize it and checksumming two bytes at a time.
I would look at its odd/even handling, which might not be done correctly when dealing with extra data.
Unless we always plan to offload them, it would be better to keep the current two-at-a-time optimization, by fixing its logic.
Try to pass to ptclcsum_one() pointers to low/high sums, and odd value, and strip the sum compression at the end of its function.
Do the sum compression (between low and high) at the end of ptclcsum() only.



--

Xiao Jia

unread,
Oct 20, 2015, 1:26:56 PM10/20/15
to aka...@googlegroups.com
On Tue, Oct 20, 2015 at 7:59 AM, 'Davide Libenzi' via Akaros
<aka...@googlegroups.com> wrote:
> For what I saw from my quick view this weekend, it is trying to optimize it and checksumming two bytes at a time.
> I would look at its odd/even handling, which might not be done correctly when dealing with extra data.
> Unless we always plan to offload them, it would be better to keep the current two-at-a-time optimization, by fixing its logic.
> Try to pass to ptclcsum_one() pointers to low/high sums, and odd value, and strip the sum compression at the end of its function.
> Do the sum compression (between low and high) at the end of ptclcsum() only.

Thanks Davide. The overall logic of the ptclcsum* functions seems
fine, though I haven't checked if the final bit fiddling is correct.

What I discovered yesterday is that ptclbsum (note, that's ptcl_b_sum,
not _c_) is buggy. Specifically, the one coming from NetBSD looks
wrong. If I change #ifdef CONFIG_X86 to #if 0 in
kern/src/net/ptclbsum.c, I don't get checksum errors.

By looking at the code, I think the in_masks handling is incorrect.
For example, when offset=2, leading bytes are placed at wrong hi/lo
bytes. For the same reason, it's incorrect again when len=1 or 3 at
the end of in_cksumdata. I'm working on a fix to verify what I said
is true.

I've heard that Alpha doesn't like unaligned memory access, so I
probably understand why it needs in_masks here. But for x86 I believe
calculating the checksum directly in 64-bit increments is fast and
much simpler.

A more general question now is, do we have a mechanism to write unit
tests for in-kernel functions? I want to write some for the checksum
functions.

Kevin Klues

unread,
Oct 20, 2015, 1:33:08 PM10/20/15
to aka...@googlegroups.com
I put together a unit testing framework in the kernel a few years ago.
It's called ktest and it's in:

kern/src/ktest
kern/include/ktest.h

You can turn it on in make menuconfig
> --
> You received this message because you are subscribed to the Google Groups "Akaros" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to akaros+un...@googlegroups.com.
> To post to this group, send email to aka...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.



--
~Kevin

Barret Rhoden

unread,
Oct 20, 2015, 1:37:08 PM10/20/15
to aka...@googlegroups.com
On 2015-10-20 at 10:26 Xiao Jia <stf...@gmail.com> wrote:

I haven't looked at any of the checksum stuff yet, but I can comment on
these:

> I've heard that Alpha doesn't like unaligned memory access, so I
> probably understand why it needs in_masks here. But for x86 I believe
> calculating the checksum directly in 64-bit increments is fast and
> much simpler.

I think unaligned accesses are legal, but slow in x86.

> A more general question now is, do we have a mechanism to write unit
> tests for in-kernel functions? I want to write some for the checksum
> functions.

Check out kern/src/ktests/

The PB ktests are a catch-all for the tests we used to run manually.

It might be nice to have another one called net_ktests.c that does all
of the networking related things.

Barret



ron minnich

unread,
Oct 20, 2015, 1:57:14 PM10/20/15
to aka...@googlegroups.com
We need to be careful about unaligned accesses. It's a serious botch in various versions of gcc, most recently in ARM on 5.2, so it's best to assume you have to handle them manually. I would not write code that assumes unaligned accesses are handled correctly either in the kernel or any part of the toolchain or the architecture for that matter ...

ron

Barret Rhoden

unread,
Oct 20, 2015, 2:06:09 PM10/20/15
to aka...@googlegroups.com
On 2015-10-20 at 17:57 ron minnich <rmin...@gmail.com> wrote:
> We need to be careful about unaligned accesses. It's a serious botch
> in various versions of gcc, most recently in ARM on 5.2, so it's best
> to assume you have to handle them manually. I would not write code
> that assumes unaligned accesses are handled correctly either in the
> kernel or any part of the toolchain or the architecture for that
> matter ...


Also keep this commit in mind:

commit b5b3361af8803a01f26aef1392f595bd9446ef11
Author: Andrew Gallatin <gall...@google.com>
Date: Mon Jun 23 13:20:35 2014 -0700

Assume natural alignment for IP & ether addrs

Assume natural alignment for IP addresses and ethernet addresses in
the critical path. By doing so, we can compare and copy via direct
4-byte load / store rather than either calling memcmp()/memmove(), or
doing hand-unrolled byte-by-byte operations.

This gains about 750Mb/s on a netperf TCP receive workload with a
10GbE NIC in a low-end x86_64.

Signed-off-by: Andrew Gallatin <gall...@google.com>

diff --git a/kern/drivers/dev/ether.c b/kern/drivers/dev/ether.c
index 37cd2767432d..dbdc8e020926 100644
--- a/kern/drivers/dev/ether.c
+++ b/kern/drivers/dev/ether.c
@@ -251,6 +251,18 @@ static void etherrtrace(struct netfile *f, struct etherpkt *pkt, int len)
qpass(f->in, bp);
}

+#ifdef CONFIG_RISCV
+#warning "Potentially unaligned ethernet addrs!"
+#endif
+
+static inline int eaddrcmp(uint8_t *x, uint8_t *y)
+{
+ uint16_t *a = (uint16_t *)x;
+ uint16_t *b = (uint16_t *)y;
+
+ return (a[0] ^ b[0]) | (a[1] ^ b[1]) | (a[2] ^ b[2]);
+}
+
struct block *etheriq(struct ether *ether, struct block *bp, int fromwire)
{
struct etherpkt *pkt;
@@ -285,7 +297,7 @@ struct block *etheriq(struct ether *ether, struct block *bp, int fromwire)

multi = pkt->d[0] & 1;
/* check for valid multcast addresses */
- if (multi && memcmp(pkt->d, ether->netif.bcast, sizeof(pkt->d)) != 0
+ if (multi && eaddrcmp(pkt->d, ether->netif.bcast) != 0
&& ether->netif.prom == 0) {
if (!activemulti(&ether->netif, pkt->d, sizeof(pkt->d))) {
if (fromwire) {
@@ -297,8 +309,8 @@ struct block *etheriq(struct ether *ether, struct block *bp, int fromwire)
}

/* is it for me? */
- tome = memcmp(pkt->d, ether->ea, sizeof(pkt->d)) == 0;
- fromme = memcmp(pkt->s, ether->ea, sizeof(pkt->s)) == 0;
+ tome = eaddrcmp(pkt->d, ether->ea) == 0;
+ fromme = eaddrcmp(pkt->s, ether->ea) == 0;

/*
* Multiplex the packet to all the connections which want it.
@@ -359,8 +371,8 @@ static int etheroq(struct ether *ether, struct block *bp)
*/
pkt = (struct etherpkt *)bp->rp;
len = BLEN(bp);
- loopback = memcmp(pkt->d, ether->ea, sizeof(pkt->d)) == 0;
- if (loopback || memcmp(pkt->d, ether->netif.bcast, sizeof(pkt->d)) == 0
+ loopback = eaddrcmp(pkt->d, ether->ea) == 0;
+ if (loopback || eaddrcmp(pkt->d, ether->netif.bcast) == 0
|| ether->netif.prom) {
disable_irqsave(&irq_state);
etheriq(ether, bp, 0);
diff --git a/kern/include/ip.h b/kern/include/ip.h
index 2bf7887d36a8..2cb2c354e5c5 100644
--- a/kern/include/ip.h
+++ b/kern/include/ip.h
@@ -531,8 +531,29 @@ extern void v4tov6(uint8_t * v6, uint8_t * v4);
extern int v6tov4(uint8_t * v4, uint8_t * v6);
//extern int eipfmt(Fmt*);

-#define ipmove(x, y) memmove(x, y, IPaddrlen)
-#define ipcmp(x, y) ( (x)[IPaddrlen-1] != (y)[IPaddrlen-1] || memcmp(x, y, IPaddrlen) )
+
+#ifdef CONFIG_RISCV
+#warning "Potentially unaligned IP addrs!"
+#endif
+static inline void ipmove(unsigned char *x, unsigned char *y)
+{
+ uint32_t *a = (uint32_t *)x;
+ uint32_t *b = (uint32_t *)y;
+
+ a[0] = b[0];
+ a[1] = b[1];
+ a[2] = b[2];
+ a[3] = b[3];
+}
+
+static inline long ipcmp(unsigned char *x, unsigned char *y)
+{
+ uint32_t *a = (uint32_t *)x;
+ uint32_t *b = (uint32_t *)y;
+ return (a[0] ^ b[0]) | (a[1] ^ b[1]) |
+ (a[2] ^ b[2]) | (a[3] ^ b[3]);
+}
+

extern uint8_t IPv4bcast[IPaddrlen];
extern uint8_t IPv4bcastobs[IPaddrlen];
diff --git a/kern/src/net/arp.c b/kern/src/net/arp.c
index 0f45fa8f61af..00e07b053b2b 100644
--- a/kern/src/net/arp.c
+++ b/kern/src/net/arp.c
@@ -206,10 +206,11 @@ void cleanarpent(struct arp *arp, struct arpent *a)
struct arpent *arpget(struct arp *arp, struct block *bp, int version,
struct Ipifc *ifc, uint8_t * ip, uint8_t * mac)
{
- int hash;
+ int hash, len;
struct arpent *a;
struct medium *type = ifc->m;
uint8_t v6ip[IPaddrlen];
+ uint16_t *s, *d;

if (version == V4) {
v4tov6(v6ip, ip);
@@ -219,7 +220,7 @@ struct arpent *arpget(struct arp *arp, struct block *bp, int version,
qlock(&arp->qlock);
hash = haship(ip);
for (a = arp->hash[hash]; a; a = a->hash) {
- if (memcmp(ip, a->ip, sizeof(a->ip)) == 0)
+ if (ipcmp(ip, a->ip) == 0)
if (type == a->type)
break;
}
@@ -241,7 +242,13 @@ struct arpent *arpget(struct arp *arp, struct block *bp, int version,
return a; /* return with arp qlocked */
}

- memmove(mac, a->mac, a->type->maclen);
+ s = (uint16_t *)a->mac;
+ d = (uint16_t *)mac;
+ len = a->type->maclen / 2;
+ while (len) {
+ *d++ = *s++;
+ len--;
+ }

/* remove old entries */
if (NOW - a->ctime > 15 * 60 * 1000)
diff --git a/kern/src/net/ethermedium.c b/kern/src/net/ethermedium.c
index 5d654cff2a85..15ae69b2b582 100644
--- a/kern/src/net/ethermedium.c
+++ b/kern/src/net/ethermedium.c
@@ -288,6 +288,19 @@ static void etherunbind(struct Ipifc *ifc)
}

/*
+ * copy ethernet address
+ */
+static inline void etherfilladdr(uint16_t *pkt, uint16_t *dst, uint16_t *src)
+{
+ *pkt++ = *dst++;
+ *pkt++ = *dst++;
+ *pkt++ = *dst++;
+ *pkt++ = *src++;
+ *pkt++ = *src++;
+ *pkt = *src;
+}
+
+/*
* called by ipoput with a single block to write with ifc rlock'd
*/
static void
@@ -325,8 +338,7 @@ etherbwrite(struct Ipifc *ifc, struct block *bp, int version, uint8_t * ip)
eh = (Etherhdr *) bp->rp;

/* copy in mac addresses and ether type */
- memmove(eh->s, ifc->mac, sizeof(eh->s));
- memmove(eh->d, mac, sizeof(eh->d));
+ etherfilladdr((uint16_t *)bp->rp, (uint16_t *)mac, (uint16_t *)ifc->mac);

switch (version) {
case V4:
diff --git a/kern/src/net/ipaux.c b/kern/src/net/ipaux.c
index 0c6ca02c1a50..5ac48727556d 100644
--- a/kern/src/net/ipaux.c
+++ b/kern/src/net/ipaux.c
@@ -12,6 +12,7 @@
#include <pmap.h>
#include <smp.h>
#include <ip.h>
+#include <endian.h>

/*
* well known IP addresses
@@ -324,51 +325,37 @@ extern char *v4parseip(uint8_t * to, char *from)

int isv4(uint8_t * ip)
{
- return memcmp(ip, v4prefix, IPv4off) == 0;
+ unsigned short *ips = (unsigned short *)ip;
+ return 0xffff == ips[5];
}

/*
* the following routines are unrolled with no memset's to speed
- * up the usual case
+ * up the usual case. They assume IP addresses are naturally aligned.
*/
void v4tov6(uint8_t * v6, uint8_t * v4)
{
- v6[0] = 0;
- v6[1] = 0;
- v6[2] = 0;
- v6[3] = 0;
- v6[4] = 0;
- v6[5] = 0;
- v6[6] = 0;
- v6[7] = 0;
- v6[8] = 0;
- v6[9] = 0;
- v6[10] = 0xff;
- v6[11] = 0xff;
- v6[12] = v4[0];
- v6[13] = v4[1];
- v6[14] = v4[2];
- v6[15] = v4[3];
+ uint32_t *v6p = (uint32_t *)v6;
+ uint32_t *v4p = (uint32_t *)v4;
+
+ v6p[0] = 0;
+ v6p[1] = 0;
+ v6p[2] = (unsigned int)PP_NTOHL(0xffff);
+ v6p[3] = v4p[0];
}

int v6tov4(uint8_t * v4, uint8_t * v6)
{
- if (v6[0] == 0
- && v6[1] == 0
- && v6[2] == 0
- && v6[3] == 0
- && v6[4] == 0
- && v6[5] == 0
- && v6[6] == 0
- && v6[7] == 0
- && v6[8] == 0 && v6[9] == 0 && v6[10] == 0xff && v6[11] == 0xff) {
- v4[0] = v6[12];
- v4[1] = v6[13];
- v4[2] = v6[14];
- v4[3] = v6[15];
+
+ uint32_t *v6p = (uint32_t *)v6;
+ uint32_t *v4p = (uint32_t *)v4;
+
+ if (v6p[0] == 0 && v6p[1] == 0 &&
+ v6p[2] == (unsigned int)PP_NTOHL(0xffff)) {
+ v4p[0] = v6p[3];
return 0;
} else {
- memset(v4, 0, 4);
+ v4p[0] = 0;
return -1;
}
}
diff --git a/kern/src/net/ipifc.c b/kern/src/net/ipifc.c
index ec4176a72781..428b7d2769b5 100644
--- a/kern/src/net/ipifc.c
+++ b/kern/src/net/ipifc.c
@@ -1258,7 +1258,7 @@ void findlocalip(struct Fs *f, uint8_t * local, uint8_t * remote)

qlock(&f->ipifc->qlock);
r = v6lookup(f, remote, NULL);
- version = (memcmp(remote, v4prefix, IPv4off) == 0) ? V4 : V6;
+ version = isv4(remote) ? V4 : V6;

if (r != NULL) {
ifc = r->rt.ifc;

Davide Libenzi

unread,
Oct 20, 2015, 2:06:58 PM10/20/15
to Akaros
In x86_64 world, Intel claims now (since Sandy IIRC) there is no extra cost for unaligned access.

Dan Cross

unread,
Oct 21, 2015, 9:27:01 AM10/21/15
to aka...@googlegroups.com
We must anticipate that x86_64 is not the only architecture we'll run on, though.

Davide Libenzi

unread,
Oct 21, 2015, 9:49:22 AM10/21/15
to Akaros
On Wed, Oct 21, 2015 at 6:26 AM, Dan Cross <cro...@gmail.com> wrote:
We must anticipate that x86_64 is not the only architecture we'll run on, though.

Another one is ARM v8/64.
When building a node software in trying to verify TCO on a vendor platform, we already had in our SW a memory access API layer [*], and when we asked the vendor if their performance figures would look better if we open coded (smaller loads plus shifts - in a dedicated ARM v8 layer) the u16, u32, and u64 unaligned accesses, they told us that even though there was a little price for unaligned access on v8, that would still be smaller than the total insns necessary for the open coded solution.
Basically they have microcode which does exactly that within the core, which ends up, for obvious reasons, to be faster.
We did not try it, but we trusted them, given than they were trying to scavenge clock cycles everywhere, in trying to make their TCO improve WRT Intel.

[*] In our SW we had always_inline APIs like: read_le16, read_le32, read_le64, read_be16, ..., write_le16, ...
    Many of them would just be return *ptr, or *ptr = value.



Xiao Jia

unread,
Oct 21, 2015, 6:08:31 PM10/21/15
to brho/akaros, akaros-notifier

I wrote a unit test to compare the results from ptclbsum and a much simpler (slower) checksum function. After reading the code again and again, I think it can be fixed by simply

diff --git a/kern/src/net/ptclbsum.c b/kern/src/net/ptclbsum.c
index eed828d..837d121 100644
--- a/kern/src/net/ptclbsum.c
+++ b/kern/src/net/ptclbsum.c
@@ -185,6 +185,8 @@ uint16_t ptclbsum(uint8_t * addr, int len)
     uint64_t sum = in_cksumdata(addr, len);
     union q_util q_util;
     union l_util l_util;
+    if ((uintptr_t)addr & 1)
+        sum <<= 8;
     REDUCE16;
     return cpu_to_be16(sum);
 }

At least this passes the unit test, and convinced myself.

Xiao Jia

unread,
Nov 12, 2015, 1:05:09 PM11/12/15
to brho/akaros, akaros-notifier

Fixes and tests were merged at f922e80...1165c2b

Xiao Jia

unread,
Nov 12, 2015, 1:05:09 PM11/12/15
to brho/akaros, akaros-notifier

Closed #23.

Reply all
Reply to author
Forward
0 new messages