tcp on mlx4

12 views
Skip to first unread message

Xiao Jia

unread,
Oct 27, 2015, 8:37:10 PM10/27/15
to aka...@googlegroups.com
This is a series of patches to run tcp on mlx4.

Current state of netperf TCP_STREAM from Akaros to Linux:

TCP STREAM TEST from (null) (0.0.0.0) port 0 AF_INET to (null) () port 0 AF_INET
netperf: create_data_socket: SO_REUSEADDR failed 92
[kernel] sys_close failed: proc 22 fd 5. Check your rets.
Recv Send Send
Socket Socket Message Elapsed
Size Size Size Time Throughput
bytes bytes bytes secs. 10^6bits/sec

87380 1024 1024 10.00 688.33

Next step is to improve the throughput. Probably I will start from checksum
offloads, and then segmentation offloads and Linux xmit_more mechanism.

After that I will look at the RX side.

Xiao Jia

unread,
Oct 27, 2015, 8:37:49 PM10/27/15
to aka...@googlegroups.com, Xiao Jia
I was getting "Lock XXX tried to spin when it shouldn't" errors when
the kernel was compiled with spinlock debugging.

Signed-off-by: Xiao Jia <stf...@gmail.com>
---
kern/drivers/net/mlx4/alloc.c | 4 ++--
kern/drivers/net/mlx4/cmd.c | 6 +++---
kern/drivers/net/mlx4/mr.c | 2 +-
3 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/kern/drivers/net/mlx4/alloc.c b/kern/drivers/net/mlx4/alloc.c
index da0b158..fdb26ee 100644
--- a/kern/drivers/net/mlx4/alloc.c
+++ b/kern/drivers/net/mlx4/alloc.c
@@ -179,7 +179,7 @@ int mlx4_bitmap_init(struct mlx4_bitmap *bitmap, uint32_t num, uint32_t mask,
bitmap->reserved_top = reserved_top;
bitmap->avail = num - reserved_top - reserved_bot;
bitmap->effective_len = bitmap->avail;
- spinlock_init_irqsave(&bitmap->lock);
+ spinlock_init(&bitmap->lock);
bitmap->table = kzmalloc(BITS_TO_LONGS(bitmap->max) * sizeof(long),
KMALLOC_WAIT);
if (!bitmap->table)
@@ -227,7 +227,7 @@ struct mlx4_zone_allocator *mlx4_zone_allocator_create(enum mlx4_zone_alloc_flag

INIT_LIST_HEAD(&zones->entries);
INIT_LIST_HEAD(&zones->prios);
- spinlock_init_irqsave(&zones->lock);
+ spinlock_init(&zones->lock);
zones->last_uid = 0;
zones->mask = 0;
zones->flags = flags;
diff --git a/kern/drivers/net/mlx4/cmd.c b/kern/drivers/net/mlx4/cmd.c
index b75f9ab..5f86dbe 100644
--- a/kern/drivers/net/mlx4/cmd.c
+++ b/kern/drivers/net/mlx4/cmd.c
@@ -2483,7 +2483,7 @@ int mlx4_cmd_init(struct mlx4_dev *dev)

if (!priv->cmd.initialized) {
qlock_init(&priv->cmd.slave_cmd_mutex);
- sema_init(&priv->cmd.poll_sem, 1);
+ sem_init(&priv->cmd.poll_sem, 1);
priv->cmd.use_events = 0;
priv->cmd.toggle = 1;
priv->cmd.initialized = 1;
@@ -2624,8 +2624,8 @@ int mlx4_cmd_use_events(struct mlx4_dev *dev)
priv->cmd.context[priv->cmd.max_cmds - 1].next = -1;
priv->cmd.free_head = 0;

- sema_init(&priv->cmd.event_sem, priv->cmd.max_cmds);
- spinlock_init_irqsave(&priv->cmd.context_lock);
+ sem_init(&priv->cmd.event_sem, priv->cmd.max_cmds);
+ spinlock_init(&priv->cmd.context_lock);

for (priv->cmd.token_mask = 1;
priv->cmd.token_mask < priv->cmd.max_cmds;
diff --git a/kern/drivers/net/mlx4/mr.c b/kern/drivers/net/mlx4/mr.c
index 927cf70..1eddbb8 100644
--- a/kern/drivers/net/mlx4/mr.c
+++ b/kern/drivers/net/mlx4/mr.c
@@ -99,7 +99,7 @@ static int mlx4_buddy_init(struct mlx4_buddy *buddy, int max_order)
int i, s;

buddy->max_order = max_order;
- spinlock_init_irqsave(&buddy->lock);
+ spinlock_init(&buddy->lock);

buddy->bits = kzmalloc((buddy->max_order + 1) * (sizeof(long *)),
KMALLOC_WAIT);
--
2.6.0.rc2.230.g3dd15c0

Xiao Jia

unread,
Oct 27, 2015, 8:37:50 PM10/27/15
to aka...@googlegroups.com, Xiao Jia
Original mlx4 driver in Linux uses module parameters to override such
configuration values. We don't have module_param's (yet), so for now
let's use Kconfig for that purpose.

Signed-off-by: Xiao Jia <stf...@gmail.com>
---
kern/drivers/net/Kconfig | 7 +++++++
kern/drivers/net/mlx4/mlx4.h | 4 ++++
2 files changed, 11 insertions(+)

diff --git a/kern/drivers/net/Kconfig b/kern/drivers/net/Kconfig
index 47b5869..1d43190 100644
--- a/kern/drivers/net/Kconfig
+++ b/kern/drivers/net/Kconfig
@@ -22,3 +22,10 @@ config MLX4_EN
config MLX4_CORE
tristate
default n
+
+config MLX4_DEFAULT_MGM_LOG_ENTRY_SIZE
+ int "Default log mgm size (num of qp per mcg)"
+ depends on MLX4_CORE
+ default 10
+ help
+ To activate device managed flow steering when available, set to -1.
diff --git a/kern/drivers/net/mlx4/mlx4.h b/kern/drivers/net/mlx4/mlx4.h
index 6c34d04..84026e7 100644
--- a/kern/drivers/net/mlx4/mlx4.h
+++ b/kern/drivers/net/mlx4/mlx4.h
@@ -68,7 +68,11 @@ enum {
};

enum {
+#ifdef CONFIG_MLX4_DEFAULT_MGM_LOG_ENTRY_SIZE
+ MLX4_DEFAULT_MGM_LOG_ENTRY_SIZE = CONFIG_MLX4_DEFAULT_MGM_LOG_ENTRY_SIZE,
+#else
MLX4_DEFAULT_MGM_LOG_ENTRY_SIZE = 10,
+#endif
MLX4_MIN_MGM_LOG_ENTRY_SIZE = 7,
MLX4_MAX_MGM_LOG_ENTRY_SIZE = 12,
MLX4_MAX_QP_PER_MGM = 4 * ((1 << MLX4_MAX_MGM_LOG_ENTRY_SIZE) / 16 - 2),
--
2.6.0.rc2.230.g3dd15c0

Xiao Jia

unread,
Oct 27, 2015, 8:37:51 PM10/27/15
to aka...@googlegroups.com, Xiao Jia
Signed-off-by: Xiao Jia <stf...@gmail.com>
---
kern/src/net/netif.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kern/src/net/netif.c b/kern/src/net/netif.c
index 0509090..ce6d586 100644
--- a/kern/src/net/netif.c
+++ b/kern/src/net/netif.c
@@ -281,7 +281,7 @@ netifread(struct ether *nif, struct chan *c, void *a, long n,
if (nif->feat & NETF_UDPCK)
j += snprintf(p + j, READSTR - j, "udpck ");
if (nif->feat & NETF_TCPCK)
- j += snprintf(p + j, READSTR - j, "tcppck ");
+ j += snprintf(p + j, READSTR - j, "tcpck ");
if (nif->feat & NETF_PADMIN)
j += snprintf(p + j, READSTR - j, "padmin ");
if (nif->feat & NETF_SG)
--
2.6.0.rc2.230.g3dd15c0

Xiao Jia

unread,
Oct 27, 2015, 8:37:51 PM10/27/15
to aka...@googlegroups.com, Xiao Jia
Temporarily disable offload features before they are implemented.
This also allows us to test TCP before actually implementing them.

Signed-off-by: Xiao Jia <stf...@gmail.com>
---
kern/drivers/net/mlx4/en_netdev.c | 4 ++++
1 file changed, 4 insertions(+)

diff --git a/kern/drivers/net/mlx4/en_netdev.c b/kern/drivers/net/mlx4/en_netdev.c
index 39b0d69..e660376 100644
--- a/kern/drivers/net/mlx4/en_netdev.c
+++ b/kern/drivers/net/mlx4/en_netdev.c
@@ -2990,9 +2990,13 @@ int mlx4_en_init_netdev(struct mlx4_en_dev *mdev, int port,
dev->hw_features |= NETIF_F_TSO | NETIF_F_TSO6;

dev->hw_features |= NETIF_F_RXCSUM | NETIF_F_RXHASH;
+#if 0 // AKAROS_PORT
dev->feat = dev->hw_features | NETIF_F_HIGHDMA |
NETIF_F_HW_VLAN_CTAG_TX | NETIF_F_HW_VLAN_CTAG_RX |
NETIF_F_HW_VLAN_CTAG_FILTER;
+#else
+ dev->feat = NETIF_F_SG;
+#endif
dev->hw_features |= NETIF_F_LOOPBACK |
NETIF_F_HW_VLAN_CTAG_TX | NETIF_F_HW_VLAN_CTAG_RX;

--
2.6.0.rc2.230.g3dd15c0

Xiao Jia

unread,
Oct 27, 2015, 8:37:52 PM10/27/15
to aka...@googlegroups.com, Xiao Jia
Signed-off-by: Xiao Jia <stf...@gmail.com>
---
kern/src/net/ipaux.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kern/src/net/ipaux.c b/kern/src/net/ipaux.c
index 5c070b8..98a7bb9 100644
--- a/kern/src/net/ipaux.c
+++ b/kern/src/net/ipaux.c
@@ -252,8 +252,8 @@ uint16_t ptclcsum_one(struct block *bp, int offset, int len)
hisum += ptclbsum(addr, x);
else
losum += ptclbsum(addr, x);
+ odd = (odd + x) & 1;
len -= x;
-
}
losum += hisum >> 8;
losum += (hisum & 0xff) << 8;
--
2.6.0.rc2.230.g3dd15c0

Xiao Jia

unread,
Oct 27, 2015, 8:37:53 PM10/27/15
to aka...@googlegroups.com, Xiao Jia
Signed-off-by: Xiao Jia <stf...@gmail.com>
---
kern/src/net/ptclbsum.c | 2 ++
1 file changed, 2 insertions(+)

diff --git a/kern/src/net/ptclbsum.c b/kern/src/net/ptclbsum.c
index eed828d..837d121 100644
--- a/kern/src/net/ptclbsum.c
+++ b/kern/src/net/ptclbsum.c
@@ -185,6 +185,8 @@ uint16_t ptclbsum(uint8_t * addr, int len)
uint64_t sum = in_cksumdata(addr, len);
union q_util q_util;
union l_util l_util;
+ if ((uintptr_t)addr & 1)
+ sum <<= 8;
REDUCE16;
return cpu_to_be16(sum);
}
--
2.6.0.rc2.230.g3dd15c0

Xiao Jia

unread,
Oct 27, 2015, 8:37:54 PM10/27/15
to aka...@googlegroups.com, Xiao Jia
Signed-off-by: Xiao Jia <stf...@gmail.com>
---
kern/include/ip.h | 4 ++++
1 file changed, 4 insertions(+)

diff --git a/kern/include/ip.h b/kern/include/ip.h
index dd79a6f..c1df2db 100644
--- a/kern/include/ip.h
+++ b/kern/include/ip.h
@@ -645,6 +645,10 @@ static inline void ptclcsum_finalize(struct block *bp, unsigned int feat)

if (flag && (flag & feat) != flag) {
csum_store = bp->rp + bp->checksum_start + bp->checksum_offset;
+ /* NOTE pseudo-header partial checksum (if any) is already placed at
+ * csum_store (e.g. tcpcksum), and the ptclcsum() below will include
+ * that partial checksum as part of the calculation.
+ */
hnputs((uint16_t *)csum_store,
ptclcsum(bp, bp->checksum_start,
BLEN(bp) - bp->checksum_start));
--
2.6.0.rc2.230.g3dd15c0

Xiao Jia

unread,
Oct 27, 2015, 8:37:54 PM10/27/15
to aka...@googlegroups.com, Xiao Jia
KERNEL_POSTBOOT_TESTING is not defined, and if USERSPACE_TESTING is
disabled, we won't run any unit tests, which is not right.

Fix it by letting the tests run whenever KERNEL_TESTING is present.

Signed-off-by: Xiao Jia <stf...@gmail.com>
---
kern/src/manager.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kern/src/manager.c b/kern/src/manager.c
index cd3c5bf..e87cc0d 100644
--- a/kern/src/manager.c
+++ b/kern/src/manager.c
@@ -42,7 +42,7 @@ void manager(void)
#define MANAGER_FUNC(dev) PASTE(manager_,dev)

#if !defined(DEVELOPER_NAME) && \
- (defined(CONFIG_KERNEL_POSTBOOT_TESTING) || \
+ (defined(CONFIG_KERNEL_TESTING) || \
defined(CONFIG_USERSPACE_TESTING))
#define DEVELOPER_NAME jenkins
#endif
--
2.6.0.rc2.230.g3dd15c0

Xiao Jia

unread,
Oct 27, 2015, 8:37:55 PM10/27/15
to aka...@googlegroups.com, Xiao Jia
Currently it includes a unit test for ptclbsum, and unit tests for
checksum benchmark.

Signed-off-by: Xiao Jia <stf...@gmail.com>
---
kern/src/ktest/Kbuild | 1 +
kern/src/ktest/Kconfig.kernel | 2 +-
kern/src/ktest/Kconfig.net | 19 +++++++++
kern/src/ktest/net_ktests.c | 94 +++++++++++++++++++++++++++++++++++++++++++
4 files changed, 115 insertions(+), 1 deletion(-)
create mode 100644 kern/src/ktest/Kconfig.net
create mode 100644 kern/src/ktest/net_ktests.c

diff --git a/kern/src/ktest/Kbuild b/kern/src/ktest/Kbuild
index 514bc6e..88e4403 100644
--- a/kern/src/ktest/Kbuild
+++ b/kern/src/ktest/Kbuild
@@ -1,2 +1,3 @@
obj-y += ktest.o
obj-$(CONFIG_PB_KTESTS) += pb_ktests.o
+obj-$(CONFIG_NET_KTESTS) += net_ktests.o
diff --git a/kern/src/ktest/Kconfig.kernel b/kern/src/ktest/Kconfig.kernel
index ad96486..b2e333c 100644
--- a/kern/src/ktest/Kconfig.kernel
+++ b/kern/src/ktest/Kconfig.kernel
@@ -5,4 +5,4 @@ menuconfig KERNEL_TESTING
Run unit tests for the kernel

source "kern/src/ktest/Kconfig.postboot"
-
+source "kern/src/ktest/Kconfig.net"
diff --git a/kern/src/ktest/Kconfig.net b/kern/src/ktest/Kconfig.net
new file mode 100644
index 0000000..0b2f593
--- /dev/null
+++ b/kern/src/ktest/Kconfig.net
@@ -0,0 +1,19 @@
+menuconfig NET_KTESTS
+ depends on KERNEL_TESTING
+ bool "Networking unit tests"
+ default y
+
+config TEST_ptclbsum
+ depends on NET_KTESTS
+ bool "Unit tests for ptclbsum"
+ default y
+
+config TEST_simplesum_bench
+ depends on NET_KTESTS
+ bool "Checksum benchmark: baseline"
+ default y
+
+config TEST_ptclbsum_bench
+ depends on NET_KTESTS
+ bool "Checksum benchmark: ptclbsum"
+ default y
diff --git a/kern/src/ktest/net_ktests.c b/kern/src/ktest/net_ktests.c
new file mode 100644
index 0000000..ba4a720
--- /dev/null
+++ b/kern/src/ktest/net_ktests.c
@@ -0,0 +1,94 @@
+#include <ip.h>
+#include <ktest.h>
+#include <linker_func.h>
+
+KTEST_SUITE("NET")
+
+static uint16_t simplesum(const uint8_t *buf, int len)
+{
+ uint64_t hi = 0, lo = 0, sum;
+ int i;
+
+ for (i = 0; i < len; i++) {
+ if (i % 2 == 0)
+ hi += buf[i];
+ else
+ lo += buf[i];
+ }
+ sum = (hi << 8) + lo;
+ while (sum >> 16)
+ sum = (sum >> 16) + (sum & 0xffff);
+ return sum & 0xffff;
+}
+
+bool test_ptclbsum(void)
+{
+ uint16_t csum, expected;
+ uint8_t buf[100];
+ int i, j, len;
+
+ for (i = 0; i < sizeof(buf); i++)
+ buf[i] = i & 0xff;
+ for (i = 0; i < sizeof(buf); i++) {
+ for (j = i; j < sizeof(buf); j++) {
+ len = j - i + 1;
+ csum = ptclbsum(buf + i, len);
+ expected = simplesum(buf + i, len);
+ if (csum != expected) {
+ printk("i %d j %d len %d csum %04x expected %04x\n",
+ i, j, len, csum, expected);
+ return false;
+ }
+ }
+ }
+ return true;
+}
+
+#define CSUM_BENCH_BUFSIZE 4000
+
+bool test_simplesum_bench(void)
+{
+ uint8_t buf[CSUM_BENCH_BUFSIZE];
+ uint16_t csum = 0;
+ int i, j, len;
+
+ for (i = 0; i < sizeof(buf); i++)
+ buf[i] = i & 0xff;
+ for (i = 0; i < sizeof(buf); i++) {
+ for (j = i; j < sizeof(buf); j++) {
+ len = j - i + 1;
+ csum += simplesum(buf + i, len);
+ }
+ }
+ return true;
+}
+
+bool test_ptclbsum_bench(void)
+{
+ uint8_t buf[CSUM_BENCH_BUFSIZE];
+ uint16_t csum = 0;
+ int i, j, len;
+
+ for (i = 0; i < sizeof(buf); i++)
+ buf[i] = i & 0xff;
+ for (i = 0; i < sizeof(buf); i++) {
+ for (j = i; j < sizeof(buf); j++) {
+ len = j - i + 1;
+ csum += ptclbsum(buf + i, len);
+ }
+ }
+ return true;
+}
+
+static struct ktest ktests[] = {
+ KTEST_REG(ptclbsum, CONFIG_TEST_ptclbsum),
+ KTEST_REG(simplesum_bench, CONFIG_TEST_simplesum_bench),
+ KTEST_REG(ptclbsum_bench, CONFIG_TEST_ptclbsum_bench),
+};
+
+static int num_ktests = sizeof(ktests) / sizeof(struct ktest);
+
+linker_func_1(register_net_ktests)
+{
+ REGISTER_KTESTS(ktests, num_ktests);
+}
--
2.6.0.rc2.230.g3dd15c0

Xiao Jia

unread,
Oct 27, 2015, 8:41:23 PM10/27/15
to aka...@googlegroups.com
Hmm, this format looks different from what Barret was sending.

I was using

git send-email --compose --no-chain-reply-to --to=aka...@googlegroups.com origin/master..stfairy/tcp

Below is the output from git request-pull

---

The following changes since commit 6f3723cd8f883260a78fdf411911d7469464caa5:

  Update file-posix.c utest (2015-10-15 12:07:00 -0400)

are available in the git repository at:

  stfairy/tcp 

for you to fetch changes up to 7c9f9d267fd6826c077cc1e5b8bd9b131e72395c:

  Add networking unit tests (2015-10-27 17:18:49 -0700)

----------------------------------------------------------------
Xiao Jia (9):
      mlx4: Fix lock initializations
      mlx4: Allow override MLX4_DEFAULT_MGM_LOG_ENTRY_SIZE
      mlx4: Temporarily disable offload features
      Fix typo for TCP checksum offload feature
      Fix ptclcsum_one to adjust odd
      Fix ptclbsum to handle odd offsets
      Explain why ptclcsum_finalize is correct
      Fix manager to run tests if KERNEL_TESTING is set
      Add networking unit tests

 kern/drivers/net/Kconfig          |  7 +++++++
 kern/drivers/net/mlx4/alloc.c     |  4 ++--
 kern/drivers/net/mlx4/cmd.c       |  6 +++---
 kern/drivers/net/mlx4/en_netdev.c |  4 ++++
 kern/drivers/net/mlx4/mlx4.h      |  4 ++++
 kern/drivers/net/mlx4/mr.c        |  2 +-
 kern/include/ip.h                 |  4 ++++
 kern/src/ktest/Kbuild             |  1 +
 kern/src/ktest/Kconfig.kernel     |  2 +-
 kern/src/ktest/Kconfig.net        | 19 +++++++++++++++++++
 kern/src/ktest/net_ktests.c       | 94 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 kern/src/manager.c                |  2 +-
 kern/src/net/ipaux.c              |  2 +-
 kern/src/net/netif.c              |  2 +-
 kern/src/net/ptclbsum.c           |  2 ++
 15 files changed, 145 insertions(+), 10 deletions(-)
 create mode 100644 kern/src/ktest/Kconfig.net
 create mode 100644 kern/src/ktest/net_ktests.c

Xiao Jia

unread,
Oct 27, 2015, 8:46:42 PM10/27/15
to aka...@googlegroups.com
To view it on GitHub: https://github.com/brho/akaros/compare/6f3723...7c9f9d

I created the link manually.  Is there a way to automate this?

Barret Rhoden

unread,
Oct 28, 2015, 5:52:55 PM10/28/15
to aka...@googlegroups.com
On 2015-10-27 at 17:41 Xiao Jia <stf...@gmail.com> wrote:
> Hmm, this format looks different from what Barret was sending.
>
> I was using
>
> git send-email --compose --no-chain-reply-to
> --to=aka...@googlegroups.com origin/master..stfairy/tcp

I used --cover-letter --annotate instead of --compose. --annotate is
in my gitconfig. --cover-letter is something I did manually.

I also did a -M (very minor, it detects renames).

As far as automating it goes, I use the attached script. The guts of
it is:

FROM_SHA=`git log --format=format:%h -1 $1`
TO_SHA=`git log --format=format:%h -1 $2`

from which you can build the github URL or anything you'd like.


Example:

~/scripts/pre-se.sh master staging

------------
You can also find these patches at:
g...@github.com:brho/akaros.git
FROM: 5025b06858b5 master
TO: 925438d89d76 staging

And view them at:
https://github.com/brho/akaros/compare/5025b06858b5...925438d89d76


I also dropped the "ts=4" option from the URL (compared to my email
from yesterday). That option didn't seem to work for me after I clicked
on a commit.

Barret

pre-se.sh

Barret Rhoden

unread,
Nov 3, 2015, 12:02:00 PM11/3/15
to aka...@googlegroups.com
Thanks, merged to staging at 6237090137f8..1165c2bda44b (from, to]

You can see the entire diff with 'git diff' or at
https://github.com/brho/akaros/compare/6237090137f8...1165c2bda44b
Reply all
Reply to author
Forward
0 new messages