in_pcb: extract union in_dependaddr

22 views
Skip to first unread message

Waldemar Kozaczuk

unread,
Aug 22, 2021, 12:23:22 AM8/22/21
to osv...@googlegroups.com, Waldemar Kozaczuk
This patch is a small subset of the original FreeBSD one -
https://reviews.freebsd.org/rS334719 - to add support of
the new SO_REUSEPORT_LB socket option.

The changes below do not affect any functionality, but instead
simply extract new union in_dependaddr that will be used
in the load balancer group struct defined as part of the
forth coming patch.

Signed-off-by: Waldemar Kozaczuk <jwkoz...@gmail.com>
---
bsd/sys/netinet/in_pcb.h | 25 +++++++++++--------------
bsd/sys/netinet/tcp_input.cc | 4 ++--
2 files changed, 13 insertions(+), 16 deletions(-)

diff --git a/bsd/sys/netinet/in_pcb.h b/bsd/sys/netinet/in_pcb.h
index 92716691..85df54d6 100644
--- a/bsd/sys/netinet/in_pcb.h
+++ b/bsd/sys/netinet/in_pcb.h
@@ -79,6 +79,11 @@ struct in_addr_4in6 {
struct in_addr ia46_addr4;
};

+union in_dependaddr {
+ struct in_addr_4in6 id46_addr;
+ struct in6_addr id6_addr;
+};
+
/*
* NOTE: ipv6 addrs should be 64-bit aligned, per RFC 2553. in_conninfo has
* some extra padding to accomplish this.
@@ -87,21 +92,13 @@ struct in_endpoints {
u_int16_t ie_fport; /* foreign port */
u_int16_t ie_lport; /* local port */
/* protocol dependent part, local and foreign addr */
- union {
- /* foreign host table entry */
- struct in_addr_4in6 ie46_foreign;
- struct in6_addr ie6_foreign;
- } ie_dependfaddr;
- union {
- /* local host table entry */
- struct in_addr_4in6 ie46_local;
- struct in6_addr ie6_local;
- } ie_dependladdr;
+ union in_dependaddr ie_dependfaddr; /* foreign host table entry */
+ union in_dependaddr ie_dependladdr; /* local host table entry */
+#define ie_faddr ie_dependfaddr.id46_addr.ia46_addr4
+#define ie_laddr ie_dependladdr.id46_addr.ia46_addr4
+#define ie6_faddr ie_dependfaddr.id6_addr
+#define ie6_laddr ie_dependladdr.id6_addr
};
-#define ie_faddr ie_dependfaddr.ie46_foreign.ia46_addr4
-#define ie_laddr ie_dependladdr.ie46_local.ia46_addr4
-#define ie6_faddr ie_dependfaddr.ie6_foreign
-#define ie6_laddr ie_dependladdr.ie6_local

/*
* XXX The defines for inc_* are hacks and should be changed to direct
diff --git a/bsd/sys/netinet/tcp_input.cc b/bsd/sys/netinet/tcp_input.cc
index c7100827..a0aa4f1f 100644
--- a/bsd/sys/netinet/tcp_input.cc
+++ b/bsd/sys/netinet/tcp_input.cc
@@ -3218,8 +3218,8 @@ static ipv4_tcp_conn_id tcp_connection_id(tcpcb* tp)
{
auto& conn = tp->t_inpcb->inp_inc.inc_ie;
return {
- conn.ie_dependfaddr.ie46_foreign.ia46_addr4,
- conn.ie_dependladdr.ie46_local.ia46_addr4,
+ conn.ie_dependfaddr.id46_addr.ia46_addr4,
+ conn.ie_dependladdr.id46_addr.ia46_addr4,
ntohs(conn.ie_fport),
ntohs(conn.ie_lport)
};
--
2.31.1

Waldemar Kozaczuk

unread,
Aug 22, 2021, 12:23:25 AM8/22/21
to osv...@googlegroups.com, Waldemar Kozaczuk
This patch is a manual back-port of the original FreeBSD
patch https://reviews.freebsd.org/rS334719. The FreeBSD patch
adds support of the SO_REUSEPORT_LB socket option, whereas the one
below implements the Linux flavor of SO_REUSEPORT which in effect
borrows good chunk of the FreeBSD implementation.

Please note the FreeBSD committers decided to retain support of the
original SO_REUSEPORT option and add new one - SO_REUSEPORT_LB. The new
option exhibits same behavior as the older one but adds important new feature
- load balancing across listener sockets sharing the same port. The
FreeBSD manual states this:

"SO_REUSEPORT_LB allows completely duplicate bindings by multiple pro-
cesses if they all set SO_REUSEPORT_LB before binding the port. Incoming
TCP and UDP connections are distributed among the sharing processes based
on a hash function of local port number, foreign IP address and port num-
ber. A maximum of 256 processes can share one socket."

So most of the original patch was back-ported as-is except for the parts
with the conditional logic to account for both SO_REUSEPORT and SO_REUSEPORT_LB
which was unnecessary for OSv as it implements Linux which only supports
the SO_REUSEPORT option. In addition in some places I had to change
some of C code to use C++ constructs just like in another places of in_pcb.cc.

Bulk of the patch below, is about adding definitions of the struct inpcblbgroup and
functions to allocate, deallocate and manipulate it to manage load
balancing groups including adding and removing member sockets or more
specifically their PCBs - Protocol Control Blocks:

(Internal API)
- struct inpcblbgroup *in_pcblbgroup_alloc() - allocates new LB group

- void in_pcblbgroup_free(struct inpcblbgroup *grp) - frees existing LB group

- struct inpcblbgroup *in_pcblbgroup_resize(struct inpcblbgrouphead *hdr, struct inpcblbgroup *old_grp, int size) - creates new LB group that is a resized version of the old one

- in_pcblbgroup_reorder(struct inpcblbgrouphead *hdr, struct inpcblbgroup **grpp, int i) - PCB at index 'i' is removed from the group, pull up the ones below il_inp[i] and shrink group if possible

(Public API)
- int in_pcbinslbgrouphash(struct inpcb *inp) - adds PCB member to the LB group for SO_REUSEPORT option (allocate new LB group if necessary)

- void in_pcbremlbgrouphash(struct inpcb *inp) - removes PCB from load balance group (free existing LB group if last member)

- struct inpcb *in_pcblookup_lbgroup(const struct inpcbinfo *pcbinfo,
const struct in_addr *laddr, uint16_t lport, const struct in_addr *faddr, uint16_t fport, int lookupflags) - looks up
inpcb in a load balancing group

The remaining part of the patch, modifies relevant parts in in_pcb.cc to:

1) add logic add and remove inpcb members to/from LB groups by
delegating to in_pcbinslbgrouphash() and in_pcbremlbgrouphash() during
setup and teardown of sockets and their PCBs

2) add logic to lookup PCBs (and relevant sockets) by delegating to
in_pcblookup_lbgroup()

This patch does not add any new locking appart for some places
that verify certain locks are held in place when functions are called.

Please note that at some point during the review process the original
version of the FreeBSD patch contained the logic originating from
DragonFlyBSD (https://github.com/DragonFlyBSD/DragonFlyBSD/commit/02ad2f0b874fb0a45eb69750219f79f5e8982272)
to handle a drawback when processes/threads using SO_REUSE_PORT would crash
causing some pending sockets in the completion and incompletion queues
to be dropped. But due to the concerns in the locking logic, this part
was removed from the patch (https://reviews.freebsd.org/D11003#326149)
and therefore also is absent in this patch below. I believe also
Linux does not handle this drawback correctly as of now.

From practical standpoint, this patch greatly improves the throughput
of applications using SO_REUSEPORT. More specifically this http
server example implemented in Rust -
https://gist.github.com/alexcrichton/7b97beda66d5e9b10321207cd69afbbc -
yields way better performance in SMP mode (the 4 CPU difference is most
profound):

Req/sec BEFORE this patch:
2 CPU - 82199.52
4 CPU - 97982.16

AFTER this patch:
2 CPU - 82361.77
4 CPU - 147389.79

Finally note this patch does not change any non-load balancing
aspects of the SO_REUSEPORT option that were already in place
before this patch, but inactive. More specifically these would
be related to how SO_REUSEADDR and/or SO_REUSEPORT flags drive
same address and/or port collision logic.

Some articles about SO_REUSE_PORT:
- https://lwn.net/Articles/542629/
- https://linuxjedi.co.uk/2020/04/25/socket-so_reuseport-and-kernel-implementations/

Fixes #1170

Signed-off-by: Waldemar Kozaczuk <jwkoz...@gmail.com>
---
bsd/sys/compat/linux/linux.h | 1 +
bsd/sys/compat/linux/linux_socket.cc | 2 +
bsd/sys/netinet/in_pcb.cc | 285 +++++++++++++++++++++++++++
bsd/sys/netinet/in_pcb.h | 32 +++
4 files changed, 320 insertions(+)

diff --git a/bsd/sys/compat/linux/linux.h b/bsd/sys/compat/linux/linux.h
index 7bc8c509..1e6116aa 100644
--- a/bsd/sys/compat/linux/linux.h
+++ b/bsd/sys/compat/linux/linux.h
@@ -89,6 +89,7 @@ typedef struct {
#define LINUX_SO_NO_CHECK 11
#define LINUX_SO_PRIORITY 12
#define LINUX_SO_LINGER 13
+#define LINUX_SO_REUSEPORT 15
#define LINUX_SO_PEERCRED 17
#define LINUX_SO_RCVLOWAT 18
#define LINUX_SO_SNDLOWAT 19
diff --git a/bsd/sys/compat/linux/linux_socket.cc b/bsd/sys/compat/linux/linux_socket.cc
index 540b5477..cee3993b 100644
--- a/bsd/sys/compat/linux/linux_socket.cc
+++ b/bsd/sys/compat/linux/linux_socket.cc
@@ -340,6 +340,8 @@ linux_to_bsd_so_sockopt(int opt)
return (SO_OOBINLINE);
case LINUX_SO_LINGER:
return (SO_LINGER);
+ case LINUX_SO_REUSEPORT:
+ return (SO_REUSEPORT);
case LINUX_SO_RCVLOWAT:
return (SO_RCVLOWAT);
case LINUX_SO_SNDLOWAT:
diff --git a/bsd/sys/netinet/in_pcb.cc b/bsd/sys/netinet/in_pcb.cc
index 0f62561b..530464c2 100644
--- a/bsd/sys/netinet/in_pcb.cc
+++ b/bsd/sys/netinet/in_pcb.cc
@@ -85,6 +85,9 @@

#include <osv/trace.hh>

+#define INPCBLBGROUP_SIZMIN 8
+#define INPCBLBGROUP_SIZMAX 256
+
TRACEPOINT(trace_inpcb_ref, "inp=%x", struct inpcb *);
TRACEPOINT(trace_inpcb_rele, "inp=%x", struct inpcb *);
TRACEPOINT(trace_inpcb_free, "inp=%x", struct inpcb *);
@@ -199,6 +202,202 @@ SYSCTL_VNET_INT(_net_inet_ip_portrange, OID_AUTO, randomtime, CTLFLAG_RW,
* functions often modify hash chains or addresses in pcbs.
*/

+static struct inpcblbgroup *
+in_pcblbgroup_alloc(struct inpcblbgrouphead *hdr, u_char vflag,
+ uint16_t port, const union in_dependaddr *addr, int size)
+{
+ struct inpcblbgroup *grp;
+ size_t bytes;
+
+ bytes = __offsetof(struct inpcblbgroup, il_inp[size]);
+ grp = (struct inpcblbgroup *)malloc(bytes);
+ if (!grp)
+ return (NULL);
+ grp->il_vflag = vflag;
+ grp->il_lport = port;
+ grp->il_dependladdr = *addr;
+ grp->il_inpsiz = size;
+ LIST_INSERT_HEAD(hdr, grp, il_list);
+ return (grp);
+}
+
+static void
+in_pcblbgroup_free(struct inpcblbgroup *grp)
+{
+
+ LIST_REMOVE(grp, il_list);
+ free(grp);
+}
+
+static struct inpcblbgroup *
+in_pcblbgroup_resize(struct inpcblbgrouphead *hdr,
+ struct inpcblbgroup *old_grp, int size)
+{
+ struct inpcblbgroup *grp;
+ int i;
+
+ grp = in_pcblbgroup_alloc(hdr, old_grp->il_vflag,
+ old_grp->il_lport, &old_grp->il_dependladdr, size);
+ if (!grp)
+ return (NULL);
+
+ KASSERT(old_grp->il_inpcnt < grp->il_inpsiz,
+ ("invalid new local group size %d and old local group count %d",
+ grp->il_inpsiz, old_grp->il_inpcnt));
+
+ for (i = 0; i < old_grp->il_inpcnt; ++i)
+ grp->il_inp[i] = old_grp->il_inp[i];
+ grp->il_inpcnt = old_grp->il_inpcnt;
+ in_pcblbgroup_free(old_grp);
+ return (grp);
+}
+
+/*
+ * PCB at index 'i' is removed from the group. Pull up the ones below il_inp[i]
+ * and shrink group if possible.
+ */
+static void
+in_pcblbgroup_reorder(struct inpcblbgrouphead *hdr, struct inpcblbgroup **grpp,
+ int i)
+{
+ struct inpcblbgroup *grp = *grpp;
+
+ for (; i + 1 < grp->il_inpcnt; ++i)
+ grp->il_inp[i] = grp->il_inp[i + 1];
+ grp->il_inpcnt--;
+
+ if (grp->il_inpsiz > INPCBLBGROUP_SIZMIN &&
+ grp->il_inpcnt <= (grp->il_inpsiz / 4)) {
+ /* Shrink this group. */
+ struct inpcblbgroup *new_grp =
+ in_pcblbgroup_resize(hdr, grp, grp->il_inpsiz / 2);
+ if (new_grp)
+ *grpp = new_grp;
+ }
+ return;
+}
+
+/*
+ * Add PCB to load balance group for SO_REUSEPORT option.
+ */
+static int
+in_pcbinslbgrouphash(struct inpcb *inp)
+{
+ struct inpcbinfo *pcbinfo;
+ struct inpcblbgrouphead *hdr;
+ struct inpcblbgroup *grp;
+ uint16_t hashmask, lport;
+ uint32_t group_index;
+ static int limit_logged = 0;
+
+ pcbinfo = inp->inp_pcbinfo;
+
+ INP_LOCK_ASSERT(inp);
+ INP_HASH_WLOCK_ASSERT(pcbinfo);
+
+ if (pcbinfo->ipi_lbgrouphashbase == NULL)
+ return (0);
+
+ hashmask = pcbinfo->ipi_lbgrouphashmask;
+ lport = inp->inp_lport;
+ group_index = INP_PCBLBGROUP_PORTHASH(lport, hashmask);
+ hdr = &pcbinfo->ipi_lbgrouphashbase[group_index];
+
+#ifdef INET6
+ /*
+ * Don't allow IPv4 mapped INET6 wild socket.
+ */
+ if ((inp->inp_vflag & INP_IPV4) &&
+ inp->inp_laddr.s_addr == INADDR_ANY &&
+ INP_CHECK_SOCKAF(inp->inp_socket, AF_INET6)) {
+ return (0);
+ }
+#endif
+
+ hdr = &pcbinfo->ipi_lbgrouphashbase[
+ INP_PCBLBGROUP_PORTHASH(inp->inp_lport,
+ pcbinfo->ipi_lbgrouphashmask)];
+ LIST_FOREACH(grp, hdr, il_list) {
+ if (grp->il_vflag == inp->inp_vflag &&
+ grp->il_lport == inp->inp_lport &&
+ memcmp(&grp->il_dependladdr,
+ &inp->inp_inc.inc_ie.ie_dependladdr,
+ sizeof(grp->il_dependladdr)) == 0) {
+ break;
+ }
+ }
+ if (grp == NULL) {
+ /* Create new load balance group. */
+ grp = in_pcblbgroup_alloc(hdr, inp->inp_vflag,
+ inp->inp_lport, &inp->inp_inc.inc_ie.ie_dependladdr,
+ INPCBLBGROUP_SIZMIN);
+ if (!grp)
+ return (ENOBUFS);
+ } else if (grp->il_inpcnt == grp->il_inpsiz) {
+ if (grp->il_inpsiz >= INPCBLBGROUP_SIZMAX) {
+ if (!limit_logged) {
+ limit_logged = 1;
+ printf("lb group port %d, limit reached\n",
+ ntohs(grp->il_lport));
+ }
+ return (0);
+ }
+
+ /* Expand this local group. */
+ grp = in_pcblbgroup_resize(hdr, grp, grp->il_inpsiz * 2);
+ if (!grp)
+ return (ENOBUFS);
+ }
+
+ KASSERT(grp->il_inpcnt < grp->il_inpsiz,
+ ("invalid local group size %d and count %d",
+ grp->il_inpsiz, grp->il_inpcnt));
+
+ grp->il_inp[grp->il_inpcnt] = inp;
+ grp->il_inpcnt++;
+ return (0);
+}
+
+/*
+ * Remove PCB from load balance group.
+ */
+static void
+in_pcbremlbgrouphash(struct inpcb *inp)
+{
+ struct inpcbinfo *pcbinfo;
+ struct inpcblbgrouphead *hdr;
+ struct inpcblbgroup *grp;
+ int i;
+
+ pcbinfo = inp->inp_pcbinfo;
+
+ INP_LOCK_ASSERT(inp);
+ INP_HASH_WLOCK_ASSERT(pcbinfo);
+
+ if (pcbinfo->ipi_lbgrouphashbase == NULL)
+ return;
+
+ hdr = &pcbinfo->ipi_lbgrouphashbase[
+ INP_PCBLBGROUP_PORTHASH(inp->inp_lport,
+ pcbinfo->ipi_lbgrouphashmask)];
+
+ LIST_FOREACH(grp, hdr, il_list) {
+ for (i = 0; i < grp->il_inpcnt; ++i) {
+ if (grp->il_inp[i] != inp)
+ continue;
+
+ if (grp->il_inpcnt == 1) {
+ /* We are the last, free this local group. */
+ in_pcblbgroup_free(grp);
+ } else {
+ /* Pull up inpcbs, shrink group if possible. */
+ in_pcblbgroup_reorder(hdr, &grp, i);
+ }
+ return;
+ }
+ }
+}
+
/*
* Initialize an inpcbinfo -- we should be able to reduce the number of
* arguments in time.
@@ -221,6 +420,8 @@ in_pcbinfo_init(struct inpcbinfo *pcbinfo, const char *name,
&pcbinfo->ipi_hashmask);
pcbinfo->ipi_porthashbase = (inpcbporthead *)hashinit(porthash_nelements, 0,
&pcbinfo->ipi_porthashmask);
+ pcbinfo->ipi_lbgrouphashbase = (inpcblbgrouphead *)hashinit(hash_nelements, 0,
+ &pcbinfo->ipi_lbgrouphashmask);
// FIXME: uma_zone_set_max(pcbinfo->ipi_zone, maxsockets);
}

@@ -1090,6 +1291,7 @@ in_pcbdrop(struct inpcb *inp)
struct inpcbport *phd = inp->inp_phd;

INP_HASH_WLOCK(inp->inp_pcbinfo);
+ in_pcbremlbgrouphash(inp);
LIST_REMOVE(inp, inp_hash);
LIST_REMOVE(inp, inp_portlist);
if (LIST_FIRST(&phd->phd_pcblist) == NULL) {
@@ -1340,6 +1542,61 @@ in_pcblookup_local(struct inpcbinfo *pcbinfo, struct in_addr laddr,
}
#undef INP_LOOKUP_MAPPED_PCB_COST

+static struct inpcb *
+in_pcblookup_lbgroup(const struct inpcbinfo *pcbinfo,
+ const struct in_addr *laddr, uint16_t lport, const struct in_addr *faddr,
+ uint16_t fport, int lookupflags)
+{
+ struct inpcb *local_wild = NULL;
+ const struct inpcblbgrouphead *hdr;
+ struct inpcblbgroup *grp;
+ struct inpcblbgroup *grp_local_wild;
+
+ INP_HASH_LOCK_ASSERT(pcbinfo);
+
+ hdr = &pcbinfo->ipi_lbgrouphashbase[
+ INP_PCBLBGROUP_PORTHASH(lport, pcbinfo->ipi_lbgrouphashmask)];
+
+ /*
+ * Order of socket selection:
+ * 1. non-wild.
+ * 2. wild (if lookupflags contains INPLOOKUP_WILDCARD).
+ *
+ * NOTE:
+ * - Load balanced group does not contain jailed sockets
+ * - Load balanced group does not contain IPv4 mapped INET6 wild sockets
+ */
+ LIST_FOREACH(grp, hdr, il_list) {
+#ifdef INET6
+ if (!(grp->il_vflag & INP_IPV4))
+ continue;
+#endif
+
+ if (grp->il_lport == lport) {
+
+ uint32_t idx = 0;
+ int pkt_hash = INP_PCBLBGROUP_PKTHASH(faddr->s_addr,
+ lport, fport);
+
+ idx = pkt_hash % grp->il_inpcnt;
+
+ if (grp->il_laddr.s_addr == laddr->s_addr) {
+ return (grp->il_inp[idx]);
+ } else {
+ if (grp->il_laddr.s_addr == INADDR_ANY &&
+ (lookupflags & INPLOOKUP_WILDCARD)) {
+ local_wild = grp->il_inp[idx];
+ grp_local_wild = grp;
+ }
+ }
+ }
+ }
+ if (local_wild != NULL) {
+ return (local_wild);
+ }
+ return (NULL);
+}
+
/*
* Lookup PCB in hash list, using pcbinfo tables. This variation assumes
* that the caller has locked the hash list, and will not perform any further
@@ -1387,6 +1644,18 @@ in_pcblookup_hash_locked(struct inpcbinfo *pcbinfo, struct in_addr faddr,
if (tmpinp != NULL)
return (tmpinp);

+ /*
+ * Then look in lb group (for wildcard match).
+ */
+ if (pcbinfo->ipi_lbgrouphashbase != NULL &&
+ (lookupflags & INPLOOKUP_WILDCARD)) {
+ inp = in_pcblookup_lbgroup(pcbinfo, &laddr, lport, &faddr,
+ fport, lookupflags);
+ if (inp != NULL) {
+ return (inp);
+ }
+ }
+
/*
* Then look for a wildcard match, if requested.
*/
@@ -1552,6 +1821,18 @@ in_pcbinshash_internal(struct inpcb *inp)
pcbporthash = &pcbinfo->ipi_porthashbase[
INP_PCBPORTHASH(inp->inp_lport, pcbinfo->ipi_porthashmask)];

+ /*
+ * Add entry to load balance group.
+ * Only do this if INP_REUSEPORT is set.
+ */
+ if (inp->inp_flags2 & INP_REUSEPORT) {
+ int ret = in_pcbinslbgrouphash(inp);
+ if (ret) {
+ /* pcb lb group malloc fail (ret=ENOBUFS). */
+ return (ret);
+ }
+ }
+
/*
* Go through port list and look for a head for this lport.
*/
@@ -1642,6 +1923,10 @@ in_pcbremlists(struct inpcb *inp)
struct inpcbport *phd = inp->inp_phd;

INP_HASH_WLOCK(pcbinfo);
+
+ /* XXX: Only do if SO_REUSEPORT set? */
+ in_pcbremlbgrouphash(inp);
+
LIST_REMOVE(inp, inp_hash);
LIST_REMOVE(inp, inp_portlist);
if (LIST_FIRST(&phd->phd_pcblist) == NULL) {
diff --git a/bsd/sys/netinet/in_pcb.h b/bsd/sys/netinet/in_pcb.h
index 85df54d6..a3f7a77a 100644
--- a/bsd/sys/netinet/in_pcb.h
+++ b/bsd/sys/netinet/in_pcb.h
@@ -318,6 +318,13 @@ struct inpcbinfo {
struct inpcbporthead *ipi_porthashbase; /* (h) */
u_long ipi_porthashmask; /* (h) */

+ /*
+ * Load balance groups used for the SO_REUSEPORT option,
+ * hashed by local port.
+ */
+ struct inpcblbgrouphead *ipi_lbgrouphashbase; /* (h) */
+ u_long ipi_lbgrouphashmask; /* (h) */
+
/*
* Pointer to network stack instance
*/
@@ -331,6 +338,27 @@ struct inpcbinfo {

#ifdef _KERNEL

+/*
+ * Load balance groups used for the SO_REUSEPORT socket option. Each group
+ * (or unique address:port combination) can be re-used at most
+ * INPCBLBGROUP_SIZMAX (256) times. The inpcbs are stored in il_inp which
+ * is dynamically resized as processes bind/unbind to that specific group.
+ */
+struct inpcblbgroup {
+ LIST_ENTRY(inpcblbgroup) il_list;
+ uint16_t il_lport; /* (c) */
+ u_char il_vflag; /* (c) */
+ u_char il_pad;
+ uint32_t il_pad2;
+ union in_dependaddr il_dependladdr; /* (c) */
+#define il_laddr il_dependladdr.id46_addr.ia46_addr4
+#define il6_laddr il_dependladdr.id6_addr
+ uint32_t il_inpsiz; /* max count in il_inp[] (h) */
+ uint32_t il_inpcnt; /* cur count in il_inp[] (h) */
+ struct inpcb *il_inp[]; /* (h) */
+};
+LIST_HEAD(inpcblbgrouphead, inpcblbgroup);
+
// No need to do any initialization to the lock, if the inp object was
// created in C++ and the constructor ran (i.e., with new)
//#define INP_LOCK_INIT(inp, d, t) mutex_init(&(inp)->inp_lock)
@@ -398,6 +426,10 @@ void inp_4tuple_get(struct inpcb *inp, uint32_t *laddr, uint16_t *lp,
(((faddr) ^ ((faddr) >> 16) ^ ntohs((lport) ^ (fport))) & (mask))
#define INP_PCBPORTHASH(lport, mask) \
(ntohs((lport)) & (mask))
+#define INP_PCBLBGROUP_PORTHASH(lport, mask) \
+ (ntohs((lport)) & (mask))
+#define INP_PCBLBGROUP_PKTHASH(faddr, lport, fport) \
+ ((faddr) ^ ((faddr) >> 16) ^ ntohs((lport) ^ (fport)))

/*
* Flags for inp_vflags -- historically version flags only
--
2.31.1

Commit Bot

unread,
Aug 22, 2021, 8:32:11 AM8/22/21
to osv...@googlegroups.com, Waldemar Kozaczuk
From: Waldemar Kozaczuk <jwkoz...@gmail.com>
Committer: Nadav Har'El <n...@scylladb.com>
Branch: master

in_pcb: extract union in_dependaddr

This patch is a small subset of the original FreeBSD one -
https://reviews.freebsd.org/rS334719 - to add support of
the new SO_REUSEPORT_LB socket option.

The changes below do not affect any functionality, but instead
simply extract new union in_dependaddr that will be used
in the load balancer group struct defined as part of the
forth coming patch.

Signed-off-by: Waldemar Kozaczuk <jwkoz...@gmail.com>
Message-Id: <20210822042314.16...@gmail.com>

---
diff --git a/bsd/sys/netinet/in_pcb.h b/bsd/sys/netinet/in_pcb.h

Nadav Har'El

unread,
Aug 22, 2021, 9:38:15 AM8/22/21
to Waldemar Kozaczuk, Osv Dev
Thanks! Looks good.

One thought that crossed my mind while reading this was that maybe it made sense to keep the FreeBSD name SO_REUSE_PORT_LB, and then translate LINUX_REUSE_PORT to SO_REUSE_PORT_LB - because keeping closer to FreeBSD might make it easier to port more patches in the future. But on the other hand, it's probably not worth it, and what you did is perfectly fine.

So I'll commit this patch.

--
Nadav Har'El
n...@scylladb.com


--
You received this message because you are subscribed to the Google Groups "OSv Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to osv-dev+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/osv-dev/20210822042314.167929-2-jwkozaczuk%40gmail.com.

Nadav Har'El

unread,
Aug 22, 2021, 9:43:38 AM8/22/21
to Waldemar Kozaczuk, Osv Dev
Unfortunately, I can't commit this yet, because it fails compilation on my gcc 11.2.1:

In file included from bsd/sys/netinet/in_pcb.cc:40:
bsd/sys/netinet/in_pcb.cc: In function ‘inpcblbgroup* in_pcblbgroup_alloc(inpcblbgrouphead*, u_char, uint16_t, const in_dependaddr*, int)’:
bsd/sys/netinet/in_pcb.cc:212:56: error: ‘size’ is not a constant expression
  212 |         bytes = __offsetof(struct inpcblbgroup, il_inp[size]);
      |                                                        ^~~~
./bsd/porting/netport.h:45:59: note: in definition of macro ‘__offsetof’
   45 | #define __offsetof(type, field)  __builtin_offsetof(type, field)
      |                                                           ^~~~~


Maybe instead of
        bytes = __offsetof(struct inpcblbgroup, il_inp[size]);
we should use
        bytes = __offsetof(struct inpcblbgroup, il_inp[0]) + size;
?
(but I'm not sure, please check it really does the same...)


--
Nadav Har'El
n...@scylladb.com

On Sun, Aug 22, 2021 at 7:23 AM Waldemar Kozaczuk <jwkoz...@gmail.com> wrote:

Waldemar Kozaczuk

unread,
Aug 22, 2021, 11:38:41 AM8/22/21
to osv...@googlegroups.com, Waldemar Kozaczuk
V2: Comparing to the V1, this patch changes slightly the expression
to calculate size of the allocated memory in in_pcblbgroup_alloc() in
order to make it compile with GCC 11 (see
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95942);

So I changed this:

bytes = __offsetof(struct inpcblbgroup, il_inp[size]);

to:

bytes = __offsetof(struct inpcblbgroup, il_inp) + sizeof(inpcblbgroup::il_inp[0]) * size;
index 0f62561b..ac4e5f3b 100644
--- a/bsd/sys/netinet/in_pcb.cc
+++ b/bsd/sys/netinet/in_pcb.cc
@@ -85,6 +85,9 @@

#include <osv/trace.hh>

+#define INPCBLBGROUP_SIZMIN 8
+#define INPCBLBGROUP_SIZMAX 256
+
TRACEPOINT(trace_inpcb_ref, "inp=%x", struct inpcb *);
TRACEPOINT(trace_inpcb_rele, "inp=%x", struct inpcb *);
TRACEPOINT(trace_inpcb_free, "inp=%x", struct inpcb *);
@@ -199,6 +202,202 @@ SYSCTL_VNET_INT(_net_inet_ip_portrange, OID_AUTO, randomtime, CTLFLAG_RW,
* functions often modify hash chains or addresses in pcbs.
*/

+static struct inpcblbgroup *
+in_pcblbgroup_alloc(struct inpcblbgrouphead *hdr, u_char vflag,
+ uint16_t port, const union in_dependaddr *addr, int size)
+{
+ struct inpcblbgroup *grp;
+ size_t bytes;
+
+ bytes = __offsetof(struct inpcblbgroup, il_inp) + sizeof(inpcblbgroup::il_inp[0]) * size;

Waldek Kozaczuk

unread,
Aug 22, 2021, 11:39:59 AM8/22/21
to Nadav Har'El, Osv Dev
I believe this is the right way to fix it:

bytes = __offsetof(struct inpcblbgroup, il_inp[size]);

to:

bytes = __offsetof(struct inpcblbgroup, il_inp) + sizeof(inpcblbgroup::il_inp[0]) * size;

Commit Bot

unread,
Aug 22, 2021, 12:04:06 PM8/22/21
to osv...@googlegroups.com, Waldemar Kozaczuk
From: Waldemar Kozaczuk <jwkoz...@gmail.com>
Committer: Nadav Har'El <n...@scylladb.com>
Branch: master

socket: implement Linux flavor of SO_REUSE_PORT option
Message-Id: <20210822153834.18...@gmail.com>

---
diff --git a/bsd/sys/compat/linux/linux.h b/bsd/sys/compat/linux/linux.h
--- a/bsd/sys/compat/linux/linux.h
+++ b/bsd/sys/compat/linux/linux.h
@@ -89,6 +89,7 @@ typedef struct {
#define LINUX_SO_NO_CHECK 11
#define LINUX_SO_PRIORITY 12
#define LINUX_SO_LINGER 13
+#define LINUX_SO_REUSEPORT 15
#define LINUX_SO_PEERCRED 17
#define LINUX_SO_RCVLOWAT 18
#define LINUX_SO_SNDLOWAT 19
diff --git a/bsd/sys/compat/linux/linux_socket.cc b/bsd/sys/compat/linux/linux_socket.cc
--- a/bsd/sys/compat/linux/linux_socket.cc
+++ b/bsd/sys/compat/linux/linux_socket.cc
@@ -340,6 +340,8 @@ linux_to_bsd_so_sockopt(int opt)
return (SO_OOBINLINE);
case LINUX_SO_LINGER:
return (SO_LINGER);
+ case LINUX_SO_REUSEPORT:
+ return (SO_REUSEPORT);
case LINUX_SO_RCVLOWAT:
return (SO_RCVLOWAT);
case LINUX_SO_SNDLOWAT:
diff --git a/bsd/sys/netinet/in_pcb.cc b/bsd/sys/netinet/in_pcb.cc
Reply all
Reply to author
Forward
0 new messages